WO2024011903A1 - 一种视频生成方法及装置、计算机可读存储介质 - Google Patents

一种视频生成方法及装置、计算机可读存储介质 Download PDF

Info

Publication number
WO2024011903A1
WO2024011903A1 PCT/CN2023/075752 CN2023075752W WO2024011903A1 WO 2024011903 A1 WO2024011903 A1 WO 2024011903A1 CN 2023075752 W CN2023075752 W CN 2023075752W WO 2024011903 A1 WO2024011903 A1 WO 2024011903A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
feature
video
real
frame
Prior art date
Application number
PCT/CN2023/075752
Other languages
English (en)
French (fr)
Inventor
白亚龙
周默涵
张炜
梅涛
Original Assignee
北京京东尚科信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2024011903A1 publication Critical patent/WO2024011903A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Definitions

  • the present invention is based on a Chinese patent application with application number 202210834191.6 and a filing date of July 14, 2022, and claims the priority of the Chinese patent application.
  • the entire content of the Chinese patent application is hereby incorporated by reference into the present invention.
  • the invention relates to the field of human-computer interaction, and in particular to a video generation method and device, and a computer-readable storage medium.
  • good communication refers to a two-way communication process. It is not one-way information input or output, but accompanied by the interaction of information.
  • Real person-to-person communication is listening and telling. The process of continuous switching and circulation between these two states. Among them, listening and speaking are equally important. Both are essential for building anthropomorphic digital humans for human-computer interaction. On the one hand, digital people are required to express their opinions as clearly, concisely and clearly as possible in a language that the other person understands. On the other hand, anthropomorphic digital people must be good at listening to and understanding other people's opinions.
  • the existing technology mainly generates corresponding speaker videos based on the speaker's reference images and time-varying signals. The main method is to use face key points, face 3D models, human skeleton models, etc. to parameterize the speakers, and then use deep neural The network is used to fit these parameters and render the image with these parameters as the generated result. The generated image has poor effect.
  • Embodiments of the present invention provide a video generation method and device, and a computer-readable storage medium, which can generate a virtual object video sequence based on the audio and video sequence of a real object, thereby improving the vividness and accuracy of generating a virtual object video sequence.
  • An embodiment of the present invention provides a video generation method, which method includes:
  • the preset standard features are features corresponding to the reference object;
  • the first features represent Different attitudes;
  • the video sequence of the virtual object is a video sequence of the corresponding reaction of the virtual object generated based on the audio and video sequence of the real object;
  • a video sequence of the virtual object is presented.
  • the virtual prediction network preset standard features, and the first feature are used to predict the anthropomorphic features and generate a video sequence of the virtual object, including:
  • the preset standard features include first gesture expression features and first identity features;
  • the anthropomorphic feature and the first feature predict and decode through the virtual prediction network to determine the multi-frame gesture expression features of the virtual object;
  • a video sequence of the virtual object is generated.
  • the acquisition of preset standard features includes:
  • the standard image represents an image of the reference object
  • Feature extraction is performed on the standard image through a face reconstruction model to obtain the preset standard features.
  • the anthropomorphic features include: multi-frame anthropomorphic features corresponding to the audio and video sequence;
  • the virtual prediction network includes: a first processing module and a second processing module; the expression based on the first gesture Features, the anthropomorphic features and the first feature are predicted and decoded through the virtual prediction network to determine the multi-frame gesture expression features of the virtual object, including:
  • the first characteristic is one of positive attitude, negative attitude and ordinary attitude
  • next gesture expression feature Based on the next gesture expression feature and the next frame of anthropomorphic features in the multi-frame anthropomorphic features, prediction and decoding are continued until the last gesture expression feature of the virtual object corresponding to the last predicted video frame is obtained time, thereby obtaining multi-frame gesture expression features of the virtual object; the first gesture expression feature is used as the first frame in the multi-frame gesture expression features.
  • generating a video sequence of the virtual object based on the multi-frame gesture expression feature of the virtual object and the first identity feature includes:
  • Each frame of gesture expression features in the multi-frame gesture expression features is respectively fused with the first identity feature to obtain a plurality of second features; the second features represent the fusion result of the identity feature and the gesture expression feature;
  • the plurality of second features are passed through a renderer to generate a video sequence of the virtual object;
  • the first identity feature includes a first identity mark, a first material and first lighting information.
  • the audio and video sequences include: audio sequences of real objects and video sequences of real objects; feature extraction of the audio and video sequences to determine anthropomorphic features includes:
  • the encoder performs feature extraction on the audio sequence of the real object to obtain multiple audio features;
  • the audio features include loudness, zero-crossing rate, and cepstrum coefficients;
  • anthropomorphic features are multi-frame anthropomorphic features corresponding to the audio and video sequences
  • the anthropomorphic features Features include video features and audio features.
  • the encoder performs feature extraction on the video sequence of the real object to obtain multiple video features, including:
  • Feature extraction is performed on each video frame of the video sequence of the real object through a face reconstruction model to obtain multiple video frame features;
  • the video frame features include a second identity feature and a second gesture expression feature;
  • the method before using a virtual prediction network, preset standard features, and first features to predict the anthropomorphic features, and generating a video sequence of the virtual object, the method further includes:
  • predicted face features under the audio and video sequence samples of the training real objects are generated;
  • the predicted face features include predicted posture features and predicted expression features;
  • feature extraction is performed through the face reconstruction model to determine the real face features, where the real face features include real posture features and real expression features;
  • the initial encoder is continuously optimized through the first loss function and the anthropomorphic sample characteristics until the first loss function value meets the first preset threshold, and the encoder is determined;
  • the initial virtual prediction network is continuously optimized through the second loss function and the third loss function until the second loss function value and the third loss function value meet the second preset threshold, and it is determined The virtual prediction network.
  • the initial virtual prediction network is continuously optimized through the second loss function and the third loss function until the second loss function value and the third loss function value satisfy the Two preset thresholds are used to determine the virtual prediction network, including:
  • the second loss function is used to ensure that the predicted expressions and predicted postures are similar to the real expressions and real postures;
  • a third loss function is determined; the third loss function is used to ensure that the inter-frame continuity of the predicted face feature is consistent with that of the real face Similar facial features;
  • the initial virtual prediction network is continuously optimized until the second loss function value and The third loss function value satisfies the second preset threshold to determine the virtual prediction network.
  • An embodiment of the present invention provides a video generation device, which is characterized in that the video generation device includes an acquisition part, a determination part and a generation part; wherein,
  • the acquisition part is configured to collect audio and video sequences of real objects
  • the determining part is configured to perform feature extraction on the audio and video sequence and determine anthropomorphic features
  • the generation part is configured to use a virtual prediction network, preset standard features, and first features to predict the anthropomorphic features and generate a video sequence of the virtual object; the preset standard features are corresponding to the reference object.
  • the first feature represents different attitudes;
  • the video sequence of the virtual object is a video sequence that generates a corresponding reaction of the virtual object based on the audio and video sequence of the real object; and presents the video sequence of the virtual object.
  • An embodiment of the present invention provides a video generation device, which is characterized in that the video generation device includes:
  • Memory used to store executable instructions
  • a processor configured to execute executable instructions stored in the memory. When the executable instructions are executed, the processor executes the video generation method.
  • An embodiment of the present invention provides a computer-readable storage medium, which is characterized in that executable instructions are stored therein.
  • the processor executes the video Generate method.
  • Embodiments of the present invention provide a video generation method and device, and a computer-readable storage medium.
  • the method includes: collecting audio and video sequences of real objects; performing feature extraction on the audio and video sequences to determine anthropomorphic features; using virtual The prediction network, the preset standard features, and the first feature predict the anthropomorphic features and generate a video sequence of the virtual object; the preset standard features are the features corresponding to the reference object; the first feature represents different attitude; the video sequence of the virtual object is a video sequence of the corresponding reaction of the virtual object generated based on the audio and video sequence of the real object; and the video sequence of the virtual object is presented.
  • Embodiments of the present invention generate video sequences of virtual objects based on audio and video sequences of real objects, and the presented video sequences of virtual objects are more vivid and accurate.
  • Figure 1 is a schematic diagram of an optional terminal operation that provides a video generation method according to an embodiment of the present invention
  • Figure 2 is an optional flow diagram 1 of a video generation method provided by an embodiment of the present invention.
  • Figure 3a is a schematic diagram 1 of an optional speaker video generation that provides a video generation method according to an embodiment of the present invention
  • Figure 3b is a schematic diagram 2 of an optional speaker video generation that provides a video generation method according to the embodiment of the present invention
  • Figure 3c is a schematic diagram 3 of an optional speaker video generation that provides a video generation method according to the embodiment of the present invention
  • Figure 4 is an optional flow diagram 2 of a video generation method provided by an embodiment of the present invention.
  • Figure 5 is an optional flowchart 3 of a video generation method provided by an embodiment of the present invention.
  • Figure 6 is an optional flowchart 4 of a video generation method provided by an embodiment of the present invention.
  • Figure 7 is an optional flow diagram 5 of a video generation method provided by an embodiment of the present invention.
  • Figure 8a is Figure 1 of the virtual object video sequence result of a video generation method provided by an embodiment of the present invention
  • Figure 8b is Figure 2 of the virtual object video sequence result of a video generation method provided by an embodiment of the present invention.
  • Figure 9 is an optional flow chart 6 of a video generation method provided by an embodiment of the present invention.
  • Figure 10 is an optional model architecture diagram that provides a video generation method according to an embodiment of the present invention.
  • Figure 11 is a schematic structural diagram of a video generation device provided by an embodiment of the present invention.
  • Figure 12 is a schematic structural diagram 2 of a video generation device provided by an embodiment of the present invention.
  • Figure 1 is a schematic diagram of an optional terminal operation for providing a video generation method according to an embodiment of the present invention.
  • the terminal includes a speaker encoder, a virtual prediction network and a virtual human interface (not shown in the figure) , the terminal can perform feature extraction on the audio and video of real objects through the speaker encoder (equivalent to the encoder), and extract the extracted features, attitude (equivalent to the first feature) and reference image (equivalent to the preset standard image) Input to the listener decoder (equivalent to a virtual prediction network) for prediction, and generate the listener's head movements and expression changes arranged in a timeline, thereby obtaining a virtual object video sequence.
  • FIG. 2 is an optional flowchart 1 of a video generation method provided by an embodiment of the present invention, which will be described in conjunction with the steps shown in FIG. 2 .
  • listening is also a functional behavior during communication.
  • listening behavior styles can be divided into four categories, namely non-listeners, marginal listeners, evaluative listeners and active listeners.
  • responsive listening is the most effective, and it also plays a key role in communication. It requires the listener to concentrate completely on what a person is saying, to listen carefully, and at the same time to show some visual reaction to the speaker. These reactions can feed back to the speaker whether the listener is interested, understands, and agrees with the content of the speech, so as to adjust the rhythm and process of the conversation and promote the smooth progress of communication.
  • Figure 3a, Figure 3b, and Figure 3c are respectively an optional speaker video generation schematic diagram 1, a speaker video generation schematic diagram 2 and a speaker video generation schematic diagram three that provide a video generation method according to an embodiment of the present invention, as shown in Figure 3a
  • the speaker video generation task includes the generation of the speaker's body posture; as shown in Figure 3b, the speaker video generation task includes the generation of the speaker's lip movements; as shown in Figure 3c, the speaker video generation task includes the speaker Motion generation of the head (including face).
  • the speaker's body posture is generated by processing a time-varying signal input by the dotted line box through the body posture generation model to obtain the body posture shown in the dotted line box.
  • the speaker's lip movement is generated by processing a time-varying signal input by the dotted line box and a general reference image through the lip movement generation model, and outputting the speaker's lip movement image shown in the dotted line box. frame.
  • the motion generation of the speaker's head is mainly through the head motion generation model to process a period of time-varying signals input by the dotted box and the reference image speaker and emotion, and the processed results are processed through
  • the head rendering model performs rendering and outputs the moving image frame of the speaker's head (including face) shown in the dotted line frame.
  • the terminal can collect audio sequences and video sequences of real objects through a collection device.
  • the collection device may be a device that collects video and audio, such as a camera, but the invention is not limited thereto;
  • the real object may be a person speaking in a scene;
  • the audio sequence and the video sequence may be at a tourist attraction In this scenario, tourists obtain it during the process of inquiring about the self-service consultation equipment (the carrier of virtual objects).
  • the present invention is applied to situations that require human-computer interaction. For example, intelligent consulting equipment in shopping malls can make corresponding videos based on the videos displayed by shoppers to guide shoppers.
  • the audio and video sequences include audio sequences of real objects and video sequences of real objects.
  • feature extraction can be implemented through a neural network model.
  • the process of feature extraction is: input each frame of the video sequence into the neural network, perform feature extraction through multiple convolution layers and pooling layers, and obtain multiple video features.
  • Anthropomorphic features refer to features obtained by characterizing body movements when people speak in real scenes, accompanied by some body movements.
  • Anthropomorphic features are features that include audio features and video features;
  • anthropomorphic features are audio features and video features obtained after feature extraction of audio and video sequences. For example, in a video, the speaker raises his hand while speaking, and the anthropomorphic features can be the features corresponding to the hand-raising motion.
  • the terminal can perform feature extraction on video sequences of real objects through an encoder to obtain multiple views. Frequency features; feature extraction from the audio sequence of real objects through the encoder to obtain multiple audio features; audio features include loudness, zero-crossing rate and cepstral coefficient; through multiple video features and multiple audio features, through the feature fusion function Perform feature transformation to determine anthropomorphic features.
  • Figure 4 is an optional flow diagram 2 of a video generation method provided by an embodiment of the present invention.
  • S102 can be implemented through S1021-S1023, as follows:
  • the video feature is a feature obtained by recording head rotation and facial expression changes during a person's communication process.
  • Video features are features obtained after feature extraction from video sequences; video features include posture and expression.
  • the terminal can perform feature extraction on each video frame of the video sequence of the real object through the face reconstruction model to obtain multiple video frame features; Two-posture expression features are used as video features.
  • Figure 5 is an optional flow diagram 3 of a video generation method provided by an embodiment of the present invention.
  • S1021 can be implemented through S10211 and S10212, as follows:
  • the video frame features are recorded based on various factors such as head rotation, facial expression and shooting of the character in each video frame in the video sequence.
  • Video frame features include second identity features and second gesture expression features.
  • the second identity feature is the result of recording the environmental factors of the video sequence shooting and the identity information of the photographed subject. Secondary identity features include the identity, material, and lighting of the real object.
  • the second gesture expression feature is the result of recording the real subject's head rotation and facial expression changes when he or she is speaking.
  • the second posture expression feature includes the head posture and facial expression of the real object.
  • the face reconstruction model generally selects a 3D face reconstruction model, and the present invention is not limited thereto.
  • the terminal can perform feature extraction on the characters and background in each video frame of the video sequence of the real object through the face reconstruction model, and the identity of the character and the background will be obtained in each video frame.
  • the identity of the character, the material of the video frame and the lighting during shooting are used as the second identity feature, and the head posture and facial expression of the character are used as the second posture expression feature; all features (i.e., the identity of the character
  • the logo, the material of the video frame, the lighting during shooting, the character’s head posture and the character’s facial expression are used as a video frame feature.
  • represents the identity of the current face
  • represents the expression of the person’s face
  • represents the video
  • p represents the posture of the character's face
  • represents the lighting during shooting. All these features combined together are a video frame feature.
  • the terminal determines all corresponding second gesture expression features in the video sequence of the real object as video features corresponding to the video sequence.
  • ⁇ , ⁇ , ⁇ , p, ⁇ parameters are divided into two categories.
  • the terminal can perform feature extraction on each video frame of the video sequence of the real object through the face reconstruction model to obtain multiple video frame features; the video sequence of the real object All the corresponding second posture expression features in the video are used as video features.
  • the inherent identity features such as the identity of the person, the material of the video frame and the lighting during shooting) are removed, and only common features are retained, which improves the effectiveness of feature extraction. Provide data support for subsequent generation of virtual object video sequences.
  • the audio features are for some features accompanying the speaker's speech during human communication.
  • Audio features are features obtained after feature extraction of audio sequences; audio features include corresponding loudness, zero-crossing rate, and cepstrum coefficients when real objects speak.
  • the terminal can perform feature extraction on the frequency spectrum of the real object's audio sequence through the encoder to obtain the loudness, zero-crossing rate and cepstrum coefficient of each moment of the real object's speech; the loudness, zero-crossing rate and cepstral coefficients are regarded as audio features, thereby obtaining multiple audio features.
  • the terminal can perform feature extraction based on the audio and video sequences generated when real objects speak, and obtain corresponding multiple audio features and multiple video features; it can quickly extract effective features of audio and effective features of video, improving The speed of feature extraction and the effectiveness of features.
  • the anthropomorphic features are multi-frame anthropomorphic features corresponding to the audio and video sequences; the anthropomorphic features include video features and audio features; the feature fusion function can convert non-linear features into linear features.
  • the terminal performs nonlinear feature conversion on multiple audio features and multiple video features through the multi-modal feature fusion function in the encoder to obtain the feature representation of the real object, that is, to determine the anthropomorphism feature.
  • the terminal can improve the speed of processing audio and video sequences by extracting features from the audio and video sequences; quickly obtain multiple video features and multiple audio features, and combine multiple video features and multiple audio features through the feature fusion function. Converting it into the first feature can convert the expression of the feature, making it easier for the terminal to process it, which improves the feasibility of processing video features and audio features.
  • the video sequence of the virtual object is a video sequence of the corresponding response of the virtual object generated based on the audio and video sequence of the real object;
  • the preset standard features are the features corresponding to the reference object;
  • the preset standard features include the reference
  • the first posture expression feature is the head posture and facial expression of the reference object;
  • the first identity feature is the identity-related information corresponding to the reference object.
  • the first characteristic is the emotional attitude characteristic of a person when speaking.
  • the first characteristic represents different attitudes.
  • the first characteristic can be a positive attitude, a negative attitude, or an ordinary attitude.
  • the terminal can obtain preset standard features; based on the first gesture expression feature, the anthropomorphic feature and the first feature, predict and decode through the virtual prediction network to determine the multi-frame gesture expression of the virtual object Features; generate a video sequence of the virtual object through the multi-frame gesture expression features and first identity features of the virtual object.
  • the virtual object generation task aims to generate a virtual object video sequence at time step t+1, which can be expressed by the following formula (1).
  • Figure 6 is an optional flow diagram 4 of a video generation method provided by an embodiment of the present invention.
  • S103 can be implemented through S1031-S1033, as follows:
  • the terminal can obtain a standard image; perform feature extraction on the standard image through a face reconstruction model to obtain the first gesture expression feature and the first identity feature, and combine the first gesture expression feature and the first identity feature As a preset standard feature.
  • S1031 can be implemented through S10311 and S10312, as follows:
  • the standard image is an image of a reference object; the standard image is any face image obtained randomly in the image library that is different from the real object.
  • the terminal can perform feature extraction on the standard image through the face reconstruction model to obtain the identity of the person in the standard image, the material of the video frame, the lighting during shooting, the person's head posture and the person's face. facial expression.
  • the identity of the object, the material of the video frame and the lighting during shooting are used as the first identity feature;
  • the head posture of the person in the standard image and the expression of the person's face are used as the first posture expression feature;
  • the first identity feature and the first Posture expression features are used as preset standard features.
  • the terminal can obtain standard images; by extracting features from the standard images through the face reconstruction model, the first posture expression features can be quickly and accurately obtained, which improves the accuracy and efficiency of standard image processing; the first posture expression Features can be used to generate video sequences of virtual objects, ensuring the accuracy of generating video sequences of virtual objects.
  • the terminal can use the first frame of anthropomorphic features, the first gesture expression feature and the first feature among the multi-frame anthropomorphic features to perform prediction through the first processing module to obtain the next predicted video frame ; Decode the next predicted video frame through the second processing module to determine the next posture expression feature of the virtual object corresponding to the next predicted video frame; use the next posture expression feature and the next frame of personification among the multi-frame anthropomorphic features Features, continue to predict and decode until the last gesture expression feature of the virtual object corresponding to the last predicted video frame is obtained, thereby obtaining the multi-frame gesture expression feature of the virtual object.
  • S1032 can be implemented through S10321, S10322 and S10323, as follows:
  • the first characteristic is any one of positive attitude, negative attitude and ordinary attitude
  • the first processing module is a coding module in the virtual prediction network
  • the function is to implement coding in prediction processing.
  • the terminal can select any one of a positive attitude, a negative attitude and a normal attitude as the first feature.
  • the normal attitude is selected; in the normal attitude, multiple frames are personified.
  • the first frame's anthropomorphic features and first gesture expression features among the features are input to the first processing module in the virtual prediction network for prediction, and the next video prediction frame is obtained.
  • the second processing module is a decoding module in the virtual prediction network; its function is to implement decoding in prediction processing.
  • the terminal can decode the next predicted video frame through the second processing module in the virtual prediction network to obtain the next gesture expression feature of the virtual object corresponding to the next predicted video frame;
  • the next gesture Expression features include next posture features and next expression features.
  • the first gesture expression feature is used as the first frame in the multi-frame gesture expression feature.
  • the terminal inputs the next gesture expression feature and the second frame of anthropomorphic features in the multi-frame anthropomorphic features to the first processing module in the virtual prediction network for prediction, and obtains the next predicted video frame ; Decode through the second processing module in the virtual prediction network to determine the next posture expression feature of the virtual object corresponding to the next predicted video frame; Continue to predict and decode through the virtual prediction network until the corresponding to the last predicted video frame is obtained The last gesture expression feature of the virtual object, thereby determining the multi-frame gesture expression features of the virtual object.
  • the terminal can make predictions through the first frame anthropomorphic features, the first posture expression features and the first feature.
  • the first feature can improve the diversity of generated predicted video frames.
  • the first frame anthropomorphic features and the first posture Expression features can improve the accuracy of generating predicted video frames; the next predicted video frame is generated based on the first predicted video frame, and the next video frame is predicted by the real-time updated predicted video frame.
  • Continuous video frames can be generated, improving the accuracy of video frames.
  • the continuity between the video frames ensures the integrity of the video sequence; by decoding multiple predicted video frames to obtain multi-frame posture expression features, it can fully reflect the head reaction and facial reaction of the virtual object when hearing the audio and video of the real object, improving The accuracy of generating video sequences of virtual objects is improved.
  • the virtual object is a simulated object with a simple communication function, which generally exists in an interactive device.
  • the terminal can add the gesture expression features of each frame in the multi-frame gesture expression features of the virtual object to Each is fused with the first identity feature to obtain multiple second features; the multiple second features are used through a renderer to generate a video sequence of the virtual object.
  • the terminal can obtain preset standard features; through the first gesture expression feature, anthropomorphic feature and first feature in the preset standard features, prediction and decoding are performed through the virtual prediction network to determine the number of virtual objects.
  • Frame gesture expression features can improve the accuracy of multi-frame gesture expression features; generate a virtual object's video sequence based on the virtual object's multi-frame gesture expression features and first identity features; can improve the accuracy and vividness of the virtual object's video sequence sex.
  • S1033 can be implemented through S10331 and S10332, as follows:
  • the second feature is a fusion feature formed by assigning gesture expression features to identity features.
  • the second feature is the fusion result of identity features and posture and expression features.
  • the terminal can fuse each of the multi-frame gesture expression features of the virtual object with the first identity feature to obtain multiple second features corresponding to the multi-frame gesture expression features.
  • the first identity feature includes a first identity mark, a first material, and first lighting information; the first identity mark corresponds to the identity mark of the person in the standard image, and the first material corresponds to the material of the video frame. and the first illumination information corresponds to the illumination at the time of shooting.
  • the terminal may generate a video sequence of the virtual object from a plurality of second features through a renderer.
  • the video sequence generation task of virtual objects can be expressed by formula (2) and formula (3).
  • the pose characteristics of the virtual object predicted by the virtual prediction network; It is the feature of the virtual object associated with its identity. It will be used together with the predicted posture features to generate a video sequence of the virtual object through the renderer.
  • the terminal can obtain multiple second features by fusing each frame of gesture expression features of the multi-frame gesture expression features of the virtual object with the first identity feature respectively; making the second features have identity attributes, it can To improve the distinction of the second feature, multiple second features are used to generate a video sequence of the virtual object through the renderer, and the video sequence of the virtual object is generated in a targeted manner, making the video sequence more vivid and accurate.
  • the terminal may present a video sequence of a virtual object on the virtual human interface.
  • the terminal collects audio and video sequences of real objects; performs feature extraction on the audio and video sequences to determine anthropomorphic features; and uses the virtual prediction network and preset standard features, as well as the first features, to predict the anthropomorphic features.
  • video sequences of virtual objects are generated, and the video sequences of virtual objects presented are more vivid and accurate.
  • Figure 7 is an optional flow diagram 5 of a video generation method provided by an embodiment of the present invention. As shown in Figure 7, before executing S103, S105-S1010 are also executed, as follows:
  • S105 Collect audio and video sequence samples of real talking objects and their corresponding face images of real listening objects.
  • the audio and video sequence samples are formed by recording the speech of the real talking partner and the corresponding movements and expressions during the speech.
  • the terminal can collect audio and video sequence samples of real talking objects and their corresponding face images of real listening objects through the collection device.
  • the collected face images of real listening subjects are collected under different first features.
  • the first features include positive attitude, negative attitude and ordinary attitude.
  • the anthropomorphic sample features are features obtained by characterizing the body movements accompanied by some body movements when people speak in the scene corresponding to the audio and video sequence samples.
  • Anthropomorphic sample features are obtained after feature extraction of audio and video sequence samples.
  • Anthropomorphic sample features include video sample features and audio sample features.
  • the terminal can perform feature extraction on audio and video sequence samples through an initial encoder to obtain anthropomorphic sample features, where the anthropomorphic sample features include multi-frame anthropomorphic sample features.
  • predicting facial features includes predicting posture features and predicting expression features.
  • the terminal can generate predicted posture features and predicted expression features under audio and video sequence samples of training real objects through the initial virtual prediction network and anthropomorphic sample features; the predicted posture features include multi-frame predicted posture features ; Predicting expression features includes predicting expression features in multiple frames; using predicted posture features and predicted expression features as predicted face features; predicting face features includes predicting face features in multiple frames.
  • real facial features include real posture features and real expression features
  • the terminal can perform feature extraction on the face image of the real listening subject through the face reconstruction model to obtain real posture features and real expression features, and use the real posture features and real expression features as real face features .
  • the number of face images of real listening subjects is consistent with the number of multi-frame anthropomorphic features obtained through audio and video sequence samples.
  • Real posture features include multi-frame real posture features;
  • real expression features include multi-frame real expression features;
  • Real face features also include multi-frame real face features, and the number of real face features is consistent with the number of predicted face features.
  • the terminal can continuously optimize the initial encoder through the first loss function and anthropomorphic sample features. If the first loss function value is greater than or equal to the first preset threshold, the encoder can be considered to have been trained. Okay, determine the encoder; if the first loss function value is less than the first preset threshold, continue to train the encoder until the first loss function value is greater than or equal to the first preset threshold, determine the encoder.
  • the terminal can use real facial features and predicted facial features to continuously optimize the initial virtual prediction network through the second loss function and the third loss function. If the second loss function value is different from the third loss function If the sum of the values is greater than or equal to the second preset threshold, the virtual prediction network is determined; if the sum of the second loss function value and the third loss function value is less than the second preset threshold, continue training the virtual prediction network until the second loss If the sum of the function value and the third loss function value is greater than or equal to the second preset threshold, the virtual prediction network is determined.
  • Figures 8a and 8b are respectively Figure 1 and Figure 2 of the video sequence results of a virtual object using a video generation method provided by an embodiment of the present invention.
  • the abscissa represents continuous video frames, including 0-32 frames;
  • Figure 8a shows the video sequence results of generating virtual objects in the domain (meaning that the training data contains the confidant in the test data set Or the listening person).
  • Figure 8b shows the results of the video sequence for generating virtual objects.
  • the results of the video sequence are outside the domain (meaning that the face data of the talking person and the listening person have never appeared in the training set.
  • the main test is to test the model in unseen situations. Generalization ability on human faces).
  • the real listener in Figure 8a has a positive attitude
  • the real listener in Figure 8b has a natural attitude
  • Rows 4, 5, and 6 in Figure 8a and Rows 4, 5, and 6 in Figure 8b respectively show the video sequence results of virtual objects generated conditioned on three different attitudes.
  • the generated listeners in Figure 8a and Figure 8b It's a virtual object.
  • frame 0 is the reference frame.
  • the different triangles in the figure represent significant changes, with the upper right Vision changes with eye movement; lower left Head movement; lower right Representative frame for each row; upper left Frames (column direction) that are significantly different under different attitudes.
  • the virtual prediction network is able to capture universal listener (equivalent to virtual object) patterns (such as eye, mouth and head movements, etc.), which may be different from real listeners, but are still meaningful.
  • the virtual prediction network is able to present the visual patterns of virtual objects under different attitudes. For the results shown in Figure 8a, it can be seen that although the neutral attitude also laughs (frames 2-8), it lasts shorter than the positive attitude (frames 2-16). For people with a negative attitude, the virtual objects did not pay attention to the conversation and looked toward the lower side of the screen at frames 10, 16, 22, and 30.
  • Attitude classification test Given a generated video sequence of a virtual object, volunteers need to determine its emotion (positive, negative, natural). It should be noted that nature is equivalent to ordinary attitudes.
  • Table 1 summarizes the mean and variance of the two “best listeners” numbers.
  • volunteers voted that nearly 20% of the generated virtual objects looked more reasonable than real listeners, validating that the model can generate responsive listeners that are consistent with human subjective perception.
  • results generated in out-of-domain data were liked by more volunteers.
  • the terminal can collect the audio and video sequence samples of the real talking object and the corresponding face image of the real listening object.
  • the face image of the real listening object is collected under different second characteristics, which can ensure Diversity of training samples; through audio and video sequence samples and their corresponding face images of real listening objects, the initial encoder and initial virtual prediction network are optimized and trained to determine the encoder and virtual prediction network, which can improve the encoder and The accuracy of the virtual prediction network output results.
  • Figure 9 is an optional flow chart 6 of a video generation method provided by an embodiment of the present invention. As shown in Figure 9, S1011-S1013 are also executed before executing S1010, as follows:
  • the second loss function is used to ensure that the predicted expression and predicted posture are similar to the real expression and real posture.
  • the terminal can perform a difference modulus operation based on the real facial features and the predicted facial features to determine the second loss function.
  • the second loss function can be obtained by the following formula (4).
  • the generation result of the last frame of the initial virtual prediction network is discarded. That is, the posture and expression features of the virtual object corresponding to the last predicted video frame (equivalent to the last frame of predicted facial features).
  • the third loss function is used to ensure that the inter-frame continuity of the predicted facial features is similar to the real facial features.
  • the terminal can perform a difference modulus operation through the change function corresponding to the real facial feature and the change function corresponding to the predicted facial feature to determine the third loss function.
  • the third loss function can be obtained by the following formula (5).
  • the terminal can determine the loss function of the virtual prediction network through the sum of the second loss function and the third loss function, and continue to optimize the initial virtual prediction network through the loss function until the loss function value (equivalent to the If the sum of the second loss function value and the third loss function value) meets the second preset threshold, the virtual prediction network is determined.
  • the loss function of the virtual prediction network can be obtained by the following formula (6).
  • the terminal can determine the second loss function based on the real facial features and the predicted facial features; determine the third loss function based on the change function corresponding to the real facial features and the change function corresponding to the predicted facial features,
  • the effectiveness of the loss function of the virtual prediction network is improved; the initial virtual prediction network is continuously optimized through the second loss function and the third loss function until the sum of the second loss function value and the third loss function value meets the second preset threshold , determine the virtual prediction network; improve the prediction accuracy of the virtual prediction network and the prediction effect of the virtual prediction network.
  • Figure 10 is an optional model architecture diagram of a video generation method provided by an embodiment of the present invention.
  • the terminal obtains the speaker's video (equivalent to a video sequence of real objects), preprocess the speaker video through the speaker encoder (equivalent to the encoder) to obtain continuous video frames; perform feature extraction on each frame of the speaker video through the face reconstruction model , get the identity, material, lighting, expression and posture corresponding to each frame; expression and posture are used as video features
  • the speaker's audio is processed through the encoder to obtain the audio at multiple moments; the features of the audio at each moment in the audio at multiple moments are extracted to obtain the corresponding audio at each moment.
  • the listener decoder in the terminal obtains the reference image of the listener (equivalent to the standard image), and extracts features of the reference image through the face reconstruction model to obtain the identity, material, lighting, expression and posture (Expressions and postures are shown in Figure 10 and ); use identity marks, materials, and lighting as the first identity features, and expression and posture as the first posture expression features.
  • the first gesture expression feature is input into the long short-term memory network encoder (equivalent to the first processing module), and the attitude e (equivalent to the first feature) is combined with the anthropomorphic feature to generate multi-frame gesture expression features (equivalent to the first processing module) and and as well as and ), there are t+1 frames in total. Share the first identity feature to the decoder, and use the decoder (equivalent to the second processing module) to fuse each frame of gesture expression features in the multi-frame gesture expression features with the first identity feature to obtain a virtual object video sequence .
  • the audio features s t and the speaker's audio features are first extracted Then a multi-modal feature fusion function f am is used to perform nonlinear feature transformation to obtain anthropomorphic features.
  • the characteristics of the attitude e and the reference image of the listener are (equivalent to the first gesture expression feature) as the first frame of the virtual object video sequence.
  • the speaker at each time step t, the speaker’s fused features (equivalent to anthropomorphic features) as input to generate the predicted video frame of step t+1.
  • the predicted video frame is decoded using the listener decoder as It contains two feature vectors, namely express expression, Represents attitude (rotation and translation).
  • the terminal supports speaker input of any length. This process can be expressed by formula (7), as follows:
  • the component unit is used to generate multi-frame gesture expression features; Represents fusion features; h t represents predicted video frames; c t represents stored predicted video frames.
  • the terminal can generate a virtual object video sequence based on the collected audio and video sequence of the real object through speaker encoder and listener decoder, making the virtual object video sequence more vivid and accurate.
  • Figure 11 is a schematic structural diagram of a video generation provided by the embodiment of the present invention.
  • the device 11 includes : Obtaining part 1101, determining part 1102 and generating part 1103; where,
  • the acquisition part 1101 is configured to collect audio and video sequences of real objects
  • the determination part 1102 is configured to perform feature extraction on the audio and video sequence and determine anthropomorphic features
  • the generation part 1103 is configured to use a virtual prediction network, preset standard features, and first features to predict the anthropomorphic features and generate a virtual object video sequence;
  • the preset standard features are the reference object corresponding Features;
  • the first feature represents different attitudes;
  • the video sequence of the virtual object is a video sequence that generates a corresponding reaction of the virtual object based on the audio and video sequence of the real object; and presents the video sequence of the virtual object.
  • the acquisition part 1101 is configured to acquire preset standard features; the preset standard features include first gesture expression features and first identity features;
  • the determining part 1102 is configured to predict and decode through the virtual prediction network based on the first gesture expression feature, the anthropomorphic feature and the first feature, and determine the multi-frame gesture expression feature of the virtual object ;
  • the generating part 1103 is configured to generate a video sequence of the virtual object based on the multi-frame gesture expression features of the virtual object and the first identity feature.
  • the acquisition part 1101 is configured to acquire a standard image; the standard image represents an image of a reference object; feature extraction is performed on the standard image through a face reconstruction model to obtain the predetermined image. standard features.
  • the anthropomorphic features include: multi-frame anthropomorphic features corresponding to the audio and video sequence;
  • the virtual prediction network includes: a first processing module and a second processing module; the determining part 1102. Configured to perform prediction through the first processing module based on the first frame of anthropomorphic features, the first gesture expression feature and the first feature among the multi-frame anthropomorphic features to obtain the next prediction.
  • the first feature is one of a positive attitude, a negative attitude and a normal attitude; decoding the next predicted video frame through the second processing module to determine the corresponding The next gesture expression feature of the virtual object; based on the next gesture expression feature and the next frame of anthropomorphic features in the multi-frame anthropomorphic features, continue to predict and decode until the last predicted video frame is obtained until the last gesture expression feature of the virtual object, thereby obtaining the multi-frame gesture expression feature of the virtual object; the first gesture expression feature is used as the first frame in the multi-frame gesture expression feature.
  • the acquisition part 1101 is configured to fuse each frame of gesture expression features in the multi-frame gesture expression features with the first identity feature to obtain a plurality of second identity features.
  • the second feature represents the fusion result of identity feature and gesture expression feature;
  • the generating part 1103 is configured to pass the plurality of second features through a renderer to generate a video sequence of the virtual object corresponding to the virtual object;
  • the first identity feature includes a first identity mark, a first Material and first lighting information.
  • the audio and video sequences include: audio sequences of real objects and video sequences of real objects; the acquisition part 1101 is configured to characterize the video sequences of real objects through an encoder Extract to obtain multiple video features; perform feature extraction on the audio sequence of the real object through an encoder to obtain multiple audio features; the audio features include loudness, zero-crossing rate and cepstrum coefficients;
  • the determination part 1102 is configured to perform feature conversion through a feature fusion function based on the multiple video features and the multiple audio features to determine the anthropomorphic features; the anthropomorphic features are corresponding to the audio and video sequences. Multi-frame anthropomorphic features; the anthropomorphic features include video features and audio features.
  • the acquisition part 1101 is configured to perform feature extraction on each video frame of the video sequence of the real object through a face reconstruction model to obtain the multiple video frame features;
  • the video frame features include second identity features and second gesture expression features;
  • the determining part 1102 is configured to determine all the second gesture expression features corresponding to the video sequence of the real object. for the video characteristics.
  • the acquisition part 1101 is configured to collect audio and video sequence samples of real talking objects and their corresponding face images of real listening objects;
  • the determination part 1102 is configured to perform feature extraction on the audio and video sequence samples through an initial encoder to determine the characteristics of the anthropomorphic samples;
  • the generation part 1103 is configured to generate predicted facial features under the sample audio and video sequences of training real objects through the initial virtual prediction network and the anthropomorphic sample features; the predicted facial features include predicted posture features and predict expression features;
  • the determination part 1102 is configured to perform feature extraction through a face reconstruction model based on the face image of the real listening object, and determine real face features, where the real face features include real posture features and real expression features;
  • the initial encoder is continuously optimized through the first loss function and the anthropomorphic sample features until the first loss function value meets the first preset threshold, and the encoder is determined; based on the real face features and predicted face Characteristically, the initial virtual prediction network is continuously optimized through the second loss function and the third loss function until the sum of the second loss function value and the third loss function value meets the second preset threshold, and the virtual prediction network is determined.
  • the determining part 1102 is configured to determine a second loss function based on the real facial features and the predicted facial features; the second loss function is used to ensure predicted expressions , the predicted posture is similar to the real expression and real posture; based on the change function corresponding to the real facial feature and the change function corresponding to the predicted facial feature, determine the third loss function; the third loss function is used to ensure prediction
  • the inter-frame continuity of facial features is similar to that of real facial features; through the second loss function and the third loss function, the initial virtual prediction network is continuously optimized until the second loss function value is equal to the third loss function value. and satisfy the second preset threshold, the virtual prediction network is determined.
  • Figure 12 is a schematic structural diagram 2 of a video generation device provided by the embodiment of the present invention.
  • the device 12 includes : Processor 1201 and memory 1202; Memory 1202 stores one or more programs executable by the processor. When one or more programs are executed, any video generation method of the aforementioned embodiments is executed by the processor 1201 .
  • embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, etc.) embodying computer-usable program code therein.
  • a computer-usable storage media including, but not limited to, magnetic disk storage, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
  • Embodiments of the present invention provide a video generation method and device, and a computer-readable storage medium.
  • the method includes: collecting audio and video sequences of real objects; performing feature extraction on the audio and video sequences to determine anthropomorphic features; using a virtual prediction network , the preset standard features, and the first feature predict the anthropomorphic features and generate a video sequence of the virtual object; the preset standard features are the features corresponding to the reference object; the first feature represents different attitudes; the video sequence of the virtual object It generates a video sequence of the corresponding response of the virtual object based on the audio and video sequence of the real object; it presents the video sequence of the virtual object.
  • Embodiments of the present invention generate video sequences of virtual objects based on audio and video sequences of real objects, and the presented video sequences of virtual objects are more vivid and accurate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例提供了一种视频生成方法及装置、计算机可读存储介质,其中,方法包括:采集真实对象的音视频序列;对音视频序列进行特征提取,确定拟人化特征;利用虚拟预测网络、预设的标准特征,以及第一特征对拟人化特征进行预测,生成虚拟对象的视频序列;预设的标准特征为参考对象对应的特征;第一特征表征不同的态度;虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现虚拟对象的视频序列。本发明实施例根据真实对象的音视频序列,生成虚拟对象的视频序列,呈现的虚拟对象的视频序列更加生动和准确。

Description

一种视频生成方法及装置、计算机可读存储介质
相关申请的交叉引用
本发明基于申请号为202210834191.6、申请日为2022年07月14日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本发明作为参考。
技术领域
本发明涉及人机交互领域,尤其涉及一种视频生成方法及装置、计算机可读存储介质。
背景技术
从人类行为学上讲,良好的沟通指的就是一种双向的沟通过程,不是单向的信息输入或者输出,而是伴随着信息的交互,真实的人和人的沟通交流是在倾听和诉说这两个状态间不断切换、循环的过程。其中,倾听与诉说是同等重要的。这两者对于构建拟人化的数字人进行人机交互是必不可少的。一方面,要求数字人要用对方明白的语言,尽量清晰、简洁、明了地表达自己的观点,另一方面,拟人化的数字人还要善于倾听和理解别人的观点。现有技术的主要针对讲者的参考图像和时变信号生成相应的讲者视频,主要做法是使用人脸关键点、人脸3D模型、人体骨架模型等将讲者参数化,再通过深度神经网络来拟合这些参数,并将这些参数渲染图像作为生成结果,生成的图像效果差。
发明内容
本发明实施例提供一种视频生成方法及装置、计算机可读存储介质,能够根据真实对象的音视频序列生成虚拟对象视频序列,提高了生成虚拟对象视频序列的生动性和准确性。
本发明的技术方案是这样实现的:
本发明实施例提供了一种视频生成方法,所述方法包括:
采集真实对象的音视频序列;
对所述音视频序列进行特征提取,确定拟人化特征;
利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;
呈现所述虚拟对象的视频序列。
上述方案中,所述利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列,包括:
获取预设的标准特征;所述预设的标准特征包括第一姿态表情特征和第一身份特征;
基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征;
基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列。
上述方案中,所述获取预设的标准特征,包括:
获取标准图像;所述标准图像表征参考对象的图像;
通过人脸重建模型对所述标准图像进行特征提取,得到所述预设的标准特征。
上述方案中,所述拟人化特征包括:所述音视频序列对应的多帧拟人化特征;所述虚拟预测网络包括:第一处理模块和第二处理模块;所述基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征,包括:
基于所述多帧拟人化特征中的第一帧拟人化特征、所述第一姿态表情特征和所述第一特征,通过所述第一处理模块进行预测,得到下一个预测视频帧;所述第一特征为积极态度、消极态度和普通态度中的一种;
通过所述第二处理模块对所述下一个预测视频帧进行解码,确定所述下一个预测视频帧对应的所述虚拟对象的下一个姿态表情特征;
基于所述下一个姿态表情特征和所述多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的所述虚拟对象的最后一个姿态表情特征时为止,从而得到所述虚拟对象的多帧姿态表情特征;所述第一姿态表情特征作为所述多帧姿态表情特征中的第一帧。
上述方案中,所述基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列,包括:
将所述多帧姿态表情特征中的每一帧姿态表情特征分别与所述第一身份特征进行融合,得到多个第二特征;所述第二特征表征身份特征与姿态表情特征的融合结果;
将所述多个第二特征通过渲染器,生成所述虚拟对象的视频序列;所述第一身份特征包括第一身份标识、第一材质和第一光照信息。
上述方案中,所述音视频序列包括:真实对象的音频序列和真实对象的视频序列;所述对所述音视频序列进行特征提取,确定拟人化特征,包括:
通过编码器对所述真实对象的视频序列进行预处理,得到多个视频特征;
通过编码器对所述真实对象的音频序列进行特征提取,得到多个音频特征;所述音频特征包括响度、过零率和倒频谱系数;
基于所述多个视频特征和所述多个音频特征,通过特征融合函数进行特征转换,确定所述拟人化特征;所述拟人化特征为音视频序列对应的多帧拟人化特征;所述拟人化特征包括视频特征和音频特征。
上述方案中,所述通过编码器对所述真实对象的视频序列进行特征提取,得到多个视频特征,包括:
通过人脸重建模型对所述真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征;所述视频帧特征包括第二身份特征和第二姿态表情特征;
将所述真实对象的视频序列中对应的所有所述第二姿态表情特征作为所述视频特征。
上述方案中,所述利用虚拟预测网络和预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列之前,所述方法还包括:
采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像;
通过初始编码器对所述样本音视频序列进行特征提取,确定拟人化样本特征;
通过初始虚拟预测网络和所述拟人化样本特征,生成训练真实对象的音视频序列样本下的预测人脸特征;所述预测人脸特征包含预测姿态特征和预测表情特征;
根据所述真实倾听对象的人脸图像,通过人脸重建模型进行特征提取,确定真实人脸特征,所述真实人脸特征包括真实姿态特征和真实表情特征;
通过第一损失函数和所述拟人化样本特征不断优化所述初始编码器,直到第一损失函数值满足第一预设阈值,确定所述编码器;
基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值满足第二预设阈值,确定所述虚拟预测网络。
上述方案中,所述基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值满足第二预设阈值,确定所述虚拟预测网络,包括:
基于所述真实人脸特征和所述预测人脸特征,确定第二损失函数;所述第二损失函数用来保证预测表情、预测姿态与真实表情、真实姿态相似;
基于所述真实人脸特征对应的变化函数和所述预测人脸特征对应的变化函数,确定第三损失函数;所述第三损失函数用来保证预测人脸特征的帧间连续性与真实人脸特征相似;
通过所述第二损失函数和所述第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值和 第三损失函数值满足第二预设阈值,确定所述虚拟预测网络。
本发明实施例提供了一种视频生成装置,其特征在于,所述视频生成装置包括获取部分、确定部分和生成部分;其中,
所述获取部分,被配置为采集真实对象的音视频序列;
所述确定部分,被配置为对所述音视频序列进行特征提取,确定拟人化特征;
所述生成部分,被配置为利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现所述虚拟对象的视频序列。
本发明实施例提供了一种视频生成装置,其特征在于,所述视频生成装置包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令,当所述可执行指令被执行时,所述处理器执行所述的视频生成方法。
本发明实施例提供了一种计算机可读存储介质,其特征在于,存储有可执行指令,当所述可执行指令被一个或多个处理器执行的时候,所述处理器执行所述的视频生成方法。
本发明实施例提供了一种视频生成方法及装置、计算机可读存储介质,其中,方法包括:采集真实对象的音视频序列;对所述音视频序列进行特征提取,确定拟人化特征;利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现所述虚拟对象的视频序列。本发明实施例根据真实对象的音视频序列,生成虚拟对象的视频序列,呈现的虚拟对象视频序列更加生动和准确。
附图说明
图1为本发明实施例提供一种视频生成方法的一个可选的终端工作示意图;
图2为本发明实施例提供一种视频生成方法的一个可选的流程示意图一;
图3a为本发明实施例提供一种视频生成方法的一个可选的讲者视频生成示意图一;
图3b为本发明实施例提供一种视频生成方法的一个可选的讲者视频生成示意图二;
图3c为本发明实施例提供一种视频生成方法的一个可选的讲者视频生成示意图三;
图4为本发明实施例提供一种视频生成方法的一个可选的流程示意图二;
图5为本发明实施例提供一种视频生成方法的一个可选的流程示意图三;
图6为本发明实施例提供一种视频生成方法的一个可选的流程示意图四;
图7为本发明实施例提供一种视频生成方法的一个可选的流程示意图五;
图8a为本发明实施例提供一种视频生成方法的虚拟对象视频序列结果图一;
图8b为本发明实施例提供一种视频生成方法的虚拟对象视频序列结果图二;
图9为本发明实施例提供一种视频生成方法的一个可选的流程示意图六;
图10为本发明实施例提供一种视频生成方法的一个可选的模型架构图;
图11为本发明实施例提供的一种视频生成装置的结构示意图一;
图12为本发明实施例提供的一种视频生成装置的结构示意图二。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例仅仅是本发明一部分实施例,而不是全部实施例。基于本发明的实施例,本领域普通技术人员在没有做出创造性劳动前提下,所获得的所有其他实施例,都属于本发明保护范围。
为了使本技术领域的人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。图1为本发明实施例提供一种视频生成方法的一个可选的终端工作示意图,如图1所示,终端里面包括讲者编码器和虚拟预测网络和虚拟人界面(图中未示出),终端可以通过讲者编码器(相当于编码器)对真实对象的音视频进行特征提取,将提取的特征、态度(相当于第一特征)和参考者图像(相当于预设的标准图像)输入至听者解码器(相当于虚拟预测网络)中进行预测,生成按时间线排列的倾听者的头部运动和表情变化,从而得到虚拟对象视频序列。
在本发明的一些实施例中,图2是本发明实施例提供一种视频生成方法的一个可选的流程示意图一,将结合图2示出的步骤进行说明。
S101、采集真实对象的音视频序列。
在本发明的一些实施例中,根据社会心理学和人类学的概念,“听”也是一种沟通时的功能性行为。其中,倾听行为风格可以分为四类,即非倾听者、边缘倾听者、评价性倾听者和积极倾听者。其中,积极响应的倾听是最有效的一种,它在沟通中也起到了关键作用。它要求听者完全专注于一个人所说的内容、仔细聆听,同时对说话者表现出一些视觉反应。这些反应可以反馈给讲者关于听者是否感兴趣,是否理解,是否同意讲话内容,以调节对话节奏、进程及促进沟通的顺利进行。
对于积极响应的聆听而言,听众在表达自己的观点时会存在常见的视觉模式,例如,对称和循环运动被用来表示“是”、“不是”或类似的信号;小幅度的线性运动与对方讲话中的强调音节相配合;而范围更大的线性运动则常在对方讲话的停顿期间出现。在人类面对面的交互中,甚至连听者眨眼的时间都可以被视为交流信号。因此,基于一些音视频序列,生成虚拟对象听到音视频序列下的视频序列具有重要的意义。
示例性的,通过给定讲者的参考图像和一段时变信号的条件下,生成一个讲者的能够与该时变信号相匹配的拟真片段。图3a、图3b、和图3c分别是本发明实施例提供一种视频生成方法的一个可选的讲者视频生成示意图一、讲者视频生成示意图二和讲者视频生成示意图三,如图3a所示,讲者视频生成任务包括讲者身体姿势的生成;如图3b所示,讲者视频生成任务包括讲者唇部运动的生成;如图3c所示,讲者视频生成任务包括讲者头部(含脸部)的运动生成。在图3a中,讲者身体姿势的生成是通过身体姿势生成模型对虚线框输入的一段时变信号进行处理,得到点线虚线框所示的身体姿势。在图3b中,讲者唇部运动的生成是通过唇部运动生成模型对虚线框输入的一段时变信号和一般的参考图像进行处理,输出点线虚线框所示的讲者唇部运动图像帧。在图3c中,讲者头部(含脸部)的运动生成主要是通过头部运动生成模型对虚线框输入的一段时变信号和参考图像讲者和情绪进行处理,对处理后的结果通过头部渲染模型进行渲染,输出点线虚线框所示的讲者头部(含脸部)的运动图像帧。
在本发明的一些实施例中,终端可以通过收集设备采集真实对象的音频序列和视频序列。
示例性的,收集设备可以是具有采集视频和音频的设备,例如,摄像头,本发明不限于此;真实对象可以是在一个场景下正在讲话的人;音频序列和视频序列可以是在游客旅游景点场景下,游客询问自助咨询设备(虚拟对象的载体)过程中获取的。
在本发明的一些实施例中,本发明应用于需要人机交互的场合,例如,商场的智能咨询设备,可以根据购物者展示的视频,做出相应的视频,对购物者进行指引。
S102、对音视频序列进行特征提取,确定拟人化特征。
在本发明的一些实施例中,音视频序列包括真实对象的音频序列和真实对象的视频序列。
在本发明的一些实施例中,特征提取可以通过神经网络模型实现。特征提取的过程是:将视频序列的每一帧输入到神经网络中,通过多个卷积层和池化层进行特征提取,得到多个视频特征。拟人化特征指在真实场景中,人说话时会伴随一些肢体动作,将肢体动作进行特征化得到的特征。拟人化特征是包含音频特性和视频特性的特征;拟人化特征是对音视频序列进行特征提取后,得到的音频特征和视频特征。示例性的,在一段视频中,讲者说话时伴随抬手动作,拟人化特征就可以是抬手动作对应的特征。
在本发明的一些实施例中,终端可以通过编码器对真实对象的视频序列进行特征提取,得到多个视 频特征;通过编码器对真实对象的音频序列进行特征提取,得到多个音频特征;音频特征包括响度、过零率和倒频谱系数;通过多个视频特征和多个音频特征,通过特征融合函数进行特征转换,确定拟人化特征。
在本发明的一些实施例中,图4是本发明实施例提供一种视频生成方法的一个可选的流程示意图二,如图4所示,S102可以通过S1021-S1023实现,如下:
S1021、通过编码器对真实对象的视频序列进行特征提取,得到多个视频特征。
在本发明的一些实施例中,视频特征是人进行交流的过程中,头部会有转动以及面部会有一些表情变化,将头部的转动以及面部的表情进行记录得到的特征。视频特征是对视频序列进行特征提取后得到的特征;视频特征包括姿态和表情。
在本发明的一些实施例中,终端可以通过人脸重建模型对真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征;将真实对象的视频序列中对应的所有第二姿态表情特征作为视频特征。
在本发明的一些实施例中,图5是本发明实施例提供一种视频生成方法的一个可选的流程示意图三,如图5所示,S1021可以通过S10211和S10212实现,如下:
S10211、通过人脸重建模型对真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征。
在本发明的一些实施例中,视频帧特征是针对视频序列里面的每一视频帧中的人物的头部转动,面部表情以及拍摄的多种因素进行记录得到的。视频帧特征包括第二身份特征和第二姿态表情特征。第二身份特征是对视频序列拍摄的环境因素以及被拍摄对象的身份信息进行记录的结果。第二身份特征包括真实对象的身份标识、材质和光照。第二姿态表情特征是真实对象在进行说话时,伴随头部转动以及面部表情变化,对其进行记录得到的结果。第二姿态表情特征包括真实对象的头部的姿态和面部的表情。人脸重建模型一般选择3D人脸重建模型,本发明不限于此。
在本发明的一些实施例中,终端可以通过人脸重建模型对真实对象的视频序列的每一帧视频帧中的人物和背景进行特征提取,每一帧视频帧中都会得到人物的身份标识、视频帧的材质、拍摄时的光照、人物的头部姿势和人物脸部的表情。将人物的身份标识、视频帧的材质和拍摄时的光照作为第二身份特征,将人物的头部姿势和人物脸部的表情作为第二姿态表情特征;将所有的特征(即,人物的身份标识、视频帧的材质、拍摄时的光照、人物的头部姿势和人物脸部的表情)作为一个视频帧特征。
示例性的,对于视频序列使用3D人脸重建模型提取其人脸3D参数。对于每一帧图片,都可以提取形如{α,β,δ,p,γ}的五元组参数,其中,α代表当前人脸的身份标识、β代表人物脸部的表情、δ代表视频帧的材质、p代表人物脸部的姿态和γ代表拍摄时的光照,这些所有特征组合在一起就是一个视频帧特征。
S10212、将真实对象的视频序列中对应的所有第二姿态表情特征作为视频特征。
在本发明的一些实施例中,终端将真实对象的视频序列中对应的所有第二姿态表情特征确定为视频序列对应的视频特征。
示例性的,将{α,β,δ,p,γ}参数分为两类,一类是相对固定的、与身份标识信息耦合较为紧密的特征(相当于第二身份特征),另一类是相对动态的,与身份标识信息独立的特征m=(β,p)(相当于第二姿态表情特征)。由于只考虑真实对象的表情变化和头部运动,忽略其身份标识信息,因此,对于输入的真实对象的视频序列,可以抽取它们的视频特征为
可以理解的是,在本发明的一些实施例中,终端可以通过人脸重建模型对真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征;将真实对象的视频序列中对应的所有第二姿态表情特征作为视频特征,去除了身份固有特征(例如人物的身份标识、视频帧的材质和拍摄时的光照),只保留通用的特征,提高了特征提取的有效性,为后续生成虚拟对象视频序列提供数据支持。
S1022、通过编码器对真实对象的音频序列进行特征提取,得到多个音频特征。
在本发明的一些实施例中,音频特征是针对人交流过程中,说话者说话时伴随的一些特征。音频特征是对音频序列进行特征提取后得到的特征;音频特征包括真实对象说话时对应的响度、过零率和倒频谱系数。
在本发明的一些实施例中,终端可以通过编码器对真实对象的音频序列的频谱进行特征提取,得到真实对象讲话每一时刻的响度、过零率和倒频谱系数;将响度、过零率和倒频谱系数都是视为音频特征,从而得到多个音频特征。
示例性的,对于真实对象的音频序列抽取其能量特征、时域特征及频域特征作为其特征;分别提取了响度、过零率(ZCR)、倒频谱系数(MFCC)特征、相应的MFCC Delta和Delta-Delta特征作为音频段的音频特征si。从中提取的音频特征表示为
可以理解的是,终端可以根据真实对象讲话时产生的音视频序列,进行特征提取,获取对应的多个音频特征和多个视频特征;可以快速提取音频的有效特征和视频的有效特征,提高了特征提取的速度以及特征的有效性。
S1023、基于多个视频特征和多个音频特征,通过特征融合函数进行特征转换,确定拟人化特征。
在本发明的一些实施例中,拟人化特征为音视频序列对应的多帧拟人化特征;拟人化特征包括视频特征和音频特征;特征融合函数可以将非线性特征转换为线性特征。
在本发明的一些实施例中,终端通过编码器中的多模态的特征融合函数对多个音频特征和多个视频特征进行非线性特征转换,得到真实对象的特征表示,即就是确定拟人化特征。
可以理解的是,终端通过对音视频序列进行特征提取,可以提高处理音视频序列的速度;快速得到多个视频特征和多个音频特征,通过特征融合函数将多个视频特征和多个音频特征转化为第一特征,可以转换特征的表达方式,便于终端对其进行处理,提高了视频特征和音频特征处理的可行性。
S103、利用虚拟预测网络、预设的标准特征,以及第一特征对拟人化特征进行预测,生成虚拟对象的视频序列。
在本发明的一些实施例中,虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;预设的标准特征为参考对象对应的特征;预设的标准特征包括参考对象对应的第一姿态表情特征和第一身份特征;第一姿态表情特征是参考对象的头部姿态和面部表情;第一身份特征是参考对象对应的身份相关信息。第一特征是人在说话时,具有感情色彩的态度特征。第一特征表征不同的态度,第一特征可以是积极态度,也可以是消极态度,还可以是普通态度。
在本发明的一些实施例中,终端可以获取预设的标准特征;基于第一姿态表情特征、拟人化特征和第一特征,通过虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征;通过虚拟对象的多帧姿态表情特征和第一身份特征,生成虚拟对象的视频序列。
示例性的,给定一段时长从1到t的真实对象的输入视频序列及其对应的音频序列(相当于真实对象的音视频序列)。虚拟对象生成任务旨在生成时间步t+1的虚拟对象视频序列可以通过以下公式(1)表示。
其中,是听者的参考图像(相当于标准图像),e是听者的态度,生成的听者视频的全体可以表示为
在本发明的一些实施例中,图6是本发明实施例提供一种视频生成方法的一个可选的流程示意图四,如图6所示,S103可以通过S1031-S1033实现,如下:
S1031、获取预设的标准特征。
在本发明的一些实施例中,终端可以获取标准图像;通过人脸重建模型对标准图像进行特征提取,得到第一姿态表情特征和第一身份特征,将第一姿态表情特征和第一身份特征作为预设的标准特征。
在本发明的一些实施例中,S1031可以通过S10311和S10312实现,如下:
S10311、获取标准图像。
在本发明的一些实施例中,标准图像是参考对象的图像;标准图像是随机在图像库中获取的任意一张不同于真实对象的人脸图像。
S10312、通过人脸重建模型对标准图像进行特征提取,得到预设的标准特征。
在本发明的一些实施例中,终端可以通过人脸重建模型对标准图像进行特征提取,得到标准图像中人物的身份标识、视频帧的材质、拍摄时的光照、人物的头部姿势和人物脸部的表情。将标准图像中人 物的身份标识、视频帧的材质和拍摄时的光照作为第一身份特征;将标准图像中人物的头部姿势和人物脸部的表情作为第一姿态表情特征;将第一身份特征和第一姿态表情特征作为预设的标准特征。
可以理解的是,终端可以获取标准图像;通过人脸重建模型对标准图像进行特征提取,可以快速准确的获取第一姿态表情特征,提高了对标准图像处理的准确度和效率;第一姿态表情特征可以用于生成虚拟对象的视频序列,保证了生成虚拟对象视频序列的准确性。
S1032、基于第一姿态表情特征、拟人化特征和第一特征,通过虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征。
在本发明的一些实施例中,终端可以利用多帧拟人化特征中的第一帧拟人化特征、第一姿态表情特征和第一特征,通过第一处理模块进行预测,得到下一个预测视频帧;通过第二处理模块对下一个预测视频帧进行解码,确定下一个预测视频帧对应的虚拟对象的下一个姿态表情特征;通过下一个姿态表情特征和多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的虚拟对象的最后一个姿态表情特征时为止,从而得到虚拟对象的多帧姿态表情特征。
在本发明的一些实施例中,S1032可以通过S10321、S10322和S10323实现,如下:
S10321、基于多帧拟人化特征中的第一帧拟人化特征、第一姿态表情特征和第一特征,通过第一处理模块进行预测,得到下一个预测视频帧。
在本发明的一些实施例中,第一特征为积极态度、消极态度和普通态度中的任意一种态度;第一处理模块是虚拟预测网络中的编码模块;功能是实现预测处理中的编码。
在本发明的一些实施例中,终端可以选择积极态度、消极态度和普通态度中的任意一种态度作为第一特征,本发明实施例中,选择普通态度;在普通态度下,将多帧拟人化特征中的第一帧拟人化特征、第一姿态表情特征输入到虚拟预测网络中的第一处理模块中进行预测,得到下一个视频预测帧。
S10322、通过第二处理模块对下一个预测视频帧进行解码,确定下一个预测视频帧对应的虚拟对象的下一个姿态表情特征。
在本发明的一些实施例中,第二处理模块是虚拟预测网络中的解码模块;功能是实现预测处理中的解码。
在本发明的一些实施例中,终端可以通过虚拟预测网络中的第二处理模块对下一个预测视频帧进行解码,得到下一个预测视频帧对应的虚拟对象的下一个姿态表情特征;下一个姿态表情特征包括下一个姿态特征和下一个表情特征。
S10323、基于下一个姿态表情特征和多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的虚拟对象的最后一个姿态表情特征时为止,从而得到虚拟对象的多帧姿态表情特征。
在本发明的一些实施例中,第一姿态表情特征作为多帧姿态表情特征中的第一帧。
在本发明的一些实施例中,终端将下一个姿态表情特征和多帧拟人化特征中的第二帧拟人化特征输入到虚拟预测网络中的第一处理模块进行预测,得到下一个预测视频帧;通过虚拟预测网络中的第二处理模块进行解码,确定下一个预测视频帧对应的虚拟对象的下一个姿态表情特征;继续通过虚拟预测网络进行预测和解码,直至得到最后一个预测视频帧对应的虚拟对象的最后一个姿态表情特征,从而确定虚拟对象的多帧姿态表情特征。
可以理解的是,终端可以通过第一帧拟人化特征、第一姿态表情特征和第一特征进行预测,第一特征可以提高生成预测视频帧的多样性,第一帧拟人化特征和第一姿态表情特征可以提高生成预测视频帧的准确性;基于第一预测视频帧生成下一个预测视频帧,通过实时更新的预测视频帧预测下一个视频帧,可以生成连续的视频帧,提高了视频帧之间的连续性,保证了视频序列的完整性;通过对多个预测视频帧解码得到多帧姿态表情特征,可以充分体现虚拟对象听到真实对象的音视频时的头部反应和面部反应,提高了生成虚拟对象的视频序列的准确性。
S1033、基于虚拟对象的多帧姿态表情特征和第一身份特征,生成虚拟对象的视频序列。
在本发明的一些实施例中,虚拟对象是具有简单沟通功能的模拟对象,一般存在于交互设备中。
在本发明的一些实施例中,终端可以通过将虚拟对象的多帧姿态表情特征中的每一帧姿态表情特征 分别与第一身份特征进行融合,得到多个第二特征;将多个第二特征通过渲染器生成虚拟对象的视频序列。
可以理解的是,终端可以获取预设的标准特征;通过预设的标准特征中的第一姿态表情特征、拟人化特征和第一特征,通过虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征,可以提高多帧姿态表情特征的准确性;基于虚拟对象的多帧姿态表情特征和第一身份特征,生成虚拟对象的视频序列;可以提高虚拟对象的视频序列的准确性和生动性。
在本发明的一些实施例中,S1033可以通过S10331和S10332实现,如下:
S10331、将虚拟对象的多帧姿态表情特征中的每一帧姿态表情特征分别与第一身份特征进行融合,得到多个第二特征。
在本发明的一些实施例中,第二特征是将姿态表情特征赋予身份特征之后,所形成的融合特征。第二特征是身份特征与姿态表情特征的融合结果。
在本发明的一些实施例中,终端可以将虚拟对象的多帧姿态表情特征中的每一帧姿态表情特征分别与第一身份特征进行融合,得到多帧姿态表情特征对应的多个第二特征。
S10332、将多个第二特征通过渲染器,生成虚拟对象的视频序列。
在本发明的一些实施例中,第一身份特征包括第一身份标识、第一材质和第一光照信息;第一身份标识对应标准图像的中人物的身份标识、第一材质对应视频帧的材质和第一光照信息对应拍摄时的光照。
在本发明的一些实施例中,终端可以通过渲染器将多个第二特征生成虚拟对象的视频序列。
示例性的,虚拟对象的视频序列生成任务可以通过公式(2)和公式(3)表示。

其中,为虚拟预测网络的预测的虚拟对象的姿态特征;为虚拟对象的与其身份标识相关联的特征,它会与预测的姿态特征一起,通过渲染器生成虚拟对象的视频序列
可以理解的是,终端可以通过将虚拟对象的多帧姿态表情特征中的每一帧姿态表情特征分别与第一身份特征进行融合,得到多个第二特征;使得第二特征具有身份属性,可以提高第二特征的区分性,将多个第二特征通过渲染器生成虚拟对象的视频序列,针对性的生成的虚拟对象的视频序列,视频序列更加生动和准确。
S104、呈现虚拟对象的视频序列。
在本发明的一些实施例中,终端可以将虚拟对象的视频序列呈现在虚拟人界面上。
可以理解的是,终端采集真实对象的音视频序列;对音视频序列进行特征提取,确定拟人化特征;利用虚拟预测网络和预设的标准特征,以及第一特征,对拟人化特征进行预测,生成虚拟对象的视频序列;预设的标准特征为参考对象对应的特征;第一特征表征不同的态度;虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现虚拟对象的视频序列。根据真实对象的音视频序列,生成虚拟对象的视频序列,呈现的虚拟对象的视频序列更加生动和准确。
在本发明的一些实施例中,图7是本发明实施例提供一种视频生成方法的一个可选的流程示意图五,如图7所示,在执行S103之前,还执行S105-S1010,如下:
S105、采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像。
在本发明的一些实施例中,音视频序列样本是记录真实倾诉对象的讲话以及对应讲话时的动作神情所形成的。
在本发明的一些实施例中,终端可以通过采集设备采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像。采集的真实倾听对象的人脸图像是在不同的第一特征下采集的,第一特征包括是积极态度、消极态度和普通态度。
S106、通过初始编码器对音视频序列样本进行特征提取,确定拟人化样本特征。
在本发明的一些实施例中,拟人化样本特征是音视频序列样本对应的场景下,人说话时会伴随一些肢体动作,将肢体动作进行特征化得到的特征。拟人化样本特征是对音视频序列样本进行特征提取后得到的,拟人化样本特征包括视频样本特征和音频样本特征。
在本发明的一些实施例中,终端可以通过初始编码器对音视频序列样本进行特征提取,得到拟人化样本特征,其中,拟人化样本特征包括多帧拟人化样本特征。
S107、通过初始虚拟预测网络和拟人化样本特征,生成训练真实对象的样本音视频序列下的预测人脸特征。
在本发明的一些实施例中,预测人脸特征包含预测姿态特征和预测表情特征。
在本发明的一些实施例中,终端可以通过初始虚拟预测网络和拟人化样本特征,生成训练真实对象的音视频序列样本下的预测姿态特征和预测表情特征;预测姿态特征包括多帧预测姿态特征;预测表情特征包括多帧预测表情特征;将预测姿态特征和预测表情特征作为预测人脸特征;预测人脸特征包括多帧预测人脸特征。
S108、根据真实倾听对象的人脸图像,通过人脸重建模型进行特征提取,确定真实人脸特征。
在本发明的一些实施例中,真实人脸特征包括真实姿态特征和真实表情特征;
在本发明的一些实施例中,终端可以通过人脸重建模型对真实倾听对象的人脸图像进行特征提取,得到真实姿态特征和真实表情特征,将真实姿态特征和真实表情特征作为真实人脸特征。其中,真实倾听对象的人脸图像的数量与通过音视频序列样本得到的多帧拟人化特征的数量是一致的,真实姿态特征包括多帧真实姿态特征;真实表情特征包括多帧真实表情特征;真实人脸特征也包括多帧真实人脸特征,真实人脸特征数量与预测人脸特征数量一致。
S109、通过第一损失函数和拟人化样本特征不断优化初始编码器,直到第一损失函数值满足第一预设阈值,确定编码器。
在本发明的一些实施例中,终端可以通过第一损失函数和拟人化样本特征不断优化初始编码器,若第一损失函数值大于或者等于第一预设阈值,则可以视为编码器已经训练好了,确定编码器;若第一损失函数值小于第一预设阈值,则继续训练编码器,直到第一损失函数值大于或者等于第一预设阈值,确定编码器。
S1010、基于真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值均满足第二预设阈值,确定虚拟预测网络。
在本发明的一些实施例中,终端可以利用真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,若第二损失函数值与第三损失函数值的和大于或者等于第二预设阈值,则确定虚拟预测网络;若第二损失函数值与第三损失函数值的和小于第二预设阈值,则继续训练虚拟预测网络,直到第二损失函数值与第三损失函数值的和大于或者等于第二预设阈值,确定虚拟预测网络。
示例性的,图8a和图8b分别为本发明实施例提供一种视频生成方法的虚拟对象的视频序列结果图一和本发明实施例提供一种视频生成方法的虚拟对象的视频序列结果图二,如图8a和图8b所示,横坐标表示连续的视频帧,包括0-32帧;图8a中展示了生成虚拟对象的视频序列结果在域内(指训练数据中包含测试数据集中的倾诉人或者倾听人)上测试的结果,图8b中展示了生成虚拟对象的视频序列结果在域外(指倾诉人和倾听人的人脸数据都不曾在训练集中出现过,主要考验模型在未见过的人脸上的泛化能力)上测试的结果。图8a中真实的倾听者态度为积极,图8b中为自然态度的真实倾听者。图8a中第4、5和6行和图8b中第4、5、6行分别显示了以三种不同态度为条件生成的虚拟对象的视频序列结果,图8a和图8b中的生成听者就是虚拟对象。其中,第0帧是参考帧。图中的不同的三角形表示显著的变化,其中,右上方视线随眼球运动而变化;左下方头部运动;右下方每一行的代表帧;左上方在不同态度下有显著差异的帧(列方向)。
可以看到虚拟预测网络能够捕捉到通用的倾听者(相当于虚拟对象)的模式(如眼睛、嘴巴和头部运动等),这些模式可能与真实的听者不同,但仍然是有意义的。此外,该虚拟预测网络能够呈现不同态度下虚拟对象的视觉模式。对于图8a所示的结果,可以看到,虽然中性态度也会笑(第2-8帧),但它保持的时间比积极的态度(第2-16帧)短。而对于消极态度的人来说,虚拟对象并不关注谈话内容,它们在第10、16、22和30帧时看向屏幕的下侧。
最后,在图8b中,域外数据上也有相对好的生成结果。持有积极态度的虚拟对象在第6-14帧微笑, 持有消极态度的听众皱着眉头,并在整个过程中表现出负面的嘴型。负面态度的虚拟对象动作变化小、眼神游离,而中性虚拟对象则保持着相对平静的表情,同时伴随着头部的有规律移动。
对于生成的虚拟对象视频序列结果进行评价,有10个志愿者做如下两个测试:
最佳匹配测试。在给定态度,真实对象音频序列,真实对象视频序列,真实的倾听者视频和生成的虚拟对象视频序列的情况下,志愿者需要选择感官上最恰当、最符合给定态度的倾听者。
态度分类测试。在给定生成的虚拟对象视频序列的情况下,志愿者需要确定它的情绪(积极、消极、自然)。需要说明的是,自然相当于普通态度。
两个测试都以双盲的形式进行,且结果如表1和表2所示。
表1最佳匹配测试结果
由表1中统计了两种“最好的倾听者”个数的均值和方差。在域内数据中,志愿者投票认为近20%的生成的虚拟对象看起来比真实的听者更合理,这验证了模型可以生成与人类主观感知相一致的响应式听者。而且,在域外数据中生成的结果被更多的志愿者喜欢。
表2表情分类测试结果
由表2可以看出,对于每一种态度,计算所有志愿者分类精度的均值和方差,得到模型可以在一定程度上生成指定态度的视频。
可以理解的是,终端可以根据采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像,真实倾听对象的人脸图像是在不同的第二特征下采集的,可以保证训练样本的多样性;通过音视频序列样本,及其对应的真实倾听对象的人脸图像,对初始编码器和初始虚拟预测网络进行优化训练,确定编码器和虚拟预测网络,可以提高编码器和虚拟预测网络输出结果的准确度。
在本发明的一些实施例中,图9是本发明实施例提供一种视频生成方法的一个可选的流程示意图六,如图9所示,在执行S1010之前还执行S1011-S1013,如下:
S1011、基于真实人脸特征和预测人脸特征,确定第二损失函数。
在本发明的一些实施例中,第二损失函数用来保证预测表情、预测姿态与真实表情、真实姿态相似。
在本发明的一些实施例中,终端可以根据真实人脸特征和预测人脸特征,进行作差求模运算,确定第二损失函数。
示例性的,第二损失函数可以通过以下公式(4)得到。
其中,表示第二损失函数;表示预测人脸特征中的预测表情特征;表示真实人脸特征中的真实表情特征;表示预测人脸特征中的预测姿态特征;表示真实人脸特征中的真实姿态特征。对于优化过程,有真实倾听者的真实人脸特征由于缺乏T+1帧的监督信号,丢弃初始虚拟预测网络最后一帧的生成结果即就是最后预测视频帧对应的虚拟对象的姿态表情特征(相当于预测人脸特征的最后一帧)。
S1012、基于真实人脸特征对应的变化函数和预测人脸特征对应的变化函数,确定第三损失函数。
在本发明的一些实施例中,第三损失函数用来保证预测人脸特征的帧间连续性与真实人脸特征相似。
在本发明的一些实施例中,终端可以通过真实人脸特征对应的变化函数和预测人脸特征对应的变化函数,进行作差求模运算,确定第三损失函数。
示例性的,第三损失函数可以通过以下公式(5)得到。
其中,表示第三损失函数;表示预测人脸特征中的预测表情特征对应的变化函数;μ((β_t^l)^)表示真实人脸特征中的真实表情特征对应的变化函数;表示预测人脸特征中的预测姿态特征对应的变化函数;表示真实人脸特征中的真实姿态特征对应的变化函数;μ是衡量当前帧和其相邻的前一帧的帧间变化的函数,即
S1013、通过第二损失函数和第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值与第三损失函数值的和满足第二预设阈值,确定虚拟预测网络。
在本发明的一些实施例中,终端可以通过第二损失函数和第三损失函数求和,确定虚拟预测网络的损失函数,通过损失函数持续优化初始虚拟预测网络,直到损失函数值(相当于第二损失函数值与第三损失函数值的和)满足第二预设阈值,确定虚拟预测网络。
示例性的,虚拟预测网络的损失函数可以通过以下公式(6)得到。
其中,表示虚拟预测网络的损失函数;表示第二损失函数;表示第三损失函数;w是用来平衡这两个损失函数的尺度。
可以理解的是,终端可以通过基于真实人脸特征和预测人脸特征,确定第二损失函数;通过真实人脸特征对应的变化函数和预测人脸特征对应的变化函数,确定第三损失函数,提高了虚拟预测网络的损失函数的有效性;通过第二损失函数和第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值与第三损失函数值的和满足第二预设阈值,确定虚拟预测网络;提高了虚拟预测网络预测的准确性和虚拟预测网络的预测效果。
下面,将说明本发明实施例在一个实际的应用场景中的示例性应用。
在本发明的一些实施例中,图10是本发明实施例提供一种视频生成方法的一个可选的模型架构图,如图10所示,终端获取讲者视频(相当于真实对象的视频序列),通过讲者编码器(相当于编码器)对讲者视频进行预处理,得到连续视频帧;通过人脸重建模型对讲者视频的每一帧进行特征提取,得到每一帧对应的身份标识、材质、光照、表情和姿态;表情和姿态作为视频特征将获取讲者音频(相当于真实对象的音频序列),通过编码器对讲者音频进行处理,得到多个时刻的音频;对多个时刻的音频中的每一时刻的音频进行特征提取,得到每一时刻对应的响度、过零率和倒频谱系数;将响度、过零率和倒频谱系数作为音频特征将视频特征和音频特征通过特征融合函数fam准换成拟人化特征。终端中的听者解码器(相当于虚拟预测网络)获取听者的参考图像(相当于标准图像),通过人脸重建模型对参考图像进行特征提取,得到身份标识、材质、光照、表情和姿态(表情和姿态如图10中的);将身份标识、材质、光照作为第一身份特征,将表情和姿态作为第一姿态表情特征。将第一姿态表情特征输入到长短期记忆网络编码器(相当于第一处理模块)中,利用态度e(相当于第一特征)结合拟人化特征生成多帧姿态表情特征(相当于以及),一共有t+1帧。将第一身份特征共享至解码器中,通过解码器(相当于第二处理模块)将多帧姿态表情特征中的每一帧姿态表情特征分别和第一身份特征进行融合,得到虚拟对象视频序列。
示例性,对于讲者编码器,在每个时间步t,首先提取音频特征st和讲者的音频特征然后使用一个多模态的特征融合函数fam进行非线性特征转换,得到拟人化特征。
为了确保虚拟对象能够以某种态度做出反应,并产生更自然的头部动作和表情变化,将态度e和听者的参考图像的特征(相当于第一姿态表情特征)作为虚拟对象视频序列的第一帧。然后,在每个时间步t,将讲者的融合特征(相当于拟人化特征)作为输入,生成t+1步的预测视频帧。最后,使用听者解码器将预测视频帧解码为其中包含两个特征向量,即表示表情,表示姿态(旋转和平移)。终端支持任意长度的讲者输入。该流程可以通过公式(7)表述,如下:
其中,表示预测的表情,表示预测的姿态(旋转和平移);Dm和LSTM均是听者解码器中 的组成单元,用于生成多帧姿态表情特征;表示融合特征;ht表示预测视频帧;ct表示存储预测视频帧。
可以理解的是,终端可以根据采集的真实对象的音视频序列,通过讲者编码器和听者解码器进行处理,生成虚拟对象视频序列,使得虚拟对象的视频序列更加生动和准确。
基于上述实施例的视频生成方法,本发明实施例还提供了一种视频生成装置,如图11所述,图11为本发明实施例提供的一种视频生成的结构示意图一,该装置11包括:获取部分1101、确定部分1102和生成部分1103;其中,
所述获取部分1101,被配置为采集真实对象的音视频序列;
所述确定部分1102,被配置为对所述音视频序列进行特征提取,确定拟人化特征;
所述生成部分1103,被配置为利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现所述虚拟对象的视频序列。
在本发明的一些实施例中,所述获取部分1101,被配置为获取预设的标准特征;所述预设的标准特征包括第一姿态表情特征和第一身份特征;
所述确定部分1102,被配置为基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征;
所述生成部分1103,被配置为基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列。
在本发明的一些实施例中,所述获取部分1101,被配置为获取标准图像;所述标准图像表征参考对象的图像;通过人脸重建模型对所述标准图像进行特征提取,得到所述预设的标准特征。
在本发明的一些实施例中,所述拟人化特征包括:所述音视频序列对应的多帧拟人化特征;所述虚拟预测网络包括:第一处理模块和第二处理模块;所述确定部分1102,被配置为基于所述多帧拟人化特征中的第一帧拟人化特征、所述第一姿态表情特征和所述第一特征,通过所述第一处理模块进行预测,得到下一个预测视频帧;所述第一特征为积极态度、消极态度和普通态度中的一种;通过所述第二处理模块对所述下一个预测视频帧进行解码,确定所述下一个预测视频帧对应的所述虚拟对象的下一个姿态表情特征;基于所述下一个姿态表情特征和所述多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的所述虚拟对象的最后一个姿态表情特征时为止,从而得到所述虚拟对象的多帧姿态表情特征;所述第一姿态表情特征作为所述多帧姿态表情特征中的第一帧。
在本发明的一些实施例中,所述获取部分1101,被配置为将所述多帧姿态表情特征中的每一帧姿态表情特征分别与所述第一身份特征进行融合,得到多个第二特征;所述第二特征表征身份特征与姿态表情特征的融合结果;
所述生成部分1103,被配置为将所述多个第二特征通过渲染器,生成所述虚拟对象对应的所述虚拟对象的视频序列;所述第一身份特征包括第一身份标识、第一材质和第一光照信息。
在本发明的一些实施例中,所述音视频序列包括:真实对象的音频序列和真实对象的视频序列;所述获取部分1101,被配置为通过编码器对所述真实对象的视频序列进行特征提取,得到多个视频特征;通过编码器对所述真实对象的音频序列进行特征提取,得到多个音频特征;所述音频特征包括响度、过零率和倒频谱系数;
所述确定部分1102,被配置为基于所述多个视频特征和所述多个音频特征,通过特征融合函数进行特征转换,确定所述拟人化特征;所述拟人化特征为音视频序列对应的多帧拟人化特征;所述拟人化特征包括视频特征和音频特征。
在本发明的一些实施例中,所述获取部分1101,被配置为通过人脸重建模型对所述真实对象的视频序列的每一帧视频帧进行特征提取,得到所述多个视频帧特征;所述视频帧特征包括第二身份特征和第二姿态表情特征;
所述确定部分1102,被配置为将所述真实对象的视频序列中对应的所有所述第二姿态表情特征作 为所述视频特征。
在本发明的一些实施例中,所述获取部分1101,被配置为采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像;
所述确定部分1102,被配置为通过初始编码器对所述音视频序列样本进行特征提取,确定拟人化样本特征;
所述生成部分1103,被配置为通过所述初始虚拟预测网络和所述拟人化样本特征,生成训练真实对象的样本音视频序列下的预测人脸特征;所述预测人脸特征包含预测姿态特征和预测表情特征;
所述确定部分1102,被配置为根据所述真实倾听对象的人脸图像,通过人脸重建模型进行特征提取,确定真实人脸特征,所述真实人脸特征包括真实姿态特征和真实表情特征;通过第一损失函数和所述拟人化样本特征不断优化所述初始编码器,直到第一损失函数值满足第一预设阈值,确定所述编码器;基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值与第三损失函数值的和满足第二预设阈值,确定所述虚拟预测网络。
在本发明的一些实施例中,所述确定部分1102,被配置为基于所述真实人脸特征和所述预测人脸特征,确定第二损失函数;所述第二损失函数用来保证预测表情、预测姿态与真实表情、真实姿态相似;基于所述真实人脸特征对应的变化函数和所述预测人脸特征对应的变化函数,确定第三损失函数;所述第三损失函数用来保证预测人脸特征的帧间连续性与真实人脸特征相似;通过所述第二损失函数和所述第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值与第三损失函数值的和满足第二预设阈值,确定所述虚拟预测网络。
需要说明的是,在进行视频生成时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由不同的程序模块完成,即将装置的内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的视频生成装置与视频生成方法实施例属于同一构思,其具体实现过程及有益效果详见方法实施例,这里不再赘述。对于本装置实施例中未披露的技术细节,请参照本发明方法实施例的描述而理解。
基于上述实施例的视频生成方法,本发明实施例还提供一种视频生成装置,如图12所示,图12为本发明实施例提供的一种视频生成装置的结构示意图二,该装置12包括:处理器1201和存储器1202;存储器1202存储处理器可执行的一个或者多个程序,当一个或者多个程序被执行时,通过处理器1201执行如前所述实施例的任意一种视频生成方法。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。
工业实用性
本发明实施例提供了一种视频生成方法及装置、计算机可读存储介质,其中,方法包括:采集真实对象的音视频序列;对音视频序列进行特征提取,确定拟人化特征;利用虚拟预测网络、预设的标准特征,以及第一特征对拟人化特征进行预测,生成虚拟对象的视频序列;预设的标准特征为参考对象对应的特征;第一特征表征不同的态度;虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现虚拟对象的视频序列。本发明实施例根据真实对象的音视频序列,生成虚拟对象的视频序列,呈现的虚拟对象的视频序列更加生动和准确。

Claims (20)

  1. 一种视频生成方法,包括:
    采集真实对象的音视频序列;
    对所述音视频序列进行特征提取,确定拟人化特征;
    利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;
    呈现所述虚拟对象的视频序列。
  2. 根据权利要求1所述的方法,其中,所述利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列,包括:
    获取预设的标准特征;所述预设的标准特征包括第一姿态表情特征和第一身份特征;
    基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征;
    基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列。
  3. 根据权利要求2所述的方法,其中,所述获取预设的标准特征,包括:
    获取标准图像;所述标准图像表征参考对象的图像;
    通过人脸重建模型对所述标准图像进行特征提取,得到所述预设的标准特征。
  4. 根据权利要求2所述的方法,其中,所述拟人化特征包括:所述音视频序列对应的多帧拟人化特征;所述虚拟预测网络包括:第一处理模块和第二处理模块;
    所述基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征,包括:
    基于所述多帧拟人化特征中的第一帧拟人化特征、所述第一姿态表情特征和所述第一特征,通过所述第一处理模块进行预测,得到下一个预测视频帧;其中,所述第一特征为积极态度、消极态度和普通态度中的一种;
    通过所述第二处理模块对所述下一个预测视频帧进行解码,确定所述下一个预测视频帧对应的所述虚拟对象的下一个姿态表情特征;
    基于所述下一个姿态表情特征和所述多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的所述虚拟对象的最后一个姿态表情特征时为止,从而得到所述虚拟对象的多帧姿态表情特征;所述第一姿态表情特征作为所述多帧姿态表情特征中的第一帧。
  5. 根据权利要求2所述的方法,其中,所述基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列,包括:
    将所述多帧姿态表情特征中的每一帧姿态表情特征分别与所述第一身份特征进行融合,得到多个第二特征;所述第二特征表征身份特征与姿态表情特征的融合结果;
    将所述多个第二特征通过渲染器,生成所述虚拟对象的视频序列;所述第一身份特征包括第一身份标识、第一材质和第一光照信息。
  6. 根据权利要求1-5任一项所述的方法,其中,所述音视频序列包括:真实对象的音频序列和真实对象的视频序列;
    所述对所述音视频序列进行特征提取,确定拟人化特征,包括:
    通过编码器对所述真实对象的视频序列进行特征提取,得到多个视频特征;
    通过编码器对所述真实对象的音频序列进行特征提取,得到多个音频特征;所述音频特征包括响度、过零率和倒频谱系数;
    基于所述多个视频特征和所述多个音频特征,通过特征融合函数进行特征转换,确定所述拟人化特征;所述拟人化特征为音视频序列对应的多帧拟人化特征;所述拟人化特征包括视频特征和音频特征。
  7. 根据权利要求6所述的方法,其中,所述通过编码器对所述真实对象的视频序列进行特征提取,得到多个视频特征,包括:
    通过人脸重建模型对所述真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征;所述视频帧特征包括第二身份特征和第二姿态表情特征;
    将所述真实对象的视频序列中对应的所有所述第二姿态表情特征作为所述视频特征。
  8. 根据权利要求1-5任一项所述的方法,其中,所述利用虚拟预测网络和预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列之前,所述方法还包括:
    采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像;
    通过初始编码器对所述音视频序列样本进行特征提取,确定拟人化样本特征;
    通过初始虚拟预测网络和所述拟人化样本特征,生成训练真实对象的音视频序列样本下的预测人脸特征;所述预测人脸特征包含预测姿态特征和预测表情特征;
    根据所述真实倾听对象的人脸图像,通过人脸重建模型进行特征提取,确定真实人脸特征,所述真实人脸特征包括真实姿态特征和真实表情特征;
    通过第一损失函数和所述拟人化样本特征不断优化所述初始编码器,直到第一损失函数值满足第一预设阈值,确定所述编码器;
    基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值均满足第二预设阈值,确定所述虚拟预测网络。
  9. 根据权利要求8所述的方法,其中,所述基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值满足第二预设阈值,确定所述虚拟预测网络,包括:
    基于所述真实人脸特征和所述预测人脸特征,确定第二损失函数;所述第二损失函数用来保证预测表情、预测姿态与真实表情、真实姿态相似;
    基于所述真实人脸特征对应的变化函数和所述预测人脸特征对应的变化函数,确定第三损失函数;所述第三损失函数用来保证预测人脸特征的帧间连续性与真实人脸特征相似;
    通过所述第二损失函数和所述第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值满足第二预设阈值,确定所述虚拟预测网络。
  10. 一种视频生成装置,包括获取部分、确定部分和生成部分;其中,
    所述获取部分,被配置为采集真实对象的音视频序列;
    所述确定部分,被配置为对所述音视频序列进行特征提取,确定拟人化特征;
    所述生成部分,被配置为利用虚拟预测网络、预设的标准特征,以及第一特征对所述拟人化特征进行预测,生成虚拟对象的视频序列;所述预设的标准特征为参考对象对应的特征;所述第一特征表征不同的态度;所述虚拟对象的视频序列是根据真实对象的音视频序列生成虚拟对象相应反应的视频序列;呈现所述虚拟对象的视频序列。
  11. 根据权利要求10所述的装置,其中,所述获取部分,还被配置为获取预设的标准特征;所述预设的标准特征包括第一姿态表情特征和第一身份特征;
    所述确定部分,还被配置为基于所述第一姿态表情特征、所述拟人化特征和所述第一特征,通过所述虚拟预测网络进行预测和解码,确定虚拟对象的多帧姿态表情特征;
    所述生成部分,还被配置为基于所述虚拟对象的多帧姿态表情特征和所述第一身份特征,生成所述虚拟对象的视频序列。
  12. 根据权利要求11所述的装置,其中,所述获取部分,还被配置为获取标准图像;所述标准图像表征参考对象的图像;通过人脸重建模型对所述标准图像进行特征提取,得到所述预设的标准特征。
  13. 根据权利要求11所述的装置,其中,所述拟人化特征包括:所述音视频序列对应的多帧拟人化特征;所述虚拟预测网络包括:第一处理模块和第二处理模块;
    所述获取部分,还被配置为基于所述多帧拟人化特征中的第一帧拟人化特征、所述第一姿态表情特征和所述第一特征,通过所述第一处理模块进行预测,得到下一个预测视频帧;其中,所述第一特征为 积极态度、消极态度和普通态度中的一种;
    所述确定部分,还被配置为通过所述第二处理模块对所述下一个预测视频帧进行解码,确定所述下一个预测视频帧对应的所述虚拟对象的下一个姿态表情特征;
    所述获取部分,还被配置为基于所述下一个姿态表情特征和所述多帧拟人化特征中的下一帧拟人化特征,继续进行预测和解码,直至得到最后一个预测视频帧对应的所述虚拟对象的最后一个姿态表情特征时为止,从而得到所述虚拟对象的多帧姿态表情特征;所述第一姿态表情特征作为所述多帧姿态表情特征中的第一帧。
  14. 根据权利要求11所述的装置,其中,所述获取部分,还被配置为将所述多帧姿态表情特征中的每一帧姿态表情特征分别与所述第一身份特征进行融合,得到多个第二特征;所述第二特征表征身份特征与姿态表情特征的融合结果;
    所述生成部分,还被配置为将所述多个第二特征通过渲染器,生成所述虚拟对象的视频序列;所述第一身份特征包括第一身份标识、第一材质和第一光照信息。
  15. 根据权利要求10-14任一项所述的装置,其中,所述音视频序列包括:真实对象的音频序列和真实对象的视频序列;
    所述获取部分,还被配置为通过编码器对所述真实对象的视频序列进行特征提取,得到多个视频特征;通过编码器对所述真实对象的音频序列进行特征提取,得到多个音频特征;所述音频特征包括响度、过零率和倒频谱系数;
    所述确定部分,还被配置为基于所述多个视频特征和所述多个音频特征,通过特征融合函数进行特征转换,确定所述拟人化特征;所述拟人化特征为音视频序列对应的多帧拟人化特征;所述拟人化特征包括视频特征和音频特征。
  16. 根据权利要求15所述的装置,其中,所述获取部分,还被配置为通过人脸重建模型对所述真实对象的视频序列的每一帧视频帧进行特征提取,得到多个视频帧特征;所述视频帧特征包括第二身份特征和第二姿态表情特征;
    所述确定部分,还被配置为将所述真实对象的视频序列中对应的所有所述第二姿态表情特征作为所述视频特征。
  17. 根据权利要求10-14任一项所述的装置,其中,所述获取部分,还被配置为采集真实倾诉对象的音视频序列样本,及其对应的真实倾听对象的人脸图像;
    所述确定部分,还被配置为通过初始编码器对所述音视频序列样本进行特征提取,确定拟人化样本特征;
    所述生成部分,还被配置为通过初始虚拟预测网络和所述拟人化样本特征,生成训练真实对象的音视频序列样本下的预测人脸特征;所述预测人脸特征包含预测姿态特征和预测表情特征;
    所述确定部分,还被配置为根据所述真实倾听对象的人脸图像,通过人脸重建模型进行特征提取,确定真实人脸特征,所述真实人脸特征包括真实姿态特征和真实表情特征;通过第一损失函数和所述拟人化样本特征不断优化所述初始编码器,直到第一损失函数值满足第一预设阈值,确定所述编码器;基于所述真实人脸特征和预测人脸特征,通过第二损失函数和第三损失函数不断优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值均满足第二预设阈值,确定所述虚拟预测网络。
  18. 根据权利要求17所述的方法,其中,所述确定部分,还被配置为基于所述真实人脸特征和所述预测人脸特征,确定第二损失函数;所述第二损失函数用来保证预测表情、预测姿态与真实表情、真实姿态相似;基于所述真实人脸特征对应的变化函数和所述预测人脸特征对应的变化函数,确定第三损失函数;所述第三损失函数用来保证预测人脸特征的帧间连续性与真实人脸特征相似;通过所述第二损失函数和所述第三损失函数,持续优化初始虚拟预测网络,直到第二损失函数值和第三损失函数值满足第二预设阈值,确定所述虚拟预测网络。
  19. 一种视频生成装置,包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1-9任一项所述的视频生成方 法。
  20. 一种计算机可读存储介质,所述存储介质存储有可执行指令,当所述可执行指令被执行时,用于引起处理器执行如权利要求1-9任一项所述的视频生成方法。
PCT/CN2023/075752 2022-07-14 2023-02-13 一种视频生成方法及装置、计算机可读存储介质 WO2024011903A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210834191.6A CN115225829A (zh) 2022-07-14 2022-07-14 一种视频生成方法及装置、计算机可读存储介质
CN202210834191.6 2022-07-14

Publications (1)

Publication Number Publication Date
WO2024011903A1 true WO2024011903A1 (zh) 2024-01-18

Family

ID=83612025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075752 WO2024011903A1 (zh) 2022-07-14 2023-02-13 一种视频生成方法及装置、计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN115225829A (zh)
WO (1) WO2024011903A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225829A (zh) * 2022-07-14 2022-10-21 北京京东尚科信息技术有限公司 一种视频生成方法及装置、计算机可读存储介质
CN115953813B (zh) * 2022-12-19 2024-01-30 北京字跳网络技术有限公司 一种表情驱动方法、装置、设备及存储介质
CN116156277B (zh) * 2023-02-16 2024-05-07 平安科技(深圳)有限公司 基于姿态预测的视频生成方法及相关设备
CN116112761B (zh) * 2023-04-12 2023-06-27 海马云(天津)信息技术有限公司 生成虚拟形象视频的方法及装置、电子设备和存储介质
CN117593473A (zh) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 动作图像与视频生成方法、设备与存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114173142A (zh) * 2021-11-19 2022-03-11 广州繁星互娱信息科技有限公司 对象直播展示方法和装置、存储介质及电子设备
WO2022062678A1 (zh) * 2020-09-25 2022-03-31 魔珐(上海)信息科技有限公司 虚拟直播方法、装置、系统及存储介质
CN114332976A (zh) * 2021-09-17 2022-04-12 广州繁星互娱信息科技有限公司 虚拟对象处理方法、电子设备及存储介质
CN115225829A (zh) * 2022-07-14 2022-10-21 北京京东尚科信息技术有限公司 一种视频生成方法及装置、计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022062678A1 (zh) * 2020-09-25 2022-03-31 魔珐(上海)信息科技有限公司 虚拟直播方法、装置、系统及存储介质
CN114332976A (zh) * 2021-09-17 2022-04-12 广州繁星互娱信息科技有限公司 虚拟对象处理方法、电子设备及存储介质
CN114173142A (zh) * 2021-11-19 2022-03-11 广州繁星互娱信息科技有限公司 对象直播展示方法和装置、存储介质及电子设备
CN115225829A (zh) * 2022-07-14 2022-10-21 北京京东尚科信息技术有限公司 一种视频生成方法及装置、计算机可读存储介质

Also Published As

Publication number Publication date
CN115225829A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
WO2024011903A1 (zh) 一种视频生成方法及装置、计算机可读存储介质
Ginosar et al. Learning individual styles of conversational gesture
Avots et al. Audiovisual emotion recognition in wild
Chou et al. NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus
McKeown et al. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent
Niewiadomski et al. Laugh-aware virtual agent and its impact on user amusement
Urbain et al. AVLaughterCycle: Enabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation
WO2020081872A1 (en) Characterizing content for audio-video dubbing and other transformations
WO2022170848A1 (zh) 人机交互方法、装置、系统、电子设备以及计算机介质
CN113392201A (zh) 信息交互方法、装置、电子设备、介质和程序产品
Sun et al. Towards visual and vocal mimicry recognition in human-human interactions
Ishii et al. Multimodal fusion using respiration and gaze for predicting next speaker in multi-party meetings
Dupont et al. Laughter research: A review of the ILHAIRE project
CN112738557A (zh) 视频处理方法及装置
CN111737516A (zh) 一种互动音乐生成方法、装置、智能音箱及存储介质
Rebol et al. Passing a non-verbal turing test: Evaluating gesture animations generated from speech
WO2023246163A9 (zh) 一种虚拟数字人驱动方法、装置、设备和介质
Terrell et al. A regression-based approach to modeling addressee backchannels
CN113395569A (zh) 视频生成方法及装置
El Haddad et al. Laughter and smile processing for human-computer interactions
Kadiri et al. Naturalistic audio-visual emotion database
Shen et al. Vida-man: visual dialog with digital humans
Song et al. Emotional listener portrait: Neural listener head generation with emotion
CN115222857A (zh) 生成虚拟形象的方法、装置、电子设备和计算机可读介质
Kawahara Smart posterboard: Multi-modal sensing and analysis of poster conversations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838405

Country of ref document: EP

Kind code of ref document: A1