CN117576608A - Voice-driven-based controllable similar key frame virtual face video generation method and interaction device - Google Patents

Voice-driven-based controllable similar key frame virtual face video generation method and interaction device Download PDF

Info

Publication number
CN117576608A
CN117576608A CN202311535537.3A CN202311535537A CN117576608A CN 117576608 A CN117576608 A CN 117576608A CN 202311535537 A CN202311535537 A CN 202311535537A CN 117576608 A CN117576608 A CN 117576608A
Authority
CN
China
Prior art keywords
interaction
voice
information
face video
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311535537.3A
Other languages
Chinese (zh)
Inventor
李鹏
李响
刘鑫淼
顾恒文
尹莉莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202311535537.3A priority Critical patent/CN117576608A/en
Publication of CN117576608A publication Critical patent/CN117576608A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a voice-driven-based controllable similar key frame virtual face video generation method and an interaction device, wherein the method and the device comprise the following steps: obtaining current image information to be input through an image information obtaining unit; the method comprises the steps of obtaining current voice information to be input through an audio information obtaining unit; the method comprises the steps of obtaining voice or text information input by an interactor through an interaction information obtaining unit; storing a controllable similar key frame virtual face video generation and interaction program based on voice driving through a program memory, wherein the program is executed when being read by a generation processor or an interaction processor; processing the image and the audio information by a generating processor and outputting a virtual face video; processing the video and the interaction information by an interaction processor to complete a virtual human interaction process; and assisting the interactors to interact through the interaction interface. By the method and the device, any role can be cloned into the virtual person with consistent image and voice characteristics and has certain interaction capability, and the method simplifies the generation process of the face video of the virtual person and enhances the authenticity and interaction capability of the virtual person.

Description

Voice-driven-based controllable similar key frame virtual face video generation method and interaction device
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a controllable similar key frame virtual face video generation method based on voice driving and an interaction device.
Background
With the rapid development of artificial intelligence and computer graphics, virtual human technologies can create realistic character images so that they exhibit similar actions, emotions, and interactive capabilities to real humans. Virtual persons are used as one of main interfaces for linking the real world and the virtual world, and gradually become an excellent employee in various industries, wherein the speaking virtual persons and the live virtual persons well meet the conditions of full-day online, no collapse and high loyalty, and become an optimal solution in the industry. The main popular modeling modes at present are scanning reconstruction and modeling binding, the former has higher cost, the latter has poorer fineness, but with the development of artificial intelligence, the latter gradually becomes a trend.
As an emerging scientific product, it still has some problems. The product customization cost is high, and a high-quality virtual person product generally needs to pay for image customization, software service and dynamic capture equipment; the virtual people have short life cycle, more virtual people in the speaking and live broadcasting industries are driven based on rules or through dynamic capturing equipment, and meanwhile, the problems of poor interactivity, weak innovation ability, high operation cost and the like are caused, so that more virtual people are led to ' out of the way ' or peak '.
The invention comprises the following steps:
the application provides a voice-driven-based controllable similar key frame virtual face video generation and interaction method, which aims to solve the problems of virtual person modeling and interaction driving. The 3D model is built and rendered through one exquisite portrait image, and the 3D model is driven through voice to generate a virtual face video of which the initial frame and the end frame are constrained to the portrait image. The interactive effect is achieved by linking up the multiple sections of video and inserting a silence state in the process to acquire the input of the next interaction.
The application provides a voice-driven-based controllable similar key frame virtual face video generation method, which comprises the following steps:
the picture information and the voice information are input, and the picture is optimal by a standard image certificate photograph (medium scene, front face, neutral facial expression, closed mouth, good stable illumination, background distinguishable from the figure in color, no facial shielding).
The key point positions of 68 specific areas (such as eyes, nose, mouth and the like) of the face in the picture are detected through a shape_predictor_68_face_landmarks.
The three-dimensional deformation model space is used as an intermediate representation, the alpha coefficient represents the identity characteristic, and the beta coefficient represents the expression characteristic. The three-dimensional surface shape S is decoupled as follows:
the lead-in coefficients r and t represent the rotation and translation of the head, respectively. The head pose ρ= [ r, t ] and the expression coefficient β are learned separately from the speech information, and the parameters of the motion are modeled as { β, r, t }. Thereby generating implicit face rendering motion coefficients.
Audio and header data are encoded as input signals using a dual stream transducer structure. The audio and head related motion are expressed as:
after processing the input signal using the dot product attention mechanism, the input signal is embedded by a linear layer and then mapped into a feature space with the same dimension. The output of audio and head motion and the sequence of key pose embedding are expressed as:
limiting the generation of the terminal frames of the video by embedding the matrix PE at positions relative to the time distance, the embedding matrix of the relative positions of the terminal frames being expressed as:
PE L =PE(|t-t i |,*)。
the head data and the audio data are connected in series in the time dimension, and the head data and the audio data are input into a cross-modal converter to learn the association between the head data and the audio data and then output driving information covering head arrangement.
Taking the drive information, the key frame coordinate information and the key frame embedded information of the head arrangement as integral input of the cross-mode decoder, wherein the integral input is expressed as:
by mapping the action results in the output sequence with respect to the implicit rendering mapping function and the linear transformation layer, the output sequence is represented as:
the output sequence is rendered to generate a final video.
The application also provides a voice-driven-based controllable similar key frame virtual face video interaction method, which comprises the following steps:
the interactive method is based on the output of a voice-driven controllable similar key frame virtual face video generation method, namely, frames at the joint of two sections of video are kept consistent.
The interactive method is divided into a starting stage, an interactive stage and an ending stage. The interaction scene is a plurality of interaction questions and answers of a real person interactor and a virtual person. The interactors act as the initiator and terminator of the interaction. The virtual person is asked as a starting signal of the virtual person, and after the virtual person answers, the interactors do not ask again as an ending signal of the virtual person within a certain time.
Image information, voice clone information and question-answering system information are input as preconditions for virtual person start.
According to the image information, a corresponding virtual human video 0 is generated through a controllable similar key frame virtual human face video generation method based on voice driving, and the video is driven by silent voice for 10 seconds.
The interactors ask the virtual person for the question 1, and can ask the question by using one of two forms of voice and text.
The virtual person uses the question answering system to generate an answer 1 to the question, and the state of the answer 1 is text.
The generated answer is converted into an audio file by voice cloning, and the state of the answer 1 is audio.
The corresponding virtual human video 1 is generated by a controllable similar key frame virtual human face video generation method based on voice driving.
Video 1 is played for the interactors as an answer to question 1.
And after the video 1 is played, automatically connecting the video 0. The end frame of video 1 coincides with the start frame of video 0 due to the precondition of the key frame constraint.
And after the video 0 is played, automatically and circularly playing the video 0. The end frame of video 0 coincides with the start frame of video 0 due to the precondition of the key frame constraint.
If the interactors initiate new problems in the process of circulating the video 0 for 3 times, performing man-machine interaction in a circulating interaction stage.
If the interactors do not initiate new problems in the process of circulating the video 0 for 3 times, entering an ending stage, and ending interaction by the virtual person.
If the interactors initiate to finish in the interaction process, entering a finishing stage, and finishing the interaction by the virtual person.
The application also provides a controllable similar key frame virtual face video interaction device based on voice driving, which comprises:
and the image information obtaining unit is used for obtaining the current image information to be input.
And the audio information obtaining unit is used for obtaining the current voice information to be input.
And the interactive information acquisition unit is used for acquiring voice or text information input by the interactors.
And the program memory is used for storing a controllable similar key frame virtual face video generation and interaction program based on voice driving, and the program is executed when being read by the generation processor or the interaction processor.
And the generation processor is used for processing the image and the audio information so as to finish video generation.
And the interaction processor is used for processing the video and the interaction information so as to complete the interaction process.
And the interaction interface is a visual window which is convenient for an interactor to interact.
Compared with the prior art, the application has the following advantages:
according to the voice-driven-based controllable similar key frame virtual face video generation and interaction method, key frame-constrained virtual face video generation is achieved through real-time acquisition of image information and voice information, and interaction is achieved through the virtual face video generation method and voice cloning, a question answering system and video connection after question information of interactors is acquired. By using the interaction method, any role can be cloned into a virtual person with consistent appearance and voice characteristics, and the interaction method has certain interaction capability. The method simplifies the generation process of the face video of the virtual person and enhances the authenticity and interaction capability of the virtual person.
Drawings
FIG. 1 is a flow chart of a method for generating a controllable similar keyframe virtual face video based on voice driving.
Fig. 2 is a flow chart of a method of controllable similar keyframe virtual face video interaction based on voice driving.
Fig. 3 is a schematic diagram of a controllable similar keyframe virtual face video interaction device based on voice driving.
Detailed Description
The method and apparatus of embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings.
As shown in fig. 1, the method for generating a controllable similar keyframe virtual face video based on voice driving according to the embodiment includes the following steps:
the picture information and the voice information are input, and the picture is optimal by a standard image certificate photograph (medium scene, front face, neutral facial expression, closed mouth, good stable illumination, background distinguishable from the figure in color, no facial shielding).
The key point positions of 68 specific areas (such as eyes, nose, mouth and the like) of the face in the picture are detected through a shape_predictor_68_face_landmarks.
The shape_predictor_68_face_landmarks. Dat model described above can be obtained as follows:
and decompressing and storing the Dlib model file after downloading. Facial keypoint detection is performed using the model in the project through Dlib library. Embodiments use OpenCV to load images, dlib for face detection and keypoint detection.
The three-dimensional deformation model space is used as an intermediate representation, the alpha coefficient represents the identity characteristic, and the beta coefficient represents the expression characteristic. The three-dimensional surface shape S is decoupled as follows:
the three-dimensional deformation model space can be obtained by the following modes:
and calibrating and aligning a large number of three-dimensional face data samples to ensure that characteristic points of different faces are aligned in space. And (3) establishing a 3D shape and texture model of the human face by using statistical methods such as principal component analysis and the like, and capturing main shape and texture changes. The model training phase best fits the training data by optimizing the model parameters.
The lead-in coefficients r and t represent the rotation and translation of the head, respectively. The head pose ρ= [ r, t ] and the expression coefficient β are learned separately from the speech information, and the parameters of the motion are modeled as { β, r, t }. Thereby generating implicit face rendering motion coefficients.
Audio and header data are encoded as input signals using a dual stream transducer structure. The audio and head related motion are expressed as:
the dual stream transducer structure described above can be obtained by:
two parallel converters flows are designed, each focused on one task or data type. The input data is distributed to the two processes, and model parameters are updated through training by using proper loss functions and optimization algorithms.
After processing the input signal using the dot product attention mechanism, the input signal is embedded by a linear layer and then mapped into a feature space with the same dimension. The output of audio and head motion and the sequence of key pose embedding are expressed as:
limiting the generation of the terminal frames of the video by embedding the matrix PE at positions relative to the time distance, the embedding matrix of the relative positions of the terminal frames being expressed as:
PE L =PE(|t-t i |,*)。
the head data and the audio data are connected in series in the time dimension, and the head data and the audio data are input into a cross-modal converter to learn the association between the head data and the audio data and then output driving information covering head arrangement.
The above-described trans-former can be obtained by:
the multi-mode input, the attention mechanism and the loss function are designed, and a model capable of effectively capturing the relation among different modes of data is established by utilizing a pre-training model and data preprocessing.
Taking the drive information, the key frame coordinate information and the key frame embedded information of the head arrangement as integral input of the cross-mode decoder, wherein the integral input is expressed as:
by mapping the action results in the output sequence with respect to the implicit rendering mapping function and the linear transformation layer, the output sequence is represented as:
the output sequence is rendered to generate a final video.
As shown in fig. 2, the voice-driven-based controllable similar keyframe virtual face video interaction method according to the embodiment includes the following steps:
the interaction method is based on the output of a voice-driven controllable similar key frame virtual face video generation technology, ensures that frames at the joint of two sections of videos are kept consistent, and improves the overall visual continuity.
The whole interaction process comprises three stages of starting, interaction and ending.
In the interaction scenario, a real person interactor performs multiple rounds of questions and answers with a virtual person, wherein the interactor is both an initiator and a terminator. The start signal of the interaction is that the interactor presents a question to the dummy, and the end signal is that the interactor does not ask a question again within a certain time after the answer of the dummy.
The virtual man's start-up preamble includes input image information, voice clone information and question-answering system information. According to the voice-driven controllable similar key frame virtual face video generation method, virtual human video 0 is generated according to image information, and the video is driven by silent voice for 10 seconds.
At the beginning of the interaction, the interactors may ask questions in speech or text form to the virtual person.
The virtual person generates an answer 1 through a question and answer system, and the initial state is a text. Answer 1 is converted into an audio file by a voice cloning technology and becomes an audio state. The virtual human video 1 is generated by a voice-driven controllable similar key frame virtual human face video generation method.
Video 1 is played to the interactors as a response to answer question 1. After video 1 is played, the system automatically connects to video 0, ensuring that the end frame of video 1 coincides with the start frame of video 0 due to the preconditions of the key frame constraint.
Video 0 is played again in a circulating way, and continuity is kept. In the process of circulating video 0 for 3 times, if the interactors put forward new problems, entering a circulating interaction stage. If no new problem is presented in the process, the system enters an ending stage, and the virtual person ends the interaction with the interactors.
If the interactors actively propose to finish in the interaction process, the interactors immediately enter a finish stage to finish the interaction of the virtual persons.
As shown in fig. 3, the voice-driven controllable similar keyframe virtual face video interaction device according to the embodiment includes the following devices:
the image information acquisition unit is a module designed for extracting current image information, and has the task of effectively acquiring data of an image to be input. Through the unit, the system can capture the image characteristics more comprehensively and accurately, and provide more information support for subsequent processing.
The audio information acquisition unit is a key component in the system and is used for acquiring current voice information for processing. Through the module, the system can efficiently extract the audio characteristics, ensure accurate understanding of the voice to be input, and provide a reliable basis for the application scene of voice driving.
And the interactive information acquisition unit is designed for acquiring voice or text information input by the interactor. The module enables the system to flexibly receive the instructions of the user, realizes diversified interaction modes, and improves the user friendliness and applicability of the system.
The program memory is a module which is specially used for storing a controllable similar key frame virtual face video generation and interaction program based on voice driving. This memory, when read by the generation processor or interaction processor, ensures efficient running and stability of the program.
The generating processor is one of the core components of the system and is responsible for processing the image and audio information, so that the video generating task is successfully completed. Through an efficient processing mechanism, the generation processor ensures that the generated video quality and smoothness reach the level desired by the user.
The interaction processor is used as one of key modules of the system and is used for focusing on processing videos and interaction information so as to realize a smooth interaction process. The processor ensures that the interaction experience between the user and the system is more natural and seamless through intelligent algorithm and rapid data processing.
The interactive interface is a visual window provided for the interactors in the system, and aims to simplify the interaction between the user and the system. Through the interface, an interactor can easily and intuitively communicate with the system, and the overall user experience is improved.
The embodiments described herein are intended to be illustrative of only some, but not all, of the many other embodiments that may be made by one of ordinary skill in the art without the benefit of the teachings of this invention.

Claims (8)

1. A controllable similar key frame virtual face video generation method and an interaction device based on voice driving. Mainly comprises the following steps:
(1) The image information obtaining unit obtains current image information to be input;
(2) The audio information obtaining unit obtains current voice information to be input;
(3) The interactive information acquisition unit acquires voice or text information input by an interactor;
(4) The program memory stores a controllable similar key frame virtual face video generation and interaction program based on voice driving, and the program is executed when being read by a generation processor or an interaction processor;
(5) The generation processor processes the image and the audio information and outputs a virtual face video;
(6) The interaction processor processes the video and the interaction information to complete the virtual human interaction process;
(7) The interactive interface assists the interactors in interacting.
2. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: in the step (1), the characteristic of an input interface is provided, and the characteristic of the interface is provided with the external equipment, so that the image information to be input can be received; having the characteristic of an acquisition method, the method for acquiring the image information can comprise a sensor, a camera, a scanning device and the like; the method has the real-time characteristic, and the capability of acquiring the current image information in real time so as to ensure the timely response of the system to the information; any preprocessing, filtering or other data processing steps performed on the obtained image information, with data processing features; have visual cloning features including cloning facial features, head, texture mapping, processing details, etc.
3. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: in the step (2), the characteristic of an input interface is provided, and the characteristic of the interface is provided with the external equipment, so that the voice information to be input can be received; the method has the characteristic of an acquisition method, and the method for acquiring the voice information can comprise a microphone, voice file importing and the like; the method has the real-time characteristic, and the capability of acquiring the current voice information in real time so as to ensure the timely response of the system to the information; any preprocessing, filtering or other data processing steps performed on the obtained voice information with data processing features; has sound cloning features including cloning pronunciation mode, intonation, tone quality, etc.
4. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: the step (3) has the characteristic of input diversity, can receive various input forms, and is not limited to voice and text; the method has voice recognition characteristics, and relates to a voice recognition algorithm and a model; the method has the characteristics of text processing, and comprises text analysis, keyword extraction and the like; the real-time interaction characteristic is provided, and the instantaneity of the system is ensured by the acquisition and response capability of real-time input; has natural language processing features, and uses natural language processing techniques to improve understanding and processing power of text information.
5. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: in the step (4), the voice driving feature is provided, and the virtual face video generation and interaction are driven through voice; the method has the key frame control characteristic, and can generate controllable key frames to control the appearance and the action of the terminal frames of the virtual face video; the virtual face video generating method has the characteristics that the generating algorithm of the program comprises three-dimensional modeling, animation rendering and other technologies so as to ensure the reality of the virtual face video; the method has the characteristics of interaction processing, and the interaction with the user comprises recognition of voice commands, real-time response and the like; having a storage and reading feature, the program is stored in memory in a manner that when read by the generation processor or interaction processor, ensures that the program can be efficiently executed when necessary.
6. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: the step (5) is characterized by multi-mode information processing, and simultaneously processes image and audio information to provide more comprehensive virtual face video generation; the method has the image processing characteristics, and the processing mode of the image information comprises three-dimensional modeling, texture mapping and other technologies; the method has the characteristics of audio processing, and the audio information processing comprises the technologies of voice recognition, audio synthesis and the like so as to realize the voice synchronization of the virtual face video; the method has the characteristics of real-time generation and real-time processing capability, and can ensure that the virtual face video can be rapidly output when the image and audio information are processed.
7. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: in the step (6), the multi-mode information processing feature is provided, and the video and the interaction information are processed at the same time so as to realize comprehensive virtual human interaction; the method has the characteristics of video processing, and the processing modes of the video information comprise technologies such as image recognition, action recognition and the like; the interactive information processing method has the characteristic of interactive information processing, and the processing of interactive information of users relates to the technologies of voice recognition, text analysis and the like; the method has the real-time interaction characteristic and the real-time processing capability, and ensures that the virtual human interaction process can be completed rapidly when the video and the interaction information are processed.
8. The voice-driven controllable similar keyframe virtual face video generation method and the interaction device according to claim 1 are characterized in that: in the step (7), the multi-mode interaction feature is provided, and a plurality of interaction modes including voice, gesture, touch and the like are supported; the method has the characteristics of user interface design, wherein the design of the interactive interface comprises layout, icons, colors and the like so as to improve user experience and interaction efficiency; the real-time interaction feature is provided, and the interaction interface has real-time response capability, so that feedback can be provided rapidly when a user interacts; the interactive interface has personalized interactive characteristics and personalized customization functions so as to adapt to the interactive preference and requirements of different users.
CN202311535537.3A 2023-11-16 2023-11-16 Voice-driven-based controllable similar key frame virtual face video generation method and interaction device Pending CN117576608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311535537.3A CN117576608A (en) 2023-11-16 2023-11-16 Voice-driven-based controllable similar key frame virtual face video generation method and interaction device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311535537.3A CN117576608A (en) 2023-11-16 2023-11-16 Voice-driven-based controllable similar key frame virtual face video generation method and interaction device

Publications (1)

Publication Number Publication Date
CN117576608A true CN117576608A (en) 2024-02-20

Family

ID=89861974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311535537.3A Pending CN117576608A (en) 2023-11-16 2023-11-16 Voice-driven-based controllable similar key frame virtual face video generation method and interaction device

Country Status (1)

Country Link
CN (1) CN117576608A (en)

Similar Documents

Publication Publication Date Title
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
KR102503413B1 (en) Animation interaction method, device, equipment and storage medium
WO2022166709A1 (en) Virtual video live broadcast processing method and apparatus, and storage medium and electronic device
CN113554737A (en) Target object motion driving method, device, equipment and storage medium
CN111383307A (en) Video generation method and device based on portrait and storage medium
US11847726B2 (en) Method for outputting blend shape value, storage medium, and electronic device
CN111045582A (en) Personalized virtual portrait activation interaction system and method
WO2022106654A2 (en) Methods and systems for video translation
CN111401101A (en) Video generation system based on portrait
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
WO2023246163A9 (en) Virtual digital human driving method, apparatus, device, and medium
CN114863533A (en) Digital human generation method and device and storage medium
CN116524791A (en) Lip language learning auxiliary training system based on meta universe and application thereof
WO2024113701A1 (en) Voice-based video generation method and apparatus, server, and medium
JP3920889B2 (en) Image synthesizer
CN116912375A (en) Facial animation generation method and device, electronic equipment and storage medium
CN117115310A (en) Digital face generation method and system based on audio and image
CN117370605A (en) Virtual digital person driving method, device, equipment and medium
CN117576608A (en) Voice-driven-based controllable similar key frame virtual face video generation method and interaction device
CN114155321B (en) Face animation generation method based on self-supervision and mixed density network
CN114461772A (en) Digital human interaction system, method and device thereof, and computer readable storage medium
Verma et al. Animating expressive faces across languages
CN116843805B (en) Method, device, equipment and medium for generating virtual image containing behaviors
CN117456064A (en) Method and system for rapidly generating intelligent companion based on photo and short audio
CN117933318A (en) Method for constructing teaching digital person

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination