WO2024088321A1 - 虚拟形象面部驱动方法、装置、电子设备及介质 - Google Patents

虚拟形象面部驱动方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2024088321A1
WO2024088321A1 PCT/CN2023/126582 CN2023126582W WO2024088321A1 WO 2024088321 A1 WO2024088321 A1 WO 2024088321A1 CN 2023126582 W CN2023126582 W CN 2023126582W WO 2024088321 A1 WO2024088321 A1 WO 2024088321A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
information
parameter sequence
driving parameter
driving
Prior art date
Application number
PCT/CN2023/126582
Other languages
English (en)
French (fr)
Inventor
刘鑫
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2024088321A1 publication Critical patent/WO2024088321A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application belongs to the field of artificial intelligence technology, and specifically relates to a virtual image face driving method, device, electronic device and medium.
  • a virtual image can be built and its facial expressions can be driven to simulate human speech.
  • each character corresponding to the voice segment and the lip movements corresponding to the facial data are aligned one by one to generate lip shape driving data corresponding to each character, so as to drive the lip shape of the virtual image to change.
  • the purpose of the embodiments of the present application is to provide a virtual image face driving method, device, electronic device and medium, which can solve the problem of poor synchronization effect caused by uncoordinated lip shape changes.
  • an embodiment of the present application provides a method for driving the face of a virtual image, the method comprising: obtaining first input information; the first input information comprising at least one of voice information and text information; generating voice-text alignment information based on the first input information; determining N phonemes corresponding to the first input information based on the voice-text alignment information, the phonemes comprising phoneme information, N being an integer greater than 1; generating a first driving parameter sequence based on the mapping relationship between the phonemes, the phoneme information and the facial visual elements in the virtual image and the phonemes; and driving the face of the virtual image based on the first driving parameter sequence.
  • an embodiment of the present application provides a virtual image face driving device, the device comprising: obtaining module, a generation module, a determination module and an execution module, wherein: the acquisition module is used to acquire first input information, and the first input information includes at least one of voice information and text information; the generation module is used to generate voice-text alignment information based on the first input information acquired by the acquisition module; the determination module is used to determine N phonemes corresponding to the first input information based on the voice-text alignment information generated by the generation module, and the phonemes include phoneme information, and N is an integer greater than 1; the generation module is also used to generate a first driving parameter sequence based on the phonemes determined by the determination module, the phoneme information, and the mapping relationship between the facial visual elements in the virtual image and the phonemes; the execution module is used to drive the face of the virtual image based on the first driving parameters generated by the generation module.
  • an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.
  • an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.
  • an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.
  • an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.
  • an electronic device can obtain first input information; the first input information includes at least one of voice information and text information; based on the first input information, voice-text alignment information is generated; based on the voice-text alignment information, N phonemes corresponding to the first input information are determined, the phonemes include phoneme information, and N is an integer greater than 1; based on the phonemes, the phoneme information, and the mapping relationship between the facial visual elements in the virtual image and the phonemes, a first driving parameter sequence is generated; based on the first driving parameter sequence, the face of the virtual image is driven.
  • FIG1 is a schematic diagram of a flow chart of a virtual image face driving method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of the structure of a virtual image face driving device provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG. 4 is a hardware schematic diagram of an electronic device provided in an embodiment of the present application.
  • first, second, etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of one type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally indicates that the objects associated before and after are in an "or” relationship.
  • the electronic device when the electronic device generates the facial driving data for driving the virtual image, it first generates the voice-text alignment information including only the text information based on the input text or voice information, then obtains the lip shape movement corresponding to the text information, and finally generates the lip shape driving data for driving the virtual image.
  • the lip shape driving data finally generated is not precise, and the lip shape shakes.
  • the electronic device can obtain first input information; the first input information includes at least one of voice information and text information; based on the first input information, voice-text alignment information is generated; based on the voice-text alignment information, N phonemes corresponding to the first input information are determined, the phonemes include phoneme information, and N is an integer greater than 1; based on the above
  • the first driving parameter sequence is generated based on the mapping relationship between the phonemes, the phoneme information, and the facial visemes in the virtual image and the phonemes; the face of the virtual image is driven based on the first driving parameter sequence.
  • the execution subject of the virtual image face driving method provided in this embodiment can be a virtual image face driving device, and the virtual image face driving device can be an electronic device, or a control module or a processing module in the electronic device.
  • the technical solution provided in the embodiment of the present application is described below by taking an electronic device as an example.
  • the embodiment of the present application provides a virtual image face driving method.
  • the virtual image face driving method may include the following steps 201 to 205:
  • Step 201 The electronic device obtains first input information.
  • the first input information includes at least one of voice information and text information.
  • the first input information is used to indicate the content to be expressed by the virtual image.
  • the above-mentioned virtual image may include a virtual character image generated by an electronic device.
  • Step 202 The electronic device generates speech-text alignment information based on the first input information.
  • the electronic device may align the above-mentioned voice information with the text information corresponding to the voice information to generate voice-text alignment information.
  • the above-mentioned voice-text alignment information is used to indicate the start time and end time of each word in the above-mentioned text information.
  • Step 203 The electronic device determines N phonemes corresponding to the first input information based on the voice-text alignment information.
  • the above-mentioned phoneme includes phoneme information.
  • N is an integer greater than 1
  • the above-mentioned phoneme information may be the pinyin information corresponding to the characters in the above-mentioned text information.
  • the above pinyin information can be divided into initial consonants and finals.
  • finals may include simple finals, complex finals, front nasal finals and back nasal finals.
  • the electronic device can divide the above N factors into single vowels, Complex vowels, front nasal vowels, back nasal vowels, whole-word syllables, and three-word syllables. Then, the three-word syllables and whole-word syllables are split into combinations of the first four vowels to generate corresponding phoneme groups.
  • Step 204 The electronic device generates a first driving parameter sequence based on the phonemes, the phoneme information, and the mapping relationship between the facial visemes in the virtual image and the phonemes.
  • the facial viseme may be a part or muscle of the face of the virtual image.
  • the facial visemes may include chin visemes, lip visemes, and other visemes.
  • chin-part visemes and mouth-part visemes are used to determine lip shape movements, and other parts of visemes are used to determine facial expression movements such as eyes, nose, and eyebrows.
  • the above-mentioned chin part visual elements may include: front jaw, right lower jaw, left lower jaw, and lower jaw.
  • the above-mentioned mouth part visual elements may include: mouth closed, mouth twisted, mouth twitching, right side of mouth, left side of mouth, left side of mouth smiling, right side of mouth smiling, mouth wrinkled to the left, mouth wrinkled to the left, dimple curve on the left side of mouth, dimple on the right side of mouth, mouth stretched to the left, mouth stretched to the right, mouth curled down, mouth curled up, lower lip raised, upper lip raised, pressing the left side of mouth, pressing the right side of mouth, lower left mouth, lower right mouth, upper left mouth, upper right mouth.
  • the above-mentioned other part visual elements may include: left eye blinking, left eye looking downward, left eye looking inward, left eye looking outward, left eye looking upward, left eye squinting, left eye wide open, right eye blinking, right eye looking downward, right eye looking inward, right eye looking outward, right eye looking upward, right eye squinting, right eye wide open, left eyebrow downward, right eyebrow downward, inner eyebrow upward, outer left eyebrow upward, outer right eyebrow upward, cheek puff, cheek slanted to the left, cheek slanted to the right, nose moving to the left, nose moving to the right, and sticking out tongue.
  • mapping relationship can be pre-stored in the electronic device or obtained from the network side.
  • mapping relationship The following is an exemplary description of how to generate the above mapping relationship:
  • the electronic device can first count the real-person facial viseme movements corresponding to each phoneme by recording a video with a real person, and record the corresponding driving parameters so that the virtual image is consistent with the facial movement of the real-person video. Then, based on the driving parameters corresponding to each of the above phonemes, the electronic device establishes a one-to-one correspondence between phonemes and visemes, that is, the above-mentioned mapping relationship.
  • the above mapping relationship can be a mapping value from phoneme to viseme.
  • the mapping value of the front palate is: 0.11426107876499998; the mapping value of the lower jaw is: 0.45334974318700005, etc.
  • Step 205 The electronic device drives the face of the virtual image based on the first driving parameter sequence.
  • the electronic device after obtaining the above-mentioned first driving parameter sequence, the electronic device can input the first driving parameter sequence into the driving engine, and can drive the lip movement of the virtual image's face through the first driving parameter sequence.
  • the driving engine may be a three-dimensional (3D) engine.
  • the electronic device can obtain the first input information; the first input information includes at least one of voice information and text information; based on the first input information, generate voice and text alignment information; based on the voice and text alignment information, determine the N phonemes corresponding to the first input information, the phonemes include phoneme information, and N is an integer greater than 1; based on the phonemes, the phoneme information, and the mapping relationship between the facial visual elements in the virtual image and the phonemes, generate a first driving parameter sequence; based on the first driving parameter sequence, drive the face of the virtual image.
  • the electronic device in the above step 204, “the electronic device generates a first driving parameter sequence based on the phonemes, the phoneme information, and the mapping relationship between the facial visemes in the virtual image and the phonemes” may include the following steps 204a and 204b:
  • Step 204a The electronic device determines the importance weight and density weight corresponding to the phoneme based on the phoneme and the phoneme information.
  • the importance weight is used to characterize the importance of the phoneme in the facial driving of the virtual image.
  • the above density weight is used to characterize the density of each phoneme in the above N phonemes.
  • the electronic device can set a corresponding importance weight for each factor group based on the above phoneme group. For example, for the importance weight, the weights of the initial consonant, the single vowel, the complex vowel, the front nasal vowel, and the back nasal vowel are set to (1.0, 0.9, 0.6, 0.5, 0.5) respectively.
  • Step 204b The electronic device generates a first driving parameter sequence based on the importance weight, the density weight, the phoneme, the phoneme information, and the mapping relationship between the facial visemes and the phonemes in the virtual image.
  • the electronic device in the above step 204b, “the electronic device generates a first driving parameter sequence based on the importance weight, the density weight, the phoneme, the phoneme information, and the mapping relationship between the facial visemes and the phonemes in the virtual image” may include the following steps 204b1 to 204b3:
  • Step 204b1 the electronic device obtains a phoneme sequence corresponding to the phoneme.
  • the above-mentioned phoneme sequence is used to indicate the order of the above-mentioned N phonemes.
  • the electronic device can generate N phonemes based on the above, and arrange the N phonemes according to the word sequence of the input information to obtain a phoneme sequence.
  • Step 204b2 The electronic device generates a first phoneme sequence based on the phoneme sequence, the importance weight, and the density weight.
  • the electronic device can discard phonemes with high density and low importance based on the above phoneme sequence, importance weight and density weight, and generate a new phoneme sequence, that is, the above first phoneme sequence.
  • Step 204b3 The electronic device converts the first phoneme sequence according to the phoneme information and the mapping relationship between the facial visemes and phonemes in the virtual image to generate a first driving parameter sequence.
  • the electronic device can calculate the first driving parameter sequence by formula (1).
  • w 1i is the above importance weight
  • w 2i is the above density weight
  • S is the above mapping relationship.
  • the electronic device can drive the virtual image based on the viseme parameter sequence, further improving the precision of driving the virtual image.
  • step 205 "the electronic device drives the face of the virtual image based on the first driving parameter sequence” may include optional steps 205a to 205c:
  • Step 205a The electronic device performs the following operations on the driving parameters corresponding to each phoneme in the first driving parameter sequence: The time domain characteristics are smoothed to obtain a processed second driving parameter sequence.
  • the electronic device after obtaining the above-mentioned viseme parameter sequence, the electronic device can respectively perform smoothing processing on the viseme parameters of different parts.
  • the above-mentioned smoothing process may be performed using a convolution smoothing (Savitzky-Golay, SG) algorithm.
  • the electronic device can take each character in the above text information as a unit and smooth the driving parameters corresponding to the phonemes of each character pair, that is, apply the SG algorithm to the driving parameters corresponding to the phonemes of each character to ensure that the facial visual element corresponding to each character is more natural, and finally obtain the above-mentioned second driving parameter sequence.
  • Step 205b the electronic device performs time domain feature smoothing processing on the second driving parameter sequence to obtain a third driving parameter sequence.
  • the facial driving data is related to the third driving parameter.
  • the electronic device may apply the SG algorithm to the second driving parameter sequence as a whole to ensure that the facial visemes corresponding to the entire input information are more natural, thereby obtaining the third driving parameter sequence.
  • the driving parameter sequence of the chin is obtained by formula (2).
  • m represents the number of characters in the above input information.
  • the electronic device can generate a final third driving parameter sequence by using formula (3) to generate the driving parameter sequences corresponding to different parts.
  • Step 205c The electronic device drives the face of the virtual image based on the third driving parameter sequence.
  • the electronic device after obtaining the third driving parameter sequence, can input the third driving parameter sequence into the 3D engine, and can drive the lip movement of the virtual image's face through the third driving parameter sequence.
  • the driving parameters corresponding to each phoneme are first smoothed, and then the overall driving parameter sequence is smoothed, so that the generated driving parameter sequence is more refined, avoiding the unnaturalness and jitter of the virtual image caused by the driving parameter jumps in the transition stages of different phonemes.
  • step 205 "the electronic device drives the face of the virtual image based on the first driving parameter sequence” may include optional steps 205d to 205g:
  • Step 205d The electronic device generates an energy coefficient weight corresponding to each phoneme based on the short-time energy of the first input information.
  • the short-time energy includes the unvoiced part and the voiced part of the speech information.
  • the energy corresponding to the voiced sound part is greater than the energy corresponding to the unvoiced sound part.
  • the energy coefficient weight is used to characterize the proportion of the unvoiced part and the voiced part in the voice information. In other words, the greater the energy coefficient weight, the greater the volume of the corresponding voice information.
  • Step 205e The electronic device obtains an energy coefficient weight sequence corresponding to the phoneme based on the phoneme sequence and energy coefficient weight corresponding to the phoneme in the first input information.
  • the electronic device may process the energy coefficient weights according to the sequence indicated by the phoneme sequence to obtain an energy coefficient weight sequence.
  • the electronic device can obtain the above energy input weight sequence through formula (4).
  • Formula (4) is as follows:
  • Step 205f The electronic device generates a fourth driving parameter sequence based on the energy coefficient weight sequence, the intensity parameters of the facial visemes in the virtual image, and the first driving parameter sequence.
  • the facial driving data is related to the fourth driving parameter sequence.
  • the intensity parameter of the facial viseme is used to characterize the emotion information corresponding to the driving parameter sequence.
  • the above-mentioned emotional information includes happiness, sadness, anger, calmness, etc.
  • the electronic device can customize different intensity parameters for the driving parameter sequences of different parts.
  • the fourth driving parameter sequence is generated by formula (5).
  • Step 205g The electronic device drives the face of the virtual image based on the fourth driving parameter sequence.
  • the electronic device after obtaining the fourth driving parameter sequence, can The fourth driving parameter sequence is input into the 3D engine, and the lip movement of the virtual image can be driven by the fourth driving parameter sequence.
  • the electronic device can reduce the problem of lip jitter by discarding the phonemes that contribute little to lip movement through the importance weight and density weight of the phonemes.
  • a mapping scheme from phonemes to visemes is established, and facial drive data can be generated directly through phonemes.
  • the drive parameter sequence is smoothed through smoothing strategies of different granularities to make the lip movement more natural.
  • the drive parameter sequence can be dynamically adjusted according to the voice information and built-in strategies to achieve different speaking styles.
  • step 202 "the electronic device generates voice-text alignment information based on the first input information" may include step 202a and step 202b:
  • Step 202a The electronic device extracts acoustic feature information corresponding to the first voice information.
  • the first voice information is the input voice information, or the voice information converted from the text information.
  • the above-mentioned conversion of text information into voice information may include: generating a virtual voice corresponding to the text information through a speech synthesis (Text To Speech, TTS) interface.
  • TTS Text To Speech
  • the above-mentioned acoustic feature information is used to represent the pitch, intensity, strength, and timbre of the above-mentioned first voice information.
  • the electronic device may input the above input information into a feature extraction model to extract the acoustic features of the corresponding speech.
  • the feature extraction model may include linear predictive coding and Mel spectrum.
  • Step 202b The electronic device performs voice-text alignment on the first voice information and the text information corresponding to the first voice information based on the acoustic feature information to generate voice-text alignment information.
  • the electronic device may input the above-mentioned acoustic feature information and text information into a statistical model or a deep learning method model for dynamic matching to generate speech-text alignment information.
  • the electronic device can more accurately obtain the content contained in the input information.
  • the phoneme information includes the duration of each phoneme
  • the step 204a of "the electronic device determines the density weight corresponding to the phoneme based on the phoneme and the phoneme information" may include the following steps 204a1 and 204a2:
  • Step 204a1 the electronic device divides the duration corresponding to the first input information into P time periods based on the duration of each phoneme.
  • P is an integer greater than 1.
  • the duration may be from the start time to the end time of each phoneme.
  • the duration corresponding to the above-mentioned input information may be the start time to the end time corresponding to the above-mentioned voice information.
  • the above-mentioned P time periods may be time periods of the same length.
  • Step 204a2 The electronic device determines a density weight corresponding to the phoneme based on the density information of each phoneme contained in each of the P time periods.
  • the above density information is used to indicate the quantity of each phoneme in each time period.
  • the electronic device can calculate the density weight by formula (6).
  • Formula (6) is as follows:
  • T represents the time length corresponding to the above P time periods
  • ti represents the i-th phoneme among the above N phonemes
  • tmax represents the duration of the longest phoneme in the above time length T
  • P is the above P time periods.
  • the electronic device can discard phonemes with high density but small application to facial visual elements based on the calculated density weight, thereby avoiding the problem of lip shaking.
  • the virtual image face driving method provided in the embodiment of the present application can be executed by a virtual image face driving device.
  • the virtual image face driving device executing the virtual image face driving method is taken as an example to illustrate the virtual image face driving device provided in the embodiment of the present application.
  • the virtual image face driving device 400 comprises: an acquisition module 401, a generation module 402, a determination module 403 and an execution module 404, wherein: the acquisition module 401 is used to acquire first input information, the input information includes voice information and text information at least one of the information; the generation module 402 is used to generate speech-text alignment information based on the first input information obtained by the acquisition module 401; the determination module 403 is used to determine the N phonemes corresponding to the first input information based on the speech-text alignment information generated by the generation module 402, the phonemes including phoneme information, and N is an integer greater than 1; the generation module 402 is also used to generate a first driving parameter sequence based on the phonemes determined by the determination module 403, the phoneme information and the mapping relationship between the facial visual elements in the virtual image and the phonemes; the execution module 404 is used to drive the face of the virtual image based on the first driving parameter sequence generated by the generation module 402.
  • the above-mentioned determination module 403 is also used to determine the importance weight and density weight corresponding to the above-mentioned phoneme based on the above-mentioned phoneme and the above-mentioned phoneme information, the above-mentioned importance weight is used to characterize the importance of the above-mentioned phoneme in the facial driving of the above-mentioned virtual image, and the above-mentioned density weight is used to characterize the density of each phoneme in the above-mentioned N phonemes; the above-mentioned generation module 402 is specifically used to generate a first driving parameter sequence based on the above-mentioned importance weight, the above-mentioned density weight, the above-mentioned phoneme, the above-mentioned phoneme information determined by the determination module 403 and the mapping relationship between the facial visual elements in the above-mentioned virtual image and the above-mentioned phonemes.
  • the acquisition module 401 is further used to acquire a phoneme sequence corresponding to the phoneme; the generation module 402 is specifically used to: generate a first phoneme sequence based on the phoneme sequence acquired by the acquisition module 401, the importance weight and the density weight; convert the first phoneme sequence according to the mapping relationship between the phoneme information and the facial visemes in the virtual image and the phonemes to generate the first driving parameter sequence.
  • the execution module 404 is specifically used to perform time domain feature smoothing processing on the driving parameters corresponding to each of the above phonemes in the above first driving parameter sequence generated by the generation module 402 to obtain a smoothed second driving parameter sequence; perform time domain feature smoothing processing on the second driving parameter sequence to obtain a third driving parameter sequence; and drive the face of the above virtual image based on the third driving parameter sequence.
  • the execution module 404 is specifically used to generate an energy coefficient weight corresponding to each of the phonemes based on the short-time energy of the first input information; obtain an energy coefficient weight sequence corresponding to the phonemes based on the phoneme sequence corresponding to the phonemes in the first input information and the energy coefficient weights; generate a fourth driving parameter sequence based on the energy coefficient weight sequence, the intensity parameters of the facial visual elements in the virtual image, and the first driving parameter sequence; and drive the virtual image based on the fourth driving parameter sequence.
  • Image of face
  • the virtual image facial driving device 400 further includes: an extraction module, wherein: the extraction module is used to extract acoustic feature information corresponding to the first voice information; the first voice information is the input voice information, or the voice information converted from the text information; the generation module 402 is specifically used to perform voice-to-text alignment on the first voice information and the text information corresponding to the first voice information based on the acoustic feature information extracted by the extraction module, so as to generate the voice-to-text alignment information.
  • the extraction module is used to extract acoustic feature information corresponding to the first voice information
  • the first voice information is the input voice information, or the voice information converted from the text information
  • the generation module 402 is specifically used to perform voice-to-text alignment on the first voice information and the text information corresponding to the first voice information based on the acoustic feature information extracted by the extraction module, so as to generate the voice-to-text alignment information.
  • the above-mentioned phoneme information includes the duration of each of the above-mentioned phonemes; the above-mentioned determination module 403 is specifically used to divide the duration corresponding to the above-mentioned first input information into P time periods based on the duration of each of the above-mentioned phonemes, where P is an integer greater than 1; based on the density information of each phoneme contained in each of the above-mentioned P time periods, determine the density weight corresponding to the above-mentioned phoneme.
  • the virtual image face driving device can obtain first input information; the first input information includes at least one of voice information and text information; based on the above first input information, generate voice and text alignment information; based on the voice and text alignment information, determine the N phonemes corresponding to the above first input information, the phonemes include phoneme information, and N is an integer greater than 1; based on the above phonemes, the above phoneme information, and the mapping relationship between the facial visual elements in the virtual image and the above phonemes, generate a first driving parameter sequence; based on the first driving parameter sequence, drive the face of the above virtual image.
  • the virtual image face driving device in the embodiment of the present application can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip.
  • the electronic device can be a terminal or other devices other than a terminal.
  • the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc.
  • NAS Network Attached Storage
  • PC personal computer
  • TV television
  • teller machine Or self-service machines etc.
  • the virtual image face driving device in the embodiment of the present application may be a device having an operating system.
  • the operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.
  • the virtual image face driving device provided in the embodiment of the present application can implement each process implemented by the method embodiment of Figure 1. To avoid repetition, it will not be repeated here.
  • an embodiment of the present application also provides an electronic device 600, including a processor 601 and a memory 602, and the memory 602 stores a program or instruction that can be executed on the processor 601.
  • the program or instruction is executed by the processor 601
  • the various steps of the above-mentioned virtual image face driving method embodiment are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.
  • FIG. 4 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.
  • the electronic device 100 includes but is not limited to components such as a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.
  • components such as a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.
  • the electronic device 100 can also include a power source (such as a battery) for supplying power to each component, and the power source can be logically connected to the processor 110 through a power management system, so that the power management system can manage charging, discharging, and power consumption management.
  • a power source such as a battery
  • the electronic device structure shown in FIG4 does not constitute a limitation on the electronic device, and the electronic device can include more or fewer components than shown in the figure, or combine certain components, or arrange components differently, which will not be described in detail here.
  • the above-mentioned processor 110 is used to obtain first input information, which includes at least one of voice information and text information; based on the above-mentioned first input information, generate voice-text alignment information; based on the above-mentioned voice-text alignment information, determine N phonemes corresponding to the above-mentioned first input information, the phonemes include phoneme information, and N is an integer greater than 1; based on the above-mentioned phonemes, the above-mentioned phoneme information and the mapping relationship between the facial visual elements in the virtual image and the above-mentioned phonemes, generate a first driving parameter sequence; based on the first driving parameter sequence, drive the face of the above-mentioned virtual image.
  • the processor 110 is further used to determine the importance weight and density weight corresponding to the phoneme based on the phoneme and the phoneme information, wherein the importance weight is used to characterize the importance of the phoneme in the facial driving of the virtual image, and the density weight is used to characterize the density of each phoneme in the N phonemes; the processor 110 is specifically used to generate a first driving parameter sequence based on the importance weight, the density weight, the phoneme, the phoneme information and the mapping relationship between the facial visual elements in the virtual image and the phonemes.
  • the processor 110 is further used to: obtain a phoneme sequence corresponding to the phoneme; the processor 110 is specifically used to: generate a first phoneme sequence based on the phoneme sequence, the importance weight and the density weight; convert the first phoneme sequence according to the mapping relationship between the phoneme information and the facial visual elements in the virtual image and the phonemes to generate the first driving parameter sequence.
  • the processor 110 is specifically used to perform time domain feature smoothing processing on the driving parameters corresponding to each of the phonemes in the first driving parameter sequence to obtain a smoothed second driving parameter sequence; perform time domain feature smoothing processing on the second driving parameter sequence to obtain a third driving parameter sequence; and drive the face of the virtual image based on the third driving parameter sequence.
  • the processor 110 is specifically used to generate an energy coefficient weight corresponding to each of the phonemes based on the short-time energy of the first input information; obtain an energy coefficient weight sequence corresponding to the phonemes based on a phoneme sequence corresponding to the phonemes in the first input information and the energy coefficient weights; generate a fourth driving parameter sequence based on the energy coefficient weight sequence, the intensity parameters of the facial visual elements in the virtual image, and the first driving parameter sequence; and drive the face of the virtual image based on the fourth driving parameter sequence.
  • the processor 110 is also used to extract acoustic feature information corresponding to the first voice information; the first voice information is the input voice information, or the voice information converted from the text information; the processor 110 is specifically used to perform voice-to-text alignment on the first voice information and the text information corresponding to the first voice information based on the acoustic feature information to generate the voice-to-text alignment information.
  • the phoneme information includes the duration of each phoneme; the processor 110 is specifically configured to convert the duration corresponding to the first input information into a time duration corresponding to each phoneme.
  • the duration is divided into P time periods, where P is an integer greater than 1; based on the density information of each phoneme contained in each time period of the P time periods, the density weight corresponding to the phoneme is determined.
  • the electronic device can obtain first input information; the first input information includes at least one of voice information and text information; based on the first input information, generate voice-text alignment information; based on the voice-text alignment information, determine N phonemes corresponding to the first input information, the phonemes include phoneme information, and N is an integer greater than 1; based on the phonemes, the phoneme information, and the mapping relationship between the facial visual elements in the virtual image and the phonemes, generate a first driving parameter sequence; based on the first driving parameter sequence, drive the face of the virtual image.
  • the input unit 104 may include a graphics processing unit (GPU) 1041 and a microphone 1042, and the graphics processor 1041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode.
  • the display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc.
  • the user input unit 107 includes a touch panel 1071 and at least one of other input devices 1072.
  • the touch panel 1071 is also called a touch screen.
  • the touch panel 1071 may include two parts: a touch detection device and a touch controller.
  • Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.
  • the memory 109 can be used to store software programs and various data.
  • the memory 109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, an image playback function, etc.), etc.
  • the memory 109 may include a volatile memory or a non-volatile memory, or the memory 109 may include both volatile and non-volatile memories.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (Random Access Memory,
  • the memory 109 in the embodiment of the present application includes but is not limited to these and any other suitable types of memory.
  • the processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 110.
  • An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored.
  • a program or instruction is stored.
  • the various processes of the above-mentioned virtual image face driving method embodiment are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the processor is the processor in the electronic device described in the above embodiment.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned virtual image facial driving method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
  • An embodiment of the present application provides a computer program product, which is stored in a storage medium.
  • the program product is executed by at least one processor to implement the various processes of the above-mentioned virtual image face driving method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the technical solution of the present application can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a magnetic disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.
  • a storage medium such as ROM/RAM, a magnetic disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种虚拟形象面部驱动方法、装置、电子设备、介质及计算机程序产品,属于人工智能技术领域。该虚拟形象面部驱动方法包括:获取第一输入信息,该输入信息包括语音信息和文字信息中的至少一个(201);基于第一输入信息,生成语音文字对齐信息(202);基于该语音文字对齐信息,确定第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数(203);基于该音素、音素信息以及虚拟形象中的面部视素与该音素间的映射关系,生成第一驱动参数序列(204);基于该第一驱动参数序列,驱动虚拟形象的面部(205)。

Description

虚拟形象面部驱动方法、装置、电子设备及介质
相关申请的交叉引用
本申请主张在2022年10月27日在中国提交的中国专利申请号202211325775.7的优先权,其全部内容通过引用包含于此。
技术领域
本申请属于人工智能技术领域,具体涉及一种虚拟形象面部驱动方法、装置、电子设备及介质。
背景技术
随着人工智能技术和大数据技术的发展,虚拟形象的应用范围也越来越广。例如,可以构建一个虚拟形象,并驱动该虚拟形象的面部表情,以模拟人的说话。
在相关技术中,在驱动虚拟形象的面部表情时,是将语音片段对应的每个文字和面部数据对应的口型动作进行一一对齐,来生成每个文字各自对应的唇形驱动数据,以实现驱动虚拟形象的唇形发生变化。
然而,由于上述方案是仅仅将文字与口型动作进行对齐,导致生成的唇形驱动数据不精细,从而使得最终呈现出来的唇形变化不协调,进而导致最终的同步效果较差。
发明内容
本申请实施例的目的是提供一种虚拟形象面部驱动方法、装置、电子设备及介质,能够解决唇形变化不协调导致的同步效果较差的问题。
第一方面,本申请实施例提供了一种虚拟形象面部驱动方法,该方法包括:获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。
第二方面,本申请实施例提供了一种虚拟形象面部驱动装置,该装置包括:获取 模块、生成模块、确定模块和执行模块,其中:上述获取模块,用于获取第一输入信息,该第一输入信息包括语音信息和文字信息中的至少一个;上述生成模块,用于基于获取模块获取到的上述第一输入信息,生成语音文字对齐信息;上述确定模块,用于基于生成模块生成的上述语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;上述生成模块,还用于基于确定模块确定的上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;上述执行模块,用于基于生成模块生成的上述第一驱动参数,驱动上述虚拟形象的面部。
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。
第四方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。
第五方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的方法。
第六方面,本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如第一方面所述的方法。
在本申请实施例中,电子设备可以获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。如此,由于上述N个音素的音素信息可以精确的表达出该第一输入信息所对应的虚拟形象的面部口型,因此,能够生成更精准的第一驱动参数序列,以驱动虚拟形象的面部。从而,避免呈现出的虚拟形象的面部口型动作不协调,提升了最终的同步效果。
附图说明
图1是本申请实施例提供的一种虚拟形象面部驱动方法的流程示意图;
图2是本申请实施例提供的一种虚拟形象面部驱动装置的结构示意图;
图3是本申请实施例提供的一种电子设备的结构示意图;
图4是本申请实施例提供的一种电子设备的硬件示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的虚拟形象面部驱动方法、装置、电子设备及介质进行详细地说明。
在相关技术中,电子设备在生成用于驱动虚拟形象的面部驱动数据时,是先基于输入的文字或语音信息,生成仅包括文字信息的语音文字对齐信息,然后获取到与该文字信息对应的唇形动作,最后生成用于驱动虚拟形象的唇形驱动数据。而采用这种方案,由于文字信息不能跟精准的表达该语音片段对应的唇形动作,导致最终生成的唇形驱动数据不精细,出现唇形抖动情况。
在本申请实施例提供的虚拟形象面部驱动方法、装置、电子设备及介质中,电子设备可以获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上 述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。如此,由于上述N个音素的音素信息可以精确的表达出该第一输入信息所对应的虚拟形象的面部口型,因此,能够生成更精准的第一驱动参数序列,以驱动虚拟形象的面部。从而,避免呈现出的虚拟形象的面部口型动作不协调,提升了最终的同步效果。
本实施例提供的虚拟形象面部驱动方法的执行主体可以为虚拟形象面部驱动装置,该虚拟形象面部驱动装置可以为电子设备,也可以为该电子设备中的控制模块或处理模块等。以下以电子设备为例来对本申请实施例提供的技术方案进行说明。
本申请实施例提供一种虚拟形象面部驱动方法,如图1所示,该虚拟形象面部驱动方法可以包括如下步骤201至步骤205:
步骤201:电子设备获取第一输入信息。
在本申请实施例中,上述第一输入信息包括语音信息和文字信息中的至少之一。
在本申请实施例中,上述第一输入信息用于指示虚拟形象的待表达内容。
在本申请实施例在,上述虚拟形象可以包括电子设备生成的虚拟人物形象。
步骤202:电子设备基于第一输入信息,生成语音文字对齐信息。
在本申请实施例中,电子设备可以将上述语音信息与语音信息对应的文字信息进行对齐,以生成语音文字对齐信息。
在本申请实施例中,上述语音文字对齐信息用于指示上述文字信息中的每个字的开始时间和结束时间。
步骤203:电子设备基于语音文字对齐信息,确定第一输入信息对应的N个音素。
在本申请实施例中,上述音素包括音素信息。
其中,N为大于1的整数
在本申请实施例中,上述音素信息可以为上述文字信息中的文字所对应的拼音信息。
示例性地,上述拼音信息可以分为声母和韵母。
需要说明的是,上述韵母可以包括单韵母、复韵母、前鼻韵母和后鼻韵母。
在本申请实施例中,电子设备可以按照拼音类型,将上述N各因素分为单韵母、 复韵母、前鼻韵母、后鼻韵母、整体认读音节、三拼音节。然后,再对三拼音节和整体认读音节进行拆分,拆分为前四种韵母的组合,生成对应的音素组。
步骤204:电子设备基于音素、音素信息以及虚拟形象中的面部视素与音素间的映射关系,生成第一驱动参数序列。
在本申请实施例中,上述面部视素可以为虚拟形象的面部的一个部位或者肌肉。
示例性地,上述面部视素可以包括下巴部位视素、嘴唇部位视素和其他部位视素。
需要说明的是,上述下巴部位视素和嘴部部位视素用于决定唇形运动,其他部位视素用于决定眼睛、鼻子、眉毛等面部表情运动。
示例性地,上述下巴部位视素可以包括:前颚、右下颚、左下颚、下颚。
示例性地,上述嘴部部位视素可以包括:嘴巴关闭、嘴巴扭动、嘴巴抽搐、嘴巴右部、嘴巴左部、嘴巴左部笑、嘴巴右部笑、嘴巴向左皱、嘴巴向左皱、嘴巴左部酒窝弯、嘴巴右部酒窝、嘴巴向左伸、嘴巴向右伸、嘴巴向下卷、嘴巴向上卷、下嘴唇耸动、上嘴唇耸动、按压嘴巴左部、按压嘴巴右部、嘴巴左下部、嘴巴右下部、嘴巴左上部、嘴巴右上部。
示例性地,上述其他部位视素可以包括:左眼眨眼、左眼向下看、左眼向里看、左眼向外看、左眼向上看、左眼眯眼、左眼睁大、右眼眨眼、右眼向下看、右眼向里看、右眼向外看、右眼向上看、右眼眯眼、右眼睁大、左眉毛向下、右眉毛向下、眉毛内侧向上、左眉毛外侧向上、右眉毛外侧向上、脸颊粉扑、脸颊向左斜、脸颊向右斜、鼻子向左动、鼻子向右动、伸出舌头。
在本申请实施例中,上述映射关系可以预先存储在电子设备中,也可以从网络侧获取。
以下将对如何生成上述映射关系进行示例性说明:
示例性地,电子设备可以先按照每个音素,通过真人录制视频,来统计各个音素的所对应的真人面部视素动作,并记录对应的驱动参数,使得虚拟形象与真人视频脸部运动一致,然后电子设备基于上述各个音素对应的驱动参数,建立音素和视素的一一对应关系,即上述映射关系。
举例说明,上述映射关系可以为音素到视素的映射值。例如:前颚的映射值为: 0.11426107876499998;下颚的映射值为:0.45334974318700005等。
步骤205:电子设备基于第一驱动参数序列,驱动虚拟形象的面部。
在本申请实施例中,电子设备在得到上述第一驱动参数序列后,可以将该第一驱动参数序列输入至驱动引擎中,即可通过该第一驱动参数序列驱动虚拟形象面部唇形运动。
示例性地,上述驱动引擎可以为三维(3 Dimensional,3D)引擎。
在本申请实施例提供的虚拟形象面部驱动方法中,电子设备可以获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。如此,由于上述N个音素的音素信息可以精确的表达出该第一输入信息所对应的虚拟形象的面部口型,因此,能够生成更精准的第一驱动参数序列,以驱动虚拟形象的面部。从而,避免呈现出的虚拟形象的面部口型动作不协调,提升了最终的同步效果。
可选地,在本申请实施例中,上述步骤204中“电子设备基于音素、音素信息以及虚拟形象中的面部视素与音素间的映射关系,生成第一驱动参数序列”可以包括如下步骤204a和步骤204b:
步骤204a:电子设备基于音素和音素信息,确定音素对应的重要性权重和密集度权重。
在本申请实施例中,上述重要性权重用于表征上述音素在上述虚拟形象面部驱动中的重要程度。
在本申请实施例中,上述密集度权重用于表征每个音素在上述N个音素中的密集程度。
在本申请实施例中,电子设备可以基于上述音素组,为每个因素组设定对应的重要性权重。例如,对于重要性权重,设定声母、单韵母、复韵母、前鼻韵母、后鼻韵母的权重分别为(1.0,0.9,0.6,0.5,0.5)。
步骤204b:电子设备基于重要性权重、密集度权重、音素、音素信息以及虚拟形象中的面部视素与音素间的映射关系,生成第一驱动参数序列。
如此,通过采用上述音素的重要性权重和密集度权重来生成第一驱动参数序列,能够丢弃密集程度高且重要性低的音素,避免了生成的第一驱动参数序列所驱动的虚拟形象动作抖动。
可选地,在本申请实施例中,上述步骤204b中“电子设备基于重要性权重、密集度权重、音素、音素信息以及虚拟形象中的面部视素与音素间的映射关系,生成第一驱动参数序列”可以包括如下步骤204b1至和步骤204b3:
步骤204b1:电子设备获取音素对应的音素序列。
在本申请实施例中,上述音素序列用于指示上述N个音素的先后顺序。
在本申请实施例中,电子设备可以基于上述生成N个音素,根据输入信息的语序顺序,排列该N个音素,以获得音素序列。
步骤204b2:电子设备基于音素序列、重要性权重以及密集度权重,生成第一音素序列。
在本申请实施例中,电子设备可以基于上述音素序列、重要性权重和密集度权重,丢弃掉密集度高,重要程度小的音素,生成新的音素序列,即上述第一音素序列。
步骤204b3:电子设备根据音素信息和虚拟形象中的面部视素与音素间的映射关系,对第一音素序列进行转换,生成第一驱动参数序列。
示例性地,电子设备可以通过公式(1)计算得到上述第一驱动参数序列。公式(1)如下:
vi=min(S(pi)*w1i*w2i,1.0)  公式(1)
其中,w1i为上述重要性权重,w2i为上述密集度权重,S为上述映射关系。
如此,通过将音素序列转变为具有时序特征的视素参数序列,使得电子设备可以基于该视素参数序列驱动虚拟形象,进一步提高了驱动虚拟形象的精细程度。
可选地,在本申请实施例中,上述步骤205中“电子设备基于第一驱动参数序列,驱动虚拟形象的面部”可以包括如选步骤205a至步骤205c:
步骤205a:电子设备对第一驱动参数序列中的每个音素对应的驱动参数分别进行 时域特征平滑处理,得到处理后的第二驱动参数序列。
在本申请实施例中,电子设备在得到上述视素参数序列之后,可以对不同部位的视素参数分别进行平滑处理。
示例性地,上述平滑处理可以采用卷积平滑(Savitzky-Golay,SG)算法进行平滑。
示例性地,电子设备可以以上述文字信息中的每个字为单位,对每个字对的音素对应的驱动参数进行平滑,即对每个字的音素对应的驱动参数都应用SG算法,以保证每个字对应的面部视素更为自然,最后,得到上述第二驱动参数序列。
步骤205b:电子设备对第二驱动参数序列进行时域特征平滑处理,得到第三驱动参数序列。
在本申请实施例中,上述面部驱动数据与上述第三驱动参数相关。
示例性地,电子设备在得到上述第二驱动参数序列后,可以对该第二驱动参数序列整体应用SG算法,以保证整个输入信息对应的面部视素更为自然,得到上述第三驱动参数序列。
示例性地,以对下巴部位对应的驱动参数进行平滑为例,通过公式(2)得到下巴部位驱动参数序列。公式(2)如下:
Vs下巴=SG((SG(v1),...,SG(vi)))  公式(2)
其中,m表示上述输入信息中的文字个数。
进一步地,电子设备可以将不同部位对应的驱动参数序列通过公式(3),生成最终的第三驱动参数序列。公式(3)如下:
Vs={Vs下巴,Vs嘴部,Vs其他}  公式(3)
步骤205c:电子设备基于第三驱动参数序列,驱动虚拟形象的面部。
在本申请实施例中,电子设备在得到上述第三驱动参数序列后,可以将该第三驱动参数序列输入至3D引擎中,即可通过该第三驱动参数序列驱动虚拟形象面部唇形运动。
如此,先对各个音素对应的驱动参数进行平滑处理,在对整体的驱动参数序列进行平滑处理,使得生成的驱动参数序列更为精细,避免了不同音素过渡阶段出现的驱动参数跳变而导致的虚拟形象的不自然和抖动问题。
可选地,在本申请实施例中,上述步骤205中“电子设备基于第一驱动参数序列,驱动虚拟形象的面部”可以包括如选步骤205d至步骤205g:
步骤205d:电子设备基于第一输入信息的短时能量,生成每个音素对应的能量系数权重。
在本申请实施例中,上述短时能量包括上述语音信息的清音部分和浊音部分。
需要说明的是,上述浊音部分对应的能量比上述清音部分对应的能量大。
在本申请实施例中,上述能量系数权重用于表征上述清音部分和浊音部分在上述语音信息中所占的比重。换句话说,上述能量系数权重越大,对应的语音信息的音量越大。
步骤205e:电子设备基于第一输入信息中的音素对应的音素序列和能量系数权重,得到音素对应的能量系数权重序列。
在本身请实施例中,电子设备可以按照上述音素序列所指示的先后顺序,对上述能量系数权重进行处理,得到能量系数权重序列。
示例性地,电子设备可以通过公式(4),得到上述能量戏输权重序列。公式(4)如下:
步骤205f:电子设备基于能量系数权重序列、虚拟形象中的面部视素的强度参数以及第一驱动参数序列,生成第四驱动参数序列。
在本申请实施例中,上述面部驱动数据与上述第四驱动参数序列相关。
在本申请实施例中,上述面部视素的强度参数用于表征上述驱动参数序列对应的情绪信息。
示例性地,上述情绪信息包括高兴、悲伤、愤怒、平静等。
示例性地,电子设备可以自定义的为上述不同部位的驱动参数序列设置不同的强度参数。然后,通过公式(5)生成上述第四驱动参数序列。公式(5)如下:
Vfinal=WE*{wst1Vs下巴,wst2Vs嘴部,wst3Vs其他}  公式(5)
步骤205g:电子设备基于第四驱动参数序列,驱动虚拟形象的面部。
在本申请实施例中,电子设备在得到上述第四驱动参数序列后,可以将该第四驱 动参数序列输入至3D引擎中,即可通过该第四驱动参数序列驱动虚拟形象面部唇形运动。
在本申请实施例中,电子设备可以通过音素的重要性权重和密集程度权重,丢弃对唇形运动贡献小的音素,来降低唇形抖动的问题。并建立音素到视素的映射方案,直接通过音素即可生成面部驱动数据,然后通过不同粒度的平滑策略,对驱动参数序列进行平滑,使得唇形运动更加自然。最后,还可以根据语音信息和内置策略对驱动参数序列进行动态调整,可以实现不同的说话风格。
如此,通过为上述第一驱动参数序列增加用于表征语音信息音量和用于表征虚拟形象情绪的参数,使得最终驱动虚拟形象的效果更为自然。
可选地,在本申请实施例中,上述步骤202中“电子设备基于第一输入信息,生成语音文字对齐信息”可以包括步骤202a和步骤202b:
步骤202a:电子设备提取第一语音信息对应的声学特征信息。
在本申请实施例中,上述第一语音信息为输入的上述语音信息,或者由上述文字信息转换成的语音信息。
在本申请实施例中,上述由文字信息转换语音信息可以包括:将文字信息通过语音合成(Text To Speech,TTS)接口,生成该文字信息对应的虚拟语音。
在本申请实施例中,上述声学特征信息用于表示上述第一语音信息的音高、音强、音强、音色。
在本申请实施例中,电子设备可以将上述输入信息输入特征提取模型,以提取对应的语音的声学特征。
示例性地,上述特征提取模型可以包括线性预测编码、梅尔频谱。
步骤202b:电子设备基于声学特征信息,将第一语音信息和第一语音信息对应的文字信息进行语音文字对齐,生成语音文字对齐信息。
在本申请实施例中,电子设备可以将上述声学特征信息和文字信息输入统计模型或深度学习方法模型进行动态匹配,以生成语音文字对齐信息。
如此,通过提取语音信息中的声学特征信息,将语音信息和对应的文字信息进行对齐,使得电子设备可以更加精准的获取到输入信息所包含的内容。
可选地,在本申请实施例中,上述音素信息包括每个音素的持续时间,上述步骤204a中“电子设备基于音素和音素信息,确定音素对应的密集度权重”可以包括如下步骤204a1和步骤204a2:
步骤204a1:电子设备基于每个音素的持续时间,将第一输入信息对应的持续时间划分为P个时间段。
其中,P为大于1的整数。
在本申请实施例中,上述持续时间可以为每个音素的开始时间至结束时间。
在本申请实施例中,上述输入信息对应的持续时间可以为上述语音信息对应的开始时间至结束时间。
在本申请实施例中,上述P个时间段可以为相同时间长度的时间段。
步骤204a2:电子设备基于P个时间段中的每个时间段中包含的每个音素的密集程度信息,确定音素对应的密集度权重。
在本申请实施例中,上述密集程度信息用于表示每个时间段中的每个音素的数量多少。
示例性地,电子设备可以通过公式(6)计算得到上述密集度权重。公式(6)如下:
其中,T表示上述P个时间段对应的时间长度,ti表示上述N个音素中的第i个音素,tmax表示上述时间长度T中最长的音素的持续时间,P即为上述P个时间段。
如此,电子设备可以基于计算得到的密集度权重,丢弃密集度高但是对面部视素应用较小的音素,避免了唇形的抖动问题。
本申请实施例提供的虚拟形象面部驱动方法,执行主体可以为虚拟形象面部驱动装置。本申请实施例中以虚拟形象面部驱动装置执行虚拟形象面部驱动方法为例,说明本申请实施例提供的虚拟形象面部驱动装置。
本申请实施例提供一种虚拟形象面部驱动装置,如图2所示,该虚拟形象面部驱动装置400包括:获取模块401、生成模块402、确定模块403和执行模块404,其中:上述获取模块401,用于获取用于第一输入信息,该输入信息包括语音信息和文字信 息中的至少一个;上述生成模块402,用于基于获取模块401获取到上述第一输入信息,生成语音文字对齐信息;上述确定模块403,用于基于生成模块402生成的上述语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;上述生成模块402,还用于基于确定模块403确定的上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;上述执行模块404,用于基于生成模块402生成的上述第一驱动参数序列,驱动上述虚拟形象的面部。
可选地,在本申请实施例中,上述确定模块403,还用于基于上述音素和上述音素信息,确定上述音素对应的重要性权重和密集度权重,上述重要性权重用于表征上述音素在上述虚拟形象面部驱动中的重要程度,上述密集度权重用于表征每个音素在上述N个音素中的密集程度;上述生成模块402,具体用于基于确定模块403确定的上述重要性权重、上述密集度权重、上述音素、上述音素信息以上述虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列。
可选地,在本申请实施例中,上述获取模块401,还用于获取上述音素对应的音素序列;上述生成模块402,具体用于:基于获取模块401获取到的上述音素序列、上述重要性权重以及上述密集度权重,生成第一音素序列;根据上述音素信息和上述虚拟形象中的面部视素与上述音素间的映射关系,对上述第一音素序列进行转换,生成上述第一驱动参数序列。
可选地,在本申请实施例中,上述执行模块404,具体用于对生成模块402生成的上述第一驱动参数序列中的上述每个音素对应的驱动参数分别进行时域特征平滑处理,得到平滑后的第二驱动参数序列;对该第二驱动参数序列进行时域特征平滑处理,得到第三驱动参数序列;基于该第三驱动参数序列,驱动上述虚拟形象的面部。
可选地,在本申请实施例中,上述执行模块404,具体用于基于上述第一输入信息的短时能量,生成上述每个音素对应的能量系数权重;基于上述第一输入信息中的上述音素对应的音素序列和上述能量系数权重,得到上述音素对应的能量系数权重序列;基于上述能量系数权重序列、上述虚拟形象中的面部视素的强度参数以及上述第一驱动参数序列,生成第四驱动参数序列;基于该第四驱动参数序列,驱动上述虚拟 形象的面部。
可选地,在本申请实施例中,上述虚拟形象面部驱动装置400还包括:提取模块,其中:上述提取模块,用于提取第一语音信息对应的声学特征信息;上述第一语音信息为输入的上述语音信息,或者由上述文字信息转换成的语音信息;上述生成模块402,具体用于基于提取模块提取到的上述声学特征信息,将上述第一语音信息和该第一语音信息对应的文字信息进行语音文字对齐,生成上述语音文字对齐信息。
可选地,在本申请实施例中,上述音素信息包括上述每个音素的持续时间;上述确定模块403,具体用于基于上述每个音素的持续时间,将上述第一输入信息对应的持续时间划分为P个时间段,P为大于1的整数;基于上述P个时间段中的每个时间段中包含的每个音素的密集程度信息,确定上述音素对应的密集度权重。
本申请实施例提供的虚拟形象面部驱动装置中,该虚拟形象面部驱动装置可以获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。如此,由于上述N个音素的音素信息可以精确的表达出该第一输入信息所对应的虚拟形象的面部口型,因此,能够生成更精准的第一驱动参数序列,以驱动虚拟形象的面部。从而,避免呈现出的虚拟形象的面部口型动作不协调,提升了最终的同步效果。
本申请实施例中的虚拟形象面部驱动装置可以是电子设备,也可以是电子设备中的部件,例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性的,电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device,MID)、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,还可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机 或者自助机等,本申请实施例不作具体限定。
本申请实施例中的虚拟形象面部驱动装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为iOS操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。
本申请实施例提供的虚拟形象面部驱动装置能够实现图1的方法实施例实现的各个过程,为避免重复,这里不再赘述。
可选地,如图3所示,本申请实施例还提供一种电子设备600,包括处理器601和存储器602,存储器602上存储有可在所述处理器601上运行的程序或指令,该程序或指令被处理器601执行时实现上述虚拟形象面部驱动方法实施例的各个步骤,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。
图4为实现本申请实施例的一种电子设备的硬件结构示意图。
该电子设备100包括但不限于:射频单元101、网络模块102、音频输出单元103、输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、以及处理器110等部件。
本领域技术人员可以理解,电子设备100还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图4中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。
其中,上述处理器110,用于获取第一输入信息,该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于上述语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。
可选地,在本申请实施例中,上述处理器110,还用于在基于上述音素和上述音素信息,确定上述音素对应的重要性权重和密集度权重,上述重要性权重用于表征上述音素在上述虚拟形象面部驱动中的重要程度,上述密集度权重用于表征每个音素在上述N个音素中的密集程度;上述处理器110,具体用于基于上述重要性权重、上述密集度权重、上述音素、上述音素信息以上述虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列。
可选地,在本申请实施例中,上述处理器110,还用于:获取上述音素对应的音素序列;上述处理器110,具体用于:基于上述音素序列、上述重要性权重以及上述密集度权重,生成第一音素序列;根据上述音素信息和上述虚拟形象中的面部视素与上述音素间的映射关系,对上述第一音素序列进行转换,生成上述第一驱动参数序列。
可选地,在本申请实施例中,上述处理器110,具体用于对上述第一驱动参数序列中的上述每个音素对应的驱动参数分别进行时域特征平滑处理,得到平滑后的第二驱动参数序列;对该第二驱动参数序列进行时域特征平滑处理,得到第三驱动参数序列;基于该第三驱动参数序列,驱动上述虚拟形象的面部。
可选地,在本申请实施例中,上述处理器110,具体用于基于上述第一输入信息的短时能量,生成上述每个音素对应的能量系数权重;基于上述第一输入信息中的上述音素对应的音素序列和上述能量系数权重,得到上述音素对应的能量系数权重序列;基于上述能量系数权重序列、上述虚拟形象中的面部视素的强度参数以及上述第一驱动参数序列,生成第四驱动参数序列;基于该第四驱动参数序列,驱动上述虚拟形象的面部。
可选地,在本申请实施例中,上述处理器110,还用于提取第一语音信息对应的声学特征信息;上述第一语音信息为输入的上述语音信息,或者由上述文字信息转换成的语音信息;上述处理器110,具体用于基于上述声学特征信息,将上述第一语音信息和该第一语音信息对应的文字信息进行语音文字对齐,生成上述语音文字对齐信息。
可选地,在本申请实施例中,上述音素信息包括上述每个音素的持续时间;上述处理器110,具体用于基于上述每个音素的持续时间,将上述第一输入信息对应的持 续时间划分为P个时间段,P为大于1的整数;基于上述P个时间段中的每个时间段中包含的每个音素的密集程度信息,确定上述音素对应的密集度权重。
在本申请实施例提供的电子设备中,电子设备可以获取第一输入信息;该第一输入信息包括语音信息和文字信息中的至少一个;基于上述第一输入信息,生成语音文字对齐信息;基于该语音文字对齐信息,确定上述第一输入信息对应的N个音素,该音素包括音素信息,N为大于1的整数;基于上述音素、上述音素信息以及虚拟形象中的面部视素与上述音素间的映射关系,生成第一驱动参数序列;基于该第一驱动参数序列,驱动上述虚拟形象的面部。如此,由于上述N个音素的音素信息可以精确的表达出该第一输入信息所对应的虚拟形象的面部口型,因此,能够生成更精准的第一驱动参数序列,以驱动虚拟形象的面部。从而,避免呈现出的虚拟形象的面部口型动作不协调,提升了最终的同步效果。
应理解的是,本申请实施例中,输入单元104可以包括图形处理器(Graphics Processing Unit,GPU)1041和麦克风1042,图形处理器1041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元106可包括显示面板1061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板1061。用户输入单元107包括触控面板1071以及其他输入设备1072中的至少一种。触控面板1071,也称为触摸屏。触控面板1071可包括触摸检测装置和触摸控制器两个部分。其他输入设备1072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。
存储器109可用于存储软件程序以及各种数据。存储器109可主要包括存储程序或指令的第一存储区和存储数据的第二存储区,其中,第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外,存储器109可以包括易失性存储器或非易失性存储器,或者,存储器109可以包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory, RAM),静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本申请实施例中的存储器109包括但不限于这些和任意其它适合类型的存储器。
处理器110可包括一个或多个处理单元;可选的,处理器110集成应用处理器和调制解调处理器,其中,应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作,调制解调处理器主要处理无线通信信号,如基带处理器。可以理解的是,上述调制解调处理器也可以不集成到处理器110中。
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述虚拟形象面部驱动方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述虚拟形象面部驱动方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如上述虚拟形象面部驱动方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括 那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (18)

  1. 一种虚拟形象面部驱动方法,所述方法包括:
    获取第一输入信息,所述第一输入信息包括语音信息和文字信息中的至少一个;
    基于所述第一输入信息,生成语音文字对齐信息;
    基于所述语音文字对齐信息,确定所述第一输入信息对应的N个音素,所述音素包括音素信息,N为大于1的整数;
    基于所述音素、所述音素信息,以及虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列;
    基于所述第一驱动参数序列,驱动所述虚拟形象的面部。
  2. 根据权利要求1所述的方法,其中,所述基于所述音素、所述音素信息,以及虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列,包括:
    基于所述音素和所述音素信息,确定所述音素对应的重要性权重和密集度权重,所述重要性权重用于表征所述音素在所述虚拟形象面部驱动中的重要程度,所述密集度权重用于表征每个音素在所述N个音素中的密集程度;
    基于所述重要性权重、所述密集度权重、所述音素、所述音素信息以及所述虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列。
  3. 根据权利要求2所述的方法,其中,所述基于所述重要性权重、所述密集度权重、所述音素、所述音素信息以及所述虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列,包括:
    获取所述音素对应的音素序列;
    基于所述音素序列、所述重要性权重以及所述密集度权重,生成第一音素序列;
    根据所述音素信息和所述虚拟形象中的面部视素与所述音素间的映射关系,对所述第一音素序列进行转换,生成所述第一驱动参数序列。
  4. 根据权利要求1所述的方法,其中,所述基于所述第一驱动参数序列,驱动所述虚拟形象的面部,包括:
    对所述第一驱动参数序列中的所述每个音素对应的驱动参数分别进行时域特征平滑处理,得到平滑后的第二驱动参数序列;
    对所述第二驱动参数序列进行时域特征平滑处理,得到第三驱动参数序列;
    基于所述第三驱动参数序列,驱动所述虚拟形象的面部。
  5. 根据权利要求1所述的方法,其中,所述基于所述第一驱动参数序列,驱动所述虚拟形象的面部,包括:
    基于所述第一输入信息的短时能量,生成所述每个音素对应的能量系数权重;
    基于所述第一输入信息中的所述音素对应的音素序列和所述能量系数权重,得到所述音素对应的能量系数权重序列;
    基于所述能量系数权重序列、所述虚拟形象中的面部视素的强度参数以及所述第一驱动参数序列,生成第四驱动参数序列;
    基于所述第四驱动参数序列,驱动所述虚拟形象的面部。
  6. 根据权利要求1所述的方法,其中,所述基于所述第一输入信息,生成语音文字对齐信息,包括:
    提取第一语音信息对应的声学特征信息;所述第一语音信息为输入的所述语音信息,或者由所述文字信息转换成的语音信息;
    基于所述声学特征信息,将所述第一语音信息和所述第一语音信息对应的文字信息进行语音文字对齐,生成所述语音文字对齐信息。
  7. 根据权利要求2所述的方法,其中,所述音素信息包括所述每个音素的持续时间;
    所述基于音素和所述音素信息,确定所述音素对应的密集度权重,包括:
    基于所述每个音素的持续时间,将所述第一输入信息对应的持续时间划分为P个时间段,P为大于1的整数;
    基于所述P个时间段中的每个时间段中包含的每个音素的密集程度信息,确定所述音素对应的密集度权重。
  8. 一种虚拟形象面部驱动装置,所述装置包括:获取模块、生成模块、确定模块和执行模块,其中:
    所述获取模块,用于获取第一输入信息,所述第一输入信息包括语音信息和文字信息中的至少一个;
    所述生成模块,用于基于所述获取模块获取到的所述第一输入信息,生成语音文字对齐信息;
    所述确定模块,用于基于所述生成模块生成的所述语音文字对齐信息,确定所述第一输入信息对应的N个音素,所述音素包括音素信息,N为大于1的整数;
    所述生成模块,还用于基于所述确定模块确定的所述音素、所述音素信息以及虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列;
    所述执行模块,用于基于所述生成模块生成的所述第一驱动参数序列,驱动所述虚拟形象的面部。
  9. 根据权利要求8所述的装置,其中,
    所述确定模块,还用于基于所述音素和所述音素信息,确定所述音素对应的重要性权重和密集度权重,所述重要性权重用于表征所述音素在所述虚拟形象面部驱动中的重要程度,所述密集度权重用于表征每个音素在所述N个音素中的密集程度;
    所述生成模块,具体用于基于所述确定模块确定的所述重要性权重、所述密集度权重、所述音素、所述音素信息以及所述虚拟形象中的面部视素与所述音素间的映射关系,生成第一驱动参数序列。
  10. 根据权利要求9所述的装置,其中,
    所述获取模块,还用于获取所述音素对应的音素序列;
    所述生成模块,具体用于:
    基于所述获取模块获取到的所述音素序列、所述重要性权重以及所述密集度权重,生成第一音素序列;
    根据所述音素信息和所述虚拟形象中的面部视素与所述音素间的映射关系,对所述第一音素序列进行转换,生成所述第一驱动参数序列。
  11. 根据权利要求8所述的装置,其中,
    所述执行模块,具体用于:
    对所述生成模块生成的所述第一驱动参数序列中的所述每个音素对应的驱动参数分别进行时域特征平滑处理,得到平滑后的第二驱动参数序列;
    对所述第二驱动参数序列进行时域特征平滑处理,得到第三驱动参数序列;
    基于所述第三驱动参数序列,驱动所述虚拟形象的面部。
  12. 根据权利要求8所述的装置,其中,
    所述执行模块,具体用于:
    基于所述第一输入信息的短时能量,生成所述每个音素对应的能量系数权重;
    基于所述第一输入信息中的所述音素对应的音素序列和所述能量系数权重,得到所述音素对应的能量系数权重序列;
    基于所述能量系数权重序列、所述虚拟形象中的面部视素的强度参数以及所述第一驱动参数序列,生成第四驱动参数序列;
    基于所述第四驱动参数序列,驱动所述虚拟形象的面部。
  13. 根据权利要求8所述的装置,其中,所述装置还包括:提取模块,其中:
    所述提取模块,用于提取第一语音信息对应的声学特征信息;所述第一语音信息为输入的所述语音信息,或者由所述文字信息转换成的语音信息;
    所述生成模块,具体用于基于所述提取模块提取到的所述声学特征信息,将所述第一语音信息和所述第一语音信息对应的文字信息进行语音文字对齐,生成所述语音文字对齐信息。
  14. 根据权利要求9所述的装置,其中,所述音素信息包括所述每个音素的持续时间;
    所述确定模块,具体用于:
    基于所述每个音素的持续时间,将所述第一输入信息对应的持续时间划分为P个时间段,P为大于1的整数;
    基于所述P个时间段中的每个时间段中包含的每个音素的密集程度信息,确定所述音素对应的密集度权重。
  15. 一种电子设备,其特征在于,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1至7任一项所述的虚拟形象面部驱动方法的步骤。
  16. 一种可读存储介质,其特征在于,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1至7任一项所述的虚拟形象面部驱动 方法的步骤。
  17. 一种计算机程序产品,所述程序产品被至少一个处理器执行以实现如权利要求1至7中任一项所述的虚拟形象面部驱动方法。
  18. 一种电子设备,所述电子设备被配置成用于执行如权利要求1至7中任一项所述的虚拟形象面部驱动方法。
PCT/CN2023/126582 2022-10-27 2023-10-25 虚拟形象面部驱动方法、装置、电子设备及介质 WO2024088321A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211325775.7A CN115662388A (zh) 2022-10-27 2022-10-27 虚拟形象面部驱动方法、装置、电子设备及介质
CN202211325775.7 2022-10-27

Publications (1)

Publication Number Publication Date
WO2024088321A1 true WO2024088321A1 (zh) 2024-05-02

Family

ID=84994320

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/126582 WO2024088321A1 (zh) 2022-10-27 2023-10-25 虚拟形象面部驱动方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN115662388A (zh)
WO (1) WO2024088321A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662388A (zh) * 2022-10-27 2023-01-31 维沃移动通信有限公司 虚拟形象面部驱动方法、装置、电子设备及介质
CN116095357B (zh) * 2023-04-07 2023-07-04 世优(北京)科技有限公司 虚拟主播的直播方法、装置及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
CN109377540A (zh) * 2018-09-30 2019-02-22 网易(杭州)网络有限公司 面部动画的合成方法、装置、存储介质、处理器及终端
CN110853614A (zh) * 2018-08-03 2020-02-28 Tcl集团股份有限公司 虚拟对象口型驱动方法、装置及终端设备
CN111459450A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111460785A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111508064A (zh) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 基于音素驱动的表情合成方法、装置和计算机存储介质
CN112927712A (zh) * 2021-01-25 2021-06-08 网易(杭州)网络有限公司 视频生成方法、装置和电子设备
CN113539240A (zh) * 2021-07-19 2021-10-22 北京沃东天骏信息技术有限公司 动画生成方法、装置、电子设备和存储介质
CN115312030A (zh) * 2022-06-22 2022-11-08 网易(杭州)网络有限公司 虚拟角色的显示控制方法、装置及电子设备
CN115662388A (zh) * 2022-10-27 2023-01-31 维沃移动通信有限公司 虚拟形象面部驱动方法、装置、电子设备及介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
CN110853614A (zh) * 2018-08-03 2020-02-28 Tcl集团股份有限公司 虚拟对象口型驱动方法、装置及终端设备
CN109377540A (zh) * 2018-09-30 2019-02-22 网易(杭州)网络有限公司 面部动画的合成方法、装置、存储介质、处理器及终端
CN111459450A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111460785A (zh) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN111508064A (zh) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 基于音素驱动的表情合成方法、装置和计算机存储介质
CN112927712A (zh) * 2021-01-25 2021-06-08 网易(杭州)网络有限公司 视频生成方法、装置和电子设备
CN113539240A (zh) * 2021-07-19 2021-10-22 北京沃东天骏信息技术有限公司 动画生成方法、装置、电子设备和存储介质
CN115312030A (zh) * 2022-06-22 2022-11-08 网易(杭州)网络有限公司 虚拟角色的显示控制方法、装置及电子设备
CN115662388A (zh) * 2022-10-27 2023-01-31 维沃移动通信有限公司 虚拟形象面部驱动方法、装置、电子设备及介质

Also Published As

Publication number Publication date
CN115662388A (zh) 2023-01-31

Similar Documents

Publication Publication Date Title
CN110688911B (zh) 视频处理方法、装置、系统、终端设备及存储介质
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2024088321A1 (zh) 虚拟形象面部驱动方法、装置、电子设备及介质
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
US20200279553A1 (en) Linguistic style matching agent
WO2019196306A1 (zh) 基于语音的口型动画合成装置、方法及可读存储介质
JP6336676B2 (ja) 顔構造に基づいて声を合成する方法および装置
KR101558202B1 (ko) 아바타를 이용한 애니메이션 생성 장치 및 방법
JP7500582B2 (ja) 発話アニメーションのリアルタイム生成
CN112650831A (zh) 虚拟形象生成方法、装置、存储介质及电子设备
Mariooryad et al. Compensating for speaker or lexical variabilities in speech for emotion recognition
CN111368609A (zh) 基于情绪引擎技术的语音交互方法、智能终端及存储介质
CN111145777A (zh) 一种虚拟形象展示方法、装置、电子设备及存储介质
TW201937344A (zh) 智慧型機器人及人機交互方法
CN111260761B (zh) 一种生成动画人物口型的方法及装置
WO2021196645A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
CN103218842A (zh) 一种语音同步驱动三维人脸口型与面部姿势动画的方法
CN112837401B (zh) 一种信息处理方法、装置、计算机设备及存储介质
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN113538641A (zh) 动画生成方法及装置、存储介质、电子设备
CN111967334B (zh) 一种人体意图识别方法、系统以及存储介质
EP4336490A1 (en) Voice processing method and related device
CN113760101A (zh) 一种虚拟角色控制方法、装置、计算机设备以及存储介质
CN112735371B (zh) 一种基于文本信息生成说话人视频的方法及装置
WO2021232876A1 (zh) 实时驱动虚拟人的方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23881898

Country of ref document: EP

Kind code of ref document: A1