CN113314104A - Interactive object driving and phoneme processing method, device, equipment and storage medium - Google Patents

Interactive object driving and phoneme processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN113314104A
CN113314104A CN202110604874.8A CN202110604874A CN113314104A CN 113314104 A CN113314104 A CN 113314104A CN 202110604874 A CN202110604874 A CN 202110604874A CN 113314104 A CN113314104 A CN 113314104A
Authority
CN
China
Prior art keywords
phoneme
voice
interactive object
feature extraction
posterior probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110604874.8A
Other languages
Chinese (zh)
Other versions
CN113314104B (en
Inventor
吴文岩
吴潜溢
高娜
钱晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN202110604874.8A priority Critical patent/CN113314104B/en
Publication of CN113314104A publication Critical patent/CN113314104A/en
Priority to PCT/CN2022/089870 priority patent/WO2022252890A1/en
Priority to TW111119388A priority patent/TW202248994A/en
Application granted granted Critical
Publication of CN113314104B publication Critical patent/CN113314104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Processing Or Creating Images (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are an interactive object driving and phoneme processing method, apparatus, device and storage medium, the interactive object driving method including: acquiring acoustic characteristics of sound driving data of an interactive object; performing feature extraction on the acoustic features by using a sound feature extraction network to obtain phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages; obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each voice frame; and controlling the posture of the interactive object according to the posture parameter value.

Description

Interactive object driving and phoneme processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for interactive object driving and phoneme processing.
Background
The digital person is a method for matching the emitted sound with the presented mouth shape, expression, action and the like by utilizing deep learning. With the wide application of digital people in many fields, there are many scenarios where digital people are required to be able to support multiple languages.
Currently, digital people are usually driven by using voice features extracted by a speech recognition model or voice features obtained by using phoneme timestamps, however, the features are different in different languages, deep learning needs to be performed on data sets of different languages, and the current open source data set has the problems of low quality, incomplete labeling, unbalanced data and the like.
How to realize the support of multilingual by digital people is a problem that needs active research at present.
Disclosure of Invention
The disclosed embodiments provide an interactive object driving and phoneme processing scheme.
According to an aspect of the present disclosure, there is provided a driving method of an interactive object, the method including: acquiring acoustic characteristics of sound driving data of an interactive object; performing feature extraction on the acoustic features by using a sound feature extraction network to obtain phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages; obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each voice frame; and controlling the posture of the interactive object according to the posture parameter value.
The embodiment of the disclosure trains the voice feature extraction network by using the phoneme table containing multiple languages, which can improve the efficiency and quality of the training of the feature extraction network, and extracts the phoneme posterior features of the voice driving data by using the network as voice feature driving the interactive object.
In combination with any one of the embodiments provided by the present disclosure, the acquiring the acoustic characteristics of the sound driving data of the interactive object includes: acquiring a voice frame sequence corresponding to the voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vector of each speech frame in the speech frame sequence.
In combination with any one of the embodiments provided by the present disclosure, the sound feature extraction network includes a first fully-connected network, a coding sub-network, and a second fully-connected network, and the extracting the features of the acoustic features by using the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data includes: inputting the acoustic features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network; performing feature coding processing on the first acoustic feature sequence by using the coding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each speech frame in the voice driving data.
In the embodiment of the present disclosure, the sound feature is converted into a sequence, feature extraction is performed through a coding network suitable for extracting the sequence feature, and the phoneme posterior probability of each speech frame in the sound feature data can be accurately predicted through full-connection network classification processing.
In combination with any one of the embodiments provided by the present disclosure, the obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes: inputting the phoneme posterior probability of each voice frame into a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated characteristic sequence to obtain the attitude parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
And predicting the attitude parameter values corresponding to the phoneme posterior probabilities of the voice frames in the voice driving data through a time sequence network and a full connection network so as to fuse the historical phoneme posterior probabilities with the relevance and the current phoneme posterior probabilities, thereby enabling the historical attitude parameter values to influence the change of the current attitude parameter values and enabling the change of the attitude parameter values of the interactive characters to be more smooth and natural.
In combination with any one of the embodiments provided by the present disclosure, the controlling parameters of the interactive object include facial pose controlling parameters, and the controlling the pose of the interactive object according to the pose parameter values includes: and driving the interactive object to realize the facial pose matched with each voice frame in the voice driving data according to the facial pose control parameter matched with the phoneme posterior probability of each voice frame.
When the voice is output according to the voice driving data, the interactive object is driven to make facial expressions according to the facial posture control parameters corresponding to the voice driving data, so that the mouth shape and the expression which make the voice can be synchronously made while the interactive object outputs the voice, the target object can generate the speaking feeling of the interactive object, and the interactive experience of the target object is improved.
According to an aspect of the present disclosure, a phoneme processing method is provided, the method including: obtaining a phoneme table containing multiple languages according to phonemes in multiple target languages; and training to obtain a sound feature extraction network based on the multi-language phoneme table, wherein the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.
The embodiment of the disclosure trains the voice feature extraction network by using the phoneme table containing multiple languages, which can improve the efficiency and quality of the training of the feature extraction network, and extracts the phoneme posterior features of the voice driving data by using the network as voice feature driving the interactive object.
In an embodiment of the disclosure, in combination with any one of the embodiments provided by the disclosure, the obtaining a phoneme table including multiple languages according to phonemes in multiple target languages includes: acquiring phonemes in a plurality of target languages for splicing; and combining the phonemes with the pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme table containing multiple languages.
The embodiment of the disclosure provides a method for constructing a multilingual phoneme table in a splicing mode, which can conveniently and quickly obtain the phoneme table containing a plurality of target languages.
In combination with any embodiment provided by the present disclosure, the method further comprises: respectively mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting a preset similarity condition; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language phoneme table.
In connection with any embodiment provided by the present disclosure, in response to a first phoneme having a pronunciation similarity to each international phonetic symbol smaller than or equal to a second set threshold value existing in the target languages, the first phoneme is added to the multilingual-containing phoneme table.
The embodiment of the disclosure provides a method for obtaining a phoneme table containing multiple languages by mapping multiple target languages into international phonetic symbols, and the method is suitable for multiple target languages and has universality.
In combination with any embodiment provided by the present disclosure, the method further comprises: acquiring a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme table; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phoneme table.
In the embodiment of the present disclosure, a phoneme table containing multiple languages is used, so that multiple languages of voice samples can be directly labeled, and a high-quality corpus with complete labeling and balanced data can be conveniently and efficiently constructed for training a voice feature extraction network.
In combination with any embodiment provided by the present disclosure, the method further comprises: inputting the acoustic features of the labeled voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
According to an aspect of the present disclosure, there is provided an apparatus for driving an interactive object, the apparatus including: the first acquisition unit is used for acquiring the acoustic characteristics of the sound driving data of the interactive object; the second acquisition unit is used for performing feature extraction on the acoustic features by utilizing a sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages; a third obtaining unit, configured to obtain an attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame; and the control unit is used for controlling the posture of the interactive object according to the posture parameter value.
In combination with any one of the embodiments provided by the present disclosure, the first obtaining unit is specifically configured to: acquiring a voice frame sequence corresponding to the voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vector of each speech frame in the speech frame sequence.
In combination with any embodiment provided by the present disclosure, the sound feature extraction network includes a first fully connected network, a coding sub-network, and a second fully connected network, and the second obtaining unit is specifically configured to: inputting the acoustic features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network; performing feature coding processing on the first acoustic feature sequence by using the coding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each speech frame in the voice driving data.
In combination with any one of the embodiments provided by the present disclosure, the third obtaining unit is specifically configured to: inputting the phoneme posterior probability of each voice frame into a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated characteristic sequence to obtain the attitude parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
In combination with any embodiment provided by the present disclosure, the control parameters of the interactive object include facial pose control parameters, and the control unit is specifically configured to: and driving the interactive object to realize the facial pose matched with each voice frame in the voice driving data according to the facial pose control parameter matched with the phoneme posterior probability of each voice frame.
According to an aspect of the present disclosure, there is provided a phoneme processing apparatus, the apparatus including: a phoneme table obtaining unit, configured to obtain a phoneme table containing multiple languages according to phonemes in multiple target languages; and the training unit is used for training to obtain a sound feature extraction network based on the multi-language phoneme table, and the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.
In combination with any embodiment provided by the present disclosure, the phoneme table obtaining unit is specifically configured to: acquiring phonemes in a plurality of target languages for splicing; combining the phonemes with the pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme table containing multiple languages; and training to obtain a sound feature extraction network based on the multi-language phoneme table.
In combination with any embodiment provided by the present disclosure, the phoneme table obtaining unit is specifically configured to: respectively mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting a preset similarity condition; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language phoneme table.
In combination with any one of the embodiments provided by the present disclosure, in response to a first phoneme having a pronunciation similarity with each international phonetic symbol smaller than or equal to the second set threshold exists in the target languages, adding the first phoneme to the multi-language-containing phoneme table.
In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes a labeling unit configured to: acquiring a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme table; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phoneme table.
In combination with any one of the embodiments provided by the present disclosure, the training unit is specifically configured to: inputting the acoustic features of the labeled voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
According to an aspect of the present disclosure, there is provided an electronic device, the device including a memory for storing computer instructions executable on a processor, and the processor being configured to implement a driving method of an interactive object according to any one of the embodiments provided in the present disclosure when executing the computer instructions.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of an interactive object according to any one of the embodiments provided in the present disclosure.
According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the driving method of an interactive object according to any one of the embodiments provided in the present disclosure.
Drawings
In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure;
fig. 2 is a schematic diagram of a process for feature coding a phoneme sequence according to at least one embodiment of the present disclosure;
FIG. 3 is a diagram illustrating a process for mapping a posterior probability of a phoneme, in accordance with at least one embodiment of the present disclosure;
fig. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a phoneme processing device according to at least one embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
At least one embodiment of the present disclosure provides a driving method for an interactive object, where the driving method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game console, a desktop computer, an advertisement machine, a kiosk, a vehicle-mounted terminal, and the like, and the server includes a local server or a cloud server, and the method may also be implemented by a way that a processor calls a computer-readable instruction stored in a memory.
In the embodiment of the present disclosure, the interactive object may be any interactive object capable of interacting with the target object, and may be a virtual character, a virtual animal, a virtual article, a cartoon image, or other virtual images capable of implementing an interactive function, where the presentation form of the virtual image may be a 2D form or a 3D form, and the present disclosure is not limited thereto. The target object can be a user, a robot or other intelligent equipment.
The interactive object can be displayed through a terminal device, the terminal device can be a television, an all-in-one machine with a display function, a projector, a Virtual Reality (VR) device, an Augmented Reality (AR) device, and the like, and the specific form of the terminal device is not limited in the disclosure.
In some embodiments, the interactive object may emit a specified voice to the target object in response to the terminal device receiving sound driving data for driving the interactive object to output the voice. The voice driving data can be generated according to the action, expression, identity, preference and the like of the target object around the terminal equipment, so that the interactive object is driven to respond by sending out the specified voice, and therefore the anthropomorphic service is provided for the target object. In some scenarios, the interactive object may interact with the target object using different languages, and in order to make the gesture of the interactive object fit the real pronunciation in different languages, at least one embodiment of the present disclosure provides a driving method for the interactive object.
Fig. 1 illustrates a flowchart of a driving method of an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 1, the method includes steps 101 to 104.
In step 101, acoustic features of sound driving data of the interactive object are acquired.
The sound driving data may include audio data (voice data), text, and the like. In response to that the voice driving data is audio data, the audio data can be directly utilized to drive the interactive object to output voice, that is, the terminal device directly outputs voice through the audio data; in response to the sound driving data being a text, corresponding phonemes may be generated from speech contained in the text, the interactive object being driven to output speech by the generated phonemes. Taking a Chinese text as an example, the text may be first converted into pinyin, and then corresponding phonemes may be generated according to the pinyin. The voice driving data may also be other forms of driving data, which the present disclosure does not limit.
In the embodiment of the present disclosure, the voice driving data may be driving data generated according to an action, an expression, an identity, a preference, and the like of a target object interacting with an interaction object, or may be voice driving data called by the terminal device from an internal memory. The present disclosure does not limit the manner of acquiring the sound drive data.
In response to the sound driving data being audio data, a phoneme may be formed by splitting the audio data into a plurality of speech frames, the speech frames being combined according to states of the speech frames; the individual phonemes formed from the audio data then form a sequence of phonemes. Wherein, the phoneme is the minimum voice unit divided according to the natural attribute of the voice, and a real character can form a phoneme by a pronunciation action.
In response to the sound driving data being a text, phonemes included in the morphemes may be obtained according to the morphemes included in the text, thereby obtaining a corresponding phoneme sequence. It should be understood by those skilled in the art that the phoneme sequence corresponding to the voice driving data can also be obtained by other ways, which is not limited by the present disclosure.
In the disclosed embodiment, the acoustic features may be features related to speech emotion, such as fundamental Frequency features, co-peak features, Mel-Frequency cepstral coefficients (MFCCs), and so on.
In step 102, a sound feature extraction network is used to perform feature extraction on the acoustic features, so as to obtain a phoneme posterior probability of each speech frame in the sound driving data.
Wherein the phoneme posterior probability represents a probability that the speech frame corresponds to each phoneme. The phoneme posterior probability is speaker independent and only speaker dependent.
In the embodiment of the present disclosure, the sound feature extraction network for extracting the phoneme posterior probability of each speech frame in the sound driving data is obtained by training according to a phoneme table containing multiple languages.
In some embodiments, a phone list containing multiple languages may be obtained by: acquiring phonemes in a plurality of target languages for splicing; and combining the phonemes with the pronunciation similarity exceeding a first set threshold in the splicing result, so that a phoneme table containing a plurality of target languages can be conveniently and quickly obtained.
For example, the phonemes in chinese (pinyin) and the phonemes in english are spliced, and the phonemes with the same or similar pronunciation, such as "b", "p", "m", "f", etc., in the splicing result are merged, so as to obtain the phoneme table containing chinese and english.
In some embodiments, a phone list containing multiple languages may be obtained by: first, phonemes in a plurality of target languages are respectively mapped to International Phonetic symbols (IPA) whose pronunciation similarity satisfies a similarity condition, for example, pronunciation is the same or similarity is the highest. And then, combining the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language contained phoneme table. The method is suitable for various target languages and has universality.
For example, all phonemes in chinese may be mapped to an international phonetic symbol with the highest pronunciation similarity, all phonemes in english may be mapped to an international phonetic symbol with the highest pronunciation similarity, the international phonetic symbols mapped in chinese and english may be stored in one phoneme table, and phonemes with the same pronunciation may be combined to obtain a phoneme table supporting both chinese and english.
For example, it is assumed that chinese phonemes include phonemes a1, a2, a3, b, i1, i2.i3, ii1, ii2, and ii3 (where 1, 2, and 3 represent tones), and english phonemes include a, b, i, and IPA table includes a, b, i. Based on the pronunciation, Chinese and English phonemes are mapped to IPA with the highest similarity, and Chinese is sequentially mapped to a, a, a, b, i, i, i, i, i, i (since there is no ii pronunciation in IPA, the actual ii pronunciation is most similar to i, then ii is mapped to i). English is mapped to be a, b and i in sequence in the same way.
In some embodiments, in response to a first phoneme having a pronunciation similarity to each international phonetic symbol of less than or equal to a second set threshold in the target languages, the first phoneme is added to the multilingual-containing phoneme table. For example, the phoneme "ng" in the text is not present in the IPA table, and the similarity between the pronunciation and other pronunciations is less than the second predetermined threshold; or, a phoneme in the text is composed of other pronunciations, the degree of similarity between the pronunciation and the IPA table is smaller than a second set threshold, such a phoneme is called a first phoneme, and the first phoneme is retained and added behind the IPA table, that is, the finally obtained IPA includes the first phoneme besides all phonemes of the IPA.
It should be understood by those skilled in the art that the first set threshold and the second set threshold may be specifically set according to actual needs, and the disclosure does not limit this.
In the embodiment of the present disclosure, a phoneme table containing multiple languages is used, so that multiple languages of voice samples can be directly labeled, and a high-quality corpus with complete labeling and balanced data can be conveniently and efficiently constructed for training a voice feature extraction network.
In step 103, obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame.
In the embodiment of the present disclosure, the attitude parameter value of the interactive object matched with the voice driving data may be obtained according to the phoneme posterior probability of each voice frame in the voice driving data.
The attitude parameters are used for controlling the attitude of the interactive object, and the interactive object can be driven to make corresponding attitude by utilizing different attitude parameter values. The pose parameters may include facial pose parameters for controlling facial poses of the interactive object, including expressions, mouth shapes, facial movements, head poses, and the like; in the embodiment of the present disclosure, a corresponding relationship between a phoneme posterior probability and an attitude parameter value of an interactive object may be pre-established, and the attitude parameter value corresponding to the voice driving data may be obtained when the posterior probability of each voice frame in the voice driving data is obtained. The specific form of the attitude parameters can be determined according to the type of the interaction object model.
In step 104, the pose of the interactive object is controlled according to the pose parameter value.
The gesture parameter value is matched with the phoneme posterior probability of each speech frame in the sound driving data of the interactive object, and the phoneme posterior probability is irrelevant to the language, so that for the speech data and the text of different languages, the gesture presented by the interactive object, such as mouth shape, expression, action and the like, can be matched with the actual pronunciation, and the target object interacting with the interactive object has the feeling that the interactive object is speaking.
In the embodiment of the present disclosure, the acoustic features of the voice driving data of the interactive object are obtained first, feature extraction is performed on the acoustic features by using a voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice driving data, then the attitude parameter value of the interactive object is obtained according to the phoneme posterior probability of each voice frame, and the attitude of the interactive object is controlled according to the attitude parameter value, since the phoneme posterior probabilities are speaker independent, capable of supporting multilingual sound features, the disclosed embodiments train the sound feature extraction network with a phoneme table containing multilinguals, and extracting the phoneme posterior feature of the voice driving data by using the network, and driving the interactive object as the voice feature so that the posture of the interactive object is attached to the real pronunciation under different languages.
In some embodiments, a corpus supporting multiple languages may be constructed according to the following method.
Firstly, a multilingual voice sample is obtained, and the language type of the voice sample is the same as the language type contained in the multilingual phoneme table. For example, in the case where the phoneme table is a phoneme table supporting chinese and english, a speech sample in chinese and a speech sample in english are acquired, respectively.
And then, performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample.
Taking the speech sample as a speech segment using the Chinese language "hello", after performing speech operation on the speech sample, the pronunciation start and stop time of each phoneme in the speech segment can be obtained: n [0,0.2], i3[0.2,0.4], h [0.5,0.7], ao3[0.7,1.2], wherein the start-stop time of pronunciation of each phoneme is indicated within [ ] in units of seconds. And determining the phoneme corresponding to each speech frame in the speech sample according to the pronunciation starting and ending time of each phoneme.
And finally, labeling the phonemes in the voice sample by using the phonemes in the multi-language phoneme table.
Taking the multilingual phoneme table as an example of a phoneme table supporting Chinese and English, for both Chinese voice samples and English voice samples, phonemes in the multilingual phoneme table can be directly called for labeling, so that a high-quality corpus with complete labeling and balanced data can be conveniently and efficiently constructed.
In some embodiments, the voice feature extraction network may be trained by the following method.
Firstly, inputting the acoustic characteristics of the labeled voice sample into the voice characteristic extraction network to obtain the phoneme posterior probability of each voice frame in the voice sample. And each voice frame in the marked voice sample is marked with a real value of a phoneme.
And then, adjusting the parameter value of the sound characteristic extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value. And when the change of the network loss meets the convergence condition, for example, the change of the network loss is smaller than a set threshold value, or the iteration times reach the set times, finishing the training, and thus obtaining the trained voice feature extraction network.
In some embodiments, a speech frame sequence corresponding to the sound driving data of the interactive object may be obtained, and the acoustic feature of the sound driving data may be obtained according to the acoustic feature vector of each speech frame in the speech frame sequence. Taking MFCC as an example, according to the MFCC coefficient of each speech frame in the sequence of speech frames, an MFCC matrix corresponding to the sound driving data may be obtained.
Fig. 2 illustrates a schematic diagram of a sound feature extraction process according to at least one embodiment of the disclosure. As shown in fig. 2, the present disclosure performs feature extraction on acoustic features of voice driving data by using a voice feature extraction network 200 to obtain a phoneme posterior probability of each voice frame in the voice feature data. The sound feature extraction network 200 comprises a first fully connected network 201, a coding subnetwork 202 and a second fully connected network 203.
Firstly, inputting the sound features into the first fully-connected network 201 to obtain a first acoustic feature sequence output by the first fully-connected network; then, the coding sub-network 202 is used to perform feature coding processing on the first acoustic feature sequence, so as to obtain a coding result. The coding sub-network may be, for example, a CBHG network, a Gated current Unit (GRU), or the like suitable for extracting sequence features. Finally, the coding result is input to the second fully-connected network 203, and the phoneme posterior probability of each speech frame in the voice driving data is obtained.
In the embodiment of the present disclosure, the sound feature is converted into a sequence, feature extraction is performed through a coding network suitable for extracting the sequence feature, and the phoneme posterior probability of each speech frame in the sound feature data can be accurately predicted through full-connection network classification processing.
In some embodiments, the pose parameter values corresponding to the phoneme posterior probabilities of the speech frames in the voice driving data may be predicted through a time sequence network and a full connection network, so as to fuse the historical phoneme posterior probabilities and the current phoneme posterior probabilities with relevance, so that the historical pose parameter values affect the change of the current pose parameter values, and the change of the pose parameter values of the interactive character is more gradual and natural.
Fig. 3 illustrates a diagram of a process for mapping a posterior probability of a phoneme according to at least one embodiment of the present disclosure. As shown in fig. 3, the phoneme posterior probabilities of the speech frames are first input to a time-series network 301, and associated feature information is output. The time sequence network may be a time-recursive neural network, such as LSTM, and may learn the history information of the posterior probability of the input phoneme, and the output associated feature information includes the influence of the history information on the current information. Next, the associated feature information is input to the third fully connected network 302, resulting in an associated feature sequence. And finally, activating the associated characteristic sequence through an activation layer 303, and converting each characteristic value in the associated characteristic sequence into a posture parameter value to obtain a posture parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
In some embodiments, the gesture parameters of the interactive object include facial gesture control parameters, and the interactive object may be driven to achieve a facial gesture matching each speech frame in the sound driving data according to the facial gesture control parameters matching the phoneme posterior probability of each speech frame. The facial pose parameters may include, for example, facial muscle control coefficients.
The motion of the face, from an anatomical point of view, is the result of the coordinated deformation of the muscles of the various parts of the face. Therefore, by obtaining a facial muscle model by dividing facial muscles of an interactive object, and controlling the movement of each muscle (region) obtained by the division by a corresponding facial muscle control coefficient, that is, performing contraction/expansion control on the muscle, it is possible to make the face of an interactive character to make various expressions. For each muscle of the facial muscle model, the motion states corresponding to different muscle control coefficients can be set according to the position of the face where the muscle is located and the motion characteristics of the muscle. For example, for the upper lip muscles, the control coefficient has a value range of (0-1), and different values in the range correspond to different contraction/expansion states of the upper lip muscles, and the longitudinal opening and closing of the mouth can be realized by changing the values; for the left mouth corner muscle, the control coefficient is in the range of (0-1), and different values in the range correspond to the contraction/expansion state of the left mouth corner muscle, and the horizontal change of the mouth part can be realized by changing the values.
When the voice is output according to the voice driving data, the interactive object is driven to make facial expressions according to the facial posture control parameters corresponding to the voice driving data, so that the mouth shape and the expression which make the voice can be synchronously made while the interactive object outputs the voice, the target object can generate the speaking feeling of the interactive object, and the interactive experience of the target object is improved.
Fig. 4 is a flowchart of a phoneme processing method according to at least one embodiment of the present disclosure. As shown in fig. 4, the method includes steps 401 to 402.
In step 401, a phoneme table containing multiple languages is obtained according to phonemes in multiple target languages.
In one example, a phone list containing multiple languages may be obtained by: acquiring phonemes in a plurality of target languages for splicing; and combining the phonemes with the pronunciation similarity exceeding a first set threshold in the splicing result, so that a phoneme table containing a plurality of target languages can be conveniently and quickly obtained.
In another example, a phone list containing multiple languages may be obtained by: firstly, the phonemes in a plurality of target languages are respectively mapped into international phonetic symbols with pronunciation similarity satisfying a similarity condition, wherein the similarity condition is, for example, pronunciation identity or highest similarity. And then, combining the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language contained phoneme table. The method is suitable for various target languages and has universality.
In some embodiments, in response to a first phoneme having a pronunciation similarity to each international phonetic symbol less than or equal to the second set threshold existing in the plurality of target languages, the first phoneme is added to the multilingual-containing phoneme table.
It should be understood by those skilled in the art that the first set threshold and the second set threshold may be specifically set according to actual needs, and the disclosure does not limit this.
In step 402, based on the multi-language phoneme table, a sound feature extraction network is trained, and the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.
The embodiment of the disclosure trains the voice feature extraction network by using the phoneme table containing multiple languages, which can improve the efficiency and quality of the training of the feature extraction network, and extracts the phoneme posterior features of the voice driving data by using the network as voice feature driving the interactive object.
In some embodiments, a corpus supporting multiple languages may be constructed according to the following method.
Firstly, a multilingual voice sample is obtained, and the language type of the voice sample is the same as the language type contained in the multilingual phoneme table.
And then, performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample.
And finally, labeling the phonemes in the voice sample by using the phonemes in the multi-language phoneme table.
In the embodiment of the disclosure, the phoneme table containing multiple languages is utilized, and the phonemes in the phoneme table containing multiple languages can be directly called for labeling, so that a high-quality corpus with complete labeling and balanced data can be conveniently and efficiently constructed.
In some embodiments, the voice feature extraction network may be trained by the following method.
Firstly, inputting the acoustic characteristics of the labeled voice sample into the voice characteristic extraction network to obtain the phoneme posterior probability of each voice frame in the voice sample. And each voice frame in the marked voice sample is marked with a real value of a phoneme.
And then, adjusting the parameter value of the sound characteristic extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value. And when the change of the network loss meets the convergence condition, for example, the change of the network loss is smaller than a set threshold value, or the iteration times reach the set times, finishing the training, and thus obtaining the trained voice feature extraction network.
Fig. 5 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 5, the apparatus may include: a first obtaining unit 501, configured to obtain an acoustic feature of sound driving data of an interactive object; a second obtaining unit 502, configured to perform feature extraction on the acoustic features by using a sound feature extraction network to obtain a phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages; a third obtaining unit 503, configured to obtain an attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame; a control unit 504, configured to control the pose of the interactive object according to the pose parameter value.
In some embodiments, the first obtaining unit is specifically configured to: acquiring a voice frame sequence corresponding to the voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vector of each speech frame in the speech frame sequence.
In some embodiments, the sound feature extraction network includes a first fully-connected network, a coding sub-network, and a second fully-connected network, and the second obtaining unit is specifically configured to: inputting the sound features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network; performing feature coding processing on the first acoustic feature sequence by using the coding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each speech frame in the voice driving data.
In some embodiments, the third obtaining unit is specifically configured to: inputting the phoneme posterior probability of each voice frame into a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated characteristic sequence to obtain the attitude parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
In some embodiments, the control parameters of the interaction object comprise facial pose control parameters, the control unit being specifically configured to: and driving the interactive object to realize the facial pose matched with each voice frame in the voice driving data according to the facial pose control parameter matched with the phoneme posterior probability of each voice frame.
Fig. 6 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure, and as shown in fig. 6, the apparatus may include: a phoneme table obtaining unit 601, configured to obtain a phoneme table containing multiple languages according to phonemes in multiple target languages; a training obtaining unit 602, configured to obtain a sound feature extraction network through training based on the multi-language-containing phoneme table, where the sound feature extraction network is used to extract a phoneme posterior probability of a speech frame to be recognized.
In some embodiments, the phoneme table obtaining unit is specifically configured to: acquiring phonemes in a plurality of target languages for splicing; combining the phonemes with the pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme table containing multiple languages; and training to obtain a sound feature extraction network based on the multi-language phoneme table.
In some embodiments, the phoneme table obtaining unit is specifically configured to: respectively mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting a preset similarity condition; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language phoneme table.
In some embodiments, in response to a first phoneme having a pronunciation similarity to each international phonetic symbol less than or equal to the second set threshold existing in the plurality of target languages, the first phoneme is added to the multilingual-containing phoneme table.
In some embodiments, the apparatus further comprises an annotation unit for: acquiring a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme table; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phoneme table.
In some embodiments, the training unit is specifically configured to: inputting the acoustic features of the labeled voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
At least one embodiment of the present disclosure further provides an electronic device, as shown in fig. 7, the device includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the driving method of the interactive object according to any embodiment of the present disclosure when executing the computer instructions.
At least one embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the driving method of the interactive object according to any one of the embodiments of the present disclosure.
At least one embodiment of the present disclosure also provides a computer program product including a computer program, which when executed by a processor implements the driving method of the interactive object according to any one of the embodiments of the present disclosure.
As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CDROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (16)

1. A method of driving an interactive object, the method comprising:
acquiring acoustic characteristics of sound driving data of an interactive object;
performing feature extraction on the acoustic features by using a sound feature extraction network to obtain phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages;
obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each voice frame;
and controlling the posture of the interactive object according to the posture parameter value.
2. The method of claim 1, wherein the obtaining acoustic features of sound driving data of an interactive object comprises:
acquiring a voice frame sequence corresponding to the voice driving data of the interactive object;
and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vector of each speech frame in the speech frame sequence.
3. The method according to claim 1 or 2, wherein the voice feature extraction network comprises a first fully-connected network, a coding sub-network, and a second fully-connected network, and the extracting the acoustic features by using the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice driving data comprises:
inputting the sound features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network;
performing feature coding processing on the first acoustic feature sequence by using the coding sub-network;
and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each speech frame in the voice driving data.
4. The method according to any one of claims 1 to 3, wherein the obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each phoneme comprises:
inputting the phoneme posterior probability of each voice frame into a time sequence network, and outputting associated characteristic information;
inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence;
and activating the associated characteristic sequence to obtain the attitude parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
5. The method of any of claims 1 to 4, wherein the gesture parameters of the interactive object comprise facial gesture parameters, and wherein controlling the gesture of the interactive object according to the gesture parameter values comprises:
and driving the interactive object to realize the facial pose matched with each voice frame in the voice driving data according to the facial pose parameters matched with the phoneme posterior probability of each voice frame.
6. A method for phoneme processing, the method comprising:
obtaining a phoneme table containing multiple languages according to phonemes in multiple target languages;
and training to obtain a sound feature extraction network based on the multi-language phoneme table, wherein the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.
7. The method of claim 6, wherein obtaining a phoneme table containing multiple languages according to phonemes in multiple target languages comprises:
splicing phonemes in the target languages;
and combining the phonemes with the pronunciation similarity exceeding a first set threshold value in the splicing result to obtain a phoneme table containing multiple languages.
8. The method of claim 6, wherein obtaining a phoneme table containing multiple languages according to phonemes in multiple target languages comprises:
respectively mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting a preset similarity condition;
and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the multi-language phoneme table.
9. The method according to claim 8, wherein in response to a first phoneme having a pronunciation similarity with each international phonetic symbol less than or equal to a second set threshold existing in the target languages, the first phoneme is added to the phoneme table containing multiple languages.
10. The method according to any one of claims 6 to 9, further comprising:
acquiring a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme table;
performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample;
and labeling the phonemes in the voice sample by using the phonemes in the multilingual phoneme table.
11. The method of claim 10, wherein training a voice feature extraction network based on the phone list containing multiple languages comprises:
inputting the acoustic features of the labeled voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples;
and adjusting the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
12. An apparatus for driving an interactive object, the apparatus comprising:
the first acquisition unit is used for acquiring the acoustic characteristics of the sound driving data of the interactive object;
the second acquisition unit is used for performing feature extraction on the acoustic features by utilizing a sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data; the voice feature extraction network is obtained by training according to a phoneme table containing multiple languages;
a third obtaining unit, configured to obtain an attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame;
and the control unit is used for controlling the posture of the interactive object according to the posture parameter value.
13. A phoneme processing apparatus, the apparatus comprising:
a phoneme table obtaining unit, configured to obtain a phoneme table containing multiple languages according to phonemes in multiple target languages;
and the training unit is used for training to obtain a sound feature extraction network based on the multi-language phoneme table, and the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.
14. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 5 when executing the computer instructions or the processor being configured to implement the method of any one of claims 6 to 11 when executing the computer instructions.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 5, or which, when being executed by a processor, carries out the method of any one of claims 6 to 11.
16. A computer program product comprising a computer program, wherein the program is adapted to perform the method of any one of claims 1 to 5 when executed by a processor or to perform the method of any one of claims 6 to 11 when executed by a processor.
CN202110604874.8A 2021-05-31 2021-05-31 Interactive object driving and phoneme processing method, device, equipment and storage medium Active CN113314104B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202110604874.8A CN113314104B (en) 2021-05-31 2021-05-31 Interactive object driving and phoneme processing method, device, equipment and storage medium
PCT/CN2022/089870 WO2022252890A1 (en) 2021-05-31 2022-04-28 Interaction object driving and phoneme processing methods and apparatus, device and storage medium
TW111119388A TW202248994A (en) 2021-05-31 2022-05-25 Method for driving interactive object and processing phoneme, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110604874.8A CN113314104B (en) 2021-05-31 2021-05-31 Interactive object driving and phoneme processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113314104A true CN113314104A (en) 2021-08-27
CN113314104B CN113314104B (en) 2023-06-20

Family

ID=77376708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110604874.8A Active CN113314104B (en) 2021-05-31 2021-05-31 Interactive object driving and phoneme processing method, device, equipment and storage medium

Country Status (3)

Country Link
CN (1) CN113314104B (en)
TW (1) TW202248994A (en)
WO (1) WO2022252890A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724718A (en) * 2021-09-01 2021-11-30 宿迁硅基智能科技有限公司 Target audio output method, device and system
WO2022252890A1 (en) * 2021-05-31 2022-12-08 上海商汤智能科技有限公司 Interaction object driving and phoneme processing methods and apparatus, device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106098059A (en) * 2016-06-23 2016-11-09 上海交通大学 customizable voice awakening method and system
US20180101764A1 (en) * 2016-10-07 2018-04-12 International Business Machines Corporation Transfer of an acoustic knowledge to a neural network
WO2018227780A1 (en) * 2017-06-12 2018-12-20 平安科技(深圳)有限公司 Speech recognition method and device, computer device and storage medium
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111459454A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN112259089A (en) * 2019-07-04 2021-01-22 阿里巴巴集团控股有限公司 Voice recognition method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503942A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 A kind of voice driven animation method and device based on artificial intelligence
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN112017648A (en) * 2020-08-25 2020-12-01 北京声智科技有限公司 Weighted finite state converter construction method, speech recognition method and device
CN112669841A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Training method and device for multilingual speech generation model and computer equipment
CN113314104B (en) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 Interactive object driving and phoneme processing method, device, equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106098059A (en) * 2016-06-23 2016-11-09 上海交通大学 customizable voice awakening method and system
US20180101764A1 (en) * 2016-10-07 2018-04-12 International Business Machines Corporation Transfer of an acoustic knowledge to a neural network
WO2018227780A1 (en) * 2017-06-12 2018-12-20 平安科技(深圳)有限公司 Speech recognition method and device, computer device and storage medium
CN109377986A (en) * 2018-11-29 2019-02-22 四川长虹电器股份有限公司 A kind of non-parallel corpus voice personalization conversion method
CN112259089A (en) * 2019-07-04 2021-01-22 阿里巴巴集团控股有限公司 Voice recognition method and device
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111459454A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
王洪海等: "自动语言辨识的研究方法及发展概述", 《电脑与信息技术》 *
王洪海等: "自动语言辨识的研究方法及发展概述", 《电脑与信息技术》, no. 02, 15 April 2007 (2007-04-15), pages 37 - 39 *
秦春香等: "发音特征在维汉语音识别中的应用", 《计算机工程》 *
秦春香等: "发音特征在维汉语音识别中的应用", 《计算机工程》, no. 23, 5 December 2012 (2012-12-05), pages 176 - 180 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022252890A1 (en) * 2021-05-31 2022-12-08 上海商汤智能科技有限公司 Interaction object driving and phoneme processing methods and apparatus, device and storage medium
CN113724718A (en) * 2021-09-01 2021-11-30 宿迁硅基智能科技有限公司 Target audio output method, device and system
WO2023030235A1 (en) * 2021-09-01 2023-03-09 南京硅基智能科技有限公司 Target audio output method and system, readable storage medium, and electronic apparatus
US11763801B2 (en) 2021-09-01 2023-09-19 Nanjing Silicon Intelligence Technology Co., Ltd. Method and system for outputting target audio, readable storage medium, and electronic device

Also Published As

Publication number Publication date
CN113314104B (en) 2023-06-20
WO2022252890A1 (en) 2022-12-08
TW202248994A (en) 2022-12-16

Similar Documents

Publication Publication Date Title
WO2021169431A1 (en) Interaction method and apparatus, and electronic device and storage medium
JP7432556B2 (en) Methods, devices, equipment and media for man-machine interaction
WO2021036644A1 (en) Voice-driven animation method and apparatus based on artificial intelligence
TWI766499B (en) Method and apparatus for driving interactive object, device and storage medium
CN112162628A (en) Multi-mode interaction method, device and system based on virtual role, storage medium and terminal
KR102449875B1 (en) Method for translating speech signal and electronic device thereof
CN111459452B (en) Driving method, device and equipment of interaction object and storage medium
CN111459454B (en) Interactive object driving method, device, equipment and storage medium
CN113067953A (en) Customer service method, system, device, server and storage medium
WO2022252890A1 (en) Interaction object driving and phoneme processing methods and apparatus, device and storage medium
CN114401438A (en) Video generation method and device for virtual digital person, storage medium and terminal
WO2021196644A1 (en) Method, apparatus and device for driving interactive object, and storage medium
CN113689879A (en) Method, device, electronic equipment and medium for driving virtual human in real time
CN111415662A (en) Method, apparatus, device and medium for generating video
JP2017182261A (en) Information processing apparatus, information processing method, and program
Iwahashi Interactive learning of spoken words and their meanings through an audio-visual interface
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
CN112632262A (en) Conversation method, conversation device, computer equipment and storage medium
KR20210124306A (en) Interactive object driving method, apparatus, device and recording medium
KR102370993B1 (en) Artificial Intelligence sign language service system with real-time translation and communication based on neural network
WO2024113701A1 (en) Voice-based video generation method and apparatus, server, and medium
Gjaci Comunicazione Non Verbale Culturalmente Competente Basata Su Generative Adversarial Networks
CN117036556A (en) Virtual image driving method and device and robot
CN114781401A (en) Data processing method, device, equipment and storage medium
CN117935807A (en) Method, device, equipment and storage medium for driving mouth shape of digital person

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40049329

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant