WO2022252890A1 - Interaction object driving and phoneme processing methods and apparatus, device and storage medium - Google Patents

Interaction object driving and phoneme processing methods and apparatus, device and storage medium Download PDF

Info

Publication number
WO2022252890A1
WO2022252890A1 PCT/CN2022/089870 CN2022089870W WO2022252890A1 WO 2022252890 A1 WO2022252890 A1 WO 2022252890A1 CN 2022089870 W CN2022089870 W CN 2022089870W WO 2022252890 A1 WO2022252890 A1 WO 2022252890A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
sound
interactive object
feature extraction
speech
Prior art date
Application number
PCT/CN2022/089870
Other languages
French (fr)
Chinese (zh)
Inventor
吴文岩
吴潜溢
高娜
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2022252890A1 publication Critical patent/WO2022252890A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to an interactive object driving and phoneme processing method, device, device and storage medium.
  • the voice made by the digital human can be matched with the presented mouth shape, expression, movement, etc.
  • digital humans are required to support multiple languages in many scenarios.
  • the voice features extracted by the speech recognition model, or the voice features obtained by phoneme time stamps are usually used to drive the digital human.
  • these features are different in different languages, and deep learning requires digital human
  • the current open source datasets have problems such as low quality, incomplete annotation, and unbalanced data.
  • the embodiment of the present disclosure provides an interactive object driving and phoneme processing solution.
  • a method for driving an interactive object comprising: acquiring the sound features of the sound driving data of the interactive object; using a sound feature extraction network to perform feature extraction on the sound features to obtain the sound The phoneme posterior probability of each speech frame in the driving data; wherein, the sound feature extraction network is obtained through phoneme table training including multiple languages; according to the phoneme posterior probability of each speech frame, the interaction object is obtained A posture parameter value; controlling the posture of the interactive object according to the posture parameter value.
  • the acquiring the sound features of the sound driving data of the interactive object includes: acquiring a sequence of speech frames corresponding to the sound driving data of the interactive object; according to each speech frame in the sequence of speech frames The sound feature vector of the sound feature of the sound driving data is obtained.
  • the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network, and the sound feature extraction network is used to perform feature extraction on the sound feature, to obtain The phoneme posterior probability of each speech frame in the sound driving data includes: inputting the sound feature into the first fully connected network to obtain the first sound feature sequence output by the first fully connected network; using the The encoding sub-network performs feature encoding processing on the first sound feature sequence; the encoding result is input to the second fully connected network to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  • the obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes: inputting the phoneme posterior probability of each speech frame into a time series network , output associated feature information; input the associated feature information into the third fully connected network to obtain an associated feature sequence; perform activation processing on the associated feature sequence to obtain the phoneme posterior probability matching of each speech frame The pose parameter value of the interactive object.
  • control parameters of the interactive object include facial posture control parameters
  • the controlling the posture of the interactive object according to the posture parameter value includes: according to the phonemes associated with the respective speech frames
  • the face pose parameter value matched by the posterior probability drives the interactive object to realize the face pose matched with each speech frame in the sound driving data.
  • a phoneme processing method comprising: obtaining a multilingual phoneme table based on phonemes in multiple target languages; and training to obtain sound features based on the multilingual phoneme table An extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
  • the obtaining of the multilingual phoneme table according to the phonemes in the multiple target languages includes: obtaining the phonemes in the multiple target languages Splicing the phonemes; merging the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain the phoneme table containing multiple languages.
  • the method further includes: mapping the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.
  • the first phoneme in response to the existence of a first phoneme in the multiple target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to a second set threshold, the first phoneme is added to the Described in the phoneme table that contains multiple languages.
  • the method further includes: acquiring a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes included in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
  • the method further includes: inputting the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
  • a driving device for an interactive object comprising: a first acquisition unit, configured to acquire the sound features of the sound driving data of the interactive object; a second acquisition unit, used to extract sound features The network performs feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by training a phoneme table that includes multiple languages; the third acquisition unit , for obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each speech frame; a control unit, for controlling the pose of the interactive object according to the pose parameter value.
  • the first acquiring unit is specifically configured to: acquire a voice frame sequence corresponding to the voice driving data of the interactive object; according to the voice feature vector of each voice frame in the voice frame sequence, Sound features of the sound driving data are obtained.
  • the sound feature extraction network includes a first fully connected network, an encoding sub-network and a second fully connected network
  • the second acquisition unit is specifically configured to: input the sound feature into The first fully connected network obtains the first sound feature sequence output by the first fully connected network; uses the coding sub-network to perform feature coding processing on the first sound feature sequence; and inputs the coding result to the The second fully connected network is used to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  • the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into the first Three fully connected networks to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the attitude parameter value of the interactive object matched by the phoneme posterior probability of each speech frame.
  • control parameters of the interactive object include facial gesture control parameters
  • the control unit is specifically configured to: according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame, Driving the interactive object to achieve a facial gesture matching each speech frame in the sound driving data.
  • a phoneme processing device comprising: a phoneme table acquisition unit, configured to obtain a multilingual phoneme table based on phonemes in multiple target languages; a training unit, configured to obtain a phoneme table based on the obtained
  • the phoneme table including multiple languages is used to train the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
  • the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; merge phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain The multilingual phoneme table; based on the multilingual phoneme table, a sound feature extraction network is obtained through training.
  • the phoneme table acquisition unit is specifically configured to: map phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies a preset similarity condition; The international phonetic symbols of pronunciation are merged to obtain the phoneme table containing multiple languages.
  • the device further includes a labeling unit, configured to: acquire multilingual speech samples, wherein the language type of the speech samples is the same as the language contained in the multilingual phoneme table The types are the same; the phoneme alignment operation is performed on the speech samples to obtain the phonemes contained in the speech samples; the phonemes in the speech samples are marked by using the phonemes in the multilingual phoneme table to mark the real value.
  • a labeling unit configured to: acquire multilingual speech samples, wherein the language type of the speech samples is the same as the language contained in the multilingual phoneme table The types are the same; the phoneme alignment operation is performed on the speech samples to obtain the phonemes contained in the speech samples; the phonemes in the speech samples are marked by using the phonemes in the multilingual phoneme table to mark the real value.
  • the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network, and obtain the phoneme posterior probability of each speech frame in the speech samples ; For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
  • an electronic device the device includes a memory and a processor, the memory is used for storing computer instructions executable on the processor, and the processor is used for executing the computer instructions Implement the driving method of the interactive object described in any implementation manner provided by the present disclosure.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.
  • a computer program product including a computer program, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.
  • Fig. 1 is a flow chart of a method for driving an interactive object proposed by at least one embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a feature encoding process for a phoneme sequence proposed by at least one embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of a mapping process of a phoneme posterior probability shown in at least one embodiment of the present disclosure
  • FIG. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure
  • Fig. 5 is a schematic structural diagram of a driving device for an interactive object proposed by at least one embodiment of the present disclosure
  • Fig. 6 is a schematic structural diagram of a phoneme processing device proposed by at least one embodiment of the present disclosure.
  • Fig. 7 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides a method for driving an interactive object.
  • the driving method can be executed by an electronic device such as a terminal device or a server.
  • the terminal device can be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, desktop computer, advertising machine, all-in-one machine, vehicle-mounted terminal, etc.
  • the server includes a local server or a cloud server, etc.
  • the method can also be realized by calling a computer-readable instruction stored in a memory by a processor.
  • the interactive object can interact with the target object, which can be a virtual character, or a virtual animal, virtual item, cartoon image, etc. that can realize interactive functions.
  • the presentation form of the virtual image is It may be in 2D form or 3D form, which is not limited in the present disclosure.
  • the target object can be a user, a robot, or other smart devices.
  • the interactive object can be displayed through a terminal device, and the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) device etc., the present disclosure does not limit the specific form of the terminal device.
  • the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) device etc.
  • the present disclosure does not limit the specific form of the terminal device.
  • the interactive object in response to the terminal device receiving sound driving data for driving the interactive object to output a voice, the interactive object can emit a specified voice to the target object.
  • sound driving data can be generated to drive the interactive object to respond by issuing a specified voice, thereby providing anthropomorphic services for the target object.
  • the interactive object can interact with the target object in different languages. In order to make the posture of the interactive object match the real pronunciation in different languages, at least one embodiment of the present disclosure proposes a driving method for the interactive object.
  • FIG. 1 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104 .
  • step 101 the sound characteristics of the sound driving data of the interactive object are acquired.
  • the sound driving data may include audio data (speech data), text, and the like.
  • the audio data can be directly used to drive the interactive object to output voice, that is, the terminal device directly outputs voice through the audio data; in response to the sound driving data being text, it can be based on the voice contained in the text , generate corresponding phonemes, and drive the interactive object to output speech through the generated phonemes.
  • the text may first be converted into pinyin, and then corresponding phonemes may be generated according to the pinyin.
  • the sound driving data may also be other forms of driving data, which is not limited in the present disclosure.
  • the sound driving data may be the driving data generated according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it may be the sound driving data called by the terminal device from the internal memory .
  • the present disclosure does not limit the acquisition method of the sound driving data.
  • the audio data can be split into a plurality of speech frames, and the speech frames are combined according to the states of the speech frames to form phonemes; each phoneme formed according to the audio data then forms phoneme sequence.
  • a phoneme is the smallest speech unit divided according to the natural attributes of speech, and a pronunciation action of a real person can form a phoneme.
  • the phonemes included in the morphemes may be obtained according to the morphemes included in the text, so as to obtain a corresponding phoneme sequence.
  • the phoneme sequence corresponding to the sound driving data can also be obtained in other ways, which is not limited in the present disclosure.
  • the sound features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Coefficient (MFCC) and the like.
  • MFCC Mel Frequency Coefficient
  • step 102 feature extraction is performed on the sound feature by using a sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  • the above-mentioned sound feature extraction network is obtained by training through a multilingual phoneme table.
  • the phonetic posterior probability represents the posterior probability that the speech frame corresponds to each phoneme in the phoneme table.
  • the phoneme posterior probability has nothing to do with the speaker, but only with the content of the speech.
  • the phoneme posterior probability of the speech frame includes the probability that the speech frame corresponds to phoneme 1, the probability of phoneme 2, and the probability of phoneme 3.
  • the sound feature extraction network used to extract the phoneme posterior probability of each speech frame in the sound driving data is obtained by training through a phoneme table including multiple languages.
  • the phoneme table containing multiple languages can be obtained in the following manner: the phonemes in multiple target languages are obtained for splicing; the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result are merged, which can be convenient, Quickly get phoneme tables for multiple target languages.
  • phonemes in Chinese can be concatenated with phonemes in English, and phonemes with the same or similar pronunciation in the concatenated result, such as "b", “p”, “m”, “f”, etc. Merging is performed, so that a phoneme table including Chinese and English can be obtained.
  • the phoneme table comprising multiple languages can be obtained in the following manner: First, the phonemes in multiple target languages are respectively mapped to the International Phonetic Alphabet (IPA) whose pronunciation similarity satisfies the similarity condition, so The similarity condition is, for example, the same pronunciation or the highest similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.
  • IPA International Phonetic Alphabet
  • all phonemes in Chinese can be mapped to the IPA with the highest pronunciation similarity
  • all phonemes in English can be mapped to the IPA with the highest pronunciation similarity
  • the IPAs mapped to Chinese and English can be stored in a phoneme table , the phonemes with the same pronunciation are merged, and a phoneme table supporting Chinese and English can be obtained.
  • Chinese phonemes include phonemes a1, a2, a3, b, i1, i2.i3, ii1, ii2, ii3 (where 1, 2, and 3 represent tones), and English phonemes include a, b, i
  • the IPA table contains a, b, i.
  • the phonemes in Chinese and English are mapped to the IPA with the highest similarity, and the Chinese sequence is mapped to a, a, a, b, i, i, i, i, i, i, i (since there is no ii pronunciation in IPA , the actual pronunciation of ii is most similar to i, then map ii to i).
  • the English mapping it is a, b, i in turn.
  • the first phonemes are added to the multilingual in the phoneme table.
  • the phoneme "ng" in Chinese does not exist in the IPA table, and the similarity between the pronunciation and other pronunciations is less than the second set threshold; or when a certain phoneme in Chinese is composed of several other pronunciations composition, the similarity between the pronunciation and the IPA table is also less than the second set threshold, and such a phoneme is called the first phoneme, and the first phoneme is reserved and appended to the back of the IPA table, that is, the final IPA is included in all of itself
  • the first phoneme is also included in addition to the phoneme.
  • first set threshold and the second set threshold can be specifically set according to actual needs, which is not limited in the present disclosure.
  • the multilingual phoneme table can be used to directly annotate multilingual speech samples, and a high-quality, complete annotated, and data-balanced corpus can be conveniently and efficiently constructed for extracting sound features.
  • the network is trained.
  • step 103 the pose parameter value of the interactive object is obtained according to the phoneme posterior probability of each speech frame.
  • the pose parameter value of the interactive object matching the sound driving data may be obtained according to the phoneme posterior probability of each speech frame in the sound driving data.
  • the posture parameter is used to control the posture of the interactive object, and the interactive object can be driven to make a corresponding posture by using different posture parameter values.
  • the gesture parameters may include facial gesture parameters, which are used to control the facial gestures of the interactive object, including expressions, mouth shapes, facial features, and head gestures; in embodiments of the present disclosure, phonemes may be pre-established For the correspondence between the posterior probability and the gesture parameter value of the interactive object, if the phoneme posterior probability of each speech frame in the voice driving data is obtained, the gesture parameter value corresponding to the voice driving data can be obtained.
  • the specific form of the attitude parameter can be determined according to the type of the interactive object model.
  • step 104 the gesture of the interactive object is controlled according to the gesture parameter value.
  • the attitude parameter value is matched with the phoneme posterior probability of each speech frame in the sound driving data of the interactive object, since the phoneme posterior probability has nothing to do with the language, therefore, for speech data and texts of different languages,
  • the gestures (such as mouth shapes, expressions, actions, etc.) presented by the interactive object can be matched with the actual pronunciation, giving the target object interacting with the interactive object the feeling that the interactive object is speaking.
  • the sound features of the sound driving data of the interactive object are obtained first, and the sound feature extraction network is used to perform feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data, and then According to the phoneme posterior probability of each speech frame, obtain the posture parameter value of the interactive object, and control the posture of the interactive object according to the posture parameter value, because the phoneme posterior probability has nothing to do with the speaker, and It can support multiple languages, and the embodiments of the present disclosure use the phoneme table containing multiple languages to train the sound feature extraction network, and use the network to extract the phoneme posterior probability of the sound driving data as the sound feature to drive the interactive object, so that The posture of the interactive object matches the real pronunciation in different languages.
  • the multilingual corpus can be constructed according to the following method.
  • multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table.
  • the phoneme table is a phoneme table supporting Chinese and English
  • the Chinese speech samples and the English speech samples are obtained respectively.
  • a phoneme alignment operation is performed on the speech samples to obtain phonemes included in the speech samples.
  • the pronunciation start and end time of each phoneme in the speech segment can be obtained: n[0,0.2] , i3[0.2,0.4], h[0.5,0.7], ao3[0.7,1.2], where [] indicates the start and end time of pronunciation of each phoneme in seconds.
  • the phoneme corresponding to each speech frame in the speech sample can be determined through the pronunciation start and end time of each phoneme.
  • the phonemes in the speech sample are marked with the phonemes in the multilingual phoneme table.
  • the true values of the phonemes in the speech samples are annotated with the phonemes in the multilingual phoneme table.
  • the phonemes in the multilingual phoneme table can be directly called for labeling, thereby It can conveniently and efficiently construct a high-quality, well-labeled, and data-balanced corpus.
  • the sound feature extraction network can be trained by the following method.
  • the sound features of the marked speech samples are input to the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples.
  • each speech frame in the marked speech sample is marked with a real value of a phoneme.
  • the voice frame sequence corresponding to the voice driving data of the interactive object may be obtained, and the voice features of the voice driving data may be obtained according to the voice feature vectors of each voice frame in the voice frame sequence.
  • the voice features of the voice driving data may be obtained according to the voice feature vectors of each voice frame in the voice frame sequence.
  • the MFCC matrix corresponding to the sound driving data can be obtained.
  • Fig. 2 shows a schematic diagram of a sound feature extraction process shown in at least one embodiment of the present disclosure.
  • the present disclosure utilizes a sound feature extraction network 200 to perform feature extraction on the sound features of the sound driving data, so as to obtain the phoneme posterior probability of each speech frame in the sound feature data.
  • the sound feature extraction network 200 includes a first fully connected network 201 , an encoding sub-network 202 and a second fully connected network 203 .
  • the sound features into the first fully connected network 201 to obtain the first sound feature sequence output by the first fully connected network; then, use the coding sub-network 202 to process the first sound feature sequence Feature encoding processing to obtain the encoding result.
  • the coding sub-network can be, for example, a CBHG network, a Gated Recurrent Unit (Gated Recurrent Unit, GRU) and other networks suitable for extracting sequence features.
  • GRU Gated Recurrent Unit
  • the encoding result is input to the second fully connected network 203 to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  • each speech frame in the sound feature data can be accurately predicted The phoneme posterior probability of .
  • the posture parameter values corresponding to the phoneme posterior probabilities of each speech frame in the sound-driven data can be predicted through a time series network and a fully connected network, so that the associated historical phoneme posterior probability and the current phoneme The posterior probability is fused, so that the historical attitude parameter value affects the change of the current attitude parameter value, making the change of the attitude of the interactive object more gentle and natural.
  • Fig. 3 shows a schematic diagram of a mapping process of phoneme posterior probabilities shown in at least one embodiment of the present disclosure.
  • the phoneme posterior probability of each speech frame is input into the time series network 301 , and associated feature information is output.
  • the time series network may be a time recursive neural network, such as LSTM.
  • the time series network can learn the historical information of the input phoneme posterior probability, and the output associated feature information includes the influence of the historical information on the current information.
  • the associated feature information is input into the third fully connected network 302 to obtain an associated feature sequence.
  • the associated feature sequence is activated through the activation layer 303, and each feature value in the associated feature sequence is transformed into a posture parameter value, so as to obtain the posture of the interactive object matched with the phoneme posterior probability of each speech frame parameter value.
  • the posture parameters of the interactive object include facial posture control parameters, and the interactive object can be driven to achieve the same level as the voice driving according to the facial posture control parameters matched with the phoneme posterior probabilities of the respective speech frames.
  • the facial posture parameters may include facial muscle control coefficients, for example.
  • the movement of the human face, from an anatomical point of view, is the result of the coordinated deformation of the muscles in various parts of the face. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive object, and the movement of each divided muscle (region) is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed.
  • Make the faces of interactive characters make various expressions.
  • the motion states corresponding to different muscle control coefficients can be set according to the facial position of the muscle and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of its control coefficient is (0 ⁇ 1). Different values in this range correspond to different contraction/expansion states of the upper lip muscle.
  • the longitudinal opening of the mouth can be realized.
  • the value range of its control coefficient is (0 ⁇ 1). Different values in this range correspond to the contraction/expansion state of the left mouth corner muscle.
  • the mouth can be realized. Lateral changes.
  • the interactive object While outputting sound according to the sound driving data, the interactive object is driven to make facial expressions according to the facial gesture control parameters corresponding to the sound driving data, so that the interactive object can simultaneously make facial expressions while outputting sound.
  • the shape of the mouth and the expression that emit the sound make the target object feel that the interactive object is speaking, and improve the interactive experience of the target object.
  • Fig. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure. As shown in FIG. 4 , the method includes step 401 - step 402 .
  • step 401 a phoneme table including multiple languages is obtained according to phonemes in multiple target languages.
  • the phoneme table containing multiple languages can be obtained in the following way: splicing phonemes in multiple target languages; merging phonemes whose pronunciation similarity exceeds the first set threshold in the splicing results can be convenient and fast
  • the phoneme tables containing multiple target languages can be obtained efficiently.
  • a multilingual phoneme table can be obtained in the following manner: First, map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the similarity condition, such as pronunciation the same or the highest degree of similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.
  • the first phonemes in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables. That is to say, if there is no International Phonetic Alphabet with a high degree of similarity to the pronunciation of the first phoneme, then the first phoneme is directly added to the multilingual phoneme table.
  • first set threshold and the second set threshold can be specifically set according to actual needs, which is not limited in the present disclosure.
  • a sound feature extraction network is trained to obtain the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
  • the sound feature extraction network is trained by using a multilingual phoneme table, which can improve the efficiency and quality of the feature extraction network training, and use the network to extract the phoneme posterior features of the sound driving data, so as to As the voice feature drives the interactive object, since the phoneme posterior probability is a speaker-independent voice feature that can support multiple languages, the posture of the interactive object is consistent with the real pronunciation in different languages.
  • the multilingual corpus can be constructed according to the following method.
  • multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table.
  • a phoneme alignment operation is performed on the speech samples to obtain phonemes included in the speech samples.
  • the phonemes in the multilingual phoneme table can be directly called to mark the phonemes in the voice sample, so that a high-quality, complete-labeled, and data-balanced corpus can be constructed conveniently and efficiently.
  • the sound feature extraction network can be trained through the following specific steps.
  • the sound features of the marked speech samples are input to the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples.
  • each speech frame in the marked speech sample is marked with a real value of a phoneme.
  • Fig. 5 is a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure.
  • the device may include: a first acquiring unit 501, configured to acquire the sound characteristics of the sound driving data of the interactive object;
  • the second acquisition unit 502 is used to extract the features of the sound features using the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by including multiple The phoneme table training of the language is obtained;
  • the third acquisition unit 503 is used to obtain the posture parameter value of the interactive object according to the phoneme posterior probability of each speech frame;
  • the control unit 504 is used to obtain the posture parameter value according to the posture parameter value Controls the pose of the interactive object.
  • the first acquiring unit is specifically configured to: acquire the voice frame sequence corresponding to the voice driving data of the interactive object; obtain the voice according to the voice feature vector of each voice frame in the voice frame sequence The sonic characteristics of the driving data.
  • the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network
  • the second acquisition unit is specifically configured to: input the sound feature into the first A fully connected network to obtain the first sound feature sequence output by the first fully connected network; use the encoding sub-network to perform feature encoding processing on the first sound feature sequence; input the encoding result to the second fully connected network
  • the network is connected to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  • the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into a third fully connected network , to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the gesture parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
  • control parameters of the interactive object include facial gesture control parameters
  • control unit is specifically configured to: drive the interaction according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame. The subject achieves a facial gesture that matches each speech frame in the sound-driven data.
  • Fig. 6 is a schematic structural diagram of a training device for proposing a network of sound features according to at least one embodiment of the present disclosure.
  • the device may include: a phoneme table acquisition unit 601, configured to, according to phonemes in multiple target languages, Obtaining a multilingual phoneme table; the training and obtaining unit 602 is configured to train a sound feature extraction network based on the multilingual phoneme table, and the sound feature extraction network is used to extract phoneme posterior probabilities of speech frames.
  • the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; combine phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result, and obtain the phonemes containing multiple The phoneme table of the language; based on the phoneme table containing multiple languages, the sound feature extraction network is obtained through training.
  • the phoneme table acquisition unit is specifically configured to: map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.
  • the first phonemes in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables.
  • the device further includes a labeling unit, configured to: obtain a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes contained in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
  • a labeling unit configured to: obtain a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes contained in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
  • the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for the For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
  • At least one embodiment of the present disclosure also provides an electronic device, as shown in FIG. 7 , the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the described The computer instructions implement the driving method of the interactive object described in any embodiment of the present disclosure.
  • At least one embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.
  • At least one embodiment of the present disclosure further provides a computer program product, including a computer program, when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.
  • one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.
  • each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments.
  • the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.
  • Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more of .
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data
  • the processing means executes.
  • a computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit.
  • a central processing unit will receive instructions and data from a read only memory and/or a random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • a computer is not required to have such a device.
  • a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB Universal Serial Bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or removable disks
  • magneto-optical disks and CD ROM and DVD-ROM disks.
  • the processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Processing Or Creating Images (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are interactive object driving and phoneme processing methods, an apparatus, a device, and a storage medium. The interactive object driving method comprises: acquiring a sound feature of sound driving data of an interactive object; performing feature extraction on the sound feature using a sound feature extraction network to obtain a phoneme posterior probability of each voice frame in the sound driving data, the sound feature extraction network being obtained by means of training of a phoneme table containing multiple languages; according to the phoneme posterior probability of each voice frame, obtaining a posture parameter value of the interactive object; and controlling the posture of the interactive object according to the posture parameter value.

Description

交互对象驱动和音素处理方法、装置、设备以及存储介质Interactive object driving and phoneme processing method, device, device and storage medium
相关申请的交叉引用Cross References to Related Applications
本公开要求于2021年05月31日提交的、申请号为202110604874.8的中国专利申请的优先权,该申请以引用的方式并入本文中。This disclosure claims the priority of the Chinese patent application with application number 202110604874.8 filed on May 31, 2021, which is incorporated herein by reference.
技术领域technical field
本公开涉及计算机技术领域,具体涉及一种交互对象驱动和音素处理方法、装置、设备以及存储介质。The present disclosure relates to the field of computer technology, and in particular to an interactive object driving and phoneme processing method, device, device and storage medium.
背景技术Background technique
利用深度学习的方法,使数字人所发出的声音与所呈现的口型、表情、动作等相匹配。随着数字人在众多领域的广泛应用,在许多场景下需要数字人能够支持多语种。Using the method of deep learning, the voice made by the digital human can be matched with the presented mouth shape, expression, movement, etc. With the widespread application of digital humans in many fields, digital humans are required to support multiple languages in many scenarios.
目前,通常利用语音识别模型所提取的声音特征,或者利用音素时间戳得到的声音特征来驱动数字人,然而这些特征在不同的语种下是有区别的,且深度学习需要数字人在应用场景下的不同语种的数据集,而当前开源数据集存在质量低、标注不完整、数据不均衡等问题。At present, the voice features extracted by the speech recognition model, or the voice features obtained by phoneme time stamps are usually used to drive the digital human. However, these features are different in different languages, and deep learning requires digital human However, the current open source datasets have problems such as low quality, incomplete annotation, and unbalanced data.
如何实现数字人对多语种的支持是目前需要积极研究的问题。How to realize the multilingual support of digital human is an issue that needs to be actively studied at present.
发明内容Contents of the invention
本公开实施例提供一种交互对象驱动和音素处理方案。The embodiment of the present disclosure provides an interactive object driving and phoneme processing solution.
根据本公开的一方面,提供一种交互对象的驱动方法,所述方法包括:获取交互对象的声音驱动数据的声音特征;利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率;其中,所述声音特征提取网络是通过包含多语种的音素表训练得到的;根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值;根据所述姿态参数值控制所述交互对象的姿态。According to one aspect of the present disclosure, there is provided a method for driving an interactive object, the method comprising: acquiring the sound features of the sound driving data of the interactive object; using a sound feature extraction network to perform feature extraction on the sound features to obtain the sound The phoneme posterior probability of each speech frame in the driving data; wherein, the sound feature extraction network is obtained through phoneme table training including multiple languages; according to the phoneme posterior probability of each speech frame, the interaction object is obtained A posture parameter value; controlling the posture of the interactive object according to the posture parameter value.
结合本公开提供的任一实施方式,所述获取交互对象的声音驱动数据的声音特征,包括:获取所述交互对象的声音驱动数据对应的语音帧序列;根据所述语音帧序列中各个语音帧的声音特征向量,得到所述声音驱动数据的声音特征。In combination with any of the implementations provided in the present disclosure, the acquiring the sound features of the sound driving data of the interactive object includes: acquiring a sequence of speech frames corresponding to the sound driving data of the interactive object; according to each speech frame in the sequence of speech frames The sound feature vector of the sound feature of the sound driving data is obtained.
结合本公开提供的任一实施方式,所述声音特征提取网络包括第一全连接网络、编码子网络、第二全连接网络,所述利用声音特征提取网络对所述声音特征进行特征提取, 得到所述声音驱动数据中各个语音帧的音素后验概率,包括:将所述声音特征输入至所述第一全连接网络,得到所述第一全连接网络输出的第一声音特征序列;利用所述编码子网络,对所述第一声音特征序列进行特征编码处理;将编码结果输入至所述第二全连接网络,得到所述声音驱动数据中各个语音帧的音素后验概率。In combination with any embodiment provided by the present disclosure, the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network, and the sound feature extraction network is used to perform feature extraction on the sound feature, to obtain The phoneme posterior probability of each speech frame in the sound driving data includes: inputting the sound feature into the first fully connected network to obtain the first sound feature sequence output by the first fully connected network; using the The encoding sub-network performs feature encoding processing on the first sound feature sequence; the encoding result is input to the second fully connected network to obtain the phoneme posterior probability of each speech frame in the sound driving data.
结合本公开提供的任一实施方式,所述根据所述各个音素的音素后验概率,得到所述交互对象的姿态参数值,包括:将所述各个语音帧的音素后验概率输入至时序网络,输出关联特征信息;将所述关联特征信息输入至第三全连接网络,得到关联特征序列;对所述关联特征序列进行激活处理,得到所述各个语音帧的音素后验概率匹配的所述交互对象的姿态参数值。In combination with any implementation manner provided by the present disclosure, the obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes: inputting the phoneme posterior probability of each speech frame into a time series network , output associated feature information; input the associated feature information into the third fully connected network to obtain an associated feature sequence; perform activation processing on the associated feature sequence to obtain the phoneme posterior probability matching of each speech frame The pose parameter value of the interactive object.
结合本公开提供的任一实施方式,所述交互对象的控制参数包括面部姿态控制参数,所述根据所述姿态参数值控制所述交互对象的姿态,包括:根据与所述各个语音帧的音素后验概率匹配的面部姿态参数值,驱动所述交互对象实现与所述声音驱动数据中的各个语音帧匹配的面部姿态。In combination with any implementation manner provided by the present disclosure, the control parameters of the interactive object include facial posture control parameters, and the controlling the posture of the interactive object according to the posture parameter value includes: according to the phonemes associated with the respective speech frames The face pose parameter value matched by the posterior probability drives the interactive object to realize the face pose matched with each speech frame in the sound driving data.
根据本公开的一方面,提出一种音素处理方法,所述方法包括:根据多个目标语种中的音素,得到包含多语种的音素表;基于所述包含多语种的音素表,训练得到声音特征提取网络,其中,所述声音特征提取网络用于提取语音帧的音素后验概率。According to one aspect of the present disclosure, a phoneme processing method is proposed, the method comprising: obtaining a multilingual phoneme table based on phonemes in multiple target languages; and training to obtain sound features based on the multilingual phoneme table An extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
在本公开实施例中,利用包含多语种的音素表结合本公开提供的任一实施方式,所述根据多个目标语种中的音素,得到包含多语种的音素表包括:获取多个目标语种中的音素进行拼接;将拼接结果中发音相似度超过第一设定阈值的音素进行合并,得到所述包含多语种的音素表。In the embodiment of the present disclosure, using the multilingual phoneme table in combination with any implementation method provided by the present disclosure, the obtaining of the multilingual phoneme table according to the phonemes in the multiple target languages includes: obtaining the phonemes in the multiple target languages Splicing the phonemes; merging the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain the phoneme table containing multiple languages.
结合本公开提供的任一实施方式,所述方法还包括:将多个目标语种中的音素分别映射为发音相似度满足预设相似度条件的国际音标;将映射结果中具有相同发音的国际音标进行合并,得到所述包含多语种的音素表。In combination with any embodiment provided by the present disclosure, the method further includes: mapping the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.
结合本公开提供的任一实施方式,响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。In combination with any of the implementations provided by the present disclosure, in response to the existence of a first phoneme in the multiple target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to a second set threshold, the first phoneme is added to the Described in the phoneme table that contains multiple languages.
结合本公开提供的任一实施方式,所述方法还包括:获取多语种的语音样本,其中,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同;对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素;利用所述多语种的音素表 中的音素来标注所述语音样本中的音素的真实值。In combination with any implementation manner provided by the present disclosure, the method further includes: acquiring a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes included in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
结合本公开提供的任一实施方式,所述方法还包括:将标注后的语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率;针对所述语音样本中各个语音帧,根据该语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。In combination with any embodiment provided by the present disclosure, the method further includes: inputting the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
根据本公开的一方面,提供一种交互对象的驱动装置,所述装置包括:第一获取单元,用于获取交互对象的声音驱动数据的声音特征;第二获取单元,用于利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率;其中,所述声音特征提取网络是通过包含多语种的音素表训练得到的;第三获取单元,用于根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值;控制单元,用于根据所述姿态参数值控制所述交互对象的姿态。According to an aspect of the present disclosure, there is provided a driving device for an interactive object, the device comprising: a first acquisition unit, configured to acquire the sound features of the sound driving data of the interactive object; a second acquisition unit, used to extract sound features The network performs feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by training a phoneme table that includes multiple languages; the third acquisition unit , for obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each speech frame; a control unit, for controlling the pose of the interactive object according to the pose parameter value.
结合本公开提供的任一实施方式,所述第一获取单元具体用于:获取所述交互对象的声音驱动数据对应的语音帧序列;根据所述语音帧序列中各个语音帧的声音特征向量,得到所述声音驱动数据的声音特征。In combination with any implementation manner provided by the present disclosure, the first acquiring unit is specifically configured to: acquire a voice frame sequence corresponding to the voice driving data of the interactive object; according to the voice feature vector of each voice frame in the voice frame sequence, Sound features of the sound driving data are obtained.
结合本公开提供的任一实施方式,所述声音特征提取网络包括第一全连接网络、编码子网络和第二全连接网络,所述第二获取单元具体用于:将所述声音特征输入至所述第一全连接网络,得到所述第一全连接网络输出的第一声音特征序列;利用所述编码子网络,对所述第一声音特征序列进行特征编码处理;将编码结果输入至所述第二全连接网络,得到所述声音驱动数据中各个语音帧的音素后验概率。In combination with any embodiment provided in the present disclosure, the sound feature extraction network includes a first fully connected network, an encoding sub-network and a second fully connected network, and the second acquisition unit is specifically configured to: input the sound feature into The first fully connected network obtains the first sound feature sequence output by the first fully connected network; uses the coding sub-network to perform feature coding processing on the first sound feature sequence; and inputs the coding result to the The second fully connected network is used to obtain the phoneme posterior probability of each speech frame in the sound driving data.
结合本公开提供的任一实施方式,所述第三获取单元具体用于:将所述各个语音帧的音素后验概率输入至时序网络,输出关联特征信息;将所述关联特征信息输入至第三全连接网络,得到关联特征序列;对所述关联特征序列进行激活处理,得到所述各个语音帧的音素后验概率匹配的所述交互对象的姿态参数值。In combination with any implementation manner provided by the present disclosure, the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into the first Three fully connected networks to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the attitude parameter value of the interactive object matched by the phoneme posterior probability of each speech frame.
结合本公开提供的任一实施方式,所述交互对象的控制参数包括面部姿态控制参数,所述控制单元具体用于:根据与所述各个语音帧的音素后验概率匹配的面部姿态参数值,驱动所述交互对象实现与所述声音驱动数据中的各个语音帧匹配的面部姿态。In combination with any implementation manner provided by the present disclosure, the control parameters of the interactive object include facial gesture control parameters, and the control unit is specifically configured to: according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame, Driving the interactive object to achieve a facial gesture matching each speech frame in the sound driving data.
根据本公开的一方面,提供一种音素处理装置,所述装置包括:音素表获取单元,用于根据多个目标语种中的音素,得到包含多语种的音素表;训练单元,用于基于所述包含多语种的音素表,训练得到声音特征提取网络,其中,所述声音特征提取网络用于 提取语音帧的音素后验概率。According to an aspect of the present disclosure, there is provided a phoneme processing device, the device comprising: a phoneme table acquisition unit, configured to obtain a multilingual phoneme table based on phonemes in multiple target languages; a training unit, configured to obtain a phoneme table based on the obtained The phoneme table including multiple languages is used to train the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
结合本公开提供的任一实施方式,所述音素表获取单元具体用于:获取多个目标语种中的音素进行拼接;将拼接结果中发音相似度超过第一设定阈值的音素进行合并,得到所述包含多语种的音素表;基于所述包含多语种的音素表,训练得到声音特征提取网络。In combination with any of the implementations provided in the present disclosure, the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; merge phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain The multilingual phoneme table; based on the multilingual phoneme table, a sound feature extraction network is obtained through training.
结合本公开提供的任一实施方式,所述音素表获取单元具体用于:将多个目标语种中的音素分别映射为发音相似度满足预设相似度条件的国际音标;将映射结果中具有相同发音的国际音标进行合并,得到所述包含多语种的音素表。In combination with any implementation method provided by the present disclosure, the phoneme table acquisition unit is specifically configured to: map phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies a preset similarity condition; The international phonetic symbols of pronunciation are merged to obtain the phoneme table containing multiple languages.
结合本公开提供的任一实施方式,响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于所述第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。In combination with any implementation manner provided by the present disclosure, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to the second set threshold, add the first phoneme to the phoneme table containing multiple languages.
结合本公开提供的任一实施方式,所述装置还包括标注单元,用于:获取多语种的语音样本,其中,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同;对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素;利用所述包含多语种的音素表中的音素来标注所述语音样本中的音素进行标注的真实值。In combination with any of the implementations provided by the present disclosure, the device further includes a labeling unit, configured to: acquire multilingual speech samples, wherein the language type of the speech samples is the same as the language contained in the multilingual phoneme table The types are the same; the phoneme alignment operation is performed on the speech samples to obtain the phonemes contained in the speech samples; the phonemes in the speech samples are marked by using the phonemes in the multilingual phoneme table to mark the real value.
结合本公开提供的任一实施方式,所述训练单元具体用于:将标注后的语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率;针对所述语音样本中各个语音帧,根据该语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。In combination with any embodiment provided in the present disclosure, the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network, and obtain the phoneme posterior probability of each speech frame in the speech samples ; For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
根据本公开的一方面,提供一种电子设备,所述设备包括存储器和处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, there is provided an electronic device, the device includes a memory and a processor, the memory is used for storing computer instructions executable on the processor, and the processor is used for executing the computer instructions Implement the driving method of the interactive object described in any implementation manner provided by the present disclosure.
根据本公开的一方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to one aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.
根据本公开的一方面,提供一种计算机程序产品,包括计算机程序,所述程序被处理器执行时实现本公开提供的任一实施方式所述的交互对象的驱动方法。According to an aspect of the present disclosure, a computer program product is provided, including a computer program, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.
附图说明Description of drawings
为了更清楚地说明本说明书一个或多个实施例或现有技术中的技术方案,下面将对 实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or prior art. Obviously, in the following description The accompanying drawings are only some embodiments described in one or more embodiments of this specification. For those skilled in the art, other drawings can also be obtained according to these drawings without creative labor. .
图1是本公开至少一个实施例提出的交互对象的驱动方法的流程图;Fig. 1 is a flow chart of a method for driving an interactive object proposed by at least one embodiment of the present disclosure;
图2是本公开至少一个实施例提出的对音素序列进行特征编码的过程示意图;FIG. 2 is a schematic diagram of a feature encoding process for a phoneme sequence proposed by at least one embodiment of the present disclosure;
图3是本公开至少一个实施例示出的音素后验概率的映射过程示意图;Fig. 3 is a schematic diagram of a mapping process of a phoneme posterior probability shown in at least one embodiment of the present disclosure;
图4是本公开至少一个实施例提出的音素处理方法的流程图;FIG. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure;
图5是本公开至少一个实施例提出的交互对象的驱动装置的结构示意图;Fig. 5 is a schematic structural diagram of a driving device for an interactive object proposed by at least one embodiment of the present disclosure;
图6是本公开至少一个实施例提出的音素处理装置的结构示意图;Fig. 6 is a schematic structural diagram of a phoneme processing device proposed by at least one embodiment of the present disclosure;
图7是本公开至少一个实施例提出的电子设备的结构示意图。Fig. 7 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.
本公开至少一个实施例提供了一种交互对象的驱动方法,所述驱动方法可以由终端设备或服务器等电子设备执行,所述终端设备可以是固定终端或移动终端,例如手机、平板电脑、游戏机、台式机、广告机、一体机、车载终端等等,所述服务器包括本地服务器或云端服务器等,所述方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。At least one embodiment of the present disclosure provides a method for driving an interactive object. The driving method can be executed by an electronic device such as a terminal device or a server. The terminal device can be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, desktop computer, advertising machine, all-in-one machine, vehicle-mounted terminal, etc., and the server includes a local server or a cloud server, etc., and the method can also be realized by calling a computer-readable instruction stored in a memory by a processor.
在本公开实施例中,交互对象能够与目标对象进行交互,其可以是虚拟人物,还可以是虚拟动物、虚拟物品、卡通形象等等其他能够实现交互功能的虚拟形象,虚拟形象的展现形式即可以是2D形式也可以是3D形式,本公开对此并不限定。所述目标对象 可以是用户,也可以是机器人,还可以是其他智能设备。In the embodiment of the present disclosure, the interactive object can interact with the target object, which can be a virtual character, or a virtual animal, virtual item, cartoon image, etc. that can realize interactive functions. The presentation form of the virtual image is It may be in 2D form or 3D form, which is not limited in the present disclosure. The target object can be a user, a robot, or other smart devices.
所述交互对象可以通过终端设备进行展示,所述终端设备可以是电视机、带有显示功能的一体机、投影仪、虚拟现实(Virtual Reality,VR)设备、增强现实(Augmented Reality,AR)设备等,本公开并不限定终端设备的具体形式。The interactive object can be displayed through a terminal device, and the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) device etc., the present disclosure does not limit the specific form of the terminal device.
在一些实施例中,响应于终端设备接收到用于驱动交互对象输出语音的声音驱动数据,交互对象可以对目标对象发出指定语音。可以根据终端设备周边目标对象的动作、表情、身份、偏好等,生成声音驱动数据,以驱动交互对象通过发出指定语音进行回应,从而为目标对象提供拟人化的服务。在一些场景下,交互对象可以利用不同语种与目标对象进行交互,为了使交互对象的姿态在不同语种下都与真实发音相贴合,本公开至少一个实施例提出一种交互对象的驱动方法。In some embodiments, in response to the terminal device receiving sound driving data for driving the interactive object to output a voice, the interactive object can emit a specified voice to the target object. According to the actions, expressions, identities, preferences, etc. of the target object around the terminal device, sound driving data can be generated to drive the interactive object to respond by issuing a specified voice, thereby providing anthropomorphic services for the target object. In some scenarios, the interactive object can interact with the target object in different languages. In order to make the posture of the interactive object match the real pronunciation in different languages, at least one embodiment of the present disclosure proposes a driving method for the interactive object.
图1示出根据本公开至少一个实施例的交互对象的驱动方法的流程图,如图1所示,所述方法包括步骤101~步骤104。FIG. 1 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104 .
在步骤101中,获取所述交互对象的声音驱动数据的声音特征。In step 101, the sound characteristics of the sound driving data of the interactive object are acquired.
所述声音驱动数据可以包括音频数据(语音数据)、文本等等。响应于声音驱动数据是音频数据,可以直接利用该音频数据驱动交互对象输出语音,也即终端设备通过该音频数据直接输出语音;响应于声音驱动数据是文本,可以根据所述文本中包含的语音,生成相应的音素,通过所生成的音素来驱动交互对象输出语音。以中文文本为例,可以首先将所述文本转换为拼音,再根据拼音生成相应的音素。所述声音驱动数据也可以是其他形式的驱动数据,本公开对此不进行限制。The sound driving data may include audio data (speech data), text, and the like. In response to the sound driving data being audio data, the audio data can be directly used to drive the interactive object to output voice, that is, the terminal device directly outputs voice through the audio data; in response to the sound driving data being text, it can be based on the voice contained in the text , generate corresponding phonemes, and drive the interactive object to output speech through the generated phonemes. Taking Chinese text as an example, the text may first be converted into pinyin, and then corresponding phonemes may be generated according to the pinyin. The sound driving data may also be other forms of driving data, which is not limited in the present disclosure.
在本公开实施例中,所述声音驱动数据可以是根据与交互对象进行交互的目标对象的动作、表情、身份、偏好等生成的驱动数据,也可以是终端设备从内部存储器调用的声音驱动数据。本公开对于该声音驱动数据的获取方式不进行限制。In the embodiment of the present disclosure, the sound driving data may be the driving data generated according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it may be the sound driving data called by the terminal device from the internal memory . The present disclosure does not limit the acquisition method of the sound driving data.
响应于所述声音驱动数据为音频数据,可以通过将音频数据拆分为多个语音帧,根据语音帧的状态对语音帧进行组合而形成音素;根据所述音频数据所形成的各个音素则形成了音素序列。其中,音素是根据语音的自然属性划分出来的最小语音单元,真实人物一个发音动作能够形成一个音素。In response to the sound driving data being audio data, the audio data can be split into a plurality of speech frames, and the speech frames are combined according to the states of the speech frames to form phonemes; each phoneme formed according to the audio data then forms phoneme sequence. Among them, a phoneme is the smallest speech unit divided according to the natural attributes of speech, and a pronunciation action of a real person can form a phoneme.
响应于所述声音驱动数据为文本,可以根据所述文本中包含的语素,获得所述语素所包含的音素,从而获得相应的音素序列。本领域技术人员应当理解,还可以通过其他方式获得所述声音驱动数据对应的音素序列,本公开对此不进行限定。In response to the fact that the sound driving data is text, the phonemes included in the morphemes may be obtained according to the morphemes included in the text, so as to obtain a corresponding phoneme sequence. Those skilled in the art should understand that the phoneme sequence corresponding to the sound driving data can also be obtained in other ways, which is not limited in the present disclosure.
在本公开实施例中,所述声音特征可以是与语音情感相关的特征,例如基频特征、共峰特征、梅尔频率倒谱系数(Mel Frequency Cofficient,MFCC)等等。In the embodiment of the present disclosure, the sound features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Coefficient (MFCC) and the like.
在步骤102中,利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率。所述述声音特征提取网络是通过包含多语种的音素表来进行训练得到的。In step 102, feature extraction is performed on the sound feature by using a sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data. The above-mentioned sound feature extraction network is obtained by training through a multilingual phoneme table.
其中,所述音素后验概率(phonetic posteriorgrams,PPG)表示所述语音帧对应于音素表中各个音素的后验概率。所述音素后验概率与说话者无关,只与说话内容有关。在一实施例中,假设音素表中有3个音素,例如音素1、音素2、音素3,通过声音特征提取网络,可得到语音帧对应于音素1的概率、音素2的概率和音素3的概率。也即,该语音帧的音素后验概率包括该语音帧对应于音素1的概率、音素2的概率和音素3的概率。Wherein, the phonetic posterior probability (phonetic posteriorgrams, PPG) represents the posterior probability that the speech frame corresponds to each phoneme in the phoneme table. The phoneme posterior probability has nothing to do with the speaker, but only with the content of the speech. In one embodiment, assuming that there are 3 phonemes in the phoneme table, such as phoneme 1, phoneme 2, and phoneme 3, through the sound feature extraction network, the probability that the speech frame corresponds to phoneme 1, the probability of phoneme 2, and the probability of phoneme 3 can be obtained. probability. That is, the phoneme posterior probability of the speech frame includes the probability that the speech frame corresponds to phoneme 1, the probability of phoneme 2, and the probability of phoneme 3.
在本公开实施例中,用于提取所述声音驱动数据中各个语音帧的音素后验概率的声音特征提取网络是通过包含多语种的音素表来进行训练得到的。In the embodiment of the present disclosure, the sound feature extraction network used to extract the phoneme posterior probability of each speech frame in the sound driving data is obtained by training through a phoneme table including multiple languages.
在一些实施例中,可以通过以下方式获得包含多语种的音素表:获取多个目标语种中的音素进行拼接;将拼接结果中发音相似度超过第一设定阈值的音素进行合并,可以方便、快速地得到包含多个目标语种的音素表。In some embodiments, the phoneme table containing multiple languages can be obtained in the following manner: the phonemes in multiple target languages are obtained for splicing; the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result are merged, which can be convenient, Quickly get phoneme tables for multiple target languages.
举例来说,可以将中文(拼音)中的音素与英文中的音素进行拼接,并将拼接结果中发音相同或相似的音素,例如“b”、“p”、“m”、“f”等进行合并,从而可以得到包含中文和英文的音素表。For example, phonemes in Chinese (Pinyin) can be concatenated with phonemes in English, and phonemes with the same or similar pronunciation in the concatenated result, such as "b", "p", "m", "f", etc. Merging is performed, so that a phoneme table including Chinese and English can be obtained.
在一些实施例中,可以通过以下方式获得包含多语种的音素表:首先,将多个目标语种中的音素分别映射为发音相似度满足相似度条件的国际音标(International Phonetic Alphabet,IPA),所述相似度条件例如为发音相同或者相似度最高。接下来,将映射结果中具有相同发音的国际音标进行合并,得到所述包含多语种的音素表。该方法适用于多种目标语种,具有普适性。In some embodiments, the phoneme table comprising multiple languages can be obtained in the following manner: First, the phonemes in multiple target languages are respectively mapped to the International Phonetic Alphabet (IPA) whose pronunciation similarity satisfies the similarity condition, so The similarity condition is, for example, the same pronunciation or the highest similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.
例如,可以将中文的所有音素映射为发音相似度最高的国际音标,同时将英文的所有音素映射为发音相似度最高的国际音标,并将中文和英文映射到的国际音标存储在一个音素表中,将具有相同发音的音素进行合并,则可以得到支持中文和英文的音素表。For example, all phonemes in Chinese can be mapped to the IPA with the highest pronunciation similarity, and all phonemes in English can be mapped to the IPA with the highest pronunciation similarity, and the IPAs mapped to Chinese and English can be stored in a phoneme table , the phonemes with the same pronunciation are merged, and a phoneme table supporting Chinese and English can be obtained.
举例来说,假设中文音素中包含音素a1,a2,a3,b,i1,i2.i3,ii1,ii2,ii3(其中1、2、3代表声调),英文音素中包含a,b,i,IPA表中包含a,b,i。根据发音,分别将中文和英 文中的音素映射到相似度最高的IPA上,中文顺序映射为a,a,a,b,i,i,i,i,i,i(由于IPA中没有ii发音,实际ii发音与i最为相似,那么就将ii映射到i)。同理英文映射后依次为a,b,i。For example, suppose Chinese phonemes include phonemes a1, a2, a3, b, i1, i2.i3, ii1, ii2, ii3 (where 1, 2, and 3 represent tones), and English phonemes include a, b, i, The IPA table contains a, b, i. According to the pronunciation, the phonemes in Chinese and English are mapped to the IPA with the highest similarity, and the Chinese sequence is mapped to a, a, a, b, i, i, i, i, i, i (since there is no ii pronunciation in IPA , the actual pronunciation of ii is most similar to i, then map ii to i). In the same way, after the English mapping, it is a, b, i in turn.
在一些实施例中,响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。例如,中文中的音素“ng”在IPA表中是不存在的,而该发音与其他发音的相似度皆小于第二设定阈值;又或者当中文中的某个音素是由其他几个发音组成,发音与IPA表中的相似度也小于第二设定阈值,将这样的音素称为第一音素,并且保留该第一音素,追加在IPA表后面,即最终得到的IPA包含在自身全部的音素之外还包括该第一音素。In some embodiments, in response to the presence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to a second set threshold, the first phonemes are added to the multilingual in the phoneme table. For example, the phoneme "ng" in Chinese does not exist in the IPA table, and the similarity between the pronunciation and other pronunciations is less than the second set threshold; or when a certain phoneme in Chinese is composed of several other pronunciations composition, the similarity between the pronunciation and the IPA table is also less than the second set threshold, and such a phoneme is called the first phoneme, and the first phoneme is reserved and appended to the back of the IPA table, that is, the final IPA is included in all of itself The first phoneme is also included in addition to the phoneme.
本领域技术人员应当理解,上述第一设定阈值、第二设定阈值可以根据实际需要具体设置,本公开对此不进行限定。Those skilled in the art should understand that the first set threshold and the second set threshold can be specifically set according to actual needs, which is not limited in the present disclosure.
在本公开实施例中,利用包含多语种的音素表,可以直接对多语种的语音样本进行标注,可以方便、高效地构建高质量、标注完整、数据均衡的语料库,以用于对声音特征提取网络进行训练。In the embodiment of the present disclosure, the multilingual phoneme table can be used to directly annotate multilingual speech samples, and a high-quality, complete annotated, and data-balanced corpus can be conveniently and efficiently constructed for extracting sound features. The network is trained.
在步骤103中,根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值。In step 103, the pose parameter value of the interactive object is obtained according to the phoneme posterior probability of each speech frame.
在本公开实施例中,可以根据所述声音驱动数据中各个语音帧的音素后验概率,获得与所述声音驱动数据匹配的交互对象的姿态参数值。In the embodiment of the present disclosure, the pose parameter value of the interactive object matching the sound driving data may be obtained according to the phoneme posterior probability of each speech frame in the sound driving data.
姿态参数用于控制所述交互对象的姿态,利用不同的姿态参数值可以驱动所述交互对象做出相应的姿态。该姿态参数可以包括面部姿态参数,所述面部姿态参数用于控制所述交互对象的面部姿态,包括表情、口型、五官动作和头部姿态等;在本公开实施例中,可以预先建立音素后验概率与交互对象的姿态参数值的对应关系,在获得了所述声音驱动数据中各个语音帧的音素后验概率的情况下,即可获得所述声音驱动数据对应的姿态参数值。姿态参数的具体形式可以根据交互对象模型的类型确定。The posture parameter is used to control the posture of the interactive object, and the interactive object can be driven to make a corresponding posture by using different posture parameter values. The gesture parameters may include facial gesture parameters, which are used to control the facial gestures of the interactive object, including expressions, mouth shapes, facial features, and head gestures; in embodiments of the present disclosure, phonemes may be pre-established For the correspondence between the posterior probability and the gesture parameter value of the interactive object, if the phoneme posterior probability of each speech frame in the voice driving data is obtained, the gesture parameter value corresponding to the voice driving data can be obtained. The specific form of the attitude parameter can be determined according to the type of the interactive object model.
在步骤104中,根据所述姿态参数值控制所述交互对象的姿态。In step 104, the gesture of the interactive object is controlled according to the gesture parameter value.
其中,所述姿态参数值是与所述交互对象的声音驱动数据中各个语音帧的音素后验概率相匹配的,由于音素后验概率与语种无关,因此,对于不同语种的语音数据和文本,所述交互对象所呈现的姿态(例如口型、表情、动作等)都可以与实际发音相匹配,给与所述交互对象进行交互的目标对象以所述交互对象正在说话的感觉。Wherein, the attitude parameter value is matched with the phoneme posterior probability of each speech frame in the sound driving data of the interactive object, since the phoneme posterior probability has nothing to do with the language, therefore, for speech data and texts of different languages, The gestures (such as mouth shapes, expressions, actions, etc.) presented by the interactive object can be matched with the actual pronunciation, giving the target object interacting with the interactive object the feeling that the interactive object is speaking.
在本公开实施例中,先获取交互对象的声音驱动数据的声音特征,利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率,之后根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值,并根据所述姿态参数值控制所述交互对象的姿态,由于所述音素后验概率与说话者无关、并且能够支持多语种,本公开实施例利用包含多语种的音素表对声音特征提取网络进行训练,并利用该网络提取所述声音驱动数据的音素后验概率,作为声音特征驱动所述交互对象,使得交互对象的姿态在不同语种下都与真实发音相贴合。In the embodiment of the present disclosure, the sound features of the sound driving data of the interactive object are obtained first, and the sound feature extraction network is used to perform feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data, and then According to the phoneme posterior probability of each speech frame, obtain the posture parameter value of the interactive object, and control the posture of the interactive object according to the posture parameter value, because the phoneme posterior probability has nothing to do with the speaker, and It can support multiple languages, and the embodiments of the present disclosure use the phoneme table containing multiple languages to train the sound feature extraction network, and use the network to extract the phoneme posterior probability of the sound driving data as the sound feature to drive the interactive object, so that The posture of the interactive object matches the real pronunciation in different languages.
在一些实施例中,可以根据以下方法来构建支持多语种的语料库。In some embodiments, the multilingual corpus can be constructed according to the following method.
首先,获取多语种的语音样本,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同。例如,在所述音素表是支持中文和英文的音素表的情况下,则分别获取中文的语音样本和英文的语音样本。First, multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table. For example, if the phoneme table is a phoneme table supporting Chinese and English, the Chinese speech samples and the English speech samples are obtained respectively.
接下来,对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素。Next, a phoneme alignment operation is performed on the speech samples to obtain phonemes included in the speech samples.
以所述语音样本为使用中文说“你好”的语音段为例,在对所述语音样本进行语音操作后,可以得到在该语音段中各个音素的发音起止时间:n[0,0.2],i3[0.2,0.4],h[0.5,0.7],ao3[0.7,1.2],其中,[]内指示每个音素的发音起止时间,单位为秒。通过各个音素的发音起止时间,可以确定所述语音样本中的各个语音帧所对应的音素。Taking the speech sample as an example of a speech segment saying "Hello" in Chinese, after the speech operation is performed on the speech sample, the pronunciation start and end time of each phoneme in the speech segment can be obtained: n[0,0.2] , i3[0.2,0.4], h[0.5,0.7], ao3[0.7,1.2], where [] indicates the start and end time of pronunciation of each phoneme in seconds. The phoneme corresponding to each speech frame in the speech sample can be determined through the pronunciation start and end time of each phoneme.
最后,利用所述包含多语种的音素表中的音素对所述语音样本中的音素进行标注。在一实施例中,用包含多语种的音素表中的音素标注语音样本中的音素的真实值。Finally, the phonemes in the speech sample are marked with the phonemes in the multilingual phoneme table. In one embodiment, the true values of the phonemes in the speech samples are annotated with the phonemes in the multilingual phoneme table.
以所述包含多语种的音素表为支持中文和英文的音素表为例,对于中文的语音样本和英文的语音样本,都可以直接调用所述包含多语种的音素表中的音素进行标注,从而可以方便、高效地构建高质量、标注完整、数据均衡的语料库。Taking the multilingual phoneme table as an example supporting Chinese and English phoneme tables, for Chinese voice samples and English voice samples, the phonemes in the multilingual phoneme table can be directly called for labeling, thereby It can conveniently and efficiently construct a high-quality, well-labeled, and data-balanced corpus.
在一些实施例中,可以通过以下方法对所述声音特征提取网络进行训练。In some embodiments, the sound feature extraction network can be trained by the following method.
首先,将标注后的语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率。其中,标注后的语音样本中每个语音帧标注有音素的真实值。Firstly, the sound features of the marked speech samples are input to the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples. Wherein, each speech frame in the marked speech sample is marked with a real value of a phoneme.
接下来,根据所述语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。在网络损失的变化满足收敛条件时,例如网络损失的变化量小于设定阈值时,或者迭代次数达到设定次数时完成训练,即得到了训练好的声音特征提取网络。Next, adjust the parameter values of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the labeled true value. When the change of the network loss satisfies the convergence condition, for example, when the change of the network loss is less than the set threshold, or when the number of iterations reaches the set number, the training is completed, and the trained sound feature extraction network is obtained.
在一些实施例中,可以获取所述交互对象的声音驱动数据对应的语音帧序列,并根据所述语音帧序列中各个语音帧的声音特征向量,得到所述声音驱动数据的声音特征。以MFCC为例,根据所述语音帧序列中各个语音帧的MFCC系数,可以得到所述声音驱动数据对应的MFCC矩阵。In some embodiments, the voice frame sequence corresponding to the voice driving data of the interactive object may be obtained, and the voice features of the voice driving data may be obtained according to the voice feature vectors of each voice frame in the voice frame sequence. Taking MFCC as an example, according to the MFCC coefficients of each speech frame in the speech frame sequence, the MFCC matrix corresponding to the sound driving data can be obtained.
图2示出本公开至少一个实施例示出的声音特征提取过程示意图。如图2所示,本公开利用声音特征提取网络200对声音驱动数据的声音特征进行特征提取,以得到所述声音特征数据中各个语音帧的音素后验概率。所述声音特征提取网络200包括第一全连接网络201、编码子网络202和第二全连接网络203。Fig. 2 shows a schematic diagram of a sound feature extraction process shown in at least one embodiment of the present disclosure. As shown in FIG. 2 , the present disclosure utilizes a sound feature extraction network 200 to perform feature extraction on the sound features of the sound driving data, so as to obtain the phoneme posterior probability of each speech frame in the sound feature data. The sound feature extraction network 200 includes a first fully connected network 201 , an encoding sub-network 202 and a second fully connected network 203 .
首先,将所述声音特征输入至所述第一全连接网络201,得到所述第一全连接网络输出的第一声音特征序列;接着,利用编码子网络202对所述第一声音特征序列进行特征编码处理,得到编码结果。所述编码子网络例如可以是CBHG网络、门控循环单元(Gated Recurrent Unit,GRU)等适用于提取序列特征的网络。最后,将所述编码结果输入至第二全连接网络203,得到所述声音驱动数据中各个语音帧的音素后验概率。First, input the sound features into the first fully connected network 201 to obtain the first sound feature sequence output by the first fully connected network; then, use the coding sub-network 202 to process the first sound feature sequence Feature encoding processing to obtain the encoding result. The coding sub-network can be, for example, a CBHG network, a Gated Recurrent Unit (Gated Recurrent Unit, GRU) and other networks suitable for extracting sequence features. Finally, the encoding result is input to the second fully connected network 203 to obtain the phoneme posterior probability of each speech frame in the sound driving data.
在本公开实施例中,通过将所述声音特征转换为序列,通过适用于提取序列特征的编码网络进行特征提取,并通过全连接网络分类处理,可以准确地预测出声音特征数据中各个语音帧的音素后验概率。In the embodiment of the present disclosure, by converting the sound features into a sequence, feature extraction is performed through a coding network suitable for extracting sequence features, and classification processing is performed through a fully connected network, each speech frame in the sound feature data can be accurately predicted The phoneme posterior probability of .
在一些实施例中,可以通过时序网络和全连接网络来预测所述声音驱动数据中各个语音帧的音素后验概率对应的姿态参数值,以将具有关联性的历史音素后验概率和当前音素后验概率进行融合,从而使得历史姿态参数值对当前姿态参数值的变化产生影响,使得交互对象的姿态的变化更加平缓、自然。In some embodiments, the posture parameter values corresponding to the phoneme posterior probabilities of each speech frame in the sound-driven data can be predicted through a time series network and a fully connected network, so that the associated historical phoneme posterior probability and the current phoneme The posterior probability is fused, so that the historical attitude parameter value affects the change of the current attitude parameter value, making the change of the attitude of the interactive object more gentle and natural.
图3示出本公开至少一个实施例示出的音素后验概率的映射过程示意图。如图3所示,首先将所述各个语音帧的音素后验概率输入至时序网络301,输出关联特征信息。其中,时序网络可以是一种时间递归神经网络,例如LSTM,所述时序网络可以学习所输入音素后验概率的历史信息,所输出的关联特征信息包含了历史信息对当前信息的影响。接下来,将所述关联特征信息输入至第三全连接网络302,得到关联特征序列。最后,通过激活层303对所述关联特征序列进行激活处理,将关联特征序列中的各个特征值变换为姿态参数值,得到所述各个语音帧的音素后验概率匹配的所述交互对象的姿态参数值。Fig. 3 shows a schematic diagram of a mapping process of phoneme posterior probabilities shown in at least one embodiment of the present disclosure. As shown in FIG. 3 , firstly, the phoneme posterior probability of each speech frame is input into the time series network 301 , and associated feature information is output. Wherein, the time series network may be a time recursive neural network, such as LSTM. The time series network can learn the historical information of the input phoneme posterior probability, and the output associated feature information includes the influence of the historical information on the current information. Next, the associated feature information is input into the third fully connected network 302 to obtain an associated feature sequence. Finally, the associated feature sequence is activated through the activation layer 303, and each feature value in the associated feature sequence is transformed into a posture parameter value, so as to obtain the posture of the interactive object matched with the phoneme posterior probability of each speech frame parameter value.
在一些实施例中,所述交互对象的姿态参数包括面部姿态控制参数,可以根据与所 述各个语音帧的音素后验概率匹配的面部姿态控制参数,驱动所述交互对象实现与所述声音驱动数据中的各个语音帧匹配的面部姿态。其中,所述面部姿态参数例如可以包括面部肌肉控制系数。In some embodiments, the posture parameters of the interactive object include facial posture control parameters, and the interactive object can be driven to achieve the same level as the voice driving according to the facial posture control parameters matched with the phoneme posterior probabilities of the respective speech frames. Each speech frame in the data matches the facial pose. Wherein, the facial posture parameters may include facial muscle control coefficients, for example.
人脸的运动,从解剖学角度来看,是由面部各部分肌肉协同变形的结果。因此,通过对交互对象的面部肌肉进行划分而获得面部肌肉模型,对划分得到的每一块肌肉(区域)通过对应的面部肌肉控制系数控制其运动,也即对其进行收缩/扩张控制,则能够使交互人物的面部做出各种表情。对于所述面部肌肉模型的每一块肌肉,可以根据肌肉所在的面部位置和肌肉自身的运动特征,来设置不同的肌肉控制系数所对应的运动状态。例如,对于上唇肌肉,其控制系数的数值范围为(0~1),在该范围内的不同数值,对应于上唇肌肉不同的收缩/扩张状态,通过改变该数值,可以实现嘴部的纵向开合;而对于左嘴角肌肉,其控制系数的数值范围为(0~1),在该范围内的不同数值,对应于左嘴角肌肉的收缩/扩张状态,通过改变该数值,可以实现嘴部的横向变化。The movement of the human face, from an anatomical point of view, is the result of the coordinated deformation of the muscles in various parts of the face. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive object, and the movement of each divided muscle (region) is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed. Make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, the motion states corresponding to different muscle control coefficients can be set according to the facial position of the muscle and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of its control coefficient is (0~1). Different values in this range correspond to different contraction/expansion states of the upper lip muscle. By changing this value, the longitudinal opening of the mouth can be realized. For the left mouth corner muscle, the value range of its control coefficient is (0~1). Different values in this range correspond to the contraction/expansion state of the left mouth corner muscle. By changing the value, the mouth can be realized. Lateral changes.
在根据所述声音驱动数据输出声音的同时,根据与所述声音驱动数据对应的面部姿态控制参数来驱动所述交互对象做出面部表情,则可以实现交互对象在输出声音的同时,同步做出发出该声音的嘴型和表情,从而使目标对象产生该交互对象正在说话的感觉,提高了目标对象的交互体验。While outputting sound according to the sound driving data, the interactive object is driven to make facial expressions according to the facial gesture control parameters corresponding to the sound driving data, so that the interactive object can simultaneously make facial expressions while outputting sound. The shape of the mouth and the expression that emit the sound make the target object feel that the interactive object is speaking, and improve the interactive experience of the target object.
图4是本公开至少一个实施例提出的音素处理方法的流程图。如图4所示,所述方法包括步骤401~步骤402。Fig. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure. As shown in FIG. 4 , the method includes step 401 - step 402 .
在步骤401中,根据多个目标语种中的音素,得到包含多语种的音素表。In step 401, a phoneme table including multiple languages is obtained according to phonemes in multiple target languages.
在一个示例中,可以通过以下方式获得包含多语种的音素表:将多个目标语种中的音素进行拼接;将拼接结果中发音相似度超过第一设定阈值的音素进行合并,可以方便、快速地得到包含多个目标语种的音素表。In one example, the phoneme table containing multiple languages can be obtained in the following way: splicing phonemes in multiple target languages; merging phonemes whose pronunciation similarity exceeds the first set threshold in the splicing results can be convenient and fast The phoneme tables containing multiple target languages can be obtained efficiently.
在另一个示例中,可以通过以下方式获得包含多语种的音素表:首先,将多个目标语种中的音素分别映射为发音相似度满足相似度条件的国际音标,所述相似度条件例如为发音相同或者相似度最高。接下来,将映射结果中具有相同发音的国际音标进行合并,得到所述包含多语种的音素表。该方法适用于多种目标语种,具有普适性。In another example, a multilingual phoneme table can be obtained in the following manner: First, map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the similarity condition, such as pronunciation the same or the highest degree of similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.
在一些实施例中,响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于所述第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。也就是说,在没有与第一音素的发音具有较高相似度的国际音标的情况下,则将 该第一音素直接添加到所述包含多语种的音素表中。In some embodiments, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables. That is to say, if there is no International Phonetic Alphabet with a high degree of similarity to the pronunciation of the first phoneme, then the first phoneme is directly added to the multilingual phoneme table.
本领域技术人员应当理解,上述第一设定阈值、第二设定阈值可以根据实际需要具体设置,本公开对此不进行限定。Those skilled in the art should understand that the first set threshold and the second set threshold can be specifically set according to actual needs, which is not limited in the present disclosure.
在步骤402中,基于所述包含多语种的音素表,训练得到声音特征提取网络,其中,所述声音特征提取网络用于提取语音帧的音素后验概率。In step 402, based on the multilingual phoneme table, a sound feature extraction network is trained to obtain the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
本公开实施例利用包含多语种的音素表对声音特征提取网络进行训练,可以提高对所述特征提取网络训练的效率和质量,并利用该网络提取所述声音驱动数据的音素后验特征,以作为声音特征驱动所述交互对象,由于所述音素后验概率是与说话者无关、能够支持多语种的声音特征,使得交互对象的姿态在不同语种下都与真实发音相贴合。In the embodiment of the present disclosure, the sound feature extraction network is trained by using a multilingual phoneme table, which can improve the efficiency and quality of the feature extraction network training, and use the network to extract the phoneme posterior features of the sound driving data, so as to As the voice feature drives the interactive object, since the phoneme posterior probability is a speaker-independent voice feature that can support multiple languages, the posture of the interactive object is consistent with the real pronunciation in different languages.
在一些实施例中,可以根据以下方法来构建支持多语种的语料库。In some embodiments, the multilingual corpus can be constructed according to the following method.
首先,获取多语种的语音样本,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同。First, multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table.
接下来,对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素。Next, a phoneme alignment operation is performed on the speech samples to obtain phonemes included in the speech samples.
最后,利用所述包含多语种的音素表中的音素来标注所述语音样本中的音素的真实值。Finally, use the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
在本公开实施例中,可以直接调用所述包含多语种的音素表中的音素来对语音样本中的音素进行标注,从而可以方便、高效地构建高质量、标注完整、数据均衡的语料库。In the embodiment of the present disclosure, the phonemes in the multilingual phoneme table can be directly called to mark the phonemes in the voice sample, so that a high-quality, complete-labeled, and data-balanced corpus can be constructed conveniently and efficiently.
在一些实施例中,可以通过以下具体步骤对所述声音特征提取网络进行训练。In some embodiments, the sound feature extraction network can be trained through the following specific steps.
首先,将标注后的语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率。其中,标注后的语音样本中每个语音帧标注有音素的真实值。Firstly, the sound features of the marked speech samples are input to the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples. Wherein, each speech frame in the marked speech sample is marked with a real value of a phoneme.
接下来,根据所述语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。在网络损失的变化满足收敛条件时,例如网络损失的变化量小于设定阈值时,或者迭代次数达到设定次数时完成训练,即得到了训练好的声音特征提取网络。Next, adjust the parameter values of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the labeled true value. When the change of the network loss satisfies the convergence condition, for example, when the change of the network loss is less than the set threshold, or when the number of iterations reaches the set number, the training is completed, and the trained sound feature extraction network is obtained.
图5是根据本公开至少一个实施例的交互对象的驱动装置的结构示意图,如图5所示,该装置可以包括:第一获取单元501,用于获取交互对象的声音驱动数据的声音特征;第二获取单元502,用于利用声音特征提取网络对所述声音特征进行特征提取, 得到所述声音驱动数据中各个语音帧的音素后验概率;其中,所述声音特征提取网络是通过包含多语种的音素表训练得到的;第三获取单元503,用于根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值;控制单元504,用于根据所述姿态参数值控制所述交互对象的姿态。Fig. 5 is a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure. As shown in Fig. 5 , the device may include: a first acquiring unit 501, configured to acquire the sound characteristics of the sound driving data of the interactive object; The second acquisition unit 502 is used to extract the features of the sound features using the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by including multiple The phoneme table training of the language is obtained; the third acquisition unit 503 is used to obtain the posture parameter value of the interactive object according to the phoneme posterior probability of each speech frame; the control unit 504 is used to obtain the posture parameter value according to the posture parameter value Controls the pose of the interactive object.
在一些实施例中,所述第一获取单元具体用于:获取所述交互对象的声音驱动数据对应的语音帧序列;根据所述语音帧序列中各个语音帧的声音特征向量,得到所述声音驱动数据的声音特征。In some embodiments, the first acquiring unit is specifically configured to: acquire the voice frame sequence corresponding to the voice driving data of the interactive object; obtain the voice according to the voice feature vector of each voice frame in the voice frame sequence The sonic characteristics of the driving data.
在一些实施例中,所述声音特征提取网络包括第一全连接网络、编码子网络、第二全连接网络,所述第二获取单元具体用于:将所述声音特征输入至所述第一全连接网络,得到所述第一全连接网络输出的第一声音特征序列;利用所述编码子网络,对所述第一声音特征序列进行特征编码处理;将编码结果输入至所述第二全连接网络,得到所述声音驱动数据中各个语音帧的音素后验概率。In some embodiments, the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network, and the second acquisition unit is specifically configured to: input the sound feature into the first A fully connected network to obtain the first sound feature sequence output by the first fully connected network; use the encoding sub-network to perform feature encoding processing on the first sound feature sequence; input the encoding result to the second fully connected network The network is connected to obtain the phoneme posterior probability of each speech frame in the sound driving data.
在一些实施例中,所述第三获取单元具体用于:将所述各个语音帧的音素后验概率输入至时序网络,输出关联特征信息;将所述关联特征信息输入至第三全连接网络,得到关联特征序列;对所述关联特征序列进行激活处理,得到所述各个语音帧的音素后验概率匹配的所述交互对象的姿态参数值。In some embodiments, the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into a third fully connected network , to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the gesture parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
在一些实施例中,所述交互对象的控制参数包括面部姿态控制参数,所述控制单元具体用于:根据与所述各个语音帧的音素后验概率匹配的面部姿态参数值,驱动所述交互对象实现与所述声音驱动数据中的各个语音帧匹配的面部姿态。In some embodiments, the control parameters of the interactive object include facial gesture control parameters, and the control unit is specifically configured to: drive the interaction according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame. The subject achieves a facial gesture that matches each speech frame in the sound-driven data.
图6是根据本公开至少一个实施例的声音特征提出网络的训练装置的结构示意图,如图6所示,该装置可以包括:音素表获取单元601,用于根据多个目标语种中的音素,得到包含多语种的音素表;训练获取单元602,用于基于所述包含多语种的音素表,训练得到声音特征提取网络,所述声音特征提取网络用于提取语音帧的音素后验概率。Fig. 6 is a schematic structural diagram of a training device for proposing a network of sound features according to at least one embodiment of the present disclosure. As shown in Fig. 6, the device may include: a phoneme table acquisition unit 601, configured to, according to phonemes in multiple target languages, Obtaining a multilingual phoneme table; the training and obtaining unit 602 is configured to train a sound feature extraction network based on the multilingual phoneme table, and the sound feature extraction network is used to extract phoneme posterior probabilities of speech frames.
在一些实施例中,所述音素表获取单元具体用于:获取多个目标语种中的音素进行拼接;将拼接结果中发音相似度超过第一设定阈值的音素进行合并,得到所述包含多语种的音素表;基于所述包含多语种的音素表,训练得到声音特征提取网络。In some embodiments, the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; combine phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result, and obtain the phonemes containing multiple The phoneme table of the language; based on the phoneme table containing multiple languages, the sound feature extraction network is obtained through training.
在一些实施例中,所述音素表获取单元具体用于:将多个目标语种中的音素分别映射为发音相似度满足预设相似度条件的国际音标;将映射结果中具有相同发音的国 际音标进行合并,得到所述包含多语种的音素表。In some embodiments, the phoneme table acquisition unit is specifically configured to: map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.
在一些实施例中,响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于所述第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。In some embodiments, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables.
在一些实施例中,所述装置还包括标注单元,用于:获取多语种的语音样本,其中,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同;对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素;利用所述多语种的音素表中的音素来标注所述语音样本中的音素的真实值。In some embodiments, the device further includes a labeling unit, configured to: obtain a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes contained in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
在一些实施例中,所述训练单元具体用于:将标注后的语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率;针对所述语音样本中各个语音帧,根据该语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。In some embodiments, the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for the For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
本公开至少一个实施例还提供了一种电子设备,如图7所示,所述设备包括存储器、处理器,存储器用于存储可在处理器上运行的计算机指令,处理器用于在执行所述计算机指令时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of the present disclosure also provides an electronic device, as shown in FIG. 7 , the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the described The computer instructions implement the driving method of the interactive object described in any embodiment of the present disclosure.
本公开至少一个实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.
本公开至少一个实施例还提供了一种计算机程序产品,包括计算机程序,所述程序被处理器执行时实现本公开任一实施例所述的交互对象的驱动方法。At least one embodiment of the present disclosure further provides a computer program product, including a computer program, when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.
本领域技术人员应明白,本说明书一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围 内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.
本说明书中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本说明书中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本说明书中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more of . Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. A computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
本说明书中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both. However, a computer is not required to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.
虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as primarily describing features of particular embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product in, or packaged into multiple software products.
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本说明书一个或多个实施例,凡在本说明书一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书一个或多个实施例保护的范围之内。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. Within the spirit and principles of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. should be included in the scope of protection of one or more embodiments of this specification.

Claims (16)

  1. 一种交互对象的驱动方法,包括:A driving method for an interactive object, comprising:
    获取交互对象的声音驱动数据的声音特征;Acquiring the sound characteristics of the sound driving data of the interactive object;
    利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率;其中,所述声音特征提取网络是通过包含多语种的音素表训练得到的;Using a sound feature extraction network to perform feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein the sound feature extraction network is obtained by training a phoneme table comprising multiple languages;
    根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值;Obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame;
    根据所述姿态参数值控制所述交互对象的姿态。The gesture of the interactive object is controlled according to the gesture parameter value.
  2. 根据权利要求1所述的方法,其特征在于,所述获取交互对象的声音驱动数据的声音特征,包括:The method according to claim 1, wherein said acquiring the sound characteristics of the sound driving data of the interactive object comprises:
    获取所述交互对象的声音驱动数据对应的语音帧序列;Acquiring a sequence of speech frames corresponding to the sound driving data of the interactive object;
    根据所述语音帧序列中各个语音帧的声音特征向量,得到所述声音驱动数据的声音特征。The sound feature of the sound driving data is obtained according to the sound feature vector of each speech frame in the speech frame sequence.
  3. 根据权利要求1或2所述的方法,其特征在于,所述声音特征提取网络包括第一全连接网络、编码子网络和第二全连接网络,所述利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率,包括:The method according to claim 1 or 2, wherein the sound feature extraction network comprises a first fully connected network, an encoding sub-network and a second fully connected network, and the sound feature extraction network is used to extract the sound features Carry out feature extraction, obtain the phoneme posterior probability of each speech frame in the described sound driving data, comprise:
    将所述声音特征输入至所述第一全连接网络,得到所述第一全连接网络输出的第一声音特征序列;Inputting the sound features into the first fully connected network to obtain a first sound feature sequence output by the first fully connected network;
    利用所述编码子网络,对所述第一声音特征序列进行特征编码处理;performing feature encoding processing on the first sound feature sequence by using the encoding sub-network;
    将编码结果输入至所述第二全连接网络,得到所述声音驱动数据中各个语音帧的音素后验概率。The encoding result is input to the second fully connected network to obtain the phoneme posterior probability of each speech frame in the sound driving data.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据所述各个音素的音素后验概率,得到所述交互对象的姿态参数值,包括:The method according to any one of claims 1 to 3, wherein the obtaining the gesture parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes:
    将所述各个语音帧的音素后验概率输入至时序网络,输出关联特征信息;The phoneme posterior probability of each speech frame is input to the time series network, and the associated feature information is output;
    将所述关联特征信息输入至第三全连接网络,得到关联特征序列;Inputting the associated feature information into a third fully connected network to obtain an associated feature sequence;
    对所述关联特征序列进行激活处理,得到所述各个语音帧的音素后验概率匹配的所述交互对象的姿态参数值。Activation processing is performed on the associated feature sequence to obtain the pose parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述交互对象的姿态参数包括面部姿态参数,所述根据所述姿态参数值控制所述交互对象的姿态,包括:The method according to any one of claims 1 to 4, wherein the posture parameters of the interactive object include facial posture parameters, and the controlling the posture of the interactive object according to the posture parameter value includes:
    根据与所述各个语音帧的音素后验概率匹配的面部姿态参数值,驱动所述交互对象实现与所述声音驱动数据中的各个语音帧匹配的面部姿态。According to the facial gesture parameter value matched with the phoneme posterior probability of each voice frame, the interactive object is driven to realize the facial gesture matched with each voice frame in the sound driving data.
  6. 一种音素处理方法,包括:A phoneme processing method, comprising:
    根据多个目标语种中的音素,得到包含多语种的音素表;According to the phonemes in multiple target languages, a phoneme table including multiple languages is obtained;
    基于所述包含多语种的音素表,训练得到声音特征提取网络,其中,所述声音特征提取网络用于提取语音帧的音素后验概率。Based on the multilingual phoneme table, a sound feature extraction network is trained to obtain the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
  7. 根据权利要求6所述的方法,其特征在于,所述根据多个目标语种中的音素,得到包含多语种的音素表,包括:The method according to claim 6, wherein, according to the phonemes in a plurality of target languages, obtaining a phoneme table comprising multiple languages includes:
    将所述多个目标语种中的音素进行拼接;Splicing the phonemes in the multiple target languages;
    将拼接结果中发音相似度超过第一设定阈值的音素进行合并,得到包含多语种的音素表。The phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result are combined to obtain a multilingual phoneme table.
  8. 根据权利要求6所述的方法,其特征在于,所述根据多个目标语种中的音素,得到包含多语种的音素表,包括:The method according to claim 6, wherein, according to the phonemes in a plurality of target languages, obtaining a phoneme table comprising multiple languages includes:
    将多个目标语种中的音素分别映射为发音相似度满足预设相似度条件的国际音标;Map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity meets the preset similarity condition;
    将映射结果中具有相同发音的国际音标进行合并,得到所述包含多语种的音素表。The IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:响应于所述多个目标语种中存在与各个国际音标的发音相似度小于或等于第二设定阈值的第一音素,将所述第一音素添加至所述包含多语种的音素表中。The method according to claim 8, characterized in that, the method further comprises: in response to the presence of first phonemes in the plurality of target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to a second set threshold, Adding the first phoneme to the multilingual phoneme table.
  10. 根据权利要求6至9任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 6 to 9, further comprising:
    获取多语种的语音样本,其中,所述语音样本的语种类型与所述包含多语种的音素表所包含的语种类型相同;Acquiring multilingual speech samples, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table;
    对所述语音样本进行音素对齐操作,得到所述语音样本所包含的音素;performing a phoneme alignment operation on the speech sample to obtain the phonemes contained in the speech sample;
    利用所述包含多语种的音素表中的音素来标注所述语音样本中的音素的真实值。Using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
  11. 根据权利要求10所述的方法,其特征在于,所述基于所述包含多语种的音素表,训练得到声音特征提取网络,包括:method according to claim 10, is characterized in that, described based on described phoneme table that contains multilingual, training obtains sound feature extraction network, comprises:
    将标注后的所述语音样本的声音特征输入至所述声音特征提取网络,得到所述语音样本中各个语音帧的音素后验概率;Inputting the voice feature of the voice sample after marking to the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice sample;
    针对所述语音样本中各个语音帧,根据该语音帧的最大音素后验概率指示的音素与所标注的真实值之间的差异,调整所述声音特征提取网络的参数值。For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
  12. 一种交互对象的驱动装置,包括:A driving device for an interactive object, comprising:
    第一获取单元,用于获取交互对象的声音驱动数据的声音特征;The first acquisition unit is used to acquire the sound characteristics of the sound driving data of the interactive object;
    第二获取单元,用于利用声音特征提取网络对所述声音特征进行特征提取,得到所述声音驱动数据中各个语音帧的音素后验概率;其中,所述声音特征提取网络是通过包 含多语种的音素表训练得到的;The second acquisition unit is used to use the sound feature extraction network to perform feature extraction on the sound feature, and obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by including multilingual The phoneme table training is obtained;
    第三获取单元,用于根据所述各个语音帧的音素后验概率,得到所述交互对象的姿态参数值;A third acquisition unit, configured to obtain the gesture parameter value of the interactive object according to the phoneme posterior probability of each speech frame;
    控制单元,用于根据所述姿态参数值控制所述交互对象的姿态。A control unit, configured to control the gesture of the interactive object according to the gesture parameter value.
  13. 一种音素处理装置,包括:A phoneme processing device, comprising:
    音素表获取单元,用于根据多个目标语种中的音素,得到包含多语种的音素表;The phoneme table acquisition unit is used to obtain a phoneme table including multiple languages according to the phonemes in multiple target languages;
    训练单元,用于基于所述包含多语种的音素表,训练得到声音特征提取网络,其中,所述声音特征提取网络用于提取语音帧的音素后验概率。The training unit is configured to train a sound feature extraction network based on the multilingual phoneme table, wherein the sound feature extraction network is used to extract phoneme posterior probabilities of speech frames.
  14. 一种电子设备,包括存储器和处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现如权利要求1至5任一项所述的方法,或者,所述处理器用于在执行所述计算机指令时实现如权利要求6至11任一项所述的方法。An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to implement the computer instructions described in any one of claims 1 to 5 when executing the computer instructions Alternatively, the processor is configured to implement the method according to any one of claims 6 to 11 when executing the computer instructions.
  15. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如权利要求1至5任一所述的方法,或者,所述程序被处理器执行时实现如权利要求6至11任一所述的方法。A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 5 is realized, or, when the program is executed by a processor, the method according to claim 1 is realized. The method described in any one of 6 to 11.
  16. 一种计算机程序产品,包括计算机程序,所述程序被处理器执行时实现如权利要求1至5任一所述的方法,或者,所述程序被处理器执行时实现如权利要求6至11任一所述的方法。A computer program product, comprising a computer program, when the program is executed by a processor, the method according to any one of claims 1 to 5 is realized, or, when the program is executed by a processor, the method according to any one of claims 6 to 11 is realized a method as described.
PCT/CN2022/089870 2021-05-31 2022-04-28 Interaction object driving and phoneme processing methods and apparatus, device and storage medium WO2022252890A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110604874.8A CN113314104B (en) 2021-05-31 2021-05-31 Interactive object driving and phoneme processing method, device, equipment and storage medium
CN202110604874.8 2021-05-31

Publications (1)

Publication Number Publication Date
WO2022252890A1 true WO2022252890A1 (en) 2022-12-08

Family

ID=77376708

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089870 WO2022252890A1 (en) 2021-05-31 2022-04-28 Interaction object driving and phoneme processing methods and apparatus, device and storage medium

Country Status (3)

Country Link
CN (1) CN113314104B (en)
TW (1) TW202248994A (en)
WO (1) WO2022252890A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314104B (en) * 2021-05-31 2023-06-20 北京市商汤科技开发有限公司 Interactive object driving and phoneme processing method, device, equipment and storage medium
CN113724718B (en) 2021-09-01 2022-07-29 宿迁硅基智能科技有限公司 Target audio output method, device and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503942A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 A kind of voice driven animation method and device based on artificial intelligence
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN112017648A (en) * 2020-08-25 2020-12-01 北京声智科技有限公司 Weighted finite state converter construction method, speech recognition method and device
CN112669841A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Training method and device for multilingual speech generation model and computer equipment
CN113314104A (en) * 2021-05-31 2021-08-27 北京市商汤科技开发有限公司 Interactive object driving and phoneme processing method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106098059B (en) * 2016-06-23 2019-06-18 上海交通大学 Customizable voice awakening method and system
US10832129B2 (en) * 2016-10-07 2020-11-10 International Business Machines Corporation Transfer of an acoustic knowledge to a neural network
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN109377986B (en) * 2018-11-29 2022-02-01 四川长虹电器股份有限公司 Non-parallel corpus voice personalized conversion method
CN113672194A (en) * 2020-03-31 2021-11-19 北京市商汤科技开发有限公司 Method, device and equipment for acquiring acoustic feature sample and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503942A (en) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 A kind of voice driven animation method and device based on artificial intelligence
CN110880315A (en) * 2019-10-17 2020-03-13 深圳市声希科技有限公司 Personalized voice and video generation system based on phoneme posterior probability
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
CN111933110A (en) * 2020-08-12 2020-11-13 北京字节跳动网络技术有限公司 Video generation method, generation model training method, device, medium and equipment
CN112017648A (en) * 2020-08-25 2020-12-01 北京声智科技有限公司 Weighted finite state converter construction method, speech recognition method and device
CN112669841A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Training method and device for multilingual speech generation model and computer equipment
CN113314104A (en) * 2021-05-31 2021-08-27 北京市商汤科技开发有限公司 Interactive object driving and phoneme processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
TW202248994A (en) 2022-12-16
CN113314104A (en) 2021-08-27
CN113314104B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
WO2021169431A1 (en) Interaction method and apparatus, and electronic device and storage medium
TWI766499B (en) Method and apparatus for driving interactive object, device and storage medium
CN108962217B (en) Speech synthesis method and related equipment
JP7432556B2 (en) Methods, devices, equipment and media for man-machine interaction
WO2022252890A1 (en) Interaction object driving and phoneme processing methods and apparatus, device and storage medium
JP7227395B2 (en) Interactive object driving method, apparatus, device, and storage medium
WO2021196644A1 (en) Method, apparatus and device for driving interactive object, and storage medium
WO2021196646A1 (en) Interactive object driving method and apparatus, device, and storage medium
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
US20200193961A1 (en) System for synchronizing speech and motion of character
Goel et al. Real-time sign language to text and speech translation and hand gesture recognition using the LSTM model
CN111415662A (en) Method, apparatus, device and medium for generating video
US20230298565A1 (en) Using Non-Parallel Voice Conversion for Speech Conversion Models
TWI759039B (en) Methdos and apparatuses for driving interaction object, devices and storage media
TWI823815B (en) Abstract generation methods and systems and computer program products
Liu et al. A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation
KR20130137367A (en) System and method for providing book-related service based on image
US20240203406A1 (en) Semi-Supervised Training Scheme For Speech Recognition
KR20230001741A (en) Method and system for recognizing finger language video in units of syllables based on artificial intelligence
CN117935807A (en) Method, device, equipment and storage medium for driving mouth shape of digital person
CN118037908A (en) Digital person driving method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22814937

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22814937

Country of ref document: EP

Kind code of ref document: A1