CN113314104B

CN113314104B - Interactive object driving and phoneme processing method, device, equipment and storage medium

Info

Publication number: CN113314104B
Application number: CN202110604874.8A
Authority: CN
Inventors: 吴文岩; 吴潜溢; 高娜; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2023-06-20
Anticipated expiration: 2041-05-31
Also published as: WO2022252890A1; TW202248994A; CN113314104A

Abstract

Disclosed are an interactive object driving and phoneme processing method, apparatus, device and storage medium, the interactive object driving method comprising: acquiring acoustic characteristics of sound driving data of an interactive object; extracting the characteristics of the acoustic characteristics by utilizing a sound characteristic extraction network to obtain the phoneme posterior probability of each voice frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages; obtaining attitude parameter values of the interactive objects according to the phoneme posterior probability of each voice frame; and controlling the gesture of the interactive object according to the gesture parameter value.

Description

Interactive object driving and phoneme processing method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to an interactive object driving and phoneme processing method, device and equipment and a storage medium.

Background

Digital persons use deep learning to match the sounds made with the mouth shape, expression, action, etc. presented. With the widespread use of digital persons in many fields, digital persons are required to be able to support multilingual in many scenes.

At present, voice features extracted by a voice recognition model or voice features obtained by using a phoneme time stamp are generally used for driving a digital person, however, the features are different in different languages, deep learning needs to be specific to data sets of different languages, and the current open source data set has the problems of low quality, incomplete labeling, unbalanced data and the like.

How to realize the support of digital people for multilingual is a problem that needs to be studied actively at present.

Disclosure of Invention

The embodiment of the disclosure provides an interactive object driving and phoneme processing scheme.

According to an aspect of the present disclosure, there is provided a driving method of an interactive object, the method including: acquiring acoustic characteristics of sound driving data of an interactive object; extracting the characteristics of the acoustic characteristics by utilizing a sound characteristic extraction network to obtain the phoneme posterior probability of each voice frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages; obtaining attitude parameter values of the interactive objects according to the phoneme posterior probability of each voice frame; and controlling the gesture of the interactive object according to the gesture parameter value.

According to the embodiment of the disclosure, the phoneme list containing multiple languages is utilized to train the sound feature extraction network, so that the efficiency and quality of training the feature extraction network can be improved, the network is utilized to extract the phoneme posterior feature of the sound driving data, and the phoneme posterior feature is used as the sound feature to drive the interactive object.

In combination with any one of the embodiments provided in the present disclosure, the acquiring acoustic features of sound driving data of an interactive object includes: acquiring a voice frame sequence corresponding to voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vectors of the voice frames in the voice frame sequence.

In combination with any one of the embodiments provided in the present disclosure, the sound feature extraction network includes a first fully-connected network, a coding sub-network, and a second fully-connected network, and the feature extraction is performed on the acoustic feature by using the sound feature extraction network to obtain a phoneme posterior probability of each speech frame in the sound driving data, where the method includes: inputting the acoustic features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network; performing feature encoding processing on the first acoustic feature sequence by using the encoding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each voice frame in the voice driving data.

In the embodiment of the disclosure, the sound features are converted into the sequences, the features are extracted through the coding network suitable for extracting the sequence features, and the phoneme posterior probability of each voice frame in the sound feature data can be accurately predicted through the full-connection network classification processing.

In combination with any one of the embodiments provided in the present disclosure, the obtaining, according to the phoneme posterior probability of each phoneme, a pose parameter value of the interaction object includes: inputting the phoneme posterior probability of each voice frame to a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated feature sequence to obtain the attitude parameter values of the interactive objects matched with the phoneme posterior probability of each voice frame.

And predicting the gesture parameter values corresponding to the phoneme posterior probabilities of the voice frames in the voice driving data through a time sequence network and a full connection network so as to fuse the historical phoneme posterior probabilities with relevance with the current phoneme posterior probabilities, thereby influencing the change of the current gesture parameter values by the historical gesture parameter values and enabling the change of the gesture parameter values of the interactive characters to be more gentle and natural.

In combination with any one of the embodiments provided in the present disclosure, the control parameters of the interactive object include a facial gesture control parameter, and the controlling the gesture of the interactive object according to the gesture parameter value includes: and driving the interactive object to realize the facial gesture matched with each voice frame in the voice driving data according to the facial gesture control parameters matched with the phoneme posterior probability of each voice frame.

When the voice is output according to the voice driving data, the interactive object is driven to make facial expression according to the facial attitude control parameters corresponding to the voice driving data, so that the interactive object can synchronously make the mouth shape and expression making the voice while outputting the voice, the target object can generate the talking sensation of the interactive object, and the interactive experience of the target object is improved.

According to an aspect of the present disclosure, a phoneme processing method is provided, the method including: obtaining a phoneme list containing multiple languages according to phonemes in multiple target languages; and training to obtain a sound feature extraction network based on the phoneme list containing multiple languages, wherein the sound feature extraction network is used for extracting the phoneme posterior probability of the speech frame to be recognized.

In an embodiment of the present disclosure, using a phone table including multiple languages in combination with any one of the implementations provided in the present disclosure, the obtaining a phone table including multiple languages from phones in multiple target languages includes: acquiring phonemes in a plurality of target languages for splicing; and merging phonemes with pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme list containing multiple languages.

The embodiment of the disclosure provides a method for constructing a multilingual phone list by a splicing mode, which can conveniently and quickly obtain the phone list containing a plurality of target languages.

In connection with any one of the embodiments provided by the present disclosure, the method further comprises: mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting preset similarity conditions respectively; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the phoneme list containing multiple languages.

In connection with any of the embodiments provided in this disclosure, in response to the presence of a first phoneme in the plurality of target languages having a pronunciation similarity to each international phonetic symbol that is less than or equal to a second set threshold, the first phoneme is added to the multilingual-containing phonemic table.

The embodiment of the disclosure provides a method for obtaining a phoneme list containing multiple languages by mapping multiple target languages into international phonetic symbols, which is applicable to multiple target languages and has universality.

In connection with any one of the embodiments provided by the present disclosure, the method further comprises: obtaining a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme list; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phonemic table.

In the embodiment of the disclosure, the phoneme list containing multilingual can be utilized to directly label the multilingual voice samples, and a corpus with high quality, complete labeling and balanced data can be conveniently and efficiently constructed for training the voice feature extraction network.

In connection with any one of the embodiments provided by the present disclosure, the method further comprises: inputting the acoustic features of the marked voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the voice feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the voice frame and the marked true value.

According to an aspect of the present disclosure, there is provided a driving apparatus of an interactive object, the apparatus including: a first acquisition unit configured to acquire acoustic features of sound driving data of an interactive object; the second acquisition unit is used for carrying out feature extraction on the acoustic features by utilizing a sound feature extraction network to obtain the phoneme posterior probability of each voice frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages; the third acquisition unit is used for obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each voice frame; and the control unit is used for controlling the gesture of the interactive object according to the gesture parameter value.

In combination with any one of the embodiments provided in the present disclosure, the first obtaining unit is specifically configured to: acquiring a voice frame sequence corresponding to voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vectors of the voice frames in the voice frame sequence.

In combination with any one of the embodiments provided in the present disclosure, the sound feature extraction network includes a first fully-connected network, an encoding sub-network, and a second fully-connected network, where the second obtaining unit is specifically configured to: inputting the acoustic features into the first fully-connected network to obtain a first acoustic feature sequence output by the first fully-connected network; performing feature encoding processing on the first acoustic feature sequence by using the encoding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each voice frame in the voice driving data.

In combination with any one of the embodiments provided in the present disclosure, the third obtaining unit is specifically configured to: inputting the phoneme posterior probability of each voice frame to a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated feature sequence to obtain the attitude parameter values of the interactive objects matched with the phoneme posterior probability of each voice frame.

In combination with any one of the embodiments provided in the present disclosure, the control parameters of the interactive object include a facial gesture control parameter, and the control unit is specifically configured to: and driving the interactive object to realize the facial gesture matched with each voice frame in the voice driving data according to the facial gesture control parameters matched with the phoneme posterior probability of each voice frame.

According to an aspect of the present disclosure, there is provided a phoneme processing apparatus, the apparatus including: a phoneme list obtaining unit for obtaining a phoneme list containing multiple languages according to phonemes in multiple target languages; and the training unit is used for training to obtain a sound feature extraction network based on the phone table containing multiple languages, and the sound feature extraction network is used for extracting the phone posterior probability of the voice frame to be recognized.

In combination with any one of the embodiments provided in the present disclosure, the phonemic table obtaining unit is specifically configured to: acquiring phonemes in a plurality of target languages for splicing; combining phonemes with pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme list containing multiple languages; training to obtain a sound feature extraction network based on the phone table containing multiple languages.

In combination with any one of the embodiments provided in the present disclosure, the phonemic table obtaining unit is specifically configured to: mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting preset similarity conditions respectively; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the phoneme list containing multiple languages.

In connection with any of the embodiments provided in this disclosure, in response to the presence of a first phoneme in the plurality of target languages having a pronunciation similarity to each international phonetic symbol that is less than or equal to the second set threshold, the first phoneme is added to the multilingual-containing phonemic table.

In combination with any one of the embodiments provided in the present disclosure, the apparatus further includes an labeling unit configured to: obtaining a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme list; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phonemic table.

In combination with any one of the embodiments provided in the present disclosure, the training unit is specifically configured to: inputting the acoustic features of the marked voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the voice feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the voice frame and the marked true value.

According to an aspect of the present disclosure, there is provided an electronic device, the device including a memory for storing computer instructions executable on the processor for implementing the method of driving an interactive object according to any of the embodiments provided in the present disclosure when the computer instructions are executed.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of driving an interactive object according to any of the embodiments provided by the present disclosure.

According to an aspect of the present disclosure, there is provided a computer program product, including a computer program, which when executed by a processor implements the method for driving an interactive object according to any of the embodiments provided in the present disclosure.

Drawings

In order to more clearly illustrate one or more embodiments of the present specification or the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described, it being apparent that the drawings in the following description are only some of the embodiments described in one or more embodiments of the present specification, and that other drawings may be obtained from these drawings without inventive faculty for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a method of driving an interactive object in accordance with at least one embodiment of the present disclosure;

FIG. 2 is a schematic illustration of a process for feature encoding a sequence of phonemes in accordance with at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a mapping process of phoneme posterior probabilities shown in at least one embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of phoneme processing as set forth in at least one embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a driving apparatus for an interactive object according to at least one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a phoneme processing device in accordance with at least one embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

At least one embodiment of the present disclosure provides a driving method of an interactive object, where the driving method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, for example, a mobile phone, a tablet computer, a game machine, a desktop computer, an advertisement machine, an all-in-one machine, a vehicle terminal, and the like, and the server includes a local server or a cloud server, and the method may also be implemented by a manner in which a processor invokes computer readable instructions stored in a memory.

In the embodiment of the present disclosure, the interactive object may be any interactive object capable of interacting with the target object, which may be a virtual character, or may be any other avatar capable of implementing an interactive function, such as a virtual animal, a virtual object, a cartoon avatar, or the like, where a presentation form of the avatar may be a 2D form or a 3D form, and the present disclosure is not limited thereto. The target object can be a user, a robot or other intelligent equipment.

The interactive object may be displayed through a terminal device, which may be a television, an integrated machine with a display function, a projector, a Virtual Reality (VR) device, an augmented Reality (Augmented Reality, AR) device, or the like, and the present disclosure is not limited to a specific form of the terminal device.

In some embodiments, the interactive object may emit the specified voice to the target object in response to the terminal device receiving voice-driven data for driving the interactive object to output the voice. According to actions, expressions, identities, preferences and the like of target objects around the terminal equipment, sound driving data can be generated to drive interaction objects to respond by sending out appointed voices, so that personification service is provided for the target objects. In some scenarios, the interactive object may interact with the target object in different languages, and in order to make the gesture of the interactive object fit with the real pronunciation in different languages, at least one embodiment of the present disclosure proposes a driving method of the interactive object.

Fig. 1 shows a flow chart of a method of driving an interactive object according to at least one embodiment of the present disclosure, as shown in fig. 1, the method including steps 101 to 104.

In step 101, acoustic features of sound driving data of the interactive object are acquired.

The sound driving data may include audio data (voice data), text, and the like. In response to the voice driving data being audio data, the voice driving data can be directly utilized to drive the interactive object to output voice, namely, the terminal equipment directly outputs voice through the audio data; in response to the sound driving data being text, corresponding phonemes may be generated from the speech contained in the text, and the interactive object may be driven to output speech through the generated phonemes. Taking a Chinese text as an example, the text can be first converted into pinyin and then corresponding phonemes can be generated according to the pinyin. The sound drive data may also be other forms of drive data, which is not limited by the present disclosure.

In the embodiment of the disclosure, the voice driving data may be driving data generated according to actions, expressions, identities, preferences and the like of the target object interacting with the interaction object, or may be voice driving data called by the terminal device from the internal memory. The present disclosure does not limit the manner of acquiring the sound drive data.

In response to the sound driving data being audio data, phonemes may be formed by splitting the audio data into a plurality of speech frames, and combining the speech frames according to states of the speech frames; each phoneme formed from the audio data forms a sequence of phonemes. Wherein, the phonemes are the minimum phonetic units which are divided according to the natural attribute of the voice, and a pronunciation action of the real person can form a phoneme.

In response to the voice driving data being text, phonemes contained in the morphemes can be obtained according to the morphemes contained in the text, so that corresponding phoneme sequences are obtained. It will be appreciated by those skilled in the art that the phoneme sequence corresponding to the sound driving data may also be obtained by other means, which is not limited by the present disclosure.

In embodiments of the present disclosure, the acoustic features may be features related to speech emotion, such as fundamental frequency features, co-peak features, mel-frequency cepstral coefficients (Mel Frequency Cofficient, MFCC), and so forth.

In step 102, feature extraction is performed on the acoustic features by using a sound feature extraction network, so as to obtain a phoneme posterior probability of each speech frame in the sound driving data.

Wherein the phoneme posterior probability represents a probability that the speech frame corresponds to each phoneme. The phoneme posterior probability is independent of the speaker and is only dependent on the speaking content.

In an embodiment of the present disclosure, the sound feature extraction network for extracting the phoneme posterior probability of each speech frame in the sound driving data is trained from a phoneme list containing multiple languages.

In some embodiments, a phone table containing multiple languages may be obtained by: acquiring phonemes in a plurality of target languages for splicing; and combining phonemes with pronunciation similarity exceeding a first set threshold in the splicing result, so that a phoneme list containing a plurality of target languages can be conveniently and quickly obtained.

For example, phonemes in Chinese (Pinyin) may be concatenated with phonemes in English, and phonemes of the same or similar pronunciation, e.g., "b", "p", "m", "f", etc., may be combined in the concatenated result, thereby obtaining a phonemic table containing Chinese and English.

In some embodiments, a phone table containing multiple languages may be obtained by: first, phonemes in a plurality of target languages are mapped to international phonetic symbols (International Phonetic Alphabet, IPA) whose pronunciation similarity satisfies a similarity condition, for example, the same pronunciation or the highest similarity. And combining the international phonetic symbols with the same pronunciation in the mapping result to obtain the phone table containing multiple languages. The method is suitable for various target languages and has universality.

For example, all phonemes of chinese may be mapped to international phonetic symbols having the highest pronunciation similarity, all phonemes of english may be mapped to international phonetic symbols having the highest pronunciation similarity, and the international phonetic symbols to which chinese and english are mapped may be stored in one phoneme list, and phonemes having the same pronunciation may be combined, thereby obtaining a phoneme list supporting chinese and english.

For example, let us assume that chinese phonemes include phonemes a1, a2, a3, b, i1, i2.i3, ii1, ii2, ii3 (where 1, 2, 3 represent the pitch), english phonemes include a, b, i, and IPA table includes a, b, i. According to the pronunciation, mapping the phonemes of the Chinese and the English to the IPA with the highest similarity, and mapping the Chinese sequence to a, a, a, b, i, i, i, i and i (since no ii pronunciation exists in the IPA, the actual ii pronunciation is most similar to i, then mapping ii to i). English mapping is followed by a, b and i in turn.

In some embodiments, in response to the presence of a first phoneme in the plurality of target languages having a pronunciation similarity to each international phonetic symbol less than or equal to a second set threshold, the first phoneme is added to the multilingual-containing phonemic table. For example, the phoneme "ng" in the text is not present in the IPA table, and the similarity of the pronunciation with other pronunciations is smaller than the second set threshold; in addition or alternatively, a certain phoneme in the list is composed of several other pronunciations, the degree of the phonemes in the list of pronunciations and IPA is also smaller than a second set threshold value, such a phoneme is called a first phoneme, and the first phoneme is reserved and added to the back of the IPA list, that is, the finally obtained IPA includes the first phoneme in addition to all phonemes of the IPA.

It should be understood by those skilled in the art that the first set threshold value and the second set threshold value may be specifically set according to actual needs, which is not limited in this disclosure.

In step 103, according to the phoneme posterior probability of each speech frame, the posture parameter value of the interactive object is obtained.

In the embodiment of the disclosure, the gesture parameter value of the interaction object matched with the voice driving data can be obtained according to the phoneme posterior probability of each voice frame in the voice driving data.

The gesture parameters are used for controlling the gesture of the interactive object, and different gesture parameter values can be used for driving the interactive object to make corresponding gestures. The gesture parameters may include facial gesture parameters for controlling a facial gesture of the interactive object, including expression, mouth shape, five-sense organ actions, head gesture, and the like; in the embodiment of the present disclosure, a correspondence between a phoneme posterior probability and a posture parameter value of an interaction object may be established in advance, and under the condition that the posterior probability of each speech frame in the sound driving data is obtained, the posture parameter value corresponding to the sound driving data may be obtained. The specific form of the pose parameters may be determined according to the type of interactive object model.

In step 104, the pose of the interactive object is controlled according to the pose parameter values.

Wherein the posture parameter value is matched with the phoneme posterior probability of each voice frame in the voice driving data of the interactive object, and the phoneme posterior probability is irrelevant to languages, so that the voice data and texts of different languages can be processed, and the posture, such as mouth shape, expression, action and the like, presented by the interactive object can be matched with the actual pronunciation, so that the feeling that the interactive object is speaking is given to the target object interacting with the interactive object.

In the embodiment of the disclosure, firstly, acoustic features of sound driving data of an interactive object are acquired, the acoustic features are extracted by utilizing a sound feature extraction network to obtain phoneme posterior probabilities of all voice frames in the sound driving data, then, gesture parameter values of the interactive object are obtained according to the phoneme posterior probabilities of all voice frames, and the gesture of the interactive object is controlled according to the gesture parameter values.

In some embodiments, a multilingual-supported corpus can be constructed according to the following method.

Firstly, a multilingual voice sample is obtained, wherein the language type of the voice sample is the same as the language type contained in the multilingual phonemic form. For example, in the case where the phonemic form is a phonemic form supporting Chinese and English, a voice sample of Chinese and a voice sample of English are acquired, respectively.

And then, performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample.

Taking the voice sample as a voice section using Chinese to say "hello", after voice operation is performed on the voice sample, the pronunciation start-stop time of each phoneme in the voice section can be obtained: n [0,0.2], i3[0.2,0.4], h [0.5,0.7], ao3[0.7,1.2], wherein the pronunciation start-stop time of each phoneme is indicated in seconds in [ ]. And determining the phonemes corresponding to each voice frame in the voice sample through the pronunciation start-stop time of each phoneme.

Finally, labeling the phonemes in the speech samples by using the phonemes in the multilingual phonemic table.

Taking the multilingual phone list as an example for supporting Chinese and English phone lists, the phones in the multilingual phone list can be directly called for labeling for Chinese voice samples and English voice samples, so that a corpus with high quality, complete labeling and balanced data can be conveniently and efficiently constructed.

In some embodiments, the acoustic feature extraction network may be trained by the following method.

Firstly, inputting acoustic features of the marked voice samples into the voice feature extraction network to obtain phoneme posterior probability of each voice frame in the voice samples. Wherein, each voice frame in the labeled voice sample is labeled with a true value of a phoneme.

Next, the parameter values of the acoustic feature extraction network are adjusted according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the noted true value. And when the change of the network loss meets the convergence condition, for example, when the change amount of the network loss is smaller than a set threshold value, or when the iteration number reaches the set number, training is completed, and the trained sound feature extraction network is obtained.

In some embodiments, a voice frame sequence corresponding to the voice driving data of the interactive object may be obtained, and according to acoustic feature vectors of each voice frame in the voice frame sequence, acoustic features of the voice driving data may be obtained. Taking MFCC as an example, according to MFCC coefficients of each voice frame in the voice frame sequence, an MFCC matrix corresponding to the voice driving data may be obtained.

Fig. 2 illustrates a schematic diagram of a sound feature extraction process shown in at least one embodiment of the present disclosure. As shown in fig. 2, the present disclosure uses a sound feature extraction network 200 to perform feature extraction on acoustic features of sound driving data to obtain phoneme posterior probabilities of respective speech frames in the sound feature data. The acoustic feature extraction network 200 comprises a first fully connected network 201, an encoding sub-network 202 and a second fully connected network 203.

Firstly, inputting the sound characteristics into the first fully-connected network 201 to obtain a first acoustic characteristic sequence output by the first fully-connected network; then, the encoding sub-network 202 is utilized to perform feature encoding processing on the first acoustic feature sequence, so as to obtain an encoding result. The encoding subnetwork may be, for example, a CBHG network, a gated loop unit (Gated Recurrent Unit, GRU), or the like suitable for extracting sequence features. Finally, the encoding result is input to the second fully-connected network 203, so as to obtain the phoneme posterior probability of each voice frame in the voice driving data.

In some embodiments, the gesture parameter values corresponding to the phoneme posterior probabilities of the voice frames in the voice driving data may be predicted through a time sequence network and a full connection network, so that the historical phoneme posterior probabilities with relevance and the current phoneme posterior probabilities are fused, the historical gesture parameter values influence the change of the current gesture parameter values, and the change of the gesture parameter values of the interactive characters is more gentle and natural.

Fig. 3 illustrates a mapping process diagram of a phoneme posterior probability illustrated by at least one embodiment of the present disclosure. As shown in fig. 3, first, the phoneme posterior probability of each speech frame is input to the time series network 301, and associated feature information is output. The time sequence network can be a time recurrent neural network, such as LSTM, and can learn the history information of the posterior probability of the input phonemes, and the output associated characteristic information comprises the influence of the history information on the current information. Next, the associated feature information is input to the third fully-connected network 302, and an associated feature sequence is obtained. Finally, the associated feature sequence is activated by an activation layer 303, and each feature value in the associated feature sequence is transformed into a gesture parameter value, so as to obtain the gesture parameter value of the interaction object with the phoneme posterior probability matching of each voice frame.

In some embodiments, the gesture parameters of the interactive object include a facial gesture control parameter, and the interactive object may be driven to implement a facial gesture matched with each speech frame in the voice driving data according to the facial gesture control parameter matched with the phoneme posterior probability of each speech frame. Wherein the facial pose parameters may include, for example, facial muscle control coefficients.

The movement of the face, from an anatomical point of view, is the result of a coordinated deformation of the muscles of the various parts of the face. Therefore, by dividing the facial muscles of the interactive object to obtain a facial muscle model, controlling the motion of each muscle (region) obtained by division through the corresponding facial muscle control coefficient, that is, performing contraction/expansion control on the muscle, the face of the interactive character can be made to have various expressions. For each muscle of the facial muscle model, the motion states corresponding to different muscle control coefficients can be set according to the facial position of the muscle and the motion characteristics of the muscle. For example, for the upper lip muscle, the control coefficient thereof has a value range of (0 to 1), and the longitudinal opening and closing of the mouth can be achieved by changing the value at different values within the range corresponding to different contracted/expanded states of the upper lip muscle; the control coefficient of the left-hand corner muscle is in the range of (0-1), and the lateral change of the mouth can be realized by changing the control coefficient of the left-hand corner muscle according to the contracted/expanded state of the left-hand corner muscle.

Fig. 4 is a flow chart of a phoneme processing method as set forth in at least one embodiment of the present disclosure. As shown in fig. 4, the method includes steps 401 to 402.

In step 401, a phone list including multiple languages is obtained based on phones in multiple target languages.

In one example, a phone table containing multiple languages may be obtained by: acquiring phonemes in a plurality of target languages for splicing; and combining phonemes with pronunciation similarity exceeding a first set threshold in the splicing result, so that a phoneme list containing a plurality of target languages can be conveniently and quickly obtained.

In another example, a phone table containing multiple languages may be obtained by: first, phonemes in a plurality of target languages are mapped to international phonetic symbols whose pronunciation similarity satisfies a similarity condition, for example, the same pronunciation or highest similarity. And combining the international phonetic symbols with the same pronunciation in the mapping result to obtain the phone table containing multiple languages. The method is suitable for various target languages and has universality.

In some embodiments, in response to the presence of a first phoneme in the plurality of target languages having a pronunciation similarity to each international phonetic symbol less than or equal to the second set threshold, the first phoneme is added to the multilingual-containing phonemic table.

In step 402, based on the phone table containing multiple languages, a sound feature extraction network is trained, where the sound feature extraction network is used to extract a phone posterior probability of a speech frame to be recognized.

Firstly, a multilingual voice sample is obtained, wherein the language type of the voice sample is the same as the language type contained in the multilingual phonemic form.

In the embodiment of the disclosure, the phonemes in the multilingual phoneme list can be directly called for labeling by utilizing the multilingual phoneme list, so that a corpus with high quality, complete labeling and balanced data can be conveniently and efficiently constructed.

Fig. 5 is a schematic structural view of a driving apparatus of an interactive object according to at least one embodiment of the present disclosure, as shown in fig. 5, the apparatus may include: a first acquiring unit 501 configured to acquire acoustic features of sound driving data of an interactive object; a second obtaining unit 502, configured to perform feature extraction on the acoustic feature by using a sound feature extraction network, so as to obtain a phoneme posterior probability of each speech frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages; a third obtaining unit 503, configured to obtain an attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame; a control unit 504 for controlling the pose of the interactive object according to the pose parameter values.

In some embodiments, the first obtaining unit is specifically configured to: acquiring a voice frame sequence corresponding to voice driving data of the interactive object; and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vectors of the voice frames in the voice frame sequence.

In some embodiments, the sound feature extraction network includes a first fully-connected network, an encoding sub-network, and a second fully-connected network, and the second obtaining unit is specifically configured to: inputting the sound characteristics into the first fully-connected network to obtain a first acoustic characteristic sequence output by the first fully-connected network; performing feature encoding processing on the first acoustic feature sequence by using the encoding sub-network; and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each voice frame in the voice driving data.

In some embodiments, the third obtaining unit is specifically configured to: inputting the phoneme posterior probability of each voice frame to a time sequence network, and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; and activating the associated feature sequence to obtain the attitude parameter values of the interactive objects matched with the phoneme posterior probability of each voice frame.

In some embodiments, the control parameters of the interactive object include facial pose control parameters, and the control unit is specifically configured to: and driving the interactive object to realize the facial gesture matched with each voice frame in the voice driving data according to the facial gesture control parameters matched with the phoneme posterior probability of each voice frame.

Fig. 6 is a schematic structural view of a driving apparatus of an interactive object according to at least one embodiment of the present disclosure, as shown in fig. 6, the apparatus may include: a phonemic form obtaining unit 601, configured to obtain a phonemic form including multiple languages according to phonemes in multiple target languages; the training obtaining unit 602 is configured to train to obtain a sound feature extraction network based on the phone table containing multiple languages, where the sound feature extraction network is used to extract a phone posterior probability of a speech frame to be recognized.

In some embodiments, the phonemic table acquiring unit is specifically configured to: acquiring phonemes in a plurality of target languages for splicing; combining phonemes with pronunciation similarity exceeding a first set threshold value in the splicing result to obtain the phoneme list containing multiple languages; training to obtain a sound feature extraction network based on the phone table containing multiple languages.

In some embodiments, the phonemic table acquiring unit is specifically configured to: mapping phonemes in a plurality of target languages into international phonetic symbols with pronunciation similarity meeting preset similarity conditions respectively; and merging the international phonetic symbols with the same pronunciation in the mapping result to obtain the phoneme list containing multiple languages.

In some embodiments, the apparatus further comprises an labeling unit for: obtaining a multilingual voice sample, wherein the language type of the voice sample is the same as the language type contained in the multilingual phoneme list; performing phoneme alignment operation on the voice sample to obtain phonemes contained in the voice sample; and labeling the phonemes in the voice sample by using the phonemes in the multilingual phonemic table.

In some embodiments, the training unit is specifically configured to: inputting the acoustic features of the marked voice samples into the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice samples; and adjusting the parameter value of the voice feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the voice frame and the marked true value.

At least one embodiment of the present disclosure further provides an electronic device, as shown in fig. 7, where the device includes a memory, and a processor, where the memory is configured to store computer instructions executable on the processor, and the processor is configured to implement, when executing the computer instructions, a method for driving an interactive object according to any embodiment of the present disclosure.

At least one embodiment of the present disclosure further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for driving an interactive object according to any embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides a computer program product, including a computer program, which when executed by a processor implements the method for driving an interactive object according to any embodiment of the present disclosure.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CDROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. A method of driving an interactive object, the method comprising:

acquiring acoustic characteristics of sound driving data of an interactive object;

extracting the characteristics of the acoustic characteristics by utilizing a sound characteristic extraction network to obtain the phoneme posterior probability of each voice frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages;

Inputting the phoneme posterior probability of each voice frame to a time sequence network, and outputting associated characteristic information;

inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence;

activating the associated feature sequence to obtain attitude parameter values of the interactive objects matched with the phoneme posterior probability of each voice frame;

and controlling the gesture of the interactive object according to the gesture parameter value.

2. The method of claim 1, wherein the acquiring acoustic characteristics of the sound driven data of the interactive object comprises:

acquiring a voice frame sequence corresponding to voice driving data of the interactive object;

and obtaining the acoustic characteristics of the sound driving data according to the acoustic characteristic vectors of the voice frames in the voice frame sequence.

3. The method according to claim 1, wherein the acoustic feature extraction network includes a first fully-connected network, a coding sub-network, and a second fully-connected network, and the feature extraction of the acoustic features by using the acoustic feature extraction network obtains a phoneme posterior probability of each speech frame in the acoustic driving data, including:

Inputting the sound characteristics into the first fully-connected network to obtain a first acoustic characteristic sequence output by the first fully-connected network;

performing feature encoding processing on the first acoustic feature sequence by using the encoding sub-network;

and inputting the coding result into the second full-connection network to obtain the phoneme posterior probability of each voice frame in the voice driving data.

4. A method according to any one of claims 1 to 3, wherein the pose parameters of the interactive object comprise facial pose parameters, and wherein controlling the pose of the interactive object according to the pose parameters values comprises:

and driving the interactive object to realize the facial gesture matched with each voice frame in the voice driving data according to the facial gesture parameters matched with the phoneme posterior probability of each voice frame.

5. A driving apparatus of an interactive object, the apparatus comprising:

a first acquisition unit configured to acquire acoustic features of sound driving data of an interactive object;

the second acquisition unit is used for carrying out feature extraction on the acoustic features by utilizing a sound feature extraction network to obtain the phoneme posterior probability of each voice frame in the sound driving data; the sound feature extraction network is trained according to a phone table containing multiple languages;

The third acquisition unit is used for inputting the phoneme posterior probability of each voice frame into a time sequence network and outputting associated characteristic information; inputting the associated feature information into a third fully-connected network to obtain an associated feature sequence; activating the associated feature sequence to obtain attitude parameter values of the interactive objects matched with the phoneme posterior probability of each voice frame;

and the control unit is used for controlling the gesture of the interactive object according to the gesture parameter value.

6. An electronic device comprising a memory for storing computer instructions executable on the processor for implementing the method of any one of claims 1 to 4 when the computer instructions are executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any of claims 1 to 4.