WO2022252890A1

WO2022252890A1 - Interaction object driving and phoneme processing methods and apparatus, device and storage medium

Info

Publication number: WO2022252890A1
Application number: PCT/CN2022/089870
Authority: WO
Inventors: 吴文岩; 吴潜溢; 高娜; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-05-31
Filing date: 2022-04-28
Publication date: 2022-12-08
Also published as: TW202248994A; CN113314104A; CN113314104B

Abstract

Disclosed are interactive object driving and phoneme processing methods, an apparatus, a device, and a storage medium. The interactive object driving method comprises: acquiring a sound feature of sound driving data of an interactive object; performing feature extraction on the sound feature using a sound feature extraction network to obtain a phoneme posterior probability of each voice frame in the sound driving data, the sound feature extraction network being obtained by means of training of a phoneme table containing multiple languages; according to the phoneme posterior probability of each voice frame, obtaining a posture parameter value of the interactive object; and controlling the posture of the interactive object according to the posture parameter value.

Description

Interactive object driving and phoneme processing method, device, device and storage medium

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application with application number 202110604874.8 filed on May 31, 2021, which is incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technology, and in particular to an interactive object driving and phoneme processing method, device, device and storage medium.

Background technique

Using the method of deep learning, the voice made by the digital human can be matched with the presented mouth shape, expression, movement, etc. With the widespread application of digital humans in many fields, digital humans are required to support multiple languages in many scenarios.

At present, the voice features extracted by the speech recognition model, or the voice features obtained by phoneme time stamps are usually used to drive the digital human. However, these features are different in different languages, and deep learning requires digital human However, the current open source datasets have problems such as low quality, incomplete annotation, and unbalanced data.

How to realize the multilingual support of digital human is an issue that needs to be actively studied at present.

Contents of the invention

The embodiment of the present disclosure provides an interactive object driving and phoneme processing solution.

According to one aspect of the present disclosure, there is provided a method for driving an interactive object, the method comprising: acquiring the sound features of the sound driving data of the interactive object; using a sound feature extraction network to perform feature extraction on the sound features to obtain the sound The phoneme posterior probability of each speech frame in the driving data; wherein, the sound feature extraction network is obtained through phoneme table training including multiple languages; according to the phoneme posterior probability of each speech frame, the interaction object is obtained A posture parameter value; controlling the posture of the interactive object according to the posture parameter value.

In combination with any of the implementations provided in the present disclosure, the acquiring the sound features of the sound driving data of the interactive object includes: acquiring a sequence of speech frames corresponding to the sound driving data of the interactive object; according to each speech frame in the sequence of speech frames The sound feature vector of the sound feature of the sound driving data is obtained.

In combination with any embodiment provided by the present disclosure, the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network, and the sound feature extraction network is used to perform feature extraction on the sound feature, to obtain The phoneme posterior probability of each speech frame in the sound driving data includes: inputting the sound feature into the first fully connected network to obtain the first sound feature sequence output by the first fully connected network; using the The encoding sub-network performs feature encoding processing on the first sound feature sequence; the encoding result is input to the second fully connected network to obtain the phoneme posterior probability of each speech frame in the sound driving data.

In combination with any implementation manner provided by the present disclosure, the obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes: inputting the phoneme posterior probability of each speech frame into a time series network , output associated feature information; input the associated feature information into the third fully connected network to obtain an associated feature sequence; perform activation processing on the associated feature sequence to obtain the phoneme posterior probability matching of each speech frame The pose parameter value of the interactive object.

In combination with any implementation manner provided by the present disclosure, the control parameters of the interactive object include facial posture control parameters, and the controlling the posture of the interactive object according to the posture parameter value includes: according to the phonemes associated with the respective speech frames The face pose parameter value matched by the posterior probability drives the interactive object to realize the face pose matched with each speech frame in the sound driving data.

According to one aspect of the present disclosure, a phoneme processing method is proposed, the method comprising: obtaining a multilingual phoneme table based on phonemes in multiple target languages; and training to obtain sound features based on the multilingual phoneme table An extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.

In the embodiment of the present disclosure, using the multilingual phoneme table in combination with any implementation method provided by the present disclosure, the obtaining of the multilingual phoneme table according to the phonemes in the multiple target languages includes: obtaining the phonemes in the multiple target languages Splicing the phonemes; merging the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain the phoneme table containing multiple languages.

In combination with any embodiment provided by the present disclosure, the method further includes: mapping the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.

In combination with any of the implementations provided by the present disclosure, in response to the existence of a first phoneme in the multiple target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to a second set threshold, the first phoneme is added to the Described in the phoneme table that contains multiple languages.

In combination with any implementation manner provided by the present disclosure, the method further includes: acquiring a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes included in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.

In combination with any embodiment provided by the present disclosure, the method further includes: inputting the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.

According to an aspect of the present disclosure, there is provided a driving device for an interactive object, the device comprising: a first acquisition unit, configured to acquire the sound features of the sound driving data of the interactive object; a second acquisition unit, used to extract sound features The network performs feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by training a phoneme table that includes multiple languages; the third acquisition unit , for obtaining the pose parameter value of the interactive object according to the phoneme posterior probability of each speech frame; a control unit, for controlling the pose of the interactive object according to the pose parameter value.

In combination with any implementation manner provided by the present disclosure, the first acquiring unit is specifically configured to: acquire a voice frame sequence corresponding to the voice driving data of the interactive object; according to the voice feature vector of each voice frame in the voice frame sequence, Sound features of the sound driving data are obtained.

In combination with any embodiment provided in the present disclosure, the sound feature extraction network includes a first fully connected network, an encoding sub-network and a second fully connected network, and the second acquisition unit is specifically configured to: input the sound feature into The first fully connected network obtains the first sound feature sequence output by the first fully connected network; uses the coding sub-network to perform feature coding processing on the first sound feature sequence; and inputs the coding result to the The second fully connected network is used to obtain the phoneme posterior probability of each speech frame in the sound driving data.

In combination with any implementation manner provided by the present disclosure, the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into the first Three fully connected networks to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the attitude parameter value of the interactive object matched by the phoneme posterior probability of each speech frame.

In combination with any implementation manner provided by the present disclosure, the control parameters of the interactive object include facial gesture control parameters, and the control unit is specifically configured to: according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame, Driving the interactive object to achieve a facial gesture matching each speech frame in the sound driving data.

According to an aspect of the present disclosure, there is provided a phoneme processing device, the device comprising: a phoneme table acquisition unit, configured to obtain a multilingual phoneme table based on phonemes in multiple target languages; a training unit, configured to obtain a phoneme table based on the obtained The phoneme table including multiple languages is used to train the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.

In combination with any of the implementations provided in the present disclosure, the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; merge phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result to obtain The multilingual phoneme table; based on the multilingual phoneme table, a sound feature extraction network is obtained through training.

In combination with any implementation method provided by the present disclosure, the phoneme table acquisition unit is specifically configured to: map phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies a preset similarity condition; The international phonetic symbols of pronunciation are merged to obtain the phoneme table containing multiple languages.

In combination with any implementation manner provided by the present disclosure, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to the second set threshold, add the first phoneme to the phoneme table containing multiple languages.

In combination with any of the implementations provided by the present disclosure, the device further includes a labeling unit, configured to: acquire multilingual speech samples, wherein the language type of the speech samples is the same as the language contained in the multilingual phoneme table The types are the same; the phoneme alignment operation is performed on the speech samples to obtain the phonemes contained in the speech samples; the phonemes in the speech samples are marked by using the phonemes in the multilingual phoneme table to mark the real value.

In combination with any embodiment provided in the present disclosure, the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network, and obtain the phoneme posterior probability of each speech frame in the speech samples ; For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.

According to an aspect of the present disclosure, there is provided an electronic device, the device includes a memory and a processor, the memory is used for storing computer instructions executable on the processor, and the processor is used for executing the computer instructions Implement the driving method of the interactive object described in any implementation manner provided by the present disclosure.

According to one aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.

According to an aspect of the present disclosure, a computer program product is provided, including a computer program, and when the program is executed by a processor, the method for driving an interactive object described in any implementation manner provided by the present disclosure is implemented.

Description of drawings

In order to more clearly illustrate one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or prior art. Obviously, in the following description The accompanying drawings are only some embodiments described in one or more embodiments of this specification. For those skilled in the art, other drawings can also be obtained according to these drawings without creative labor. .

Fig. 1 is a flow chart of a method for driving an interactive object proposed by at least one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a feature encoding process for a phoneme sequence proposed by at least one embodiment of the present disclosure;

Fig. 3 is a schematic diagram of a mapping process of a phoneme posterior probability shown in at least one embodiment of the present disclosure;

FIG. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of a driving device for an interactive object proposed by at least one embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of a phoneme processing device proposed by at least one embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of an electronic device proposed by at least one embodiment of the present disclosure.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

At least one embodiment of the present disclosure provides a method for driving an interactive object. The driving method can be executed by an electronic device such as a terminal device or a server. The terminal device can be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, desktop computer, advertising machine, all-in-one machine, vehicle-mounted terminal, etc., and the server includes a local server or a cloud server, etc., and the method can also be realized by calling a computer-readable instruction stored in a memory by a processor.

In the embodiment of the present disclosure, the interactive object can interact with the target object, which can be a virtual character, or a virtual animal, virtual item, cartoon image, etc. that can realize interactive functions. The presentation form of the virtual image is It may be in 2D form or 3D form, which is not limited in the present disclosure. The target object can be a user, a robot, or other smart devices.

The interactive object can be displayed through a terminal device, and the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) device etc., the present disclosure does not limit the specific form of the terminal device.

In some embodiments, in response to the terminal device receiving sound driving data for driving the interactive object to output a voice, the interactive object can emit a specified voice to the target object. According to the actions, expressions, identities, preferences, etc. of the target object around the terminal device, sound driving data can be generated to drive the interactive object to respond by issuing a specified voice, thereby providing anthropomorphic services for the target object. In some scenarios, the interactive object can interact with the target object in different languages. In order to make the posture of the interactive object match the real pronunciation in different languages, at least one embodiment of the present disclosure proposes a driving method for the interactive object.

FIG. 1 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 1 , the method includes steps 101 to 104 .

In step 101, the sound characteristics of the sound driving data of the interactive object are acquired.

The sound driving data may include audio data (speech data), text, and the like. In response to the sound driving data being audio data, the audio data can be directly used to drive the interactive object to output voice, that is, the terminal device directly outputs voice through the audio data; in response to the sound driving data being text, it can be based on the voice contained in the text , generate corresponding phonemes, and drive the interactive object to output speech through the generated phonemes. Taking Chinese text as an example, the text may first be converted into pinyin, and then corresponding phonemes may be generated according to the pinyin. The sound driving data may also be other forms of driving data, which is not limited in the present disclosure.

In the embodiment of the present disclosure, the sound driving data may be the driving data generated according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it may be the sound driving data called by the terminal device from the internal memory . The present disclosure does not limit the acquisition method of the sound driving data.

In response to the sound driving data being audio data, the audio data can be split into a plurality of speech frames, and the speech frames are combined according to the states of the speech frames to form phonemes; each phoneme formed according to the audio data then forms phoneme sequence. Among them, a phoneme is the smallest speech unit divided according to the natural attributes of speech, and a pronunciation action of a real person can form a phoneme.

In response to the fact that the sound driving data is text, the phonemes included in the morphemes may be obtained according to the morphemes included in the text, so as to obtain a corresponding phoneme sequence. Those skilled in the art should understand that the phoneme sequence corresponding to the sound driving data can also be obtained in other ways, which is not limited in the present disclosure.

In the embodiment of the present disclosure, the sound features may be features related to speech emotion, such as fundamental frequency features, common peak features, Mel Frequency Coefficient (MFCC) and the like.

In step 102, feature extraction is performed on the sound feature by using a sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data. The above-mentioned sound feature extraction network is obtained by training through a multilingual phoneme table.

Wherein, the phonetic posterior probability (phonetic posteriorgrams, PPG) represents the posterior probability that the speech frame corresponds to each phoneme in the phoneme table. The phoneme posterior probability has nothing to do with the speaker, but only with the content of the speech. In one embodiment, assuming that there are 3 phonemes in the phoneme table, such as phoneme 1, phoneme 2, and phoneme 3, through the sound feature extraction network, the probability that the speech frame corresponds to phoneme 1, the probability of phoneme 2, and the probability of phoneme 3 can be obtained. probability. That is, the phoneme posterior probability of the speech frame includes the probability that the speech frame corresponds to phoneme 1, the probability of phoneme 2, and the probability of phoneme 3.

In the embodiment of the present disclosure, the sound feature extraction network used to extract the phoneme posterior probability of each speech frame in the sound driving data is obtained by training through a phoneme table including multiple languages.

In some embodiments, the phoneme table containing multiple languages can be obtained in the following manner: the phonemes in multiple target languages are obtained for splicing; the phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result are merged, which can be convenient, Quickly get phoneme tables for multiple target languages.

For example, phonemes in Chinese (Pinyin) can be concatenated with phonemes in English, and phonemes with the same or similar pronunciation in the concatenated result, such as "b", "p", "m", "f", etc. Merging is performed, so that a phoneme table including Chinese and English can be obtained.

In some embodiments, the phoneme table comprising multiple languages can be obtained in the following manner: First, the phonemes in multiple target languages are respectively mapped to the International Phonetic Alphabet (IPA) whose pronunciation similarity satisfies the similarity condition, so The similarity condition is, for example, the same pronunciation or the highest similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.

For example, all phonemes in Chinese can be mapped to the IPA with the highest pronunciation similarity, and all phonemes in English can be mapped to the IPA with the highest pronunciation similarity, and the IPAs mapped to Chinese and English can be stored in a phoneme table , the phonemes with the same pronunciation are merged, and a phoneme table supporting Chinese and English can be obtained.

For example, suppose Chinese phonemes include phonemes a1, a2, a3, b, i1, i2.i3, ii1, ii2, ii3 (where 1, 2, and 3 represent tones), and English phonemes include a, b, i, The IPA table contains a, b, i. According to the pronunciation, the phonemes in Chinese and English are mapped to the IPA with the highest similarity, and the Chinese sequence is mapped to a, a, a, b, i, i, i, i, i, i (since there is no ii pronunciation in IPA , the actual pronunciation of ii is most similar to i, then map ii to i). In the same way, after the English mapping, it is a, b, i in turn.

In some embodiments, in response to the presence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to a second set threshold, the first phonemes are added to the multilingual in the phoneme table. For example, the phoneme "ng" in Chinese does not exist in the IPA table, and the similarity between the pronunciation and other pronunciations is less than the second set threshold; or when a certain phoneme in Chinese is composed of several other pronunciations composition, the similarity between the pronunciation and the IPA table is also less than the second set threshold, and such a phoneme is called the first phoneme, and the first phoneme is reserved and appended to the back of the IPA table, that is, the final IPA is included in all of itself The first phoneme is also included in addition to the phoneme.

Those skilled in the art should understand that the first set threshold and the second set threshold can be specifically set according to actual needs, which is not limited in the present disclosure.

In the embodiment of the present disclosure, the multilingual phoneme table can be used to directly annotate multilingual speech samples, and a high-quality, complete annotated, and data-balanced corpus can be conveniently and efficiently constructed for extracting sound features. The network is trained.

In step 103, the pose parameter value of the interactive object is obtained according to the phoneme posterior probability of each speech frame.

In the embodiment of the present disclosure, the pose parameter value of the interactive object matching the sound driving data may be obtained according to the phoneme posterior probability of each speech frame in the sound driving data.

The posture parameter is used to control the posture of the interactive object, and the interactive object can be driven to make a corresponding posture by using different posture parameter values. The gesture parameters may include facial gesture parameters, which are used to control the facial gestures of the interactive object, including expressions, mouth shapes, facial features, and head gestures; in embodiments of the present disclosure, phonemes may be pre-established For the correspondence between the posterior probability and the gesture parameter value of the interactive object, if the phoneme posterior probability of each speech frame in the voice driving data is obtained, the gesture parameter value corresponding to the voice driving data can be obtained. The specific form of the attitude parameter can be determined according to the type of the interactive object model.

In step 104, the gesture of the interactive object is controlled according to the gesture parameter value.

Wherein, the attitude parameter value is matched with the phoneme posterior probability of each speech frame in the sound driving data of the interactive object, since the phoneme posterior probability has nothing to do with the language, therefore, for speech data and texts of different languages, The gestures (such as mouth shapes, expressions, actions, etc.) presented by the interactive object can be matched with the actual pronunciation, giving the target object interacting with the interactive object the feeling that the interactive object is speaking.

In the embodiment of the present disclosure, the sound features of the sound driving data of the interactive object are obtained first, and the sound feature extraction network is used to perform feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data, and then According to the phoneme posterior probability of each speech frame, obtain the posture parameter value of the interactive object, and control the posture of the interactive object according to the posture parameter value, because the phoneme posterior probability has nothing to do with the speaker, and It can support multiple languages, and the embodiments of the present disclosure use the phoneme table containing multiple languages to train the sound feature extraction network, and use the network to extract the phoneme posterior probability of the sound driving data as the sound feature to drive the interactive object, so that The posture of the interactive object matches the real pronunciation in different languages.

In some embodiments, the multilingual corpus can be constructed according to the following method.

First, multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table. For example, if the phoneme table is a phoneme table supporting Chinese and English, the Chinese speech samples and the English speech samples are obtained respectively.

Next, a phoneme alignment operation is performed on the speech samples to obtain phonemes included in the speech samples.

Taking the speech sample as an example of a speech segment saying "Hello" in Chinese, after the speech operation is performed on the speech sample, the pronunciation start and end time of each phoneme in the speech segment can be obtained: n[0,0.2] , i3[0.2,0.4], h[0.5,0.7], ao3[0.7,1.2], where [] indicates the start and end time of pronunciation of each phoneme in seconds. The phoneme corresponding to each speech frame in the speech sample can be determined through the pronunciation start and end time of each phoneme.

Finally, the phonemes in the speech sample are marked with the phonemes in the multilingual phoneme table. In one embodiment, the true values of the phonemes in the speech samples are annotated with the phonemes in the multilingual phoneme table.

Taking the multilingual phoneme table as an example supporting Chinese and English phoneme tables, for Chinese voice samples and English voice samples, the phonemes in the multilingual phoneme table can be directly called for labeling, thereby It can conveniently and efficiently construct a high-quality, well-labeled, and data-balanced corpus.

In some embodiments, the sound feature extraction network can be trained by the following method.

Firstly, the sound features of the marked speech samples are input to the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples. Wherein, each speech frame in the marked speech sample is marked with a real value of a phoneme.

Next, adjust the parameter values of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the labeled true value. When the change of the network loss satisfies the convergence condition, for example, when the change of the network loss is less than the set threshold, or when the number of iterations reaches the set number, the training is completed, and the trained sound feature extraction network is obtained.

In some embodiments, the voice frame sequence corresponding to the voice driving data of the interactive object may be obtained, and the voice features of the voice driving data may be obtained according to the voice feature vectors of each voice frame in the voice frame sequence. Taking MFCC as an example, according to the MFCC coefficients of each speech frame in the speech frame sequence, the MFCC matrix corresponding to the sound driving data can be obtained.

Fig. 2 shows a schematic diagram of a sound feature extraction process shown in at least one embodiment of the present disclosure. As shown in FIG. 2 , the present disclosure utilizes a sound feature extraction network 200 to perform feature extraction on the sound features of the sound driving data, so as to obtain the phoneme posterior probability of each speech frame in the sound feature data. The sound feature extraction network 200 includes a first fully connected network 201 , an encoding sub-network 202 and a second fully connected network 203 .

First, input the sound features into the first fully connected network 201 to obtain the first sound feature sequence output by the first fully connected network; then, use the coding sub-network 202 to process the first sound feature sequence Feature encoding processing to obtain the encoding result. The coding sub-network can be, for example, a CBHG network, a Gated Recurrent Unit (Gated Recurrent Unit, GRU) and other networks suitable for extracting sequence features. Finally, the encoding result is input to the second fully connected network 203 to obtain the phoneme posterior probability of each speech frame in the sound driving data.

In the embodiment of the present disclosure, by converting the sound features into a sequence, feature extraction is performed through a coding network suitable for extracting sequence features, and classification processing is performed through a fully connected network, each speech frame in the sound feature data can be accurately predicted The phoneme posterior probability of .

In some embodiments, the posture parameter values corresponding to the phoneme posterior probabilities of each speech frame in the sound-driven data can be predicted through a time series network and a fully connected network, so that the associated historical phoneme posterior probability and the current phoneme The posterior probability is fused, so that the historical attitude parameter value affects the change of the current attitude parameter value, making the change of the attitude of the interactive object more gentle and natural.

Fig. 3 shows a schematic diagram of a mapping process of phoneme posterior probabilities shown in at least one embodiment of the present disclosure. As shown in FIG. 3 , firstly, the phoneme posterior probability of each speech frame is input into the time series network 301 , and associated feature information is output. Wherein, the time series network may be a time recursive neural network, such as LSTM. The time series network can learn the historical information of the input phoneme posterior probability, and the output associated feature information includes the influence of the historical information on the current information. Next, the associated feature information is input into the third fully connected network 302 to obtain an associated feature sequence. Finally, the associated feature sequence is activated through the activation layer 303, and each feature value in the associated feature sequence is transformed into a posture parameter value, so as to obtain the posture of the interactive object matched with the phoneme posterior probability of each speech frame parameter value.

In some embodiments, the posture parameters of the interactive object include facial posture control parameters, and the interactive object can be driven to achieve the same level as the voice driving according to the facial posture control parameters matched with the phoneme posterior probabilities of the respective speech frames. Each speech frame in the data matches the facial pose. Wherein, the facial posture parameters may include facial muscle control coefficients, for example.

The movement of the human face, from an anatomical point of view, is the result of the coordinated deformation of the muscles in various parts of the face. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive object, and the movement of each divided muscle (region) is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed. Make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, the motion states corresponding to different muscle control coefficients can be set according to the facial position of the muscle and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of its control coefficient is (0~1). Different values in this range correspond to different contraction/expansion states of the upper lip muscle. By changing this value, the longitudinal opening of the mouth can be realized. For the left mouth corner muscle, the value range of its control coefficient is (0~1). Different values in this range correspond to the contraction/expansion state of the left mouth corner muscle. By changing the value, the mouth can be realized. Lateral changes.

While outputting sound according to the sound driving data, the interactive object is driven to make facial expressions according to the facial gesture control parameters corresponding to the sound driving data, so that the interactive object can simultaneously make facial expressions while outputting sound. The shape of the mouth and the expression that emit the sound make the target object feel that the interactive object is speaking, and improve the interactive experience of the target object.

Fig. 4 is a flowchart of a phoneme processing method proposed by at least one embodiment of the present disclosure. As shown in FIG. 4 , the method includes step 401 - step 402 .

In step 401, a phoneme table including multiple languages is obtained according to phonemes in multiple target languages.

In one example, the phoneme table containing multiple languages can be obtained in the following way: splicing phonemes in multiple target languages; merging phonemes whose pronunciation similarity exceeds the first set threshold in the splicing results can be convenient and fast The phoneme tables containing multiple target languages can be obtained efficiently.

In another example, a multilingual phoneme table can be obtained in the following manner: First, map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the similarity condition, such as pronunciation the same or the highest degree of similarity. Next, the IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table. This method is applicable to a variety of target languages and is universal.

In some embodiments, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables. That is to say, if there is no International Phonetic Alphabet with a high degree of similarity to the pronunciation of the first phoneme, then the first phoneme is directly added to the multilingual phoneme table.

In step 402, based on the multilingual phoneme table, a sound feature extraction network is trained to obtain the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.

In the embodiment of the present disclosure, the sound feature extraction network is trained by using a multilingual phoneme table, which can improve the efficiency and quality of the feature extraction network training, and use the network to extract the phoneme posterior features of the sound driving data, so as to As the voice feature drives the interactive object, since the phoneme posterior probability is a speaker-independent voice feature that can support multiple languages, the posture of the interactive object is consistent with the real pronunciation in different languages.

First, multilingual speech samples are acquired, and the language types of the speech samples are the same as the language types contained in the multilingual phoneme table.

Finally, use the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.

In the embodiment of the present disclosure, the phonemes in the multilingual phoneme table can be directly called to mark the phonemes in the voice sample, so that a high-quality, complete-labeled, and data-balanced corpus can be constructed conveniently and efficiently.

In some embodiments, the sound feature extraction network can be trained through the following specific steps.

Fig. 5 is a schematic structural diagram of a device for driving an interactive object according to at least one embodiment of the present disclosure. As shown in Fig. 5 , the device may include: a first acquiring unit 501, configured to acquire the sound characteristics of the sound driving data of the interactive object; The second acquisition unit 502 is used to extract the features of the sound features using the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by including multiple The phoneme table training of the language is obtained; the third acquisition unit 503 is used to obtain the posture parameter value of the interactive object according to the phoneme posterior probability of each speech frame; the control unit 504 is used to obtain the posture parameter value according to the posture parameter value Controls the pose of the interactive object.

In some embodiments, the first acquiring unit is specifically configured to: acquire the voice frame sequence corresponding to the voice driving data of the interactive object; obtain the voice according to the voice feature vector of each voice frame in the voice frame sequence The sonic characteristics of the driving data.

In some embodiments, the sound feature extraction network includes a first fully connected network, an encoding sub-network, and a second fully connected network, and the second acquisition unit is specifically configured to: input the sound feature into the first A fully connected network to obtain the first sound feature sequence output by the first fully connected network; use the encoding sub-network to perform feature encoding processing on the first sound feature sequence; input the encoding result to the second fully connected network The network is connected to obtain the phoneme posterior probability of each speech frame in the sound driving data.

In some embodiments, the third acquisition unit is specifically configured to: input the phoneme posterior probability of each speech frame into a time series network, and output associated feature information; input the associated feature information into a third fully connected network , to obtain an associated feature sequence; performing activation processing on the associated feature sequence to obtain the gesture parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.

In some embodiments, the control parameters of the interactive object include facial gesture control parameters, and the control unit is specifically configured to: drive the interaction according to the facial gesture parameter value matched with the phoneme posterior probability of each speech frame. The subject achieves a facial gesture that matches each speech frame in the sound-driven data.

Fig. 6 is a schematic structural diagram of a training device for proposing a network of sound features according to at least one embodiment of the present disclosure. As shown in Fig. 6, the device may include: a phoneme table acquisition unit 601, configured to, according to phonemes in multiple target languages, Obtaining a multilingual phoneme table; the training and obtaining unit 602 is configured to train a sound feature extraction network based on the multilingual phoneme table, and the sound feature extraction network is used to extract phoneme posterior probabilities of speech frames.

In some embodiments, the phoneme table acquisition unit is specifically configured to: acquire phonemes in multiple target languages for splicing; combine phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result, and obtain the phonemes containing multiple The phoneme table of the language; based on the phoneme table containing multiple languages, the sound feature extraction network is obtained through training.

In some embodiments, the phoneme table acquisition unit is specifically configured to: map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity satisfies the preset similarity condition; Merging is performed to obtain the multilingual phoneme table.

In some embodiments, in response to the existence of first phonemes in the plurality of target languages whose pronunciation similarities with the respective International Phonetic Alphabets are less than or equal to the second set threshold, the first phonemes are added to the containing Multilingual phoneme tables.

In some embodiments, the device further includes a labeling unit, configured to: obtain a multilingual speech sample, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table; performing a phoneme alignment operation on the speech samples to obtain the phonemes contained in the speech samples; using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.

In some embodiments, the training unit is specifically configured to: input the sound features of the marked speech samples into the sound feature extraction network to obtain the phoneme posterior probability of each speech frame in the speech samples; for the For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.

At least one embodiment of the present disclosure also provides an electronic device, as shown in FIG. 7 , the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the described The computer instructions implement the driving method of the interactive object described in any embodiment of the present disclosure.

At least one embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.

At least one embodiment of the present disclosure further provides a computer program product, including a computer program, when the program is executed by a processor, the method for driving an interactive object described in any embodiment of the present disclosure is implemented.

Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may employ a computer program embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The form of the product.

Each embodiment in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment.

The foregoing describes specific embodiments of this specification. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or in A combination of one or more of . Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data The processing means executes. A computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both. However, a computer is not required to have such a device. In addition, a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as primarily describing features of particular embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product in, or packaged into multiple software products.

Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. Within the spirit and principles of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. should be included in the scope of protection of one or more embodiments of this specification.

Claims

A driving method for an interactive object, comprising:

Acquiring the sound characteristics of the sound driving data of the interactive object;

Using a sound feature extraction network to perform feature extraction on the sound features to obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein the sound feature extraction network is obtained by training a phoneme table comprising multiple languages;

Obtaining the attitude parameter value of the interactive object according to the phoneme posterior probability of each speech frame;

The gesture of the interactive object is controlled according to the gesture parameter value.
The method according to claim 1, wherein said acquiring the sound characteristics of the sound driving data of the interactive object comprises:

Acquiring a sequence of speech frames corresponding to the sound driving data of the interactive object;

The sound feature of the sound driving data is obtained according to the sound feature vector of each speech frame in the speech frame sequence.
The method according to claim 1 or 2, wherein the sound feature extraction network comprises a first fully connected network, an encoding sub-network and a second fully connected network, and the sound feature extraction network is used to extract the sound features Carry out feature extraction, obtain the phoneme posterior probability of each speech frame in the described sound driving data, comprise:

Inputting the sound features into the first fully connected network to obtain a first sound feature sequence output by the first fully connected network;

performing feature encoding processing on the first sound feature sequence by using the encoding sub-network;

The encoding result is input to the second fully connected network to obtain the phoneme posterior probability of each speech frame in the sound driving data.
The method according to any one of claims 1 to 3, wherein the obtaining the gesture parameter value of the interactive object according to the phoneme posterior probability of each phoneme includes:

The phoneme posterior probability of each speech frame is input to the time series network, and the associated feature information is output;

Inputting the associated feature information into a third fully connected network to obtain an associated feature sequence;

Activation processing is performed on the associated feature sequence to obtain the pose parameter value of the interactive object matched with the phoneme posterior probability of each speech frame.
The method according to any one of claims 1 to 4, wherein the posture parameters of the interactive object include facial posture parameters, and the controlling the posture of the interactive object according to the posture parameter value includes:

According to the facial gesture parameter value matched with the phoneme posterior probability of each voice frame, the interactive object is driven to realize the facial gesture matched with each voice frame in the sound driving data.
A phoneme processing method, comprising:

According to the phonemes in multiple target languages, a phoneme table including multiple languages is obtained;

Based on the multilingual phoneme table, a sound feature extraction network is trained to obtain the sound feature extraction network, wherein the sound feature extraction network is used to extract the phoneme posterior probability of the speech frame.
The method according to claim 6, wherein, according to the phonemes in a plurality of target languages, obtaining a phoneme table comprising multiple languages includes:

Splicing the phonemes in the multiple target languages;

The phonemes whose pronunciation similarity exceeds the first set threshold in the splicing result are combined to obtain a multilingual phoneme table.
The method according to claim 6, wherein, according to the phonemes in a plurality of target languages, obtaining a phoneme table comprising multiple languages includes:

Map the phonemes in multiple target languages to the International Phonetic Alphabet whose pronunciation similarity meets the preset similarity condition;

The IPAs with the same pronunciation in the mapping result are combined to obtain the multilingual phoneme table.
The method according to claim 8, characterized in that, the method further comprises: in response to the presence of first phonemes in the plurality of target languages whose pronunciation similarity with each International Phonetic Alphabet is less than or equal to a second set threshold, Adding the first phoneme to the multilingual phoneme table.
The method according to any one of claims 6 to 9, further comprising:

Acquiring multilingual speech samples, wherein the language type of the speech sample is the same as the language type included in the multilingual phoneme table;

performing a phoneme alignment operation on the speech sample to obtain the phonemes contained in the speech sample;

Using the phonemes in the multilingual phoneme table to mark the real values of the phonemes in the speech samples.
method according to claim 10, is characterized in that, described based on described phoneme table that contains multilingual, training obtains sound feature extraction network, comprises:

Inputting the voice feature of the voice sample after marking to the voice feature extraction network to obtain the phoneme posterior probability of each voice frame in the voice sample;

For each speech frame in the speech sample, adjust the parameter value of the sound feature extraction network according to the difference between the phoneme indicated by the maximum phoneme posterior probability of the speech frame and the marked true value.
A driving device for an interactive object, comprising:

The first acquisition unit is used to acquire the sound characteristics of the sound driving data of the interactive object;

The second acquisition unit is used to use the sound feature extraction network to perform feature extraction on the sound feature, and obtain the phoneme posterior probability of each speech frame in the sound driving data; wherein, the sound feature extraction network is obtained by including multilingual The phoneme table training is obtained;

A third acquisition unit, configured to obtain the gesture parameter value of the interactive object according to the phoneme posterior probability of each speech frame;

A control unit, configured to control the gesture of the interactive object according to the gesture parameter value.
A phoneme processing device, comprising:

The phoneme table acquisition unit is used to obtain a phoneme table including multiple languages according to the phonemes in multiple target languages;

The training unit is configured to train a sound feature extraction network based on the multilingual phoneme table, wherein the sound feature extraction network is used to extract phoneme posterior probabilities of speech frames.
An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to implement the computer instructions described in any one of claims 1 to 5 when executing the computer instructions Alternatively, the processor is configured to implement the method according to any one of claims 6 to 11 when executing the computer instructions.
A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method according to any one of claims 1 to 5 is realized, or, when the program is executed by a processor, the method according to claim 1 is realized. The method described in any one of 6 to 11.
A computer program product, comprising a computer program, when the program is executed by a processor, the method according to any one of claims 1 to 5 is realized, or, when the program is executed by a processor, the method according to any one of claims 6 to 11 is realized a method as described.