CN112786018A

CN112786018A - Speech conversion and related model training method, electronic equipment and storage device

Info

Publication number: CN112786018A
Application number: CN202011634065.3A
Authority: CN
Inventors: 刘利娟; 胡亚军; 江源
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-11

Abstract

The application discloses a method for training a voice conversion model and a related model, electronic equipment and a storage device, wherein the method for training the voice conversion model comprises the following steps: acquiring a first sample voice of a target object and a pre-trained voice conversion model; the voice conversion model is obtained by pre-training a second sample voice of the sample object and a third sample voice after tone conversion; recognizing first phoneme information of the first sample voice by using a phoneme recognition network, and extracting first actual acoustic information of the first sample voice; predicting the first phoneme information and the first actual acoustic information by using an acoustic prediction network to obtain first predicted acoustic information, and adjusting network parameters of the acoustic prediction network based on the difference between the first actual acoustic information and the first predicted acoustic information; and combining the phoneme recognition network and the adjusted acoustic prediction network to serve as a voice conversion model matched with the target object. By the scheme, the quality of voice conversion can be improved.

Description

Speech conversion and related model training method, electronic equipment and storage device

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a method for training a speech conversion and a related model, an electronic device, and a storage device.

Background

Speech conversion is an important branch of speech synthesis, and aims to convert the speech of a source object so that the converted speech has the speaking content of the source object and the speaking timbre of a target object. Therefore, how to improve the quality of voice conversion becomes a topic with great research value.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a method for training voice conversion and related models, electronic equipment and a storage device, which can improve the quality of voice conversion.

In order to solve the above problem, a first aspect of the present application provides a method for training a speech conversion model, including: acquiring a first sample voice of a target object, and acquiring a pre-trained voice conversion model; the voice conversion model comprises a phoneme recognition network and an acoustic prediction network, and is obtained by utilizing second sample voice and third sample voice of a plurality of sample objects to pre-train, wherein the third sample voice is obtained by performing tone conversion on the second sample voice; recognizing first phoneme information of the first sample voice by using a phoneme recognition network, and extracting first actual acoustic information of the first sample voice; predicting the first phoneme information and the first actual acoustic information by using an acoustic prediction network to obtain first predicted acoustic information, and adjusting network parameters of the acoustic prediction network based on the difference between the first actual acoustic information and the first predicted acoustic information; and combining the phoneme recognition network and the adjusted acoustic prediction network to serve as a voice conversion model matched with the target object.

In order to solve the above problem, a second aspect of the present application provides a speech conversion method, including: acquiring a voice to be converted of a source object, and acquiring a voice conversion model matched with a target object; the voice conversion model is obtained by utilizing the training method of the voice conversion model in the first aspect; recognizing the phoneme information of the voice to be converted by utilizing a phoneme recognition network of the voice conversion model, and extracting the actual acoustic information of the voice to be converted; predicting the phoneme information and the actual acoustic information by using an acoustic prediction network of the voice conversion model to obtain predicted acoustic information; and synthesizing to obtain synthesized voice with the same tone as the target object by utilizing the predicted acoustic information.

In order to solve the above problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the method for training a speech conversion model in the first aspect or to implement the method for speech conversion in the second aspect.

In order to solve the above problem, a fourth aspect of the present application provides a storage device storing program instructions capable of being executed by a processor, the program instructions being used to implement the method for training a speech conversion model in the above first aspect or to implement the method for speech conversion in the above second aspect.

According to the scheme, the first sample voice of the target object is obtained, the pre-trained voice conversion model is obtained, the voice conversion model comprises the phoneme recognition network and the acoustic prediction network, so that the phoneme recognition network is used for recognizing the first phoneme information of the first sample voice, the first actual acoustic information of the first sample voice is extracted, the voice conversion model is obtained by pre-training the second sample voice and the third sample voice of a plurality of sample objects, the third sample voice is obtained by performing tone conversion on the second sample voice, the pre-trained voice conversion model can accurately extract the phoneme information for voices with different tones, the accuracy of the first phoneme information can be improved, and on the basis, the acoustic prediction network is used for predicting the first phoneme information and the first actual acoustic information, the method comprises the steps of obtaining first predicted acoustic information, adjusting network parameters of an acoustic prediction network based on the difference between the first actual acoustic information and the first predicted acoustic information, enabling the acoustic prediction network to learn acoustic characteristics of a target object by constraining the first actual acoustic information and the predicted first predicted acoustic information based on accurate first phoneme information, and further enabling a combination of a phoneme recognition network and the adjusted acoustic prediction network to serve as a voice conversion model matched with the target object, so that the accuracy of the voice conversion model can be improved, and the quality of voice conversion is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a method for training a speech conversion model according to the present application;

FIG. 2 is a block diagram of an embodiment of an acoustic prediction network;

FIG. 3 is a flowchart illustrating an embodiment of step S11 in FIG. 1;

FIG. 4 is a block diagram of another embodiment of an acoustic prediction network;

FIG. 5 is a schematic flow chart diagram of one embodiment of training a phoneme recognition network;

FIG. 6 is a state diagram of an embodiment of a training phoneme recognition network;

FIG. 7 is a flowchart illustrating a voice conversion method according to an embodiment of the present application;

FIG. 8 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 9 is a block diagram of an embodiment of a memory device according to the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for training a speech conversion model according to an embodiment of the present invention. Specifically, the method may include the steps of:

step S11: a first sample voice of a target object is obtained, and a pre-trained voice conversion model is obtained.

In the embodiment of the present disclosure, the speech conversion model includes a phoneme recognition network and an acoustic prediction network, the speech conversion model is obtained by pre-training second sample speech and third sample speech of a plurality of sample objects, and the third sample speech is obtained by performing tone conversion on the second sample speech.

In one implementation scenario, the languages used by the sample objects may not be identical in order to improve the applicability of the speech conversion model to different languages.

In a specific implementation scenario, the languages used by some sample objects may be the same, and the languages used by some sample objects may be different. For example, the sample objects 1 and 4 may be english, japanese, and japanese, respectively, and are not limited herein.

In another specific implementation scenario, the languages used by the sample objects may also be completely different, and still taking the sample objects 1 to 4 as examples, the sample object 1 may adopt english, the sample object 2 may adopt russian, the sample object 3 may adopt japanese, and the sample object 4 may adopt french, which is not limited herein.

In yet another specific implementation scenario, to further expand the applicability of the speech conversion model, the language in which the sample object is sampled may encompass real-world major languages, for example, which may include, but are not limited to: chinese, english, spanish, french, german, russian, arabic, finnish, japanese, korean, grub, malay, vietnamese, timier, and the like, without limitation.

In one implementation scenario, a second sample speech of several sample objects may be pre-collected. Furthermore, in order to improve the accuracy of the pre-trained speech conversion model as much as possible, the second sample speech of several sample objects may be collected as much as possible to cover the different phonemes as comprehensively as possible. For example, for each of the above languages, a second sample voice in a different environment (e.g., a quiet environment, an office environment, a conference environment, etc.) for a preset duration (e.g., hundreds of hours, thousands of hours, etc.) may be collected. On the basis, the pre-trained voice conversion model can learn different languages.

In a specific implementation scenario, considering that different languages share pronunciation characteristics, more second sample voices can be collected for large languages such as chinese and english, and the amount of collected data can be reduced appropriately for small languages such as finnish and grudgia.

In another specific implementation scenario, in order to facilitate uniform modeling of different languages and sharing pronunciation similarity between different languages, International Phonetic Alphabet (IPA) may be used to label actual phoneme information of a second sample speech of different languages, so that different languages may be labeled in a uniform phoneme labeling manner, so that pronunciation similarity may be shared between different languages.

In another implementation scenario, the first sample speech of the target object may also be pre-collected. In addition, in the case where the second sample voice of the above sample object is as much as possible, the first sample voice of the target object can be collected appropriately less. For example, at least 100 sentences of the first sample speech of the target object may be collected. In addition, in order to improve the accuracy of the finally trained speech conversion model matched with the target object, the first sample speech of the target object can be acquired in a quiet environment.

In an implementation scenario, the phoneme recognition network of the pre-trained speech conversion model may be specifically trained from the second sample speech and the third sample speech, and the acoustic prediction network of the pre-trained speech conversion model may be specifically trained from the second sample speech. For the specific training process of the pre-training, reference may be made to the following disclosure embodiments, which are not repeated herein.

It should be noted that in the embodiments of the present disclosure and the following disclosure, the phoneme recognition network is used to recognize the phoneme information of the speech, i.e., to recognize the pronunciation content of the speech. For example, for the speech "how weather is", the phoneme information expressed by international phonetic symbols can be obtained by using phoneme recognition network recognition:

other cases may be analogized, and no one example is given here. Furthermore, the acoustic prediction network is used to predict acoustic information of the obtained speech frames, which may include but is not limited to: the pitch frequency, the sound intensity, the sound duration, etc. are not limited herein, so that the voice can be synthesized through the acoustic information of the voice frame, for example, the acoustic information can be input into the vocoder to synthesize the voice, and the specific process of synthesizing the voice by the vocoder is not described herein again.

In one implementation scenario, the phoneme recognition network may specifically include, but is not limited to: a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN), and the like, which are not limited herein. In addition, the acoustic prediction network may specifically be composed of one or more of a fully-connected network, a cyclic neural network, and a convolutional neural network, which is not limited herein.

Step S12: first phoneme information of the first sample speech is recognized using a phoneme recognition network, and first actual acoustic information of the first sample speech is extracted.

According to the embodiment of the disclosure, the first phoneme information of the first sample speech can be identified and obtained through the phoneme identification network. For example, for the first sample speech "how much weather is today", the first phoneme information expressed in the international phonetic symbol may be recognized

Other cases may be analogized, and no one example is given here.

In one implementation scenario, the first actual acoustic information may include Filter Bank feature information (FBK), and further, the first actual acoustic information may also include acoustic joint feature information, for example, which may include but is not limited to: spectral features, pitch frequency, fundamental voicing, non-periodic harmonic content, and the like, wherein the spectral features may include, but are not limited to: mel Frequency Cepstrum Coefficient (MFCC), line spectrum pair characteristics, spectrum envelope characteristics, etc., which are not limited herein.

Step S13: predicting the first phoneme information and the first actual acoustic information by using an acoustic prediction network to obtain first predicted acoustic information, and adjusting network parameters of the acoustic prediction network based on the difference between the first actual acoustic information and the first predicted acoustic information.

In an implementation scenario, the first actual acoustic information and the first predicted acoustic information may be processed by using a minimum mean square error function to obtain a loss value of the acoustic prediction network, so that the network parameters of the acoustic prediction network may be adjusted by using the loss value, and the acoustic prediction network may learn the acoustic features of the target object.

In an implementation scenario, please refer to fig. 2 in combination, in order to further improve the accuracy of the acoustic prediction model, and fig. 2 is a schematic diagram of a framework of an embodiment of an acoustic prediction network. As shown in fig. 2, the acoustic prediction network may further include an extraction subnetwork and a prediction subnetwork, so that the prosody feature information of the first actual acoustic information may be further extracted by using the extraction subnetwork, and the prosody feature information and the first phoneme information may be predicted by using the prediction subnetwork, so as to obtain the first predicted acoustic information. In the above manner, by setting the acoustic prediction network to include the extraction subnetwork and the prediction subnetwork, on the basis, the prosodic feature information of the first actual acoustic information is further extracted by utilizing the extraction sub-network, and the prediction sub-network is utilized to predict the prosodic feature information and the first phoneme information to obtain first predicted acoustic information, so that the acoustic prediction network can avoid not learning the acoustic features of the target object due to the fact that the acoustic prediction network directly copies the input first actual acoustic information in the training process, the accuracy of the acoustic prediction network is improved, in addition, by predicting the first predicted acoustic information by using the first phoneme information and the extracted prosodic feature information, the loss error between the first actual acoustic information and the first predicted acoustic information can be favorably reduced, and the accuracy of the acoustic prediction network can be further improved.

In a specific implementation scenario, the extraction sub-network may include, but is not limited to: a recurrent neural network, a convolutional neural network, etc., without limitation.

In another specific implementation scenario, the prosodic feature information may specifically include, but is not limited to: the information such as tone and pitch is not limited herein.

In yet another specific implementation scenario, the prediction subnetworks may specifically include, but are not limited to: a fully connected network, etc., without limitation.

In a further specific implementation scenario, since the input first actual acoustic information includes information such as phonemes, prosody, and the like in the first sample speech, in order to enable the extraction subnetwork to extract key prosody feature information therefrom as much as possible to predict the first predicted acoustic information in combination with the first phoneme information, so as to reduce a loss error between the first actual acoustic information and the first predicted acoustic information, an amount of information extracted by the extraction subnetwork from the first actual acoustic information may also be limited, and specifically, the number of output layer nodes of the extraction subnetwork may not exceed a preset threshold, for example, the preset threshold may be set to 1, 2, and the like, which is not limited herein. In the above manner, by setting the number of the output layer nodes of the extraction sub-network to be not more than the preset threshold, the amount of information extracted from the first actual acoustic information by the extraction sub-network can be limited, so that the loss error between the first actual acoustic information and the first predicted acoustic information can be restricted, the acoustic prediction network can learn the acoustic features of the target object in the training process, and the accuracy of the acoustic prediction network can be improved.

In another specific implementation scenario, in order to further improve the accuracy of extracting the prosodic feature information from the sub-network, an Instance Normalization (Instance Normalization) layer may be connected after the sub-network hidden layer is extracted. By the method, the example normalization layer is connected after the sub-network hidden layer is extracted, so that the speaker related information can be favorably normalized, the speaker related information extracted by the sub-network can be reduced, the accuracy of the sub-network is further improved, and the accuracy of the acoustic prediction network can be favorably improved.

Step S14: and combining the phoneme recognition network and the adjusted acoustic prediction network to serve as a voice conversion model matched with the target object.

In the embodiment of the present disclosure, a combination of the phoneme recognition network and the adjusted acoustic prediction network may be finally used as a speech conversion model matched with the target object.

In one implementation scenario, as described above, the languages sampled by the sample objects may not be identical, so that the phoneme recognition network can recognize phoneme information of different languages of speech, and the adaptability of the acoustic prediction network to different languages of speech can be improved. For a specific conversion process, reference may be made to the following related disclosure embodiments, which are not described herein in detail.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of step S11 in fig. 1. Specifically, FIG. 3 is a flow diagram of one embodiment of obtaining a pre-trained speech conversion model. The embodiment of the present disclosure may specifically include the following steps:

step S31: and respectively identifying second phoneme information of the second sample voice and third phoneme information of the third sample voice by using the phoneme identification network.

As described above, in the embodiment of the present disclosure, the third sample voice is obtained by performing tone conversion on the second sample voice. Specifically, in order to improve the efficiency of tone conversion, a preset acoustic prediction network corresponding to a sample object may be trained by using second sample voices of a plurality of sample objects, and the preset acoustic prediction network learns acoustic features of the sample objects in a training process. On the basis, the sample objects can be respectively used as current objects, and the rest of the sample objects can be used as reference objects, so that the tone of the second sample voice of the current object is converted to the reference object by using the phoneme recognition network and the preset acoustic prediction network corresponding to the reference object, and a third sample voice with the same tone as the reference object is obtained.

In an implementation scenario, in order to improve the efficiency of training the phoneme recognition network in the embodiment of the present disclosure, several voices recorded in a quiet environment may be selected from the second sample voices, and for convenience of description, the ith sample object in the sample objects to which the selected voices belong may be denoted as S_i. In this case, i may be given toInitial value 1, and sample object S_iAs the current object, the remaining sample objects are reference objects, so that the current object S can be set_iThe timbres of the second sample voices are respectively converted into reference objects, and corresponding third sample voices are obtained. After this, i may be added by 1 and the above-described sample object S may be re-executed_iA step as a current object and subsequent steps.

In another implementation scenario, the specific structure of the preset acoustic prediction network corresponding to the reference object may refer to fig. 2 and the related description in the foregoing disclosed embodiment, and is not repeated herein. It should be noted that, although the preset acoustic prediction network corresponding to the reference object may share the same network structure as the acoustic prediction network included in the speech conversion model in the foregoing disclosed embodiment, since the preset acoustic prediction network corresponding to the reference object is obtained by the second sample speech training of the reference object and the acoustic prediction network included in the speech conversion model is obtained by the first sample speech training of the target object, the network parameters of the two are not the same, that is, the preset acoustic prediction network corresponding to the reference object learns the acoustic features of the reference object, and the acoustic prediction network included in the speech conversion model learns the acoustic features of the target object.

In another implementation scenario, in order to improve the accuracy of performing tone conversion on the second sample speech, the second sample speech may be first used to perform a first training on the phoneme recognition network, and the specific process may refer to the following related disclosure embodiments, which are not described herein again. On the basis, the trained phoneme recognition network and the second sample speech can be reused for training the preset acoustic prediction network corresponding to the sample object.

In one particular implementation scenario, after training the phoneme recognition network with the second sample speech, the ith sample object S is targeted_iThe sample object S can be identified using a phoneme recognition network_iAnd extracting a sample object S from the phoneme information of the second sample speech_iSo that the sample object S can be utilized_iCorresponding toPresetting an acoustic prediction network (for convenience of description, denoted as W)_i) Predicting the phoneme information and the actual acoustic information to obtain predicted acoustic information, and adjusting a preset acoustic prediction network W based on the difference between the actual acoustic information and the predicted acoustic information_iSuch that the acoustic prediction network W is preset_iSample object S can be learned_iThe acoustic characteristics of (1). The specific process can refer to the related description in the foregoing disclosed embodiments, and is not repeated herein. The same may be true for other sample objects, and so on, and no further example is given here.

In yet another implementation scenario, after the preset acoustic prediction network corresponding to the sample object is obtained through training, as described above, each sample object may be used as the current object, and the remaining sample objects may be used as the reference objects. On this basis, a preset acoustic prediction network corresponding to the reference object can be acquired, so that the phoneme recognition network can be used for recognizing the phoneme information of the second sample speech of the current object, the actual acoustic information of the second sample speech of the current object is extracted, the phoneme information obtained by recognition and the actual acoustic information can be predicted by using the preset acoustic prediction network of the reference object, the predicted acoustic information is obtained, and a third sample speech with the same tone as the reference object is synthesized by using the predicted acoustic information, namely the third sample speech has the same pronunciation content as the second sample speech and has the tone of the reference object. In the above manner, the preset acoustic prediction networks corresponding to the sample objects are obtained by respectively training the second sample voices of the plurality of sample objects, so that each sample object is respectively used as a current object, the rest sample objects are used as reference objects, the preset acoustic prediction networks corresponding to the reference objects are obtained, the phoneme information of the second sample voice of the current object is identified by using the phoneme identification network, the actual acoustic information of the second sample voice of the current object is extracted, on the basis, the phoneme information and the actual acoustic information are predicted by using the preset acoustic prediction networks of the reference objects, the predicted acoustic information is obtained, and the predicted acoustic information is used to synthesize a third sample voice with the same tone as that of the reference object, which is beneficial to improving the quality and tone conversion efficiency of the third sample voice.

In a specific implementation scenario, a specific process for recognizing the phoneme information of the second sample speech of the current object by using the phoneme recognition network may refer to the steps related to recognizing the first phoneme information of the first sample speech in the foregoing disclosed embodiment, and details are not described herein again. In addition, the specific process of extracting the actual acoustic information of the second sample voice of the current object may refer to the relevant steps of extracting the first actual acoustic information of the first sample voice in the foregoing disclosed embodiment, and details are not described herein again. In addition, the specific process of predicting the phoneme information and the actual acoustic information by using the preset acoustic prediction network of the reference object may refer to the relevant steps of predicting the first phoneme information and the first actual acoustic information by using the acoustic prediction network in the foregoing disclosed embodiment, and details are not described herein again.

In another specific implementation scenario, after the preset acoustic prediction network of the reference object predicts the predicted acoustic information, the predicted acoustic information may be specifically input to the vocoder, and a third sample voice with the same tone as that of the reference object is synthesized. For example, the second sample voice of the current object is "how weather", the third sample voice is "how weather" with the reference object tone, and the other cases may be analogized, which is not exemplified herein.

In another implementation scenario, in order to improve accuracy of the phoneme recognition network, before the phoneme recognition network is used to recognize the second phoneme information of the second sample speech and the third phoneme information of the third sample speech respectively in the embodiment of the present disclosure, the second sample speech may be used to perform a second training on the phoneme recognition network first.

In an implementation scenario, the second phoneme information identified by the phoneme identification network may specifically include a phoneme sequence of the second sample speech, where the phoneme sequence includes a plurality of phonemes and an identification probability value corresponding to each phoneme, and for convenience of description, the second phoneme information may include a phoneme sequence of the second sample speech, where the phoneme sequence includes a plurality of phonemes and an identification probability value corresponding to each phonemeIs recorded as y'_mIn addition, the third phoneme information identified by the phoneme identification network may specifically include a phoneme sequence of the third sample speech, where the phoneme sequence may include a plurality of phonemes and an identification probability value corresponding to each phoneme, which may be denoted as y 'for convenience of description'_m′。

Step S32: network parameters of the phoneme recognition network are adjusted based on a difference between the second phoneme information and the third phoneme information.

In one implementation scenario, a mathematical expectation of the difference between the second phoneme information and the third phoneme information may be found as a loss value of the phoneme recognition network, and the loss value may be used to adjust network parameters of the phoneme recognition network. For convenience of description, the loss value can be recorded as L_consistentSpecifically, it can be expressed as:

L_consistent＝E(||y′_m-y′_m′||²)……(1)

in the above formula (1), E () represents mathematical expectation, y'_mRepresents a recognition probability value y 'corresponding to each of a plurality of phonemes obtained by the m-th second sample voice recognition'_m′And representing the recognition probability values corresponding to a plurality of phonemes respectively obtained by recognizing a third sample voice obtained by performing tone conversion on the mth second sample voice.

In addition, it should be noted that, as described above, the languages sampled by the sample object may not be completely the same, so that the phoneme recognition network can recognize voices of different languages.

In the above manner, the second phoneme information of the second sample voice and the third phoneme information of the third sample voice are respectively recognized by using the phoneme recognition network, and the network parameters of the phoneme recognition network are adjusted based on the difference between the second phoneme information and the third phoneme information, so that the phoneme information extracted by the phoneme recognition network can be made to be independent of the speaker as much as possible by restricting the loss of consistency between the second phoneme information and the third phoneme information, that is, the phoneme recognition network can reduce the attention of the speaker as much as possible, improve the attention of the phoneme information as much as possible, and is beneficial to improving the accuracy of the phoneme recognition network.

After adjusting the network parameters of the phoneme recognition network, the network parameters of the acoustic prediction network may be further adjusted, which is specifically as follows:

step S33: and recognizing fourth phoneme information of the second sample voice by using the phoneme recognition network, and extracting second actual acoustic information of the second sample voice.

In an implementation scenario, the specific processes of identifying the fourth phoneme information and extracting the second actual acoustic information may refer to the relevant steps of identifying the first phoneme information and the first actual acoustic information in the foregoing disclosed embodiment, and are not described herein again.

In another implementation scenario, please refer to fig. 4 in combination, and fig. 4 is a schematic diagram of a framework of another embodiment of an acoustic prediction network. As shown in fig. 4, in order to distinguish different sample objects during the training process, different sample objects may be encoded separately to obtain object encoding information, that is, the object encoding information of different sample objects is also different, and the specific encoding manner is not limited herein. For example, for S sample objects, an object code information with a length of S may be encoded for each sample object, and the value of the ith bit element in the object code information of the ith sample object is 1, and the values of the other bit elements are 0. For example, the 1 st sample object may be encoded as object encoding information [1,0,0, …,0 ] with a length S, a 1 st element of 1, and the remaining elements of 0]^T(ii) a The 2 nd sample object may be encoded as object encoding information [0,1,0, …,0 ] with a length of S, a 2 nd element of 1, and the remaining elements of 0]^TAnd so on, and no examples are given here.

Step S34: and predicting the fourth phoneme information and the second actual acoustic information by using an acoustic prediction network to obtain second predicted acoustic information.

In an implementation scenario, a specific process of obtaining the second predicted acoustic information by using the acoustic prediction network may refer to relevant steps of obtaining the first predicted acoustic information by prediction in the foregoing disclosed embodiment, and details are not described here again.

In another implementation scenario, referring to fig. 4 in a continuing manner, in a case where different sample objects are distinguished by using the object coding information, the second actual acoustic information may be input into an extraction subnetwork of the acoustic prediction network to extract prosody feature information, and the prosody feature information, the fourth phoneme information and the object coding information are input into a prediction subnetwork of the acoustic prediction network together to predict the second predicted acoustic information.

Step S35: adjusting a network parameter of the acoustic prediction network based on a difference between the second actual acoustic information and the second predicted acoustic information.

In an implementation scenario, the second actual acoustic information and the second predicted acoustic information may be processed by using a minimum mean square error function to obtain a loss value of the acoustic prediction network, so that the network parameters of the acoustic prediction network may be adjusted by using the loss value, and further, the ability of the acoustic prediction network to learn acoustic features of different objects may be improved.

In a specific implementation scenario, the number of output layer nodes of an extraction subnetwork in the acoustic prediction network does not exceed a preset threshold, which may specifically refer to relevant descriptions in the foregoing disclosed embodiments, and details are not described here.

In another specific implementation scenario, an example normalization layer may be further connected after a hidden layer of an extraction subnetwork in the acoustic prediction network, which may specifically refer to relevant descriptions in the foregoing disclosed embodiments, and details are not described here again.

In yet another specific implementation scenario, please continue to refer to fig. 4 in combination, after the network parameters of the acoustic prediction network are adjusted, parameters of a network hidden layer connected to the object coding information in the prediction subnetwork may be further removed, so that the acoustic prediction network can reduce the attention of the speaker as much as possible.

In the above manner, after the phoneme recognition network training, the fourth phoneme information of the second sample speech is recognized by using the phoneme recognition network, the second actual acoustic information of the second sample speech is extracted, and the fourth phoneme information and the second actual acoustic information are predicted by using the acoustic prediction network to obtain the second predicted acoustic information, so that the network parameters of the acoustic prediction network are adjusted based on the difference between the second actual acoustic information and the second predicted acoustic information, the speaker feature information included in the fourth phoneme information can be reduced as much as possible, and the accuracy of the acoustic prediction network can be improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of training a phoneme recognition network. Specifically, the embodiment of the present disclosure is a specific process of "first training" and "second training" in the foregoing disclosed embodiment, and specifically may include the following steps:

step S51: and identifying second phoneme information of the second sample voice by using an object identification network to obtain prediction probability distribution that the second sample voice respectively belongs to a plurality of sample objects.

In one implementation scenario, the object recognition network may specifically include, but is not limited to: convolutional neural networks, fully-connected networks, etc., without limitation.

In another implementation scenario, please refer to fig. 6 in combination, fig. 6 is a state diagram illustrating an embodiment of a training phoneme recognition network. As shown in fig. 6, the second phoneme information of the second sample speech may be input to the object recognition network to recognize and obtain the prediction probability distributions that the second sample speech belongs to the plurality of sample objects, and for convenience of description, the prediction probability distributions that the mth sample speech belongs to the plurality of sample objects may be denoted as p'_m。

Step S52: and obtaining a first recognition loss value based on the second phoneme information and the actual phoneme information, and obtaining a second recognition loss value based on the prediction probability distribution and the sample object to which the second sample speech belongs.

In the embodiment of the present disclosure, the second sample speech may be labeled with actual phoneme information. For example, the second sample speech, "weather-like", may be labeled with the actual phoneme information in an international phonetic notation

Other cases may be analogized and will not be described herein.

In an implementation scenario, as described in the foregoing disclosure embodiment, the second phoneme information identified by using the phoneme identification network may specifically include a phoneme sequence of the second sample speech, where the phoneme sequence includes a plurality of phonemes and an identification probability value corresponding to each phoneme, and for convenience of description, the identification probability value of each phoneme identified by the mth second sample speech may be recorded as y'_mSo that the second phoneme information and the actual phoneme information of the cross entropy loss function can be used to obtain the first recognition loss value L_c(θ_c) Specifically, it can be expressed as:

in the above formula (2), M represents the total number of second sample voices, θ_cNetwork parameter, y, representing a phoneme recognition network_mRepresenting the actual phoneme information of the mth second sample speech, CE () representing the cross-entropy loss function, the specific formula of which is not described herein again.

In another implementation scenario, in order to further constrain the phoneme recognition network to reduce the attention of the speaker as much as possible, the difference between the probability distribution and the preset probability distribution, and the difference between the second phoneme information and the actual phoneme information may be predicted to obtain the first recognition loss value, where the preset probability distribution includes preset probability values that the second sample speech belongs to a plurality of sample objects respectively, and each preset probability value is the same, for convenience of description, for example, S sample objects, the preset probability value may be represented as 1/S, so the preset probability distribution may be represented as [1/S, …,1/S]^TThe other cases may be analogized so that the first recognition loss value L_c(θ_c) Can be expressed as:

in the formula (3), e represents a preset probability distribution, p'_mRepresenting the prediction probability distribution of the mth second sample speech. In the above manner, in the training process, the phoneme recognition network can be further constrained to reduce the attention of the speaker as much as possible and improve the attention of the phoneme information as much as possible by constraining the prediction probability distribution to tend to the preset probability distribution, which is beneficial to further improving the accuracy of the phoneme recognition network.

In an implementation scenario, the prediction probability distribution and the sample object to which the second sample speech belongs may be processed by using a cross entropy loss function, so as to obtain a second recognition loss value L_sc(θ_sc) Specifically, it can be expressed as:

in the above formula (4), θ_scNetwork parameters, p, representing an object recognition network_mRepresenting the actual probability distribution of the sample object for the mth sample speech. In particular, the actual probability distribution p_mCan be represented by one-hot (one-hot) coding, taking S sample objects as an example, in the case that the mth second sample speech belongs to the ith sample object, p_mThe ith element value may be set to 1, the remaining element values may be set to 0, and the rest may be similar, which is not illustrated here.

Step S53: the network parameters of the phoneme recognition network are adjusted using the first recognition loss value, or the network parameters of the object recognition network are adjusted using the second recognition loss value.

In one implementation scenario, in the pre-training process of the phoneme recognition network, the phoneme recognition network may be obtained through several pre-training, and in the case that the training times satisfy the preset condition, the network parameters of the phoneme recognition network are adjusted by using the first recognition loss value, and in the case that the training times do not satisfy the preset condition, the network parameters of the object recognition network are adjusted by using the second recognition loss value. In the above manner, in the training process of the phoneme recognition network, the network parameters of the phoneme recognition network and the network parameters of the object recognition network can be adjusted in turn, and since the attention of the phoneme recognition network to the speaker can be reduced and the accuracy of the recognized phoneme information can be improved by adjusting the network parameters of the phoneme recognition network, and the accuracy of the object recognition network for recognizing the object to which the voice belongs through the phoneme information can be improved by adjusting the network parameters of the object recognition network, the performance of the object recognition network can be improved, the phoneme information recognized by the phoneme recognition network can be promoted to contain speaker-related information as little as possible, and the two are further promoted and supplemented with each other, so that the attention to the phoneme information can be finally improved as much as possible, and the accuracy of the phoneme recognition network can be further improved.

In a specific implementation scenario, the preset conditions may include: the number of training times is odd. For example, in the 1 st training, the network parameters of the phoneme recognition network may be adjusted by using the first recognition loss value, in the 2 nd training, the network parameters of the object recognition network may be adjusted by using the second recognition loss value, in the 3 rd training, the network parameters of the phoneme recognition network may be continuously adjusted by using the first recognition loss value, and so on, which is not illustrated herein.

In another specific implementation scenario, the preset conditions may also include: the number of training times is even. For example, in the 1 st training, the network parameters of the object recognition network may be adjusted by using the second recognition loss value, in the 2 nd training, the network parameters of the phoneme recognition network may be adjusted by using the first recognition loss value, in the 3 rd training, the network parameters of the object recognition network may be adjusted by using the second recognition loss value, and so on, which is not illustrated herein.

Different from the foregoing embodiment, the second sample speech is labeled with actual phoneme information, and the object recognition network is utilized to recognize the second phoneme information of the second sample speech, so as to obtain the prediction probability distribution that the second sample speech belongs to a plurality of sample objects respectively, thereby obtaining a first recognition loss value based on the second phoneme information and the actual phoneme information, and obtaining a second recognition loss value based on the prediction probability distribution and the sample object to which the second sample speech belongs, and further utilizing the first recognition loss value to adjust the network parameters of the phoneme recognition network, or utilizing the second recognition loss value to adjust the network parameters of the object recognition network, so that the phoneme recognition network can be constrained to reduce the attention of the speaker as much as possible and improve the attention of the phoneme information as much as possible from two levels of the phoneme recognition dimension and the object recognition dimension in the process of training the phoneme recognition network, the accuracy of the phoneme recognition network is further improved.

In some disclosed embodiments, after obtaining a first sample voice of a target object and second sample voices of a plurality of sample objects, performing tone conversion on the second sample voice to obtain a third sample voice; pre-training a voice conversion model by using the second sample voice and the third sample voice; and finally, training by using the first sample voice to obtain a voice conversion model matched with the target object.

In one implementation scenario, first, the phoneme recognition network may be initially trained to have phoneme recognition capability. Specifically, the phoneme information of the second sample speech may be recognized by using the object recognition network to obtain a prediction probability distribution that the second sample speech belongs to a plurality of sample objects respectively, and a first recognition loss value may be obtained based on the phoneme information and the actual phoneme information, and a second recognition loss value may be obtained based on the prediction probability distribution and the sample object to which the second sample speech belongs, so as to adjust the network parameters of the phoneme recognition network by using the first recognition loss value, or adjust the network parameters of the object recognition network by using the second recognition loss value, so that the phoneme recognition network has the capability of accurately recognizing phonemes. Then, the acoustic prediction network can be preliminarily trained to have the acoustic prediction capability. Specifically, the phoneme information of the second sample speech may be identified by using a phoneme identification network, the actual acoustic information of the second sample speech may be extracted, and the phoneme information and the actual acoustic information may be predicted by using an acoustic prediction network to obtain predicted acoustic information, so that the network parameters of the acoustic prediction network may be adjusted based on the difference between the actual acoustic information and the predicted acoustic information, so that the acoustic prediction network has the capability of accurately predicting the acoustic features. Then, a preset acoustic prediction network corresponding to the sample object may be trained. Specifically, the adjusted acoustic prediction networks may be respectively used as preset acoustic prediction networks corresponding to the sample objects, for each sample object, the phoneme recognition network may be used to recognize phoneme information of the second sample speech of the sample object, and extract actual acoustic information of the second sample speech of the sample object, so that the preset acoustic prediction network corresponding to the sample object is used to predict the phoneme information and the actual acoustic information to obtain predicted acoustic information, and based on a difference between the actual acoustic information and the predicted acoustic information, network parameters of the preset acoustic prediction network of the sample object are adjusted, so as to train to obtain the preset acoustic prediction network corresponding to the sample object. Finally, each sample object can be used as a current object, the rest sample objects can be used as reference objects, and the tone of the second sample speech of the current object is converted to the reference object by using the phoneme recognition network and the preset acoustic prediction network corresponding to the reference object, so as to obtain a third sample speech. Specifically, a preset acoustic prediction network corresponding to the reference object may be acquired, so that the phoneme recognition network is used to recognize phoneme information of the second sample speech of the current object, and extract actual acoustic information of the second sample speech of the current object, and then the preset acoustic prediction network of the reference object is used to predict the phoneme information and the actual acoustic information to obtain predicted acoustic information, and the predicted acoustic information is used to synthesize a third sample speech having the same tone as that of the reference object.

In another implementation scenario, after obtaining the third sample speech, the phoneme recognition network and the acoustic prediction network may be pre-trained. First, the object recognition network and the phoneme recognition network may be jointly trained. Specifically, the phoneme information of the second sample speech may be recognized by using an object recognition network to obtain prediction probability distributions that the second sample speech belongs to the plurality of sample objects, respectively, so as to obtain a first recognition loss value based on the phoneme information and the actual phoneme information, and obtain a second recognition loss value based on the prediction probability distributions and the sample objects to which the second sample speech belongs, and then adjust the network parameters of the phoneme recognition network by using the first recognition loss value, or adjust the network parameters of the object recognition network by using the second recognition loss value. The phoneme recognition network may then be further trained based on the loss of consistency. Specifically, the phoneme information of the second sample voice and the phoneme information of the third sample voice may be recognized using a phoneme recognition network, respectively, so that network parameters of the phoneme recognition network are adjusted based on a difference between the phoneme information of the second sample voice and the phoneme information of the third sample voice. Finally, the acoustic prediction network may continue to be trained based on the trained phoneme recognition network. Specifically, the phoneme information of the second sample speech may be identified by using a phoneme identification network, and the actual acoustic information of the second sample speech may be extracted, so that the phoneme information and the actual acoustic information may be predicted by using an acoustic prediction network to obtain predicted acoustic information, and then the network parameters of the acoustic prediction network may be adjusted based on a difference between the actual acoustic information and the predicted acoustic information.

In yet another implementation scenario, after the pre-training is completed, the first sample speech may be used to further train to obtain a speech conversion model matching the target object. Reference may be made specifically to the foregoing disclosure embodiments, which are not described herein again.

Referring to fig. 7, fig. 7 is a flowchart illustrating a voice conversion method according to an embodiment of the present application. The method specifically comprises the following steps:

step S71: and acquiring the voice to be converted of the source object and acquiring a voice conversion model matched with the target object.

In the embodiment of the present disclosure, the speech conversion model is obtained by training using any one of the above training methods of the speech conversion model, which may be referred to in the embodiment of the present disclosure specifically, and details are not repeated here. In addition, the source object refers to an object to which the voice to be converted belongs, and the source object is not limited to which person. For example, in a translation scenario, the speech to be converted may be speech built in a translator, and the source object is an object to which the speech built in the translator belongs. Other scenarios may be analogized, and are not exemplified here.

In an implementation scenario, the languages of the sample objects to which the second sample speech belongs used for training the speech conversion model may not be completely the same, and in this case, the language used for the source object is not limited herein. For example, in a translation scenario, a target object may be chinese, but needs to communicate with others using english, after receiving an english conversation, the translator may translate the english conversation into chinese for reference by the target object, and input a reply text of the english conversation in the translator, and the translator may synthesize the reply text into english speech, and on this basis, may use the english speech as speech to be converted. Other scenarios may be analogized, and are not exemplified here.

Step S72: and recognizing the phoneme information of the voice to be converted by utilizing a phoneme recognition network of the voice conversion model, and extracting the actual acoustic information of the voice to be converted.

Reference may be made to the related description in the foregoing embodiments, which are not repeated herein. Still taking the above-mentioned translator scenario as an example, in the case that the above-mentioned english speech is "what is the weather like", the phoneme information expressed by the international phonetic symbol can be obtained by using the phoneme recognition network to recognize:

other cases may be analogized, and no one example is given here.

Step S73: and predicting the phoneme information and the actual acoustic information by using an acoustic prediction network of the voice conversion model to obtain predicted acoustic information.

In an implementation scenario, the acoustic prediction network may include an extraction subnetwork and a prediction subnetwork, so that the prosody feature information of actual acoustic information may be extracted by using the extraction subnetwork, and then the prediction subnetwork may be used to predict the prosody feature information and the prosody information to obtain predicted acoustic information, so that in the process of predicting the acoustic information, a prosody pronunciation law of speech to be converted may be modeled, which is beneficial to improving the prosody naturalness of synthesized speech obtained by subsequent synthesis. Still taking the above-mentioned translation scenario as an example, after the english speech is taken as the speech to be converted, the phoneme recognition network may be used to recognize the phoneme information of the english speech, and extract the actual acoustic information of the english speech, so as to obtain the predicted acoustic information with the target object timbre features for the phoneme information and the actual acoustic information of the english speech by using the acoustic prediction network.

Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

Step S74: and synthesizing to obtain synthesized voice with the same tone as the target object by utilizing the predicted acoustic information.

In one implementation scenario, the predicted acoustic information may be input to a vocoder that synthesizes synthesized speech having the same tone as the target object. Reference may be made specifically to the foregoing disclosure embodiments, which are not described herein again. Still taking the above-mentioned translator scenario as an example, after obtaining the predicted acoustic information with the target object tone feature, the english speech "what is the skin like" with the target object tone can be synthesized, so that the target object without the english conversation capability can be enabled to adopt english for communication conversation by the synthesized speech with the own tone. Other scenarios may be analogized, and are not exemplified here.

It should be noted that the embodiments disclosed in the present application may also be applied to scenes such as story machine, early education machine, and reading accompanying machine, and are not limited to specific application scenes. Using the story machine as an example, through obtaining the first sample pronunciation of the head of a family (being the target object), can train and obtain the speech conversion model that matches with the head of a family, in addition, the story machine can synthesize the story text into story speech, thereby can regard the story speech as the speech of waiting to convert, and then can utilize the speech conversion model that matches with the head of a family to convert the speech of waiting to convert into the synthetic speech that has head of a family's tone, so that under the condition that the head of a family is not convenient for telling the story for children, can tell the story for children through the synthetic speech that has head of a family's tone of story machine synthesis. Other scenarios may be analogized, and are not exemplified here.

According to the scheme, the voice to be converted of the source object is obtained, the voice conversion model matched with the target object is obtained, the voice conversion model is obtained by training through any one of the voice conversion model training methods, the phoneme information of the voice to be converted is recognized through the phoneme recognition network of the voice conversion model, the actual acoustic information of the voice to be converted is extracted, the phoneme information and the actual acoustic information are predicted through the acoustic prediction network of the voice conversion model, the predicted acoustic information is obtained, the predicted acoustic information is further used, the synthesized voice with the same tone as that of the target object is obtained through synthesis, and the quality of voice conversion can be improved.

Referring to fig. 8, fig. 8 is a schematic block diagram of an embodiment of an electronic device 80 according to the present application. The electronic device 80 comprises a memory 81 and a processor 82 coupled to each other, the memory 81 stores program instructions, and the processor 82 is configured to execute the program instructions to implement the steps in any of the above-described embodiments of the speech conversion model training method, or to implement the steps in any of the above-described embodiments of the speech conversion method. Specifically, the electronic device 80 includes, but is not limited to: a desktop computer, a notebook computer, a tablet computer, a mobile phone, a translator, a story machine, an early education machine, a reading machine, a server, etc., which are not limited herein.

In particular, the processor 82 is configured to control itself and the memory 81 to implement the steps in any of the above-described embodiments of the speech conversion model training method, or to implement the steps in any of the above-described embodiments of the speech conversion method. The processor 82 may also be referred to as a CPU (Central Processing Unit). The processor 82 may be an integrated circuit chip having signal processing capabilities. The Processor 82 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be collectively implemented by an integrated circuit chip.

In some disclosed embodiments, the processor 82 is configured to obtain a first sample speech of the target object and obtain a pre-trained speech conversion model; the voice conversion model comprises a phoneme recognition network and an acoustic prediction network, and is obtained by utilizing second sample voice and third sample voice of a plurality of sample objects to pre-train, wherein the third sample voice is obtained by performing tone conversion on the second sample voice; the processor 82 is configured to recognize first phoneme information of the first sample speech by using the phoneme recognition network, and extract first actual acoustic information of the first sample speech; the processor 82 is configured to predict the first phoneme information and the first actual acoustic information by using an acoustic prediction network to obtain first predicted acoustic information, and adjust a network parameter of the acoustic prediction network based on a difference between the first actual acoustic information and the first predicted acoustic information; the processor 82 is configured to use a combination of the phoneme recognition network and the adjusted acoustic prediction network as a speech conversion model that matches the target object.

In some disclosed embodiments, the processor 82 is configured to identify second phoneme information of the second sample speech and third phoneme information of the third sample speech, respectively, using a phoneme recognition network; the processor 82 is configured to adjust network parameters of the phoneme recognition network based on a difference between the second phoneme information and the third phoneme information.

Different from the foregoing embodiment, by using the phoneme recognition network to recognize the second phoneme information of the second sample speech and the third phoneme information of the third sample speech respectively, and adjusting the network parameters of the phoneme recognition network based on the difference between the second phoneme information and the third phoneme information, it is possible to make the phoneme information extracted by the phoneme recognition network as independent of the speaker as possible by constraining the loss of consistency between the second phoneme information and the third phoneme information, i.e., it is possible to make the phoneme recognition network reduce the attention of the speaker as much as possible, and improve the attention of the phoneme information as much as possible, which is beneficial to improving the accuracy of the phoneme recognition network.

In some disclosed embodiments, the processor 82 is configured to identify fourth phoneme information of the second sample speech using the phoneme recognition network and extract second actual acoustic information of the second sample speech; the processor 82 is configured to predict the fourth phoneme information and the second actual acoustic information by using an acoustic prediction network, so as to obtain second predicted acoustic information; the processor 82 is configured to adjust a network parameter of the acoustic prediction network based on a difference between the second actual acoustic information and the second predicted acoustic information.

Different from the embodiment, after the phoneme recognition network training, the fourth phoneme information of the second sample speech is recognized by using the phoneme recognition network, the second actual acoustic information of the second sample speech is extracted, and the fourth phoneme information and the second actual acoustic information are predicted by using the acoustic prediction network to obtain the second predicted acoustic information, so that the network parameters of the acoustic prediction network are adjusted based on the difference between the second actual acoustic information and the second predicted acoustic information, the speaker feature information contained in the fourth phoneme information can be reduced as much as possible, and the accuracy of the acoustic prediction network can be improved.

In some disclosed embodiments, the second sample speech is labeled with actual phoneme information, and the processor 82 is configured to identify the second phoneme information of the second sample speech by using an object identification network, so as to obtain prediction probability distributions that the second sample speech belongs to a plurality of sample objects respectively; the processor 82 is configured to obtain a first recognition loss value based on the second phoneme information and the actual phoneme information, and obtain a second recognition loss value based on the prediction probability distribution and the sample object to which the second sample speech belongs; the processor 82 is configured to adjust a network parameter of the phoneme recognition network using the first recognition loss value, or adjust a network parameter of the object recognition network using the second recognition loss value.

In some disclosed embodiments, the processor 82 is configured to adjust a network parameter of the phoneme recognition network by using the first recognition loss value if the training times satisfy a preset condition; the processor 82 is configured to adjust a network parameter of the object recognition network by using the second recognition loss value if the training times do not satisfy the preset condition.

Different from the embodiment, in the training process of the phoneme recognition network, the network parameters of the phoneme recognition network and the network parameters of the object recognition network are adjusted in turn, so that the attention of the phoneme recognition network to the speaker can be reduced and the accuracy of the recognized phoneme information can be improved by adjusting the network parameters of the phoneme recognition network, and the accuracy of the object recognition network for recognizing the object to which the voice belongs through the phoneme information can be improved by adjusting the network parameters of the object recognition network, so that the performance of the object recognition network can be improved, the phoneme information recognized by the phoneme recognition network is promoted to contain the speaker related information as little as possible, the two are promoted and supplemented with each other, the attention to the phoneme information can be improved as far as possible, and the accuracy of the phoneme recognition network can be further improved.

In some disclosed embodiments, the processor 82 is configured to derive a first recognition loss value using a difference between the second phoneme information and the actual phoneme information, and a difference between the predicted probability distribution and a preset probability distribution; the preset probability distribution comprises preset probability values of the second sample voice belonging to the plurality of sample objects respectively, and the preset probability values are the same.

Different from the embodiment, in the training process, the phoneme recognition network can be further constrained to reduce the attention of speakers as much as possible and improve the attention of phoneme information as much as possible by constraining the prediction probability distribution to tend to preset probability distribution, which is beneficial to further improving the accuracy of the phoneme recognition network.

In some disclosed embodiments, the processor 82 is configured to respectively train to obtain preset acoustic prediction networks corresponding to the sample objects by using second sample voices of the sample objects; the processor 82 is configured to respectively use each sample object as a current object, use the remaining sample objects as reference objects, and obtain a preset acoustic prediction network corresponding to the reference objects; the processor 82 is configured to recognize fifth phoneme information of the second sample speech of the current object by using the phoneme recognition network, and extract third actual acoustic information of the second sample speech of the current object; the processor 82 is configured to predict the fifth phoneme information and the third actual acoustic information by using a preset acoustic prediction network of the reference object, so as to obtain third predicted acoustic information; the processor 82 is configured to synthesize a third sample speech having the same tone as the reference object by using the third prediction acoustic information.

Different from the foregoing embodiment, the preset acoustic prediction networks corresponding to the sample objects are obtained by respectively training the second sample voices of the plurality of sample objects, so that each sample object is respectively used as a current object, the rest sample objects are used as reference objects, the preset acoustic prediction networks corresponding to the reference objects are obtained, the phoneme recognition network is further used for recognizing the phoneme information of the second sample voice of the current object, and the actual acoustic information of the second sample voice of the current object is extracted, on the basis, the phoneme information and the actual acoustic information are predicted by using the preset acoustic prediction networks of the reference objects, the predicted acoustic information is obtained, and the predicted acoustic information is used for synthesizing a third sample voice with the same tone as that of the reference object, which can be beneficial to improving the quality and tone conversion efficiency of the third sample voice.

In some disclosed embodiments, the acoustic prediction network includes an extraction subnetwork and a prediction subnetwork, and the processor 82 is configured to extract prosodic feature information of the first actual acoustic information using the extraction subnetwork; the processor 82 is configured to predict the prosodic feature information and the first phoneme information using a prediction subnetwork to obtain first predicted acoustic information.

In distinction from the foregoing embodiments, by arranging the acoustic prediction network to include an extraction subnetwork and a prediction subnetwork, on the basis, the prosodic feature information of the first actual acoustic information is further extracted by utilizing the extraction sub-network, and the prediction sub-network is utilized to predict the prosodic feature information and the first phoneme information to obtain first predicted acoustic information, so that the acoustic prediction network can avoid not learning the acoustic features of the target object due to the fact that the acoustic prediction network directly copies the input first actual acoustic information in the training process, the accuracy of the acoustic prediction network is improved, in addition, by predicting the first predicted acoustic information by using the first phoneme information and the extracted prosodic feature information, the loss error between the first actual acoustic information and the first predicted acoustic information can be favorably reduced, and the accuracy of the acoustic prediction network can be further improved.

In some disclosed embodiments, the number of extraction subnetwork output layer nodes does not exceed a preset threshold; and/or, the instance normalization layer is connected after the hidden layer in the extraction sub-network.

Different from the foregoing embodiment, by setting the number of output layer nodes of the extraction sub-network not to exceed the preset threshold, the amount of information extracted from the first actual acoustic information by the extraction sub-network can be limited, so that the loss error between the first actual acoustic information and the first predicted acoustic information can be constrained, the acoustic prediction network can learn the acoustic features of the target object in the training process, and the accuracy of the acoustic prediction network can be improved; the embodiment normalization layer is connected after the sub-network hidden layer is extracted, so that the speaker related information can be favorably normalized, the speaker related information extracted by the sub-network can be reduced, the accuracy of the sub-network is further improved, and the accuracy of the acoustic prediction network can be favorably improved.

In some disclosed embodiments, the language employed by the plurality of sample objects is not identical.

Unlike the foregoing embodiment, setting the languages sampled by the several sample objects to be not identical enables the phoneme recognition network to recognize phoneme information of different languages of speech, and improves the adaptability of the acoustic prediction network to different languages of speech.

In some disclosed embodiments, the processor 82 is configured to obtain the voice to be converted of the source object, and obtain a voice conversion model matching the target object; wherein, the voice conversion model is obtained by utilizing the steps in the embodiment of the training method of any one voice conversion model; the processor 82 is configured to recognize phoneme information of the speech to be converted by using the phoneme recognition network of the speech conversion model, and extract actual acoustic information of the speech to be converted; the processor 82 is configured to predict the phoneme information and the actual acoustic information by using an acoustic prediction network of the speech conversion model to obtain predicted acoustic information; the processor 82 is configured to synthesize a synthesized speech having the same tone as the target object using the predicted acoustic information.

Different from the foregoing embodiment, the method includes obtaining a to-be-converted speech of a source object, obtaining a speech conversion model matched with a target object, and training the speech conversion model by using any one of the above training methods of the speech conversion model, recognizing phoneme information of the to-be-converted speech by using a phoneme recognition network of the speech conversion model, and extracting actual acoustic information of the to-be-converted speech, so as to predict the phoneme information and the actual acoustic information by using an acoustic prediction network of the speech conversion model, obtain predicted acoustic information, and synthesize a synthesized speech having a tone color identical to that of the target object by using the predicted acoustic information, which can be beneficial to improving the quality of speech conversion.

Referring to fig. 9, fig. 9 is a schematic diagram of a memory device 90 according to an embodiment of the present application. The memory device 90 stores program instructions 91 that can be executed by the processor, the program instructions 91 being adapted to implement the steps in any of the above-described speech conversion model training method embodiments, or to implement the steps in any of the above-described speech conversion method embodiments.

By the scheme, the quality of voice conversion can be improved.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method for training a speech conversion model, comprising:

acquiring a first sample voice of a target object, and acquiring a pre-trained voice conversion model; the voice conversion model comprises a phoneme recognition network and an acoustic prediction network, and is obtained by utilizing second sample voice and third sample voice of a plurality of sample objects to pre-train, wherein the third sample voice is obtained by performing tone conversion on the second sample voice;

recognizing first phoneme information of the first sample voice by using the phoneme recognition network, and extracting first actual acoustic information of the first sample voice;

predicting the first phoneme information and the first actual acoustic information by using the acoustic prediction network to obtain first predicted acoustic information, and adjusting network parameters of the acoustic prediction network based on the difference between the first actual acoustic information and the first predicted acoustic information;

and combining the phoneme recognition network and the adjusted acoustic prediction network to serve as a voice conversion model matched with the target object.

2. The method of claim 1, wherein obtaining the pre-trained speech conversion model comprises:

identifying second phoneme information of the second sample voice and third phoneme information of the third sample voice respectively by using the phoneme recognition network;

adjusting network parameters of the phoneme recognition network based on a difference between the second phoneme information and the third phoneme information.

3. The method of claim 2, wherein after adjusting the network parameters of the phoneme recognition network, the method further comprises:

recognizing fourth phoneme information of the second sample voice by using the phoneme recognition network, and extracting second actual acoustic information of the second sample voice;

predicting the fourth phoneme information and the second actual acoustic information by using the acoustic prediction network to obtain second predicted acoustic information;

adjusting a network parameter of the acoustic prediction network based on a difference between the second actual acoustic information and the second predicted acoustic information.

4. The method of claim 2, wherein the second sample speech is labeled with actual phoneme information; before the recognizing, by the phoneme recognition network, the second phoneme information of the second sample speech and the third phoneme information of the third sample speech, respectively, the method further includes:

identifying second phoneme information of the second sample voice by using an object identification network to obtain prediction probability distribution that the second sample voice respectively belongs to the plurality of sample objects;

obtaining a first recognition loss value based on the second phoneme information and the actual phoneme information, and obtaining a second recognition loss value based on the prediction probability distribution and a sample object to which the second sample voice belongs;

and adjusting the network parameters of the phoneme recognition network by using the first recognition loss value, or adjusting the network parameters of the object recognition network by using the second recognition loss value.

5. The method of claim 4, wherein the phoneme recognition network is pre-trained a number of times, the method further comprising:

under the condition that the training times meet a preset condition, adjusting network parameters of the phoneme recognition network by using the first recognition loss value;

and under the condition that the training times do not meet the preset condition, adjusting the network parameters of the object recognition network by using the second recognition loss value.

6. The method of claim 4, wherein obtaining a first recognition loss value based on the second phoneme information and the actual phoneme information comprises:

obtaining the first recognition loss value by using the difference between the second phoneme information and the actual phoneme information and the difference between the prediction probability distribution and a preset probability distribution;

the preset probability distribution comprises preset probability values of the second sample voice belonging to the plurality of sample objects respectively, and the preset probability values are the same.

7. The method according to claim 1, wherein the step of obtaining the third sample speech comprises:

respectively training by using second sample voices of the plurality of sample objects to obtain preset acoustic prediction networks corresponding to the sample objects;

respectively taking each sample object as a current object, taking the rest sample objects as reference objects, and acquiring a preset acoustic prediction network corresponding to the reference objects;

identifying fifth phoneme information of the second sample voice of the current object by using the phoneme recognition network, and extracting third actual acoustic information of the second sample voice of the current object;

predicting the fifth phoneme information and the third actual acoustic information by using a preset acoustic prediction network of the reference object to obtain third predicted acoustic information;

and synthesizing to obtain a third sample voice with the same tone as the reference object by using the third prediction acoustic information.

8. The method of claim 1, wherein the acoustic prediction network comprises an extraction subnetwork and a prediction subnetwork; the predicting the first phoneme information and the first actual acoustic information by using the acoustic prediction network to obtain first predicted acoustic information includes:

extracting prosodic feature information of the first actual acoustic information using the extraction subnetwork;

and predicting the prosodic feature information and the first phoneme information by using the prediction subnetwork to obtain the first predicted acoustic information.

9. The method of claim 8, wherein the number of extraction subnetwork output layer nodes does not exceed a preset threshold;

and/or an instance normalization layer is connected behind a hidden layer in the extraction sub-network.

10. The method of claim 1, wherein the plurality of sample objects are not identical in language.

11. A method of speech conversion, comprising:

acquiring a voice to be converted of a source object, and acquiring a voice conversion model matched with a target object; wherein the speech conversion model is obtained by using the training method of the speech conversion model according to any one of claims 1 to 10;

recognizing the phoneme information of the voice to be converted by utilizing a phoneme recognition network of the voice conversion model, and extracting the actual acoustic information of the voice to be converted;

predicting the phoneme information and the actual acoustic information by using an acoustic prediction network of the voice conversion model to obtain predicted acoustic information;

and synthesizing to obtain synthesized voice with the same tone as the target object by utilizing the predicted acoustic information.

12. An electronic device, comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the method for training a speech conversion model according to any one of claims 1 to 10, or to implement the method for speech conversion according to claim 11.

13. A storage device storing program instructions executable by a processor to implement a method of training a speech conversion model according to any one of claims 1 to 10 or to implement a method of speech conversion according to claim 11.