CN115273831A

CN115273831A - Voice conversion model training method, voice conversion method and device

Info

Publication number: CN115273831A
Application number: CN202210916630.8A
Authority: CN
Inventors: 张颖
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-11-01

Abstract

The disclosure relates to a speech conversion model training method, a speech conversion method and a speech conversion device, and relates to the technical field of computers. The method for training the voice conversion model comprises the following steps: acquiring a training audio data set; inputting sample audio data into a pre-training text encoder in a voice conversion model, and performing text encoding processing to generate a text encoding vector; inputting the text coding vector and the sample tone into a tone decoder in the voice conversion model, and performing audio reconstruction processing to generate reconstructed sample audio data; calculating a loss value from the sample audio data and the reconstructed sample audio data; and updating parameters of the voice conversion model according to the loss value. Therefore, the trained voice conversion model obtained by adopting the joint training mode can perform voice conversion end to end, performs audio reconstruction processing according to the text coding vector, and has better variable intonation and emotion following capability.

Description

Voice conversion model training method, voice conversion method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a speech conversion model training method, a speech conversion method, and a speech conversion device.

Background

With the continuous development of science and technology, people can record and play sound through electronic equipment (such as mobile phones, notebook computers, tablet computers, smart homes and the like).

In the related art, there is a need for voice conversion in aspects of movie dubbing, short video dubbing, virtual human beings, and the like. The voice conversion refers to transferring the tone of the audio data of the original speaker to the tone of the target speaker and acquiring the audio data of the target speaker under the condition of keeping the language content of the audio data of the original speaker.

Among other things, speech conversion typically uses two models, one model for converting the audio data of an original speaker into text data and the other model for converting the text data into the audio data of a target speaker. However, the voice conversion requires two model processes, and the inflexion intonation of the voice conversion is limited to the conversion of text data, so that the inflexion intonation is not good.

Disclosure of Invention

The present disclosure provides a speech conversion model training method, a speech conversion method and a speech conversion device, which perform speech conversion end to end with a trained speech conversion model obtained by adopting a joint training mode. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for training a speech conversion model, including: acquiring a training audio data set; wherein the training audio data set comprises: sample audio data of at least one target object and a sample tone corresponding to the sample audio data; inputting the sample audio data into a pre-training text encoder in a voice conversion model, and performing text encoding processing to generate a text encoding vector; inputting the text coding vector and the sample tone to a tone decoder in a voice conversion model, and performing audio reconstruction processing to generate reconstructed sample audio data; calculating a loss value from the sample audio data and the reconstructed sample audio data; and updating parameters of the voice conversion model according to the loss value.

In some embodiments, the loss value comprises a first loss value and/or a second loss value, the calculating a loss value from the sample audio data and the reconstructed sample audio data comprising: calculating a first loss value between a sample timbre acoustic feature of the sample audio data and a reconstructed timbre acoustic feature of the reconstructed sample audio data, and/or inputting the reconstructed sample audio data to the pre-training text encoder, generating a reconstructed text encoding vector, and calculating a second loss value between the text encoding vector and the reconstructed text encoding vector.

In some embodiments, said updating parameters of said speech conversion model based on said loss values comprises at least one of:

in response to at least one of the loss values of the continuous preset times being larger than a first preset value, performing parameter updating on the tone color decoder in the voice conversion model;

in response to the extreme difference of the loss value of the continuous preset times being larger than a second preset value, performing parameter updating on the tone color decoder in the voice conversion model;

in response to the loss values of the continuous preset times being smaller than a first preset value, performing parameter updating on the pre-training text encoder and the tone decoder in the voice conversion model;

and in response to the extreme difference of the loss value of the continuous preset times being smaller than a second preset value, performing parameter updating on the pre-training text encoder and the tone color decoder in the voice conversion model.

In some embodiments, the method further comprises: acquiring a training text data set; wherein the training text data set comprises: pre-training audio data of at least one object and pre-training text data corresponding to the pre-training audio data; inputting the pre-training audio data into a text encoder, and performing text encoding to generate a pre-training text encoding vector; inputting the pre-training text coding vector to a text decoder, and performing decoding processing to generate target text data; calculating a pre-training loss value according to the target text data and the pre-training text data; and updating parameters of the text encoder according to the pre-training loss value to generate the pre-training text encoder.

In some embodiments, the training text data set further includes pre-training monophonic data corresponding to the pre-training audio data, further including: inputting the pre-training text coding vector to a phoneme decoder, and performing phoneme decoding to generate target single-phoneme data; calculating a second pre-training loss value according to the target mono-pel data and the pre-training mono-pel data; and updating parameters of the text encoder according to the first pre-training loss value and the second pre-training loss value to generate the pre-training text encoder.

According to a second aspect of the embodiments of the present disclosure, there is provided a voice conversion method including: acquiring audio data of an original object; determining a target tone of a target object to be converted; inputting the audio data and the target tone to a trained voice conversion model, and performing voice conversion processing to generate target audio data of the target object; wherein the trained speech conversion model is obtained by training according to the method described in some embodiments above.

In some embodiments, the trained speech conversion model comprises: a trained text encoder and a trained tone decoder, wherein the inputting the audio data and the target tone to a trained speech conversion model to generate target audio data of the target object comprises: inputting the audio data to the trained text encoder, and performing text encoding processing to generate a target text encoding vector; and inputting the target text coding vector and the target tone to the trained tone decoder for decoding to generate the target audio data of the target object.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech conversion model training apparatus, including: a data set acquisition unit for acquiring a training audio data set; wherein the training audio data set comprises: sample audio data of at least one target object and a sample tone corresponding to the sample audio data; the coding module is used for inputting the sample audio data into a pre-training text coder in a voice conversion model, and performing text coding processing to generate a text coding vector; the voice reconstruction unit is used for inputting the text coding vector and the sample tone into a tone decoder in the voice conversion model, performing audio reconstruction processing and generating reconstructed sample audio data; a loss calculation unit for calculating a loss value from the sample audio data and the reconstructed sample audio data; and the model updating unit is used for updating parameters of the voice conversion model according to the loss value.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a voice conversion apparatus including: an audio data acquisition unit for acquiring audio data of an original object; a target tone color determination unit for determining a target tone color of a target object to be converted; the target audio acquisition unit is used for inputting the audio data and the target tone into a trained voice conversion model, performing voice conversion processing and generating target audio data of the target object; wherein the trained speech conversion model is obtained by training according to the method of some embodiments.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech conversion model training method according to the first aspect, or the processor is configured to execute the instructions to implement the speech conversion method according to the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein the instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech conversion model training method according to the first aspect, or, when executed by the processor of the electronic device, enable the electronic device to perform the speech conversion method according to the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method for training a speech conversion model according to the first aspect above, or which, when executed by a processor, implements a method for speech conversion according to the second aspect above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the voice conversion model training method provided by the embodiment of the disclosure acquires a training audio data set; wherein training the audio data set comprises: sample audio data of at least one target object and sample timbre corresponding to the sample audio data; inputting sample audio data into a pre-training text encoder in a voice conversion model, and performing text encoding processing to generate a text encoding vector; inputting the text coding vector and the sample tone into a tone decoder in the voice conversion model, and performing audio reconstruction processing to generate reconstructed sample audio data; calculating a loss value from the sample audio data and the reconstructed sample audio data; and updating parameters of the voice conversion model according to the loss value. Therefore, the trained voice conversion model obtained by adopting the joint training mode can perform voice conversion end to end, performs audio reconstruction processing according to the text coding vector, and has better variable intonation and emotion following capability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow chart of a method for training a speech conversion model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of calculating a loss value provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of another method for training a speech conversion model according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a voice conversion method provided by an embodiment of the present disclosure;

fig. 5 is a flowchart of S30 in the voice conversion method provided by the embodiment of the present disclosure;

FIG. 6 is a block diagram of a device for training a speech conversion model according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of another apparatus for training speech conversion models provided in an embodiment of the present disclosure;

FIG. 8 is a block diagram of another apparatus for training speech conversion models provided in accordance with an embodiment of the present disclosure;

fig. 9 is a structural diagram of a voice conversion apparatus according to an embodiment of the present disclosure;

fig. 10 is a structural diagram of a target audio acquiring unit in the speech conversion apparatus according to the embodiment of the disclosure;

fig. 11 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

Throughout the specification and claims, the term "comprising" is to be interpreted in an open, inclusive sense, i.e., as "including, but not limited to," unless the context requires otherwise. In the description of the specification, the terms "some embodiments" and the like are intended to indicate that a particular feature, structure, material, or characteristic described in connection with the embodiments or examples is included in at least one embodiment or example of the disclosure. The schematic representations of the above terms are not necessarily referring to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be included in any suitable manner in any one or more embodiments or examples.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

The voice conversion technology comprises the following steps: the method refers to a technology for converting a source voice into a target voice under the condition of keeping semantic content unchanged, wherein the source voice is a voice sent by a first person, and the target voice is a voice sent by a second person, namely, the source voice sent by the first person is converted into a target voice sent by the second person with the same semantic through a voice conversion technology.

Tone color: the interpretation is the color of the sound and the color of the sound, which means the personality of the sound. The formation and difference of timbre is the effect that different components of the object vibration combine to change the relationship perceived in the human ear's sense of hearing.

It should be noted that the speech conversion model training method according to the embodiment of the present disclosure may be executed by the speech conversion model training apparatus according to the embodiment of the present disclosure, and the speech conversion model training apparatus may be implemented in software and/or hardware, and the speech conversion model training apparatus may be configured in an electronic device, where the electronic device may install and run a speech conversion model training program. The electronic device may include, but is not limited to, a hardware device with various operating systems, such as a smart phone, a tablet computer, and a computer.

It should be noted that the voice conversion method according to the embodiment of the present disclosure may be executed by a voice conversion apparatus according to the embodiment of the present disclosure, where the voice conversion apparatus may be implemented in software and/or hardware, and the voice conversion apparatus may be configured in an electronic device, where the electronic device may install and run a voice conversion program. The electronic device may include, but is not limited to, a hardware device with various operating systems, such as a smart phone, a tablet computer, and a computer.

Fig. 1 is a flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure.

As shown in fig. 1, the method for training a speech conversion model provided in the embodiment of the present disclosure includes, but is not limited to, the following steps:

s1: acquiring a training audio data set; wherein training the audio data set comprises: sample audio data of at least one target object, and a sample timbre corresponding to the sample audio data.

In an embodiment of the present disclosure, a training audio data set is obtained, where the training audio data set includes: sample audio data of at least one target object, and a sample timbre corresponding to the sample audio data.

The target object may be a user with a specific tone, for example: movie and television drama actors, celebrities, animated characters, etc. Of course, the target object may also be a general user except for the above example, and in the case that the general user has a specific timbre and can acquire audio data, the target object may also be the target object, and the embodiment of the present disclosure does not specifically limit this.

In an embodiment of the present disclosure, a training audio data set is obtained, the training audio data set comprising sample audio data of one or more target objects. The sample audio data of a target object may include one or more pieces of speech data.

On the basis of acquiring the sample audio data of the target object, the sample audio data may be marked with a corresponding sample tone. In the embodiment of the present disclosure, a uniform marking rule may be adopted to mark the sample tone color for the sample audio data.

Illustratively, extracting acoustic features of sample audio data of a target object, and taking the acoustic features as sample timbre; or, the sample audio data of each target object is numbered, the sample audio data belonging to the same target object are numbered the same, the sample audio data of different target objects are numbered differently, and so on.

In the embodiment of the present disclosure, sample audio data of a target object (for example, in a case where the target object is a general user) may be acquired by a dedicated audio acquisition device, or sample audio data (for example, in a case where the target object is a movie or television drama actor, celebrity, or animated character) may be acquired from audio data disclosed by the target object. Different audio acquisition modes can be adopted according to different target objects to obtain sample audio data.

S2: and inputting the sample audio data into a pre-training text encoder in the voice conversion model, and performing text encoding processing to generate a text encoding vector.

In an embodiment of the present disclosure, the speech conversion model includes a pre-trained text coder, where the pre-trained text coder may be a pre-trained text coder.

The pre-training text encoder can generate a text encoding vector according to the sample audio data, and extract text information in the sample audio data.

In an embodiment of the disclosure, the pre-training text encoder may be a pre-training transformer-encoder. The pre-training text encoder may be a part before a classification layer of the pre-training speech recognition model, and the pre-training speech recognition model may be a pre-training transformer-encoder-decoder.

It should be noted that, the method in the related art may be adopted to obtain the pre-trained speech recognition model, and the embodiment of the present disclosure does not specifically limit this.

In the embodiment of the present disclosure, in the case of obtaining a pre-trained speech recognition model, a pre-trained text encoder of the pre-trained speech recognition model is used to obtain a text encoding vector in sample audio data.

It can be understood that the pre-trained text encoder is part of the pre-trained speech recognition model, the pre-trained speech recognition model can well recognize text information in the sample audio data, and the generated text encoding vector is more accurate.

S3: and inputting the text coding vector and the sample tone into a tone decoder in the voice conversion model, and performing audio reconstruction processing to generate reconstructed sample audio data.

In the embodiment of the present disclosure, in the case where the sample audio data is input to the pre-training text encoder of the speech conversion model to generate the text encoding vector, the text encoding vector and the sample tone are input to the tone decoder of the speech conversion model to generate the reconstructed sample audio data.

In the embodiment of the disclosure, according to the text coding vector and the sample tone, the audio data is input to a tone decoder in the speech conversion model for audio reconstruction processing, and the reconstructed sample audio data is generated.

According to the method provided by the embodiment of the disclosure, the audio data of an original speaker is not required to be converted into text data, but the sample audio data is input into a pre-training text encoder of a voice conversion model to generate a text coding vector, and the text coding vector and the sample tone are input into a tone decoder of the voice conversion model to generate reconstructed sample audio data. Therefore, the generated reconstructed sample audio data is not limited by the conversion of the text data any more, and can have good inflexion intonation.

The tone Decoder may be a Decoder, and may generate reconstructed sample audio data according to the text coded vector and the sample tone. The reconstructed sample audio data may be audio data of a sample timbre.

S4: a loss value is calculated from the sample audio data and the reconstructed sample audio data.

In the embodiments of the present disclosure, in the case of obtaining reconstructed sample audio data, a loss value may be calculated from the sample audio data and the reconstructed sample audio data.

The sample tone acoustic features of the sample audio data and the reconstructed tone acoustic features of the reconstructed sample audio data can be extracted, and then loss values between the sample tone acoustic features and the reconstructed tone acoustic features can be calculated. Alternatively, the loss value may also be calculated according to other parameter characteristics of the sample audio data and the reconstructed sample audio data, or the loss value may also be calculated according to a characterization obtained after further processing the sample audio data and the reconstructed sample audio data, which is not limited in this disclosure.

S5: and updating parameters of the voice conversion model according to the loss value.

In the embodiment of the present disclosure, a loss value is calculated according to the sample audio data and the reconstructed sample audio data, and the speech conversion model is subjected to parameter updating according to the loss value.

It should be noted that, in the embodiment of the present disclosure, according to the loss value, the parameters of the speech conversion model are updated, and the parameters of the pre-training text encoder of the speech conversion model are updated, or the parameters of the tone color decoder of the speech conversion model are updated, or the parameters of the pre-training text encoder and the tone color encoder of the speech conversion model are updated at the same time.

Under the condition that the loss value is small enough and stable, the voice conversion model can be judged to meet the requirement at the moment, and the trained voice conversion model can be obtained.

In the embodiment of the disclosure, sample audio data is input to a pre-training text encoder of a speech conversion model to generate a text coding vector, the text coding vector and sample timbre are input to a timbre decoder of the speech conversion model to generate reconstructed sample audio data, and each module can contribute information by adopting a joint training mode, so that the performance of speech conversion can be improved. In addition, the trained voice conversion model obtained through the joint training is an end-to-end voice conversion system, and has better sound variation accuracy and emotion following capability.

By implementing the embodiment of the disclosure, a training audio data set is obtained; wherein training the audio data set comprises: sample audio data of at least one target object and sample tone corresponding to the sample audio data; inputting sample audio data into a pre-training text encoder of a voice conversion model to generate a text encoding vector; inputting the text coding vector and the sample tone into a tone decoder of a voice conversion model to generate reconstructed sample audio data; calculating a loss value from the sample audio data and the reconstructed sample audio data; and updating parameters of the voice conversion model according to the loss value. Therefore, the trained voice conversion model obtained by adopting the joint training mode can perform voice conversion end to end, performs audio reconstruction processing according to the text coding vector, and has better variable pitch and emotion following capability.

In some embodiments, the loss value comprises a first loss value and/or a second loss value, the calculating the loss value from the sample audio data and the reconstructed sample audio data comprising: and/or inputting the reconstructed sample audio data into a pre-training text encoder to generate a reconstructed text coding vector, and calculating a second loss value between the text coding vector and the reconstructed text coding vector.

In the embodiment of the present disclosure, a loss value is calculated according to sample audio data and reconstructed sample audio data, a sample tone acoustic feature of the sample audio data may be extracted, a reconstructed tone acoustic feature of the reconstructed sample audio data may be extracted, and then a first loss value between the sample tone acoustic feature and the reconstructed tone acoustic feature may be calculated.

In the embodiment of the present disclosure, the loss value is calculated according to the sample audio data and the reconstructed sample audio data, the reconstructed sample audio data may be input to a pre-training text encoder, a reconstructed text encoding vector is generated, and a second loss value between the text encoding vector and the reconstructed text encoding vector is calculated.

Exemplarily, as shown in fig. 2, in the embodiment of the present disclosure, a loss value is calculated according to sample audio data and reconstructed sample audio data, the loss value includes a first loss value and a second loss value, a sample timbre acoustic feature of the sample audio data may be extracted, and a reconstructed timbre acoustic feature of the reconstructed sample audio data may be extracted, and further, a first loss value between the sample timbre acoustic feature and the reconstructed timbre acoustic feature is calculated; and inputting the reconstructed sample audio data into a pre-training text encoder, generating a reconstructed text encoding vector, and calculating a second loss value between the text encoding vector and the reconstructed text encoding vector.

In some embodiments, calculating a first loss value between a sample tonal acoustic feature of the sample audio data and a reconstructed tonal acoustic feature of the reconstructed sample audio data comprises:

obtaining the timbre acoustic feature Y of the ith frame sample _i And reconstructing timbre acoustic features

Wherein i is a positive integer;

according to the timbre acoustic feature Y of the ith frame sample _i And reconstructing timbre acoustic features

Calculating a first loss L _reconst (ii) a Wherein the first loss value L _reconst The following relationship is satisfied:

wherein N is a positive integer.

In the embodiment of the present disclosure, N may be a total frame number of the sample audio data, and the timbre acoustic feature Y of each frame sample is calculated _i And reconstructing timbre acoustic features

And summing and averaging to obtain a first loss value L _reconst 。

In some embodiments, the parameter update to the speech conversion model is based on the loss value, including at least one of:

in response to at least one of the loss values of the continuous preset times being larger than a first preset value, updating parameters of a tone color decoder in the voice conversion model;

responding to the extreme difference of the loss value of the continuous preset times larger than a second preset value, and updating parameters of a tone color decoder in the voice conversion model;

in response to the fact that the loss values of the continuous preset times are smaller than a first preset value, updating parameters of a pre-training text encoder and a tone decoder in the voice conversion model;

and in response to the extreme difference of the loss values of the continuous preset times being smaller than a second preset value, updating parameters of a pre-training text encoder and a tone color decoder in the speech conversion model.

In the embodiment of the present disclosure, updating parameters of the speech conversion model according to the loss value includes: and under the condition that a first loss value between the sample tone acoustic characteristics of the sample audio data and the reconstructed tone acoustic characteristics of the reconstructed sample audio data is calculated and at least one of the loss values for continuous preset times is greater than a first preset value, updating parameters of a tone color decoder in the speech conversion model.

The preset number of times may be 50 times, 100 times, or the like, and the first preset value may be set as needed, for example, 0.2, 0.3, or the like, which is not specifically limited in the embodiment of the present disclosure.

In the embodiment of the present disclosure, updating parameters of the speech conversion model according to the loss value includes: and under the condition that a first loss value between the sample tone acoustic features of the sample audio data and the reconstructed tone acoustic features of the reconstructed sample audio data is calculated, and the loss values of continuous preset times are all smaller than a first preset value, updating parameters of a pre-training text encoder and a tone decoder in the speech conversion model.

In the embodiment of the present disclosure, updating parameters of the speech conversion model according to the loss value includes: and under the condition that the extreme difference of the loss values of the continuous preset times is greater than a second preset value, updating parameters of a tone color decoder in the voice conversion model.

The preset number of times may be 50 times, or 100 times, etc., and the second preset value may be set as needed, for example, 0.05, 0.01, etc., which is not specifically limited in the embodiments of the present disclosure.

In the embodiment of the present disclosure, updating parameters of the speech conversion model according to the loss value includes: and under the condition that the extreme difference of the loss values of the continuous preset times is less than a second preset value, updating parameters of the pre-training text encoder and the tone color decoder in the speech conversion model.

It can be understood that, because the parameters in the pre-training text encoder transformer-encoder of the speech conversion model have converged to a stable state on the speech recognition task, the speech conversion based on a specific person only needs to be updated in a small scale, while the parameters of the timbre decoder of the speech conversion model are updated randomly without prior information reference, so that different strategies are optimally designed for the parameters of the two modules during joint training.

In the previous training times, parameters of a Conformer-encoder module of a pre-training text encoder of the voice conversion model are not updated, and only parameters of a tone decoder are updated; then, the gradient back pass of the pre-training text encoder former-encoder module is turned on, but a smaller learning rate update is designed.

The method comprises the steps that parameter updating is carried out on a tone decoder in a voice conversion model when at least one of loss values of continuous preset times is larger than a first preset value, and then parameter updating is carried out on a pre-training text encoder and the tone decoder in the voice conversion model when the loss values of the continuous preset times are smaller than the first preset value; and/or updating parameters of a tone decoder in the voice conversion model when the range of the loss values of the continuous preset times is larger than a second preset value, and then updating parameters of a pre-training text encoder and the tone decoder in the voice conversion model when the range of the loss values of the continuous preset times is smaller than the second preset value.

As shown in fig. 3, in some embodiments, the method for training a speech conversion model provided in the embodiments of the present disclosure further includes:

s11: acquiring a training text data set; wherein the training text data set comprises: the pre-training audio data of the at least one object and the pre-training text data corresponding to the pre-training audio data.

In the embodiment of the disclosure, a training text data set is obtained; wherein the training text data set comprises: the pre-training audio data of the at least one object and the pre-training text data corresponding to the pre-training audio data.

Wherein, the training text data set includes pre-training audio data of one or more objects, and the object may be: a general user, or a user with a particular timbre. It is understood that common users may include the elderly, middle aged, young, children, etc.

In the embodiment of the present disclosure, the pre-training audio data of one or more objects and the training text data corresponding to the pre-training audio data are obtained, and this process may be implemented by using an artificial label or by using a mature technology in the related technologies, which is not specifically limited in the embodiment of the present disclosure.

The pre-training audio data of a subject may include one or more pieces of speech data.

S12: and inputting the pre-training audio data into a text encoder, performing text encoding, and generating a pre-training text encoding vector.

In the embodiment of the disclosure, in the case of acquiring pre-training audio data of one or more objects, the pre-training audio data is input to a text encoder, and a pre-training text encoding vector is generated. The text encoder may be a transformer-encoder.

In the embodiment of the disclosure, the pre-training audio data is input to a former-encoder to generate a pre-training text encoding vector.

S13: and inputting the pre-training text coding vector into a text decoder, and performing decoding processing to generate target text data.

In the embodiment of the present disclosure, in the case that the pre-training audio data is input to the text encoder to generate the pre-training text encoding vector, the pre-training text encoding vector is further input to the text decoder to generate the target text data. The text Decoder can be a Decoder and can decode and generate target text data according to the pre-training text coding vector.

S14: a first pre-training loss value is calculated based on the target text data and the pre-training text data.

In the embodiment of the disclosure, the target text data is obtained when the pre-training audio data is input to the text encoder to generate the pre-training text encoding vector, and the pre-training text encoding vector is further input to the text decoder to generate the target text data.

On this basis, a first pre-training loss value may be calculated from the target text data and the pre-training text data.

It can be understood that the target text data and the pre-training text data are both texts, and the first pre-training loss value may be calculated by comparing differences between the texts, or a text vector corresponding to the target text data and the pre-training text data may be further obtained to calculate the first pre-training loss value, or a method in the related art may also be adopted, and the like, which is not specifically limited in the embodiment of the present disclosure.

S15: and updating parameters of the text encoder according to the first pre-training loss value to obtain a pre-training text encoder.

In the embodiment of the disclosure, after the first pre-training loss value is calculated according to the target text data and the pre-training text data, the parameters of the text encoder are updated according to the first pre-training loss value, so as to obtain the pre-training text encoder. And updating parameters of the text encoder and the text decoder simultaneously according to the first pre-training loss value.

Under the condition that the first pre-training loss value is small enough and stable, the text encoder and the text decoder can be judged to meet the requirements at the moment, the trained text encoder and the trained text decoder can be obtained, and the pre-training text encoder is generated.

In some embodiments, the training text data set further includes pre-training monophonic data corresponding to the pre-training audio data, wherein the method further comprises: inputting the pre-training text coding vector into a phoneme decoder, and performing phoneme decoding to generate target single-phoneme data; calculating a second pre-training loss value according to the target mono-phone data and the pre-training mono-phone data; and updating parameters of the text encoder according to the first pre-training loss value and the second pre-training loss value to generate a pre-training text encoder.

In the embodiment of the present disclosure, the training text data set further includes pre-training monophonic data corresponding to the pre-training audio data, for example: a, o, e, etc. in the Chinese phonetic alphabet.

In the embodiment of the disclosure, pre-training audio data is input to a text encoder, after a pre-training text encoding vector is generated, a plurality of linear layers are added behind the text encoder, and the pre-training text encoding vector is output and mapped into a single phone to generate target single phone data.

Based on this, in the case of obtaining the target monophonic note data and the pre-training monophonic note data, a second pre-training loss value may be calculated from the target monophonic note data and the pre-training monophonic note data, and the second pre-training loss value may be calculated using the monophonic loss function as an optimization target.

And updating the parameters of the text encoder according to the second pre-training loss value to obtain the pre-training text encoder.

And updating parameters of the text encoder and the text decoder simultaneously according to the first pre-training loss value and the second pre-training loss value. Under the condition that the first pre-training loss value and the second pre-training loss value are small enough and stable, the text encoder and the text decoder can be judged to meet the requirements at the moment, the trained text encoder and the trained text decoder can be obtained, and the pre-training text encoder is generated.

Fig. 4 is a flowchart of a voice conversion method according to an embodiment of the present disclosure.

As shown in fig. 4, the speech conversion method provided by the embodiment of the present disclosure includes, but is not limited to, the following steps:

s10: audio data of an original object is acquired.

In the embodiment of the disclosure, the audio data of the original object is obtained, and the audio data of the original object may be a piece of recording data uploaded by a user, or a piece of recording data recorded by the user in real time.

It can be understood that the voice conversion method provided by the embodiment of the present disclosure may be executed by a terminal device, for example: smart phones, computers, etc. The user can record a segment of his own voice by using the terminal device, so that the terminal device obtains the audio data of the original object.

S20: and determining the target tone of the target object to be converted.

In the embodiment of the present disclosure, a reference table of the target tone of the target object may be provided, so that the user may select the target tone of the target object according to the reference table; alternatively, only the target object may be provided, and after the user selects the target object, the target timbre may be determined according to the selected target object.

It is understood that after the user uses the terminal device to record a segment of his/her own voice, the user can select the target tone of the target object to be converted.

S30: inputting the audio data and the target tone into a trained voice conversion model, and performing voice conversion processing to generate target audio data of a target object; wherein the trained speech conversion model is obtained by training according to the method of some embodiments above.

In the embodiment of the present disclosure, in the case of acquiring audio data of an original object and determining a target tone of a target object to be converted, the audio data and the target tone are input to a trained speech conversion model, and target audio data of the target object may be generated. Thereby realizing voice conversion.

The trained speech conversion model is obtained by training according to the methods in the above embodiments, and reference may be specifically made to the relevant description in the above embodiments, which is not described herein again.

To facilitate understanding, the disclosed embodiments provide an exemplary embodiment.

In the embodiment of the present disclosure, the terminal device can execute the voice conversion method, and the user can record a segment of speaking voice by using the terminal device, and select the target object, for example: and selecting the small new crayon, and based on the small new crayon, determining the target tone corresponding to the small new crayon.

And then, inputting the voice recorded by the user and the target tone corresponding to the crayon novels into the trained voice conversion model, so as to generate the target audio data of the crayon novels. For example: the voice recorded by the user is 'hello, i is a crayon novelty', and it can be understood that the voice recorded by the user is the tone of the user, which is different from the tone of the crayon novelty. After the user finishes recording, the user can select voice to be converted into a new tone of the crayon, the voice is input into the trained voice conversion model, the audio data of the new tone of the crayon can be generated, and the obtained target audio data, namely the tone of the new tone of the crayon, is the new tone of the crayon.

It should be noted that, the user may also select the target tone of the target object first, and then record the voice, and the sequence of the steps may be adjusted as required.

By implementing the embodiment of the disclosure, audio data of an original object is acquired; determining a target tone of a target object to be converted; inputting the audio data and the target tone to a trained voice conversion model to generate target audio data of a target object; wherein the trained speech conversion model is obtained by training according to the method of some embodiments above. Therefore, voice conversion can be achieved, and the voice conversion method has good voice accuracy changing and emotion following capability.

As shown in FIG. 5, in some embodiments, the trained speech conversion model includes: a trained text encoder and a trained tone decoder, wherein, S30: inputting the audio data and the target tone color into a trained voice conversion model to generate target audio data of a target object, wherein the method comprises the following steps:

s301: and inputting the audio data into a trained text encoder, and performing text encoding processing to generate a target text encoding vector.

S302: and inputting the target text coding vector and the target tone into a trained tone decoder, and decoding to generate target audio data of the target object.

In the embodiment of the present disclosure, the trained speech conversion model includes: a trained text encoder and a trained tone decoder. The trained text encoder may be a trained former-encoder, and the trained tone Decoder may be a trained Decoder.

In the embodiment of the present disclosure, audio data is input to a trained text encoder to generate a target text encoding vector, and the target text encoding vector and a target timbre are input to a trained timbre decoder to generate target audio data of a target object.

Fig. 6 is a block diagram of a speech conversion model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the speech conversion model training apparatus 1 includes: a data set acquisition unit 11, an encoding module 12, a speech reconstruction unit 13, a loss calculation unit 14 and a model update unit 15.

A data set acquisition unit 11 for acquiring a training audio data set; wherein training the audio data set comprises: sample audio data of at least one target object, and a sample tone color corresponding to the sample audio data.

And the encoding module 12 is configured to input the sample audio data to a pre-training text encoder in the speech conversion model, perform text encoding processing, and generate a text encoding vector.

And the voice reconstruction unit 13 is configured to input the text coding vector and the sample tone to a tone decoder in the voice conversion model, perform audio reconstruction processing, and generate reconstructed sample audio data.

A loss calculation unit 14 for calculating a loss value based on the sample audio data and the reconstructed sample audio data.

And the model updating unit 15 is used for updating parameters of the voice conversion model according to the loss value.

By implementing the disclosed embodiment, the data set acquisition unit 11 acquires a training audio data set; wherein training the audio data set comprises: sample audio data of at least one target object and sample timbre corresponding to the sample audio data; the encoding module 12 inputs the sample audio data to a pre-training text encoder in the speech conversion model, performs text encoding processing, and generates a text encoding vector; the speech reconstruction unit 13 inputs the text coding vector and the sample tone to a tone decoder in the speech conversion model, and performs audio reconstruction processing to generate reconstructed sample audio data; the loss calculation unit 14 calculates a loss value from the sample audio data and the reconstructed sample audio data; the model updating unit 15 performs parameter updating on the speech conversion model based on the loss value. Therefore, the trained voice conversion model obtained by adopting the joint training mode can perform voice conversion end to end, performs audio reconstruction processing according to the text coding vector, and has better variable intonation and emotion following capability.

In some embodiments, the loss value comprises a first loss value and/or a second loss value, and the loss calculation unit 14 is specifically configured to: calculating a first loss value between the sample tone acoustic features of the sample audio data and the reconstructed tone acoustic features of the reconstructed sample audio data, and/or inputting the reconstructed sample audio data into a pre-training text encoder, generating a reconstructed text encoding vector, and calculating a second loss value between the text encoding vector and the reconstructed text encoding vector.

In some embodiments, the model updating unit 15 is specifically configured to:

in response to at least one loss value of the continuous preset times not being smaller than a first preset value, updating parameters of a tone color decoder in the speech conversion model;

in response to the fact that the extreme difference of the loss values of the continuous preset times is not smaller than a second preset value, updating parameters of a tone color decoder in the voice conversion model;

and in response to the extreme difference of the loss values of the continuous preset times being smaller than a second preset value, updating parameters of a pre-training text encoder and a tone color decoder in the voice conversion model.

As shown in fig. 7, in some embodiments, the speech conversion model training apparatus 1 further includes:

a training text acquisition unit 61 configured to acquire a training text data set; wherein the training text data set comprises: the pre-training audio data of the at least one object and the pre-training text data corresponding to the pre-training audio data.

And a text encoding unit 62, configured to input the pre-training audio data to a text encoder, perform text encoding, and generate a pre-training text encoding vector.

And the text decoding unit 63 is configured to input the pre-training text coding vector to a text decoder, perform decoding processing, and generate target text data.

The first training loss calculating unit 64 is configured to calculate a first pre-training loss value according to the target text data and the pre-training text data.

The first encoder updating unit 65 is configured to perform parameter updating on the text encoder according to the first pre-training loss value to obtain a pre-training text encoder.

As shown in fig. 8, in some embodiments, the training text data set further includes pre-training monophonic data corresponding to the pre-training audio data, and the speech conversion model training apparatus 1 further includes:

the phoneme decoding unit 71 is configured to input the pre-training text encoding vector to a phoneme decoder, perform phoneme decoding, and generate target monophonic data.

A second training loss calculating unit 72 for calculating a second pre-training loss value based on the target monophonic data and the pre-training monophonic data.

And a second encoder updating unit 73, configured to perform parameter updating on the text encoder according to the first pre-training loss value and the second pre-training loss value, so as to generate a pre-training text encoder.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The beneficial effects obtained by the speech conversion model training apparatus in the embodiments of the present disclosure are the same as those obtained by the above example speech conversion model training method, and are not described herein again.

Fig. 9 is a structural diagram of a voice conversion apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, the speech conversion apparatus 80 includes: an audio data acquisition unit 801, a target tone color determination unit 802, and a target audio acquisition unit 803.

An audio data acquisition unit 801 for acquiring audio data of an original object.

A target tone color determination unit 802, configured to determine a target tone color of a target object to be converted.

A target audio acquiring unit 803, configured to input the audio data and the target timbre to the trained speech conversion model, perform speech conversion processing, and generate target audio data of the target object; wherein the trained speech conversion model is obtained by training according to the method of some embodiments above.

By implementing the embodiment of the present disclosure, the audio data acquisition unit 801 acquires audio data of an original object; the target tone determination unit 802 determines a target tone of a target object to be converted; the target audio acquisition unit 803 inputs the audio data and the target timbre to the trained voice conversion model, performs voice conversion processing, and generates target audio data of a target object; wherein the trained speech conversion model is obtained by training according to the method of some embodiments above. Therefore, voice conversion can be realized, and the voice conversion method has good sound accuracy changing and emotion following capabilities.

As shown in FIG. 10, in some embodiments, the trained speech conversion model includes: a trained text encoder and a trained tone decoder, wherein the target audio obtaining unit 803 comprises: a target text encoding module 8031 and a target audio decoding module 8032.

And the target text encoding module 8031 is configured to input the audio data to a trained text encoder, perform text encoding processing, and generate a target text encoding vector.

And the target audio decoding module 8032 is configured to input the target text encoding vector and the target tone to a trained tone decoder, perform decoding processing, and generate target audio data of the target object.

The beneficial effects obtained by the voice conversion apparatus in the embodiment of the present disclosure are the same as those obtained by the above-mentioned exemplary voice conversion method, and are not described herein again.

Fig. 11 is a block diagram of an electronic device 100 for executing a speech conversion model training method or a speech conversion method according to an embodiment of the present disclosure.

Illustratively, the electronic device 100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

As shown in fig. 11, electronic device 100 may include one or more of the following components: processing component 101, memory 102, power component 103, multimedia component 104, audio component 105, interface to input/output (I/O) 106, sensor component 107, and communication component 108.

The processing component 101 generally controls overall operation of the electronic device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 101 may include one or more processors 1011 to execute instructions to perform all or part of the steps of the method described above. Further, the processing component 101 may include one or more modules that facilitate interaction between the processing component 101 and other components. For example, the processing component 101 may include a multimedia module to facilitate interaction between the multimedia component 104 and the processing component 101.

The memory 102 is configured to store various types of data to support operations at the electronic device 100. Examples of such data include instructions for any application or method operating on the electronic device 100, contact data, phonebook data, messages, pictures, videos, and so forth. The Memory 102 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as an SRAM (Static Random-Access Memory), an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM (Erasable Programmable Read-Only Memory), a PROM (Programmable Read-Only Memory), a ROM (Read-Only Memory), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

The power supply component 103 provides power to the various components of the electronic device 100. Power components 103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 100.

The multimedia component 104 includes a touch-sensitive display screen that provides an output interface between the electronic device 100 and a user. In some embodiments, the Touch Display screen may include an LCD (Liquid Crystal Display) and a TP (Touch Panel). The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 104 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 100 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 105 is configured to output and/or input audio signals. For example, the audio component 105 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 102 or transmitted via the communication component 108. In some embodiments, audio component 105 also includes a speaker for outputting audio signals.

The I/O interface 2112 provides an interface between the processing component 101 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 107 includes one or more sensors for providing various aspects of status assessment for the electronic device 100. For example, the sensor component 107 may detect an open/closed state of the electronic device 100, the relative positioning of components, such as a display and keypad of the electronic device 100, the sensor component 107 may also detect a change in the position of the electronic device 100 or a component of the electronic device 100, the presence or absence of user contact with the electronic device 100, orientation or acceleration/deceleration of the electronic device 100, and a change in the temperature of the electronic device 100. The sensor assembly 107 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 107 may also include a light sensor, such as a CMOS (Complementary Metal Oxide Semiconductor) or CCD (Charge-coupled Device) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 107 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 108 is configured to facilitate wired or wireless communication between the electronic device 100 and other devices. The electronic device 100 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 108 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the Communication component 108 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on an RFID (Radio Frequency Identification) technology, an IrDA (Infrared Data Association) technology, an UWB (Ultra Wide Band) technology, a BT (Bluetooth) technology, and other technologies.

In an exemplary embodiment, the electronic Device 100 may be implemented by one or more ASICs (Application Specific Integrated circuits), DSPs (Digital Signal processors), digital Signal Processing Devices (DSPDs), PLDs (Programmable Logic devices), FPGAs (Field Programmable Gate arrays), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described voice conversion model training method, or voice conversion method. It should be noted that, for the implementation process and the technical principle of the electronic device of this embodiment, reference is made to the foregoing explanation of the speech conversion model training method or the speech conversion method in the embodiment of the present disclosure, and details are not repeated here.

The electronic device provided in the embodiments of the present disclosure may execute the speech conversion model training method or the speech conversion method described in some embodiments above, and the beneficial effects thereof are the same as those of the speech conversion model training method or the speech conversion method described above, and are not described herein again.

In order to implement the above embodiments, the present disclosure also provides a storage medium.

Wherein the instructions in the storage medium, when executed by a processor of the electronic device, enable the electronic device to perform a speech conversion model training method, or a speech conversion method, as previously described. For example, the storage medium may be a ROM (Read Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

To implement the above embodiments, the present disclosure also provides a computer program product, which when executed by a processor of an electronic device, enables the electronic device to perform the speech conversion model training method, or the speech conversion method, as described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a speech conversion model, comprising:

acquiring a training audio data set; wherein the training audio data set comprises: sample audio data of at least one target object and a sample tone corresponding to the sample audio data;

inputting the sample audio data into a pre-training text encoder in a voice conversion model, and performing text encoding processing to generate a text encoding vector;

inputting the text coding vector and the sample tone to a tone decoder in a voice conversion model, and performing audio reconstruction processing to generate reconstructed sample audio data;

calculating a loss value from the sample audio data and the reconstructed sample audio data;

and updating parameters of the voice conversion model according to the loss value.

2. The method of claim 1, wherein the loss values comprise first loss values and/or second loss values, and wherein calculating loss values from the sample audio data and the reconstructed sample audio data comprises:

calculating a first loss value between a sample timbre acoustic feature of the sample audio data and a reconstructed timbre acoustic feature of the reconstructed sample audio data, and/or

And inputting the reconstructed sample audio data into the pre-training text encoder, generating a reconstructed text encoding vector, and calculating a second loss value between the text encoding vector and the reconstructed text encoding vector.

3. The method of claim 2, wherein the performing a parameter update on the speech conversion model according to the loss value comprises at least one of:

in response to at least one of the loss values for a preset number of consecutive times being greater than a first preset value, performing parameter update on the tone color decoder in the speech conversion model; in response to the extreme difference of the loss value of the continuous preset times being larger than a second preset value, performing parameter updating on the tone color decoder in the voice conversion model;

in response to the loss values of the continuous preset times are smaller than a first preset value, updating parameters of the pre-training text encoder and the tone color decoder in the voice conversion model;

4. The method of any one of claims 1 to 3, further comprising:

acquiring a training text data set; wherein the training text data set comprises: pre-training audio data of at least one object and pre-training text data corresponding to the pre-training audio data;

inputting the pre-training audio data into a text encoder, and performing text encoding to generate a pre-training text encoding vector;

inputting the pre-training text coding vector to a text decoder, and performing decoding processing to generate target text data;

calculating a first pre-training loss value according to the target text data and the pre-training text data;

and updating parameters of the text encoder according to the first pre-training loss value to generate the pre-training text encoder.

5. The method of claim 4, wherein the training text data set further comprises pre-training monophonic data corresponding to the pre-training audio data, and further comprising:

inputting the pre-training text coding vector to a phoneme decoder, and performing phoneme decoding to generate target single-phoneme data;

calculating a second pre-training loss value according to the target mono-phone data and the pre-training mono-phone data;

and updating parameters of the text encoder according to the first pre-training loss value and the second pre-training loss value to generate the pre-training text encoder.

6. A method of speech conversion, comprising:

acquiring audio data of an original object;

determining a target tone of a target object to be converted;

inputting the audio data and the target tone to a trained voice conversion model, and performing voice conversion processing to generate target audio data of the target object; wherein the trained speech conversion model is trained according to the method of any one of claims 1 to 5.

7. The method of claim 6, wherein the trained speech conversion model comprises: a trained text encoder and a trained timbre decoder, wherein said inputting said audio data and said target timbre to a trained speech conversion model to generate target audio data for said target object comprises:

inputting the audio data to the trained text encoder, and performing text encoding processing to generate a target text encoding vector;

and inputting the target text coding vector and the target tone to the trained tone decoder for decoding to generate the target audio data of the target object.

8. A speech conversion model training apparatus, comprising:

a data set acquisition unit for acquiring a training audio data set; wherein the training audio data set comprises: sample audio data of at least one target object and a sample tone corresponding to the sample audio data;

the coding module is used for inputting the sample audio data into a pre-training text coder in a voice conversion model, performing text coding processing and generating a text coding vector;

the voice reconstruction unit is used for inputting the text coding vector and the sample tone into a tone decoder in the voice conversion model, performing audio reconstruction processing and generating reconstructed sample audio data;

a loss calculation unit for calculating a loss value from the sample audio data and the reconstructed sample audio data;

and the model updating unit is used for updating parameters of the voice conversion model according to the loss value.

9. A speech conversion apparatus, comprising:

an audio data acquisition unit for acquiring audio data of an original object;

a target tone color determination unit for determining a target tone color of a target object to be converted;

the target audio acquisition unit is used for inputting the audio data and the target timbre into a trained voice conversion model, performing voice conversion processing and generating target audio data of the target object; wherein the trained speech conversion model is trained according to the method of any one of claims 1 to 5.

10. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 5 or the processor is configured to execute the instructions to implement the method of claim 6 or 7.

11. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1 to 5, or wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of claim 6 or 7.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 5, or which, when executed by a processor, implements the method according to claim 6 or 7.