CN113689867B

CN113689867B - A training method, device, electronic device and medium for a speech conversion model

Info

Publication number: CN113689867B
Application number: CN202110950483.1A
Authority: CN
Inventors: 王俊超; 陈怿翔; 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2022-06-28
Anticipated expiration: 2041-08-18
Also published as: CN113689867A

Abstract

The present disclosure provides a training method, device, electronic device and medium for a speech conversion model, which relate to the technical field of artificial intelligence, and in particular, to speech and deep learning technologies. The specific implementation scheme is: input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model; input the hidden features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model; The acoustic features and the predicted acoustic features are used to train the speech conversion model to be trained. In the embodiment of the present application, the latent feature can be used as the input of the speech conversion model to predict the target acoustic feature, so that the model can learn more fully and has a wide range of application scenarios.

Description

A training method, device, electronic device and medium for a speech conversion model

技术领域technical field

本公开涉及人工智能技术领域，进一步涉及语音和深度学习技术，尤其是一种语音转换模型的训练方法、装置、电子设备及介质。The present disclosure relates to the technical field of artificial intelligence, and further relates to speech and deep learning technologies, in particular to a training method, apparatus, electronic device and medium for a speech conversion model.

背景技术Background technique

语音转换在市场上变得越来越受关注，语音转换的目的是将源说话人的语音转为目标说话人的音色，并保持语音的表达内容不变。按照模型所需的语料，语音转换可分为平行语料语音转换和非平行语料语音转换；其中，平行语料语音转换，在录制所需的语料时，需要源说话人和目标说话人录制相同文本的音频。非平行语料语音转换，需要录制目标说话人的若干语音，训练时不需要源说话人的语音，通常的方法有基于音素概率图的方法和自重构的方法。Voice conversion has become more and more concerned in the market. The purpose of voice conversion is to convert the voice of the source speaker into the timbre of the target speaker, and keep the expression content of the voice unchanged. According to the corpus required by the model, speech conversion can be divided into parallel corpus speech conversion and non-parallel corpus speech conversion; among them, parallel corpus speech conversion requires the source speaker and the target speaker to record the same text when recording the required corpus. audio. For non-parallel corpus speech conversion, several voices of the target speaker need to be recorded, and the voice of the source speaker is not required for training. The usual methods include the method based on the phoneme probability map and the method of self-reconstruction.

基于音素概率图的方法，首先将目标说话人的音频通过语音识别模型提取一种表达说话内容的ppg特征，然后通过模型建模ppg特征和音频mel特征的联系。在测试时，源说话人通过语音识别模型提取ppg特征，输入训练好的转换模型，即可得到转换后的特征。基于自重构的方法，总体思路是在训练时通过编码器将特征中的内容信息和音色信息进行解耦，再通过解码器还原信息，进行自重构训练。The method based on the phoneme probability map first extracts a ppg feature expressing the speech content from the audio of the target speaker through the speech recognition model, and then uses the model to model the connection between the ppg feature and the audio mel feature. During testing, the source speaker extracts ppg features through the speech recognition model, and inputs the trained transformation model to obtain the transformed features. Based on the self-reconstruction method, the general idea is to decouple the content information and timbre information in the feature through the encoder during training, and then restore the information through the decoder to perform self-reconstruction training.

人的发音是由许多语音帧组成的，因为人的发音有内在的连续性，相邻的语音帧之间应该彼此存在关联性。由于现有的ppg特征或者模型输入的mel特征帧与帧之间是相互独立的，这就造成了其中的信息相互独立，神经网络模型难以从中学习到帧与帧之间的互相关联性，使模型的学习能力受限。Human pronunciation is composed of many speech frames. Because human pronunciation has inherent continuity, there should be correlations between adjacent speech frames. Since the existing ppg features or the mel feature frames input by the model are independent of each other, the information in them is independent of each other, and it is difficult for the neural network model to learn the correlation between frames. The learning ability of the model is limited.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种语音转换模型的训练方法方法、装置、电子设备以及介质。The present disclosure provides a method, apparatus, electronic device and medium for training a speech conversion model.

第一方面，本申请提供了一种语音转换模型的训练方法，所述方法包括：In a first aspect, the present application provides a method for training a speech conversion model, the method comprising:

将语音的原始的声学特征输入至预训练模型，得到所述预训练模型输出的隐特征；Input the original acoustic features of the speech to the pre-training model to obtain the hidden features output by the pre-training model;

将所述隐特征输入至语音转换模型，得到所述语音转换模型输出的预测的声学特征；Inputting the latent features to a speech conversion model to obtain the predicted acoustic features output by the speech conversion model;

基于所述原始的声学特征和所述预测的声学特征对待训练的语音转换模型进行训练。The speech conversion model to be trained is trained based on the original acoustic features and the predicted acoustic features.

第二方面，本申请提供了一种语音转换模型的训练装置，所述装置包括：预训练模块、语音转换模块和训练模块；其中，In a second aspect, the present application provides a training device for a voice conversion model, the device includes: a pre-training module, a voice conversion module and a training module; wherein,

所述预训练模块，用于将语音的原始的声学特征输入至预训练模型，得到所述预训练模型输出的隐特征；The pre-training module is used to input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model;

所述语音转换模块，用于将所述隐特征输入至语音转换模型，得到所述语音转换模型输出的预测的声学特征；The speech conversion module is used for inputting the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model;

所述训练模块，用于基于所述原始的声学特征和所述预测的声学特征对待训练的语音转换模型进行训练。The training module is configured to train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

第三方面，本申请实施例提供了一种电子设备，包括：In a third aspect, an embodiment of the present application provides an electronic device, including:

一个或多个处理器；one or more processors;

存储器，用于存储一个或多个程序，memory for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现本申请任意实施例所述的语音转换模型的训练方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method for training a speech conversion model described in any embodiment of the present application.

第四方面，本申请实施例提供了一种存储介质，其上存储有计算机程序，该程序被处理器执行时实现本申请任意实施例所述的语音转换模型的训练方法。In a fourth aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for training a speech conversion model described in any embodiment of the present application.

第五方面，提供了一种计算机程序产品，当所述计算机程序产品被计算机设备执行时实现本申请任意实施例所述的语音转换模型的训练方法。In a fifth aspect, a computer program product is provided, which, when the computer program product is executed by a computer device, implements the method for training a speech conversion model described in any embodiment of the present application.

根据本申请的技术解决了现有技术中由于ppg特征或者模型输入的mel特征帧与帧之间是相互独立的，从而造成其中的信息相互独立，神经网络模型难以从中学习到帧与帧之间的互相关联性，使模型的学习能力受限的技术问题，本申请提供的技术方案，可以用声学特征训练预训练自监督模型，再提取其中帧级别的隐特征作为新的声学特征，这样的隐特征中包含了声学特征的互相关联的信息，将隐特征作为转换模型的输入预测目标声学特征，使模型学习更充分，应用场景广泛。The technology according to the present application solves the problem in the prior art that because the ppg feature or the mel feature frame input by the model is independent of each other, the information in it is independent of each other, and it is difficult for the neural network model to learn from it. The technical problem of limiting the learning ability of the model, the technical solution provided in this application can use acoustic features to train a pre-trained self-supervised model, and then extract the frame-level hidden features as new acoustic features, such as The hidden features contain the interrelated information of the acoustic features, and the hidden features are used as the input of the conversion model to predict the target acoustic features, so that the model can learn more fully and has a wide range of application scenarios.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

图1是本申请实施例提供的语音转换模型的训练方法的第一流程示意图；Fig. 1 is the first schematic flow chart of the training method of the speech conversion model provided by the embodiment of the present application;

图2是本申请实施例提供的语音转换模型的训练方法的第二流程示意图；2 is a second schematic flowchart of a training method of a speech conversion model provided by an embodiment of the present application;

图3是本申请实施例提供的语音转换模型的训练系统的结构示意图；3 is a schematic structural diagram of a training system for a speech conversion model provided by an embodiment of the present application;

图4是本申请实施例提供的语音转换模型的训练方法的第三流程示意图；4 is a third schematic flowchart of a training method for a speech conversion model provided by an embodiment of the present application;

图5是本申请实施例提供的语音转换模型的预测系统的结构示意图；5 is a schematic structural diagram of a prediction system for a speech conversion model provided by an embodiment of the present application;

图6是本申请实施例提供的语音转换模型的训练装置的结构示意图；6 is a schematic structural diagram of a training device for a speech conversion model provided by an embodiment of the present application;

图7是用来实现本申请实施例的语音转换模型的训练方法的电子设备的框图。FIG. 7 is a block diagram of an electronic device used to implement the training method of the speech conversion model according to the embodiment of the present application.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

实施例一Example 1

图1是本申请实施例提供的语音转换模型的训练方法的第一流程示意图，该方法可以由语音转换模型的训练装置或者电子设备来执行，该装置或者电子设备可以由软件和/或硬件的方式实现，该装置或者电子设备可以集成在任何具有网络通信功能的智能设备中。如图1所示，语音转换模型的训练方法可以包括以下步骤：FIG. 1 is a first schematic flowchart of a training method for a speech conversion model provided by an embodiment of the present application. The method may be performed by a training device or electronic device for a speech conversion model, and the device or electronic device may be performed by software and/or hardware. The device or electronic device can be integrated into any smart device with network communication function. As shown in Figure 1, the training method of the speech conversion model may include the following steps:

S101、将语音的原始的声学特征输入至预训练模型，得到预训练模型输出的隐特征。S101. Input the original acoustic features of speech into a pre-training model to obtain latent features output by the pre-training model.

在本步骤中，电子设备可以将语音的原始的声学特征输入至预训练模型，得到预训练模型输出的隐特征。具体地，当待训练的语音转换模型不满足预先设置的收敛条件时，电子设备可以将语音的原始的声学特征输入至预训练模型，得到预训练模型输出的隐特征。进一步地，电子设备可以将原始的声学特征输入至神经网络模型，得到神经网络模型输出的特征序列；将神经网络模型输出的特征序列作为预训练模型输出的隐特征。进一步地，电子设备可以先将原始的声学特征划分为N个声学特征单元；其中，N为大于1的自然数；然后将N个声学特征单元中的一个或者多个声学特征进行掩蔽，得到掩蔽后的声学特征单元；再将掩蔽后的声学特征单元输入至神经网络模型，得到神经网络模型输出的特征序列。In this step, the electronic device may input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model. Specifically, when the speech conversion model to be trained does not meet the preset convergence conditions, the electronic device can input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model. Further, the electronic device can input the original acoustic features into the neural network model to obtain a feature sequence output by the neural network model; and use the feature sequence output by the neural network model as the latent feature output by the pre-training model. Further, the electronic device may firstly divide the original acoustic features into N acoustic feature units; wherein, N is a natural number greater than 1; and then mask one or more acoustic features in the N acoustic feature units to obtain a masked Then input the masked acoustic feature unit into the neural network model to obtain the feature sequence output by the neural network model.

S102、将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征。S102. Input the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model.

在本步骤中，电子设备可以将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征。具体地，电子设备可以将隐特征输入至任何一个具有语音转换功能的模型中。例如，该语音转换模型可以包括：内容编码器、音色编码器和解码器。In this step, the electronic device may input the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model. Specifically, the electronic device can input latent features into any model with speech conversion function. For example, the speech conversion model may include: a content encoder, a tone encoder, and a decoder.

S103、基于原始的声学特征和预测的声学特征对待训练的语音转换模型进行训练。S103 , train the speech conversion model to be trained based on the original acoustic feature and the predicted acoustic feature.

在本步骤中，电子设备可以基于原始的声学特征和预测的声学特征对待训练的语音转换模型进行训练。接着，电子设备可以重新选取一个语音对待训练的语音转换模型进行训练，直到待训练的语音转换模型满足预先设置的收敛条件。这里重新选择的语音可以是与上一个语音相邻的语音，也可以是与上一个语音不相邻的语音，在此不进行限定。进一步地，电子设备可以基于原始的声学特征和预测的声学特征计算待训练的语音转换模型针对语音的损失值；然后根据待训练的语音转换模型针对语音的损失值对待训练的语音转换模型中的模型参数进行调整。In this step, the electronic device may train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features. Next, the electronic device may re-select a speech to be trained speech conversion model for training, until the to-be-trained speech conversion model satisfies a preset convergence condition. The voice to be reselected here may be a voice adjacent to the previous voice, or may be a voice not adjacent to the previous voice, which is not limited herein. Further, the electronic device can calculate the loss value of the speech conversion model to be trained for speech based on the original acoustic feature and the predicted acoustic characteristic; The model parameters are adjusted.

本申请实施例提出的语音转换模型的训练方法，先将语音的原始的声学特征输入至预训练模型，得到预训练模型输出的隐特征；然后将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征；再基于原始的声学特征和预测的声学特征对待训练的语音转换模型进行训练。也就是说，本申请在对语音转换模型进行训练时，通过预训练模型在原始的声学特征中提取出相互关联的信息。而在现有的语音转换模型的训练方法中，由于ppg特征或者模型输入的mel特征帧与帧之间是相互独立的，从而造成其中的信息相互独立，神经网络模型难以从中学习到帧与帧之间的互相关联性，使模型的学习能力受限。因为本申请采用了通过预训练模型提取关联信息的技术手段，克服了现有技术中由于ppg特征或者模型输入的mel特征帧与帧之间是相互独立的，从而造成其中的信息相互独立，神经网络模型难以从中学习到帧与帧之间的互相关联性，使模型的学习能力受限的技术问题，本申请提供的技术方案，可以用声学特征训练预训练自监督模型，再提取其中帧级别的隐特征作为新的声学特征，这样的隐特征中包含了声学特征的互相关联的信息，将隐特征作为语音转换模型的输入预测目标声学特征，使模型学习更充分，应用场景广泛；并且，本申请实施例的技术方案实现简单方便、便于普及，适用范围更广。In the training method of the speech conversion model proposed in the embodiment of the present application, the original acoustic features of speech are firstly input into the pre-training model to obtain the latent features output by the pre-training model; then the hidden features are input into the speech conversion model to obtain the speech conversion model The output predicted acoustic features; the speech conversion model to be trained is then trained based on the original acoustic features and the predicted acoustic features. That is to say, the present application extracts interrelated information from the original acoustic features through the pre-training model when training the speech conversion model. In the existing training methods of speech conversion models, because the ppg features or the mel feature frames input by the model are independent of each other, the information in them is independent of each other, and it is difficult for the neural network model to learn from them. The interrelatedness between them limits the learning ability of the model. Because the present application adopts the technical means of extracting the associated information through the pre-training model, it overcomes the fact that the mel feature frames input by the ppg feature or the model are independent of each other in the prior art, thereby causing the information in them to be independent of each other and the neural network. It is difficult for the network model to learn the correlation between frames and the technical problem that the learning ability of the model is limited. The technical solution provided in this application can use acoustic features to train a pre-trained self-supervised model, and then extract the frame level. The latent features of , as new acoustic features, such hidden features contain the interrelated information of the acoustic features, and the hidden features are used as the input of the speech conversion model to predict the target acoustic features, so that the model can learn more fully and has a wide range of application scenarios; and, The technical solutions of the embodiments of the present application are simple and convenient to implement, easy to popularize, and have wider application range.

实施例二Embodiment 2

图2是本申请实施例提供的语音转换模型的训练方法的第二流程示意图。基于上述技术方案进一步优化与扩展，并可以与上述各个可选实施方式进行结合。如图2所示，语音转换模型的训练方法可以包括以下步骤：FIG. 2 is a second schematic flowchart of a training method for a speech conversion model provided by an embodiment of the present application. Based on the above technical solutions, it is further optimized and expanded, and can be combined with the above-mentioned optional embodiments. As shown in Figure 2, the training method of the speech conversion model may include the following steps:

S201、将原始的声学特征输入至神经网络模型，得到神经网络模型输出的特征序列；将神经网络模型输出的特征序列作为预训练模型输出的隐特征。S201. Input the original acoustic feature into the neural network model to obtain a feature sequence output by the neural network model; use the feature sequence output by the neural network model as the latent feature output by the pre-training model.

在本步骤中，电子设备可以将原始的声学特征输入至神经网络模型，得到神经网络模型输出的特征序列；将神经网络模型输出的特征序列作为预训练模型输出的隐特征。具体地，电子设备可以先将原始的声学特征划分为N个声学特征单元；其中，N为大于1的自然数；然后将N个声学特征单元中的一个或者多个声学特征进行掩蔽，得到掩蔽后的声学特征单元；再将掩蔽后的声学特征单元输入至神经网络模型，得到神经网络模型输出的特征序列。例如，原始的声学特征可以被划分为：A、B、C、D、E、F、G、H、I、J、K这11个声学特征单元；然后将C、F、G、I四个声学特征单元进行掩蔽，得到掩蔽后的声学特征单元；再将掩蔽后的声学特征单元输入至神经网络模型，得到神经网络模型输出的特征序列，该特征序列可以包括A、B、C、D、E、F、G、H、I、J、K这11个声学特征单元。In this step, the electronic device may input the original acoustic features into the neural network model to obtain a feature sequence output by the neural network model; and use the feature sequence output by the neural network model as a latent feature output by the pre-training model. Specifically, the electronic device may first divide the original acoustic feature into N acoustic feature units; where N is a natural number greater than 1; and then mask one or more acoustic features in the N acoustic feature units to obtain a masked Then input the masked acoustic feature unit into the neural network model to obtain the feature sequence output by the neural network model. For example, the original acoustic features can be divided into: 11 acoustic feature units A, B, C, D, E, F, G, H, I, J, K; then four C, F, G, I The acoustic feature unit is masked to obtain the masked acoustic feature unit; then the masked acoustic feature unit is input into the neural network model to obtain the feature sequence output by the neural network model, the feature sequence may include A, B, C, D, 11 acoustic feature units E, F, G, H, I, J, K.

S202、将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征。S202. Input the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model.

在本步骤中，电子设备可以将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征。具体地，电子设备可以将原始的声学特征输入至神经网络模型，得到神经网络模型输出的特征序列；将神经网络模型输出的特征序列作为预训练模型输出的隐特征。In this step, the electronic device may input the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model. Specifically, the electronic device can input the original acoustic features into the neural network model to obtain the feature sequence output by the neural network model; and use the feature sequence output by the neural network model as the latent feature output by the pre-training model.

S203、基于原始的声学特征和预测的声学特征计算待训练的语音转换模型针对语音的损失值。S203. Calculate the loss value of the speech conversion model to be trained for speech based on the original acoustic feature and the predicted acoustic feature.

在本步骤中，电子设备可以基于原始的声学特征和预测的声学特征计算待训练的语音转换模型针对语音的损失值。具体地，电子设备可以将原始的声学特征和预测的声学特征输入至预先构建的损失函数中，通过该损失函数得到待训练的语音转换模型针对语音的损失值。In this step, the electronic device may calculate the loss value of the speech conversion model to be trained for speech based on the original acoustic features and the predicted acoustic features. Specifically, the electronic device may input the original acoustic features and the predicted acoustic features into a pre-built loss function, and obtain a loss value of the speech conversion model to be trained for speech through the loss function.

S204、根据待训练的语音转换模型针对语音的损失值对待训练的语音转换模型中的模型参数进行调整。S204: Adjust the model parameters in the speech conversion model to be trained according to the speech loss value of the speech conversion model to be trained.

在本步骤中，电子设备可以根据待训练的语音转换模型针对语音的损失值对待训练的语音转换模型中的模型参数进行调整。具体地，待训练的语音转换系统可以是一个神经网络，该神经网络可以包括：卷积层、池化层、全连接层，电子设备可以根据待训练的语音转换模型针对语音的损失值对上述各层中的模型参数进行调整。In this step, the electronic device may adjust the model parameters in the speech conversion model to be trained according to the speech loss value of the speech conversion model to be trained. Specifically, the speech conversion system to be trained may be a neural network, and the neural network may include: a convolution layer, a pooling layer, and a fully connected layer. The model parameters in each layer are adjusted.

图3是本申请实施例提供的语音转换模型的训练系统的结构示意图。如图3所示，语音转换模型的训练系统可以包括：预训练模型和语音转换模型；先将语音的原始的声学特征输入至预训练模型，得到预训练模型输出的隐特征；然后将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征；再基于原始的声学特征和预测的声学特征对待训练的语音转换模型进行训练。FIG. 3 is a schematic structural diagram of a training system for a speech conversion model provided by an embodiment of the present application. As shown in Figure 3, the training system of the speech conversion model may include: a pre-training model and a speech conversion model; first, input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model; Input to the speech conversion model to obtain the predicted acoustic features output by the speech conversion model; and then train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

实施例三Embodiment 3

图4是本申请实施例提供的语音转换模型的训练方法的第三流程示意图。基于上述技术方案进一步优化与扩展，并可以与上述各个可选实施方式进行结合。如图4所示，语音转换模型的训练方法可以包括以下步骤：FIG. 4 is a third schematic flowchart of a training method for a speech conversion model provided by an embodiment of the present application. Based on the above technical solutions, it is further optimized and expanded, and can be combined with the above-mentioned optional embodiments. As shown in Figure 4, the training method of the speech conversion model may include the following steps:

S401、将原始的声学特征输入至神经网络模型，得到神经网络模型输出的特征序列；将神经网络模型输出的特征序列作为预训练模型输出的隐特征。S401. Input the original acoustic feature into the neural network model to obtain a feature sequence output by the neural network model; use the feature sequence output by the neural network model as a latent feature output by the pre-training model.

S402、将隐特征输入至语音转换模型，得到语音转换模型输出的预测的声学特征。S402. Input the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model.

S403、基于原始的声学特征和预测的声学特征计算待训练的语音转换模型针对语音的损失值。S403. Calculate the loss value of the speech conversion model to be trained for speech based on the original acoustic feature and the predicted acoustic feature.

S404、根据待训练的语音转换模型针对语音的损失值对待训练的语音转换模型中的模型参数进行调整。S404. Adjust the model parameters in the speech conversion model to be trained according to the speech loss value of the speech conversion model to be trained.

S405、将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型，通过语音转换模型得到由第一语音和第二语音转换后的目标语音；其中，目标语音包括第一语音的内容信息和第二语音的音色信息。S405: Input the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into the trained voice conversion model respectively, and obtain the first voice and the second voice through the voice conversion model. The target voice after voice conversion; wherein, the target voice includes content information of the first voice and timbre information of the second voice.

在本步骤中，电子设备可以将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型，通过语音转换模型得到由第一语音和第二语音转换后的目标语音；其中，目标语音包括第一语音的内容信息和第二语音的音色信息。具体地，电子设备可以先将第一用户针对第一语音的原始的声学特征输入至训练好的预训练模型，得到预训练模型输出的隐特征；然后将预训练模型输出的隐特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型，得到语音转换模型输出的预测的声学特征；再将语音转换模型输出的预测的声学特征输入至声码器，得到声码器输出的目标语音。In this step, the electronic device can respectively input the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into the trained voice conversion model, and obtain the The target voice converted from the first voice and the second voice; wherein, the target voice includes content information of the first voice and timbre information of the second voice. Specifically, the electronic device can first input the original acoustic features of the first user for the first speech into the trained pre-training model to obtain the latent features output by the pre-training model; then the latent features output by the pre-training model and the second The user inputs the original acoustic features of the second voice into the trained voice conversion model respectively, and obtains the predicted acoustic features output by the voice conversion model; and then inputs the predicted acoustic features output by the voice conversion model into the vocoder to obtain the acoustic features. The target voice output by the encoder.

图5是本申请实施例提供的语音转换模型的预测系统的结构示意图。如图5所示，语音转换模型的预测系统可以包括：预训练模模型、语音转换模型和声码器；先将第一用户(A用户)针对第一语音的原始的声学特征(A用户的声学特征)输入至训练好的预训练模型，得到预训练模型输出的隐特征；再将预训练模型输出的隐特征和第二用户(B用户)针对第二语音的原始的声学特征(B用户的声学特征)分别输入至训练好的语音转换模型，得到语音转换模型输出的预测的声学特征；再将语音转换模型输出的预测的声学特征输入至声码器，得到声码器输出的目标语音。FIG. 5 is a schematic structural diagram of a prediction system for a speech conversion model provided by an embodiment of the present application. As shown in FIG. 5 , the prediction system of the speech conversion model may include: a pre-trained model model, a speech conversion model and a vocoder; Acoustic features) are input to the trained pre-training model to obtain the latent features output by the pre-training model; then the latent features output by the pre-training model and the second user (B user) are aimed at the original acoustic features of the second voice (B user). (acoustic features) are respectively input to the trained voice conversion model to obtain the predicted acoustic features output by the voice conversion model; then the predicted acoustic features output by the voice conversion model are input to the vocoder to obtain the target voice output by the vocoder .

实施例四Embodiment 4

图6是本申请实施例提供的语音转换模型的训练装置的结构示意图。如图6所示，所述装置600包括：预训练模块601、语音转换模块602和训练模块603；其中，FIG. 6 is a schematic structural diagram of an apparatus for training a speech conversion model provided by an embodiment of the present application. As shown in FIG. 6 , the apparatus 600 includes: a pre-training module 601, a speech conversion module 602 and a training module 603; wherein,

所述预训练模块601，用于将语音的原始的声学特征输入至预训练模型，得到所述预训练模型输出的隐特征；The pre-training module 601 is used to input the original acoustic features of the speech into the pre-training model to obtain the latent features output by the pre-training model;

所述语音转换模块602，用于将所述隐特征输入至语音转换模型，得到所述语音转换模型输出的预测的声学特征；The voice conversion module 602 is used to input the latent features into a voice conversion model to obtain the predicted acoustic features output by the voice conversion model;

所述训练模块603，用于基于所述原始的声学特征和所述预测的声学特征对待训练的语音转换模型进行训练。The training module 603 is configured to train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

进一步的，所述预训练模块601，具体用于将所述原始的声学特征输入至神经网络模型，得到所述神经网络模型输出的特征序列；将所述神经网络模型输出的特征序列作为所述预训练模型输出的隐特征。Further, the pre-training module 601 is specifically configured to input the original acoustic features into a neural network model to obtain a feature sequence output by the neural network model; and use the feature sequence output by the neural network model as the The latent features output by the pretrained model.

进一步的，所述预训练模块601，具体用于将所述原始的声学特征划分为N个声学特征单元；其中，N为大于1的自然数；将所述N个声学特征单元中的一个或者多个声学特征进行掩蔽，得到掩蔽后的声学特征单元；将所述掩蔽后的声学特征单元输入至所述神经网络模型，得到所述神经网络模型输出的特征序列。Further, the pre-training module 601 is specifically configured to divide the original acoustic feature into N acoustic feature units; wherein, N is a natural number greater than 1; one or more of the N acoustic feature units Mask the acoustic features to obtain masked acoustic feature units; input the masked acoustic feature units into the neural network model to obtain a feature sequence output by the neural network model.

进一步的，所述训练模块603，具体用于基于所述原始的声学特征和所述预测的声学特征计算所述待训练的语音转换模型针对所述语音的损失值；根据所述待训练的语音转换模型针对所述语音的损失值对所述待训练的语音转换模型中的模型参数进行调整。Further, the training module 603 is specifically configured to calculate the loss value of the speech conversion model to be trained for the speech based on the original acoustic features and the predicted acoustic features; The conversion model adjusts the model parameters in the to-be-trained speech conversion model according to the loss value of the speech.

进一步的，所述装置还包括：预测模块604(图中未示出)，用于将第一用户针对第一语音的原始的声学特征和第二用户针对第二语音的原始的声学特征分别输入至训练好的语音转换模型，通过所述语音转换模型得到由所述第一语音和所述第二语音转换后的目标语音；其中，所述目标语音包括所述第一语音的内容信息和所述第二语音的音色信息。Further, the apparatus further includes: a prediction module 604 (not shown in the figure), configured to respectively input the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice To the trained voice conversion model, the target voice converted from the first voice and the second voice is obtained through the voice conversion model; wherein, the target voice includes the content information of the first voice and all Describe the timbre information of the second voice.

进一步的，所述预测模块604，具体用于将所述第一用户针对所述第一语音的原始的声学特征输入至训练好的预训练模型，得到所述预训练模型输出的隐特征；将所述预训练模型输出的隐特征和所述第二用户针对所述第二语音的原始的声学特征分别输入至训练好的语音转换模型，得到所述语音转换模型输出的预测的声学特征；将所述语音转换模型输出的预测的声学特征输入至声码器，得到所述声码器输出的所述目标语音。Further, the prediction module 604 is specifically configured to input the original acoustic features of the first user for the first speech into the trained pre-training model to obtain the latent features output by the pre-training model; The latent features output by the pre-training model and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model to obtain the predicted acoustic features output by the voice conversion model; The predicted acoustic features output by the speech conversion model are input to a vocoder to obtain the target speech output by the vocoder.

上述语音转换模型的训练装置可执行本申请任意实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本申请任意实施例提供的语音转换模型的训练方法。The above-mentioned training device for the speech conversion model can execute the method provided by any embodiment of the present application, and has functional modules and beneficial effects corresponding to the execution method. For technical details not described in detail in this embodiment, reference may be made to the training method of the speech conversion model provided by any embodiment of this application.

本公开的技术方案中，所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solutions of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of the user's personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

实施例五Embodiment 5

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the device 700 includes a computing unit 701 that can be executed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 Various appropriate actions and handling. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The computing unit 701 , the ROM 702 , and the RAM 703 are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .

设备700中的多个部件连接至I/O接口705，包括：输入单元706，例如键盘、鼠标等；输出单元707，例如各种类型的显示器、扬声器等；存储单元708，例如磁盘、光盘等；以及通信单元709，例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理，例如语音转换模型的训练方法。例如，在一些实施例中，语音转换模型的训练方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时，可以执行上文描述的语音转换模型的训练方法的一个或多个步骤。备选地，在其他实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行语音转换模型的训练方法。Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the training method of the speech conversion model. For example, in some embodiments, the training method of the speech conversion model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709 . When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the training method of the speech conversion model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (eg, by means of firmware) to perform the training method of the speech conversion model.

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a distributed system server, or a server combined with blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims

1. A training method for a speech conversion model, the method comprising:

Divide the original acoustic feature into N acoustic feature units; wherein, N is a natural number greater than 1; mask one or more acoustic features in the N acoustic feature units to obtain a masked acoustic feature unit; The masked acoustic feature unit is input to the neural network model, and the feature sequence output by the neural network model is obtained; the feature sequence output by the neural network model is used as the hidden feature output by the pre-training model;

Inputting the latent features to a speech conversion model to obtain the predicted acoustic features output by the speech conversion model;

The speech conversion model to be trained is trained based on the original acoustic features and the predicted acoustic features.

2. The method according to claim 1, wherein the training of the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features comprises:

Calculate the loss value of the speech conversion model to be trained for the speech based on the original acoustic feature and the predicted acoustic feature;

The model parameters in the to-be-trained speech-conversion model are adjusted according to the speech loss value of the to-be-trained speech-conversion model.

3. The method of claim 1, further comprising:

The original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model, and the first voice and the second voice are obtained through the voice conversion model. The target voice after the second voice conversion; wherein, the target voice includes content information of the first voice and timbre information of the second voice.

4. The method according to claim 3, wherein the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model , obtain the target voice converted by the first voice and the second voice through the voice conversion model, including:

Inputting the original acoustic features of the first voice to the trained pre-training model by the first user to obtain the latent features output by the pre-training model;

The latent features output by the pre-training model and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model to obtain the predicted acoustic features output by the voice conversion model;

The predicted acoustic features output by the speech conversion model are input to a vocoder to obtain the target speech output by the vocoder.

5. A training device for a voice conversion model, the device comprising: a pre-training module, a voice conversion module and a training module; wherein,

The pre-training module is used to divide the original acoustic feature into N acoustic feature units; wherein, N is a natural number greater than 1; mask one or more acoustic features in the N acoustic feature units to obtain The masked acoustic feature unit; the masked acoustic feature unit is input into the neural network model, and the feature sequence output by the neural network model is obtained; the feature sequence output by the neural network model is used as the hidden output of the pre-training model. feature;

The speech conversion module is used for inputting the latent features into the speech conversion model to obtain the predicted acoustic features output by the speech conversion model;

The training module is configured to train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

6. The apparatus according to claim 5, wherein the training module is specifically configured to calculate the loss value of the speech conversion model to be trained for the speech based on the original acoustic feature and the predicted acoustic feature; The model parameters in the to-be-trained speech-conversion model are adjusted according to the speech loss value of the to-be-trained speech-conversion model.

7. The apparatus according to claim 5, further comprising: a prediction module for inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice respectively. To a trained voice conversion model, the target voice converted from the first voice and the second voice is obtained through the voice conversion model; wherein, the target voice includes the content information of the first voice and all Describe the timbre information of the second voice.

8 . The apparatus according to claim 7 , wherein the prediction module is specifically configured to input the original acoustic features of the first voice by the first user into a trained pre-training model to obtain the pre-training. 9 . The hidden features output by the model; the hidden features output by the pre-training model and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model to obtain the output of the voice conversion model and input the predicted acoustic features output by the speech conversion model to a vocoder to obtain the target speech output by the vocoder.

9. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-4 Methods.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-4.