CN113689867A

CN113689867A - Training method and device of voice conversion model, electronic equipment and medium

Info

Publication number: CN113689867A
Application number: CN202110950483.1A
Authority: CN
Inventors: 王俊超; 陈怿翔; 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2021-11-23
Anticipated expiration: 2041-08-18
Also published as: CN113689867B

Abstract

The disclosure provides a training method and device of a voice conversion model, electronic equipment and a medium, and relates to the technical field of artificial intelligence, in particular to a voice and deep learning technology. The specific implementation scheme is as follows: inputting the original acoustic features of the voice into a pre-training model to obtain hidden features output by the pre-training model; inputting the hidden features into a voice conversion model to obtain predicted acoustic features output by the voice conversion model; the speech conversion model to be trained is trained based on the original acoustic features and the predicted acoustic features. According to the method and the device, the hidden features can be used as the input prediction target acoustic features of the voice conversion model, so that the model learning is more sufficient, and the application scene is wide.

Description

Training method and device of voice conversion model, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and further relates to speech and deep learning technologies, and in particular, to a method and an apparatus for training a speech conversion model, an electronic device, and a medium.

Background

Speech conversion, the purpose of which is to convert the voice of a source speaker into the timbre of a target speaker and keep the content of the speech expression unchanged, is becoming more and more interesting in the market. According to the linguistic data required by the model, the voice conversion can be divided into parallel linguistic data voice conversion and non-parallel linguistic data voice conversion; the parallel corpus voice conversion requires the source speaker and the target speaker to record the audio frequency of the same text when recording the required corpus. The non-parallel corpus voice conversion needs to record a plurality of voices of a target speaker, and does not need the voice of a source speaker during training.

The method based on the phoneme probability graph comprises the steps of firstly extracting a ppg characteristic expressing the speaking content from the audio of a target speaker through a speech recognition model, and then modeling the connection between the ppg characteristic and an audio mel characteristic through the model. During testing, the source speaker extracts the ppg characteristic through the speech recognition model and inputs the trained conversion model to obtain the converted characteristic. The general idea is to decouple the content information and tone information in the features through an encoder during training, and restore the information through a decoder to perform self-reconstruction training.

Human speech is composed of many speech frames, and since there is an inherent continuity of human speech, there should be a correlation between adjacent speech frames. Because the frame and the frame of the mel feature input by the existing ppg feature or model are mutually independent, the information in the frame and the mel feature is mutually independent, and the neural network model is difficult to learn the correlation between the frame and the mel feature, so that the learning capability of the model is limited.

Disclosure of Invention

The disclosure provides a method and a device for training a voice conversion model, electronic equipment and a medium.

In a first aspect, the present application provides a method for training a speech conversion model, the method including:

inputting original acoustic features of voice into a pre-training model to obtain hidden features output by the pre-training model;

inputting the hidden features into a voice conversion model to obtain predicted acoustic features output by the voice conversion model;

and training a speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

In a second aspect, the present application provides an apparatus for training a speech conversion model, the apparatus comprising: the device comprises a pre-training module, a voice conversion module and a training module; wherein the content of the first and second substances,

the pre-training module is used for inputting the original acoustic features of the voice into a pre-training model to obtain the hidden features output by the pre-training model;

the voice conversion module is used for inputting the hidden features into a voice conversion model to obtain predicted acoustic features output by the voice conversion model;

and the training module is used for training a speech conversion model to be trained on the basis of the original acoustic features and the predicted acoustic features.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method for training a speech conversion model according to any embodiment of the present application.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for training a speech conversion model according to any embodiment of the present application.

In a fifth aspect, a computer program product is provided, which when executed by a computer device implements the method for training a speech conversion model according to any of the embodiments of the present application.

According to the technical scheme provided by the application, the acoustic features can be used for training the pre-trained self-supervision model, then the hidden features at the frame level are extracted to serve as new acoustic features, the hidden features contain the information of the acoustic features, and the hidden features serve as input prediction target acoustic features of a conversion model, so that the model learning is more sufficient, and the application scene is wide.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure;

FIG. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application;

FIG. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an apparatus for training a speech conversion model according to an embodiment of the present application;

FIG. 7 is a block diagram of an electronic device for implementing a method for training a speech conversion model according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example one

Fig. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present application, where the method may be performed by an apparatus or an electronic device for training a speech conversion model, where the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device with a network communication function. As shown in fig. 1, the training method of the speech conversion model may include the following steps:

s101, inputting the original acoustic features of the voice into a pre-training model to obtain the hidden features output by the pre-training model.

In this step, the electronic device may input the original acoustic features of the speech to the pre-training model to obtain the hidden features output by the pre-training model. Specifically, when the speech conversion model to be trained does not satisfy the preset convergence condition, the electronic device may input the original acoustic feature of the speech to the pre-training model to obtain the hidden feature output by the pre-training model. Further, the electronic device may input the original acoustic features to the neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model. Further, the electronic device may first divide the original acoustic features into N acoustic feature units; wherein N is a natural number greater than 1; then masking one or more acoustic features in the N acoustic feature units to obtain masked acoustic feature units; and inputting the masked acoustic feature units into a neural network model to obtain a feature sequence output by the neural network model.

And S102, inputting the hidden features into the voice conversion model to obtain the predicted acoustic features output by the voice conversion model.

In this step, the electronic device may input the hidden feature to the speech conversion model to obtain a predicted acoustic feature output by the speech conversion model. Specifically, the electronic device may input the hidden feature into any one of the models having a speech conversion function. For example, the speech conversion model may include: content encoder, tone encoder and decoder.

S103, training a speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

In this step, the electronic device may train the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features. Then, the electronic device may reselect a speech to train the speech conversion model to be trained until the speech conversion model to be trained satisfies the preset convergence condition. The reselected voice may be a voice adjacent to the previous voice, or a voice not adjacent to the previous voice, which is not limited herein. Further, the electronic device may calculate a loss value of the speech conversion model to be trained for speech based on the original acoustic features and the predicted acoustic features; and then adjusting model parameters in the voice conversion model to be trained according to the voice conversion model to be trained aiming at the loss value of the voice.

The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly inputting original acoustic features of voice into a pre-training model to obtain hidden features output by the pre-training model; then, inputting the hidden features into a voice conversion model to obtain predicted acoustic features output by the voice conversion model; and training the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features. That is to say, when the speech conversion model is trained, the pre-training model extracts the correlated information from the original acoustic features. In the existing training method of the voice conversion model, because the frame and the frame of the ppg characteristic or the mel characteristic input by the model are mutually independent, the information in the frame and the frame are mutually independent, and the neural network model is difficult to learn the correlation between the frame and the frame, the learning capability of the model is limited. Because the technical means of extracting the associated information through the pre-training model is adopted, the technical problems that the mutual independence of the information is caused by the fact that the frame and the frame of the ppg characteristic or the mel characteristic input by the model are mutually independent in the prior art, and the neural network model is difficult to learn the mutual association between the frame and the frame, so that the learning capability of the model is limited are solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.

Example two

Fig. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 2, the training method of the speech conversion model may include the following steps:

s201, inputting original acoustic features into a neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

In this step, the electronic device may input the original acoustic features to the neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model. Specifically, the electronic device may first divide the original acoustic features into N acoustic feature units; wherein N is a natural number greater than 1; then masking one or more acoustic features in the N acoustic feature units to obtain masked acoustic feature units; and inputting the masked acoustic feature units into a neural network model to obtain a feature sequence output by the neural network model. For example, the original acoustic features can be divided into: A. b, C, D, E, F, G, H, I, J, K acoustic signature units; masking the C, F, G, I four acoustic feature units to obtain masked acoustic feature units; and inputting the masked acoustic feature units into the neural network model to obtain a feature sequence output by the neural network model, wherein the feature sequence can comprise A, B, C, D, E, F, G, H, I, J, K acoustic feature units.

S202, the hidden features are input into the voice conversion model, and the predicted acoustic features output by the voice conversion model are obtained.

In this step, the electronic device may input the hidden feature to the speech conversion model to obtain a predicted acoustic feature output by the speech conversion model. Specifically, the electronic device may input the original acoustic features to the neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

And S203, calculating a loss value of the voice conversion model to be trained for the voice based on the original acoustic features and the predicted acoustic features.

In this step, the electronic device may calculate a loss value of the speech conversion model to be trained for speech based on the original acoustic features and the predicted acoustic features. Specifically, the electronic device may input the original acoustic features and the predicted acoustic features into a pre-constructed loss function, and obtain a loss value of the speech conversion model to be trained for the speech through the loss function.

S204, adjusting model parameters in the voice conversion model to be trained according to the voice conversion model to be trained aiming at the loss value of the voice.

In this step, the electronic device may adjust model parameters in the speech conversion model to be trained according to the speech conversion model to be trained with respect to the loss value of speech. In particular, the speech conversion system to be trained may be a neural network, which may comprise: the electronic equipment can adjust model parameters in the convolutional layer, the pooling layer and the full-connection layer according to the loss value of the voice conversion model to be trained aiming at the voice.

Fig. 3 is a schematic structural diagram of a training system of a speech conversion model according to an embodiment of the present application. As shown in fig. 3, the training system of the speech conversion model may include: pre-training a model and a voice conversion model; inputting original acoustic features of voice into a pre-training model to obtain hidden features output by the pre-training model; then, inputting the hidden features into a voice conversion model to obtain predicted acoustic features output by the voice conversion model; and training the speech conversion model to be trained based on the original acoustic features and the predicted acoustic features.

EXAMPLE III

Fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 4, the training method of the speech conversion model may include the following steps:

s401, inputting original acoustic features into a neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

S402, inputting the hidden features into the voice conversion model to obtain the predicted acoustic features output by the voice conversion model.

And S403, calculating a loss value of the voice conversion model to be trained for the voice based on the original acoustic features and the predicted acoustic features.

S404, adjusting model parameters in the voice conversion model to be trained according to the loss value of the voice conversion model to be trained aiming at the voice.

S405, respectively inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.

In this step, the electronic device may input an original acoustic feature of the first user for the first voice and an original acoustic feature of the second user for the second voice to the trained voice conversion model respectively, and obtain, through the voice conversion model, a target voice converted from the first voice and the second voice; wherein the target voice includes content information of the first voice and tone information of the second voice. Specifically, the electronic device may input an original acoustic feature of the first user for the first speech into a trained pre-training model to obtain a hidden feature output by the pre-training model; then, respectively inputting the hidden features output by the pre-training model and the original acoustic features of a second user aiming at a second voice into the trained voice conversion model to obtain the predicted acoustic features output by the voice conversion model; and inputting the predicted acoustic characteristics output by the voice conversion model into the vocoder to obtain the target voice output by the vocoder.

Fig. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application. As shown in fig. 5, the prediction system of the speech conversion model may include: pre-training model, voice conversion model and vocoder; firstly, inputting original acoustic features (acoustic features of A users) of a first user (A users) aiming at first voice into a trained pre-training model to obtain hidden features output by the pre-training model; respectively inputting the hidden features output by the pre-training model and the original acoustic features of a second user (B user) aiming at a second voice (acoustic features of the B user) into the trained voice conversion model to obtain predicted acoustic features output by the voice conversion model; and inputting the predicted acoustic characteristics output by the voice conversion model into the vocoder to obtain the target voice output by the vocoder.

Example four

Fig. 6 is a schematic structural diagram of a training apparatus for a speech conversion model according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes: a pre-training module 601, a voice conversion module 602, and a training module 603; wherein the content of the first and second substances,

the pre-training module 601 is configured to input an original acoustic feature of a voice to a pre-training model to obtain a hidden feature output by the pre-training model;

the voice conversion module 602 is configured to input the hidden feature to a voice conversion model, so as to obtain a predicted acoustic feature output by the voice conversion model;

the training module 603 is configured to train a to-be-trained speech conversion model based on the original acoustic features and the predicted acoustic features.

Further, the pre-training module 601 is specifically configured to input the original acoustic features to a neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

Further, the pre-training module 601 is specifically configured to divide the original acoustic features into N acoustic feature units; wherein N is a natural number greater than 1; masking one or more acoustic features in the N acoustic feature units to obtain masked acoustic feature units; and inputting the masked acoustic feature unit to the neural network model to obtain a feature sequence output by the neural network model.

Further, the training module 603 is specifically configured to calculate a loss value of the to-be-trained speech conversion model for the speech based on the original acoustic features and the predicted acoustic features; and adjusting model parameters in the voice conversion model to be trained according to the loss value of the voice conversion model to be trained aiming at the voice.

Further, the apparatus further comprises: a prediction module 604 (not shown in the figure), configured to input an original acoustic feature of a first user for a first voice and an original acoustic feature of a second user for a second voice into a trained voice conversion model, respectively, and obtain, through the trained voice conversion model, a target voice converted from the first voice and the second voice; wherein the target voice includes content information of the first voice and tone information of the second voice.

Further, the prediction module 604 is specifically configured to input the original acoustic feature of the first user for the first speech into a trained pre-training model, so as to obtain a hidden feature output by the pre-training model; respectively inputting the hidden features output by the pre-training model and the original acoustic features of the second voice of the second user into a trained voice conversion model to obtain predicted acoustic features output by the voice conversion model; and inputting the predicted acoustic features output by the voice conversion model into a vocoder to obtain the target voice output by the vocoder.

The training device of the voice conversion model can execute the method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For the technical details not described in detail in this embodiment, reference may be made to a method for training a speech conversion model provided in any embodiment of the present application.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

EXAMPLE five

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the training method of the speech conversion model. For example, in some embodiments, the method of training the speech conversion model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method for training a speech conversion model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the speech conversion model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a speech conversion model, the method comprising:

2. The method of claim 1, wherein the inputting original acoustic features of the speech into a pre-trained model to obtain hidden features output by the pre-trained model comprises:

inputting the original acoustic features into a neural network model to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

3. The method of claim 2, wherein the inputting the raw acoustic features into a neural network model to obtain a feature sequence output by the neural network model comprises:

dividing the original acoustic features into N acoustic feature units; wherein N is a natural number greater than 1;

masking one or more acoustic features in the N acoustic feature units to obtain masked acoustic feature units;

and inputting the masked acoustic feature unit to the neural network model to obtain a feature sequence output by the neural network model.

4. The method of claim 1, wherein the training a speech conversion model to be trained based on the original acoustic features and the predicted acoustic features comprises:

calculating a loss value of the speech conversion model to be trained for the speech based on the original acoustic features and the predicted acoustic features;

and adjusting model parameters in the voice conversion model to be trained according to the loss value of the voice conversion model to be trained aiming at the voice.

5. The method of claim 1, further comprising:

respectively inputting an original acoustic feature of a first user for first voice and an original acoustic feature of a second user for second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.

6. The method of claim 5, wherein the inputting original acoustic features of a first user for a first voice and original acoustic features of a second user for a second voice into a trained voice conversion model respectively, and obtaining target voice converted from the first voice and the second voice through the trained voice conversion model comprises:

inputting the original acoustic features of the first user aiming at the first voice into a trained pre-training model to obtain hidden features output by the pre-training model;

respectively inputting the hidden features output by the pre-training model and the original acoustic features of the second voice of the second user into a trained voice conversion model to obtain predicted acoustic features output by the voice conversion model;

and inputting the predicted acoustic features output by the voice conversion model into a vocoder to obtain the target voice output by the vocoder.

7. An apparatus for training a speech conversion model, the apparatus comprising: the device comprises a pre-training module, a voice conversion module and a training module; wherein the content of the first and second substances,

8. The apparatus according to claim 7, wherein the pre-training module is specifically configured to input the original acoustic features into a neural network model, so as to obtain a feature sequence output by the neural network model; and taking the characteristic sequence output by the neural network model as the hidden characteristic output by the pre-training model.

9. The apparatus of claim 8, the pre-training module, in particular to divide the original acoustic features into N acoustic feature units; wherein N is a natural number greater than 1; masking one or more acoustic features in the N acoustic feature units to obtain masked acoustic feature units; and inputting the masked acoustic feature unit to the neural network model to obtain a feature sequence output by the neural network model.

10. The apparatus according to claim 7, the training module being specifically configured to calculate a loss value of the speech conversion model to be trained for the speech based on the original acoustic features and the predicted acoustic features; and adjusting model parameters in the voice conversion model to be trained according to the loss value of the voice conversion model to be trained aiming at the voice.

11. The apparatus of claim 7, further comprising: the prediction module is used for respectively inputting the original acoustic features of a first user aiming at a first voice and the original acoustic features of a second user aiming at a second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.

12. The apparatus according to claim 11, wherein the prediction module is specifically configured to input an original acoustic feature of the first user for the first speech into a trained pre-training model, so as to obtain a hidden feature output by the pre-training model; respectively inputting the hidden features output by the pre-training model and the original acoustic features of the second voice of the second user into a trained voice conversion model to obtain predicted acoustic features output by the voice conversion model; and inputting the predicted acoustic features output by the voice conversion model into a vocoder to obtain the target voice output by the vocoder.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method of claim 1.