CN113689868B

CN113689868B - Training method and device of voice conversion model, electronic equipment and medium

Info

Publication number: CN113689868B
Application number: CN202110950488.4A
Authority: CN
Inventors: 王俊超; 陈怿翔; 康永国
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2022-09-13
Anticipated expiration: 2041-08-18
Also published as: CN113689868A

Abstract

The present disclosure provides a training method, device, electronic device and medium for a speech conversion model, which relate to the technical field of artificial intelligence, and in particular, to speech and deep learning technologies. The specific implementation scheme is as follows: input the original acoustic features of speech into the content encoder and the timbre encoder, respectively, to obtain the content sequence output by the content encoder and the timbre vector output by the timbre encoder; input the content sequence into the content supervision network to obtain The supervised sequence output by the content supervision network; the content encoder is trained based on the supervised sequence and the content sequence; the content sequence and the timbre vector are input to the decoder respectively, and the predicted acoustic features output by the decoder are obtained; based on the predicted acoustic features and the original The acoustic features of the to-be-trained speech conversion model are trained. In this embodiment of the present application, the content encoder can be trained specifically through the content supervision network, so that the speech conversion model can more accurately implement speech conversion.

Description

Training method and device of voice conversion model, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and further relates to speech and deep learning technologies, and in particular, to a method and an apparatus for training a speech conversion model, an electronic device, and a medium.

Background

Speech conversion, the purpose of which is to convert the voice of a source speaker into the timbre of a target speaker and keep the content of the speech expression unchanged, is becoming more and more interesting in the market. According to the linguistic data required by the model, the voice conversion can be divided into parallel linguistic data voice conversion and non-parallel linguistic data voice conversion; in parallel corpus voice conversion, when recording the required corpus, a source speaker and a target speaker are required to record the audio of the same text. The non-parallel corpus voice conversion needs to record a plurality of voices of a target speaker, and does not need the voice of a source speaker during training.

The existing self-reconstruction many-to-many voice conversion system mainly comprises: a content encoder, a tone encoder and a decoder; when the self-reconfigurable many-to-many voice conversion system is trained, the original acoustic features are input into a tone encoder to obtain a tone vector of the whole sentence, wherein the tone vector represents tone information of a speaker; inputting the original acoustic features into a content encoder, wherein the content encoder is provided with a module (such as a down-sampling module, a vector quantization encoder and the like) capable of encoding content information to obtain a content sequence of the whole sentence, and the content sequence represents the content information of the speaker; then inputting the tone vector and the content sequence into a decoder to obtain predicted acoustic features; and finally, training the self-reconfigurable many-to-many voice conversion system based on the predicted acoustic features and the original acoustic features.

The traditional self-reconfigurable many-to-many voice conversion system is trained by adopting the mode, so that the information of a speaker can be removed through an encoder, and the information decoupling is completed by adding tone information to the input of a decoder. However, when the content encoder encodes the content of the speaker, part of the content information may be removed, resulting in more errors in the converted content.

Disclosure of Invention

The disclosure provides a method and a device for training a voice conversion model, electronic equipment and a medium.

In a first aspect, the present application provides a method for training a speech conversion model, the method including:

respectively inputting original acoustic features of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;

inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;

respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder;

and training a speech conversion model to be trained based on the predicted acoustic features and the original acoustic features.

In a second aspect, the present application provides an apparatus for training a speech conversion model, the apparatus comprising: the device comprises an encoding module, a monitoring module, a decoding module and a training module; wherein,

the encoding module is used for respectively inputting the original acoustic characteristics of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;

the monitoring module is used for inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;

the decoding module is used for respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder;

and the training module is used for training a voice conversion model to be trained on the basis of the predicted acoustic features and the original acoustic features.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method for training a speech conversion model according to any embodiment of the present application.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for training a speech conversion model according to any embodiment of the present application.

In a fifth aspect, a computer program product is provided, which when executed by a computer device implements the method for training a speech conversion model according to any of the embodiments of the present application.

According to the technical scheme, the technical problem that when the content encoder in the prior art encodes the content of a speaker, part of content information can be removed, and more errors occur in the converted content is solved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure;

FIG. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application;

fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an apparatus for training a speech conversion model according to an embodiment of the present application;

FIG. 7 is a block diagram of an electronic device for implementing a method for training a speech conversion model according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example one

Fig. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure, where the method may be performed by an apparatus or an electronic device for training a speech conversion model, where the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device with a network communication function. As shown in fig. 1, the training method of the speech conversion model may include the following steps:

s101, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.

In this step, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder. Specifically, when the speech conversion model to be trained does not satisfy the preset convergence condition, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain the content sequence output by the content encoder and the tone vector output by the tone encoder. Furthermore, the electronic device can extract content-related information from the original acoustic features through a content encoder to obtain a content sequence corresponding to the original acoustic features; and extracting information related to tone in the original acoustic features through a tone encoder to obtain tone vectors corresponding to the original acoustic features.

S102, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence.

In this step, the electronic device may input the content sequence to the content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence. Specifically, the electronic device may extract text information from the content sequence through a content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence.

And S103, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.

In this step, the electronic device may input the content sequence and the timbre vector to a decoder, respectively, to obtain the predicted acoustic features output by the decoder. Specifically, the decoder may fuse content-related information in the original acoustic features and timbre-related information in the original acoustic features to obtain the predicted acoustic features.

And S104, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.

In this step, the electronic device may train the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. Then, the electronic device may reselect a speech to train the speech conversion model to be trained until the speech conversion model to be trained satisfies the preset convergence condition. The reselected voice may be a voice adjacent to the previous voice, or a voice not adjacent to the previous voice, which is not limited herein. Further, the electronic device may calculate a loss value of the predicted acoustic feature and the original acoustic feature through a pre-constructed loss function, and then train the speech conversion model to be trained based on the loss value.

The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the training is specially carried out aiming at the content encoder is adopted, the technical problem that when the content encoder codes the content of a speaker in the prior art, partial content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.

Example two

Fig. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 2, the training method of the speech conversion model may include the following steps:

s201, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.

S202, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network.

In this step, the electronic device may input the content sequence to the content surveillance network, resulting in a surveillance sequence output by the content surveillance network. Specifically, the electronic device may extract text information in the content sequence through the content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence. In addition, the content monitoring network may also use other monitoring methods to obtain the monitoring sequence, which is not limited herein.

And S203, calculating a loss value of the content encoder for the original acoustic features based on the supervision sequence and the content sequence.

In this step, the electronic device may calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence. Specifically, the electronic device may calculate a loss value of the content encoder for the original acoustic feature through a pre-constructed loss function.

And S204, adjusting model parameters in the content encoder according to the loss value of the content encoder aiming at the original acoustic features.

In this step, the electronic device may adjust the model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features. Specifically, the content encoder may be a neural network, and the model parameters in the content encoder may be adjusted according to the loss value of the content encoder for the original acoustic features.

And S205, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.

And S206, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.

Fig. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application. As shown in fig. 3, the training system of the speech conversion model may include: a content monitoring network, a content encoder, a tone encoder and a decoder; when a voice conversion model is trained, firstly, respectively inputting original acoustic features of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; meanwhile, the content sequence can be input into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; then, respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; the speech conversion model is then trained based on the predicted acoustic features and the original acoustic features.

The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the content encoder is specially trained is adopted, the technical problem that when the content encoder encodes the content of a speaker in the prior art, part of content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.

EXAMPLE III

Fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 4, the training method of the speech conversion model may include the following steps:

s401, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.

S402, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; the content encoder is trained based on the supervisory sequence and the content sequence.

And S403, respectively inputting the content sequence and the tone vector into a decoder to obtain the predicted acoustic features output by the decoder.

S404, training a speech conversion model to be trained based on the predicted acoustic features and the original acoustic features.

S405, respectively inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.

After the trained voice conversion model is obtained through the above steps, in this step, the electronic device may respectively input the original acoustic feature of the first user for the first voice and the original acoustic feature of the second user for the second voice into the trained voice conversion model, and obtain the target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice. Specifically, the electronic device may input an original acoustic feature of a first user for a first voice to a content encoder, to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting original acoustic features of a second user aiming at a second voice into a trained timbre coder to obtain a timbre vector of the second voice output by the timbre coder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.

Fig. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application. As shown in fig. 5, the prediction system of the speech conversion model may include: content encoder, tone encoder, decoder and vocoder. Assuming that the voice of the user A is required to be converted into the tone of the user B, firstly, inputting the original acoustic feature (acoustic feature of the user A) of the user A aiming at the first voice into a content encoder to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting the original acoustic features of the user B aiming at the second voice (acoustic features of the user B) into a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.

Example four

Fig. 6 is a schematic structural diagram of a training apparatus for a speech conversion model according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes: an encoding module 601, a supervision module 602, a decoding module 603 and a training module 604; wherein,

the encoding module 601 is configured to input original acoustic features of a speech to a content encoder and a tone encoder respectively, so as to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;

the monitoring module 602 is configured to input the content sequence into a content monitoring network, so as to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;

the decoding module 603 is configured to input the content sequence and the timbre vector to a decoder, respectively, so as to obtain predicted acoustic features output by the decoder;

the training module 604 is configured to train a to-be-trained speech conversion model based on the predicted acoustic features and the original acoustic features.

Further, the monitoring module 602 is specifically configured to extract text information from the content sequence through the content monitoring network; the supervision sequence is derived based on the textual information.

Further, the monitoring module 602 is specifically configured to input the content sequence into a speech recognition acoustic model of the content monitoring network, and output a phoneme probability sequence through the speech recognition acoustic model; and obtaining the supervision sequence based on the phoneme probability sequence.

Further, the supervision module 602 is specifically configured to calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence; and adjusting model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features.

Further, the apparatus further comprises: a prediction module 605 (not shown in the figure), configured to input an original acoustic feature of a first user for a first voice and an original acoustic feature of a second user for a second voice into a trained voice conversion model, respectively, and obtain, through the voice conversion model, a target voice converted from the first voice and the second voice; wherein the target voice includes content information of the first voice and tone information of the second voice.

Further, the prediction module 605 is specifically configured to input an original acoustic feature of the first user for the first speech into the content encoder, so as to obtain a content sequence of the first speech output by the content encoder; inputting the original acoustic features of the second voice of the second user to a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; respectively inputting the content sequence of the first voice and the tone vector of the second voice to a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic features into a trained vocoder to obtain the target voice output by the vocoder.

The training device of the voice conversion model can execute the method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For the technical details not described in detail in this embodiment, reference may be made to a method for training a speech conversion model provided in any embodiment of the present application.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

EXAMPLE five

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the training method of the speech conversion model. For example, in some embodiments, the method of training the speech conversion model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method for training a speech conversion model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the speech conversion model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method for a speech conversion model, the method comprising:

The original acoustic features of the voice are input into the content encoder and the timbre encoder respectively, and the content sequence output by the content encoder and the timbre vector output by the timbre encoder are obtained; wherein, the voice is the same as the training above. a phonetically adjacent phonetic;

Inputting the content sequence into a content supervision network, and extracting text information from the content sequence through the content supervision network; obtaining a supervision sequence based on the text information;

calculating a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence;

adjusting the model parameters in the content encoder according to the loss value of the content encoder for the original acoustic feature;

Inputting the content sequence and the timbre vector to a decoder respectively, to obtain the predicted acoustic feature output by the decoder;

Calculate the loss value of the predicted acoustic feature and the original acoustic feature by using a pre-built loss function, and train the speech conversion model to be trained based on the loss value;

The original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model, and the first voice and the second voice are obtained through the voice conversion model. The target voice after the second voice conversion; wherein, the target voice includes content information of the first voice and timbre information of the second voice.

2 . The method according to claim 1 , wherein the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice are respectively input into the trained voice conversion model. 3 . , obtain the target voice converted by the first voice and the second voice through the voice conversion model, including:

Inputting the original acoustic features of the first voice by the first user into the content encoder, to obtain a content sequence of the first voice output by the content encoder;

Inputting the original acoustic features of the second voice to the trained timbre encoder by the second user to obtain the timbre vector of the second voice output by the timbre encoder;

The content sequence of the first voice and the timbre vector of the second voice are respectively input into the trained decoder, and the predicted fusion acoustic feature output by the decoder is used;

The predicted fused acoustic features are input to the trained vocoder to obtain the target speech output by the vocoder.

3. A training device for a speech conversion model, the device comprising: an encoding module, a supervision module, a decoding module and a training module; wherein,

The encoding module is used for inputting the original acoustic features of the voice into the content encoder and the timbre encoder respectively, so as to obtain the content sequence output by the content encoder and the timbre vector output by the timbre encoder; wherein, the The speech is the speech adjacent to the previous speech being trained;

The supervision module is configured to input the content sequence into a content supervision network, and extract text information from the content sequence through the content supervision network; obtain a supervision sequence based on the text information; The sequence and the content sequence calculate the loss value of the content encoder for the original acoustic feature; perform model parameters in the content encoder according to the loss value of the content encoder for the original acoustic feature. Adjustment;

The decoding module is configured to respectively input the content sequence and the timbre vector to a decoder to obtain the predicted acoustic feature output by the decoder;

The training module is used to calculate the loss value of the predicted acoustic feature and the original acoustic feature through a pre-built loss function, and train the speech conversion model to be trained based on the loss value;

The prediction module is used for inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into the trained voice conversion model respectively, and obtains the data obtained by the voice conversion model through the voice conversion model. The target voice converted from the first voice and the second voice; wherein, the target voice includes content information of the first voice and timbre information of the second voice.

4. The apparatus according to claim 3, wherein the prediction module is specifically configured to input the original acoustic features of the first user for the first speech into the content encoder to obtain the content encoder outputting the content sequence of the first voice; inputting the original acoustic features of the second voice by the second user into the trained timbre encoder to obtain the second voice output by the timbre encoder timbre vector; input the content sequence of the first voice and the timbre vector of the second voice into the trained decoder respectively, and pass the predicted fusion acoustic feature output by the decoder; The acoustic features are input to the trained vocoder to obtain the target speech output by the vocoder.

5. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-2 Methods.

6. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-2.