Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Example one
Fig. 1 is a first flowchart of a method for training a speech conversion model according to an embodiment of the present disclosure, where the method may be performed by an apparatus or an electronic device for training a speech conversion model, where the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device with a network communication function. As shown in fig. 1, the training method of the speech conversion model may include the following steps:
s101, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
In this step, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder. Specifically, when the speech conversion model to be trained does not satisfy the preset convergence condition, the electronic device may input the original acoustic features of the speech to the content encoder and the tone encoder, respectively, to obtain the content sequence output by the content encoder and the tone vector output by the tone encoder. Furthermore, the electronic device can extract content-related information from the original acoustic features through a content encoder to obtain a content sequence corresponding to the original acoustic features; and extracting information related to tone in the original acoustic features through a tone encoder to obtain tone vectors corresponding to the original acoustic features.
S102, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence.
In this step, the electronic device may input the content sequence to the content monitoring network to obtain a monitoring sequence output by the content monitoring network to be trained; the content encoder is trained based on the supervisory sequence and the content sequence. Specifically, the electronic device may extract text information from the content sequence through a content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence.
And S103, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.
In this step, the electronic device may input the content sequence and the timbre vector to a decoder, respectively, to obtain the predicted acoustic features output by the decoder. Specifically, the decoder may fuse content-related information in the original acoustic features and timbre-related information in the original acoustic features to obtain the predicted acoustic features.
And S104, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.
In this step, the electronic device may train the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. Then, the electronic device may reselect a speech to train the speech conversion model to be trained until the speech conversion model to be trained satisfies the preset convergence condition. The reselected voice may be a voice adjacent to the previous voice, or a voice not adjacent to the previous voice, which is not limited herein. Further, the electronic device may calculate a loss value of the predicted acoustic feature and the original acoustic feature through a pre-constructed loss function, and then train the speech conversion model to be trained based on the loss value.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the training is specially carried out aiming at the content encoder is adopted, the technical problem that when the content encoder codes the content of a speaker in the prior art, partial content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
Example two
Fig. 2 is a second flowchart of a method for training a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 2, the training method of the speech conversion model may include the following steps:
s201, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
S202, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network.
In this step, the electronic device may input the content sequence to the content surveillance network, resulting in a surveillance sequence output by the content surveillance network. Specifically, the electronic device may extract text information in the content sequence through the content surveillance network; then, a supervision sequence is obtained based on the text information; or, the electronic device may further input the content sequence into a speech recognition acoustic model of the content surveillance network, and output a phoneme probability sequence through the speech recognition acoustic model; a supervised sequence is then derived based on the phoneme probability sequence. In addition, the content monitoring network may also use other monitoring methods to obtain the monitoring sequence, which is not limited herein.
And S203, calculating a loss value of the content encoder for the original acoustic features based on the supervision sequence and the content sequence.
In this step, the electronic device may calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence. Specifically, the electronic device may calculate a loss value of the content encoder for the original acoustic feature through a pre-constructed loss function.
And S204, adjusting model parameters in the content encoder according to the loss value of the content encoder aiming at the original acoustic features.
In this step, the electronic device may adjust the model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features. Specifically, the content encoder may be a neural network, and the model parameters in the content encoder may be adjusted according to the loss value of the content encoder for the original acoustic features.
And S205, respectively inputting the content sequence and the tone vector to a decoder to obtain the predicted acoustic features output by the decoder.
And S206, training the voice conversion model to be trained based on the predicted acoustic features and the original acoustic features.
Fig. 3 is a schematic structural diagram of a training system for a speech conversion model according to an embodiment of the present application. As shown in fig. 3, the training system of the speech conversion model may include: a content monitoring network, a content encoder, a tone encoder and a decoder; when a voice conversion model is trained, firstly, respectively inputting original acoustic features of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; meanwhile, the content sequence can be input into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; then, respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; the speech conversion model is then trained based on the predicted acoustic features and the original acoustic features.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the content encoder is specially trained is adopted, the technical problem that when the content encoder encodes the content of a speaker in the prior art, part of content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
EXAMPLE III
Fig. 4 is a third flowchart of a training method of a speech conversion model according to an embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 4, the training method of the speech conversion model may include the following steps:
s401, respectively inputting the original acoustic features of the voice to a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder.
S402, inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; the content encoder is trained based on the supervisory sequence and the content sequence.
And S403, respectively inputting the content sequence and the tone vector into a decoder to obtain the predicted acoustic features output by the decoder.
S404, training a speech conversion model to be trained based on the predicted acoustic features and the original acoustic features.
S405, respectively inputting the original acoustic features of the first user for the first voice and the original acoustic features of the second user for the second voice into a trained voice conversion model, and obtaining target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice.
After the trained voice conversion model is obtained through the above steps, in this step, the electronic device may respectively input the original acoustic feature of the first user for the first voice and the original acoustic feature of the second user for the second voice into the trained voice conversion model, and obtain the target voice converted from the first voice and the second voice through the voice conversion model; wherein the target voice includes content information of the first voice and tone information of the second voice. Specifically, the electronic device may input an original acoustic feature of a first user for a first voice to a content encoder, to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting original acoustic features of a second user aiming at a second voice into a trained timbre coder to obtain a timbre vector of the second voice output by the timbre coder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.
Fig. 5 is a schematic structural diagram of a prediction system of a speech conversion model according to an embodiment of the present application. As shown in fig. 5, the prediction system of the speech conversion model may include: content encoder, tone encoder, decoder and vocoder. Assuming that the voice of the user A is required to be converted into the tone of the user B, firstly, inputting the original acoustic feature (acoustic feature of the user A) of the user A aiming at the first voice into a content encoder to obtain a content sequence of the first voice output by the content encoder; simultaneously inputting the original acoustic features of the user B aiming at the second voice (acoustic features of the user B) into a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; then, respectively inputting the content sequence of the first voice and the tone vector of the second voice into a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic characteristics into the trained vocoder to obtain the target voice output by the vocoder.
The training method of the voice conversion model provided by the embodiment of the application comprises the steps of firstly respectively inputting original acoustic characteristics of voice into a content encoder and a tone encoder to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder; then inputting the content sequence into a content monitoring network to obtain a monitoring sequence output by the content monitoring network; training a content encoder based on the supervisory sequence and the content sequence; respectively inputting the content sequence and the tone vector to a decoder to obtain predicted acoustic features output by the decoder; and finally, training the speech conversion model to be trained based on the predicted acoustic features and the original acoustic features. That is, when the speech conversion model is trained, the content sequence output by the content encoder is input to the content monitoring network in addition to the decoder, and the content encoder is trained exclusively by the content monitoring network. In the existing training method of the speech conversion model, the content sequence output by the content encoder is only input to the decoder, and an additional auxiliary network is not specially trained for the content encoder. Because the technical means that the content monitoring network is added in the voice conversion model and the content encoder is specially trained is adopted, the technical problem that when the content encoder encodes the content of a speaker in the prior art, part of content information can be removed, and the converted content has more errors is solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.
Example four
Fig. 6 is a schematic structural diagram of a training apparatus for a speech conversion model according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes: an encoding module 601, a supervision module 602, a decoding module 603 and a training module 604; wherein,
the encoding module 601 is configured to input original acoustic features of a speech to a content encoder and a tone encoder respectively, so as to obtain a content sequence output by the content encoder and a tone vector output by the tone encoder;
the monitoring module 602 is configured to input the content sequence into a content monitoring network, so as to obtain a monitoring sequence output by the content monitoring network; training the content encoder based on the supervisory sequence and the content sequence;
the decoding module 603 is configured to input the content sequence and the timbre vector to a decoder, respectively, so as to obtain predicted acoustic features output by the decoder;
the training module 604 is configured to train a to-be-trained speech conversion model based on the predicted acoustic features and the original acoustic features.
Further, the monitoring module 602 is specifically configured to extract text information from the content sequence through the content monitoring network; the supervision sequence is derived based on the textual information.
Further, the monitoring module 602 is specifically configured to input the content sequence into a speech recognition acoustic model of the content monitoring network, and output a phoneme probability sequence through the speech recognition acoustic model; and obtaining the supervision sequence based on the phoneme probability sequence.
Further, the supervision module 602 is specifically configured to calculate a loss value of the content encoder for the original acoustic feature based on the supervision sequence and the content sequence; and adjusting model parameters in the content encoder according to the loss value of the content encoder for the original acoustic features.
Further, the apparatus further comprises: a prediction module 605 (not shown in the figure), configured to input an original acoustic feature of a first user for a first voice and an original acoustic feature of a second user for a second voice into a trained voice conversion model, respectively, and obtain, through the voice conversion model, a target voice converted from the first voice and the second voice; wherein the target voice includes content information of the first voice and tone information of the second voice.
Further, the prediction module 605 is specifically configured to input an original acoustic feature of the first user for the first speech into the content encoder, so as to obtain a content sequence of the first speech output by the content encoder; inputting the original acoustic features of the second voice of the second user to a trained tone encoder to obtain a tone vector of the second voice output by the tone encoder; respectively inputting the content sequence of the first voice and the tone vector of the second voice to a trained decoder, and outputting predicted fusion acoustic features through the decoder; and inputting the predicted fusion acoustic features into a trained vocoder to obtain the target voice output by the vocoder.
The training device of the voice conversion model can execute the method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For the technical details not described in detail in this embodiment, reference may be made to a method for training a speech conversion model provided in any embodiment of the present application.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
EXAMPLE five
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the training method of the speech conversion model. For example, in some embodiments, the method of training the speech conversion model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method for training a speech conversion model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the speech conversion model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.