CN114023342A

CN114023342A - Voice conversion method and device, storage medium and electronic equipment

Info

Publication number: CN114023342A
Application number: CN202111118347.2A
Authority: CN
Inventors: 聂志朋; 王俊超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-02-08
Anticipated expiration: 2041-09-23
Also published as: CN114023342B

Abstract

The present disclosure relates to the field of artificial intelligence technologies such as natural speech processing, speech and deep learning, and provides a speech conversion method, device, storage medium and electronic device, including: receiving source audio to be converted; carrying out content information coding on source audio to obtain a first characteristic; acquiring the designated audio frequency of a target speaker; carrying out voice recognition on the designated audio to obtain a second characteristic; and inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio. The end-to-end voice conversion training is adopted, so that the complicated process of independently training the vocoder is effectively avoided, and a large amount of audio of a target speaker does not need to be acquired to perform independent vocoder model training. Therefore, the tone conversion from any source audio to the target speaker can be realized without parallel data. And, the acoustic model and the sound decoder are cooperatively modeled, so that the overall model scale of the voice conversion is greatly reduced. Obviously saves storage resources and computing resources and effectively improves the voice conversion efficiency.

Description

Voice conversion method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies such as natural speech processing, speech and deep learning, and in particular, to a speech conversion method and apparatus, a storage medium, and an electronic device.

Background

Speech conversion refers to the conversion of source audio into speech having the timbre characteristics of a target speaker. The method has important application in a plurality of fields such as sound changing, dubbing, voice simulation and the like, and is a leading and important technical branch of the current voice technology. However, in the current speech conversion process, a large amount of corpus of a target speaker needs to be obtained first, training of a speech coding model and training of a vocoder are performed, and the final target speech can be obtained only by performing speech conversion on acoustic features through the vocoder which is complicated in the model training process.

Disclosure of Invention

The present disclosure provides a method, apparatus, device and storage medium for voice conversion.

According to an aspect of the present disclosure, there is provided a voice conversion method including: receiving source audio to be converted; carrying out content information coding on the source audio to obtain a first characteristic; acquiring the designated audio frequency of a target speaker; carrying out voice recognition on the designated audio to obtain a second characteristic; and inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

According to an embodiment of the present disclosure, the inputting the first feature and the second feature into a speech conversion model to obtain a target audio includes: inputting the first feature and the second feature into a voice conversion model, and adding the second feature into the first feature based on a frame of source audio to obtain a joint code; performing feature fusion on the joint codes to obtain fusion features; and performing signal conversion on the fusion characteristics to obtain the target audio.

According to an embodiment of the present disclosure, the method further comprises: fundamental frequency extraction is carried out on the source audio and the designation to obtain fundamental frequency information; correspondingly, inputting the first feature and the second feature into a speech conversion model to obtain target audio, including: and inputting the fundamental frequency information, the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

According to an embodiment of the present disclosure, the method further comprises: carrying out content information coding on the target audio to obtain content characteristics; and judging the first loss of the target audio line according to the content characteristic and the first characteristic.

According to an embodiment of the present disclosure, the method further comprises: acquiring sample audio of a target speaker; performing discriminant model training based on the sample audio and the target audio; and performing second loss judgment on the target audio by using the judgment model.

According to an embodiment of the present disclosure, the second feature includes a timbre feature of the target speaker; the first feature includes content feature information of the source audio.

According to a second aspect of the present disclosure, there is provided a voice conversion apparatus including: the receiving module is used for receiving source audio to be converted; the content information coding module is used for coding the content information of the source audio to obtain a first characteristic; the appointed audio acquisition module is used for acquiring appointed audio of the target speaker; the recognition module is used for carrying out voice recognition on the specified audio to obtain a second characteristic; and the conversion module is used for inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

According to an embodiment of the present disclosure, the conversion module includes: the coding submodule is used for inputting the first characteristic and the second characteristic into a voice conversion model, and adding the second characteristic into the first characteristic based on a frame of source audio to obtain a joint code; the characteristic fusion submodule is used for carrying out characteristic fusion on the joint codes to obtain fusion characteristics; and the sound code conversion sub-module is used for carrying out signal conversion on the fusion characteristics to obtain the target audio.

According to an embodiment of the present disclosure, the apparatus further comprises: the fundamental frequency extraction module is used for extracting the fundamental frequency of the source audio and the designation to obtain fundamental frequency information; inputting the first feature and the second feature into a voice conversion model by the corresponding conversion module to obtain target audio, wherein the conversion module comprises: and the conversion module inputs the fundamental frequency information, the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

According to an embodiment of the present disclosure, the apparatus further comprises: the target content coding module is used for coding the content information of the target audio to obtain content characteristics; and the first loss judgment module is used for judging the first loss of the target audio line according to the content characteristic and the first characteristic.

According to an embodiment of the present disclosure, the apparatus further comprises: the sample acquisition module is used for acquiring sample audio of the target speaker; the discriminant model training module is used for performing discriminant model training based on the sample audio and the target audio; and the first loss judgment module is used for carrying out second loss judgment on the target audio by utilizing the judgment model.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described voice conversion method.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the above-described speech conversion method.

According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described speech conversion method.

The voice conversion method of the embodiment of the disclosure encodes content information of a received source audio to be converted to obtain a first feature, obtains a designated audio of a target speaker to perform voice recognition on the designated audio to obtain a second feature, and finally inputs the first feature and the second feature to a voice conversion model to obtain the target audio. The end-to-end voice conversion training is adopted, so that the complicated process of independently training the vocoder is effectively avoided, and a large amount of audio of a target speaker does not need to be acquired to perform independent vocoder model training. Therefore, the tone conversion from any source audio to the target speaker can be realized without parallel data. And, here, carry on the cooperative modeling of acoustic model and vocoder, in the whole course of the speech conversion, the overall model size is reduced greatly. Obviously saves storage resources and computing resources and effectively improves the voice conversion efficiency.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of an implementation of a voice conversion method according to a first embodiment of the present disclosure;

fig. 2 is a schematic flow chart of an implementation of a voice conversion method according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a model architecture of a specific application example of the speech conversion method according to the present disclosure;

FIG. 4 is a schematic diagram of an alternative construction of a voice conversion apparatus according to the present disclosure;

FIG. 5 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flow chart of an implementation of a voice conversion method according to a first embodiment of the present disclosure. Referring to fig. 1, a speech conversion method provided by a first embodiment of the present disclosure includes at least the following operations:

s101, receiving source audio to be converted.

In this embodiment of the present disclosure, the source audio to be converted may be any audio that needs to be voice converted. For example, in the process of making dubbing audio or making voice change, the sentence Y spoken by the speaker a needs to be converted into the target audio having the timbre characteristics of the target speaker B, and the source audio refers to the audio of the sentence Y spoken by the speaker a. The examples of source audio herein do not form a limitation of the present disclosure to the description of source audio.

S102, content information coding is carried out on the source audio to obtain a first characteristic.

In this embodiment of the present disclosure, the first feature may include content feature information of the source audio. The content characteristic information of the source audio may include all other acoustic characteristics than the tonal characteristics in the source audio. For example: language content characteristics, rhythm characteristics, mood characteristics, and the like. And encoding content information of the source audio, including encoding characteristics such as language content characteristics, rhythm characteristics and mood characteristics in the source audio.

Here, the content of the source audio may be encoded by using a speech content encoder. The processing of source audio by a content encoder is to convert the sound into a machine-recognizable digital vector. The method can be realized by adopting a universal speech content encoder in the natural speech processing technology, and details are not repeated here.

S103, acquiring the designated audio of the target speaker.

In this embodiment of the present disclosure, the designated audio of the target speaker may be any sentence of audio of the target speaker.

And S104, performing voice recognition on the designated audio to obtain a second characteristic.

In this embodiment of the present disclosure, a general GST (Global warping) technique may be adopted to perform speech recognition on the specified audio, so as to obtain the second feature. Wherein the second characteristic refers to the tone color characteristic of the target speaker.

Similarly, other suitable speech recognition techniques may be used for speech recognition of the designated audio, and the disclosure is not limited thereto.

And S105, inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

In this embodiment of the present disclosure, inputting the first feature and the second feature to the speech conversion model to obtain the target audio is implemented by: inputting the first characteristic and the second characteristic into a voice conversion model, and adding the second characteristic into the first characteristic based on a frame of source audio to obtain a joint code; performing feature fusion on the joint codes to obtain fusion features; and performing signal conversion on the fusion characteristics to obtain a target audio.

In this embodiment of the present disclosure, the second feature is a timbre feature recognized from the specified audio of the target speaker, and the first feature is a feature other than the timbre features such as a content feature, a mood feature, and a prosodic rhythm feature extracted from the source audio. The speech conversion model has the functions of both a joint network and a vocoder, and in the model training process, the joint network for generating the audio acoustic features of the source and the vocoder for converting the acoustic features into the target audio are trained as a complete speech conversion model. The method effectively avoids the process of collecting a large amount of audio of the target speaker to carry out vocoder training, and can realize the tone conversion from any source audio to the target speaker without parallel data.

Here, the first feature is an acoustic feature showing the source audio, the first diagnostic and the second feature are input into the speech conversion model, and first, the second feature showing the target speaker tone color diagnostic can be added on each frame of the first feature by means of vector addition or vector splicing based on the frames of the source audio, so as to obtain the joint coding. And then, carrying out feature fusion on the network through the multilayer stacked gods to obtain a fusion feature. The Neural Network algorithm that can be used herein includes, but is not limited to, MLP (multi layer Perceptron), convolutional Network, RNN (Recurrent Neural Network), and the like. Finally, a target audio is generated by DDSP (differential Digital Signal Processing) Processing.

In this embodiment of the present disclosure, for better conversion effect and input of the DDSP, fundamental frequency extraction is also performed on the source audio and the specification to obtain fundamental frequency information; and correspondingly, simultaneously inputting the fundamental frequency information, the first characteristic and the second characteristic into the voice conversion model to obtain the target audio.

For example, the fundamental frequency information f0 can be extracted from the source audio and the designated audio by the fundamental frequency extractor, and f0 can be input into the voice conversion model architecture as the input of the DDSP process, wherein the fundamental frequency information f0 includes the fundamental frequency information f0 of the source audio_sAnd fundamental frequency information f0 of target speaker_t。

In another embodiment of the present disclosure, the fundamental frequency information f0 is also normalized before the fundamental frequency information f0 is input into the speech conversion model.

Here, the fundamental frequency extractor may adopt a mature technical solution such as CREPE (Convolutional reconstruction For Pitch Estimation), SPICE (Self-supervised Pitch Estimation), and swape (saw Wave Inspired Pitch Estimation). Specifically, based on any audio in the source audio, the fundamental frequency extractor is used to obtain fundamental frequency information f0 of the source audio_s. Then, the fundamental frequency information f0 of the target speaker is obtained by the following formula 1_t。

Wherein mean is_s，var_sThe mean and variance of the fundamental frequency of the speaker of the source audio are respectively expressed and can be obtained and calculated in advance through the fundamental frequency extractor disclosed by the disclosure;

mean_t，var_tthe fundamental frequency mean and variance, respectively, representing the target speaker may be obtained and calculated in advance by the fundamental frequency extractor according to the present disclosure.

Fig. 2 is a schematic flow chart of an implementation of a voice conversion method according to a second embodiment of the present disclosure.

Referring to fig. 2, a speech conversion method provided by a second embodiment of the present disclosure includes at least the following operations:

s201, receiving source audio to be converted.

In this embodiment of the present disclosure, the source audio to be converted may be any audio that needs to be voice converted. For example, during dubbing audio or change of voice processing,

s202, content information coding is carried out on the source audio to obtain a first characteristic.

S203, acquiring the designated audio of the target speaker.

And S204, performing voice recognition on the designated audio to obtain a second characteristic.

And S205, inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

And S206, carrying out content information coding on the target audio to obtain content characteristics.

In the embodiment of the present disclosure, in order to optimally train the speech conversion model, the model optimization in S206 to S210 in the following operations is performed. The operations S206 to S207 are a group of operations, and the operations S208 to S210 are a group of operations, and in the actual application process, one of the groups of operations may be selected, or the two groups of operations may be executed simultaneously.

In this embodiment of the present disclosure, content information encoding is performed on the target audio generated by the speech conversion model by using a content encoder, so as to obtain the content characteristics of the target audio. Here, the content information encoding of the target audio includes encoding of characteristics such as a language content characteristic, a prosodic rhythm characteristic, and a mood characteristic in the target audio.

And S207, judging the first loss of the target audio line according to the content characteristic and the first characteristic.

In this embodiment of the present disclosure, a first loss of the target audio line is determined based on the content feature of the target audio obtained in operation S206 and the first feature of the source audio obtained in operation S202. Here, the loss may be computed point-by-point based on the content feature and the first feature, for example, the computation of the loss may include, but is not limited to, the mean squared error MES _ loss and the mean absolute error L1_ loss.

And S208, acquiring the sample audio of the target speaker.

In this embodiment of the present disclosure, to enhance the naturalness of the generated audio, a GAN (generic adaptive Networks) is also used to perform a countermeasure training on the voice conversion model. Here, it is necessary to acquire a sample audio of the target speaker, which may be any real audio other than the designated audio of the target speaker.

S209, performing discriminant model training based on the sample audio and the target audio.

In this embodiment of the present disclosure, realoss and fake loss between the sample audio and the target audio may be calculated based on the sample audio and the target audio, and the closer the real loss is to 0 and the closer the fake loss is to 1, the better the precision and accuracy of the GAN model is and the more optimized the parameters are. Accordingly, a discriminant model, which may be a GAN model, is trained.

S210, performing second loss judgment on the target audio by using the judgment model.

In this embodiment of the present disclosure, the discrimination model may be utilized to perform the second loss discrimination on the target audio, so as to adjust the model parameters of the voice conversion model according to the discrimination of the second loss, optimize the voice conversion model, and improve the precision and accuracy of the voice conversion.

The specific implementation process of S201 to S205 is similar to the specific implementation process of operations S101 to S105 in the embodiment shown in fig. 1, and is not described here again.

Fig. 3 is a model architecture diagram of a specific application example of the speech conversion method according to the present disclosure. Referring to fig. 3, a model architecture of a specific application example of the speech conversion method of the present disclosure may include: the system comprises a content encoder, a target speaker encoder and a Decoder (Decoder).

The content encoder is used for encoding the content information of the source audio to obtain the content information (content) of the source audio. The target speaker encoder is used to perform the tone feature recognition on any audio of the target speaker, and since other audio of the target speaker will be used in the subsequent operations, for the purpose of distinction, any audio of the target speaker is shown as target speaker audio 1 in fig. 3. The Decoder includes the functions of both the union network and the DDSP, but here the Decoder does not need to perform model training on the union network and the DDSP respectively, but directly uses them as a Decoder model to perform collaborative training for realizing the operation of voice conversion.

The user inputs arbitrary source audio (source _ wav1) to the content encoder and arbitrary sentence audio (target _ wav2) of the target speaker to the target speaker encoder.

The content encoder is configured to compress and content-encode the source audio (source _ wav1) to obtain content information (content). The content information here refers to information such as language content, rhythm, tone, etc., other than the tone of the speaker.

The target speaker encoder is used for performing voice recognition on any sentence of audio of the target speaker to obtain a coding vector (target _ emb) representing the unique tone of the target speaker, and can perform voice recognition by adopting the mature GST technology and other technologies.

The Decoder (Decoder) is used for adding the tone color characteristic (target _ emb) of the target speaker to the content information (content) of the source audio and generating the target audio (target _ wav1) retaining the content information of the source audio. The Decoder may structurally include two parts of a federated network and a DDSP. The joint network adds the tone color coding of the target speaker on each frame of the content coding in a vector addition or splicing mode. Then, the network is subjected to feature fusion through the multilayer stacked gods. Here, the neural network includes, but is not limited to, MLP, convolutional network, RNN, and the like. Finally, target audio (target _ wav1) with target speaker timbre characteristics and content consistent with the source audio is generated by the DDSP.

In order to achieve better voice conversion effect and meet the input requirement of DDSP, the disclosure should specifically applyIn the working example, fundamental frequency information f0 is also extracted from the source audio and the specified audio by the fundamental frequency extractor, and f0 is input into the speech conversion model architecture as input to the DDSP process. Wherein the fundamental frequency information f0 includes the fundamental frequency information f0 of the source audio_sAnd fundamental frequency information f0 of target speaker_t. In the optimized embodiment of the present disclosure, the fundamental frequency information f0 is also normalized.

In addition, the training phase for the voice conversion model may employ the first loss function content _ loss and GAN countertraining. Here, the content _ loss may re-extract content information of the generated target audio through a content encoder, and calculate a loss point by point from the content information of the target audio and the content information of the source audio. And optimizing parameters of the deorder according to the calculated loss.

Adopting the GAN model for the confrontational training can comprise the following specific operations: other real audio of the target speaker (target _ wav2) may be obtained, and is shown in fig. 3 as target speaker audio 2 for the purpose of distinguishing it from the audio of the target speaker obtained during encoding of the aforementioned timbre characteristics of the target speaker. And inputting the target speaker audio 2 and the generated target audio (target _ wav1) into a generating formula pair network GAN model, and performing second loss judgment to obtain a second loss judgment result. Specifically, it can be calculated that the target audio obtained through the Decoder is real/fake loss with the target speaker audio 2, and the GAN model is trained accordingly. And then, performing second loss judgment on the generated target audio by using the GAN model which is trained iteratively. And further optimizing parameters of the decoder according to the loss judgment result. Thereby, the naturalness of the target audio is further enhanced.

FIG. 4 is a schematic diagram of an alternative construction of a voice conversion apparatus according to the present disclosure; as shown in fig. 4, the speech conversion apparatus 40 in the embodiment of the present disclosure includes: a receiving module 401, configured to receive source audio to be converted; a content information encoding module 402, configured to perform content information encoding on a source audio to obtain a first characteristic; a designated audio acquisition module 403, configured to acquire a designated audio of a target speaker; the recognition module 404 is configured to perform voice recognition on the specified audio to obtain a second feature; and a conversion module 405, configured to input the first feature and the second feature into a speech conversion model, so as to obtain a target audio.

In this embodiment of the present disclosure, the conversion module 402 includes: the coding submodule is used for inputting the first characteristic and the second characteristic into the voice conversion model, and adding the second characteristic into the first characteristic based on the frame of the source audio to obtain a joint code; the characteristic fusion submodule is used for carrying out characteristic fusion on the joint codes to obtain fusion characteristics; and the sound code conversion submodule is used for carrying out signal conversion on the fusion characteristics to obtain the target audio.

In this embodiment of the present disclosure, the apparatus 40 further comprises: the fundamental frequency extraction module is used for extracting the fundamental frequency of the source audio and the designation to obtain fundamental frequency information; the corresponding conversion module inputs the first characteristic and the second characteristic into a voice conversion model to obtain target audio, and the method comprises the following steps: the conversion module inputs the fundamental frequency information, the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

In this embodiment of the present disclosure, the apparatus 40 further comprises: the target content coding module is used for coding the content information of the target audio to obtain the content characteristics; and the first loss judging module is used for judging the first loss of the target audio line according to the content characteristic and the first characteristic.

In this embodiment of the present disclosure, the apparatus 40 further comprises: the sample acquisition module is used for acquiring sample audio of the target speaker; the discrimination model training module is used for performing discrimination model training based on the sample audio and the target audio; and the first loss judgment module is used for carrying out second loss judgment on the target audio by utilizing the judgment model.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as a voice conversion method. For example, in some embodiments, the speech conversion method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the speech conversion method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the speech conversion method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of speech conversion, comprising:

receiving source audio to be converted;

carrying out content information coding on the source audio to obtain a first characteristic;

acquiring the designated audio frequency of a target speaker;

carrying out voice recognition on the designated audio to obtain a second characteristic;

and inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

2. The method of claim 1, wherein the inputting the first feature and the second feature into a speech conversion model resulting in target audio comprises:

inputting the first feature and the second feature into a voice conversion model, and adding the second feature into the first feature based on a frame of source audio to obtain a joint code;

performing feature fusion on the joint codes to obtain fusion features;

and performing signal conversion on the fusion characteristics to obtain the target audio.

3. The method of claim 1, further comprising:

fundamental frequency extraction is carried out on the source audio and the designation to obtain fundamental frequency information; corresponding to

Inputting the first feature and the second feature into a speech conversion model to obtain target audio, including:

and inputting the fundamental frequency information, the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

4. The method of claim 1, further comprising:

carrying out content information coding on the target audio to obtain content characteristics;

and judging the first loss of the target audio line according to the content characteristic and the first characteristic.

5. The method of claim 1, further comprising:

acquiring sample audio of a target speaker;

performing discriminant model training based on the sample audio and the target audio;

and performing second loss judgment on the target audio by using the judgment model.

6. The method of any of claims 1-5, the second feature comprising a timbre feature of a target speaker;

the first feature includes content feature information of the source audio.

7. A speech conversion apparatus comprising:

the receiving module is used for receiving source audio to be converted;

the content information coding module is used for coding the content information of the source audio to obtain a first characteristic;

the appointed audio acquisition module is used for acquiring appointed audio of the target speaker;

the recognition module is used for carrying out voice recognition on the specified audio to obtain a second characteristic;

and the conversion module is used for inputting the first characteristic and the second characteristic into a voice conversion model to obtain target audio.

8. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of speech conversion according to any of claims 1-6.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the speech conversion method according to any one of claims 1-6.

10. A computer program product comprising a computer program which, when executed by a processor, implements a method of speech conversion according to any one of claims 1-6.