CN113178200A

CN113178200A - Voice conversion method, device, server and storage medium

Info

Publication number: CN113178200A
Application number: CN202110470020.5A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-27
Anticipated expiration: 2041-04-28
Also published as: CN113178200B

Abstract

The application relates to voice processing in artificial intelligence, and provides a voice conversion method, a device, a server and a storage medium, wherein the method comprises the following steps: acquiring training sample data, wherein the training sample data comprises a first sample pair or a second sample pair; if the training sample data is a first sample pair, inputting the first voice data into a voice encoder to obtain a voice characteristic vector, and inputting the second voice data into an emotion encoder to obtain a first emotion characteristic vector; inputting the voice feature vector and the first emotion feature vector into a feature conversion layer to obtain a first linear spectrogram and a first Mel spectrogram; updating model parameters of the voice conversion model according to the first linear spectrogram and the first Mel spectrogram until the voice conversion model is converged; and inputting target voice data to be converted and reference voice data representing target emotion into the converged voice conversion model to obtain a target voice signal. The method and the device can improve the accuracy of voice conversion.

Description

Voice conversion method, device, server and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech conversion method, apparatus, server, and storage medium.

Background

Personalized Voice generation and various human-computer interactions are always objects of interest, and Voice conversion technology (VC), which is a branch of the personalized Voice generation, is also always noticed by people. The voice conversion is to convert the voice of a person into different styles on the premise of keeping the language content, can be used for keeping the identity of the speaker secret in a special environment, and can also be used for dubbing film and television works.

Currently, speech conversion is performed based on speech frames, by giving a forced alignment of source speech and target speech and converting acoustic characteristics of the source speech into the target speech, or performing speech conversion based on a sequence-to-sequence conversion model, which are prone to deliver language information incorrectly, resulting in situations of word skipping, word missing, repetition, etc., and the accuracy of speech conversion is not high. Therefore, how to improve the accuracy of voice conversion becomes an urgent problem to be solved.

Disclosure of Invention

The present application mainly aims to provide a voice conversion method, device, server and storage medium, aiming to improve the accuracy of voice conversion.

In a first aspect, the present application provides a method for voice conversion, including:

acquiring training sample data, wherein the training sample data comprises a first sample pair or a second sample pair, the first sample pair comprises first voice data and second voice data, an emotion label corresponding to the first voice data is different from an emotion label corresponding to the second voice data, and the second sample pair comprises third voice data and text information corresponding to the third voice data;

calling a preset voice conversion model, wherein the voice conversion model comprises a text encoder, a voice encoder, an emotion encoder and a feature conversion layer;

if the training sample data is the first sample pair, inputting the first voice data into the voice encoder for encoding operation to obtain a voice feature vector, and inputting the second voice data into the emotion encoder for encoding operation to obtain a first emotion feature vector;

inputting the voice feature vector and the first emotion feature vector into the feature conversion layer for processing to obtain a first linear spectrogram and a first Mel spectrogram;

determining whether the voice conversion model converges according to the first linear spectrogram and the first Mel spectrogram; and

if the training sample data is the second sample pair, inputting the third voice data into the emotion encoder for encoding operation to obtain a second emotion characteristic vector, and inputting the text information into the text encoder for encoding operation to obtain a text characteristic vector;

inputting the text feature vector and the second emotion feature vector into the feature conversion layer for processing to obtain a second linear spectrogram and a second Meier spectrogram;

determining whether the voice conversion model converges according to the second linear spectrogram and the second Meier spectrogram;

if the voice conversion model is not converged, updating model parameters of the voice conversion model, and executing the step of obtaining training sample data until the voice conversion model is converged;

acquiring target voice data to be converted and acquiring reference voice data representing target emotion;

and inputting the target voice data and the reference voice data into the converged voice conversion model to obtain a target voice signal representing the target emotion.

In a second aspect, the present application further provides a voice conversion apparatus, including:

the training sample data comprises a first sample pair or a second sample pair, the first sample pair comprises first voice data and second voice data, emotion labels corresponding to the first voice data are different from emotion labels corresponding to the second voice data, and the second sample pair comprises third voice data and text information corresponding to the third voice data;

the calling module is used for calling a preset voice conversion model, and the voice conversion model comprises a text encoder, a voice encoder, an emotion encoder and a feature conversion layer;

a first encoding module, configured to, if it is determined that the training sample data is the first sample pair, input the first speech data into the speech encoder for encoding operation to obtain a speech feature vector, and input the second speech data into the emotion encoder for encoding operation to obtain a first emotion feature vector;

the first conversion module is used for inputting the voice feature vector and the first emotion feature vector into the feature conversion layer for processing to obtain a first linear spectrogram and a first Mel spectrogram;

a first determining module, configured to determine whether the speech conversion model converges according to the first linear spectrogram and the first mel spectrogram; and

a second encoding module, configured to, if it is determined that the training sample data is the second sample pair, input the third speech data into the emotion encoder for encoding operation to obtain a second emotion feature vector, and input the text information into the text encoder for encoding operation to obtain a text feature vector;

the second conversion module is used for inputting the text feature vector and the second emotion feature vector into the feature conversion layer for processing to obtain a second linear spectrogram and a second Meier spectrogram;

a second determining module, configured to determine whether the speech conversion model converges according to the second linear spectrogram and the second mel spectrogram;

the updating module is used for updating the model parameters of the voice conversion model if the voice conversion model is not converged, and executing the step of acquiring training sample data until the voice conversion model is converged;

the acquisition module is also used for acquiring target voice data to be converted and acquiring reference voice data representing target emotion;

and the input module is used for inputting the target voice data and the reference voice data into the converged voice conversion model to obtain a target voice signal representing the target emotion.

In a third aspect, the present application further provides a server comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the voice conversion method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of the speech conversion method as described above.

The application provides a voice conversion method, a device, a server and a storage medium, wherein training sample data is obtained, the training sample data comprises a first sample pair or a second sample pair, the first sample pair comprises first voice data and second voice data, and the second sample pair comprises third voice data and text information corresponding to the third voice data; calling a preset voice conversion model, wherein the preset voice conversion model comprises a text encoder, a voice encoder, an emotion encoder and a feature conversion layer; if the training sample data is a first sample pair, inputting the first voice data into a voice encoder to obtain a voice characteristic vector, and inputting the second voice data into an emotion encoder to obtain a first emotion characteristic vector; inputting the voice feature vector and the first emotion feature vector into a feature conversion layer to obtain a first linear spectrogram and a first Mel spectrogram; determining whether the voice conversion model converges according to the first linear spectrogram and the first Mel spectrogram; if the training sample data is a second sample pair, inputting third voice data into an emotion encoder to obtain a second emotion characteristic vector, and inputting text information into a text encoder to obtain a text characteristic vector; inputting the text feature vector and the second emotion feature vector into a feature conversion layer to obtain a second linear spectrogram and a second Meier spectrogram; determining whether the voice conversion model converges according to the second linear spectrogram and the second Meier spectrogram; if the voice conversion model is not converged, updating model parameters of the voice conversion model, and executing the step of acquiring training sample data until the voice conversion model is converged; and inputting target voice data to be converted and reference voice data representing target emotion into the converged voice conversion model to obtain a target voice signal. According to the method and the device, multitask learning is carried out through the text encoder, the voice encoder and the emotion encoder, so that the network parameters of the voice conversion model are fitted towards text content information, the voice conversion model is helped to learn language content information, the performance of the voice conversion model can be improved, the stability of model training is kept, and the accuracy of voice conversion is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating steps of a speech conversion method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating a sub-step of the speech conversion method of FIG. 1;

FIG. 3 is a flow chart illustrating another sub-step of the speech conversion method of FIG. 1;

fig. 4 is a schematic block diagram of a speech conversion apparatus according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a sub-module of the speech conversion apparatus of FIG. 4;

FIG. 6 is a schematic block diagram of another sub-module of the speech conversion apparatus of FIG. 4;

fig. 7 is a block diagram schematically illustrating a structure of a server according to an embodiment of the present disclosure.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.

The embodiment of the application provides a voice conversion method, a voice conversion device, a server and a storage medium. The voice conversion method can be applied to a server, wherein the server stores a voice conversion model, and the voice conversion model comprises a text encoder, a voice encoder, an emotion encoder and a feature conversion layer. The server may be a single server or a server cluster including a plurality of servers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flowchart illustrating a voice conversion method according to an embodiment of the present application.

As shown in fig. 1, the voice conversion method includes steps S101 to S108.

Step S101, obtaining training sample data, wherein the training sample data comprises a first sample pair or a second sample pair.

The first sample pair comprises first voice data and second voice data, emotion labels corresponding to the first voice data are different from emotion labels corresponding to the second voice data, and the second sample pair comprises third voice data and text information corresponding to the third voice data. Wherein the emotion labels include calm, joy, sadness, anger, fear, surprise and confusion, etc., the first sample pair may be one or more, and the second sample pair may also be one or more.

In an embodiment, the emotion tag corresponding to the third voice data is the same as the emotion tag corresponding to the first voice data. When the emotion label corresponding to the third voice data is the same as the emotion label corresponding to the first voice data, in subsequent model training for realizing Text-to-Speech (TTS), model parameters of the voice conversion model are not affected by different emotions and fit towards the embedding space direction of the learning Text content information, so that the learning of the language content information by the voice conversion model is facilitated.

In one embodiment, as shown in fig. 2, step S101 includes: substep S1011 to substep S1013.

And a substep S1011, obtaining a plurality of training samples, wherein the training samples comprise voice data, text information corresponding to the voice data and emotion labels.

The voice data are voice data recorded by the user through different emotions, for example, 3000 voices are recorded by seven emotions, namely, calm, happy, sad, angry, fear, surprise and confusion, of the user A respectively to obtain 21000 voice data, and each voice data is marked with text information and an emotion label corresponding to the voice data respectively to obtain 21000 training samples.

Substep S1012, constructing a first sample pair and a second sample pair based on the plurality of training samples.

The first sample pair comprises first Voice data and second Voice data which represent different emotions, the first sample pair is used for carrying out model training of Voice-to-Voice Conversion (VC), the first Voice data is input into a Voice coder, the second Voice data is input into an emotion coder, and the Voice conversion model is made to learn language content information of different emotions. The second sample pair includes third Speech data and Text information corresponding to the third Speech data, for performing model training of Text-to-Speech (TTS), inputting the third Speech data into the emotion coder, and inputting the Text information into the Text coder, so that the Speech conversion model is fitted toward an embedding space of the learning Text content information, thereby assisting the Speech conversion model in learning the language content information.

In one embodiment, constructing the first sample pair and the second sample pair from a plurality of training samples comprises: determining a first emotion label and a second emotion label to be selected; selecting first voice data corresponding to a first emotion label from a plurality of training samples and combining with second voice data corresponding to a second emotion label to obtain a first sample pair; and selecting third voice data from the plurality of training samples and combining the third voice data with the text information corresponding to the third voice data to obtain a plurality of second sample pairs.

It should be noted that the first emotion tag and the second emotion tag may be flexibly set or randomly selected, and the emotion tag corresponding to the third voice data may be the same as the first emotion tag or the second emotion tag. The first voice data corresponding to the first emotion tag and the second voice data corresponding to the second emotion tag are combined into a sample pair, and the third voice data and the text information corresponding to the third voice data are combined into a sample pair, so that multitask learning of voice-to-voice conversion VC and text-to-voice conversion TTS is facilitated, and the performance of a voice conversion model is improved.

In one embodiment, constructing a plurality of first sample pairs and a plurality of second sample pairs comprises: determining a first emotion label to be selected; selecting first voice data corresponding to a first emotion label from a plurality of training samples for a plurality of times according to a set batch size (batch size) and combining the first voice data with second voice data corresponding to other emotion labels except the first emotion label to obtain a plurality of first sample pairs; and selecting third voice data corresponding to the first emotion label from the plurality of training samples according to the set batch size (batch size) and combining the third voice data with the text information corresponding to the third voice data to obtain a plurality of second sample pairs. The batch size can be flexibly set, for example, the batch size is 30, the emotion tags corresponding to the third voice data can be the same as the emotion tags corresponding to the first voice data, the batch first sample pair is used for voice conversion VC, and the batch second sample pair is used for text-to-voice conversion TTS, so that the stability of a model training process of the voice conversion model can be kept.

Illustratively, the batch size is 30, the first emotion label is calm, and the corresponding 30 first sample pairs and 30 second sample pairs are generated through a plurality of training samples, wherein the emotion label of each first voice data of the first sample pairs is calm, the emotion label of the second voice data can be other than calm, such as happy, sad, angry, fear, etc., and the emotion label of the third voice data is calm.

And a substep S1013 of selecting the first sample pair or the second sample pair as training sample data.

And selecting the first sample pair or the second sample pair as training sample data according to specific conditions for training the voice conversion model. It can be understood that the model training process of the speech conversion model of the present application includes a speech-to-speech conversion VC and a text-to-speech conversion TTS, and performs multitask learning by performing the speech-to-speech conversion VC or the text-to-speech conversion TTS in turn, and selects a first sample pair when speech-to-speech conversion is required, and selects a second sample pair when text-to-speech conversion is required.

In one embodiment, model training tasks are determined, wherein the model training tasks comprise a first training task and a second training task, the first training task is used for realizing model training of voice-to-voice conversion, and the second training task is used for realizing model training of text-to-voice conversion; if the model training task is a first training task, determining that the first sample pair is used as training sample data; and if the model training task is a second training task, determining the second sample pair as training sample data. It should be noted that, when the model training task is the first training task, the first sample pair is selected as the training sample data, which is beneficial to implementing the model training from speech to speech. When the model training task is the second training task, the second sample pair is selected as training sample data, which is beneficial to realizing the model training from text to voice conversion, thereby improving the performance of the voice conversion model.

Wherein, determining a model training task comprises: determining an output result of a preset function, wherein the preset function comprises a function for randomly generating a first element and a second element; when the output result is the first element, determining the model training task as the first training task; and when the output result is the second element, determining the model training task as the second training task. It should be noted that the predetermined function is, for example, a random function that uniformly generates (0, 1) integers, the first element is, for example, 0, and the second element is, for example, 1. It is to be understood that the preset function may also be other functions capable of generating two or more elements, and the embodiment is not particularly limited.

And step S102, calling a preset voice conversion model, wherein the voice conversion model comprises a text coder, a voice coder, an emotion coder and a feature conversion layer.

The voice conversion model can be stored in the server in advance, and the voice encoder is used for encoding the first voice data to obtain a voice characteristic vector; the emotion encoder is used for encoding the second voice data to obtain a first emotion characteristic vector, or encoding the third voice data to obtain a second emotion characteristic vector; the text encoder is used for encoding the text information of the third voice data to obtain a text characteristic vector; the feature conversion layer is used for processing the voice feature vector and the first emotion feature vector to obtain a first linear spectrogram and a first Mel spectrogram, or inputting the text feature vector and the second emotion feature vector into the feature conversion layer to be processed to obtain a second linear spectrogram and a second Mel spectrogram.

Illustratively, the speech encoder is stacked by LSTM networks, the emotion encoder is combined by LSTM layers and full connectivity layers, the feature conversion layer includes an attention layer, a decoder and a post-processing network, and the text encoder is composed of a character embedding layer, Pre-Net (rain removal network) and extraction sequence feature CBHG network. It is understood that the text encoder, the speech encoder, the emotion encoder, and the feature conversion layer may also be composed of other convolutional neural networks, cyclic convolutional neural networks, and the like, and this embodiment is not particularly limited.

Step S103a, if the training sample data is the first sample pair, inputting the first voice data into a voice coder for coding operation to obtain a voice feature vector, and inputting the second voice data into an emotion coder for coding operation to obtain a first emotion feature vector.

The speech coder is used for coding the input speech data to extract the speech characteristic information of the speech data to obtain a speech characteristic vector, and the emotion coder is used for coding the input speech data to extract the emotion characteristic information of the speech data to obtain an emotion characteristic vector. The emotion encoder only extracts emotion characteristic information and removes language content information. The voice characteristic vector and the emotion characteristic vector are used for realizing model training of voice-to-voice conversion, so that the voice conversion model is helped to learn language content information, the performance of the voice conversion model can be improved, and the accuracy of voice conversion is improved.

Illustratively, the speech encoder is stacked from an LSTM network and the emotion encoder is combined from an LSTM layer and a fully connected layer. By means of a speech encoder, the first speech data x can be encoded_cExtracting the features to obtain a voice feature vector h_c＝ContentEncoder(x_c) For the second speech data x by means of the emotional feature vector_sExtracting features to obtain emotion feature vector h_s＝StyleEncoder(x_s)。

Step S104a, inputting the voice feature vector and the first emotion feature vector into a feature conversion layer for processing to obtain a first linear spectrogram and a first Mel spectrogram.

The voice conversion model further comprises a feature conversion layer, and the feature conversion layer is used for performing feature conversion processing on the voice feature vector and the first emotion feature vector to generate a first linear spectrogram and a first Mel spectrogram of the first voice data.

In one embodiment, the feature conversion layer includes an attention layer, a decoder, and a post-processing network; mapping the voice characteristic vector and the first emotion characteristic vector through an attention layer to obtain a target characteristic vector; decoding the target feature vector through a decoder to obtain a first Mel spectrogram; and processing the first Mel frequency spectrum through a post-processing network to obtain a first linear spectrogram. The decoding layer includes, for example, a layer of RNN attention model including 256 GRU networks and two layers of residual GRU networks. It should be noted that, the attention layer maps the speech feature vector and the first emotion feature vector to the same space to obtain a target feature vector, and then is connected to a decoder to decode the target feature vector into a first mel spectrum, and the first mel spectrum is input to the post-processing layer to obtain a corresponding first linear spectrum, so that the loss value of the speech conversion model can be calculated according to the first linear spectrum and the first mel spectrum.

And step S105a, determining whether the voice conversion model converges according to the first linear spectrogram and the first Mel spectrogram.

After the first linear spectrogram and the first mel spectrogram are obtained, the loss value of the voice conversion model can be calculated through the first linear spectrogram and the first mel spectrogram, and whether the voice conversion model is converged or not is determined according to the loss value. If the voice conversion model is not converged, the model parameters of the voice conversion model can be updated according to the loss value of the voice conversion model.

In one embodiment, a real linear spectrogram and a real mel spectrogram of the first voice data are obtained; calculating a first loss value of the voice conversion model according to the real Mel spectrogram and the first Mel spectrogram; calculating a second loss value of the voice conversion model according to the real linear spectrogram and the first linear spectrogram; adding the first loss value and the second loss value to obtain a total loss value of the voice conversion model; determining whether the total loss value is less than or equal to a preset loss value; if the total loss value is determined to be less than or equal to the preset loss value, determining the convergence of the voice conversion model; and if the total loss value is determined to be larger than the preset loss value, determining the voice conversion model to be convergence. The real mel spectrum is obtained by performing mel filtering transformation on a spectrogram of the first voice data, for example, fourier transformation is performed on the voice data to obtain a spectrogram of the voice data, a spectrogram landscape mel scale filter bank of the voice data is used to obtain a real mel spectrum, and the real linear spectrum is obtained by processing the real mel spectrum. It should be noted that, if it is determined that the speech conversion model is convergent, the speech conversion model needs to be trained continuously, so as to ensure the performance of the speech conversion model.

Step S103b, if the training sample data is the second sample pair, inputting the third voice data into an emotion encoder for encoding operation to obtain a second emotion characteristic vector, and inputting the text information into a text encoder for encoding operation to obtain a text characteristic vector.

The text encoder is used for encoding input text information to extract text characteristic vectors of the text information, the emotion encoder is used for encoding input voice data and can extract second emotion characteristic information of input third voice data, and therefore model training of text-to-voice conversion is achieved through the text characteristic vectors and the second emotion characteristic information, the voice conversion model is helped to learn text content information, the voice conversion model is helped to capture embedded representation of source voice content, network parameters are fitted towards the text content information, and the performance of the voice conversion model is helped to be improved.

Illustratively, the text encoder consists of a character embedding layer, Pre-Net (rain removal network) which may consist of two FC layers activated with Linear rectification functions (ReLU), and an extracted sequence feature CBHG network, and the emotion encoder consists of a LSTM layer and a full connection layer.

Step S104b, inputting the text feature vector and the second emotion feature vector into a feature conversion layer for processing to obtain a second linear spectrogram and a second Meier spectrogram.

The voice conversion model further comprises a feature conversion layer, and the feature conversion layer can be used for performing feature conversion processing on the text feature vector and the second emotion feature vector to generate a second linear spectrogram and a second Mel spectrogram of third voice data.

In one embodiment, the feature conversion layer includes an attention layer, a decoder, and a post-processing network; mapping the voice characteristic vector and the second emotion characteristic vector through an attention layer to obtain a target characteristic vector; decoding the target feature vector through a decoder to obtain a second Meier spectrogram; and processing the second mel frequency spectrum through a post-processing network to obtain a second linear spectrogram. It should be noted that the attention layer maps the speech feature vector and the second emotion feature vector to the same space to obtain a target feature vector, and then the target feature vector is decoded into a second mel spectrogram by being connected to a decoder, and the second mel spectrogram is input to the post-processing layer to obtain a corresponding second linear spectrogram, so that the loss value of the speech conversion model can be calculated according to the second linear spectrogram and the second mel spectrogram.

And step S105b, determining whether the voice conversion model converges according to the second linear spectrogram and the second Meier spectrogram.

After the second linear spectrogram and the second mel spectrogram are obtained, the loss value of the voice conversion model can be calculated through the second linear spectrogram and the second mel spectrogram, and whether the voice conversion model is converged or not is determined according to the loss value. If the voice conversion model is not converged, the model parameters of the voice conversion model can be updated according to the loss value of the voice conversion model.

In one embodiment, a true linear spectrogram and a true mel spectrogram of the second voice data are obtained; calculating a third loss value of the voice conversion model according to the real Mel spectrogram and the second Mel spectrogram; calculating a fourth loss value of the voice conversion model according to the real linear spectrogram and the second linear spectrogram; adding the third loss value and the fourth loss value to obtain a total loss value of the voice conversion model; determining whether the total loss value is less than or equal to a preset loss value; if the total loss value is determined to be less than or equal to the preset loss value, determining the convergence of the voice conversion model; and if the total loss value is determined to be larger than the preset loss value, determining the voice conversion model to be convergence. If the voice conversion model is determined not to be converged, the voice conversion model needs to be trained continuously, so that the performance of the voice conversion model is ensured. And if the voice conversion model is determined to be convergent, ending the training to obtain the trained voice conversion model, wherein the trained voice conversion model is subjected to multi-task learning, network parameters are fitted towards text content information, the performance of the voice conversion model is better, and the accuracy of voice conversion can be improved.

And S106, if the voice conversion model is not converged, updating model parameters of the voice conversion model, and executing the step of acquiring training sample data until the voice conversion model is converged.

If the voice conversion model is not converged, the voice conversion model needs to be trained continuously, so that the performance of the voice conversion model is ensured. And adjusting model parameters of the voice conversion model through the total loss value of the voice conversion model, and executing the step of obtaining training sample data, namely obtaining the training sample data again, wherein the training sample data comprises a first sample pair or a second sample pair, and the training is carried out on the voice conversion model with the adjusted model parameters through the first sample pair or the second sample pair until the voice conversion model is converged to obtain the trained voice conversion model. Through the multitask learning of voice-to-voice conversion VC and text-to-voice conversion TTS, the performance of a voice conversion model is improved, word skipping, word missing and repetition are effectively avoided, and the stability of the training process of the voice conversion model is better.

In an embodiment, the training of the speech conversion model by the first sample pair includes steps S103a to S105a, and the loss value determined according to the first linear spectrogram and the first mel spectrogram is updated when the model parameters of the speech conversion model are updated. Training the speech conversion model through the second sample pair, including steps S103b to S105b, when updating model parameters of the speech conversion model, the loss values determined according to the second linear spectrogram and the second mel spectrogram are updated.

In one embodiment, determining whether the iteration times of the voice conversion model reach a preset iteration time, and if the iteration times of the voice conversion model reach the preset iteration times, determining that the voice conversion model is in a convergence state; if the iteration times of the voice conversion model do not reach the preset iteration times, determining that the voice conversion model is not in a convergence state; or determining whether the iteration time of the voice conversion model is greater than or equal to the preset iteration time, and if the iteration time of the voice conversion model is greater than or equal to the preset iteration time, determining that the voice conversion model is in a convergence state; and if the iteration time of the voice conversion model is less than the preset iteration time, determining that the voice conversion model is not in a convergence state. The preset iteration time and the preset iteration times can be flexibly set by a user, and the embodiment of the application is not particularly limited.

And S107, acquiring target voice data to be converted and acquiring reference voice data representing target emotion.

The reference voice data of the target emotion can be specified by the user, for example, the reference voice data of the emotion such as anger, sadness, fear, happiness, surprise and disgust can be expressed, and the emotion feature information of the target emotion can be obtained by referring to the voice data.

And S108, inputting the target voice data and the reference voice data into a converged voice conversion model to obtain a target voice signal representing the target emotion.

And inputting the target voice data and the reference voice data into a converged voice conversion model to generate a target voice signal, wherein the target voice signal can represent a target emotion and is equivalent to converting the target emotion of the reference voice data into the target voice signal. The method has the advantages that the language information is completely reserved after the voice conversion, the word skipping, the word missing and the repetition are avoided, in addition, the specific emotion can be fused in the conversion process, the accuracy of the voice conversion can be effectively improved, and the more accurate target voice signal can be generated.

In one embodiment, as shown in fig. 3, step S108 includes: substeps 1081 to substep S1083.

And a substep S1081 of inputting the target speech data into a speech encoder for encoding operation to extract a speech feature vector of the target speech data, and inputting the reference speech data into an emotion encoder for encoding operation to extract an emotion feature vector of the reference speech data.

The voice coder is used for extracting voice characteristic information of target voice data, coding the input target voice data to obtain a voice characteristic vector of the target voice data, and the emotion coder is used for extracting emotion characteristic information of reference voice data, coding the input reference voice data to obtain an emotion characteristic vector of the reference voice data. The emotion coder extracts only emotion feature information of the reference voice data and removes language content information of the reference voice data. Language content information is learned through a multi-task learning team voice conversion model, and the accuracy of voice conversion is higher.

And a substep S1082 of inputting the voice feature vector of the target voice data and the emotion feature vector of the reference voice data into a feature conversion layer for processing to obtain a target linear spectrogram.

The feature conversion layer is used for performing feature conversion processing on the voice feature vector of the target voice data and the emotion feature vector of the reference voice data to generate a target linear spectrogram of the target voice data. The specific processing procedure may refer to the embodiment described in step S104a or step S104 b.

Illustratively, the feature conversion layer includes an attention layer, a decoder, and a post-processing network; mapping the voice feature vector of the target voice data and the emotion feature vector of the reference voice data through an attention layer to obtain a target feature vector; decoding the target characteristic vector through a decoder to obtain a target Mel spectrogram; and processing the target Mel frequency spectrum through a post-processing network to obtain a target linear spectrogram. It should be noted that the attention layer maps the speech feature vector and the emotion feature vector to the same space to obtain a target feature vector, and then is connected to a decoder to decode the target feature vector into a target mel-frequency spectrum, and the target mel-frequency spectrum is input to the post-processing layer to obtain a corresponding target linear spectrogram, so that a target speech signal can be output according to the target linear spectrogram.

And a substep S1083 of converting the target linear spectrogram into a vocoder of the converged voice conversion model for voice code conversion to obtain a target voice signal representing target emotion.

Among them, a vocoder (vocoder) may be classified into: channel vocoders, formant vocoders, pattern vocoders, linear prediction vocoders, correlation vocoders, and orthogonal function vocoders. And inputting the linear spectrogram into a vocoder to obtain a target voice signal, wherein the target voice signal completes voice conversion, and the linear spectrogram is converted into a wav file capable of being played and can represent target emotion.

It should be noted that the multitask learning of the voice conversion model is completed in the training stage, and the network parameters of the voice conversion model are fitted towards the text content information in the training process of performing text-to-voice conversion, which is beneficial to the learning of the language content information by the voice conversion model, so that the model performance is greatly improved, the accuracy of voice conversion can be greatly improved, and a more accurate target voice signal can be generated.

In the voice conversion method provided in the foregoing embodiment, training sample data is obtained, where the training sample data includes a first sample pair or a second sample pair, the first sample pair includes first voice data and second voice data, and the second sample pair includes third voice data and text information corresponding to the third voice data; if the training sample data is a first sample pair, inputting the first voice data into a voice encoder to obtain a voice characteristic vector, and inputting the second voice data into an emotion encoder to obtain a first emotion characteristic vector; inputting the voice feature vector and the first emotion feature vector into a feature conversion layer to obtain a first linear spectrogram and a first Mel spectrogram; determining whether the voice conversion model converges according to the first linear spectrogram and the first Mel spectrogram; if the training sample data is a second sample pair, inputting third voice data into an emotion encoder to obtain a second emotion characteristic vector, and inputting text information into a text encoder to obtain a text characteristic vector; inputting the text feature vector and the second emotion feature vector into a feature conversion layer to obtain a second linear spectrogram and a second Meier spectrogram; determining whether the voice conversion model converges according to the second linear spectrogram and the second Meier spectrogram; if the voice conversion model is not converged, updating model parameters of the voice conversion model, and executing the step of acquiring training sample data until the voice conversion model is converged; and inputting target voice data to be converted and reference voice data representing target emotion into the converged voice conversion model to obtain a target voice signal. According to the method and the device, multitask learning is carried out through the text encoder, the voice encoder and the emotion encoder, so that the network parameters of the voice conversion model are fitted towards text content information, the voice conversion model is helped to learn language content information, the performance of the voice conversion model can be improved, the stability of model training is kept, and the accuracy of voice conversion is improved.

Referring to fig. 4, fig. 4 is a schematic block diagram of a speech conversion apparatus according to an embodiment of the present application.

As shown in fig. 4, the speech conversion apparatus 300 includes: an obtaining module 301, a calling module 302, a first encoding module 303a, a first converting module 304a, a first determining module 305a, a second encoding module 303b, a second converting module 304b, a second determining module 305b, an updating module 306 and an input module 307.

An obtaining module 301, configured to obtain training sample data, where the training sample data includes a first sample pair or a second sample pair, the first sample pair includes first voice data and second voice data, an emotion tag corresponding to the first voice data is different from an emotion tag corresponding to the second voice data, and the second sample pair includes third voice data and text information corresponding to the third voice data;

the calling module 302 is configured to call a preset voice conversion model, where the voice conversion model includes a text encoder, a voice encoder, an emotion encoder, and a feature conversion layer;

the first encoding module 303a is configured to, if it is determined that the training sample data is a first sample pair, input the first speech data into a speech encoder for encoding operation to obtain a speech feature vector, and input the second speech data into an emotion encoder for encoding operation to obtain a first emotion feature vector;

a first conversion module 304a, configured to input the speech feature vector and the first emotion feature vector into a feature conversion layer for processing, so as to obtain a first linear spectrogram and a first mel spectrogram;

a first determining module 305a, configured to determine whether the speech conversion model converges according to the first linear spectrogram and the first mel spectrogram; and

the second encoding module 303b is configured to, if it is determined that the training sample data is a second sample pair, input the third speech data into the emotion encoder for encoding operation to obtain a second emotion feature vector, and input the text information into the text encoder for encoding operation to obtain a text feature vector;

the second conversion module 304b is configured to input the text feature vector and the second emotion feature vector into the feature conversion layer for processing, so as to obtain a second linear spectrogram and a second mel spectrogram;

a second determining module 305b, configured to determine whether the speech conversion model converges according to the second linear spectrogram and the second mel spectrogram;

and the updating module 306 is configured to update the model parameters of the voice conversion model if the voice conversion model is not converged, and perform the step of obtaining training sample data until the voice conversion model is converged.

The obtaining module 301 is further configured to obtain target speech data to be converted and obtain reference speech data representing a target emotion;

an input module 307, configured to input the target speech data and the reference speech data into the converged speech conversion model, so as to obtain a target speech signal representing the target emotion.

In one embodiment, as shown in fig. 5, the obtaining module 301 includes:

the obtaining sub-module 3011 is configured to obtain multiple training samples, where the training samples include voice data, text information corresponding to the voice data, and emotion labels.

A construction sub-module 3012 for constructing the first sample pair and the second sample pair according to the plurality of training samples.

The selecting sub-module 3013 is configured to select the first sample pair or the second sample pair as training sample data.

In one embodiment, the construction submodule 3012 is further configured to:

determining a first emotion label and a second emotion label to be selected;

selecting first voice data corresponding to the first emotion label from the training samples and combining the first voice data with second voice data corresponding to the second emotion label to obtain a first sample pair;

and selecting third voice data from the training samples and combining the third voice data with the text information corresponding to the third voice data to obtain a plurality of second sample pairs.

In one embodiment, the selection submodule 3013 is further configured to:

determining a model training task, wherein the model training task comprises a first training task and a second training task, the first training task is used for realizing model training of voice-to-voice conversion, and the second training task is used for realizing model training of text-to-voice conversion;

if the model training task is a first training task, determining the first sample pair as training sample data;

and if the model training task is a second training task, determining the second sample pair as training sample data.

In one embodiment, the selection submodule 3013 is further configured to:

determining an output result of a preset function, wherein the preset function comprises a function for randomly outputting a first element and a second element;

when the output result is the first element, determining that the model training task is a first training task;

and when the output result is the second element, determining that the model training task is a second training task.

In one embodiment, the feature conversion layer includes an attention layer, a decoder, and a post-processing network; the first conversion module 304a is further configured to:

inputting the voice feature vector and the first emotion feature vector into the attention layer for mapping to obtain a target feature vector;

inputting the target feature vector into the decoder for decoding to obtain a first Mel spectrogram;

and inputting the first Mel frequency spectrum into the post-processing network for processing to obtain a first linear spectrogram.

In one embodiment, as shown in fig. 6, the input module 307 comprises:

a first input module 3071, configured to input the target speech data into the speech encoder for encoding operation to extract a speech feature vector of the target speech data, and input the reference speech data into the emotion encoder for encoding operation to extract an emotion feature vector of the reference speech data;

the second input module 3072 is configured to input the voice feature vector of the target voice data and the emotion feature vector of the reference voice data into the feature conversion layer for processing, so as to obtain a target linear spectrogram;

the third input module 3073 is configured to perform vocoder conversion on the vocoder of the voice conversion model with the converged target linear spectrogram conversion input, so as to obtain a target voice signal representing the target emotion.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules and units described above may refer to the corresponding processes in the foregoing voice conversion method embodiment, and are not described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program, which may run on a server as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of a server according to an embodiment of the present disclosure. The server stores a voice conversion model, and the voice conversion model comprises a text encoder, a voice encoder, an emotion encoder and a feature conversion layer.

As shown in fig. 7, the server includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the speech conversion methods.

The processor is used for providing calculation and control capacity and supporting the operation of the whole server.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech conversion methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 7 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

In one embodiment, the processor, when implementing the obtaining training sample data, is configured to implement:

acquiring a plurality of training samples, wherein the training samples comprise voice data, text information corresponding to the voice data and emotion labels;

constructing the first sample pair and the second sample pair from the plurality of training samples;

and selecting the first sample pair or the second sample pair as training sample data.

In one embodiment, the processor, in implementing the constructing the first and second sample pairs from the plurality of training samples, is configured to implement:

determining a first emotion label and a second emotion label to be selected;

In one embodiment, when the selecting the first sample pair or the second sample pair as training sample data is implemented, the processor is configured to:

In one embodiment, the processor, when performing the deterministic model training task, is configured to perform:

In one embodiment, the feature conversion layer includes an attention layer, a decoder, and a post-processing network; the processor is configured to, when the speech feature vector and the first emotion feature vector are input to the feature conversion layer to be processed to obtain a first linear spectrogram and a first mel spectrogram, implement:

In one embodiment, the processor, in implementing the speech conversion model that converges the target speech data and the reference speech data input, resulting in a target speech signal characterizing the target emotion, is configured to implement:

inputting the target voice data into the voice coder for coding operation so as to extract a voice feature vector of the target voice data, and inputting the reference voice data into the emotion coder for coding operation so as to extract an emotion feature vector of the reference voice data;

inputting the voice feature vector of the target voice data and the emotion feature vector of the reference voice data into the feature conversion layer for processing to obtain a target linear spectrogram;

and converting the target linear spectrogram into a converged vocoder of the voice conversion model for voice code conversion to obtain a target voice signal representing the target emotion.

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the server described above may refer to the corresponding process in the foregoing voice conversion method embodiment, and is not described herein again.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to the embodiments of the speech conversion method of the present application.

The computer-readable storage medium may be an internal storage unit of the server according to the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the server.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech conversion, comprising:

2. The method of speech conversion according to claim 1, wherein said obtaining training sample data comprises:

3. The method of speech conversion according to claim 2, wherein said constructing the first sample pair and the second sample pair from the plurality of training samples comprises:

determining a first emotion label and a second emotion label to be selected;

4. The method of claim 2, wherein the selecting the first sample pair or the second sample pair as training sample data comprises:

if the model training task is a first training task, taking the first sample pair as training sample data;

and if the model training task is a second training task, taking the second sample pair as training sample data.

5. The method of speech conversion according to claim 4, wherein said determining a model training task comprises:

6. The speech conversion method of any one of claims 1-5, wherein the feature conversion layer comprises an attention layer, a decoder, and a post-processing network; the inputting the voice feature vector and the first emotion feature vector into the feature conversion layer for processing to obtain a first linear spectrogram and a first mel spectrogram, and the method comprises the following steps:

7. The speech conversion method of any one of claims 1-5, wherein said inputting the target speech data and the reference speech data into the converged speech conversion model, resulting in a target speech signal characterizing the target emotion, comprises:

8. A speech conversion apparatus, characterized in that the speech conversion apparatus comprises:

9. A server, characterized in that the server comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the speech conversion method according to claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech conversion method according to claims 1 to 7.