CN118173079A

CN118173079A - Noise robust personalized speech synthesis method and device

Info

Publication number: CN118173079A
Application number: CN202410326279.6A
Authority: CN
Inventors: 成明; 吴志勇; 雷舜; 肖龙
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-06-11

Abstract

The invention discloses a noise robust personalized speech synthesis method and a device, wherein the method comprises the following steps: decoupling training: the method comprises the steps of using multi-speaker voice data without text labels as training data, inputting Mel frequency spectrums of the voice data into a Mel2BN module, and decoupling tone information to obtain speaker-independent bottleneck characteristics; inputting the speaker independent bottleneck characteristics and the speaker identity tag into a BN2Mel module, modeling the tone of the speaker, and recovering the corresponding Mel frequency spectrum; and (3) speech synthesis: converting the high-quality Text-carrying annotation voice data into Text-high-quality bottleneck characteristic data by using the decoupling-trained Mel2BN module, and training the Text2BN module; when the target speaker is synthesized, firstly, the target speaker identity label is utilized to finely tune the BN2Mel module, so that the BN2Mel module models tone information of the target speaker; and inputting the target Text into a trained Text2BN module to obtain target bottleneck characteristics, and converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of a target speaker through the trimmed BN2Mel module.

Description

Noise robust personalized speech synthesis method and device

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a personalized voice synthesis method and device with noise robustness.

Background

Speech synthesis (TTS) technology has been widely used in products such as voice assistants, intelligent navigation, electronic books, etc. However, conventional speech synthesis models require a large amount of high quality speech data from one or more speakers, with text annotations, and are limited to synthesizing the speech of the speakers in the training dataset. Whether a single speaker or a multi-speaker speech synthesis system, requires a large amount of high quality data per speaker. In order to customize the speech synthesis system of any speaker, a large amount of high quality, text annotated speech data must be obtained for that speaker, which is often costly and impractical in real world applications. Personalized speech synthesis is a special form of speech synthesis whose technical framework is shown in fig. 1, which uses limited and possibly lower quality speech data to produce the natural speech of the target speaker.

The current personalized speech synthesis technology mainly faces the following two problems: (1) Only a small amount of voice data of a target speaker can be obtained generally, and a customized personalized voice synthesis system is required to be customized by using limited available data; (2) Often, users cannot record in a quiet environment such as a recording studio using specialized recording equipment, and voice data often contains noise. Thus, the speech data used in personalized speech synthesis scenarios is typically small in quantity, low in quality, which makes personalized speech synthesis more challenging. Most of the current personalized voice cloning technologies depend on high-quality voice data of a target speaker, so that the cost of the technology is still high, the application scene of the technology can be effectively expanded by realizing noise robust personalized voice synthesis, and the actual landing of the technology is promoted.

For noise robust speech synthesis tasks, some existing methods learn different characterizations of speaker identity and background noise in a multi-speaker speech synthesis model through data enhancement and resistance decomposition, thereby decoupling related speaker and noise properties, which helps to synthesize clean speech; other schemes obtain a noise characterization by a noise extraction module and use it as a conditional input to the acoustic model to achieve noise robustness. However, it is difficult to effectively model various types of different noise by means of explicit modeling noise, so that robustness to noise outside the set is poor, and meanwhile, voice information leakage exists in the mode, so that generalization performance of the model is reduced.

At present, personalized speech synthesis mainly follows two technical methods: speaker adaptation (Speaker Adapatation) and speaker embedding (Speaker Embedding). The speaker-adaptive personalized speech synthesis is based on a multi-speaker speech synthesis model pre-trained by fine tuning limited data of a target speaker, so that the model can adapt to the voice characteristics of the target speaker. Personalized speech synthesis based on speaker inserts involves extracting speaker inserts from a reference speech of a target speaker using a speaker encoder, and then inputting these inserts, along with intermediate representations of text, into a decoder of a multi-speaker speech synthesis system to generate speech that matches the target speaker's voice. Although speaker-based embedding methods can achieve zero-sample synthesis, their speech quality and similarity are generally inferior to speaker-based adaptation methods, especially when noise is present in the reference audio, and their synthesis results are often poor. Different from the traditional personalized speech synthesis technical method, the voice of the target speaker is effectively cloned based on a large-scale language model (Large Language Model, LLM), but the method needs a large amount of corpus data and significant computing resources; in addition, when the target speaker's reference speech contains background noise, the synthesized speech may retain this noise.

For the noise robust personalized speech synthesis task, one direct method is to use a speech enhancement model to remove noise in noisy audio before speech cloning, and then use the existing personalized speech synthesis method to perform timbre cloning. However, this may introduce distortion in the de-noised audio, degrading the quality of the synthesized speech. The mid-characterization of noise independence can be achieved by extracting content and tone information from speech through domain countermeasure training, mapping speech information from noise and clean audio into the same domain space, however, this approach essentially denoises noise speech in the noise-independent domain space, similar to speech cloning using speech enhancement to obtain clean speech. Because of the limited variety and quantity of noise and training data compared with the speech enhancement model, the method has weak robustness to the noise outside the distribution, and has the problem of poor generalization performance. Furthermore, a mismatch between a pre-training step relying on paired clean/noisy speech and an adaptation step using unpaired noisy speech only may lead to domain drift. Still other methods use a self-supervised learning framework for personalized speech synthesis, based on speaker embedding, employing data enhancement to obtain pre-trained models capable of extracting noise robust speaker embeddings, which are then used as acoustic conditions to train a speech synthesis system dedicated to speech cloning, however, as with most personalized speech synthesis methods that rely on speaker embeddings, this technique faces challenges that are insufficient for the generalization of the unseen speaker.

In summary, in the existing noise robust personalized speech synthesis technology based on speech enhancement, the quality and similarity of the synthesized speech can be obviously reduced due to distortion of noise-reduced speech; second, in models based on noise level operations (works based on noise modeling, noise-independent space, etc.), it is difficult for this method to cover various types of noise in the real scene due to the diversity of noise, resulting in poor robustness of this method to out-of-domain noise. Meanwhile, the method can cause voice information leakage or loss, so that the generalization performance of the model is reduced.

Disclosure of Invention

In view of this, the present invention proposes a noise robust personalized speech synthesis scheme based on speaker independent bottleneck characteristics to solve the above-mentioned problems of the existing noise robust personalized speech synthesis technology.

According to one aspect of the present invention, a noise robust personalized speech synthesis method is provided, comprising: the decoupling training stage comprises the following steps: the method comprises the steps of adopting multi-speaker voice data without text labels as training data, firstly inputting Mel frequency spectrums corresponding to the voice data into a Mel2BN module, decoupling tone information, retaining other information except the tone information, and obtaining speaker-independent bottleneck characteristics; then inputting the speaker-independent bottleneck characteristics and the speaker identity tag into a BN2Mel module together, modeling the tone of the speaker so as to restore the corresponding Mel frequency spectrum; and (II) a voice synthesis stage, comprising the following steps of: the Mel2BN module trained by the decoupling training stage is utilized to convert the high-quality voice data with Text labels into Text-high-quality bottleneck characteristic data for training the Text2BN module; connecting the BN2Mel module trained by the decoupling training stage in series with the output end of the trained Text2BN module; when the target speaker is synthesized, firstly, the identity label of the target speaker is utilized to finely tune the BN2Mel module, and parameters of other modules are frozen at the same time, so that the BN2Mel module models tone information of the target speaker; inputting the target Text into a trained Text2BN module to obtain target bottleneck characteristics, and converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of a target speaker through the trimmed BN2Mel module to realize voice synthesis of the target speaker; wherein the Mel2BN module is a Mel frequency spectrum-bottleneck characteristic conversion module, the BN2Mel module is a bottleneck characteristic-Mel frequency spectrum conversion module, and the Text2BN module is a Text-bottleneck characteristic conversion module.

Further, in the decoupling training stage, a tone decoupling method based on domain countermeasure training, random circulation loss and speaker consistency loss is adopted to decouple tone, and other information including noise information is reserved.

Still further, the method of timbre decoupling based on domain countermeasure training, random loop loss, and speaker consistency loss includes: mapping voice data of different speakers into the same domain space through domain countermeasure training to obtain the speaker-independent bottleneck characteristics; introducing a circulation turn in the decoupling training process, and introducing random circulation loss in the circulation turn to ensure that the original bottleneck characteristics of the voice data can be rebuilt after the voice data are converted into corresponding Mel frequency spectrums and are encoded through a Mel2BN module; when the BN2Mel module decodes, random factor substitution is implemented to disturb the speaker identity label, and the audio with the converted tone is decoded; and introducing speaker consistency loss to assist the BN2Mel module in maintaining the identity characteristics of the target speaker in the process of implementing the conversion.

Still further, the domain countermeasure training includes: the domain countermeasure training module comprises a first speaker classifier which is connected to the Mel2BN module through a gradient inversion layer; in domain countermeasure training, classifying speakers according to the Mel2BN module output bottleneck characteristics by the first speaker classifier; when the gradient is counter-propagating, the gradient inversion forces the decoupling training model to update in a direction which makes the bottleneck characteristics more difficult to classify, and countermeasure training is performed, by which voice data of different speakers can be mapped into the same domain space, thereby realizing the bottleneck characteristics irrelevant to the speakers.

Still further, the speaker consistency penalty is introduced by a pre-trained second speaker classifier that takes as input the Mel spectrum of the decoded output of the BN2Mel module in a round-robin round.

Still further, the total loss function of the decoupling training phase is:

L_total＝L_recon+λ_cycL_cyc+λ_advL_adv+λ_scL_sc

Where L _recon represents mel spectral reconstruction loss, L _cyc represents the random cyclic loss, L _adv represents domain antagonism loss, L _sc represents the speaker consistency loss, and λ _cyc、λ_adv、λ_sc is a hyper-parameter.

Still further, the mel-spectrum reconstruction loss and the random cyclic loss use mean square error, and the domain uses cross entropy loss for the immunity loss and the speaker consistency loss as follows:

wherein X represents the real Mel frequency spectrum, Representing a Mel spectrum reconstructed by a BN2Mel module; b _i represents bottleneck characteristics obtained by passing the Mel frequency spectrum of the speaker i through the Mel2BN module; /(I)B _i is represented to replace a speaker identity tag through a random factor, the speaker identity tag of a speaker j is used and is input into a BN2Mel module together with B _i to obtain a Mel frequency spectrum of converted tone, and bottleneck characteristics are obtained through the Mel2BN module;

L _adv and L _sc use cross entropy loss of multi-classification tasks, namely:

Wherein y _k and p _k correspond to the real label of the kth category to which the input data belongs and the probability of speaker classifier prediction, respectively, and C represents the number of speaker categories.

Still further, the decoupling training phase further includes dynamic noise enhancement, the dynamic noise enhancement including: during the decoupling training process, noise randomly clipped and adjusted to the randomly sampled signal-to-noise ratio is added to the speech data for training with a predetermined probability.

Further, in the speech synthesis stage, inputting the target Text into the trained Text2BN module to obtain target bottleneck characteristics, including:

Converting the input target text into a corresponding phoneme sequence through the front end of the character-to-phoneme conversion;

The phoneme sequence is subjected to vector embedding to obtain a phoneme embedding, the phoneme embedding is subjected to N layers of forward Transformer blocks to obtain a phoneme middle representation, meanwhile, a duration predictor is used for predicting the duration frame length of each phoneme middle representation, and the corresponding phoneme middle representation is copied to the predicted frame length; the intermediate representation after the length adjustment is input to an N-layer transducer block through position coding, and clean bottleneck characteristics corresponding to the input text sequence are obtained.

According to another aspect of the present invention, there is also provided a noise robust personalized speech synthesis apparatus, including a decoupling training model and a speech synthesis model; the decoupling training model comprises a Mel2BN module and a BN2Mel module connected in series with the output end of the Mel2BN module; the Mel2BN module is used for decoupling tone information from a Mel frequency spectrum corresponding to training data, retaining other information except the tone information, and obtaining speaker-independent bottleneck characteristics; the BN2Mel module is used for modeling the tone of a speaker by taking the speaker irrelevant bottleneck characteristic and the speaker identity label as input so as to restore the corresponding Mel frequency spectrum; wherein the training data is multi-speaker voice data without text labels; the voice synthesis model comprises a Text2BN module and a BN2Mel module trained by the decoupling training model, wherein the trained BN2Mel module is connected in series with the output end of the Text2BN module; the Text2BN module is used for converting the target Text into target bottleneck characteristics; the trained BN2Mel module is used for converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of the target speaker after the target speaker identity label is finely adjusted, so as to realize the voice synthesis of the target speaker; wherein the Mel2BN module is a Mel frequency spectrum-bottleneck characteristic conversion module, the BN2Mel module is a bottleneck characteristic-Mel frequency spectrum conversion module, and the Text2BN module is a Text-bottleneck characteristic conversion module.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: in the decoupling training stage, the Mel2BN module and the BN2Mel module are trained together by using a large amount of multi-speaker voice data without text labels, so that the Mel2BN module has sufficient tone decoupling capability, and meanwhile, the BN2Mel module can reconstruct a Mel frequency spectrum containing tone information of a target speaker according to bottleneck characteristics without tone information and target speaker identity tags containing tone information, so as to synthesize the voice of the target speaker, and realize personalized voice synthesis. When personalized speech synthesis is carried out, a user records several sentences of speech (whether noise exists or not) firstly, a new speaker ID is designated for a new user by the system, fine tuning is carried out on the BN2Mel module, as the BN2Mel module has obtained the capability of modeling different speaker timbres through a decoupling training stage, the BN2Mel module can model the timbre of a target speaker through the fine tuning, meanwhile, the BN2Mel module can receive the clean bottleneck characteristic of the target text as input, and can output the clean Mel frequency spectrum which corresponds to the target text and contains the timbre information of the target speaker, and finally the clean target speaker speech is obtained through a vocoder, so that personalized speech synthesis of the target text is realized. Because the Text2BN module is used for high-quality data training, clean bottleneck characteristics without noise information can be obtained, and the BN2Mel module is used for modeling speaker tone information, even if the voice data of a target speaker contains noise, the voice of the target speaker without noise can still be synthesized. It can be seen that the present invention has the following advantages:

1) And receiving a voice fragment recorded by a user in a noisy environment for customized personalized voice synthesis, and after the voice fragment is matched with the user voice, giving a text section as input by the user. The training is not needed by utilizing a large amount of voice data of the target speaker in advance, only a section of voice without quality requirement of the target speaker is provided, and the trained BN2Mel module is finely adjusted according to the voice and the corresponding identity tag, so that high-quality synthesized voice can be rebuilt. The personalized voice synthesis system with robust noise is established, so that low-cost personalized voice synthesis is possible, the application scene of personalized voice synthesis is expanded, and voices recorded in various different recording environments provided by a target speaker can be cloned, including but not limited to environments such as silence, various noises, single noise and the like;

2) Although the invention aims to improve the noise robustness of the personalized speech synthesis model, the invention does not pay attention to the noise itself, does not directly operate at the noise level, but develops a new way to operate the tone of a speaker obeying a certain specific distribution, thereby avoiding the problem of insufficient robustness to the external noise;

3) The tone decoupling training stage does not depend on manually created clean/noisy paired data and text labels, allows training to be performed by using a large number of text-free marked noisy speech data sets existing in reality, and improves decoupling capacity and generalization capacity;

4) Personalized speech synthesis in a single noise environment is difficult to clone due to high coupling of noise and the speech of a target speaker. The invention adopts a dynamic noise enhancement method to planarize the noise distribution related to the speaker, thereby improving the robustness under a single noise environment.

Drawings

FIG. 1 is an architecture diagram of a TTS system;

fig. 2 is a flow chart of a personalized speech synthesis method according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a personalized speech synthesis process according to the personalized speech synthesis method of the embodiment of the invention.

Fig. 4 is a schematic diagram of an exemplary tone decoupling method in the personalized speech synthesis method according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of an exemplary Text2BN module in a personalized speech synthesis method according to an embodiment of the invention.

Detailed Description

The invention will be further described with reference to the drawings and specific embodiments, examples. It should be understood that the examples are provided for illustrative purposes only and are not intended to limit the scope of the present invention in any way.

Abbreviations and key term definitions:

① TTS: text-to-speech conversion, also known as speech synthesis, refers to converting Text information into standard smooth pronunciation;

② Text2BN: text-to-bottleneck, text-bottleneck feature conversion;

③ BN2Mel: bottleneck-to-mel-spectrogram, bottleneck feature-mel spectral transformation; ④ Mel2BN: mel-spectrogram-to-bottleneck, mel spectrum-bottleneck feature transformation; ⑤ ID: identity, identity;

⑥ Personalized speech synthesis: also known as voice cloning. The TTS technology is used for converting given text into voice similar to voice characteristics of a target speaker by utilizing certain voice fragments provided by the target speaker, and is a special scene;

⑦ Robust: robust, transliterated terms, are used in a Robust and strong sense, generally referring to a model or system that has the property of maintaining its performance against abnormal conditions. The invention mainly discusses noise robustness, namely, the system can still normally perform personalized speech synthesis task aiming at the target speaker data with noise.

The embodiment of the invention provides a noise robust personalized speech synthesis method, which comprises two parts, namely firstly performing tone decoupling training on a model and then performing personalized speech synthesis based on the trained model. Referring to fig. 2, the personalized speech synthesis method according to the embodiment of the invention includes: (one), decoupling training stage, and (two), speech synthesis stage. The decoupling training stage mainly trains the Mel2BN module and the BN2Mel module, and specifically comprises the following steps: a large amount of multi-speaker voice data without text labels is used as training data, mel frequency spectrums corresponding to the voice data are firstly input into a Mel2BN module, tone color information is decoupled, other information except the tone color information is reserved, and a speaker-independent bottleneck characteristic is obtained; and then inputting the speaker-independent bottleneck characteristics and the speaker identity tag into a BN2Mel module together, modeling the tone of the speaker, and recovering the corresponding Mel frequency spectrum. The speech synthesis stage mainly uses a trained BN2Mel module to connect with a Text2BN module in series to form a TTS system, and synthesizes a target Text into a speech extremely similar to the speech of a target speaker, and specifically comprises the following steps: converting the high-quality voice data with Text labels into Text-high-quality bottleneck characteristic data by using the decoupling trained Mel2BN module, and training the Text2BN module; connecting the decoupling trained BN2Mel module in series with the output end of the trained Text2BN module; when the target speaker is synthesized, firstly, the identity label of the target speaker is utilized to finely tune the BN2Mel module, and parameters of other modules are frozen at the same time, so that the BN2Mel module models tone information of the target speaker; and inputting the target Text into a trained Text2BN module to obtain target bottleneck characteristics, and converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of the target speaker through the trimmed BN2Mel module to realize voice synthesis of the target speaker.

The decoupling training stage has no excessively strict requirements on training data, clean or noisy speech data, and text labeling is not required. The Mel2BN module decouples the tone information, retains other voice information except tone, and obtains the speaker independent bottleneck characteristic. And the BN2Mel module models the speaker tone according to the bottleneck characteristics and the speaker Identity (ID) tag so as to restore the corresponding Mel frequency spectrum.

Through the training of the decoupling training stage, the BN2Mel module has the Mel frequency spectrum which can reconstruct the tone information of the target speaker according to the input target speaker identity label (one end voice data of the target speaker) and the bottleneck characteristics which are irrelevant to the target speaker. Thus, as shown in fig. 3, when personalized speech synthesis is required, for example, a user wants to synthesize a text into a "own" speech, the user only needs to record several sentences of speech (whether noise exists or not), the system generates a new speaker ID for the new user, the speaker ID is a tag for distinguishing different speakers, and is a unique tag designated for a certain speaker, a speaker embedding (a vector) is obtained through a speaker embedding table (speaker embedding table), and in the speech synthesis system, the speech of the same speaker is designated for the same speaker ID, and the obtained speaker is also embedded for the same speaker. For a new target speaker, a new speaker ID is assigned to it, and the ID of the speaker in the pre-training dataset that is most similar to its timbre can also be assigned as its ID by the speaker classifier. The tone color information is essentially embedded into the model parameters of the BN2Mel module, when the embedded voice information of a certain speaker and other voice information without tone color are obtained, the embedding of different speakers can activate the parameters related to tone color in the BN2Mel model parameters, thereby correspondingly recovering the tone color information and carrying out audio reconstruction. In the fine tuning stage, the BN2Mel module adjusts the timbre information it models under the target speaker's embedding so that the timbre information modeled by the BN2Mel module under the speaker's embedding increasingly resembles the target speaker. The fine tuning only updates the parameters of the BN2Mel module, freezes the parameters of the other modules, so that the BN2Mel module gains the ability to model the user's timbre. The Text input by the user outputs the clean bottleneck characteristic corresponding to the Text through the Text2BN module, the trimmed BN2Mel module converts the clean bottleneck characteristic of the Text into a Mel frequency spectrum containing the tone of the user, the voice without noise of the Text can be output after passing through the vocoder, and the synthesized voice has the same tone as the user.

In some embodiments, the decoupling training phase may employ a tone decoupling method based on domain countermeasure training, random loop loss, and speaker consistency loss to decouple the tone while retaining other information including noise information. Referring to fig. 4, to ensure that bottleneck features are completely decoupled from speaker information, a domain countermeasure training module is introduced during the decoupling training phase, which includes a speaker classifier connected to the Mel2BN module of the master model by a gradient inversion layer (GRL). In training the Mel2BN module, the speaker classifier attempts to accurately classify the speaker based on the bottleneck feature, however, as its gradient is counter-propagated to the master model, the gradient is reversed by the GRL and the gradient direction becomes the opposite direction forcing the master model to update in a direction that makes the bottleneck feature more difficult to classify. With such countermeasure training, the speech data of different speakers can be mapped into the same domain space, thereby achieving a bottleneck characteristic that is independent of the speaker.

The speaker classifier is an existing technology, and can be constructed by using different neural network layers, and the goal of the domain countermeasure training module is to use a deep learning technology to classify voices through speakers through the neural network, namely, different voices of the same speaker are classified into the same category, and voices of different speakers are classified into different categories.

Random loop loss (Random Cycle Loss) is a new method of speech decoupling for use in speech conversion tasks, which introduces a new concept of random loop loss in combination with random factor substitution. By forcing the consistency of analysis and reconstruction, the random loop loss helps to achieve independent and separate coding, which is critical for efficient speech conversion. This is achieved by minimizing the difference between the encoding and the recoding of the speech signal reconstructed from the encoding.

In addition to the conventional reconstruction runs, we have introduced a cyclic run in the timbre decoupling training, as in fig. 4. When decoding using the BN2Mel module, audio with converted timbre is decoded by implementing random factor substitution (Random Factor Substitution, RFS) to shuffle speaker IDs. However, due to the lack of corresponding real data, random cyclic losses are employed to ensure that the data, after conversion and re-encoding by the Mel2BN module, can reconstruct its original bottleneck characteristics. This loss helps decouple the more powerful and independent characterization, ensuring the integrity and reversibility of the information during the conversion process. In a cyclic pass, if the tone decoupling is good enough, audio with converted tone can be obtained by RFS. Therefore, a pre-trained speaker classifier is introduced in the circulation round, the speaker classifier takes the Mel frequency spectrum decoded and output by the BN2Mel module as input in the circulation round, the speaker consistency loss is provided for the tone decoupling model, the identity characteristics of the target speaker are maintained in the conversion process of the model, and the conversion accuracy and naturalness are improved. By combining random loop loss with speaker consistency loss, the integrity and reversibility of information during the conversion process is ensured, while ensuring that the converted speech is consistent with the timbre of the target speaker. Random loop loss helps the model learn a robust mapping from the source speaker to the target speaker back to the source speaker, thereby reducing information loss during the conversion process. Meanwhile, the loss of consistency of the speaker ensures that the conversion respects the unique attribute of the target speaker, so that the generated voice sounds more natural and has the characteristics of the target speaker. Through the strategies, the tone can be effectively decoupled from the voice signal, and the bottleneck characteristics which are completely decoupled and irrelevant to the tone are obtained, so that the subsequent fine tuning by using a new target speaker is facilitated.

For a specific implementation of a speaker classifier for speaker consistency loss and a speaker classifier for domain countermeasure training, the former may be composed of an LSTM layer, a linear layer and a Softmax layer, and the latter is added with a gradient inversion layer at the forefront of the former for performing countermeasure training.

Dynamic noise enhancement strategies are used to enhance the data, enabling the timbre decoupling module to more accurately separate information related to the timbre. In training of the timbre decoupling module, noise is added to the speech data with a predetermined probability, such as 50% probability. Therefore, the model is trained on clean and noisy data simultaneously, making the model somewhat noise robust. When noise needs to be added into voice data, a noise sample is randomly selected from the noise data set, and is mixed with the voice data after being randomly cut and adjusted to the signal-to-noise ratio of random sampling. This process introduces multiple layers of randomness, in a manner that provides almost infinite enhancement to the original data set.

Since the speech data set inevitably contains some speaker-dependent noise or reverberation, these noise can easily be misinterpreted as part of the speaker's characteristics when using speaker ID information for speech decoupling, thereby affecting the decoupling process. Thus, by mixing each utterance with a randomly selected background noise and keeping the speaker tag unchanged at a randomly sampled signal-to-noise ratio, the speaker-dependent signal-to-noise ratio distribution is substantially flattened. The method reduces the distinction between the signal-to-noise ratio and the speaker, and reduces the influence of the speaker-related background influence, thereby promoting more effective tone decoupling and improving the robustness of the model in a single noise environment.

The total loss function in the timbre decoupling training objective is as follows:

L_total＝L_recon+λ_cycL_cyc+λ_advL_adv+λ_scL_sc

In some embodiments of the invention, the mel-spectrum reconstruction loss and the random loop loss use mean square error, and the domain contrast loss and the speaker consistency loss use cross entropy loss, as follows:

L _adv and L _sc use cross entropy loss of multi-classification tasks, namely:

For a pre-trained tone decoupling model, the key point is that tone information can be accurately decoupled from data during pre-training, and all voice information including a recording environment (namely noise information) except for tone is reserved. Therefore, the constraint means of specific decoupling can be various, and specific implementation can be based on methods such as countermeasure training, mutual information decoupling, constraint bottleneck characteristic dimension forced decoupling, bias induction and the like. Pre-trained speech conversion models can also be used for data enhancement to obtain tone-independent bottleneck characteristics.

Whereas for network architectures in which the Mel2BN modules and the BN2Mel modules, in some embodiments of the invention, a decoupling network based on the speaker independent bottleneck characteristics of a Convolutional Neural Network (CNN) and long and short term memory network (LSTM) self-encoder (Auto-Encoder) framework is implemented, where the Mel2BN modules consist of a set of convolutional layers with a convolution kernel of 5 x1 and a set of two-way long and short term memory network layers (BiLSTM), and finally the output is normalized to the [ -1,1] range using the tanh activation function; the BN2Mel module is composed of a set BiLSTM of layers and a linear layer. However, this is only an example, and those skilled in the art will appreciate that the encoder-decoder architecture need only be satisfied, and that the simplest self-encoder, variable-component self-encoder (VAE), vector-quantized-variable-component self-encoder (VQ-VAE), etc. may be used.

The hierarchical frame-based speech synthesis section consists of BN2Mel modules in the timbre decoupling section and Text2BN modules trained using high quality speech data, that is, the Mel2BN modules and the BN2Mel modules are trained by the aforementioned decoupling training phase and then used in the speech synthesis phase. The Text2BN module converts Text sequences into corresponding phoneme sequences using an existing character-to-phoneme (G2P) front end. The Text2BN module firstly obtains a phoneme embedding of the phoneme sequence through an embedding vector layer, and finally outputs corresponding bottleneck characteristics through a series of network structures.

The embodiment of the invention provides an implementation of a Text2BN module based on FASTSPEECH model, as shown in figure 5, wherein FFT represents a forward transform block and LR represents length adjustment. The phoneme embedding firstly obtains the intermediate representation of the phonemes through the N layers of forward convertors, the duration predictor predicts the duration (frame length) of each phoneme intermediate representation, and copies the corresponding intermediate representation to the predicted frame length, and the method can specifically be realized by referring to the intermediate representation of the paper Ren Y,RuanY,Tan X,et al.Fastspeech:Fast,robust and controllable text to speech[J].Advances in neural information processing systems,2019,32.. after the length adjustment and inputting the intermediate representation into the N layers of convertors through position coding, so that the clean bottleneck characteristics corresponding to the input text sequence are obtained.

It should be appreciated that the Text2BN module may also be implemented using off-the-shelf network structures of other speech synthesis models, which is critical to the ability to model sequence data to obtain bottleneck feature output from Text input. For example, the model can be constructed by using an autoregressive model based on RNN or LSTM, or a non-autoregressive model formed by a stack of autoregressive modules and a transducer module.

The trained Text2BN module can convert any given Text into corresponding clean bottleneck characteristics, and the corresponding clean voice can be synthesized by connecting the Text with the BN2Mel module of the target speaker after fine tuning adaptation in series. Experiments prove that the technical scheme provided by the invention can effectively use the voice data of the target speaker containing noise to perform personalized voice synthesis, and clean synthesized voice with the tone of the target speaker is obtained.

Another embodiment of the present invention provides a noise robust personalized speech synthesis apparatus, including a decoupling training model and a speech synthesis model; the decoupling training model comprises a Mel2BN module and a BN2Mel module connected in series with the output end of the Mel2BN module; the Mel2BN module is used for decoupling tone information from a Mel frequency spectrum corresponding to training data, retaining other information except the tone information, and obtaining speaker-independent bottleneck characteristics; the BN2Mel module is used for modeling the tone of a speaker by taking the speaker irrelevant bottleneck characteristic and the speaker identity label as input so as to restore the corresponding Mel frequency spectrum; wherein the training data is multi-speaker voice data without text labels; the voice synthesis model comprises a Text2BN module and a BN2Mel module trained by the decoupling training model, wherein the trained BN2Mel module is connected in series with the output end of the Text2BN module; the Text2BN module is used for converting the target Text into target bottleneck characteristics; the trained BN2Mel module is used for converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of the target speaker after the target speaker identity label is finely adjusted, so as to realize the voice synthesis of the target speaker.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. A noise robust personalized speech synthesis method, comprising:

The decoupling training stage comprises the following steps:

the method comprises the steps of adopting multi-speaker voice data without text labels as training data, firstly inputting Mel frequency spectrums corresponding to the voice data into a Mel2BN module, decoupling tone information, retaining other information except the tone information, and obtaining speaker-independent bottleneck characteristics; then inputting the speaker-independent bottleneck characteristics and the speaker identity tag into a BN2Mel module together, modeling the tone of the speaker so as to restore the corresponding Mel frequency spectrum;

and (II) a voice synthesis stage, comprising the following steps of:

The Mel2BN module trained by the decoupling training stage is utilized to convert the high-quality voice data with Text labels into Text-high-quality bottleneck characteristic data for training the Text2BN module; connecting the BN2Mel module trained by the decoupling training stage in series with the output end of the trained Text2BN module; when the target speaker is synthesized, firstly, the identity label of the target speaker is utilized to finely tune the BN2Mel module, and parameters of other modules are frozen at the same time, so that the BN2Mel module models tone information of the target speaker; inputting the target Text into a trained Text2BN module to obtain target bottleneck characteristics, and converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of a target speaker through the trimmed BN2Mel module to realize voice synthesis of the target speaker;

Wherein the Mel2BN module is a Mel frequency spectrum-bottleneck characteristic conversion module, the BN2Mel module is a bottleneck characteristic-Mel frequency spectrum conversion module, and the Text2BN module is a Text-bottleneck characteristic conversion module.

2. The noise robust personalized speech synthesis method according to claim 1, wherein: in the decoupling training stage, a tone decoupling method based on domain countermeasure training, random circulation loss and speaker consistency loss is adopted to decouple tone, and other information including noise information is reserved.

3. The noise robust personalized speech synthesis method according to claim 2, wherein the domain-based timbre decoupling method for training, random loop loss, and speaker consistency loss comprises:

Mapping voice data of different speakers into the same domain space through domain countermeasure training to obtain the speaker-independent bottleneck characteristics;

Introducing a circulation turn in the decoupling training process, and introducing random circulation loss in the circulation turn to ensure that the original bottleneck characteristics of the voice data can be rebuilt after the voice data are converted into corresponding Mel frequency spectrums and are encoded through a Mel2BN module; when the BN2Mel module decodes, random factor substitution is implemented to disturb the speaker identity label, and the audio with the converted tone is decoded; and introducing speaker consistency loss to assist the BN2Mel module in maintaining the identity characteristics of the target speaker in the process of implementing the conversion.

4. A noise robust personalized speech synthesis method according to claim 3, wherein the domain countermeasure training comprises: the domain countermeasure training module comprises a first speaker classifier which is connected to the Mel2BN module through a gradient inversion layer; in domain countermeasure training, classifying speakers according to the Mel2BN module output bottleneck characteristics by the first speaker classifier; when the gradient is counter-propagating, the gradient inversion forces the decoupling training model to update in a direction which makes the bottleneck characteristics more difficult to classify, and countermeasure training is performed, by which voice data of different speakers can be mapped into the same domain space, thereby realizing the bottleneck characteristics irrelevant to the speakers.

5. The noise robust personalized speech synthesis method according to claim 3, wherein the speaker consistency penalty is introduced by a pre-trained second speaker classifier taking as input a Mel spectrum of the BN2Mel module decoding output in a round-robin round.

6. A noise robust personalized speech synthesis method according to claim 3, wherein the total loss function of the decoupling training phase is:

L_total＝L_recon+λ_cycL_cyc+λ_advL_adv+λ_scL_sc

7. The noise robust personalized speech synthesis method according to claim 6, wherein the mel spectral reconstruction loss and the random cyclic loss use mean square error, and the domain uses cross entropy loss for the immunity loss and the speaker consistency loss as follows:

L _adv and L _sc use cross entropy loss of multi-classification tasks, namely:

8. The noise robust personalized speech synthesis method according to claim 1, wherein the decoupling training stage further comprises dynamic noise enhancement, the dynamic noise enhancement comprising: during the decoupling training process, noise randomly clipped and adjusted to the randomly sampled signal-to-noise ratio is added to the speech data for training with a predetermined probability.

9. The noise robust personalized speech synthesis method according to claim 1, wherein, in the speech synthesis stage, a target Text is input into a trained Text2BN module to obtain a target bottleneck feature, comprising:

10. A noise robust personalized speech synthesis apparatus, characterized by: the method comprises a decoupling training model and a voice synthesis model;

the decoupling training model comprises a Mel2BN module and a BN2Mel module connected in series with the output end of the Mel2BN module; the Mel2BN module is used for decoupling tone information from a Mel frequency spectrum corresponding to training data, retaining other information except the tone information, and obtaining speaker-independent bottleneck characteristics; the BN2Mel module is used for modeling the tone of a speaker by taking the speaker irrelevant bottleneck characteristic and the speaker identity label as input so as to restore the corresponding Mel frequency spectrum; wherein the training data is multi-speaker voice data without text labels;

The voice synthesis model comprises a Text2BN module and a BN2Mel module trained by the decoupling training model, wherein the trained BN2Mel module is connected in series with the output end of the Text2BN module; the Text2BN module is used for converting the target Text into target bottleneck characteristics; the trained BN2Mel module is used for converting the target bottleneck characteristics into a target Mel frequency spectrum containing tone information of the target speaker after the target speaker identity label is finely adjusted, so as to realize the voice synthesis of the target speaker;