CN114842825A

CN114842825A - Emotion migration voice synthesis method and system

Info

Publication number: CN114842825A
Application number: CN202210414220.3A
Authority: CN
Inventors: 秦济韬; 李骁; 陈山
Original assignee: Hangzhou Yingying Sound Technology Co ltd
Current assignee: Hangzhou Yingying Sound Technology Co ltd
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-02

Abstract

The invention relates to the technical field of voice synthesis, in particular to an emotion migration voice synthesis method and system. The method comprises the following steps: obtaining a text coding vector; obtaining an emotion style vector; obtaining a text-voice alignment sequence; the speaker identity I D is input to a speech frame decoder, which processes the text-to-speech alignment sequence and decodes to obtain the mel-frequency spectrum characteristics. The emotion information extraction module provided by the invention can completely decouple speaker information and emotion information in audio features, and an emotion encoding vector only contains the emotion information in the audio, and improves the similarity between the encoding vector and the vector representing the emotion information; the emotion encoding vector and the speaker information can be freely combined, so that the task of transferring the emotion information to the target speaker without emotion from the audio data of the source speaker is realized; by giving weight to the emotion encoding vector, the strength of emotion can be controlled more easily.

Description

Emotion migration voice synthesis method and system

Technical Field

The invention relates to the technical field of voice synthesis, in particular to an emotion migration voice synthesis method and system.

Background

A deep-learning based speech synthesis (TTS) model, typically consisting of a neural network of encoders-decoders, is trained to successfully map a given sequence of text to a sequence of speech frames. And, the model can effectively model and control the voice style (e.g., information of speaker's identity, speaking style, emotion and prosody).

Many current TTS application scenarios, such as voice reading of voice-over-speech, news broadcasting, conversation assistants, etc., require single speaker multi-style speech synthesis. For example, a composite speaker may speak in a variety of feelings of happiness, sadness, fear, etc. simultaneously. However, the difficulty of acquiring voice data of a single speaker with multiple styles is high, and therefore, the corresponding performance in emotional voice synthesis is insufficient.

Currently, most neural TTS models based on neural networks are trained using expressive single emotional style corpora (e.g., using flat speech data without any emotion). Obtaining large amounts of single-speaker speech data in a variety of styles is useful for training a good TTS system, but data collection is often expensive and time consuming. In speech emotion synthesis, there are two key elements that need to be solved: first, the synthesized speech should correctly and accurately convey emotion. In particular, the emotion conveyed by the synthesized speech should be easily perceived by the listener without confusion. Second, controlling emotional expression should be done in a more flexible and reliable manner.

An effective solution is to employ a voice style migration method that allows the target speaker to accurately learn the desired emotional information from voice data of the same style but recorded by other speakers, and to preserve the target speaker's timbre. Furthermore, various emotional styles and timbres from the corpus of speakers are combined to generate expressive speech.

In the area of multi-emotional style synthesis, there are typically the following solutions:

firstly, the most direct scheme is to add an emotion style label directly into a TTS model, and the label is used as the input of the TTS model to directly participate in speech synthesis to control the emotion types in the synthesized speech.

Secondly, the cross-speaker Style transfer model based on coding vectors is usually based on several general coding embedding methods (refer to encoder [1] paper to end-to-end pro-ready transfer for explicit speech synthesis with Tacotron, [2] paper controlled interpretation transfer for end-to-end speech synthesis, Global Style Tag (GST) [3] paper keys: unused speech model, control and transfer in-to-end speech synthesis, and Variable Automatic Encoder (VAE) [4] paper Learning translation for speech control and transfer in-end speech synthesis). The methods extract the coding vector expressing emotion information from emotion voice, add the coding vector into a TTS system by a simple or complex splicing method, and can meet the requirement of emotion style transfer.

However, the prior art has the following disadvantages:

(1) adding an emotional style label: this task relies heavily on training data and requires the speaker to have a full amount of various emotion data. Meanwhile, the scheme can not be combined with the voices and styles of different speakers to create new synthesized voice, and can not directly control the strength of emotion.

(2) Cross-speaker model based on coding vectors: first, the style code vectors obtained by these methods often contain too much entangled information (including information such as speaker identity, emotion, speaking style, and speech speed), and thus lack effective interpretation of emotion information.

Secondly, the emotion expressed in the synthesized audio is often the average emotion and the expression is not obvious enough; and it is also a difficult problem to select a suitable emotion voice to extract the coding vector. And control of certain speech characteristics, such as speaker identity and speaking style, is not efficient enough.

Finally, in emotion style migration, these methods often need to migrate all style styles (including speaking style, speech rate, emotional expression, etc.) of the source speaker, which is not matched with the requirement of emotion migration and seriously affects the generalization ability of the model. When performing style control, it is difficult to find a direct relationship between a target style and parameters in a coded vector represented by the style to control style variation.

Aiming at the defects of the conventional emotion voice synthesis, the invention provides an emotion voice synthesis method and system.

Disclosure of Invention

The invention aims to provide an emotion migration voice synthesis system, which is used for solving the problems in the prior art: most of the methods adopt a scheme of extracting an emotional feature sequence in audio by using an emotional feature extraction module to represent emotional information, and the methods often have too much entangled information and have unclear and direct independent control on specific voice features.

In order to achieve the purpose, the invention adopts the following technical scheme:

the emotion migration voice synthesis method comprises the following steps:

s1, inputting a text input sequence with a phoneme level into a text encoder to obtain a text encoding vector;

s2, obtaining a corresponding emotion category number according to the emotion category required by the voice, and inputting the number into an emotion extraction module to obtain an emotion style vector;

s3, passing the text coding vector and the emotion style vector through a text voice frame pair module to obtain a text-voice alignment sequence;

and S4, inputting the speaker identity ID into a voice frame decoder, processing the text-voice alignment sequence through the voice frame decoder, and decoding to obtain the Mel-frequency spectrum characteristics.

Further preferably, wherein S1 includes the following steps:

processing the text input sequence at the phoneme level through a character embedding layer, and coding the text input sequence into a sequence vector with fixed dimensionality;

constructing a position vector from the text input sequence at the phoneme level through a position coding layer;

adding the sequence vector and the position vector with fixed dimensionality, sending the sum into a multilayer FFTBlock module, and converting the sum into a text coding vector with fixed length.

Further preferably, the emotion extraction module in S2 further includes:

acquiring Mel sound spectrum characteristics of emotional voice, inputting the Mel sound spectrum characteristics into a reference encoder layer, encoding into a vector with fixed length, and recording as a reference embedded vector;

inputting the reference embedded vector into an attention module, and calculating a plurality of emotion expression vectors; the emotion expression vectors are labeled by emotion category numbers, and in the extraction process, the emotion expression vectors respectively represent an emotion category;

and forming an emotion style vector by weighted combination of a plurality of emotion expression vectors.

Further preferably, wherein S2 includes the following steps:

converting the emotion category numbers through an attention module to obtain emotion expression vectors;

and forming the emotion style vector by weighted combination of the emotion expression vectors.

Further preferably, wherein S3 includes the following steps:

correspondingly adding the emotion style vector and the text coding vector to obtain a combined text coding vector;

predicting, by a duration predictor, a duration of each text phoneme of the combined text encoding vector; the duration predictor adopts a 1-dimensional convolutional neural network with 2 layers, a full-connection layer is accessed to the last layer, the input of the duration predictor is a combined text coding vector, and the prediction model predicts a speech duration value for each text coding in the vector through multilayer convolutional operation;

aligning each phoneme to the length of a speech frame one by one according to the predicted duration through an LR module to obtain a speech frame length coding vector;

sequentially passing the voice frame length coding vector through a fundamental frequency predictor and an energy predictor to obtain a fundamental frequency predicted value and an energy predicted value; the duration predictor, the fundamental frequency predictor and the energy predictor adopt consistent structures, 2 layers of 1-dimensional convolutional neural networks are adopted, and a full-connection layer is accessed to the last layer. The fundamental frequency predictor/energy predictor receives the length coding vector of the voice frame and respectively predicts the voice fundamental frequency value/voice energy value corresponding to each frame;

and adding the fundamental frequency predicted value, the energy predicted value and the voice frame length coding vector to obtain a text-voice alignment sequence.

Further preferably, wherein S4 includes the following steps:

constructing a position vector code by the text-voice alignment sequence through a position vector layer;

adding the position vector code and the corresponding position of the alignment code sequence, splicing the ID value of the speaker and the vector in a front-back manner, and decoding the spliced result through a multilayer FFTBlock module;

the acquired decoding characteristic sequence obtains the Mel sound spectrum characteristic through a post-processing module; the post-processing module is 5 layers of 1-dimensional convolution network stacking, the module calculates the decoding characteristic sequence for multiple times through a convolution network, and the obtained 80-dimensional coding vector is the Mel sound spectrum characteristic.

An emotion migration speech synthesis system comprising:

the emotion recognition system comprises a text encoder, an emotion extraction module, a text voice frame pair module and a voice frame decoder.

Further, the text encoder includes: the device comprises a character embedding layer, a position coding layer and an FFTBlock module;

the emotion extraction module comprises: a reference encoder layer, attention module;

the text speech frame comprises the following modules: the device comprises a duration predictor, an LR module, a fundamental frequency predictor and an energy predictor;

the speech frame decoder comprises: the device comprises a position coding layer, an FFTBlock module and a post-processing module.

The invention has at least the following beneficial effects:

the emotion information extraction module provided by the invention can completely decouple speaker information and emotion information in audio features, and the emotion encoding vector only contains the emotion information in the audio, so that the similarity between the encoding vector and the emotion information represented by the vector is improved; secondly, the emotion encoding vector and the speaker information can be freely combined, so that the task of migrating the emotion information to a target speaker without emotion from the audio data of a source speaker is realized; by giving weight to the emotion encoding vector, the strength of emotion can be controlled more easily.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is an overall architecture diagram;

FIG. 2 is a text encoder workflow;

FIG. 3 is a flow of text-to-speech frame versus module work;

FIG. 4 is a speech frame decoder workflow;

FIG. 5 is an extraction and production flow of emotion extraction module;

fig. 6 is a system overall execution flow.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, the overall architecture of the model of the present invention is shown.

The emotion migration TTS system is based on a sequence-sequence structure of a multilayer neural network, and inputs a text sequence at a phoneme level, a specified speaker number and a specified emotion category number, and outputs sequence information at a voice frame level.

The system mainly comprises four modules, a text encoder module, a text voice frame pair module, a voice frame decoder module and an emotion information extraction module.

In the model architecture diagram of fig. 1, the text encoder module and the emotion information extraction module are jointly spliced into the alignment module, and then the output of the alignment module enters the speech frame decoder module to obtain a finally decoded speech frame. These four modules are described in detail below.

(1) Text encoder, referring to fig. 2, fig. 2 illustrates a text encoder workflow.

The text coder is used for converting the phoneme-level text sequence into a machine-readable text coding vector through a neural network.

The text encoder is composed of a character embedding layer, a position coding layer and an FFTBlock module (the module is composed of a plurality of convolution layers and a self-attention mechanism layer), and the specific operation flow is as follows:

a. processing the text input sequence at the phoneme level through a character embedding layer, and coding the text input sequence into a sequence vector with fixed dimensionality;

b. constructing a position vector from the text input sequence at the phoneme level through a position coding layer;

c. adding the coded text sequence vector and the corresponding position of the position vector, and sending the added result into a multilayer FFTBlock module to convert the result into a text coding vector with fixed length.

(2) Text speech frame to its module, refer to fig. 3, and fig. 3 shows the workflow of the alignment fusion module.

And the alignment module changes the text coding vector from the phoneme length to the speech frame length to obtain an obtained text-speech alignment sequence. The module contains three predictors: and the duration predictor, the fundamental frequency predictor and the energy predictor are used for predicting the duration of each phoneme, the average energy value corresponding to each phoneme and the average fundamental frequency value corresponding to each phoneme respectively for the phoneme coding sequence. And the LR module aligns the phoneme sequence coding vectors to the length of the speech frame one by one according to the predicted duration value. The output of the alignment module is the sum of the three predictors. The duration predictor, the fundamental frequency predictor and the energy predictor all adopt 2-layer 1-dimensional convolutional neural networks, and the last layer is accessed with a full-connection layer. The duration predictor predicts a voice duration value for each text code in the vector through multilayer convolution operation; the fundamental frequency predictor/energy predictor receives the length coding vector of the voice frame and respectively predicts the voice fundamental frequency value/voice energy value corresponding to each frame. The alignment module adopts three predictors for extracting and predicting characteristics capable of determining speech details and overall perception, such as phoneme duration (duration) influencing pronunciation length and overall prosody; pitch (fundamental frequency) is a feature that affects emotion and rhythm; the energy value determines the volume of the audio.

Because the emotion style vector contains emotion detail information in voice, after the emotion style vector needs to be fused with the text coding vector, the alignment module performs duration alignment and fundamental frequency energy information prediction. The alignment module first adds the emotion style vector and the text encoding vector.

The process of aligning the text sequence and the voice frame is as follows:

a. and correspondingly adding the emotion style vector and the text coding vector to obtain a combined text coding vector. The vector combines each text phoneme information with the local emotional style information.

b. The combined text encoding vector predicts the Duration of each text phoneme through a Duration Predictor (Duration Predictor);

b. aligning each phoneme to the length of a speech frame one by one according to the predicted duration through an LR module to obtain a speech frame length coding vector;

c. then, the coding vector of the voice frame length sequentially passes through a fundamental frequency Predictor (Pitch Predictor) and an Energy Predictor (Energy Predictor) to obtain a fundamental frequency predicted value and an Energy predicted value;

d. and sequentially adding the fundamental frequency predicted value and the energy predicted value with the voice frame length coding vector, wherein the final result is the obtained text-voice alignment sequence.

(3) Speech frame decoding, referring to fig. 4, fig. 4 shows a speech frame decoder workflow.

The function of the part is to decode the encoded data of the speech frame level obtained by the 'alignment fusion module' into speech frame characteristic data (Mel-frequency spectrum characteristic). The speech frame level encoded data of the last step is not the result of the final speech synthesis, the text-to-speech alignment encoded data fuses the text and emotional style information, and the encoded data needs to be decoded into mel-frequency spectrum characteristics by a decoder to generate an audio file audible to human ears.

The structure of the decoder is substantially identical to the composition of the text encoder, but at the end a post-processing module is accessed to synthesize more accurate speech frame level features. The post-processing module is 5 layers of 1-dimensional convolution network stacking, the module calculates the decoding characteristic sequence for multiple times through a convolution network, and the obtained 80-dimensional coding vector is the Mel sound spectrum characteristic. The post-processing module aims to perform more detailed modeling on a speech frame decoding sequence obtained by a decoder and reduce information loss in the Mel-cepstrum characteristics generated by the model.

Figure 4 shows the work flow of a speech frame decoder. The decoding flow of the decoder is as follows:

a. an alignment coding sequence of a speech frame level containing emotion style and text fusion information constructs position vector coding through a position vector layer;

b. adding the position vector code and the corresponding position of the alignment coding sequence, splicing the ID value of the speaker and the vector in a front-back manner, decoding the spliced result through a multilayer FFTBlock module, and obtaining a final voice frame characteristic sequence, namely a Mel sound spectrum characteristic through an obtained decoding characteristic sequence through a post-processing module.

(4) The emotion information extraction module refers to fig. 5, and fig. 5 is an extraction and production flow of the emotion extraction module.

The emotion information extraction module has the function of extracting emotion style vectors in the voice data, and the emotion style vectors can uniquely represent one emotion style and can be completely decoupled from information such as speaker information, speaking style and the like in the voice data. Therefore, the speaker information and emotion style vectors can be freely combined to synthesize emotion voice satisfying the demand.

And splicing the emotion vector and the text coding vector to serve as input data of an alignment module, and then decoding by a decoder to obtain the final Mel sound spectrum characteristic with emotion information. The module is divided into two processing flows of an extraction process and a production process.

The extraction process is as follows:

a. obtaining the Mel sound spectrum characteristic of the emotional voice, inputting the characteristic into a Reference Encoder layer (Reference Encoder), and encoding the Mel sound spectrum characteristic into a vector with fixed length called as a Reference embedded vector by the Reference Encoder layer;

b. calculating a plurality of vectors representing emotional expressions through an Attention module (Attention Model) by referring to the embedded vectors; the emotion expression vectors are labeled by emotion category numbers, so that in the extraction process, the emotion expression vectors can respectively represent an emotion category;

c. the emotional style vector is formed by the weighted combination of the plurality of emotional expression vectors.

The extraction process is executed in the training process of the TTS system, and a plurality of emotion expression vectors are finally obtained; and in the voice production process, multiplying the selected emotion category number by the plurality of corresponding emotion expression vectors to obtain an emotion style vector. The production process comprises the following steps:

a. appointing a needed emotion category, and converting an emotion category number into a fixed-length vector (emotion style vector) only containing 0 and 1 through an attention module;

b. the vector passes through an Attention module (Attention Model), and is weighted and combined with a plurality of emotion vectors extracted in advance, and finally, a needed emotion style vector is obtained.

The overall work flow of the model is as follows: referring to FIG. 5, a flowchart of the overall implementation of the emotion synthesis TTS system is shown.

The overall working mode of the TTS synthesis system comprises two modes: a training process and a production process. In the training process, each parameter in the whole network is automatically adjusted through a labeled and aligned data set of 'text phoneme sequence', 'speech frame data corresponding to the text' and 'speaker information'; after training is completed, the parameters in the network will not change. In the production process, the network model with fixed parameters obtained in the training process is used for inputting the target text sequence and the target speaker, and the target synthesized Mel frequency spectrum characteristic can be obtained.

The TTS system inputs "speaker ID", "phoneme level text sequence", and "emotion audio feature/emotion tag ID" (the part is "emotion audio feature + emotion tag ID" in the training process, and the part is "emotion tag ID" in the production process). The overall working process of the system is as follows:

a. the phoneme-level text sequence is input into a text encoder, and a text encoding vector is generated through the processing of a text encoder network;

obtaining emotion style vectors decoupled from the audio features through an emotion extraction module by using 'audio features with emotion/emotion label ID' information;

c. correspondingly adding the text coding vector and the emotion style vector to serve as the input of a text-voice frame alignment module, wherein the module can expand an input sequence into a sequence with the length of a voice frame;

d. and (3) decoding the aligned voice frame length sequence into a Mel sound spectrum characteristic through a voice frame decoder by combining the information of the ID of the speaker.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The emotion migration voice synthesis method is characterized by comprising the following steps of:

and S4, inputting the identity ID of the speaker into a voice frame decoder, processing the text-voice alignment sequence through the voice frame decoder, and decoding to obtain the Mel-frequency spectrum characteristics.

2. The emotion migration speech synthesis method according to claim 1, wherein S1 includes the steps of:

3. The emotion migration speech synthesis method according to claim 1, wherein the extraction process of the emotion extraction module in S2 includes the following steps:

4. The emotion migration speech synthesis method according to claim 3, wherein S2 includes the steps of:

5. The emotion migration speech synthesis method according to claim 1, wherein S3 includes the steps of:

predicting, by a duration predictor, a duration of each combined text phoneme of the combined text encoding vector;

sequentially passing the voice frame length coding vector through a fundamental frequency predictor and an energy predictor to obtain a fundamental frequency predicted value and an energy predicted value;

6. The emotion migration speech synthesis method according to claim 1, wherein S4 includes the steps of:

constructing a position vector code by a text-voice alignment sequence containing emotion style vectors and text coding vector information through a position vector layer;

and the obtained decoding characteristic sequence is processed by a post-processing module to obtain the Mel sound spectrum characteristic.

7. An emotion migration speech synthesis system, comprising:

8. The emotion migration speech synthesis system of claim 7, wherein the text coder comprises: the device comprises a character embedding layer, a position coding layer and an FFTBlock module;

the text voice frame comprises the following modules: the device comprises a duration predictor, an LR module, a fundamental frequency predictor and an energy predictor;