CN115762466A

CN115762466A - Method and device for synthesizing different emotion audios

Info

Publication number: CN115762466A
Application number: CN202211454821.3A
Authority: CN
Inventors: 周琳岷; 王昆; 朱海
Original assignee: Sichuan Qiruike Technology Co Ltd; Sichuan Changhong Electronic Holding Group Co Ltd
Current assignee: Sichuan Qiruike Technology Co Ltd; Sichuan Changhong Electronic Holding Group Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-07

Abstract

The invention provides a method and a device for synthesizing different emotion audios, which comprises a training stage and an inference stage, wherein the training stage comprises the following steps: s11, collecting training corpora including audio of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voice; s12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels; s13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus; s14, combining the text coding vector with the emotion feature vector of the voice, and training a voice synthesis model through acoustic features of the corresponding voice; s15, taking the emotion feature vector and the text coding vector of the voice as input, and training an emotion feature prediction model; s16, training a vocoder through acoustic features of voice and corresponding voice; the invention solves the problems of flat tone and unclear emotion of the traditional voice synthesis.

Description

Method and device for synthesizing different emotion audios

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a method and a device for synthesizing different emotion audios.

Background

With the continuous development of speech technology, people have higher and higher requirements on speech synthesis quality, the mechanical pronunciation of traditional speech synthesis is not satisfactory, and people hope to add more emotional expressive force into the synthesized speech.

The existing voice technology generally synthesizes directly or converts emotion labels into vector codes, the mode usually depends on preset results, different contexts are not judged, the coding mode is too simple, the labels are relatively fixed, and subtle changes cannot be reflected. Therefore, the dynamic coding mode can reflect the actual emotional condition better and adjust the emotional intensity better.

Disclosure of Invention

The invention aims to provide a method and a device for synthesizing different emotion audios, which aim to solve the problems in the background technology. The invention solves the problems that the emotion is lacked and the emotion is not easy to distinguish in the current sound synthesis.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method and a device for synthesizing different emotion audios comprise a training stage and an inference stage, wherein the training stage comprises the following steps:

s11, collecting training corpora including audios of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voices;

s12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels;

s13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus;

s14, combining the text coding vector with the emotion characteristic vector of the voice, and training a voice synthesis model through acoustic characteristics of the corresponding voice

S15, taking the emotion feature vector and the text coding vector of the voice as input, and training an emotion feature prediction model;

s16, training vocoder through acoustic characteristics and corresponding voice of voice

The inference phase comprises the following steps:

s21, processing the text to obtain a text coding vector, and generating emotional feature information according to the text coding vector through an emotional feature prediction model

S22, combining the text coding vector with the emotional characteristic information, and generating the voice acoustic characteristic through a voice synthesis model

S23, synthesizing audio frequency through a vocoder according to the acoustic characteristics of the voice

Further, in order to extract the spectral feature of the corresponding voice, S11 further includes:

the training corpus includes, but is not limited to, an open speech synthesis training data set or a self-recorded speech synthesis training data set, and the extracted speech spectral features include, but are not limited to, linear spectral features, mel spectral features.

Further, in order to train the emotion voice feature extraction model, S12 further includes:

performing emotion recognition of the voice according to the frequency spectrum characteristics of the trained voice and the emotion characteristics extracted by the emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to network structures such as CNN and GRU;

further, in order to collect the training corpus, the S11 further includes:

the extracted speech spectral features include, but are not limited to, linear spectral features, mel-frequency spectral features.

Further, for training the emotion voice feature extraction model, S12 further includes

Performing emotion recognition of voice according to the frequency spectrum characteristics and the labels of the trained voice as the input of an emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to network structures such as CNN (convolutional neural network), GRU (generalized regression), and the like;

further, in order to extract the emotion feature vector of the corpus and the text feature vector of the corpus, S13 further includes:

extracting an emotional characteristic vector of the training corpus according to the trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing and according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.

Further, to train the speech synthesis model, S14 further includes:

in the process of training the voice synthesis model, the generated voice feature vectors are subjected to multi-scale emotion feature extraction through the emotion extraction model trained in S12, and the multi-scale emotion features extracted in S13 are compared through a loss function feedback error adjustment network.

Adding the text coding vector and the emotion characteristic vector extracted in the step S13 for coding, and then expanding frames of the emotion characteristic vector through an alignment result between the frequency spectrum characteristic and the text and inputting the frames into a decoding layer of a speech synthesis model;

the alignment between the spectral feature and the text may include, but is not limited to, forced alignment, and monotonic alignment search.

Further, in order to train the text prediction emotion feature model, S15 further includes:

and taking the text coding vector as input, enabling the network to output emotion feature prediction vectors through a deep learning network, comparing the emotion feature vectors of the training corpus extracted in S13, and adjusting the network through a loss function feedback error. The emotion feature prediction model network structure includes, but is not limited to, RNN, transformer, etc.

Further, to train the vocoder, S16 further includes:

the vocoder is trained by the anti-neural network using the speech features and the speech signal generated by the speech synthesis model trained in S14 as inputs. Vocoders used include, but are not limited to, waveNET, wavRNN, melGAN.

Further, for emotion speech synthesis, S21-23 further includes:

the speech synthesis model parameters of the reasoning phase are obtained in the training phase, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the inference stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the inference stage, the emotion feature vector is obtained by predicting an emotion feature model through the text, and the emotion intensity is amplified or weakened by multiplying the emotion feature vector by a feature coefficient gram.

The invention provides a device for synthesizing different emotion audios, which comprises:

the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector;

the emotion voice feature extraction unit is used for extracting the emotion features of the target;

an emotion feature prediction unit for predicting emotion feature vector from input text information

The speech synthesis unit is used for synthesizing the input text information and the emotional characteristic information into speech information;

and a vocoder unit for converting the generated voice feature into an audio signal.

The method and the device for synthesizing different emotion audios have the advantages that the method and the device for synthesizing the different emotion audios are not limited to the following steps:

the method and the device for synthesizing the emotion audio can synthesize the speech with emotion in sentences of the speaker in real time, and solve the problem that the emotion of the speaker cannot be expressed by the traditional synthesis method. The invention is applied to the field of emotional speech synthesis, but is not limited to the field. By predicting the emotion feature vector and adding the emotion feature vector into the synthesized sentence, the synthesized audio emotion can be richer and the expressive force can be stronger.

Drawings

FIG. 1 is a flowchart illustrating a method for synthesizing different emotion audios according to an embodiment of the present invention

Fig. 2 is a training flowchart according to an embodiment of the present invention.

FIG. 3 is a flow chart illustrating inference according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a thorough understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.

A method and an apparatus for synthesizing different emotion audios according to the embodiments of the present application will be described in detail with reference to fig. 1 to 3. It should be noted that the following examples are merely illustrative of the present application and are not to be construed as limiting the present application.

Example 1:

a method for synthesizing different emotion audios comprises the following steps in a training stage:

optionally, the corpus includes, but is not limited to, an open speech synthesis training data set or a self-recorded speech synthesis training data set, and the extracted speech spectral features include, but are not limited to, linear spectral features, mel spectral features.

For example, 80-dimensional mel spectral features are extracted from the acquired audio through a window with the window length of 0.05s and the sliding distance of 0.015s, or 513-dimensional linear spectral features are extracted through Fourier transform.

specifically, speech emotion recognition is performed according to the frequency spectrum feature and the label of the trained speech as the input of the emotion feature extraction network. The emotional feature recognition model network structure comprises but is not limited to network structures such as CNN, GRU and the like;

for example, linear spectrum features are subjected to feature vector extraction through a 2D convolution network and GRU, pass through a full connection layer and are classified through softmax.

specifically, extracting an emotional characteristic vector of a training corpus according to a trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing and according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.

For example, the emotion feature vector is extracted by training the bottleeck of the finished classification model. Carrying out standardized processing on a Chinese text, screening out illegal syllables, carrying out word segmentation, part-of-speech tagging and the like on legal input, and inputting the extracted comprehensive linguistic characteristics into a rhythm prediction model to obtain a pause level tag; the Chinese characters are converted into corresponding pinyin phonemes, and text information is encoded to obtain text encoding vectors.

And S14, combining the text with the emotional characteristic vector of the voice, and training a voice synthesis model through the acoustic characteristic of the corresponding voice.

Specifically, in the process of training the speech synthesis model, the generated speech feature vectors are subjected to extraction of multi-scale emotion features through the emotion extraction model trained in S12, and the extracted multi-scale emotion features are compared through the loss function feedback error adjustment network in S13.

the alignment between the spectrum feature and the text may include, but is not limited to, forced alignment, and monotonic alignment search.

S15, taking the speech emotion feature vector and the text coding vector as input, and training a text emotion feature model;

and taking the text coding vector as input, enabling the network to output an emotional characteristic prediction vector through a deep learning network, comparing the emotional characteristic vectors of the training corpus extracted in S13, and feeding back an error adjustment network through a loss function.

Optionally, the text feature prediction model network structure includes, but is not limited to, a transform, rnn, and other network structures.

S16, training vocoder through acoustic features and corresponding voice of voice

Specifically, the vocoder is trained by the antagonistic neural network using the speech features and speech information generated by the speech synthesis model trained in S14 as inputs

Optionally, the vocoder used includes, but is not limited to, wavNET, wavRNN, melGAN.

The inference phase comprises the following steps:

Understandably, the speech synthesis model parameters of the inference phase are obtained by the training phase, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the reasoning stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the reasoning stage, and the emotion feature vector is obtained by a text prediction emotion feature model. The emotion feature vector is multiplied by a feature coefficient, which is typically keyed to amplify or attenuate the emotion intensity.

For example, multiplying the emotion feature vector by 2 can double the emotion intensity, multiplying by 0.5 can reduce the emotion intensity by 1/2, and the coefficient is typically set to between 0 and 2.

Example 2:

in this embodiment, an apparatus for synthesizing different emotion audios includes:

the emotion voice feature extraction unit is used for extracting emotion features of the target;

a vocoder unit which converts the generated voice feature into an audio signal;

by the method and the device for synthesizing different emotion audios, which are provided by the embodiment 2 of the invention, the emotion characteristic vectors are generated through the texts and are combined with the text characteristic vectors to carry out voice synthesis, so that the synthesized audio emotion is richer and the expressive force is stronger.

It should be noted that, in this embodiment, each module (or unit) is in a logical sense, and in particular, when the embodiment is implemented, a plurality of modules (or units) may be combined into one module (or unit), and one module (or unit) may also be split into a plurality of modules (or units).

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method for synthesizing different emotion audios is characterized by comprising a training phase and an inference phase, wherein the training phase comprises the following steps:

s11, collecting training corpora including audio of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voice;

s14, combining the text coding vector with the emotion feature vector of the voice, and training a voice synthesis model through acoustic features of the corresponding voice;

s16, training a vocoder according to the acoustic characteristics of the voice and the corresponding voice;

the inference phase comprises the following steps:

s21, processing the text to obtain a text coding vector, and generating emotional characteristic information according to the text vector coding through an emotional characteristic prediction model;

s22, combining the text coding vector with the emotional characteristic information, and generating voice acoustic characteristics through a voice synthesis model;

and S23, synthesizing audio through a vocoder according to the acoustic characteristics of the voice.

2. The method of claim 1, wherein the step S11 further comprises:

3. The method of claim 1, wherein the step S12 further comprises:

and performing emotion recognition of the voice according to the frequency spectrum characteristics and the labels of the trained voice as the input of an emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to a CNN (convolutional neural network) and a GRU (generalized regression) network structure.

4. The method of claim 1, wherein the step S13 further comprises:

extracting an emotional characteristic vector of a training corpus according to the trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.

5. The method of claim 1, wherein the step S14 further comprises:

in the process of training a voice synthesis model, extracting multi-scale emotion characteristics from the generated voice characteristic vector through an emotion extraction model trained in S12, comparing the multi-scale emotion characteristics of the training corpus extracted in S13, and feeding back an error adjustment network through a loss function;

6. The method of claim 1, wherein the step S15 further comprises;

and taking the text coding vector as an input, outputting an emotion feature prediction vector of the network through a deep learning network, comparing the emotion feature vectors of the training corpus extracted in S13, and adjusting the network through a loss function feedback error, wherein the emotion feature prediction model network structure comprises but is not limited to RNN and Transformer network structures.

7. The method of claim 1, wherein the step S16 further comprises:

training a vocoder through an antagonistic neural network with the voice feature and the voice information generated by the voice synthesis model trained in S14 as inputs; vocoders used include, but are not limited to, waveNET, wavRNN, melGAN.

8. The method of claim 1, wherein the steps S21-23 further comprise:

the parameters of the voice synthesis model in the inference stage are obtained in the training stage, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the inference stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the inference stage, the emotion feature vector is obtained by predicting an emotion feature model through the text, and the emotion intensity is amplified or weakened by multiplying the emotion feature vector by a feature coefficient gram.

9. An apparatus for synthesizing different emotion audios, comprising: the method comprises the following steps: