CN115762466A - Method and device for synthesizing different emotion audios - Google Patents

Method and device for synthesizing different emotion audios Download PDF

Info

Publication number
CN115762466A
CN115762466A CN202211454821.3A CN202211454821A CN115762466A CN 115762466 A CN115762466 A CN 115762466A CN 202211454821 A CN202211454821 A CN 202211454821A CN 115762466 A CN115762466 A CN 115762466A
Authority
CN
China
Prior art keywords
emotion
voice
training
text
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211454821.3A
Other languages
Chinese (zh)
Inventor
周琳岷
王昆
朱海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Qiruike Technology Co Ltd
Sichuan Changhong Electronic Holding Group Co Ltd
Original Assignee
Sichuan Qiruike Technology Co Ltd
Sichuan Changhong Electronic Holding Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Qiruike Technology Co Ltd, Sichuan Changhong Electronic Holding Group Co Ltd filed Critical Sichuan Qiruike Technology Co Ltd
Priority to CN202211454821.3A priority Critical patent/CN115762466A/en
Publication of CN115762466A publication Critical patent/CN115762466A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a device for synthesizing different emotion audios, which comprises a training stage and an inference stage, wherein the training stage comprises the following steps: s11, collecting training corpora including audio of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voice; s12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels; s13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus; s14, combining the text coding vector with the emotion feature vector of the voice, and training a voice synthesis model through acoustic features of the corresponding voice; s15, taking the emotion feature vector and the text coding vector of the voice as input, and training an emotion feature prediction model; s16, training a vocoder through acoustic features of voice and corresponding voice; the invention solves the problems of flat tone and unclear emotion of the traditional voice synthesis.

Description

Method and device for synthesizing different emotion audios
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a method and a device for synthesizing different emotion audios.
Background
With the continuous development of speech technology, people have higher and higher requirements on speech synthesis quality, the mechanical pronunciation of traditional speech synthesis is not satisfactory, and people hope to add more emotional expressive force into the synthesized speech.
The existing voice technology generally synthesizes directly or converts emotion labels into vector codes, the mode usually depends on preset results, different contexts are not judged, the coding mode is too simple, the labels are relatively fixed, and subtle changes cannot be reflected. Therefore, the dynamic coding mode can reflect the actual emotional condition better and adjust the emotional intensity better.
Disclosure of Invention
The invention aims to provide a method and a device for synthesizing different emotion audios, which aim to solve the problems in the background technology. The invention solves the problems that the emotion is lacked and the emotion is not easy to distinguish in the current sound synthesis.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method and a device for synthesizing different emotion audios comprise a training stage and an inference stage, wherein the training stage comprises the following steps:
s11, collecting training corpora including audios of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voices;
s12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels;
s13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus;
s14, combining the text coding vector with the emotion characteristic vector of the voice, and training a voice synthesis model through acoustic characteristics of the corresponding voice
S15, taking the emotion feature vector and the text coding vector of the voice as input, and training an emotion feature prediction model;
s16, training vocoder through acoustic characteristics and corresponding voice of voice
The inference phase comprises the following steps:
s21, processing the text to obtain a text coding vector, and generating emotional feature information according to the text coding vector through an emotional feature prediction model
S22, combining the text coding vector with the emotional characteristic information, and generating the voice acoustic characteristic through a voice synthesis model
S23, synthesizing audio frequency through a vocoder according to the acoustic characteristics of the voice
Further, in order to extract the spectral feature of the corresponding voice, S11 further includes:
the training corpus includes, but is not limited to, an open speech synthesis training data set or a self-recorded speech synthesis training data set, and the extracted speech spectral features include, but are not limited to, linear spectral features, mel spectral features.
Further, in order to train the emotion voice feature extraction model, S12 further includes:
performing emotion recognition of the voice according to the frequency spectrum characteristics of the trained voice and the emotion characteristics extracted by the emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to network structures such as CNN and GRU;
further, in order to collect the training corpus, the S11 further includes:
the extracted speech spectral features include, but are not limited to, linear spectral features, mel-frequency spectral features.
Further, for training the emotion voice feature extraction model, S12 further includes
Performing emotion recognition of voice according to the frequency spectrum characteristics and the labels of the trained voice as the input of an emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to network structures such as CNN (convolutional neural network), GRU (generalized regression), and the like;
further, in order to extract the emotion feature vector of the corpus and the text feature vector of the corpus, S13 further includes:
extracting an emotional characteristic vector of the training corpus according to the trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing and according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.
Further, to train the speech synthesis model, S14 further includes:
in the process of training the voice synthesis model, the generated voice feature vectors are subjected to multi-scale emotion feature extraction through the emotion extraction model trained in S12, and the multi-scale emotion features extracted in S13 are compared through a loss function feedback error adjustment network.
Adding the text coding vector and the emotion characteristic vector extracted in the step S13 for coding, and then expanding frames of the emotion characteristic vector through an alignment result between the frequency spectrum characteristic and the text and inputting the frames into a decoding layer of a speech synthesis model;
the alignment between the spectral feature and the text may include, but is not limited to, forced alignment, and monotonic alignment search.
Further, in order to train the text prediction emotion feature model, S15 further includes:
and taking the text coding vector as input, enabling the network to output emotion feature prediction vectors through a deep learning network, comparing the emotion feature vectors of the training corpus extracted in S13, and adjusting the network through a loss function feedback error. The emotion feature prediction model network structure includes, but is not limited to, RNN, transformer, etc.
Further, to train the vocoder, S16 further includes:
the vocoder is trained by the anti-neural network using the speech features and the speech signal generated by the speech synthesis model trained in S14 as inputs. Vocoders used include, but are not limited to, waveNET, wavRNN, melGAN.
Further, for emotion speech synthesis, S21-23 further includes:
the speech synthesis model parameters of the reasoning phase are obtained in the training phase, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the inference stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the inference stage, the emotion feature vector is obtained by predicting an emotion feature model through the text, and the emotion intensity is amplified or weakened by multiplying the emotion feature vector by a feature coefficient gram.
The invention provides a device for synthesizing different emotion audios, which comprises:
the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector;
the emotion voice feature extraction unit is used for extracting the emotion features of the target;
an emotion feature prediction unit for predicting emotion feature vector from input text information
The speech synthesis unit is used for synthesizing the input text information and the emotional characteristic information into speech information;
and a vocoder unit for converting the generated voice feature into an audio signal.
The method and the device for synthesizing different emotion audios have the advantages that the method and the device for synthesizing the different emotion audios are not limited to the following steps:
the method and the device for synthesizing the emotion audio can synthesize the speech with emotion in sentences of the speaker in real time, and solve the problem that the emotion of the speaker cannot be expressed by the traditional synthesis method. The invention is applied to the field of emotional speech synthesis, but is not limited to the field. By predicting the emotion feature vector and adding the emotion feature vector into the synthesized sentence, the synthesized audio emotion can be richer and the expressive force can be stronger.
Drawings
FIG. 1 is a flowchart illustrating a method for synthesizing different emotion audios according to an embodiment of the present invention
Fig. 2 is a training flowchart according to an embodiment of the present invention.
FIG. 3 is a flow chart illustrating inference according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a thorough understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.
A method and an apparatus for synthesizing different emotion audios according to the embodiments of the present application will be described in detail with reference to fig. 1 to 3. It should be noted that the following examples are merely illustrative of the present application and are not to be construed as limiting the present application.
Example 1:
a method for synthesizing different emotion audios comprises the following steps in a training stage:
s11, collecting training corpora including audios of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voices;
optionally, the corpus includes, but is not limited to, an open speech synthesis training data set or a self-recorded speech synthesis training data set, and the extracted speech spectral features include, but are not limited to, linear spectral features, mel spectral features.
For example, 80-dimensional mel spectral features are extracted from the acquired audio through a window with the window length of 0.05s and the sliding distance of 0.015s, or 513-dimensional linear spectral features are extracted through Fourier transform.
S12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels;
specifically, speech emotion recognition is performed according to the frequency spectrum feature and the label of the trained speech as the input of the emotion feature extraction network. The emotional feature recognition model network structure comprises but is not limited to network structures such as CNN, GRU and the like;
for example, linear spectrum features are subjected to feature vector extraction through a 2D convolution network and GRU, pass through a full connection layer and are classified through softmax.
S13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus;
specifically, extracting an emotional characteristic vector of a training corpus according to a trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing and according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.
For example, the emotion feature vector is extracted by training the bottleeck of the finished classification model. Carrying out standardized processing on a Chinese text, screening out illegal syllables, carrying out word segmentation, part-of-speech tagging and the like on legal input, and inputting the extracted comprehensive linguistic characteristics into a rhythm prediction model to obtain a pause level tag; the Chinese characters are converted into corresponding pinyin phonemes, and text information is encoded to obtain text encoding vectors.
And S14, combining the text with the emotional characteristic vector of the voice, and training a voice synthesis model through the acoustic characteristic of the corresponding voice.
Specifically, in the process of training the speech synthesis model, the generated speech feature vectors are subjected to extraction of multi-scale emotion features through the emotion extraction model trained in S12, and the extracted multi-scale emotion features are compared through the loss function feedback error adjustment network in S13.
Adding the text coding vector and the emotion characteristic vector extracted in the step S13 for coding, and then expanding frames of the emotion characteristic vector through an alignment result between the frequency spectrum characteristic and the text and inputting the frames into a decoding layer of a speech synthesis model;
the alignment between the spectrum feature and the text may include, but is not limited to, forced alignment, and monotonic alignment search.
S15, taking the speech emotion feature vector and the text coding vector as input, and training a text emotion feature model;
and taking the text coding vector as input, enabling the network to output an emotional characteristic prediction vector through a deep learning network, comparing the emotional characteristic vectors of the training corpus extracted in S13, and feeding back an error adjustment network through a loss function.
Optionally, the text feature prediction model network structure includes, but is not limited to, a transform, rnn, and other network structures.
S16, training vocoder through acoustic features and corresponding voice of voice
Specifically, the vocoder is trained by the antagonistic neural network using the speech features and speech information generated by the speech synthesis model trained in S14 as inputs
Optionally, the vocoder used includes, but is not limited to, wavNET, wavRNN, melGAN.
The inference phase comprises the following steps:
s21, processing the text to obtain a text coding vector, and generating emotional feature information according to the text coding vector through an emotional feature prediction model
S22, combining the text coding vector with the emotional characteristic information, and generating the voice acoustic characteristic through a voice synthesis model
S23, synthesizing audio frequency through a vocoder according to the acoustic characteristics of the voice
Understandably, the speech synthesis model parameters of the inference phase are obtained by the training phase, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the reasoning stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the reasoning stage, and the emotion feature vector is obtained by a text prediction emotion feature model. The emotion feature vector is multiplied by a feature coefficient, which is typically keyed to amplify or attenuate the emotion intensity.
For example, multiplying the emotion feature vector by 2 can double the emotion intensity, multiplying by 0.5 can reduce the emotion intensity by 1/2, and the coefficient is typically set to between 0 and 2.
Example 2:
in this embodiment, an apparatus for synthesizing different emotion audios includes:
the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector;
the emotion voice feature extraction unit is used for extracting emotion features of the target;
an emotion feature prediction unit for predicting emotion feature vector from input text information
The speech synthesis unit is used for synthesizing the input text information and the emotional characteristic information into speech information;
a vocoder unit which converts the generated voice feature into an audio signal;
by the method and the device for synthesizing different emotion audios, which are provided by the embodiment 2 of the invention, the emotion characteristic vectors are generated through the texts and are combined with the text characteristic vectors to carry out voice synthesis, so that the synthesized audio emotion is richer and the expressive force is stronger.
It should be noted that, in this embodiment, each module (or unit) is in a logical sense, and in particular, when the embodiment is implemented, a plurality of modules (or units) may be combined into one module (or unit), and one module (or unit) may also be split into a plurality of modules (or units).
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions related to a program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (9)

1. A method for synthesizing different emotion audios is characterized by comprising a training phase and an inference phase, wherein the training phase comprises the following steps:
s11, collecting training corpora including audio of different speakers, corresponding texts and emotion labels and extracting frequency spectrum characteristics of corresponding voice;
s12, training an emotional voice feature extraction model according to the voice frequency spectrum features and the corresponding emotional labels;
s13, extracting emotion characteristic vectors of the training corpus and text coding vectors corresponding to the training corpus;
s14, combining the text coding vector with the emotion feature vector of the voice, and training a voice synthesis model through acoustic features of the corresponding voice;
s15, taking the emotion feature vector and the text coding vector of the voice as input, and training an emotion feature prediction model;
s16, training a vocoder according to the acoustic characteristics of the voice and the corresponding voice;
the inference phase comprises the following steps:
s21, processing the text to obtain a text coding vector, and generating emotional characteristic information according to the text vector coding through an emotional characteristic prediction model;
s22, combining the text coding vector with the emotional characteristic information, and generating voice acoustic characteristics through a voice synthesis model;
and S23, synthesizing audio through a vocoder according to the acoustic characteristics of the voice.
2. The method of claim 1, wherein the step S11 further comprises:
the training corpus includes, but is not limited to, an open speech synthesis training data set or a self-recorded speech synthesis training data set, and the extracted speech spectral features include, but are not limited to, linear spectral features, mel spectral features.
3. The method of claim 1, wherein the step S12 further comprises:
and performing emotion recognition of the voice according to the frequency spectrum characteristics and the labels of the trained voice as the input of an emotion characteristic extraction network, wherein the emotion characteristic recognition model network structure comprises but is not limited to a CNN (convolutional neural network) and a GRU (generalized regression) network structure.
4. The method of claim 1, wherein the step S13 further comprises:
extracting an emotional characteristic vector of a training corpus according to the trained emotional characteristic extraction network, extracting a text coding vector of a text corresponding to the training corpus through a text coding network after the text is subjected to normalization processing according to text information, and extracting multi-scale emotional characteristics of the training corpus through a convolutional neural network to obtain a characteristic matrix.
5. The method of claim 1, wherein the step S14 further comprises:
in the process of training a voice synthesis model, extracting multi-scale emotion characteristics from the generated voice characteristic vector through an emotion extraction model trained in S12, comparing the multi-scale emotion characteristics of the training corpus extracted in S13, and feeding back an error adjustment network through a loss function;
adding the text coding vector and the emotion characteristic vector extracted in the step S13 for coding, and then expanding frames of the emotion characteristic vector through an alignment result between the frequency spectrum characteristic and the text and inputting the frames into a decoding layer of a speech synthesis model;
the alignment between the spectrum feature and the text may include, but is not limited to, forced alignment, and monotonic alignment search.
6. The method of claim 1, wherein the step S15 further comprises;
and taking the text coding vector as an input, outputting an emotion feature prediction vector of the network through a deep learning network, comparing the emotion feature vectors of the training corpus extracted in S13, and adjusting the network through a loss function feedback error, wherein the emotion feature prediction model network structure comprises but is not limited to RNN and Transformer network structures.
7. The method of claim 1, wherein the step S16 further comprises:
training a vocoder through an antagonistic neural network with the voice feature and the voice information generated by the voice synthesis model trained in S14 as inputs; vocoders used include, but are not limited to, waveNET, wavRNN, melGAN.
8. The method of claim 1, wherein the steps S21-23 further comprise:
the parameters of the voice synthesis model in the inference stage are obtained in the training stage, and the network structures are consistent; the processing mode of the text in the reasoning stage is consistent with that in the training stage; and the emotion feature vector generated in the inference stage is subjected to frame expansion according to the length of the text and is combined with the text, an emotion feature extraction model is not used in the inference stage, the emotion feature vector is obtained by predicting an emotion feature model through the text, and the emotion intensity is amplified or weakened by multiplying the emotion feature vector by a feature coefficient gram.
9. An apparatus for synthesizing different emotion audios, comprising: the method comprises the following steps:
the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector;
the emotion voice feature extraction unit is used for extracting emotion features of the target;
an emotion feature prediction unit for predicting emotion feature vector from input text information
The speech synthesis unit is used for synthesizing the input text information and the emotional characteristic information into speech information;
and a vocoder unit for converting the generated voice feature into an audio signal.
CN202211454821.3A 2022-11-21 2022-11-21 Method and device for synthesizing different emotion audios Pending CN115762466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211454821.3A CN115762466A (en) 2022-11-21 2022-11-21 Method and device for synthesizing different emotion audios

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211454821.3A CN115762466A (en) 2022-11-21 2022-11-21 Method and device for synthesizing different emotion audios

Publications (1)

Publication Number Publication Date
CN115762466A true CN115762466A (en) 2023-03-07

Family

ID=85333482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211454821.3A Pending CN115762466A (en) 2022-11-21 2022-11-21 Method and device for synthesizing different emotion audios

Country Status (1)

Country Link
CN (1) CN115762466A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778967A (en) * 2023-08-28 2023-09-19 清华大学 Multi-mode emotion recognition method and device based on pre-training model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778967A (en) * 2023-08-28 2023-09-19 清华大学 Multi-mode emotion recognition method and device based on pre-training model
CN116778967B (en) * 2023-08-28 2023-11-28 清华大学 Multi-mode emotion recognition method and device based on pre-training model

Similar Documents

Publication Publication Date Title
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
Kons et al. High quality, lightweight and adaptable TTS using LPCNet
US11562739B2 (en) Content output management based on speech quality
US11361753B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
KR20230034423A (en) 2-level speech rhyme transmission
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
Nanavare et al. Recognition of human emotions from speech processing
US11600261B2 (en) System and method for cross-speaker style transfer in text-to-speech and training data generation
Kumar et al. Machine learning based speech emotions recognition system
Chittaragi et al. Acoustic-phonetic feature based Kannada dialect identification from vowel sounds
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN115762466A (en) Method and device for synthesizing different emotion audios
CN114550706A (en) Smart campus voice recognition method based on deep learning
Rabiee et al. Persian accents identification using an adaptive neural network
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Li et al. End-to-end mongolian text-to-speech system
Săracu et al. An analysis of the data efficiency in Tacotron2 speech synthesis system
Othmane et al. Enhancement of esophageal speech using voice conversion techniques
Cahyaningtyas et al. Synthesized speech quality of Indonesian natural text-to-speech by using HTS and CLUSTERGEN
Sinha et al. Fusion of multi-stream speech features for dialect classification
Pao et al. Emotion recognition from Mandarin speech signals
Savargiv et al. Study on unit-selection and statistical parametric speech synthesis techniques
Reddy et al. Speech-to-Text and Text-to-Speech Recognition Using Deep Learning
Al-Said et al. An Arabic text-to-speech system based on artificial neural networks
Abdullaeva et al. Uzbek Speech synthesis using deep learning algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination