WO2021047233A1 - 一种基于深度学习的情感语音合成方法及装置 - Google Patents
一种基于深度学习的情感语音合成方法及装置 Download PDFInfo
- Publication number
- WO2021047233A1 WO2021047233A1 PCT/CN2020/096998 CN2020096998W WO2021047233A1 WO 2021047233 A1 WO2021047233 A1 WO 2021047233A1 CN 2020096998 W CN2020096998 W CN 2020096998W WO 2021047233 A1 WO2021047233 A1 WO 2021047233A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- information
- sub
- text information
- sample
- Prior art date
Links
- 230000002996 emotional effect Effects 0.000 title claims abstract description 122
- 238000013135 deep learning Methods 0.000 title claims abstract description 32
- 238000001308 synthesis method Methods 0.000 title claims abstract description 15
- 230000008451 emotion Effects 0.000 claims abstract description 131
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000002372 labelling Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 71
- 238000000605 extraction Methods 0.000 claims description 63
- 230000015572 biosynthetic process Effects 0.000 claims description 29
- 238000003786 synthesis reaction Methods 0.000 claims description 29
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000004891 communication Methods 0.000 abstract description 13
- 230000006870 function Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 241000282414 Homo sapiens Species 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- the present invention relates to the field of speech synthesis, in particular to an emotional speech synthesis method and device based on deep learning.
- the present invention provides an emotional speech synthesis method and device based on deep learning, which can realize emotional speech synthesis without manually labeling emotions one by one.
- an emotional speech synthesis method based on deep learning is provided, the method at least including the following steps:
- an emotional speech is synthesized through a pre-trained second model.
- the first model includes a first sub-model, a second sub-model, and a third sub-model that are sequentially connected, and the to-be-processed text information and the preceding information are input through pre-built
- the first model of generating emotional feature information specifically includes the following sub-steps:
- feature extraction is performed through a pre-trained third sub-model to obtain emotional feature information.
- the first model when the preceding information also includes the preceding speech information, the first model includes a fourth sub-model, a fifth sub-model, and a sixth sub-model that are sequentially connected, and the The text information to be processed and the preceding information are input, and the emotional feature information is generated through the first model constructed in advance, which specifically includes the following sub-steps:
- feature extraction is performed through a pre-trained sixth sub-model to obtain emotional feature information.
- the second model when the second model is pre-trained, it specifically includes the following sub-steps:
- the second model is trained.
- the first model when the first model is pre-trained, it specifically includes the following sub-steps:
- Extracting the current text information sample and the previous text information sample of the video sample includes the previous text information sample
- the first sub-model is obtained by training, and the first sub-model is extracted.
- the third sub-model is obtained through training.
- the first model when the first model is pre-trained, it specifically includes the following sub-steps:
- the fourth sub-model is obtained by training, and the fourth sub-model is extracted.
- the sixth sub-model is obtained through training.
- the pre-training of the first model further includes video sample preprocessing, which at least includes:
- the video image sample is divided into several video image sub-samples, the text in any time interval is used as the current text information sample, and the text before any time interval is used as the preceding text information sample.
- the present invention also provides a deep learning-based emotional speech synthesis device for executing the above method, the device at least comprising:
- Extraction module used to extract the to-be-processed text information and the preceding information of the to-be-processed text information, the preceding information includes the preceding text information;
- Emotion feature information generation module used to generate emotional feature information through a pre-built first model using the text information to be processed and the preceding information as input;
- Emotional speech synthesis module used to synthesize emotional speech through a pre-trained second model using the emotional feature information and the text information to be processed as input.
- the first model includes a first sub-model, a second sub-model, and a third sub-model that are sequentially connected
- the emotional feature information generation module includes at least:
- the first feature extraction unit used to take the to-be-processed text information and the preceding information as input, and perform feature extraction through the pre-trained first sub-model to obtain a first intermediate output;
- the second feature extraction unit used to take the first intermediate output and the text information to be processed as input, and perform feature extraction through the pre-trained second sub-model to obtain the emotion type and the second intermediate output;
- the third feature extraction unit used to take the second intermediate output, the text information to be processed, the emotion type or the received user-specified emotion type as input, and perform feature extraction through a pre-trained third sub-model to obtain emotional feature information .
- the first model when the foregoing information also includes the foregoing speech information, the first model includes a fourth sub-model, a fifth sub-model, and a sixth sub-model that are sequentially connected, and the emotional feature information
- the acquisition module includes at least:
- the fourth feature extraction unit used to take the to-be-processed text information and the preceding information as input, and perform feature extraction through a pre-trained fourth sub-model to obtain a fourth intermediate output;
- the fifth feature extraction unit for taking the fourth intermediate output and the text information to be processed as input, performing feature extraction through a pre-trained fifth sub-model to obtain a fifth intermediate output;
- the sixth feature extraction unit used to take the fifth intermediate output, the text information to be processed, the emotion type, or the received user-specified emotion type as input, and perform feature extraction through a pre-trained sixth sub-model to obtain emotional feature information.
- the device further includes a model training module, the model training module includes at least a second model training unit for training the second model, and the second model training unit includes at least:
- the first extraction subunit used to extract video image samples, text information samples and dialogue information samples of the video samples;
- Emotion labeling subunit used to label the video image samples to obtain emotional labeling information samples according to preset sentiment classification
- the first training subunit used to take the video image sample as input and the emotion annotation information sample as output, train a third model, and extract the third intermediate output of the third model as the video image sample Emotional information;
- the model training module further includes a first model training unit for training the first module, and the first model training unit at least includes:
- the second extraction subunit used to extract the current text information sample and the previous text information sample of the video sample, the previous information sample includes the previous text information sample;
- the second training subunit used to take the current text information sample and the preceding information sample as input, and use whether the emotion of the current text information sample relative to the preceding information sample is changed as the output, and train to obtain the first subunit. Model, and extract the first intermediate output of the intermediate output of the first sub-model;
- the model training module further includes a third model training unit for training another first model, and the third model training unit at least includes:
- the third extraction subunit used to extract the current text information sample and the previous text information sample of the video sample, the previous information sample includes the previous text information sample and the previous voice information sample;
- the third training subunit is used to take the current text information sample and the preceding information sample as input, and use whether the emotion of the current text information sample relative to the preceding information sample is changed as the output, and train to obtain the fourth subunit. Model, and extract the fourth intermediate output of the intermediate output of the fourth sub-model;
- the second model training unit further includes:
- the preprocessing subunit is used to divide the video image sample into a number of video image subsamples according to a preset time interval, and use text in any time interval as the current text information sample.
- the previous text serves as a sample of the previous text information.
- the present invention discloses an emotional speech synthesis method based on deep learning, which is based on the extracted text information to be processed and the preceding information of the text information to be processed, generates emotional feature information through a pre-built first model, and then generates emotional feature information according to the emotional feature information And the text information to be processed, the emotional speech is synthesized through the second model pre-trained based on the video samples.
- This method can realize the synthesis of emotional speech based on deep learning on the basis of only obtaining the text information, without the need for manual pre-processing of each Acoustic pronunciation is used to label text and emotions. Therefore, this method can further reduce labeling errors while reducing labor costs, improve the fit of emotional information, enrich dialog voice emotions, improve the naturalness and fluency of synthesized speech, and improve people Machine communication experience;
- the present invention when training the model, first obtains emotional information from the video image according to the corresponding video image information, text information, and voice information in a piece of video, constructs an emotional voice generation module based on the video image, and then builds it based on the text information
- the emotional voice generation module that targets the emotional information, so as to achieve the purpose of generating emotional voice based on text information, so this method is suitable for video communication scenes, voice communication scenes, and even communication scenes with only text information, and has wide adaptability. Further improve the human-machine communication experience;
- the emotional speech synthesis method based on deep learning is based on the video image samples extracted from the video, the corresponding text information samples and the dialogue information samples when the speech synthesis model (second model) is constructed. , So the obtained emotion is more appropriate, and the synthesized speech emotion is more accurate and natural.
- FIG. 1 is a flowchart of an emotional speech synthesis method based on deep learning in Embodiment 1 of the present invention
- FIG. 2 is a logical schematic diagram of an emotional speech synthesis method based on deep learning in Embodiment 1 of the present invention
- FIG. 3 is a schematic diagram of logic when training a second model in Embodiment 1 of the present invention.
- Fig. 5 is a schematic structural diagram of an emotional speech synthesis device based on deep learning in the second embodiment of the present invention.
- this embodiment provides an emotional speech synthesis method based on deep learning, which belongs to the field of speech synthesis.
- the synthesis of emotional speech can be achieved without the need to manually label emotions, and the synthesis can be effectively improved.
- the naturalness of voice emotions can be achieved without the need to manually label emotions, and the synthesis can be effectively improved.
- the method includes the following steps:
- the preceding information includes the preceding text information
- the preceding information includes the preceding text information and the preceding voice information.
- extracting text information from text objects extracting text information and voice information from voice objects, and extracting text information and voice information from video objects can all be implemented by different extractors.
- the manners are conventional technical means in the field, and will not be cited here.
- the emotional feature information is generated through the first model constructed in advance.
- step S2 specifically includes the following sub-steps:
- S213 Take the second intermediate output, the text information to be processed, the emotion type or the received user-specified emotion type as input, and perform feature extraction through a pre-trained third sub-model to obtain emotion feature information.
- one of the input ports of the third sub-model is the emotion control port, which can input the emotion type output by the second sub-model or the emotion type artificially set by the user. Therefore, when obtaining the emotional feature information, you can Obtained entirely based on the model, when the model data has accuracy problems, it can be adjusted by human intervention, thereby further improving the accuracy and reliability of the obtained emotional feature information.
- the first intermediate output is the output feature vector of the previous layer of the logical judgment layer of the first sub-model, and the content includes the current dialogue tone extracted by the first sub-model and the emotional feature of the current text.
- the second intermediate output is the output feature vector of the previous layer of the second sub-model classification layer, and the content includes the emotional feature of the current text extracted by the second sub-model in combination with the first intermediate output.
- step S2 specifically includes the following sub-steps:
- S223 Take the fifth intermediate output, the text information to be processed, the emotion type or the received user-specified emotion type as input, and perform feature extraction through a pre-trained sixth sub-model to obtain emotion feature information.
- the fourth intermediate output is the output feature vector of the previous layer of the fourth sub-model logical judgment layer, and the content includes the current dialogue tone and emotional features of the current text extracted by the fourth sub-model from the sent dialogue voice or video screen.
- the fifth intermediate output is the output feature vector of the previous layer of the fifth sub-model classification layer, and the content is the emotional feature of the current text extracted by the fifth sub-model in combination with the fourth intermediate output.
- the current text information includes the previous text information and the previous voice information, and the obtained emotional voice feature information is more reliable.
- an emotional speech is synthesized through a pre-trained second model.
- the deep learning-based emotional speech synthesis method is based on the extracted text information to be processed and the preamble information of the text information to be processed, generates emotional feature information through a pre-built first model, and then generates emotional feature information based on the emotional feature information and the to-be-processed text information.
- This method can realize the synthesis of emotional speech based on deep learning on the basis of only obtaining text information, without the need to manually label each acoustic pronunciation in advance Words and emotions, therefore, this method can further reduce labeling errors while reducing labor costs, improve the fit of emotional information, enrich dialog voice emotions, improve the naturalness and fluency of synthesized speech, and improve the human-computer communication experience.
- the processing object can be only text, or a combination of text and speech. Therefore, this method can synthesize emotional speech based on any of text, speech or video, and is applicable to a wide range of scenarios. .
- the method further includes a model pre-training step for pre-training the first model and the second model.
- the training process of the second model specifically includes the following sub-steps:
- Sa1 extract the video image sample, text information sample and dialogue information sample of the video sample
- Sa2 according to the preset emotion classification, annotate the video image samples to obtain the emotion annotation information sample;
- Sa3. Take video image samples as input and emotional annotation information samples as output, train the third model, and extract the third intermediate output of the third model as the emotional information of the video image samples; take emotional information and text information samples as input, Take the dialog information sample as the output, and train the second model.
- the third model is constructed on the basis of ResNet-50 and carries the cross-entropy loss function
- the second model is constructed on the basis of Tacotron2 and carries the average variance loss function and the L2 distance loss function.
- the third model and the second model are connected back and forth to be trained together.
- the video image sample is sent to the third model input terminal (I 3 )
- the third intermediate output (O 31 ) is sent to an input terminal of the second model (I 51 )
- the second model takes text information samples as input (I 52 )
- the third model and the second model use emotion-labeled information samples (O 32 ) and dialogue information samples (O 5 ) to ask the target and train together
- the second and third models so as to obtain the second model with the intercepted third intermediate output (O 31 ) as the input and the dialogue information sample (O 5 ) as the output, and the intercepted third intermediate output (O 31 ) as the emotion information.
- the first model can construct multiple groups according to different applicable objects. For example, if it is applicable to text content or applicable to voice content or video content, the models used are different. After receiving the object to be processed, the system can automatically determine the type of the object and automatically select the first applicable model.
- the training process of the first model specifically includes the following sub-steps:
- the first sub-model, the second sub-model, and the third sub-model are sequentially connected, and after the previous text information sample and the previous text information sample are extracted, the three sub-models are simultaneously trained. ,
- the first sub-model is constructed on the basis of Transformer-xl, and the Decoder part is replaced by the LSTM+CNN structure, and used as the logical judgment output of the first sub-model, and its output carries the cross-entropy loss function;
- the model is built on the basis of Transformer, using the LSTM+CNN structure to replace its Decoder part, and as the output of the second sub-model's classifier, its output is equipped with a cross-entropy loss function;
- the third sub-model is built on the basis of StarGAN, using Conv1D network layer Replacing Conv2D in the structure, its output carries the average variance loss function and the L2 distance loss function.
- the previous text information sample and the current text information sample are used as the two inputs of the first model (I 11 , I 12 ), and the current text information sample is used as one input of each sub-model (I 11 , I 21 , I 42 ) .
- the training process specifically includes the following sub-steps:
- the previous information sample includes the previous text information sample and the previous voice information sample
- the fourth sub-model is constructed on the basis of ResNet-50 and Transformer-xl, discarding the Dense layer of ResNet-50, using ConvLstm2D structure network layer to replace Conv2D in ResNet-50, and pooling ResNet-50
- the layer output is merged into the Encoder output of Transformer-xl, and the Decoder part of Transformer-xl is replaced with the LSTM+CNN structure, which is used as the logical judgment output of the fourth sub-model, and carries the cross-entropy loss function
- the fifth sub-model uses Transformer as Basic construction, using the LSTM+CNN structure to replace its Decoder part, and as the output of the fifth sub-model's classifier, its output is equipped with a cross-entropy loss function
- the sixth sub-model is built on the basis of StarGAN, using the Conv1D structure network layer to replace the structure Conv2D structure network layer, its output is equipped with average variance loss function and L2 distance loss function.
- the process of the two training methods of the first model is the same.
- the specific difference is that in the second training method, the fourth sub-model needs to add the previous voice information sample input.
- the present invention when training the model, first obtains emotional information from the video image according to the corresponding video image information, text information, and voice information in a piece of video, constructs an emotional voice generation module based on the video image, and then builds it based on the text information
- the emotional voice generation module that targets the emotional information, so as to achieve the purpose of generating emotional voice based on text information, so this method is suitable for video communication scenes, voice communication scenes, and even communication scenes with only text information, and has wide adaptability. Further improve the human-machine communication experience.
- the emotional speech synthesis method based on deep learning is based on the video image samples extracted from the video, the corresponding text information samples and the dialogue information samples when the speech synthesis model (second model) is constructed. , So the obtained emotion is more appropriate, and the synthesized speech emotion is more accurate and natural.
- this embodiment provides an emotional speech synthesis device 100 based on deep learning.
- FIG. 5 is a schematic structural diagram of the device 100 for emotional speech synthesis based on deep learning. As shown in FIG. 5, the device 100 at least includes:
- Extraction module 1 used to extract the text information to be processed and the preceding information of the text information to be processed, the preceding information includes the preceding text information;
- Emotion feature information generation module 2 used to generate emotional feature information through the pre-built first model by taking the text information to be processed and the preceding information as input;
- Emotional speech synthesis module 3 It is used to synthesize emotional speech through the pre-trained second model by taking the emotional feature information and the text information to be processed as input.
- the first model includes a first sub-model, a second sub-model, and a third sub-model that are sequentially connected
- the emotional feature information generating module 2 includes at least:
- the first feature extraction unit 21 used to take the text information to be processed and the preceding information as input, and perform feature extraction through the pre-trained first sub-model to obtain the first intermediate output;
- the second feature extraction unit 22 used to take the first intermediate output and the to-be-processed text information as input, and perform feature extraction through the pre-trained second sub-model to obtain the emotion type and the second intermediate output;
- the third feature extraction unit 23 is used to take the second intermediate output, the text information to be processed, the emotion type, or the received user-specified emotion type as input, and perform feature extraction through a pre-trained third sub-model to obtain emotional feature information.
- the first model when the current text information also includes the previous speech information, the first model includes the fourth, fifth, and sixth sub-models that are sequentially connected, and the emotional feature information acquisition module 2 further includes:
- the fourth feature extraction unit 21' used to take the text information to be processed and the preceding information as input, and perform feature extraction through a pre-trained fourth sub-model to obtain a fourth intermediate output;
- the fifth feature extraction unit 22' used to take the fourth intermediate output and the to-be-processed text information as input, and perform feature extraction through the pre-trained fifth sub-model to obtain the fifth intermediate output;
- the sixth feature extraction unit 23' used to take the fifth intermediate output, the text information to be processed, the emotion type, or the received user-specified emotion type as input, and perform feature extraction through a pre-trained sixth sub-model to obtain emotional feature information.
- the device further includes a model training module 4, which at least includes a second model training unit 41 for training a second model, and the second model training unit 41 includes at least:
- the first extraction subunit 411 used to extract video image samples, text information samples, and dialogue information samples of the video samples;
- Emotion labeling subunit 412 configured to label the video image samples to obtain emotional labeling information samples according to preset sentiment classification
- the first training subunit 413 used to take the video image sample as an input and the emotion annotation information sample as an output, train a third model, and extract the third intermediate output of the third model as the video image
- the emotional information of the sample it is also used to train the second model with the emotional information and text information samples as input and dialogue information samples as output.
- the model training module further includes a first model training unit 42 for training a first model, and the first model training unit 42 at least includes:
- Second extraction subunit 421 used to extract current text information samples and previous text information samples of the video samples, where the previous text information samples include previous text information samples;
- the second training subunit 422 used to take the current text information sample and the preceding information sample as input, and whether the emotion of the current text information sample relative to the preceding information sample is changed as the output, and train to obtain the first submodel , And extract the first intermediate output of the intermediate output of the first sub-model;
- the model training module 4 further includes a third model training unit 43 for training another first model, and the third model training unit 43 at least includes:
- the third extraction subunit 431 used to extract the current text information sample and the previous text information sample of the video sample, the previous information sample includes the previous text information sample and the previous voice information sample;
- the third training subunit 432 is configured to take the current text information sample and the preceding information sample as input, and use whether the emotion of the current text information sample relative to the preceding information sample is changed as the output, and train to obtain the fourth Sub-model, and extract the fourth intermediate output of the intermediate output of the fourth sub-model;
- the second model training unit 41 further includes:
- the preprocessing subunit 414 is configured to divide the video image sample into several video image subsamples according to a preset time interval, and use the text in any time interval as the current text information sample, and the The text before the interval serves as a sample of the preceding text information.
- the device for emotional speech synthesis based on deep learning triggers an emotional speech synthesis service
- only the division of the above functional modules is used as an example for illustration. In actual applications, the above can be changed according to needs. Function allocation is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the embodiment of an emotional speech synthesis device based on deep learning provided by the above embodiment and the method embodiment belong to the same concept, that is, the device is based on the method. For the specific implementation process, please refer to the method embodiment, which will not be repeated here. .
- the program can be stored in a computer-readable storage medium.
- the storage medium can be read-only memory, magnetic disk or optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims (10)
- 一种基于深度学习的情感语音合成方法,其特征在于,所述方法至少包括如下步骤:提取待处理文本信息及所述待处理文本信息的前文信息,所述前文信息包括前文文本信息;以所述待处理文本信息及前文信息为输入,通过预先构建的第一模型生成情感特征信息;以所述情感特征信息及待处理文本信息为输入,通过预先训练的第二模型合成情感语音。
- 根据权利要求1所述的一种基于深度学习的情感语音合成方法,其特征在于,所述第一模型包括依次连接的第一子模型、第二子模型及第三子模型,所述以所述待处理文本信息及前文信息为输入,通过预先构建的第一模型生成情感特征信息,具体包括如下子步骤:以所述待处理文本信息及前文信息为输入,通过预先训练的第一子模型进行特征提取以获得第一中间输出;以所述第一中间输出及待处理文本信息为输入,通过预先训练的第二子模型进行特征提取以获得情感类型及第二中间输出;以所述第二中间输出、待处理文本信息、情感类型或接收的用户指定情感类型为输入,通过预先训练的第三子模型进行特征提取以获得情感特征信息。
- 根据权利要求1所述的一种基于深度学习的情感语音合成方法,其特征在于,当所述前文信息还包括前文语音信息时,所述第一模型包括依次连接的第四子模型、第五子模型及第六子模型,所述以所述待处理文本信息及前文信息为输入,通过预先构建的第一模型生成情感特征信息,具体包括如下子步骤:以所述待处理文本信息及前文信息为输入,通过预先训练的第四子模型进行特征提取以获得第四中间输出;以所述第四中间输出及待处理文本信息为输入,通过预先训练的第五子模型进行特征提取以获得第五中间输出;以所述第五中间输出、待处理文本信息、情感类型或接收的用户指定情感类型为输入,通过预先训练的第六子模型进行特征提取以获得情感特征信息。
- 根据权利要求2或3所述的一种基于深度学习的情感语音合成方法,其特征在于,预先训练所述第二模型时,具体包括如下子步骤:提取视频样本的视频图像样本、文本信息样本及对话信息样本;按照预设情感分类,对所述视频图像样本进行标注获得情感标注信息样本;以所述视频图像样本为输入,以所述情感标注信息样本为输出,训练第三模型,并提取所述第三模型的第三中间输出作为所述视频图像样本的情感信息;以所述情感信息及文本信息样本为输入,以对话信息样本为输出,训练第二模型。
- 根据权利要求4所述的一种基于深度学习的情感语音合成方法,其特征在于,预先训练所述第一模型时,具体包括如下子步骤:提取视频样本的当前文本信息样本及前文信息样本,所述前文信息样本包括前文文本信息样本;以所述当前文本信息样本及前文信息样本为输入,且以所述当前文本信息样本相对所述前文信息样本的情感是否变化为输出,训练获得所述第一子模型,并提取所述第一子模型中间输出的第一中间输出;以所述第一中间输出及当前文本信息样本为输入,且以情感类型为输出,训练获得所述第二子模型,并提取所述第二子模型中间输出的第二中间输出;以所述第二中间输出、当前文本信息样本、情感类型或接收的用户指定情感类型为输入,且以所述模型三获得的情感信息为输出,训练获得所述第三子模型。
- 根据权利要求4所述的一种基于深度学习的情感语音合成方法,其特征在于,预先训练所述第一模型时,具体包括如下子步骤:提取视频样本的当前文本信息样本及前文信息样本,所述前文信息样本包括前文文本信息样本及前文语音信息样本;以所述当前文本信息样本及前文信息样本为输入,且以所述当前文本信息样本相对所述前文信息样本的情感是否变化为输出,训练获得所述第四子模型,并提取所述第四子模型中间输出的第四中间输出;以所述第四中间输出及当前文本信息样本为输入,且以情感类型为输出,训练获得所述第五子模型,并提取所述第五子模型中间输出的第五中间输出及情感类型;以所述第五中间输出、当前文本信息样本、情感类型或接收的用户指定情感类型为输入,且以所述模型三获得的情感信息为输出,训练获得所述第六子模型。
- 根据权利要求5或6所述的一种基于深度学习的情感语音合成方法,其特征在于,所述预先训练所述第二模型时,还包括视频样本预处理,其至少包括:按照预设时间间隔,将所述视频图像样本分为若干段视频图像子样本,并将任一时间间隔内的文本作为当前文本信息样本,将所述任一时间间隔之前的文本作为前文文本信息样本。
- 一种基于权利要求1~7任意一项所述方法的基于深度学习的情感语音合成装置,其特征在于:所述装置至少包括:提取模块:用于提取待处理文本信息及所述待处理文本信息的前文信息,所述前文信息包括前文文本信息;情感特征信息生成模块:用于以所述待处理文本信息及前文信息为输入,通过预先构建的第一模型生成情感特征信息;情感语音合成模块:用于以所述情感特征信息及待处理文本信息为输入,通过预先训练的第二模型合成情感语音。
- 根据权利要求8所述的一种基于深度学习的情感语音合成装置,其特征在于,所述第一模型包括依次连接的第一子模型、第二子模型及第三子模型,所述情感特征信息生成模块至少包括:第一特征提取单元:用于以所述待处理文本信息及前文信息为输入,通过预先训练的第一子模型进行特征提取以获得第一中间输出;第二特征提取单元:用于以所述第一中间输出及待处理文本信息为输入,通过预先训练的第二子模型进行特征提取以获得情感类型及第二中间输出;第三特征提取单元::用于以所述第二中间输出、待处理文本信息、情感类型或接收的用户指定情感类型为输入,通过预先训练的第三子模型进行特征提取以获得情感特征信息。
- 根据权利要求8所述的一种基于深度学习的情感语音合成装置,其特征在于,当所述前文信息还包括前文语音信息时,所述第一模型包括依次连接的第四子模型、第五子模型及第六子模型,所述情感特征信息获取模块至少包括:第四特征提取单元:用于以所述待处理文本信息及前文信息为输入,通过预先训练的第四子模型进行特征提取以获得第四中间输出;第五特征提取单元:用于以所述第四中间输出及待处理文本信息为输入,通过预先训练的第五子模型进行特征提取以获得第五中间输出;第六特征提取单元:用于以所述第五中间输出、待处理文本信息、情感类型或接收的用户指定情感类型为输入,通过预先训练的第六子模型进行特征提取以获得情感特征信息。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA3154029A CA3154029A1 (en) | 2019-09-10 | 2020-06-19 | Deep learning-based emotional speech synthesis method and device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910850474.8 | 2019-09-10 | ||
CN201910850474.8A CN110675853B (zh) | 2019-09-10 | 2019-09-10 | 一种基于深度学习的情感语音合成方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021047233A1 true WO2021047233A1 (zh) | 2021-03-18 |
Family
ID=69077740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/096998 WO2021047233A1 (zh) | 2019-09-10 | 2020-06-19 | 一种基于深度学习的情感语音合成方法及装置 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110675853B (zh) |
CA (1) | CA3154029A1 (zh) |
WO (1) | WO2021047233A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113421576A (zh) * | 2021-06-29 | 2021-09-21 | 平安科技(深圳)有限公司 | 语音转换方法、装置、设备以及存储介质 |
CN114005446A (zh) * | 2021-11-01 | 2022-02-01 | 科大讯飞股份有限公司 | 情感分析方法、相关设备及可读存储介质 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675853B (zh) * | 2019-09-10 | 2022-07-05 | 苏宁云计算有限公司 | 一种基于深度学习的情感语音合成方法及装置 |
CN113223493A (zh) * | 2020-01-20 | 2021-08-06 | Tcl集团股份有限公司 | 语音看护方法、装置、系统及存储介质 |
CN111816212B (zh) * | 2020-06-19 | 2022-10-11 | 杭州电子科技大学 | 基于特征集融合的语音情感识别及评价方法 |
CN112489620B (zh) * | 2020-11-20 | 2022-09-09 | 北京有竹居网络技术有限公司 | 语音合成方法、装置、可读介质及电子设备 |
CN113192483B (zh) * | 2021-03-22 | 2024-02-27 | 联想(北京)有限公司 | 一种文本转换为语音的方法、装置、存储介质和设备 |
CN114783406B (zh) * | 2022-06-16 | 2022-10-21 | 深圳比特微电子科技有限公司 | 语音合成方法、装置和计算机可读存储介质 |
CN116825088B (zh) * | 2023-08-25 | 2023-11-07 | 深圳市国硕宏电子有限公司 | 一种基于深度学习的会议语音检测方法及系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599998A (zh) * | 2016-12-01 | 2017-04-26 | 竹间智能科技(上海)有限公司 | 基于情感特征调整机器人回答的方法及系统 |
CN108172209A (zh) * | 2018-01-09 | 2018-06-15 | 上海大学 | 构建语音偶像方法 |
US20180286383A1 (en) * | 2017-03-31 | 2018-10-04 | Wipro Limited | System and method for rendering textual messages using customized natural voice |
CN109003624A (zh) * | 2018-06-29 | 2018-12-14 | 北京百度网讯科技有限公司 | 情绪识别方法、装置、计算机设备及存储介质 |
CN110211563A (zh) * | 2019-06-19 | 2019-09-06 | 平安科技(深圳)有限公司 | 面向情景及情感的中文语音合成方法、装置及存储介质 |
CN110675853A (zh) * | 2019-09-10 | 2020-01-10 | 苏宁云计算有限公司 | 一种基于深度学习的情感语音合成方法及装置 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105355193B (zh) * | 2015-10-30 | 2020-09-25 | 百度在线网络技术(北京)有限公司 | 语音合成方法和装置 |
CN109523989B (zh) * | 2019-01-29 | 2022-01-11 | 网易有道信息技术(北京)有限公司 | 语音合成方法、语音合成装置、存储介质及电子设备 |
-
2019
- 2019-09-10 CN CN201910850474.8A patent/CN110675853B/zh active Active
-
2020
- 2020-06-19 WO PCT/CN2020/096998 patent/WO2021047233A1/zh active Application Filing
- 2020-06-19 CA CA3154029A patent/CA3154029A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599998A (zh) * | 2016-12-01 | 2017-04-26 | 竹间智能科技(上海)有限公司 | 基于情感特征调整机器人回答的方法及系统 |
US20180286383A1 (en) * | 2017-03-31 | 2018-10-04 | Wipro Limited | System and method for rendering textual messages using customized natural voice |
CN108172209A (zh) * | 2018-01-09 | 2018-06-15 | 上海大学 | 构建语音偶像方法 |
CN109003624A (zh) * | 2018-06-29 | 2018-12-14 | 北京百度网讯科技有限公司 | 情绪识别方法、装置、计算机设备及存储介质 |
CN110211563A (zh) * | 2019-06-19 | 2019-09-06 | 平安科技(深圳)有限公司 | 面向情景及情感的中文语音合成方法、装置及存储介质 |
CN110675853A (zh) * | 2019-09-10 | 2020-01-10 | 苏宁云计算有限公司 | 一种基于深度学习的情感语音合成方法及装置 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113421576A (zh) * | 2021-06-29 | 2021-09-21 | 平安科技(深圳)有限公司 | 语音转换方法、装置、设备以及存储介质 |
CN113421576B (zh) * | 2021-06-29 | 2024-05-24 | 平安科技(深圳)有限公司 | 语音转换方法、装置、设备以及存储介质 |
CN114005446A (zh) * | 2021-11-01 | 2022-02-01 | 科大讯飞股份有限公司 | 情感分析方法、相关设备及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CA3154029A1 (en) | 2021-03-18 |
CN110675853A (zh) | 2020-01-10 |
CN110675853B (zh) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021047233A1 (zh) | 一种基于深度学习的情感语音合成方法及装置 | |
CN112184858B (zh) | 基于文本的虚拟对象动画生成方法及装置、存储介质、终端 | |
CN108766414B (zh) | 用于语音翻译的方法、装置、设备和计算机可读存储介质 | |
US11514888B2 (en) | Two-level speech prosody transfer | |
CN111048064B (zh) | 基于单说话人语音合成数据集的声音克隆方法及装置 | |
US20150325240A1 (en) | Method and system for speech input | |
CN111433847B (zh) | 语音转换的方法及训练方法、智能装置和存储介质 | |
CN1835074B (zh) | 一种结合高层描述信息和模型自适应的说话人转换方法 | |
CN109523989A (zh) | 语音合成方法、语音合成装置、存储介质及电子设备 | |
CN112001992A (zh) | 基于深度学习的语音驱动3d虚拟人表情音画同步方法及系统 | |
CN110718208A (zh) | 基于多任务声学模型的语音合成方法及系统 | |
CN110852075B (zh) | 自动添加标点符号的语音转写方法、装置及可读存储介质 | |
CN109697978B (zh) | 用于生成模型的方法和装置 | |
US20230298564A1 (en) | Speech synthesis method and apparatus, device, and storage medium | |
CN111508466A (zh) | 一种文本处理方法、装置、设备及计算机可读存储介质 | |
CN115147521A (zh) | 一种基于人工智能语义分析的角色表情动画的生成方法 | |
CN116303966A (zh) | 基于提示学习的对话行为识别系统 | |
CN115249479A (zh) | 基于brnn的电网调度复杂语音识别方法、系统及终端 | |
Zhao et al. | Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks. | |
TW201937479A (zh) | 一種多語言混合語音識別方法 | |
CN111833878A (zh) | 基于树莓派边缘计算的中文语音交互无感控制系统和方法 | |
Gao et al. | Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition. | |
CN112242134A (zh) | 语音合成方法及装置 | |
Huang et al. | Speaker independent and multilingual/mixlingual speech-driven talking head generation using phonetic posteriorgrams | |
WO2020073839A1 (zh) | 语音唤醒方法、装置、系统及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20863605 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3154029 Country of ref document: CA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20863605 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20863605 Country of ref document: EP Kind code of ref document: A1 |