WO2021212954A1 - 极低资源下的特定发音人情感语音合成方法及装置 - Google Patents

极低资源下的特定发音人情感语音合成方法及装置 Download PDF

Info

Publication number
WO2021212954A1
WO2021212954A1 PCT/CN2021/074826 CN2021074826W WO2021212954A1 WO 2021212954 A1 WO2021212954 A1 WO 2021212954A1 CN 2021074826 W CN2021074826 W CN 2021074826W WO 2021212954 A1 WO2021212954 A1 WO 2021212954A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
specific
phoneme sequence
text
model
Prior art date
Application number
PCT/CN2021/074826
Other languages
English (en)
French (fr)
Inventor
袁熹
Original Assignee
升智信息科技(南京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 升智信息科技(南京)有限公司 filed Critical 升智信息科技(南京)有限公司
Publication of WO2021212954A1 publication Critical patent/WO2021212954A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to the technical field of speech signal processing, in particular to a method, device, computer equipment and storage medium for the emotional speech synthesis of a specific speaker under extremely low resources.
  • Speech synthesis technology is to give a computer (or various terminal equipment) the ability to speak like a person, which is a typical interdisciplinary subject.
  • TTS technology also known as text-to-speech technology
  • speech synthesis belongs to speech synthesis. It is a technology that converts text information generated by a computer or input from outside into an understandable and fluent voice output technology.
  • Emotional speech synthesis is a research field that has only emerged in the past ten years. Compared with traditional speech synthesis, emotional speech synthesis takes into account the emotional state and speaking style of the speaker, making the synthesized speech more intelligent and humane, and has more Wide application value.
  • intelligent voice assistants such as Microsoft Xiaoice, Xiaona, Siri, etc.
  • the corresponding emotional speech is synthesized in different conversation situations, so that the intelligent voice assistant can be like a real housekeeper and improve the user experience
  • the automatic voice service system Synthesize different emotional voices according to the user’s conversation state, allowing users to enjoy better service quality
  • in online education change the voice state according to the student’s performance, when the student is distracted, use a strict tone to correct it, and test the score on the student When it is better, praise in an admiring tone can improve the quality of education and so on.
  • Emotional speech synthesis methods may include: emotional speech synthesis based on waveform splicing, emotional speech synthesis based on speech conversion, emotional speech synthesis based on statistical parameters, and emotional speech synthesis based on neural networks.
  • Emotional speech synthesis based on waveform splicing is based on the construction of a huge emotional speech database.
  • the database uses phonemes as pronunciation units to establish a mapping from phonemes to speech fragments.
  • a logistic regression model is used to select text phonemes to produce corresponding speech fragments. Then splice and smooth.
  • the advantages of this method are simple implementation and fast synthesis speed.
  • Emotional speech synthesis based on speech conversion describes the change of speech emotion as a continuous jump between neutral emotion and other emotions, and calculates the acoustic characteristics of neutral speech and other emotional jumps.
  • the acoustic characteristics are adjusted by changing rules.
  • the rules of this method are not universal. Different speakers have different rules.
  • Statistical parameter emotional speech synthesis is to use HMM (Hidden Markov Model) to model a variety of emotional acoustic models.
  • the amount of modeling data is less than that of the waveform method. It is a simple and convenient modeling method for emotional speech synthesis. The modeling ability is limited, and the synthesized sound quality is not high.
  • the excellent modeling capabilities of deep neural networks, more and more acoustic modeling methods based on deep learning, synthetic models are generally divided into two parts, one part is acoustic modeling, which expresses the mapping of text to acoustic features, and the second part is vocoding It is responsible for inverting the acoustic characteristics into waveforms.
  • the advantage of the neural network synthesis method is that the processing is simple and the synthesis effect is very natural.
  • the disadvantage is that a large data set is required to train the model, especially for the annotation of sentiment data, which is too costly. It can be seen that traditional emotional speech synthesis schemes often have limitations and high costs.
  • the present invention proposes a method, device, computer equipment and storage medium for the emotional speech synthesis of a specific speaker with extremely low resources.
  • a method for the emotional speech synthesis of a specific speaker with extremely low resources which includes the following steps:
  • S10 Obtain training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain initial training data;
  • S30 Obtain specific text and specific audio that characterize the emotion data of a specific speaker, convert the specific text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain specific training data;
  • S50 Convert the text to be synthesized into a phoneme sequence to obtain the phoneme sequence to be synthesized, fill the phoneme sequence to be synthesized into an emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model to obtain Voice audio with specific emotions.
  • inputting the initial training data into a deep learning model for training, and obtaining a basic model includes:
  • the initial training data is input into the deep learning model, the output is the Mel spectrum through the encoder, the attention mechanism and the decoder, and the back propagation algorithm is used to perform back propagation processing on the Mel spectrum to obtain the basic model through training.
  • the deep learning model is an end-to-end model.
  • An emotional speech synthesis device for a specific speaker with extremely low resources including:
  • the first acquisition module is configured to acquire training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain initial training data;
  • the first training module is used to input the initial training data into the deep learning model for training to obtain a basic model
  • the second acquisition module is used to acquire specific text and specific audio that characterize the emotional data of a specific speaker, convert the specific text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain specific training data ;
  • the second training module is used to input the specific training data into the basic model for training to obtain a speech synthesis model
  • the conversion module is used to convert the text to be synthesized into a phoneme sequence to obtain the phoneme sequence to be synthesized, fill the phoneme sequence to be synthesized into the emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model, Get voice audio with specific emotions.
  • the first training module is further used for:
  • the initial training data is input into the deep learning model, the output is the Mel spectrum through the encoder, the attention mechanism and the decoder, and the back propagation algorithm is used to perform back propagation processing on the Mel spectrum to obtain the basic model through training.
  • the deep learning model is an end-to-end model.
  • a computer device including a memory, a processor, and a computer program stored on the memory and capable of running on the processor.
  • the processor executes the computer program, the specific pronunciation under extremely low resources in any of the above embodiments is realized Steps of human emotion speech synthesis method.
  • a computer-readable storage medium has a computer program stored thereon, and the computer program, when executed by a processor, realizes the steps of the method for speech synthesis of a specific speaker's emotion under extremely low resources in any of the above embodiments.
  • the above-mentioned method, device, computer equipment, and storage medium for the emotional speech synthesis of a specific speaker with extremely low resources by acquiring the training text and the audio corresponding to the training text, convert the training text into the corresponding phoneme sequence, in the obtained phoneme sequence Embed in the slot with emotion vector to get the initial training data, input the initial training data into the deep learning model for training, get the basic model, get the specific text and specific audio that characterize the emotion data of the specific speaker, and convert the specific text into the corresponding phoneme sequence , Embed a slot with emotion vectors in the obtained phoneme sequence to obtain specific training data, input the specific training data into the basic model training, obtain a speech synthesis model, convert the text to be synthesized into a phoneme sequence, and obtain the phoneme sequence to be synthesized , Fill the phoneme sequence to be synthesized into the emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model to obtain speech audio with specific emotions, so as to reduce the cost of obtaining speech audio
  • the model is obtained by migration learning on the pre-trained model, which can ensure the maximum degree of synthesized speech. Naturalness, and embedded emotions at the same time, to ensure the accuracy of the resulting voice audio with specific emotions.
  • Fig. 1 is a flowchart of an embodiment of a method for synthesizing emotional speech of a specific speaker with extremely low resources
  • Figure 2 is a schematic diagram of a basic model of an embodiment
  • Fig. 3 is a schematic diagram of a speech synthesis model according to an embodiment
  • FIG. 4 is a schematic structural diagram of an emotional speech synthesis device for a specific speaker under extremely low resources in an embodiment
  • Fig. 5 is a schematic diagram of a computer device according to an embodiment.
  • the emotional speech synthesis method for a specific speaker with extremely low resources can be applied to the emotional speech synthesis system for a specific speaker.
  • the aforementioned emotional speech synthesis system for a specific speaker can obtain training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed slots with emotion vectors in the obtained phoneme sequence to obtain initial training data , Input the initial training data into a deep learning model for training, obtain a basic model, obtain specific text and specific audio that characterize the emotional data of a specific speaker, convert the specific text into a corresponding phoneme sequence, and embed in the obtained phoneme sequence
  • a slot with an emotion vector to obtain specific training data input the specific training data into the basic model for training, obtain a speech synthesis model, convert the text to be synthesized into a phoneme sequence, obtain the phoneme sequence to be synthesized, and convert the phoneme sequence to be synthesized Fill in the emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model to obtain speech audio
  • a method for the emotional speech synthesis of a specific speaker with extremely low resources is provided.
  • the application of the method to a specific speaker’s emotional speech synthesis system is taken as an example for description, including the following steps:
  • S10 Obtain training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain initial training data.
  • the above steps may perform text processing on the training text to obtain initial training data.
  • Text processing refers to the phoneme processing of text with emotional slots.
  • the initial training data of the speech synthesis model includes the text and the corresponding audio.
  • the slot with the emotion vector is embedded in the processed sequence to bring the emotion information of the sentence Enter the phoneme sequence, and finally you will get a phoneme sequence with emotional slots.
  • This sequence will be used as the input for the synthetic model training model.
  • Slots with emotion vectors can reserve slots for a variety of emotions such as happiness, surprise, fear, sadness, anger, nausea, and neutral.
  • S20 Input the initial training data into a deep learning model for training, to obtain a basic model.
  • the deep learning model is an end-to-end model.
  • inputting the initial basic data into a deep learning model for training, and obtaining the basic model includes:
  • the initial input data is input into the deep learning model, and the output is the Mel spectrum through the encoder, the attention mechanism and the decoder, and the back propagation algorithm is used to perform back propagation processing on the Mel spectrum to obtain the basic model through training.
  • the corresponding model training in this embodiment adopts an end-to-end model, and the model training input is from the embedded phoneme sequence with emotional slots obtained in step S10.
  • the output is Mei Erpu, through the back-propagation algorithm, finally trained to obtain the basic model.
  • the above-mentioned basic model can be referred to as shown in FIG. 2.
  • the text 101 is processed by the phoneme processing module 102 with emotion slots to form a phoneme sequence with slots, which is encoded by the encoder 103 to form high-dimensional features of phonemes with emotion slots, and decoded by the decoder 105.
  • Obtain the Mel feature 106 calculate the loss 107, apply gradient backpropagation 108, and train the parameters 103, 104, and 105.
  • S30 Obtain specific text and specific audio representing the emotion data of a specific speaker, convert the specific text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain specific training data.
  • the above-mentioned specific speaker may also be referred to as the target inventor. After the above-mentioned specific speaker's emotional speech synthesis method, it is necessary to obtain speech audio matching the emotional direction of this type of speaker.
  • step S10 the text processing of the emotional data of the specific speaker is processed in the same manner as step S10, but the text data comes from the specific speaker, and the purpose is to be used for the subsequent migration learning training.
  • S40 Input the specific training data into the basic model training to obtain a speech synthesis model.
  • the above steps are based on the migration learning of the emotion filling of the specific speaker.
  • the training data comes from the data in step S30.
  • the speech data of each emotion only needs 200 sentences (about 15 minutes).
  • the pre-training model comes from the basic model obtained through training in step S20. On this basic model, fix its encoding module (refer to the component 203 shown in FIG. 3), continue to train its decoding module, and finally obtain a speech synthesis model of a specific speaker.
  • the speech synthesis model can refer to FIG. 3.
  • the text 201 is processed by the phoneme processing module 202 with emotion slots to form a phoneme sequence with slots, which is encoded by the encoder 203, which is the encoder 103 part of the training basic model in Figure 2 above (fixed parameters ), form high-dimensional features with emotional slot phonemes, decoded by decoder 205, and obtain Mel feature 206.
  • calculate loss 207 apply gradient backpropagation 208, train 204 and 205 parameters;
  • 206 directly passes through the vocoder 210 to generate emotional speech.
  • S50 Convert the text to be synthesized into a phoneme sequence to obtain the phoneme sequence to be synthesized, fill the phoneme sequence to be synthesized into an emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model to obtain Voice audio with specific emotions.
  • the corresponding emotion slot value is filled into the sequence, and after the inference of the model in step S40, the speech audio with the specified emotion is obtained.
  • the emotion filling slot here can be filled manually.
  • the above-mentioned method for synthesizing emotions of a specific speaker with extremely low resources is to obtain training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence, Obtain the initial training data, input the initial training data into the deep learning model for training, obtain the basic model, obtain the specific text and specific audio representing the emotional data of the specific speaker, convert the specific text into the corresponding phoneme sequence, and embed it in the obtained phoneme sequence Slots with emotion vectors to obtain specific training data, input specific training data into the basic model training to obtain a speech synthesis model, convert the text to be synthesized into a phoneme sequence, obtain the phoneme sequence to be synthesized, and fill in the phoneme sequence to be synthesized Emotion slot, get synthesized input data, input synthesized input data into speech synthesis model, get speech audio with specific emotion, to reduce the cost of obtaining speech audio with specific emotion, and improve the flexibility of corresponding emotion speech synthesis scheme .
  • the model is obtained by migration learning on the pre-trained model, which can guarantee the synthesis of speech to the greatest extent
  • the naturalness and emotion are embedded at the same time to ensure the accuracy of the voice audio with specific emotions.
  • FIG. 4 is a schematic structural diagram of an emotional speech synthesis device for a specific speaker under extremely low resources according to an embodiment, including:
  • the first obtaining module 10 is configured to obtain training text and audio corresponding to the training text, convert the training text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain initial training data;
  • the first training module 20 is configured to input the initial training data into the deep learning model for training to obtain a basic model
  • the second obtaining module 30 is used to obtain specific text and specific audio representing the emotional data of a specific speaker, convert the specific text into a corresponding phoneme sequence, and embed a slot with an emotion vector in the obtained phoneme sequence to obtain specific training data;
  • the second training module 40 is configured to input the specific training data into the basic model training to obtain a speech synthesis model
  • the conversion module 50 is configured to convert the text to be synthesized into a phoneme sequence to obtain the phoneme sequence to be synthesized, fill the phoneme sequence to be synthesized into the emotion slot to obtain synthesized input data, and input the synthesized input data into the speech synthesis model , Get voice audio with specific emotions.
  • the first training module is further used for:
  • the initial input data is input into the deep learning model, and the output is the Mel spectrum through the encoder, the attention mechanism and the decoder, and the back propagation algorithm is used to perform back propagation processing on the Mel spectrum to obtain the basic model through training.
  • the deep learning model is an end-to-end model.
  • the various modules in the above-mentioned device for speech synthesis of specific speaker emotions under extremely low resources can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, trackball or touch pad set on the housing of the computer equipment , It can also be an external keyboard, touchpad, or mouse.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device is further provided.
  • the computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the The program realizes the emotional speech synthesis method of a specific speaker with extremely low resources as in any of the above embodiments.
  • the program can be stored in a non-volatile computer readable storage.
  • the program can be stored in the storage medium of the computer system and executed by at least one processor in the computer system to realize the emotional speech of a specific speaker with extremely low resources as described above.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • a computer storage medium a computer readable storage medium, on which a computer program is stored, where the program is executed by a processor to achieve a very low level of any of the above-mentioned embodiments. Emotional speech synthesis method for specific speakers under resources.
  • first ⁇ second ⁇ third involved in the embodiments of this application only distinguishes similar objects, and does not represent a specific order for the objects. Understandably, “first ⁇ second ⁇ third” “Three” can be interchanged in specific order or precedence when permitted. It should be understood that the objects distinguished by “first ⁇ second ⁇ third” can be interchanged under appropriate circumstances, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein.

Abstract

一种极低资源下的特定发音人情感语音合成方法、装置、计算机设备和存储介质,该方法包括:通过获取训练文本和训练文本对应的音频,将训练文本转换为音素序列,以嵌入带情感向量的槽位,得到初始训练数据(S10);输入深度学习模型进行训练,得到基础模型(S20);获取特定文本和特定音频,将特定文本转换为对应音素序列,在音素序列中嵌入带情感向量的槽位,得到特定训练数据(S30);将特定训练数据输入基础模型训练,得到语音合成模型(S40);将待合成文本转换为音素序列,得到待合成音素序列,将待合成音素序列填入情感槽,得到合成输入数据,输入语音合成模型,得到带有特定情感的语音音频(S50)。该方法能够降低获得带有特定情感的语音音频的成本,提高情感语音合成方案的灵活性。

Description

极低资源下的特定发音人情感语音合成方法及装置
本申请要求于2020年4月21日提交中国专利局、申请号为202010317018.X、发明名称为“极低资源下的特定发音人情感语音合成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音信号处理技术领域,尤其涉及一种极低资源下的特定发音人情感语音合成方法、装置、计算机设备和存储介质。
背景技术
语音合成技术就是赋予计算机(或者各种终端设备)具有像人一样的说话能力,这是一门典型的交叉学科。TTS技术(又称文语转换技术)隶属于语音合成,它是将计算机自己产生的、或外部输入的文字信息转变为可以听得懂的、流利的语音输出的技术。情感语音合成是近十几年来才兴起的一个研究领域,相比于传统的语音合成,情感语音合成考虑了说话人的情感状态、说话风格,使合成的语音更加智能化和人性化,具有更广泛的应用价值。在智能语音助手中,比如微软小冰、小娜、siri等,在不同的对话情境下合成相应的情感语音,使智能语音助手可以像个真实的管家,提高用户体验;在自动语音服务系统中,根据用户的对话状态合成不同情感的语音,让用户享受更好的服务质量;在在线教育中,根据学生的表现改变语音状态,在学生分神时,用严厉的语气纠正,在学生测试成绩较好时,用赞赏的语气表扬,可以提高教育质量等等。
情感语音合成方法可以包括:基于波形拼接的情感语音合成、基于语音转换的情感语音合成、统计参数情感语音合成和神经网络的情感语音合成。基于波形拼接的情感语音合成的基础是建设一个庞大的情绪语音数据库,数据库以音素为发音单位,建立起音素到语音片段的映射,合成时,利用逻辑回归模型,挑选文本音素出对应语音片段,然后拼接、平滑。这种方法的优点是实现简单,合成速度快。缺点是数据库的建立极其费时,质量难以保证,在某些应用场景下不可达,拼接平滑合成的语音很难达到 "拟人"效果。基于语音转换的情感语音合成是将语音情感变化描述为:中性情感与其他情感之间的连续跳转,统计出中性语音与其他情感跳变的声学特征,合成时候,按照中性情感和其他情感的维度,应用变化规则对声学特征进行调整,这种方法的规则不具备普适性,不同的说话人,不同的规则。统计参数情感语音合成是用HMM(隐马尔可夫模型)对多种情感声学模型建模,建模数据量比波形法少,是一种简单方便的情感语音合成的建模方法,但是HMM的建模能力有限,合成音质不高。深度神经网络出色的建模能力,越来越多基于深度学习的声学建模方法,合成模型一般分为两部分,一部分声学建模,表述了文本到声学特征的映射,第二部分是声码器,负责将声学特征逆变成波形。神经网络的合成方法优点是处理简单,合成效果非常自然,缺点是需要一个大的数据集来训练模型,特别对于情绪数据的标注,成本太大。可见传统的情感语音合成方案往往存在局限性,且存在成本高的问题。
发明内容
针对以上问题,本发明提出一种极低资源下的特定发音人情感语音合成方法、装置、计算机设备和存储介质。
为实现本发明的目的,提供一种极低资源下的特定发音人情感语音合成方法,包括如下步骤:
S10,获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据;
S20,将所述初始训练数据输入深度学习模型进行训练,得到基础模型;
S30,获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据;
S40,将所述特定训练数据输入所述基础模型训练,得到语音合成模型;
S50,将待合成文本转换为音素序列,得到待合成音素序列,将所述 待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
在一个实施例中,将所述初始训练数据输入深度学习模型进行训练,得到基础模型包括:
将所述初始训练数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
在一个实施例中,所述深度学习模型为端到端的模型。
一种极低资源下的特定发音人情感语音合成装置,包括:
第一获取模块,用于获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据;
第一训练模块,用于将所述初始训练数据输入深度学习模型进行训练,得到基础模型;
第二获取模块,用于获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据;
第二训练模块,用于将所述特定训练数据输入所述基础模型训练,得到语音合成模型;
转换模块,用于将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
在一个实施例中,所述第一训练模块进一步用于:
将所述初始训练数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
在一个实施例中,所述深度学习模型为端到端的模型。
一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一实施例的极低资源下的特定发音人情感语音合成方法的步骤。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一实施例的极低资源下的特定发音人情感语音合成方法的步骤。
上述极低资源下的特定发音人情感语音合成方法、装置、计算机设备和存储介质,通过获取训练文本和所述训练文本对应的音频,将训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据,将初始训练数据输入深度学习模型进行训练,得到基础模型,获取表征特定发音人情感数据的特定文本和特定音频,将特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据,将特定训练数据输入所述基础模型训练,得到语音合成模型,将待合成文本转换为音素序列,得到待合成音素序列,将待合成音素序列填入情感槽,得到合成输入数据,将合成输入数据输入语音合成模型,得到带有特定情感的语音音频,以降低获得带有特定情感的语音音频所需的成本,提高相应情感语音合成方案的灵活性。进一步地,在特定发音人情感语音合成过程中,不需要获取大量的情感标注的数据,极大地减少数据依赖,其中模型是由在预训练模型上迁移学习获得,能最大程度上保证合成语音的自然度,并同时嵌入情感,以保证所得到的带有特定情感的语音音频的精准性。
说明书附图
图1是一个实施例的极低资源下的特定发音人情感语音合成方法流程图;
图2是一个实施例的基础模型示意图;
图3是一个实施例的语音合成模型示意图;
图4是一个实施例的极低资源下的特定发音人情感语音合成装置结构示意图;
图5是一个实施例的计算机设备示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图 及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员能够理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请提供的极低资源下的特定发音人情感语音合成方法,可以应用于特定发音人情感语音合成系统。上述特定发音人情感语音合成系统可以获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据,将所述初始训练数据输入深度学习模型进行训练,得到基础模型,获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据,将特定训练数据输入所述基础模型训练,得到语音合成模型,将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频,以降低获得带有特定情感的语音音频所需的成本,提高相应情感语音合成方案的灵活性。其中,特定发音人情感语音合成系统可以但不限于是各种个人计算机和笔记本电脑等智能处理设备。
在一个实施例中,如图1所示,提供了一种极低资源下的特定发音人情感语音合成方法,以该方法应用于特定发音人情感语音合成系统为例进行说明,包括以下步骤:
S10,获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据。
具体地,上述步骤可以对训练文本进行文本处理,以得到初始训练数据。文本处理指的是带情感槽位文本的音素处理。语音合成模型的初始训练数据包括文本和对应的音频,在处理其文本数据时先将其转换为对应的 音素序列,并在处理的序列里面嵌入带情感向量的槽位,将句子的情感信息带入音素序列,最后会得到一个带情感槽位的音素序列,这个序列会作为合成模型训练模型时的输入。带情感向量的槽位可以预留开心、惊讶、害怕、悲伤、愤怒、恶心和中性等多种情感的槽位。
S20,将所述初始训练数据输入深度学习模型进行训练,得到基础模型。
具体地,所述深度学习模型为端到端的模型。
在一个实施例中,将所述初始基础数据输入深度学习模型进行训练,得到基础模型包括:
将所述初始输入数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
本实施例进行相应模型训练采用的是一个端到端的模型,模型训练输入是来自步骤S10中处理获得的带情感槽位嵌入音素序列,经编码器、注意力机制和解码器,输出的是梅尔谱,经反向传播算法,最后训练获得基础模型。
在一个示例中,上述基础模型可以参考图2所示。图2中,文本101经过带情感槽位的音素处理模块102处理后,形成带槽位的音素序列,经过编码器103编码,形成带情感槽位音素的高维特征,经过解码器105解码,得到梅尔特征106,计算损失107,应用梯度反向传播108,训练103、104和105参数。
S30,获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据。
上述特定发音人也可以称为目标发明人,经过上述特定发音人情感语音合成方法,需要获得与这一类发音人情感向匹配的语音音频。
上述步骤中,特定发音人情感数据的文本处理,处理方式和步骤S10一样,但是文本数据来自特定发音人,目的是用于接下来的迁移学习训练。
S40,将所述特定训练数据输入所述基础模型训练,得到语音合成模型。
上述步骤基于特定发音人情感填槽的迁移学习,训练数据来自步骤S30中数据,每种情感的语音数据仅需要200句(约15分钟),预训练模型来自步骤S20中训练获得的基础模型,在这个基础模型上,固定其编码模块(可以参考图3所示部件203),继续训练其解码模块,最后获得特定发音人的语音合成模型。
在一个示例中,语音合成模型可以参考图3。图3中,文本201经过带情感槽位的音素处理模块202处理后,形成带槽位的音素序列,经过编码器203编码,203为上述图2中训练基础模型的编码器103部分(固定参数),形成带情感槽位音素的高维特征,经过解码器205解码,得到梅尔特征206,在迁移学习训练时,计算损失207,应用梯度反向传播208,训练204和205参数;在推理时,206直接经过声码器210,生成带情感的语音。
S50,将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
上述步骤可以将待合成文本处理成音素序列后,将相应情感槽值填入该序列,经过步骤S40中模型的推理,获得带有指定情感的语音音频。为了确保情感正确性,不致影响交互体验,此处的情感填槽可以手动填充。
上述极低资源下的特定发音人情感语音合成方法,通过获取训练文本和所述训练文本对应的音频,将训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据,将初始训练数据输入深度学习模型进行训练,得到基础模型,获取表征特定发音人情感数据的特定文本和特定音频,将特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据,将特定训练数据输入所述基础模型训练,得到语音合成模型,将待合成文本转换为音素序列,得到待合成音素序列,将待合成音素序列填入情感槽,得到合成输入数据,将合成输入数据输入语音合成模型,得到带有特定情感的语音音频,以降低获得带有特定情感的语音音频所需的成本,提高相应情感语音合成方案的灵活性。进一步地,在特定发音人情感语音合成过程中,不需要获取大量的情感标注的数据,极大对减少数据依赖,其中模型是由在预训练模型 上迁移学习获得,能最大程度上保证合成语音的自然度,并同时嵌入情感,以保证所得到的带有特定情感的语音音频的精准性。
请参考图4,图4为一个实施例的极低资源下的特定发音人情感语音合成装置结构示意图,包括:
第一获取模块10,用于获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据;
第一训练模块20,用于将所述初始训练数据输入深度学习模型进行训练,得到基础模型;
第二获取模块30,用于获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据;
第二训练模块40,用于将所述特定训练数据输入所述基础模型训练,得到语音合成模型;
转换模块50,用于将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
在一个实施例中,所述第一训练模块进一步用于:
将所述初始输入数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
在一个实施例中,所述深度学习模型为端到端的模型。
关于极低资源下的特定发音人情感语音合成装置的具体限定可以参见上文中对于极低资源下的特定发音人情感语音合成方法的限定,在此不再赘述。上述极低资源下的特定发音人情感语音合成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处 理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种极低资源下的特定发音人情感语音合成方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
基于如上所述的示例,在一个实施例中还提供一种计算机设备,该计算机设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现如上述各实施例中的任意一种极低资源下的特定发音人情感语音合成方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性的计算机可读取存储介质中,如本发明实施例中,该程序可存储于计算机系统的存储介质中,并被该计算机系统中的至少一个处理器执行,以实现包括如上述极低资源下的特定发音人情感语音合成方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
据此,在一个实施例中还提供一种计算机存储介质计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如上述各实施例中的任意一种极低资源下的特定发音人情感语音合成方法。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对 上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
需要说明的是,本申请实施例所涉及的术语“第一\第二\第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序。应该理解“第一\第二\第三”区分的对象在适当情况下可以互换,以使这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。
本申请实施例的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、装置、产品或设备没有限定于已列出的步骤或模块,而是可选地还包括没有列出的步骤或模块,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (8)

  1. 一种极低资源下的特定发音人情感语音合成方法,其特征在于,包括如下步骤:
    S10,获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据;
    S20,将所述初始训练数据输入深度学习模型进行训练,得到基础模型;
    S30,获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据;
    S40,将所述特定训练数据输入所述基础模型训练,得到语音合成模型;
    S50,将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
  2. 根据权利要求1所述的极低资源下的特定发音人情感语音合成方法,其特征在于,将所述初始训练数据输入深度学习模型进行训练,得到基础模型包括:
    将所述初始训练数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
  3. 根据权利要求1所述的极低资源下的特定发音人情感语音合成方法,其特征在于,所述深度学习模型为端到端的模型。
  4. 一种极低资源下的特定发音人情感语音合成装置,其特征在于,包括:
    第一获取模块,用于获取训练文本和所述训练文本对应的音频,将所述训练文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到初始训练数据;
    第一训练模块,用于将所述初始训练数据输入深度学习模型进行训练,得到基础模型;
    第二获取模块,用于获取表征特定发音人情感数据的特定文本和特定音频,将所述特定文本转换为对应音素序列,在得到的音素序列中嵌入带情感向量的槽位,得到特定训练数据;
    第二训练模块,用于将所述特定训练数据输入所述基础模型训练,得到语音合成模型;
    转换模块,用于将待合成文本转换为音素序列,得到待合成音素序列,将所述待合成音素序列填入情感槽,得到合成输入数据,将所述合成输入数据输入所述语音合成模型,得到带有特定情感的语音音频。
  5. 根据权利要求4所述的极低资源下的特定发音人情感语音合成装置,其特征在于,所述第一训练模块进一步用于:
    将所述初始训练数据输入深度学习模型,经编码器、注意力机制和解码器,输出是梅尔谱,采用反向传播算法对梅尔谱进行反向传播处理,训练获得基础模型。
  6. 根据权利要求4所述的极低资源下的特定发音人情感语音合成装置,其特征在于,所述深度学习模型为端到端的模型。
  7. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至3中任一项所述方法的步骤。
  8. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至3中任一项所述方法的步骤。
PCT/CN2021/074826 2020-04-21 2021-02-02 极低资源下的特定发音人情感语音合成方法及装置 WO2021212954A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010317018.X 2020-04-21
CN202010317018.XA CN111627420B (zh) 2020-04-21 2020-04-21 极低资源下的特定发音人情感语音合成方法及装置

Publications (1)

Publication Number Publication Date
WO2021212954A1 true WO2021212954A1 (zh) 2021-10-28

Family

ID=72258984

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074826 WO2021212954A1 (zh) 2020-04-21 2021-02-02 极低资源下的特定发音人情感语音合成方法及装置

Country Status (2)

Country Link
CN (1) CN111627420B (zh)
WO (1) WO2021212954A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627420B (zh) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 极低资源下的特定发音人情感语音合成方法及装置
CN113053357B (zh) * 2021-01-29 2024-03-12 网易(杭州)网络有限公司 语音合成方法、装置、设备和计算机可读存储介质
CN113257225B (zh) * 2021-05-31 2021-11-02 之江实验室 一种融合词汇及音素发音特征的情感语音合成方法及系统
CN113450760A (zh) * 2021-06-07 2021-09-28 北京一起教育科技有限责任公司 一种文本转语音的方法、装置及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777140A (zh) * 2018-04-27 2018-11-09 南京邮电大学 一种非平行语料训练下基于vae的语音转换方法
CN108831435A (zh) * 2018-06-06 2018-11-16 安徽继远软件有限公司 一种基于多情感说话人自适应的情感语音合成方法
CN109036370A (zh) * 2018-06-06 2018-12-18 安徽继远软件有限公司 一种说话人语音自适应训练方法
WO2019161011A1 (en) * 2018-02-16 2019-08-22 Dolby Laboratories Licensing Corporation Speech style transfer
CN110189742A (zh) * 2019-05-30 2019-08-30 芋头科技(杭州)有限公司 确定情感音频、情感展示、文字转语音的方法和相关装置
CN110335587A (zh) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
WO2019231638A1 (en) * 2018-05-31 2019-12-05 Microsoft Technology Licensing, Llc A highly empathetic tts processing
CN111627420A (zh) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 极低资源下的特定发音人情感语音合成方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0113570D0 (en) * 2001-06-04 2001-07-25 Hewlett Packard Co Audio-form presentation of text messages
CN102385858B (zh) * 2010-08-31 2013-06-05 国际商业机器公司 情感语音合成方法和系统
CN109754779A (zh) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 可控情感语音合成方法、装置、电子设备及可读存储介质
KR102057927B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN110148398A (zh) * 2019-05-16 2019-08-20 平安科技(深圳)有限公司 语音合成模型的训练方法、装置、设备及存储介质
CN110264991B (zh) * 2019-05-20 2023-12-22 平安科技(深圳)有限公司 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
CN110379409B (zh) * 2019-06-14 2024-04-16 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN110211563A (zh) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019161011A1 (en) * 2018-02-16 2019-08-22 Dolby Laboratories Licensing Corporation Speech style transfer
CN108777140A (zh) * 2018-04-27 2018-11-09 南京邮电大学 一种非平行语料训练下基于vae的语音转换方法
WO2019231638A1 (en) * 2018-05-31 2019-12-05 Microsoft Technology Licensing, Llc A highly empathetic tts processing
CN108831435A (zh) * 2018-06-06 2018-11-16 安徽继远软件有限公司 一种基于多情感说话人自适应的情感语音合成方法
CN109036370A (zh) * 2018-06-06 2018-12-18 安徽继远软件有限公司 一种说话人语音自适应训练方法
CN110189742A (zh) * 2019-05-30 2019-08-30 芋头科技(杭州)有限公司 确定情感音频、情感展示、文字转语音的方法和相关装置
CN110335587A (zh) * 2019-06-14 2019-10-15 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN111627420A (zh) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 极低资源下的特定发音人情感语音合成方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEE YOUNGGUN, RABIEE AZAM, LEE SOO-YOUNG: "Emotional End-to-End Neural Speech Synthesizer", 28 November 2017 (2017-11-28), XP055829334, Retrieved from the Internet <URL:https://arxiv.org/pdf/1711.05447.pdf> [retrieved on 20210802] *
ZHI PENGPENG, YANG HONGWU SONG NAN: "DNN-based emotional speech synthesis by speaker adaptation", JOURNAL OF CHONGQING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS(NATURAL SCIENCE EDITION), vol. 30, no. 5, 30 October 2018 (2018-10-30), pages 673 - 679, XP055860097, ISSN: 1673-825X, DOI: 10.3979/j.issn.1673-825X.2018.05.013 *

Also Published As

Publication number Publication date
CN111627420A (zh) 2020-09-04
CN111627420B (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
JP7280386B2 (ja) 多言語音声合成およびクロスランゲージボイスクローニング
JP7395792B2 (ja) 2レベル音声韻律転写
JP7436709B2 (ja) 非発話テキストおよび音声合成を使う音声認識
WO2021212954A1 (zh) 极低资源下的特定发音人情感语音合成方法及装置
US20220392430A1 (en) System Providing Expressive and Emotive Text-to-Speech
JP2020034883A (ja) 音声合成装置及びプログラム
US20230111824A1 (en) Computing system for unsupervised emotional text to speech training
CN113421550A (zh) 语音合成方法、装置、可读介质及电子设备
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
CN112102811A (zh) 一种合成语音的优化方法、装置及电子设备
CN116798405B (zh) 语音合成方法、装置、存储介质和电子设备
CN112242134A (zh) 语音合成方法及装置
CN116312471A (zh) 语音迁移、语音交互方法、装置、电子设备及存储介质
JP7357518B2 (ja) 音声合成装置及びプログラム
CN116825090B (zh) 语音合成模型的训练方法、装置及语音合成方法、装置
CN116863909B (zh) 基于因子图的语音合成方法、装置及系统
Belonozhko et al. Features of the Implementation of Real-Time Text-to-Speech Systems With Data Variability
KR20240035548A (ko) 합성 트레이닝 데이터를 사용하는 2-레벨 텍스트-스피치 변환 시스템
Nitisaroj et al. The Lessac Technologies system for Blizzard Challenge 2010
TWM621764U (zh) 客製化語音服務系統
CN115346510A (zh) 一种语音合成方法、装置、电子设备及存储介质
CN113192484A (zh) 基于文本生成音频的方法、设备和存储介质
Jeffrey You Don’t Say? Enriching Human-Computer Interactions through Voice Synthesis
CN117672179A (zh) 一种支持智能处理的语音合成方法及系统
CN116778907A (zh) 基于多模态的语音合成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21791607

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21791607

Country of ref document: EP

Kind code of ref document: A1