WO2021127811A1 - 一种语音合成方法、装置、智能终端及可读介质 - Google Patents

一种语音合成方法、装置、智能终端及可读介质 Download PDF

Info

Publication number
WO2021127811A1
WO2021127811A1 PCT/CN2019/127327 CN2019127327W WO2021127811A1 WO 2021127811 A1 WO2021127811 A1 WO 2021127811A1 CN 2019127327 W CN2019127327 W CN 2019127327W WO 2021127811 A1 WO2021127811 A1 WO 2021127811A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing
sampling
feature
module
processed
Prior art date
Application number
PCT/CN2019/127327
Other languages
English (en)
French (fr)
Inventor
黄东延
盛乐园
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/127327 priority Critical patent/WO2021127811A1/zh
Priority to CN201980003174.4A priority patent/CN111133507B/zh
Priority to US17/115,729 priority patent/US11417316B2/en
Publication of WO2021127811A1 publication Critical patent/WO2021127811A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a speech synthesis method, device, smart terminal, and readable medium.
  • Speech synthesis can convert text, text, etc. into natural speech output.
  • a speech synthesis system includes a text analysis stage and a synthesized speech stage.
  • Deep learning can integrate the text analysis stage and the synthesized speech stage into an end-to-end model.
  • the end-to-end model is mainly completed by two steps. The first step is to map the text to the speech feature, and the second step is to convert the speech feature into a synthesized speech.
  • the Mel spectrum feature can be used as an intermediate feature variable for the conversion between text and speech, which better realizes the process of synthesis from text to speech.
  • the Mel spectrum features obtained by analyzing and extracting the text lack a lot of rich information, and there is a certain difference between the Mel spectrum features and the real Mel spectrum features. Therefore, the speech pronunciation synthesized according to the Mel’s spectral characteristics is not natural enough.
  • a method of speech synthesis including:
  • the target mel spectrum feature is converted into speech, and a target speech corresponding to the text to be synthesized is generated.
  • a speech synthesis device includes:
  • the feature extraction module is used to obtain the text to be synthesized, and extract the Mel spectrum features of the text to be synthesized according to a preset speech feature extraction algorithm;
  • the ResUnet module is used to input the spectrum characteristics of the mel to be processed into the preset ResUnet network model to obtain the first intermediate characteristic;
  • the post-processing module is used to perform average pooling processing and first down-sampling processing on the to-be-processed Mel spectrum features, to obtain a second intermediate feature, and to use the second intermediate feature and the output of the ResUnet network model
  • the first intermediate feature is input, deconvolution processing and first up-sampling processing are performed, and the target mel spectrum feature corresponding to the mel spectrum feature to be processed is obtained;
  • the speech synthesis module is used to convert the target Mel spectrum characteristics into speech, and generate a target speech corresponding to the text to be synthesized.
  • an intelligent terminal is proposed.
  • An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the target mel spectrum feature is converted into speech, and a target speech corresponding to the text to be synthesized is generated.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the target mel spectrum feature is converted into speech, and a target speech corresponding to the text to be synthesized is generated.
  • the process of speech synthesis first extract the Mel spectrum characteristics of the text to be synthesized, and then use the ResUnet network model for the extracted Mel spectrum characteristics Perform down-sampling, residual connection, and up-sampling to obtain the corresponding first intermediate feature; then in the post-processing process, perform average pooling and down-sampling on the extracted Mel spectrum features, and then the result is the same as the first
  • the intermediate features are subjected to skip addition processing, and then multiple deconvolution processing and up-sampling processing are performed, and the result is skip-added with the down-sampling result to obtain the final target mel spectrum feature, and then pass the target mel spectrum feature Perform speech synthesis.
  • the ResUnet network model processing and post-processing of the mel spectrum feature makes the mel spectrum feature not only high-resolution features, but also global low-resolution features, which improves the mel The accuracy of spectral feature extraction, thereby improving the accuracy of subsequent speech synthesis.
  • Fig. 1 is an application environment diagram of a speech synthesis method according to an embodiment of the application
  • FIG. 2 is a schematic flowchart of a speech synthesis method according to an embodiment of the application
  • FIG. 3 is a schematic diagram of Mel spectrum characteristics in an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a ResUnet network model in an embodiment of this application.
  • FIG. 5 is a schematic structural diagram of a ResUnet network model in an embodiment of the application.
  • Fig. 6 is a schematic flow chart of a data processing process performed by the ResUnet network model in an embodiment of the application;
  • FIG. 7 is a schematic flowchart of a post-processing process in an embodiment of the application.
  • FIG. 8 is a schematic flowchart of a post-processing process in an embodiment of the application.
  • FIG. 9 is a schematic diagram of comparison of Mel spectrum characteristics in an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of a speech synthesis device in an embodiment of this application.
  • FIG. 11 is a schematic structural diagram of a computer device running the above-mentioned speech synthesis method according to an embodiment of the application.
  • Fig. 1 is an application environment diagram of a speech synthesis method in an embodiment.
  • the speech synthesis method can be applied to a speech synthesis system.
  • the speech synthesis system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a robot, a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used to analyze and process the text to be synthesized, and the server 120 is used to train and predict the model.
  • the speech synthesis system to which the aforementioned speech synthesis method is applied may also be implemented based on the terminal 110.
  • the terminal is used for model training and prediction, and converts the text to be synthesized into speech.
  • a speech synthesis method is provided.
  • the method can be applied to a terminal or a server, and this embodiment is applied to a terminal as an example.
  • the speech synthesis method specifically includes the following steps:
  • Step S102 Obtain the text to be synthesized, and extract the Mel spectrum feature of the text to be synthesized according to a preset speech feature extraction algorithm.
  • the text to be synthesized is text information that requires speech synthesis. For example, in scenarios such as voice chat robots, voice newspaper reading, etc., text information that needs to be converted into voice.
  • the text to be synthesized could be "Since that moment, she will no longer be arrogant.”
  • the text to be synthesized is analyzed, and the corresponding Mel spectrum feature is extracted according to the preset speech feature extraction algorithm as the Mel spectrum feature to be processed.
  • Mel Bank Features can be used to identify voice features of sounds or sentences.
  • the Mel spectrum feature is used as an intermediate feature between text information and speech.
  • Step S104 Input the mel spectrum feature to be processed into a preset ResUnet network model to obtain a first intermediate feature.
  • the ResUnet network model can perform down-sampling, residual connection, and up-sampling processing on the mel spectrum feature to be processed, and obtain the first intermediate feature corresponding to the mel spectrum feature to be processed, and the first intermediate feature is used in the subsequent calculation process.
  • the second down-sampling process, the residual connection process, and the second up-sampling process are performed on the to-be-processed Mel spectrum feature through the ResUnet network model to obtain the first intermediate feature.
  • the ResUnet network model is used to first perform the second down-sampling process on the Mel spectrum features to be processed, then perform the residual connection process on the down-sampled features, and then perform the second up-sampling process.
  • the number of data channels is the process of small ⁇ large ⁇ small
  • the data dimension is the process of large ⁇ small ⁇ large.
  • the abstract semantic information contained in features gradually increases.
  • features not only contain rich semantic information, but also rely on upsampling. Adding to the data makes the feature contain enough spatial detail information so that the feature can be restored to the same resolution as the input Mel spectrum feature to be processed.
  • the ResUnet network model includes an up-sampling module, a residual connection module, and a down-sampling module.
  • an up-sampling module As shown in Figure 4, a schematic diagram of the structure of the three modules included in the ResUnet network model is given.
  • the down-sampling module contains 2 groups (Conv2d, BatchNorm2d, Relu) structures, where Conv2d represents a two-dimensional convolutional layer, BatchNorm2d represents a two-dimensional batch normalization, and Relu represents a modified linear unit.
  • the Residual Unit includes the down-sampling module on the left and a group of (Conv2d, BatchNorm2d, Relu) structures on the right.
  • the input of the residual connection module is respectively processed by the down-sampling module and (Conv2d, BatchNorm2d, Relu) structure, and then the obtained results are jumped and added to realize the jumped connection and make up for the information lost in the downsampling process.
  • the up-sampling module (Unet-Up ResBlock) contains the left and right branches, the left branch does not process the input; in the right branch, Residual Unit represents the residual connection module, and then passes through MaxPool2d, Dropout2d, ConvTranspose2d, BatchNorm2d, Relu After processing, jump and add to the branch on the left.
  • MaxPool2d represents a two-dimensional maximum pooling layer
  • Dropout2d represents a two-dimensional discarding layer
  • ConvTranspose2d represents a two-dimensional deconvolution layer
  • BatchNorm2d represents a two-dimensional batch normalization
  • Relu represents a modified linear unit.
  • the spectrum characteristics of the Melt to be processed are input into the down-sampling module (Unet ConvBlovk) of the ResUnet network model, and then through 5 residual connection modules (Residual Unit), and finally through 5 up-sampling modules (Unet-Up ResBlock), and jump-add the result after the loading and unloading module with the output result of the corresponding residual connection module or down-sampling module on the left.
  • Unet ConvBlovk the down-sampling module of the ResUnet network model
  • the Mel spectrum characteristic of the input down-sampling module is 3 data channels, and the output is 64 data channels; when passing through the residual connection module, the characteristic data channel is increased from 64 to 128, 256, 512, 1024, 2048; when passing through the up-sampling module, the characteristic data channel is reduced from 2048 to 1024, 512, 256, 128, 64. That is, in the embodiment shown in FIG. 5, the final output feature has 64 data channels.
  • the features change from large to small, but the channels change from few to more, and more global semantic information can be obtained.
  • the continuous down-sampling and convolution process makes the number of channels more and more, the features are getting smaller and smaller, that is, the resolution is reduced; in this process, the features become more Efficient and abstract, but also lost more spatial details.
  • the feature In the process of passing through the up-sampling module on the right, the feature is changed from small to large by up-sampling, and the number of channels is reduced by deconvolution; and after each up-sampling, there is a correlation with the down-sampling module and residual The jumps of the features obtained by connecting the modules are added.
  • the features After the above process, the features have high resolution and abstract low-resolution features; that is, the final generated features include features of different sizes, and sufficient spatial detail information is retained to make the prediction results more accurate.
  • the numbers of residual connection modules and up-sampling modules are the same.
  • the ResUnet network model includes a down-sampling module, at least one residual connection module, and at least one up-sampling module, and the number of residual connection modules is the same as the number of up-sampling modules.
  • the foregoing step of inputting the mel spectrum feature to be processed into the preset ResUnet network model to obtain the first intermediate feature, as shown in FIG. 6, includes steps S1041-S1043 as shown in FIG. 6:
  • Step S1041 Perform a second down-sampling process on the to-be-processed Mel spectrum feature by the down-sampling module;
  • Step S1042 Perform second down-sampling processing and residual connection processing on the result output by the down-sampling module through at least one of the residual connection modules;
  • Step S1043 Perform a second up-sampling process on the result output by the residual connection module through at least one of the up-sampling modules, and perform a second up-sampling process on the output result after the second up-sampling process and the result module output by the residual connection module Adding processing is performed to obtain the first intermediate feature.
  • Step S106 Perform average pooling processing and first down-sampling processing on the to-be-processed Mel spectrum features to obtain a second intermediate feature, and the first intermediate feature output by the second intermediate feature and the ResUnet network model The feature is the input, and the deconvolution processing and the first up-sampling process are performed to obtain the target mel spectrum feature corresponding to the mel spectrum feature to be processed.
  • the bottom-up average pooling processing and down-sampling processing are performed on the to-be-processed Mel spectrum features extracted from the text to be synthesized. Obtain the second intermediate feature.
  • the first intermediate feature output by the ResUnet network model and the second intermediate feature that have been averaged pooling and down-sampling are jump-added, and then deconvolution and first up-sampling are performed; and each up-sampling After processing, the result is jump-added with the result after the corresponding first down-sampling, so as to obtain the final target Mel spectrum feature.
  • the number of times of performing the first down-sampling process is at least once, the corresponding number of second up-sampling is also at least once, and the number of first down-sampling processes is the same as the number of second up-sampling processes. of.
  • step S106 may be referred to as a post-processing process, which specifically includes steps S1061-S1065 as shown in FIG. 7:
  • Step S1061 Perform at least one average pooling process on the mel spectrum feature to be processed
  • Step S1062 After each average pooling process, perform a first down-sampling process on the processing result to obtain the second intermediate feature.
  • Step S1063 Perform deconvolution processing on the first intermediate feature and the second intermediate feature
  • Step S1064 Perform at least one first up-sampling process on the processing result
  • Step S1065 Perform addition processing on the processing result of the first up-sampling processing and the processing result after the first down-sampling processing, and perform deconvolution processing; acquire the target Mel spectrum feature.
  • Fig. 8 for a schematic diagram of the above-mentioned post-processing flow of the Mel spectrum feature.
  • the mel spectrum feature to be processed is a feature with a size of 512*512, as shown in Figure 8, first perform average pooling processing (for example, two-dimensional average pooling processing), and then perform the first first down-sampling processing , To obtain 256*256 mel spectrum characteristics; perform the first down-sampling process after averaging pooling to obtain 128*128 mel spectrum characteristics; perform the third first time after averaging pooling
  • the down-sampling process obtains the 64*64 size Mel spectrum feature, that is, the second intermediate feature.
  • the first intermediate feature output by the ResUnet network model can also be a mel spectrum feature with a size of 64*64.
  • the first intermediate feature and the second intermediate feature are jump-added, and then the deconvolution process is performed ( After the two-dimensional deconvolution processing), perform the first first up-sampling processing (128*128 size); the result is skip-added with the result after the second first down-sampling, and then the deconvolution processing and the second One up-sampling processing (256*256 size); the result is jump-added with the result after the first down-sampling processing for the first time, then deconvolution processing and the third first up-sampling processing (512*512 size) are performed , The result is then jump-added with the 512*512 mel spectrum to be processed to obtain the final target mel spectrum feature, and the size of the target mel spectrum feature is 512*512.
  • the global semantic information contained in the feature can be made more, and the deconvolution process, the first up-sampling, and the result after the first down-sampling process can be jumped and added.
  • the feature not only contains rich semantic information, but also contains enough spatial detail information, so that when the feature has a high resolution, the prediction result is more accurate.
  • FIG. 9 a comparison diagram of the mel spectrum characteristics after processing by the above-mentioned processing method of the mel spectrum characteristics to be processed is given.
  • Step S108 Convert the target Mel spectrum feature into speech, and generate a target speech corresponding to the text to be synthesized.
  • the target Mel spectrum feature is used as input, and the target Mel spectrum feature corresponding to the text to be synthesized is speech synthesized through a preset acoustic encoder, and the corresponding target speech is output.
  • a speech synthesis device is provided.
  • the above-mentioned speech synthesis device includes:
  • the feature extraction module 202 is configured to obtain the text to be synthesized, and extract the Mel spectrum features of the text to be synthesized according to a preset speech feature extraction algorithm;
  • the ResUnet module 204 is configured to input the spectrum characteristics of the mel to be processed into a preset ResUnet network model to obtain a first intermediate characteristic
  • the post-processing module 206 is configured to perform average pooling processing and first down-sampling processing on the to-be-processed Mel spectrum features to obtain a second intermediate feature, which is output by the second intermediate feature and the ResUnet network model
  • the first intermediate feature is an input, and a deconvolution process and a first upsampling process are performed to obtain a target mel spectrum feature corresponding to the mel spectrum feature to be processed;
  • the speech synthesis module 208 is configured to convert the target Mel spectrum characteristics into speech, and generate a target speech corresponding to the text to be synthesized.
  • the ResUnet module 204 is further configured to perform a second down-sampling process, a residual connection process, and a second up-sampling process on the to-be-processed Mel spectrum feature through the ResUnet network model to obtain the The first intermediate feature.
  • the ResUnet network model includes an up-sampling module, at least one residual connection module, and at least one down-sampling module;
  • the ResUnet module 204 is also used for:
  • the post-processing module 206 is further used for:
  • the post-processing module 206 is further used for:
  • Fig. 11 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be a terminal or a server.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can realize the speech synthesis method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the speech synthesis method.
  • 11 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • an intelligent terminal which includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the target mel spectrum feature is converted into speech, and a target speech corresponding to the text to be synthesized is generated.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the target mel spectrum feature is converted into speech, and a target speech corresponding to the text to be synthesized is generated.
  • the process of speech synthesis After using the above speech synthesis method, device, smart terminal and computer-readable storage medium, in the process of speech synthesis, first extract the Mel spectrum features of the text to be synthesized, and then use the ResUnet network model for the extracted Mel spectrum features Perform down-sampling, residual connection, and up-sampling to obtain the corresponding first intermediate feature; then in the post-processing process, perform average pooling and down-sampling on the extracted Mel spectrum features, and then the result is the same as the first
  • the intermediate features are subjected to skip addition processing, and then multiple deconvolution processing and up-sampling processing are performed, and the result is skip-added with the down-sampling result to obtain the final target mel spectrum feature, and then pass the target mel spectrum feature Perform speech synthesis.
  • the ResUnet network model processing and post-processing of the mel spectrum feature makes the mel spectrum feature not only high-resolution features, but also global low-resolution features, which improves the mel The accuracy of spectral feature extraction, thereby improving the accuracy of subsequent speech synthesis.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

一种语音合成方法,包括:获取待合成文本,根据预设的语音特征提取算法提取待合成文本的待处理梅尔频谱特征(S102);将待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征(S104);对待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以第二中间特征和ResUnet网络模型输出的第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与待处理梅尔频谱特征对应的目标梅尔频谱特征(S106);将目标梅尔频谱特征转换为语音,生成与待合成文本对应的目标语音(S108)。此外,本申请还公开了一种语音合成装置、智能终端及计算机可读存储介质。采用本申请,可以提高文本的梅尔频谱特征预测的准确性,提高语音合成的准确性。

Description

一种语音合成方法、装置、智能终端及可读介质 技术领域
本申请涉及人工智能技术领域,尤其涉及一种语音合成方法、装置、智能终端及可读介质。
背景技术
随着移动互联网和人工智能技术的快速发展,语音播报、听小说、听新闻、智能交互等一系列语音合成的场景越来越多。语音合成可以将文本、文字等转换成自然语音输出。
一般来讲,语音合成系统包括了文本分析阶段和合成语音阶段,深度学习可以将文本分析阶段和合成语音阶段整合到一个端对端的模型中。其中,端对端的模型主要由两步完成,第一步是将文本映射到语音特征,第二步是将语音特征转换成合成的语音。并且在各种语音合成以及语音特征提取的方法中,梅尔频谱特征可以作为文本和语音之间转换的中间特征变量,较好的实现从文本到语音的合成的过程。
但是,在相关技术方案中,对文本进行分析提取得到的梅尔频谱特征相对于真实语音对应的梅尔频谱特征来讲,缺少了很多丰富的信息,与真实的梅尔频谱特征之间存在一定的差距;从而导致了根据该梅尔频谱特征合成的语音发音不够自然。
也就是说,上述语音合成的方案中,因为梅尔频谱特征与真实的梅尔频谱特征之间的差异导致了合成的语音的准确性不足。
发明内容
基于此,有必要针对上述问题,提出了一种语音合成方法、装置、智能终端及计算机可读存储介质。
在本申请的第一方面,提出了一种语音合成方法。
一种语音合成方法,包括:
获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在本申请的第二方面,提出了一种语音合成装置。
一种语音合成装置,包括:
特征提取模块,用于获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
ResUnet模块,用于将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
后置处理模块,用于对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
语音合成模块,用于将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在本申请的第三方面,提出了一种智能终端。
一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所 述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在本申请的第四方面,提出了一种计算机可读存储介质。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
实施本申请实施例,将具有如下有益效果:
采用了上述语音合成方法、装置、智能终端及计算机可读存储介质之后, 在语音合成的过程中,首先提取待合成文本的梅尔频谱特征,然后对于提取到的梅尔频谱特征通过ResUnet网络模型进行下采样、残差连接以及上采样,获取对应的第一中间特征;然后在后置处理过程中,对提取到的梅尔频谱特征进行平均池化处理和下采样处理,然后结果与第一中间特征进行跳跃相加处理,然后进行多次反卷积处理和上采样处理,结果与下采样之后的结果进行跳跃相加,获取最终的目标梅尔频谱特征,再通过该目标梅尔频谱特征进行语音合成。
也就是说,在本实施例中,对梅尔频谱特征通过ResUnet网络模型处理和后置处理使得梅尔频谱特征既拥有高分辨率的特征,也拥有全局的低分辨率的特征,提高了梅尔频谱特征提取的准确性,从而提高了后续的语音合成的准确性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为本申请的一个实施例的语音合成方法的应用环境图;
图2为本申请的一个实施例的一种语音合成方法的流程示意图;
图3为本申请的一个实施例中梅尔频谱特征的示意图;
图4为本申请的一个实施例中ResUnet网络模型的结构示意图;
图5为本申请的一个实施例中ResUnet网络模型的结构示意图;
图6为本申请的一个实施例中ResUnet网络模型进行数据处理过程的流程示意图;
图7为本申请的一个实施例中后置处理过程的流程示意图;
图8为本申请的一个实施例中后置处理过程的流程示意图;
图9为本申请的一个实施例中梅尔频谱特征对比示意图;
图10为本申请的一个实施例中一种语音合成装置的结构示意图;
图11为本申请的一个实施例的运行上述语音合成方法的计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为一个实施例中一种语音合成方法的应用环境图。参照图1,该语音合成方法可应用于语音合成系统。该语音合成系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是机器人、手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。其中,终端110用于对需要进行合成的文本进行分析处理,服务器120用于模型的训练与预测。
在另一个实施例中,上述语音合成方法所应用的语音合成系统还可以是基于终端110实现的。终端用于模型的训练与预测,并将需要进行合成的文本转换成语音。
如图2所示,在一个实施例中,提供了一种语音合成方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该语音合成方法具体包括如下步骤:
步骤S102:获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征。
待合成文本为需要进行语音合成的文本信息。例如,在语音聊天机器人、 语音读报等场景下,需要转换成语音的文本信息。
示例性的,待合成文本可以是“自从那一刻起,她便不再妄自菲薄。”。
对待合成文本进行分析,并且根据预设的语音特征提取算法提取对应的梅尔频谱特征作为待处理梅尔频谱特征。其中,梅尔频谱特征(Mel Bank Features),可以用于标识声音或者语句的语音特征。在本实施例中,采用梅尔频谱特征作为文本信息与语音之间的中间特征。
在一个实施例中,如图3所示,给出了提取得到的梅尔频谱特征的一个示例。
步骤S104:将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征。
ResUnet网络模型可以对待处理梅尔频谱特征进行下采样、残差连接以及上采样处理,获取与待处理梅尔频谱特征对应的第一中间特征,第一中间特征用于后续的计算过程。
具体的,通过所述ResUnet网络模型对所述待处理梅尔频谱特征进行第二下采样处理、残差连接处理和第二上采样处理,获取所述第一中间特征。在本实施例中,通过ResUnet网络模型首先对待处理梅尔频谱特征进行第二下采样处理,然后对下采样处理后的特征进行残差连接处理,再进行第二上采样处理。在这个过程中,待处理梅尔频谱特征对应的特征中,数据通道数量为小→大→小的过程,而数据维度为大→小→大的过程。在数据通道数量从小变大的过程中,特征所包含的抽象的语义信息逐渐变多,而在数据通道从大变小的过程中,特征不仅仅包含了丰富的语义信息,还借助于上采样和数据相加,使得特征包含了足够的空间细节信息,使得特征能够还原到与输入的待处理梅尔频谱特征相同的分辨率。
示例性的,ResUnet网络模型包括上采样模块、残差连接模块和下采样模块。如图4所示,给出了ResUnet网络模型包含的3个模块的结构示意图。
下采样模块(UNet ConvBlock)包含了2组(Conv2d,BatchNorm2d,Relu) 结构,其中,Conv2d表示二维卷积层,BatchNorm2d表示二维批标准化,Relu表示修正线性单元。
残差连接模块(Residual Unit)包括了左侧的下采样模块,以及右边的一组(Conv2d,BatchNorm2d,Relu)结构。残差连接模块的输入分别经过下采样模块和(Conv2d,BatchNorm2d,Relu)结构处理,然后得到的结果进行跳跃相加,实现了跳跃的连接,对下采样过程中丢失的信息进行了弥补。
上采样模块(Unet-Up ResBlock)包含了左右两个分支,左边的分支对输入不做处理;右边的分支中,Residual Unit代表了残差连接模块,然后经过MaxPool2d,Dropout2d,ConvTranspose2d,BatchNorm2d,Relu处理过之后与左侧的分支进行跳跃相加。其中,MaxPool2d表示二维最大池化层,Dropout2d表示二维丢弃层,ConvTranspose2d表示二维反卷积层,BatchNorm2d表示二维批标准化,Relu表示修正线性单元。
如图5所示,给出了ResUnet网络模型构成的一个示例。
如图5所示,将待处理梅尔频谱特征输入ResUnet网络模型的下采样模块(Unet ConvBlovk),然后通过5个残差连接模块(Residual Unit),最后通过5个上采样模块(Unet-Up ResBlock),并且,将经过上下样模块之后的结果与左侧对应的残差连接模块或下采样模块的输出结果进行跳跃相加。
在图5所给出的实施例中,输入下采样模块的梅尔频谱特征是3个数据通道,输出为64个数据通道;在经过残差连接模块时,将特征的数据通道从64升到了128、256、512、1024、2048;在经过上采样模块时,将特征的数据通道从2048降到1024、512、256、128、64。也就是说,在图5所示的实施例中,最终输出的特征具有64个数据通道。
如图5所示,在经过左侧的下采样模块和残差连接模块的过程中,特征由大变小、但是通道由少变多,可以获取更多的全局语义信息。其中,通过下采样模块和残差连接模块中,不断的下采样和卷积过程,使得通道数越来越多,特征越来越小,即分辨率降低;在这个过程中,特征变得更加高效和抽象,也 丢失了较多的空间细节信息。
在经过右侧的上采样模块的过程中,利用上采样将特征由小变大,并且通过反卷积将通道数量变小;并且,每次上采样之后均有一次与下采样模块和残差连接模块得到的特征的跳跃相加。经过上述过程,特征拥有高分辨率,也有抽象的低分辨率的特征;即最终生成的特征包含不同大小的特征,保留足够的空间细节信息,使得预测结果更加准确。
需要说明的是,在本实施例中,ResUnet网络模型中,残差连接模块和上采样的模块的数量是相同的。也就是说,ResUnet网络模型包括了下采样模块、至少一个残差连接模块以及至少一个上采样模块,且残差连接模块的数量与上采样模块的数量是相同的。
具体的,上述将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征的步骤,如图6所示,包括如图6所示的步骤S1041-S1043:
步骤S1041:通过所述下采样模块对所述待处理梅尔频谱特征进行第二下采样处理;
步骤S1042:通过至少一个所述残差连接模块对所述下采样模块输出的结果进行第二下采样处理和残差连接处理;
步骤S1043:通过至少一个所述上采样模块对所述残差连接模块输出的结果进行第二上采样处理,并对第二上采样处理之后的输出结果与所述残差连接模块输出的结果模块进行相加处理,获取所述第一中间特征。
将待处理梅尔频谱特征输入ResUnet网络模型的下采样模块进行第二下采样处理,然后通过至少一个残差连接模块进行第二下采样处理以及残差连接处理,最后通过至少一个上采样模块,进行上采样处理;并且,将每一次经过下样模块之后的结果与残差连接模块或下采样模块的输出结果进行跳跃相加,从而获取最终的第一中间特征。
步骤S106:对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出 的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征。
为了提高梅尔频谱特征的质量,对丢失的信息进行补充,在本实施例中,对于待合成文本中提取到的待处理梅尔频谱特征进行自底向上的平均池化处理和下采样处理,获取第二中间特征。
然后将ResUnet网络模型输出的第一中间特征与经过平均池化处理和下采样处理的第二中间特征进行跳跃相加,然后进行反卷积处理和第一上采样处理;并且,每次上采样处理之后,将结果与对应的第一下采样之后的结果进行跳跃相加,从而获取最终的目标梅尔频谱特征。
在一个具体的实施例中,进行第一下采样处理的次数为至少一次,对应的第二上采样的次数也为至少一次,且第一下采样处理的次数与第二上采样的次数是相同的。
在一个具体的实施例中,上述步骤S106可以被称为后置处理过程,具体包括了如图7所示的步骤S1061-S1065:
步骤S1061:对所述待处理梅尔频谱特征进行至少一次平均池化处理;
步骤S1062:在每一次平均池化处理之后,对处理结果进行第一下采样处理,获取所述第二中间特征。
步骤S1063:对所述第一中间特征和所述第二中间特征进行反卷积处理;
步骤S1064:对处理结果进行至少一次第一上采样处理;
步骤S1065:对第一上采样处理的处理结果和所述第一下采样处理之后的处理结果进行相加处理,并进行反卷积处理;获取所述目标梅尔频谱特征。
参见图8给出了上述对梅尔频谱特征进行后置处理的流程示意图。
设待处理梅尔频谱特征为512*512大小的特征,如图8所示,首先对其进行平均池化处理(例如,二维平均池化处理),然后进行第一次第一下采样处理,获取256*256大小的梅尔频谱特征;进行平均池化处理后再进行第一下采样处理,获取128*128大小的梅尔频谱特征;进行平均池化处理后再进行第三 次第一下采样处理,获取64*64大小的梅尔频谱特征,即第二中间特征。
如图8所示,ResUnet网络模型输出的第一中间特征也可以是64*64大小的梅尔频谱特征,将第一中间特征与第二中间特征进行跳跃相加,然后进行反卷积处理(二维反卷积处理)后进行第一次第一上采样处理(128*128大小);结果与第二次第一下采样之后的结果进行跳跃相加,然后进行反卷积处理和第二次第一上采样处理(256*256大小);结果与第一次第一下采样处理之后的结果进行跳跃相加,然后进行反卷积处理和第三次第一上采样处理(512*512大小),结果再与512*512大小的待处理梅尔频谱进行跳跃相加,获取最终的目标梅尔频谱特征,目标梅尔频谱特征的大小为512*512的大小。
通过平均池化处理和第一下采样处理可以使得特征中包含的全局的语义信息更多,而反卷积处理和第一上采样、与第一下采样处理之后的结果进行跳跃相加,可以使得特征不仅包含了丰富的语义信息,还使得特征包含了足够的空间细节信息,使得特征在具备高分辨率的情况下,预测结果更加准确。如图9所示,给出了经过上述对待处理梅尔频谱特征的处理方式进行处理之后的梅尔频谱特征的对比示意图。
步骤S108:将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在语音合成的步骤中,将目标梅尔频谱特征作为输入,通过预设的声学编码器对待合成文本对应的目标梅尔频谱特征进行语音合成,输出对应的目标语音。
在另一个可选的实施例中,如图10所示,提供了一种语音合成装置。
如图10所示,上述语音合成装置包括:
特征提取模块202,用于获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
ResUnet模块204,用于将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
后置处理模块206,用于对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
语音合成模块208,用于将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在一个实施例中,所述ResUnet模块204还用于通过所述ResUnet网络模型对所述待处理梅尔频谱特征进行第二下采样处理、残差连接处理和第二上采样处理,获取所述第一中间特征。
在一个实施例中,所述ResUnet网络模型包括上采样模块、至少一个残差连接模块和至少一个下采样模块;
所述ResUnet模块204还用于:
通过所述下采样模块对所述待处理梅尔频谱特征进行第二下采样处理;
通过至少一个所述残差连接模块对所述下采样模块输出的结果进行第二下采样处理和残差连接处理;
通过至少一个所述上采样模块对所述残差连接模块输出的结果进行第二上采样处理,并对第二上采样处理之后的输出结果与所述残差连接模块输出的结果模块进行相加处理,获取所述第一中间特征。
在一个实施例中,所述后置处理模块206还用于:
对所述待处理梅尔频谱特征进行至少一次平均池化处理;
在每一次平均池化处理之后,对处理结果进行第一下采样处理;
获取所述第二中间特征。
在一个实施例中,所述后置处理模块206还用于:
对所述第一中间特征和所述第二中间特征进行反卷积处理;
对处理结果进行至少一次第一上采样处理;
对第一上采样处理的处理结果和所述第一下采样处理之后的处理结果进 行相加处理,并进行反卷积处理;
获取所述目标梅尔频谱特征。
图11示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器。如图11所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音合成方法。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提出了一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序, 所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
采用了上述语音合成方法、装置、智能终端及计算机可读存储介质之后,在语音合成的过程中,首先提取待合成文本的梅尔频谱特征,然后对于提取到的梅尔频谱特征通过ResUnet网络模型进行下采样、残差连接以及上采样,获取对应的第一中间特征;然后在后置处理过程中,对提取到的梅尔频谱特征进行平均池化处理和下采样处理,然后结果与第一中间特征进行跳跃相加处理,然后进行多次反卷积处理和上采样处理,结果与下采样之后的结果进行跳跃相加,获取最终的目标梅尔频谱特征,再通过该目标梅尔频谱特征进行语音合成。
也就是说,在本实施例中,对梅尔频谱特征通过ResUnet网络模型处理和后置处理使得梅尔频谱特征既拥有高分辨率的特征,也拥有全局的低分辨率的特征,提高了梅尔频谱特征提取的准确性,从而提高了后续的语音合成的准确性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据 库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (12)

  1. 一种语音合成方法,其特征在于,包括:
    获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
    将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
    对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征;以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
    将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征的步骤,还包括:
    通过所述ResUnet网络模型对所述待处理梅尔频谱特征进行第二下采样处理、残差连接处理和第二上采样处理,获取所述第一中间特征。
  3. 根据权利要求2所述的方法,其特征在于,所述ResUnet网络模型包括上采样模块、至少一个残差连接模块和至少一个下采样模块;
    所述将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征的步骤,还包括:
    通过所述下采样模块对所述待处理梅尔频谱特征进行第二下采样处理;
    通过至少一个所述残差连接模块对所述下采样模块输出的结果进行第二下采样处理和残差连接处理;
    通过至少一个所述上采样模块对所述残差连接模块输出的结果进行第二上采样处理,并对第二上采样处理之后的输出结果与所述残差连接模块输出的结果模块进行相加处理,获取所述第一中间特征。
  4. 根据权利要求1所述的方法,其特征在于,所述对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征的步骤,还包括:
    对所述待处理梅尔频谱特征进行至少一次平均池化处理;
    在每一次平均池化处理之后,对处理结果进行第一下采样处理;
    获取所述第二中间特征。
  5. 根据权利要求4所述的方法,其特征在于,所述以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征的步骤,还包括:
    对所述第一中间特征和所述第二中间特征进行反卷积处理;
    对处理结果进行至少一次第一上采样处理;
    对第一上采样处理的处理结果和所述第一下采样处理之后的处理结果进行相加处理,并进行反卷积处理;
    获取所述目标梅尔频谱特征。
  6. 一种语音合成装置,其特征在于,包括:
    特征提取模块,用于获取待合成文本,根据预设的语音特征提取算法提取所述待合成文本的待处理梅尔频谱特征;
    ResUnet模块,用于将所述待处理梅尔频谱特征输入预设的ResUnet网络模型,获取第一中间特征;
    后置处理模块,用于对所述待处理梅尔频谱特征进行平均池化处理和第一下采样处理,获取第二中间特征,以所述第二中间特征和所述ResUnet网络模型输出的所述第一中间特征为输入,进行反卷积处理和第一上采样处理,获取与所述待处理梅尔频谱特征对应的目标梅尔频谱特征;
    语音合成模块,用于将所述目标梅尔频谱特征转换为语音,生成与所述待合成文本对应的目标语音。
  7. 根据权利要求6所述的装置,其特征在于,所述ResUnet模块还用于通过所述ResUnet网络模型对所述待处理梅尔频谱特征进行第二下采样处理、残差连接处理和第二上采样处理,获取所述第一中间特征。
  8. 根据权利要求7所述的装置,其特征在于,所述ResUnet网络模型包括上采样模块、至少一个残差连接模块和至少一个下采样模块;
    所述ResUnet模块还用于:
    通过所述下采样模块对所述待处理梅尔频谱特征进行第二下采样处理;
    通过至少一个所述残差连接模块对所述下采样模块输出的结果进行第二下采样处理和残差连接处理;
    通过至少一个所述上采样模块对所述残差连接模块输出的结果进行第二上采样处理,并对第二上采样处理之后的输出结果与所述残差连接模块输出的结果模块进行相加处理,获取所述第一中间特征。
  9. 根据权利要求6所述的装置,其特征在于,所述后置处理模块还用于:
    对所述待处理梅尔频谱特征进行至少一次平均池化处理;
    在每一次平均池化处理之后,对处理结果进行第一下采样处理;
    获取所述第二中间特征。
  10. 根据权利要求9所述的方法,其特征在于,所述后置处理模块还用于:对所述第一中间特征和所述第二中间特征进行反卷积处理;
    对处理结果进行至少一次第一上采样处理;
    对第一上采样处理的处理结果和所述第一下采样处理之后的处理结果进行相加处理,并进行反卷积处理;
    获取所述目标梅尔频谱特征。
  11. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至5中任一项所述方法的步骤。
  12. 一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至5 中任一项所述方法的步骤。
PCT/CN2019/127327 2019-12-23 2019-12-23 一种语音合成方法、装置、智能终端及可读介质 WO2021127811A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2019/127327 WO2021127811A1 (zh) 2019-12-23 2019-12-23 一种语音合成方法、装置、智能终端及可读介质
CN201980003174.4A CN111133507B (zh) 2019-12-23 2019-12-23 一种语音合成方法、装置、智能终端及可读介质
US17/115,729 US11417316B2 (en) 2019-12-23 2020-12-08 Speech synthesis method and apparatus and computer readable storage medium using the same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127327 WO2021127811A1 (zh) 2019-12-23 2019-12-23 一种语音合成方法、装置、智能终端及可读介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/115,729 Continuation US11417316B2 (en) 2019-12-23 2020-12-08 Speech synthesis method and apparatus and computer readable storage medium using the same

Publications (1)

Publication Number Publication Date
WO2021127811A1 true WO2021127811A1 (zh) 2021-07-01

Family

ID=70507768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127327 WO2021127811A1 (zh) 2019-12-23 2019-12-23 一种语音合成方法、装置、智能终端及可读介质

Country Status (3)

Country Link
US (1) US11417316B2 (zh)
CN (1) CN111133507B (zh)
WO (1) WO2021127811A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112599141B (zh) * 2020-11-26 2022-02-25 北京百度网讯科技有限公司 神经网络声码器训练方法、装置、电子设备及存储介质
CN112489629A (zh) * 2020-12-02 2021-03-12 北京捷通华声科技股份有限公司 语音转写模型、方法、介质及电子设备
CN113436608B (zh) * 2021-06-25 2023-11-28 平安科技(深圳)有限公司 双流语音转换方法、装置、设备及存储介质
CN113421544B (zh) * 2021-06-30 2024-05-10 平安科技(深圳)有限公司 歌声合成方法、装置、计算机设备及存储介质
CN113470616B (zh) * 2021-07-14 2024-02-23 北京达佳互联信息技术有限公司 语音处理方法和装置以及声码器和声码器的训练方法
CN113781995B (zh) * 2021-09-17 2024-04-05 上海喜马拉雅科技有限公司 语音合成方法、装置、电子设备及可读存储介质
CN115116470A (zh) * 2022-06-10 2022-09-27 腾讯科技(深圳)有限公司 音频处理方法、装置、计算机设备和存储介质
CN116189654B (zh) * 2023-02-23 2024-06-18 京东科技信息技术有限公司 语音编辑方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090138272A1 (en) * 2007-10-17 2009-05-28 Gwangju Institute Of Science And Technology Wideband audio signal coding/decoding device and method
WO2014168591A1 (en) * 2013-04-11 2014-10-16 Cetinturk Cetin Relative excitation features for speech recognition
CN109523989A (zh) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 语音合成方法、语音合成装置、存储介质及电子设备
CN109754778A (zh) * 2019-01-17 2019-05-14 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN110136690A (zh) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110211604A (zh) * 2019-06-17 2019-09-06 广东技术师范大学 一种用于语音变形检测的深度残差网络结构

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6011758B2 (ja) * 2011-09-09 2016-10-19 国立研究開発法人情報通信研究機構 音声合成システム、音声合成方法、およびプログラム
WO2019139430A1 (ko) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
CN108847249B (zh) * 2018-05-30 2020-06-05 苏州思必驰信息科技有限公司 声音转换优化方法和系统
CN108766462B (zh) * 2018-06-21 2021-06-08 浙江中点人工智能科技有限公司 一种基于梅尔频谱一阶导数的语音信号特征学习方法
CN109859736B (zh) * 2019-01-23 2021-05-25 北京光年无限科技有限公司 语音合成方法及系统
CN110070852B (zh) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质
CN110232932B (zh) * 2019-05-09 2023-11-03 平安科技(深圳)有限公司 基于残差时延网络的说话人确认方法、装置、设备及介质
EP4052251A1 (en) * 2019-12-13 2022-09-07 Google LLC Training speech synthesis to generate distinct speech sounds

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090138272A1 (en) * 2007-10-17 2009-05-28 Gwangju Institute Of Science And Technology Wideband audio signal coding/decoding device and method
WO2014168591A1 (en) * 2013-04-11 2014-10-16 Cetinturk Cetin Relative excitation features for speech recognition
CN109754778A (zh) * 2019-01-17 2019-05-14 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN109523989A (zh) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 语音合成方法、语音合成装置、存储介质及电子设备
CN110136690A (zh) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110211604A (zh) * 2019-06-17 2019-09-06 广东技术师范大学 一种用于语音变形检测的深度残差网络结构

Also Published As

Publication number Publication date
CN111133507B (zh) 2023-05-23
CN111133507A (zh) 2020-05-08
US20210193113A1 (en) 2021-06-24
US11417316B2 (en) 2022-08-16

Similar Documents

Publication Publication Date Title
WO2021127811A1 (zh) 一种语音合成方法、装置、智能终端及可读介质
US11042968B2 (en) Method and apparatus for enhancing vehicle damage image on the basis of a generative adversarial network
WO2021128256A1 (zh) 语音转换方法、装置、设备及存储介质
US11763796B2 (en) Computer-implemented method for speech synthesis, computer device, and non-transitory computer readable storage medium
US10810993B2 (en) Sample-efficient adaptive text-to-speech
WO2022141868A1 (zh) 一种提取语音特征的方法、装置、终端及存储介质
CN110265032A (zh) 会议数据分析处理方法、装置、计算机设备和存储介质
CN111226275A (zh) 基于韵律特征预测的语音合成方法、装置、终端及介质
CN116434741A (zh) 语音识别模型训练方法、装置、计算机设备及存储介质
CN111814534A (zh) 视觉任务的处理方法、装置和电子系统
CN113362804B (zh) 一种合成语音的方法、装置、终端及存储介质
WO2020057014A1 (zh) 对话分析评价的方法、装置、计算机设备和存储介质
CN111914068B (zh) 试题知识点的提取方法
CN111108549B (zh) 语音合成方法、装置、计算机设备及计算机可读存储介质
WO2020057023A1 (zh) 自然语言的语义解析方法、装置、计算机设备和存储介质
US11367456B2 (en) Streaming voice conversion method and apparatus and computer readable storage medium using the same
US11704585B2 (en) System and method to determine outcome probability of an event based on videos
CN115798453A (zh) 语音重建方法、装置、计算机设备和存储介质
Basir et al. U-NET: A Supervised Approach for Monaural Source Separation
CN111108558B (zh) 语音转换方法、装置、计算机设备及计算机可读存储介质
WO2020133291A1 (zh) 文本实体识别方法、装置、计算机设备及存储介质
CN115440198B (zh) 混合音频信号的转换方法、装置、计算机设备和存储介质
Prathipati et al. Single channel speech enhancement using time-frequency attention mechanism based nested U-net model
CN114882896A (zh) 语音转换模型的训练及语音转换方法、装置和相关设备
CN114155868A (zh) 语音增强方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957095

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957095

Country of ref document: EP

Kind code of ref document: A1