WO2020215551A1 - 合成中文语音的方法、装置、设备及存储介质 - Google Patents

合成中文语音的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020215551A1
WO2020215551A1 PCT/CN2019/102247 CN2019102247W WO2020215551A1 WO 2020215551 A1 WO2020215551 A1 WO 2020215551A1 CN 2019102247 W CN2019102247 W CN 2019102247W WO 2020215551 A1 WO2020215551 A1 WO 2020215551A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
target
mel spectrum
vector
sequence
Prior art date
Application number
PCT/CN2019/102247
Other languages
English (en)
French (fr)
Inventor
陈闽川
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020215551A1 publication Critical patent/WO2020215551A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • This application relates to the field of language signal processing, and in particular to a method, device, equipment and storage medium for synthesizing Chinese speech.
  • LSTM long and short term memory
  • RNN recurrent neural network
  • This application provides a method, device, device, and storage medium for synthesizing Chinese speech, which are used to reduce training time, while enhancing model expressiveness and generalization ability, and further improving the synthesized speech quality.
  • the first aspect of the embodiments of the present application provides a method for synthesizing Chinese speech, including: obtaining an initial Mel spectrum and a target vector; processing the target vector to obtain a first sequence, the first sequence being a two-dimensional tensor Processing the initial mel spectrum to obtain the target mel spectrum; determining the target correspondence between the first sequence and the target mel spectrum in each subspace; corresponding to the target according to the self-attention mechanism Relations perform speech synthesis to obtain the target speech.
  • the processing of the target vector to obtain a first sequence includes: calling an encoder Each module processes the target vector, the output of the previous module in the encoder is used as the input of the next module, and the encoder is composed of multiple modules in series; the output vector of the last module in the encoder is used as the first A sequence, the first sequence is a two-dimensional tensor.
  • the invoking each module of the encoder to process the target vector includes: invoking the multi-head attention component of the first module in the encoder
  • the module processes the target vector to obtain the first intermediate vector; calls the forward network sub-module of the first module to process the first intermediate vector to obtain the first output vector; calls the next one in the encoder
  • the multi-head attention sub-module of the module processes the first output vector to obtain a second intermediate vector; calls the forward network sub-module of the next module to process the second intermediate vector to obtain the second output vector ;
  • Invoke other modules in the encoder in turn to process according to the second output vector, until the vector output by the last module is obtained.
  • the processing the initial mel spectrum to obtain the target mel spectrum includes: invoking each module of the decoder to perform the processing of the initial mel spectrum
  • the mel spectrum is processed, the output of the previous module in the decoder is used as the input of the next module, and the decoder is composed of multiple modules in series; the mel spectrum output by the last module in the decoder is used as the output The target Mel spectrum.
  • the invoking each module of the decoder to process the initial Mel spectrum includes: invoking the first module in the decoder
  • the masked multi-attention sub-module of the first module masks the illegal information in the initial mel spectrum to obtain the masked mel spectrum; calls the multi-head attention sub-module of the first module to mask the masked mel spectrum
  • the mel spectrum is processed to obtain the preprocessed mel spectrum; the forward network submodule of the first module is called to process the preprocessed mel spectrum to obtain the first mel spectrum;
  • the other modules in the decoder process according to the first Mel spectrum until the Mel spectrum output by the last module is obtained.
  • the determining the target correspondence between the first sequence and the target mel spectrum in each subspace includes: The first sequence and the target Mel spectrum are mapped to the same multiple subspaces; the first sequence is used as the query and key of the multi-head attention; the target Mel spectrum is used as the value of the multi-head attention; according to the The query, the key, and the value are calculated to obtain a target correspondence, and the target correspondence is a mapping relationship between the first sequence and the target mel spectrum in each subspace.
  • the method before the obtaining the initial Mel spectrum and the target vector, the method further includes: obtaining a target text, where the target text needs to be synthesized Convert the target text into a word embedding vector; combine the word embedding vector and a preset position coding vector; generate the target vector.
  • a second aspect of the embodiments of the present application provides an apparatus for synthesizing Chinese speech, including: a first acquiring unit, configured to acquire an initial Mel spectrum and a target vector; and a first processing unit, configured to process the target vector Obtain a first sequence, where the first sequence is a two-dimensional tensor; a second processing unit, configured to process the initial mel spectrum to obtain a target mel spectrum; and a determining unit, configured to determine the first sequence The target corresponding relationship with the target Mel spectrum in each subspace; the synthesis unit is configured to perform speech synthesis according to the self-attention mechanism and the target corresponding relationship to obtain the target voice.
  • the first processing unit is specifically configured to: call each module of the encoder to process the target vector, and the output of the previous module in the encoder As the input of the next module, the encoder is composed of multiple modules in series; the output vector of the last module in the encoder is taken as the first sequence, and the first sequence is a two-dimensional tensor.
  • the first processing unit is specifically further configured to: call the multi-head attention sub-module of the first module in the encoder to process the target vector , Obtain the first intermediate vector; call the forward network sub-module of the first module to process the first intermediate vector to obtain the first output vector; call the multi-head attention sub-module of the next module in the encoder
  • the module processes the first output vector to obtain a second intermediate vector; calls the forward network sub-module of the next module to process the second intermediate vector to obtain the second output vector; calls the encoding in turn Other modules in the processor process according to the second output vector until the vector output by the last module is obtained.
  • the second processing unit is specifically configured to: call each module of the decoder to process the initial mel spectrum, and the upper part of the decoder The output of one module is used as the input of the next module, and the decoder is composed of multiple modules connected in series; the mel spectrum output by the last module in the decoder is used as the target mel spectrum.
  • the second processing unit is specifically configured to: call the mask multi-head attention sub-module of the first module in the decoder to convert the initial The illegal information in the mel spectrum is masked to obtain the masked mel spectrum; the multi-head attention sub-module of the first module is called to process the masked mel spectrum to obtain the preprocessed mel spectrum Frequency spectrum; call the forward network sub-module of the first module to process the pre-processed Mel spectrum to obtain the first Mel spectrum; call other modules in the decoder in turn according to the first Mel spectrum The spectrum is processed until the Mel spectrum output by the last module is obtained.
  • the determining unit is specifically configured to: map the first sequence and the target mel spectrum to the same multiple subspaces;
  • the first sequence is used as the query and key of multi-head attention;
  • the target Mel spectrum is used as the value of multi-head attention;
  • the target correspondence is calculated according to the query, the key, and the value, and the target correspondence is The mapping relationship between the first sequence and the target Mel spectrum in each subspace.
  • the apparatus for synthesizing Chinese speech further includes: a second acquiring unit configured to acquire target text, where the target text is text that needs to be synthesized;
  • the conversion unit is used to convert the target text into a word embedding vector;
  • the combination unit is used to combine the word embedding vector and a preset position coding vector; and
  • the generation unit is used to generate the target vector.
  • the third aspect of the embodiments of the present application provides a device for synthesizing Chinese speech, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the The computer program implements the method for synthesizing Chinese speech as described in any of the above embodiments.
  • the fourth aspect of the embodiments of the present application provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the synthesis of Chinese speech as described in any of the above embodiments. Method steps.
  • the initial Mel spectrum and the target vector are obtained; the target vector is processed to obtain a first sequence, which is a two-dimensional tensor; the initial Mel spectrum is processed to obtain the target Mel spectrum; determine the target correspondence between the first sequence and the target Mel spectrum in each subspace; perform speech synthesis according to the self-attention mechanism and the target correspondence to obtain the target speech.
  • the cyclic neural network in the multi-head attention is replaced with self-attention, which speeds up the model training speed, reduces the training time, and at the same time enhances the model expressiveness and generalization ability, and further improves the synthesized speech quality.
  • FIG. 1 is a schematic diagram of an embodiment of a method for synthesizing Chinese speech in an embodiment of this application;
  • FIG. 2 is a schematic diagram of an embodiment of a device for synthesizing Chinese speech in an embodiment of the application
  • Fig. 3 is a schematic diagram of another embodiment of a device for synthesizing Chinese speech in an embodiment of the application;
  • Fig. 4 is a schematic diagram of an embodiment of a device for synthesizing Chinese speech in an embodiment of the application.
  • This application provides a method, device, device, and storage medium for synthesizing Chinese speech, which are used to reduce training time, while enhancing model expressiveness and generalization ability, and further improving the synthesized speech quality.
  • FIG. 1 a flowchart of a method for synthesizing Chinese speech provided by an embodiment of the present application, which specifically includes:
  • the device for synthesizing Chinese speech obtains the initial Mel spectrum and target vector.
  • the target vector is obtained by the encoder, and the target vector is the content that needs to be voice converted, and the content is a vector form that the encoder can recognize.
  • the target vector can indicate "I love China”, “I come from Beijing”, “Beijing welcomes you” and so on.
  • the initial mel spectrum is obtained by the decoder, where the initial mel spectrum is the audio after the phase information is removed.
  • the initial Mel spectrum is the lossy audio obtained after processing the original audio, so if you need to convert the original audio to the vocoder, you can use the Griffin-Lim algorithm or the WaveNet algorithm to achieve it, specifically here Not limited.
  • the initial mel spectrum and the target vector can be acquired at the same time, or the mel spectrum is acquired first and then the target vector, or the target vector is acquired first and then the mel spectrum is acquired, and the specifics are not limited here.
  • the device for synthesizing Chinese speech processes the target vector to obtain a first sequence, and the first sequence is a two-dimensional tensor. Specifically, the device for synthesizing Chinese speech calls the sub-module of the encoder to process the target vector.
  • the encoder is composed of multiple modules in series, and the result of the previous module is sent to the next module for processing. Each module includes a multi-head attention (Multi-Head Attention) sub-module and a feed-forward net (FFN) sub-module.
  • Multi-Head Attention multi-head attention
  • FNN feed-forward net
  • the multi-head attention is mainly to capture the relationship in the sequence subspace, such as synthesizing Chinese
  • the voice device may learn the sentence reading relationship in a certain subspace, and learn the dependency relationship in a certain subspace, similar to the superposition of multiple convolution kernels in convolution.
  • the device for synthesizing Chinese speech preprocesses the initial mel spectrum to obtain the target mel spectrum. Specifically, the device for synthesizing Chinese speech masks the Multi-Head Attention (Masked Multi-Head Attention) to mask information that should not be known when the sequence is generated (ie, illegal information).
  • Multi-Head Attention Mask Multi-Head Attention
  • the masked multi-head attention is mainly to be consistent during training and inference. For example, during training, you want to predict the pronunciation of "w", but in fact, when entering the network, the entire initial Mel spectrum will enter. The sequence of the Mel spectrum after the "w" is shielded from the network to prevent the network from seeing the information that needs to be predicted in the future, because this information is invisible during inference.
  • multi-head attention consists of several self-attentions, such as 4-head attention, which is essentially 4 self-attention on the sequence.
  • the output of the decoder used to predict the target mel spectrum must additionally enter the normalized network composed of multi-layer convolutional layers with residuals to optimize and purify the decoder The output result.
  • the device for synthesizing Chinese speech determines the target correspondence between the first sequence and the target Mel spectrum in each subspace. Specifically, the device for synthesizing Chinese speech maps the first sequence and the target mel spectrum to the same multiple subspaces; uses the first sequence as the query and key of the multi-head attention; and uses the target mel spectrum as the value of the multi-head attention; The target correspondence is calculated according to the query, key, and value. The target correspondence is the mapping relationship between the first sequence and the target Mel spectrum in each subspace.
  • the introduced multi-head attention mechanism can train more parameters, and can take into account the attention of different positions, assign multiple subspaces to the attention, and use different subspaces to express different association relationships, for example, One subspace represents the dependency relationship, and one subspace represents the sentence-reading relationship. It integrates information from various locations (subspaces) to improve the performance of attention. For example, in one subspace, there is a dependency relationship between the first sequence and the target mel spectrum, and in another subspace, there is a sentence reading relationship between the first sequence and the target mel spectrum.
  • the device for synthesizing Chinese speech performs speech synthesis according to the self-attention mechanism and the target correspondence to obtain the target speech.
  • the essence of the attention function can be described as a mapping from a query to a series of (key, value) pairs.
  • the calculation of attention is divided into three steps: first, the similarity between the query and each key Calculate the weights, commonly used similarity functions, point product functions, splicing functions, perceptron functions, etc.; secondly, use a softmax function to normalize these weights; finally, the weights and the corresponding key values are weighted and summed to get Final attention.
  • NLP natural language processing
  • multi-head attention can contain three parameters query, key, and value.
  • the three parameters are first subjected to a linear transformation, and then input to the zoom dot product attention. This is done h times, which is actually the so-called multi-head (h Head), count one head at a time.
  • the parameter W for linear transformation of Q, K, and V is different each time.
  • the h times of scaling the dot product attention results are spliced, and the value obtained by performing a linear transformation is used as the result of the multi-head attention. It can be seen that the difference of multi-head attention is that h calculations are performed instead of only once, which allows the model to learn relevant information in different representation subspaces.
  • the cyclic neural network in the multi-head attention is replaced with self-attention, which speeds up the model training speed, reduces the training time, and at the same time enhances the model expressiveness and generalization ability, and further improves the synthesized speech quality.
  • the first sequence is obtained by processing the target vector, and
  • the first sequence is a two-dimensional tensor including:
  • the output vector of the last module in the encoder is taken as the first sequence, and the first sequence is a two-dimensional tensor.
  • the embodiment of the present application refines the processing procedure of the target vector, and adds the implementable manner of the present application.
  • the invoking each module of the encoder to process the target vector includes:
  • the embodiment of the application refines the processing process of the target vector, and uses the self-attention mechanism in the sub-module to extract the correlation between words and words within the sequence, which improves the naturalness of synthetic speech segmentation.
  • the initial mel spectrum is processed to obtain the target mel Spectrum, including:
  • the output of the previous module in the decoder is used as the input of the next module, and the decoder is composed of multiple modules in series;
  • the mel spectrum output by the last module in the decoder is used as the target mel spectrum.
  • the embodiment of the present application refines the processing procedure of the initial Mel spectrum, and adds the implementation manner of the present application.
  • each module of the invoking decoder performs the operation of the initial Mel spectrum Processing, including:
  • the illegal information in the Mel spectrum is masked through the masking multi-head attention mechanism to prevent the current position from paying attention to the information of the subsequent position, and to ensure that the prediction of the current position depends only on the known output before the current position .
  • the determining that the first sequence and the target Mel spectrum are in The target correspondence in each subspace includes:
  • a target correspondence relationship is calculated according to the query, the key, and the value, and the target correspondence relationship is a mapping relationship between the first sequence and the target Mel spectrum in each subspace.
  • the embodiment of this application refines the process of determining the target correspondence between the first sequence and the target Mel spectrum in each subspace.
  • the introduced multi-head attention mechanism gives attention to multiple subspaces, which can be represented by different subspaces. Different associations, integrating the associated information of each position, improve the performance of attention.
  • the method before the acquiring the initial Mel spectrum and the target vector, the method further include:
  • Target text is the text that needs to be synthesized
  • the target vector corresponds to a sequence, but the target text cannot be calculated, so the text is generally mapped to a number.
  • the sequence "I love China” is converted into the pinyin sequence "wo3ai4zho1ngguo2", and then mapped into a sequence of numbers. If it is "163 123 111 123", one character corresponds to one number.
  • the word embedding vector is obtained.
  • the word embedding vector and the position coding vector are added as elements.
  • the word embedding vector is [1, 2]
  • the corresponding position coding vector is [0.1, 0.9]
  • the deep network is [1.1, 2.9].
  • Position coding means that the vector is a tensor equal to the word embedding vector.
  • the embodiment of the application refines the target vector acquisition process, and solves the problem of losing order when extracting features from a sequence by self-attention.
  • an embodiment of the device for synthesizing Chinese speech in the embodiment of the application includes:
  • the first obtaining unit 201 is configured to obtain the initial Mel spectrum and the target vector
  • the first processing unit 202 is configured to process the target vector to obtain a first sequence, where the first sequence is a two-dimensional tensor;
  • the second processing unit 203 is configured to process the initial mel spectrum to obtain a target mel spectrum
  • the determining unit 204 is configured to determine the target correspondence between the first sequence and the target Mel spectrum in each subspace;
  • the synthesis unit 205 is configured to perform speech synthesis according to the self-attention mechanism and the target correspondence to obtain the target speech.
  • the cyclic neural network in the multi-head attention is replaced with self-attention, which speeds up the model training speed, reduces the training time, and at the same time enhances the model expressiveness and generalization ability, and further improves the synthesized speech quality.
  • FIG. 3 another embodiment of the device for synthesizing Chinese speech in the embodiment of the present application includes:
  • the first obtaining unit 201 is configured to obtain the initial Mel spectrum and the target vector
  • the first processing unit 202 is configured to process the target vector to obtain a first sequence, where the first sequence is a two-dimensional tensor;
  • the second processing unit 203 is configured to process the initial mel spectrum to obtain a target mel spectrum
  • the determining unit 204 is configured to determine the target correspondence between the first sequence and the target Mel spectrum in each subspace;
  • the synthesis unit 205 is configured to perform speech synthesis according to the self-attention mechanism and the target correspondence to obtain the target speech.
  • the first processing unit 202 is specifically configured to:
  • Each module of the encoder is called to process the target vector, the output of the previous module in the encoder is used as the input of the next module, and the encoder is composed of multiple modules in series; the last module in the encoder
  • the output vector is used as a first sequence, and the first sequence is a two-dimensional tensor.
  • the first processing unit 202 is specifically further configured to:
  • the second intermediate vector is processed to obtain a second output vector; other modules in the encoder are sequentially called to process according to the second output vector, until the vector output by the last module is obtained.
  • the second processing unit 203 is specifically configured to:
  • Each module of the decoder is called to process the initial Mel spectrum, the output of the previous module in the decoder is used as the input of the next module, and the decoder is composed of multiple modules in series; The Mel spectrum output by the last module is used as the target Mel spectrum.
  • the second processing unit 203 is specifically configured to:
  • the attention submodule processes the masked Mel spectrum to obtain a preprocessed Mel spectrum; calls the forward network submodule of the first module to process the preprocessed Mel spectrum to obtain The first mel spectrum; call other modules in the decoder in turn to process according to the first mel spectrum until the mel spectrum output by the last module is obtained.
  • the determining unit 204 is specifically configured to:
  • a target correspondence relationship is calculated according to the query, the key, and the value, and the target correspondence relationship is a mapping relationship between the first sequence and the target Mel spectrum in each subspace.
  • the device for synthesizing Chinese speech further includes:
  • the second obtaining unit 206 is configured to obtain target text, where the target text is text that needs to be synthesized;
  • a conversion unit 207 configured to convert the target text into a word embedding vector
  • a combining unit 208 configured to combine the word embedding vector with a preset position coding vector
  • the generating unit 209 is configured to generate the target vector.
  • the initial Mel spectrum and the target vector are obtained; the target vector is processed to obtain a first sequence, which is a two-dimensional tensor; the initial Mel spectrum is processed to obtain the target Mel spectrum; determine the target correspondence between the first sequence and the target Mel spectrum in each subspace; perform speech synthesis according to the self-attention mechanism and the target correspondence to obtain the target speech.
  • the cyclic neural network in the multi-head attention is replaced with self-attention, which speeds up the model training speed, reduces the training time, and at the same time enhances the model expressiveness and generalization ability, and further improves the synthesized speech quality.
  • FIGS 2 to 3 above describe in detail the device for synthesizing Chinese speech in the embodiment of the present application from the perspective of modular functional entities, and the following describes the device for synthesizing Chinese speech in the embodiment of the present application in detail from the perspective of hardware processing.
  • FIG. 4 is a schematic structural diagram of a device for synthesizing Chinese speech provided by an embodiment of the present application.
  • the device 400 for synthesizing Chinese speech may have relatively large differences due to different configurations or performance, and may include one or more processors (central Processing units, CPU) 401 (for example, one or more processors), memory 409, and one or more storage media 408 (for example, one or more storage devices with a large amount of data) storing application programs 407 or data 406.
  • the memory 409 and the storage medium 408 may be short-term storage or persistent storage.
  • the program stored in the storage medium 408 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the device for synthesizing Chinese speech.
  • the processor 401 may be configured to communicate with the storage medium 408 and execute a series of instruction operations in the storage medium 408 on the device 400 for synthesizing Chinese speech.
  • the device 400 for synthesizing Chinese speech may also include one or more power supplies 402, one or more wired or wireless network interfaces 403, one or more input and output interfaces 404, and/or one or more operating systems 405, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 405 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the processor 401 can execute the first acquisition unit 201, the first processing unit 202, the second processing unit 203, the determination unit 204, the synthesis unit 205, the second acquisition unit 206, the conversion unit 207, the combination unit 208, and the generation in the above embodiments. Function of unit 209.
  • the processor 401 is the control center of the device for synthesizing Chinese speech, and can perform processing according to the set method of synthesizing Chinese speech.
  • the processor 401 uses various interfaces and lines to connect the various parts of the entire device for synthesizing Chinese speech, and executes the synthesis of Chinese by running or executing software programs and/or modules stored in the memory 409, and calling data stored in the memory 409.
  • the various functions of the speech equipment and the processing data so as to realize the synthesis of Chinese speech.
  • the storage medium 408 and the memory 409 are both carriers for storing data.
  • the storage medium 408 may refer to an internal memory with a small storage capacity but high speed, and the storage 409 may have a large storage capacity but a slow storage speed. External memory.
  • the memory 409 may be used to store software programs and modules.
  • the processor 401 executes various functional applications and data processing of the device 400 for synthesizing Chinese speech by running the software programs and modules stored in the memory 409.
  • the memory 409 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system and an application program required by at least one function (such as determining that the first sequence and the target Mel spectrum are in each subspace). Correspondence between the targets in ), etc.; the storage data area can store data (such as target vectors, etc.) created based on the use of devices that synthesize Chinese speech.
  • the memory 409 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • a non-volatile memory such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the application also provides a non-volatile computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the following steps of the method for synthesizing Chinese speech:
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

涉及人工智能领域的语言信号处理领域,公开了合成中文语音的方法、装置、设备及存储介质,用于减少训练时长,同时增强模型表现力和泛化能力,进一步提高合成的语音质量。合成中文语音的方法包括:获取初始梅尔频谱和目标向量(101);对目标向量进行处理得到第一序列,第一序列为二维张量(102);对初始梅尔频谱进行处理,得到目标梅尔频谱(103);确定第一序列与目标梅尔频谱在各个子空间中的目标对应关系(104);根据自注意力机制和目标对应关系进行语音合成,得到目标语音(105)。

Description

合成中文语音的方法、装置、设备及存储介质
本申请要求于2019年4月26日提交中国专利局、申请号为201910342344.3、发明名称为“合成中文语音的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及语言信号处理领域,尤其涉及一种合成中文语音的方法、装置、设备及存储介质。
背景技术
目前国内外大多数语音合成研究是针对文语转换系统,且只能解决以某种朗读风格将书面语言转换成口语输出,缺乏不同年龄、性别特征及语气、语速的表现,更不用说赋予个人的感情色彩。随着信息社会的需求发展,对人机交互提出了更高的要求,人机口语对话系统的研究也提到了日程上。
语音合成研究已开始从文字到语音的转换阶段向概念到语音的转换阶段发展。这不仅对语音合成技术提出了更高的要求,而且涉及到计算机语言生成,涉及人类大脑的高级神经活动。但就语音合成来说,仍是一个要丰富合成语音的表现力问题。目前对于中文语音合成而言,在进行语音合成时,字词断句不自然,声音沉闷,韵律感差,影响合成质量,合成后与真实人声有明显差异。
发明人意识到现在语音合成中普遍采用了长短时记忆网络(long short term memory,LSTM)等循环神经网络(recurrent neural network,RNN)结构,导致训练时必须依赖上一个时间步的结果,难以并行化处理,训练时间过长。
发明内容
本申请提供了一种合成中文语音的方法、装置、设备及存储介质,用于减少训练时长,同时增强模型表现力和泛化能力,进一步提高合成的语音质量。
本申请实施例的第一方面提供一种合成中文语音的方法,包括:获取初始梅尔频谱和目标向量;对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;对所述初始梅尔频谱进行处理,得到目标梅尔频谱;确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
可选的,在本申请实施例第一方面的第一种实现方式中,所述对所述目标向量进行处理得到第一序列,所述第一序列为二维张量包括:调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
可选的,在本申请实施例第一方面的第二种实现方式中,所述调用编码器的各个模块对目标向量进行处理包括:调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
可选的,在本申请实施例第一方面的第三种实现方式中,所述对所述初始梅尔频谱进行处理,得到目标梅尔频谱,包括:调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
可选的,在本申请实施例第一方面的第四种实现方式中,所述调用解码器的各个模块对所述初始梅尔频谱进行处理,包括:调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
可选的,在本申请实施例第一方面的第五种实现方式中,所述确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系,包括:将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;将所述第一序列作为多头注意力的query和key;将所述目标梅尔频谱作为多头注意力的value;根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
可选的,在本申请实施例第一方面的第六种实现方式中,所述获取初始梅尔频谱和目标向量之前,所述方法还包括:获取目标文本,所述目标文本为需要进行合成的文字;将所述目标文本转换为词嵌入向量;将所述词嵌入向量和预置的位置编码向量进行组合;生成所述目标向量。
本申请实施例的第二方面提供了一种合成中文语音的装置,包括:第一获取单元,用于获取初始梅尔频谱和目标向量;第一处理单元,用于对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;第二处理单元,用于对所述初始梅尔频谱进行处理,得到目标梅尔频谱;确定单元,用于确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;合成单元,用于根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
可选的,在本申请实施例第二方面的第一种实现方式中,第一处理单元具体用于:调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
可选的,在本申请实施例第二方面的第二种实现方式中,第一处理单元具体还用于:调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
可选的,在本申请实施例第二方面的第三种实现方式中,第二处理单元具体用于:调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
可选的,在本申请实施例第二方面的第四种实现方式中,第二处理单元具体用于:调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;调用所述第一个模块的前向网络子模块 对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
可选的,在本申请实施例第二方面的第五种实现方式中,确定单元具体用于:将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;将所述第一序列作为多头注意力的query和key;将所述目标梅尔频谱作为多头注意力的value;根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
可选的,在本申请实施例第二方面的第六种实现方式中,合成中文语音的装置还包括:第二获取单元,用于获取目标文本,所述目标文本为需要进行合成的文字;转换单元,用于将所述目标文本转换为词嵌入向量;组合单元,用于将所述词嵌入向量和预置的位置编码向量进行组合;生成单元,用于生成所述目标向量。
本申请实施例的第三方面提供了一种合成中文语音的设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一实施方式所述的合成中文语音的方法。
本申请实施例的第四方面提供了一种非易失性计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得计算机执行上述任一实施方式所述的合成中文语音的方法的步骤。
本申请实施例提供的技术方案中,获取初始梅尔频谱和目标向量;对该目标向量进行处理得到第一序列,该第一序列为二维张量;对初始梅尔频谱进行处理,得到目标梅尔频谱;确定该第一序列与该目标梅尔频谱在各个子空间中的目标对应关系;根据自注意力机制和目标对应关系进行语音合成,得到目标语音。本申请实施例,将多头注意力中的循环神经网络替换成自注意力,加快了模型训练速度,减少了训练时长,同时增强模型表现力和泛化能力,进一步提高了合成的语音质量。
附图说明
图1为本申请实施例中合成中文语音的方法的一个实施例示意图;
图2为本申请实施例中合成中文语音的装置的一个实施例示意图;
图3为本申请实施例中合成中文语音的装置的另一个实施例示意图;
图4为本申请实施例中合成中文语音的设备的一个实施例示意图。
具体实施方式
本申请提供了一种合成中文语音的方法、装置、设备及存储介质,用于减少训练时长, 同时增强模型表现力和泛化能力,进一步提高合成的语音质量。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
请参阅图1,本申请实施例提供的一种合成中文语音的方法的流程图,具体包括:
101、获取初始梅尔频谱和目标向量。
合成中文语音的装置获取初始梅尔频谱和目标向量。具体的,通过编码器获取目标向量,该目标向量为需要进行语音转换的内容,内容为编码器可识别的向量形式。例如,目标向量可以指示“我爱中国”、“我来自北京”、“北京欢迎你”等内容。通过解码器获取初始梅尔频谱,其中,初始梅尔频谱为去除相位信息后的音频。
需要说明的是,初始梅尔频谱是将原始音频进行处理后得到的有损的音频,因此如果需要转换为原始音频需要声码器,可以利用Griffin-Lim算法或者WaveNet算法等实现,具体此处不做限定。
可以理解的是,初始梅尔频谱和目标向量可以同时获取,或者先获取梅尔频谱再获取目标向量,或者是先获取目标向量再获取梅尔频谱,具体此处不做限定。
102、对目标向量进行处理得到第一序列,第一序列为二维张量。
合成中文语音的装置对目标向量进行处理得到第一序列,该第一序列为二维张量。具体的,合成中文语音的装置调用编码器的子模块对目标向量进行处理,编码器由多个模块串联组成,上一个模块的结果送入下一个模块中处理。每个模块中包括多头注意力(Multi-Head Attention)子模块和前向网络(feed-forward net,FFN)子模块,其中,多头注意力主要为了捕获序列子空间中的关系,比如,合成中文语音的装置可能在某个子空间学习到了句读关系,在某个子空间学习到依存关系,类似于卷积中多个卷积核的叠加。
103、对初始梅尔频谱进行处理,得到目标梅尔频谱。
合成中文语音的装置对初始梅尔频谱进行预处理,得到目标梅尔频谱。具体的,合成中文语音的装置通过掩码多头注意力(Masked Multi-Head Attention),遮蔽序列生成时不应知道的信息(即不合法的信息)。
其中,掩码多头注意力主要是为了训练时和推断时要一致,比如,在训练时,想要预测“w”这个发音,但是实际上进入网络时是整个初始梅尔频谱都会进入,要把这个梅尔频谱在“w”这个之后的序列都对网络屏蔽,防止网络看到未来需要预测的信息,因为这些信息在推断时是看不到的。
需要说明的是,多头注意力由几个自注意力组成,比如4头注意力,实质上就是对序列做4次自注意力。
可以理解的是,为了提升目标梅尔频谱的生成质量,用于预测目标梅尔频谱的解码器输出要额外进入带残差的多层卷积层组成的归一化网络,优化和提纯解码器的输出结果。
104、确定第一序列与目标梅尔频谱在各个子空间中的目标对应关系。
合成中文语音的装置确定第一序列与目标梅尔频谱在各个子空间中的目标对应关系。具体的,合成中文语音的装置将第一序列和目标梅尔频谱映射到相同的多个子空间;将第一序列作为多头注意力的query和key;将目标梅尔频谱作为多头注意力的value;根据query、key和value计算得到目标对应关系,目标对应关系为第一序列和目标梅尔频谱在各个子空间中的映射关系。
可以理解的是,引入的多头注意力机制,可以训练更多的参数,并且可考虑到不同位置的注意力,对注意力赋予多个子空间,利用不同子空间可以表示不同的关联关系,比如,一个子空间表示依存关系,一个子空间表示句读关系,综合各种各样位置(子空间)的信息,提升了注意力的表现能力。例如,在一个子空间中,第一序列和目标梅尔频谱之间为依存关系,在另一个子空间中,第一序列和目标梅尔频谱之间为句读关系。
需要说明的是,其中,注意力是序列生成中的一个概念,本质是一个相关性矩阵。比如做机器翻译时,这个二维矩阵中“我”和“I”两者对应的矩阵值比较大。
105、根据自注意力机制和目标对应关系进行语音合成,得到目标语音。
合成中文语音的装置根据自注意力机制和目标对应关系进行语音合成,得到目标语音。注意力函数的本质可以被描述为一个查询(query)到一系列(键key、值value)对的映射,在计算注意力时主要分为三步:首先是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积函数,拼接函数,感知机函数等;其次是使用一个softmax函数 对这些权重进行归一化;最后是将权重和相应的键值value进行加权求和得到最后的注意力。目前在自然语言处理(natural language processing,NLP)研究中,key和value常常都是同一个,即key=value。需要说明的是,自注意力机制中,key=value=query。
例如,多头注意力中可以包含三个参数query、key、value,三个参数首先进过一个线性变换,然后输入到放缩点积attention,这里要做h次,其实也就是所谓的多头(h头),每一次算一个头。而且每次Q,K,V进行线性变换的参数W是不一样的。然后将h次的放缩点积attention结果进行拼接,再进行一次线性变换得到的值作为多头attention的结果。可以看到,多头注意力的不同之处在于进行了h次计算而不仅仅算一次,可以允许模型在不同的表示子空间里学习到相关的信息。
本申请实施例,将多头注意力中的循环神经网络替换成自注意力,加快了模型训练速度,减少了训练时长,同时增强模型表现力和泛化能力,进一步提高了合成的语音质量。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述对所述目标向量进行处理得到第一序列,所述第一序列为二维张量包括:
调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;
将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
本申请实施例,对目标向量的处理过程进行了细化,增加了本申请的可实现方式。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述调用编码器的各个模块对目标向量进行处理包括:
调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;
调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;
调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;
调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;
依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
本申请实施例,对目标向量的处理过程进行了细化,在子模块中使用自注意力机制,提取序列内部词与词之间的关联关系,改善了合成语音断句的自然程度。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述对所述初始梅尔频谱进行处理,得到目标梅尔频谱,包括:
调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;
将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
本申请实施例,对初始梅尔频谱的处理过程进行了细化,增加了本申请的可实现方式。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述调用解码器的各个模块对所述初始梅尔频谱进行处理,包括:
调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;
调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;
调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;
依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
本申请实施例,通过掩码多头注意力机制,将梅尔频谱中非法的信息进行屏蔽,避免当前位置注意到后面位置的信息,确保当前位置的预测仅取决于在当前位置之前的已知输出。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系,包括:
将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;
将所述第一序列作为多头注意力的query和key;
将所述目标梅尔频谱作为多头注意力的value;
根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
本申请实施例,细化了确定第一序列和目标梅尔频谱在各个子空间中的目标对应关系的过程,引入的多头注意力机制,对注意力赋予多个子空间,利用不同子空间可以表示不同的关联关系,综合各个位置的关联信息,提升了注意力的表现能力。
可选的,在上述图1对应的实施例的基础上,本申请实施例提供的合成中文语音的方法的可选实施例中,所述获取初始梅尔频谱和目标向量之前,所述方法还包括:
获取目标文本,所述目标文本为需要进行合成的文字;
将所述目标文本转换为词嵌入向量;
将所述词嵌入向量和预置的位置编码向量进行组合;
生成所述目标向量。
例如,目标向量对应一个序列,但是目标文本无法运算,所以一般是将文本映射成数字。在该语音合成框架中如序列“我爱中国”,转成拼音序列“wo3ai4zho1ng guo2”,然后被映射成数字序列,假如是“163 123 111 123…”,一个字符对应一个数字。目标文本经过预处理网络后得到词嵌入向量,词嵌入向量和位置编码向量做元素加,如词嵌入向量为[1,2],对应的位置编码向量为[0.1,0.9],最终送入后续深度网络中的是[1.1,2.9]。位置编码是向量是和词嵌入向量等大的张量。
假如希望合成的目标文本是“我爱中国”,这种文本经过转化为拼音然后词嵌入会变成二维张量(序列),如[[0.2,0.4],[0.1,0.5],[0.3,0.3],[0.9,0.7],…],经过位置编码向量加和之后,一个序列内,每个字符都会和其它字符做运算。
本申请实施例,细化了目标向量的获取过程,解决了自注意力对序列抽取特征时,丢失顺序的问题。
上面对本申请实施例中合成中文语音的方法进行了描述,下面对本申请实施例中合成中文语音的装置进行描述,请参阅图2,本申请实施例中合成中文语音的装置的一个实施例包括:
第一获取单元201,用于获取初始梅尔频谱和目标向量;
第一处理单元202,用于对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
第二处理单元203,用于对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
确定单元204,用于确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
合成单元205,用于根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
本申请实施例,将多头注意力中的循环神经网络替换成自注意力,加快了模型训练速度,减少了训练时长,同时增强模型表现力和泛化能力,进一步提高了合成的语音质量。
请参阅图3,本申请实施例中合成中文语音的装置的另一个实施例包括:
第一获取单元201,用于获取初始梅尔频谱和目标向量;
第一处理单元202,用于对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
第二处理单元203,用于对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
确定单元204,用于确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
合成单元205,用于根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
可选的,第一处理单元202具体用于:
调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
可选的,第一处理单元202具体还用于:
调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
可选的,第二处理单元203具体用于:
调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输 出作为下一个模块的输入,所述解码器由多个模块串联组成;将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
可选的,第二处理单元203具体用于:
调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
可选的,确定单元204具体用于:
将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;将所述第一序列作为多头注意力的query和key;将所述目标梅尔频谱作为多头注意力的value;根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
可选的,合成中文语音的装置还包括:
第二获取单元206,用于获取目标文本,所述目标文本为需要进行合成的文字;
转换单元207,用于将所述目标文本转换为词嵌入向量;
组合单元208,用于将所述词嵌入向量和预置的位置编码向量进行组合;
生成单元209,用于生成所述目标向量。
本申请实施例提供的技术方案中,获取初始梅尔频谱和目标向量;对该目标向量进行处理得到第一序列,该第一序列为二维张量;对初始梅尔频谱进行处理,得到目标梅尔频谱;确定该第一序列与该目标梅尔频谱在各个子空间中的目标对应关系;根据自注意力机制和目标对应关系进行语音合成,得到目标语音。本申请实施例,将多头注意力中的循环神经网络替换成自注意力,加快了模型训练速度,减少了训练时长,同时增强模型表现力和泛化能力,进一步提高了合成的语音质量。
上面图2至图3从模块化功能实体的角度对本申请实施例中的合成中文语音的装置进行详细描述,下面从硬件处理的角度对本申请实施例中合成中文语音的设备进行详细描述。
图4是本申请实施例提供的一种合成中文语音的设备的结构示意图,该合成中文语音的设备400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器 (central processing units,CPU)401(例如,一个或一个以上处理器)和存储器409,一个或一个以上存储应用程序407或数据406的存储介质408(例如一个或一个以上海量存储设备)。其中,存储器409和存储介质408可以是短暂存储或持久存储。存储在存储介质408的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对合成中文语音的设备中的一系列指令操作。更进一步地,处理器401可以设置为与存储介质408通信,在合成中文语音的设备400上执行存储介质408中的一系列指令操作。
合成中文语音的设备400还可以包括一个或一个以上电源402,一个或一个以上有线或无线网络接口403,一个或一个以上输入输出接口404,和/或,一个或一个以上操作系统405,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图4中示出的合成中文语音的设备结构并不构成对合成中文语音的设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。处理器401可以执行上述实施例中第一获取单元201、第一处理单元202、第二处理单元203、确定单元204、合成单元205、第二获取单元206、转换单元207、组合单元208和生成单元209的功能。
下面结合图4对合成中文语音的设备的各个构成部件进行具体的介绍:
处理器401是合成中文语音的设备的控制中心,可以按照设置的合成中文语音的方法进行处理。处理器401利用各种接口和线路连接整个合成中文语音的设备的各个部分,通过运行或执行存储在存储器409内的软件程序和/或模块,以及调用存储在存储器409内的数据,执行合成中文语音的设备的各种功能和处理数据,从而实现中文语音的合成。存储介质408和存储器409都是存储数据的载体,本申请实施例中,存储介质408可以是指储存容量较小,但速度快的内存储器,而存储器409可以是储存容量大,但储存速度慢的外存储器。
存储器409可用于存储软件程序以及模块,处理器401通过运行存储在存储器409的软件程序以及模块,从而执行合成中文语音的设备400的各种功能应用以及数据处理。存储器409可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系)等;存储数据区可存储根据合成中文语音的设备的使用所创建的数据(比如目标向量等)等。此外,存储器409可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在本 申请实施例中提供的合成中文语音的方法程序和接收到的数据流存储在存储器中,当需要使用时,处理器401从存储器409中调用。
本申请还提供一种非易失性计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得计算机执行如下合成中文语音的方法的步骤:
获取初始梅尔频谱和目标向量;
对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,光盘)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络 单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种合成中文语音的方法,包括:
    获取初始梅尔频谱和目标向量;
    对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
    对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
    确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
    根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
  2. 根据权利要求1所述的合成中文语音的方法,所述对所述目标向量进行处理得到第一序列,所述第一序列为二维张量包括:
    调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;
    将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
  3. 根据权利要求2所述的合成中文语音的方法,所述调用编码器的各个模块对目标向量进行处理包括:
    调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;
    调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;
    调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;
    调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;
    依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
  4. 根据权利要求1所述的合成中文语音的方法,所述对所述初始梅尔频谱进行处理,得到目标梅尔频谱,包括:
    调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;
    将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
  5. 根据权利要求4所述的合成中文语音的方法,所述调用解码器的各个模块对所述初始梅尔频谱进行处理,包括:
    调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;
    调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;
    调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;
    依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
  6. 根据权利要求1所述的合成中文语音的方法,所述确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系,包括:
    将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;
    将所述第一序列作为多头注意力的query和key;
    将所述目标梅尔频谱作为多头注意力的value;
    根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
  7. 根据权利要求1-6中任一所述的合成中文语音的方法,所述获取初始梅尔频谱和目标向量之前,所述方法还包括:
    获取目标文本,所述目标文本为需要进行合成的文字;
    将所述目标文本转换为词嵌入向量;
    将所述词嵌入向量和预置的位置编码向量进行组合;
    生成所述目标向量。
  8. 一种合成中文语音的装置,包括:
    第一获取单元,用于获取初始梅尔频谱和目标向量;
    第一处理单元,用于对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
    第二处理单元,用于对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
    确定单元,用于确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
    合成单元,用于根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
  9. 根据权利要求8所述的合成中文语音的装置,第一处理单元具体用于:
    调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;
    将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
  10. 根据权利要求9所述的合成中文语音的装置,第一处理单元具体还用于:
    调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;
    调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;
    调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;
    调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;
    依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
  11. 根据权利要求8所述的合成中文语音的装置,第二处理单元具体用于:
    调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;
    将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
  12. 根据权利要求11所述的合成中文语音的装置,第二处理单元具体还用于:
    调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;
    调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;
    调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;
    依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
  13. 根据权利要求8所述的合成中文语音的装置,确定单元具体用于:
    将所述第一序列和所述目标梅尔频谱映射到相同的多个子空间;
    将所述第一序列作为多头注意力的query和key;
    将所述目标梅尔频谱作为多头注意力的value;
    根据所述query、所述key和所述value计算得到目标对应关系,所述目标对应关系为所述第一序列和所述目标梅尔频谱在各个子空间中的映射关系。
  14. 根据权利要求8-13中任一所述的合成中文语音的装置,合成中文语音的装置还包括:
    第二获取单元,用于获取目标文本,所述目标文本为需要进行合成的文字;
    转换单元,用于将所述目标文本转换为词嵌入向量;
    组合单元,用于将所述词嵌入向量和预置的位置编码向量进行组合;
    生成单元,用于生成所述目标向量。
  15. 一种合成中文语音的设备,包括存储器、处理器及存储在所述存储器上并可在所 述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下合成中文语音的方法的步骤:
    获取初始梅尔频谱和目标向量;
    对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
    对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
    确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
    根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
  16. 根据权利要求15所述的合成中文语音的设备,所述处理器具体用于执行以下步骤:
    调用编码器的各个模块对目标向量进行处理,所述编码器中上一个模块的输出作为下一个模块的输入,所述编码器由多个模块串联组成;
    将所述编码器中最后一个模块的输出向量作为第一序列,所述第一序列为二维张量。
  17. 根据权利要求16所述的合成中文语音的设备,所述处理器具体用于执行以下步骤:
    调用所述编码器中第一个模块的多头注意力子模块对目标向量进行处理,得到第一中间向量;
    调用所述第一个模块的前向网络子模块对所述第一中间向量进行处理,得到第一输出向量;
    调用所述编码器中下一个模块的多头注意力子模块对所述第一输出向量进行处理,得到第二中间向量;
    调用所述下一个模块的前向网络子模块对所述第二中间向量进行处理,得到第二输出向量;
    依次调用所述编码器中其他模块根据所述第二输出向量进行处理,直至得到最后一个模块输出的向量。
  18. 根据权利要求15所述的合成中文语音的设备,所述处理器具体用于执行以下步骤:
    调用解码器的各个模块对所述初始梅尔频谱进行处理,所述解码器中上一个模块的输出作为下一个模块的输入,所述解码器由多个模块串联组成;
    将所述解码器中最后一个模块输出的梅尔频谱作为所述目标梅尔频谱。
  19. 根据权利要求18所述的合成中文语音的设备,所述处理器具体用于执行以下步骤:
    调用所述解码器中第一个模块的掩码多头注意力子模块将所述初始梅尔频谱中不合法的信息进行屏蔽,得到屏蔽后的梅尔频谱;
    调用所述第一个模块的多头注意力子模块对所述屏蔽后的梅尔频谱进行处理,得到预处理的梅尔频谱;
    调用所述第一个模块的前向网络子模块对所述预处理的梅尔频谱进行处理,得到第一梅尔频谱;
    依次调用所述解码器中其他模块根据所述第一梅尔频谱进行处理,直至得到最后一个模块输出的梅尔频谱。
  20. 一种非易失性计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得计算机执行如下合成中文语音的方法的步骤:
    获取初始梅尔频谱和目标向量;
    对所述目标向量进行处理得到第一序列,所述第一序列为二维张量;
    对所述初始梅尔频谱进行处理,得到目标梅尔频谱;
    确定所述第一序列与所述目标梅尔频谱在各个子空间中的目标对应关系;
    根据自注意力机制和所述目标对应关系进行语音合成,得到目标语音。
PCT/CN2019/102247 2019-04-26 2019-08-23 合成中文语音的方法、装置、设备及存储介质 WO2020215551A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910342344.3 2019-04-26
CN201910342344.3A CN110070852B (zh) 2019-04-26 2019-04-26 合成中文语音的方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020215551A1 true WO2020215551A1 (zh) 2020-10-29

Family

ID=67369058

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102247 WO2020215551A1 (zh) 2019-04-26 2019-08-23 合成中文语音的方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110070852B (zh)
WO (1) WO2020215551A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488022A (zh) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 一种语音合成方法和装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070852B (zh) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质
CN110808027B (zh) * 2019-11-05 2020-12-08 腾讯科技(深圳)有限公司 语音合成方法、装置以及新闻播报方法、系统
CN112786000B (zh) * 2019-11-11 2022-06-03 亿度慧达教育科技(北京)有限公司 语音合成方法、系统、设备及存储介质
CN111161702B (zh) * 2019-12-23 2022-08-26 爱驰汽车有限公司 个性化语音合成方法、装置、电子设备、存储介质
CN111133507B (zh) * 2019-12-23 2023-05-23 深圳市优必选科技股份有限公司 一种语音合成方法、装置、智能终端及可读介质
CN111369968B (zh) * 2020-03-19 2023-10-13 北京字节跳动网络技术有限公司 语音合成方法、装置、可读介质及电子设备
CN111462735B (zh) * 2020-04-10 2023-11-28 杭州网易智企科技有限公司 语音检测方法、装置、电子设备及存储介质
CN111859994B (zh) * 2020-06-08 2024-01-23 北京百度网讯科技有限公司 机器翻译模型获取及文本翻译方法、装置及存储介质
CN112002305A (zh) * 2020-07-29 2020-11-27 北京大米科技有限公司 语音合成方法、装置、存储介质及电子设备
CN112382273A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN112687259B (zh) * 2021-03-11 2021-06-18 腾讯科技(深圳)有限公司 一种语音合成方法、装置以及可读存储介质
CN113192484A (zh) * 2021-05-26 2021-07-30 腾讯音乐娱乐科技(深圳)有限公司 基于文本生成音频的方法、设备和存储介质
CN113792540B (zh) * 2021-09-18 2024-03-22 平安科技(深圳)有限公司 意图识别模型更新方法及相关设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182629A1 (en) * 2004-01-16 2005-08-18 Geert Coorman Corpus-based speech synthesis based on segment recombination
CN104392717A (zh) * 2014-12-08 2015-03-04 常州工学院 一种基于声道谱高斯混合建模的快速语音转换系统及其方法
CN104485099A (zh) * 2014-12-26 2015-04-01 中国科学技术大学 一种合成语音自然度的提升方法
CN107545903A (zh) * 2017-07-19 2018-01-05 南京邮电大学 一种基于深度学习的语音转换方法
CN109036377A (zh) * 2018-07-26 2018-12-18 中国银联股份有限公司 一种语音合成方法及装置
CN110070852A (zh) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962216B (zh) * 2018-06-12 2021-02-02 北京市商汤科技开发有限公司 一种说话视频的处理方法及装置、设备和存储介质
CN109036375B (zh) * 2018-07-25 2023-03-24 腾讯科技(深圳)有限公司 语音合成方法、模型训练方法、装置和计算机设备
CN109616127A (zh) * 2018-11-15 2019-04-12 建湖云飞数据科技有限公司 一种音频数据融合方法
CN109616093B (zh) * 2018-12-05 2024-02-27 平安科技(深圳)有限公司 端对端语音合成方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182629A1 (en) * 2004-01-16 2005-08-18 Geert Coorman Corpus-based speech synthesis based on segment recombination
CN104392717A (zh) * 2014-12-08 2015-03-04 常州工学院 一种基于声道谱高斯混合建模的快速语音转换系统及其方法
CN104485099A (zh) * 2014-12-26 2015-04-01 中国科学技术大学 一种合成语音自然度的提升方法
CN107545903A (zh) * 2017-07-19 2018-01-05 南京邮电大学 一种基于深度学习的语音转换方法
CN109036377A (zh) * 2018-07-26 2018-12-18 中国银联股份有限公司 一种语音合成方法及装置
CN110070852A (zh) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488022A (zh) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 一种语音合成方法和装置
CN113488022B (zh) * 2021-07-07 2024-05-10 北京搜狗科技发展有限公司 一种语音合成方法和装置

Also Published As

Publication number Publication date
CN110070852B (zh) 2023-06-16
CN110070852A (zh) 2019-07-30

Similar Documents

Publication Publication Date Title
WO2020215551A1 (zh) 合成中文语音的方法、装置、设备及存储介质
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
CN111276120B (zh) 语音合成方法、装置和计算机可读存储介质
CN112687259B (zh) 一种语音合成方法、装置以及可读存储介质
WO2018058994A1 (zh) 基于深度学习的对话方法、装置及设备
CN111312245B (zh) 一种语音应答方法、装置和存储介质
CN109887484A (zh) 一种基于对偶学习的语音识别与语音合成方法及装置
CN106971709A (zh) 统计参数模型建立方法和装置、语音合成方法和装置
WO2020248393A1 (zh) 语音合成方法、系统、终端设备和可读存储介质
CN112837669B (zh) 语音合成方法、装置及服务器
CN113421547B (zh) 一种语音处理方法及相关设备
CN113539232B (zh) 一种基于慕课语音数据集的语音合成方法
CN114895817B (zh) 交互信息处理方法、网络模型的训练方法及装置
US20220157329A1 (en) Method of converting voice feature of voice
CN113761841B (zh) 将文本数据转换为声学特征的方法
CN113886643A (zh) 数字人视频生成方法、装置、电子设备和存储介质
CN114882862A (zh) 一种语音处理方法及相关设备
CN113450765A (zh) 语音合成方法、装置、设备及存储介质
KR20190016889A (ko) 텍스트-음성 변환 방법 및 시스템
CN116611459B (zh) 翻译模型的训练方法、装置、电子设备及存储介质
JP2023169230A (ja) コンピュータプログラム、サーバ装置、端末装置、学習済みモデル、プログラム生成方法、及び方法
WO2023116243A1 (zh) 数据转换方法及计算机存储介质
CN116913244A (zh) 一种语音合成方法、设备及介质
Chen et al. Speaker-independent emotional voice conversion via disentangled representations
CN114464163A (zh) 语音合成模型的训练方法、装置、设备、存储介质和产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926281

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926281

Country of ref document: EP

Kind code of ref document: A1