WO2022257454A1 - 一种合成语音的方法、装置、终端及存储介质 - Google Patents
一种合成语音的方法、装置、终端及存储介质 Download PDFInfo
- Publication number
- WO2022257454A1 WO2022257454A1 PCT/CN2022/071430 CN2022071430W WO2022257454A1 WO 2022257454 A1 WO2022257454 A1 WO 2022257454A1 CN 2022071430 W CN2022071430 W CN 2022071430W WO 2022257454 A1 WO2022257454 A1 WO 2022257454A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample
- text
- layer
- vector
- decoder
- Prior art date
Links
- 238000001308 synthesis method Methods 0.000 title abstract description 6
- 238000001228 spectrum Methods 0.000 claims abstract description 143
- 238000000034 method Methods 0.000 claims abstract description 113
- 230000008569 process Effects 0.000 claims abstract description 79
- 238000012549 training Methods 0.000 claims abstract description 78
- 238000012545 processing Methods 0.000 claims abstract description 74
- 238000004821 distillation Methods 0.000 claims abstract description 21
- 239000013598 vector Substances 0.000 claims description 142
- 230000003595 spectral effect Effects 0.000 claims description 35
- 230000002194 synthesizing effect Effects 0.000 claims description 32
- 230000008878 coupling Effects 0.000 claims description 30
- 238000010168 coupling process Methods 0.000 claims description 30
- 238000005859 coupling reaction Methods 0.000 claims description 30
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 21
- 238000005315 distribution function Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 14
- 230000008521 reorganization Effects 0.000 claims description 7
- 230000001537 neural effect Effects 0.000 claims description 5
- 230000015572 biosynthetic process Effects 0.000 abstract description 25
- 238000003786 synthesis reaction Methods 0.000 abstract description 25
- 238000013527 convolutional neural network Methods 0.000 description 13
- 230000009466 transformation Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 230000002441 reversible effect Effects 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present application belongs to the technical field of speech synthesis, and in particular relates to a method, device, terminal and storage medium for synthesizing speech.
- the current end-to-end speech synthesis technology can already produce high-quality speech.
- the end-to-end speech synthesis method does not require complex modeling of speech, and can produce more natural speech at the same time.
- the inventor realizes that the existing end-to-end speech synthesis models are generally divided into autoregressive models and non-autoregressive models.
- the autoregressive model means that the output of the model is a step-by-step output, that is, the output of each step will depend on the previous output. Therefore, whether the model is in the training process or in the actual use process, it is very time-consuming, and the efficiency of synthesizing speech is low.
- the non-autoregressive model means that the output of the model is a fully parallel output. Although this type of model synthesizes speech at a high speed, because the model needs to be distilled, the final speech quality of the model is very low.
- One of the purposes of the embodiments of the present application is to provide a method, device, terminal, and storage medium for synthesizing speech, so as to solve the problem of the speech synthesis model in the prior art, whether it is in the training process or in the actual use process. Time-consuming, low-efficiency and low-quality technical problems in synthesizing speech.
- the embodiment of the present application provides a method for synthesizing speech, wherein the method includes:
- the text information is input into the trained spectrum generation model for processing to obtain the mel spectrogram corresponding to the text information
- the spectrum generation model is a non-autoregressive model without distillation
- the spectrum generation model It includes an encoder, a length prediction network, and a decoder, wherein the training process and actual use process of the decoder are inverse operations;
- the embodiment of the present application provides a device for synthesizing speech, wherein the device includes:
- an acquisition unit configured to acquire text information
- a processing unit configured to input the text information into a trained spectrum generation model for processing to obtain a Mel spectrogram corresponding to the text information, the spectrum generation model is a non-autoregressive model that does not require distillation,
- the spectrum generation model includes an encoder, a length prediction network, and a decoder, wherein the training process and the actual use process of the decoder are inverse operations;
- a generating unit configured to generate voice information corresponding to the text information based on the mel spectrogram.
- an embodiment of the present application provides a terminal for synthesizing speech, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor executes the Realize when describing a computer program:
- the text information is input into the trained spectrum generation model for processing to obtain the mel spectrogram corresponding to the text information
- the spectrum generation model is a non-autoregressive model without distillation
- the spectrum generation model It includes an encoder, a length prediction network, and a decoder, wherein the training process and actual use process of the decoder are inverse operations;
- the embodiment of the present application provides a computer-readable storage medium.
- the computer-readable storage medium may be non-volatile or volatile.
- the computer-readable storage medium stores a computer program, and the computer program Implemented when executed by a processor:
- the text information is input into the trained spectrum generation model for processing to obtain the mel spectrogram corresponding to the text information
- the spectrum generation model is a non-autoregressive model without distillation
- the spectrum generation model It includes an encoder, a length prediction network, and a decoder, wherein the training process and actual use process of the decoder are inverse operations;
- the embodiments of the present application have the following beneficial effects: by acquiring text information; inputting the text information into the trained spectrum generation model for processing, and obtaining the mel spectrum corresponding to the text information, the spectrum
- the generation model is a non-autoregressive model that does not require distillation.
- the spectrum generation model includes an encoder, a length prediction network, and a decoder.
- the training process and the actual use process of the decoder are inverse operations; based on the Mel spectrogram to generate the voice information corresponding to the text information.
- the acquired text information is input into the trained spectrum generation model for processing, and the mel spectrum corresponding to the text information is obtained.
- the generation model is a non-autoregressive model without distillation, that is, the The output of the spectrum generation model is a fully parallel output, which improves the rate at which the spectrum generation model generates the mel-spectrogram, thereby increasing the speed of speech synthesis. Furthermore, since the training process and the actual use process of the decoder in the spectrum generation model are inverse operations, the decoder learns how to extract the mel-spectrogram corresponding to the text information more accurately and quickly, so that based on the The speech generated by the mel spectrogram is of high quality.
- Fig. 1 is a schematic flowchart of a method for synthesizing speech provided by an exemplary embodiment of the present application
- FIG. 2 is a schematic diagram of a network structure showing a spectrum generation model according to an exemplary embodiment of the present application
- FIG. 3 is a specific flowchart of step S102 of the method for synthesizing speech shown in an exemplary embodiment of the present application;
- Fig. 4 is a schematic diagram of a coupling layer network structure shown in an exemplary embodiment of the present application.
- FIG. 5 is a schematic flowchart of a method for synthesizing speech provided by another embodiment of the present application.
- Fig. 6 is a schematic diagram of a device for synthesizing speech provided by an embodiment of the present application.
- Fig. 7 is a schematic diagram of a terminal for synthesizing speech provided by another embodiment of the present application.
- a relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists alone.
- plural refers to two or more than two.
- first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
- the current end-to-end speech synthesis technology can already produce high-quality speech.
- the end-to-end speech synthesis method does not require complex modeling of speech, and can produce more natural speech at the same time.
- autoregressive models means that the output of the model is a step-by-step output, that is, the output of each step will depend on the previous output. Therefore, whether the model is in the training process or in the actual use process, it is very time-consuming, and the efficiency of synthesizing speech is low.
- the non-autoregressive model means that the output of the model is a fully parallel output. Although this model can synthesize speech quickly, the quality of the final synthesized speech of the model is very low because the model needs to be distilled.
- the present application provides a method for synthesizing speech to obtain text information; input the text information into the trained spectrum generation model for processing, and obtain the mel spectrogram corresponding to the text information, the spectrum generation model is A non-autoregressive model that does not require distillation, the spectrum generation model includes an encoder, a length prediction network, and a decoder, wherein the training process and actual use process of the decoder are inverse operations; based on the Mel spectrogram, Voice information corresponding to the text information is generated.
- the acquired text information is input into the trained spectrum generation model for processing, and the mel spectrum corresponding to the text information is obtained.
- the generation model is a non-autoregressive model without distillation, that is, the The output of the spectrum generation model is a fully parallel output, which improves the rate at which the spectrum generation model generates the mel-spectrogram, thereby increasing the speed of speech synthesis. Furthermore, since the training process and the actual use process of the decoder in the spectrum generation model are inverse operations, the decoder learns how to extract the mel-spectrogram corresponding to the text information more accurately and quickly, so that based on the The speech generated by the mel spectrogram is of high quality.
- FIG. 1 is a schematic flowchart of a method for synthesizing speech provided by an exemplary embodiment of the present application.
- the execution subject of the method for synthesizing speech provided by this application is a terminal for synthesizing speech, wherein the terminal includes but is not limited to mobile terminals such as smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), desktop computers, etc.
- a terminal for synthesizing speech includes but is not limited to mobile terminals such as smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), desktop computers, etc.
- PDA Personal Digital Assistant
- Various types of servers may also be included.
- the method for synthesizing speech as shown in Figure 1 may comprise: S101 ⁇ S103, specifically as follows:
- the text information is text information to be converted into voice.
- the text information may include text data, for example, the text information may be a word, a phrase, a phrase, a phrase, a sentence or a combination of multiple sentences, and the like.
- the text information may be a word, a phrase, a phrase, a phrase, a sentence or a combination of multiple sentences, and the like.
- the foregoing is only an exemplary description, and does not limit the format and content of the text information.
- the terminal processing the text information acquires the text information when detecting the speech synthesis instruction.
- the speech synthesis instruction refers to an instruction for instructing the terminal to perform speech synthesis according to the acquired text information.
- the speech synthesis instruction may be triggered by the user, for example, the user clicks a speech synthesis option in the terminal.
- the acquired text information may be the text information uploaded by the user to the terminal, or the terminal may obtain the text file corresponding to the file identification according to the file identification included in the speech synthesis instruction, and extract the text information in the text file.
- S102 Input the text information into the trained spectrum generation model for processing, and obtain the mel spectrogram corresponding to the text information, the spectrum generation model is a non-autoregressive model without distillation, and the spectrum generation model includes encoding A device, a length prediction network and a decoder, wherein the training process and the actual use process of the decoder are inverse operations.
- a pre-trained spectrum generation model is pre-stored in the terminal.
- the spectrum generation model is trained based on a sample training set using a machine learning algorithm.
- the sample training set includes a plurality of sample texts, a sample mel-spectrogram corresponding to each sample text, and a sample spectrum length corresponding to each sample mel-spectrogram.
- the spectrum generation model may be pre-trained by the terminal, or the file corresponding to the spectrum generation model may be transplanted to the terminal after being pre-trained by other devices. That is to say, the execution subject for training the spectrum generation model may be the same as or different from the execution subject for performing speech synthesis using the spectrum generation model.
- This spectrum generation model is a non-autoregressive model that does not require distillation.
- An autoregressive model means that the output of the model is a step-by-step output, that is, the output of each step will depend on the previous output.
- the non-autoregressive model means that the output of the model is a fully parallel output. Compared with the autoregressive model, the non-autoregressive model greatly improves the speed of data processing.
- the traditional non-autoregressive model needs distillation during the training process, which leads to the low quality of the final speech synthesized by the traditional non-autoregressive model.
- the spectrum generation model in this scheme does not need distillation, and the training process and actual use process of the decoder in the spectrum generation model are inverse operations, so that the decoder learns how to extract the mel spectrum corresponding to the text information more accurately and quickly , so that the speech quality generated based on the mel spectrogram is high.
- FIG. 2 is a schematic diagram showing a network structure of a spectrum generation model according to an exemplary embodiment of the present application.
- Encoder represents the encoder in the spectrum generation model
- Length Predictor represents the length prediction network in the spectrum generation model
- each network layer in the entire box on the right of Figure 2 constitutes the decoder in the spectrum generation model.
- the decoder includes a normal distribution function layer, a split layer, a coupling block layer, an affine xform layer, an invertible linear layer, and a reshape layer connected in sequence.
- the bottom Z in the box represents the normal distribution function layer
- the split layer represents the separation layer in the decoder
- the coupling block layer represents the coupling layer in the decoder
- the affine xform layer represents the transformation layer in the decoder
- the invertible linear layer represents the invertible linear transformation layer in the decoder
- the reshape layer represents the data reorganization layer in the decoder.
- MIe-spectrogram represents the output mel-spectrogram.
- the training process and the actual use process of the decoder are inverse operations.
- the data is input into the reshape layer, and after being processed by the invertible linear layer, the affine xform layer, the coupling block layer, and the split layer, the data is output by the normal distribution function layer.
- the data is input into the normal distribution function layer, and after being processed by the split layer, coupling block layer, affine xform layer, and invertible linear layer, the data is output by the reshape layer.
- the structure of the traditional decoder is different from that of the decoder designed in this scheme, and the traditional decoder inputs data into the same network layer during the training process and the actual use process, and passes through several network layers After processing, the data is output by the last network layer.
- the structure of the decoder is special, and the training process and the actual use process are inverse operations, which effectively avoids such problems in the prior art, and can accurately learn the speech features corresponding to the text information.
- the mel-spectrogram corresponding to the text information can be extracted accurately and quickly.
- text information is processed based on an encoder, a length prediction network, and a decoder in the spectrum generation model to obtain a mel-spectrogram corresponding to the text information.
- the mel spectrogram corresponding to the text information is converted into the speech information corresponding to the text information.
- the mel spectrogram is input into a trained neural vocoder for processing to obtain the speech information corresponding to the text information.
- the voice information includes the audio corresponding to the text information.
- the trained neural vocoder may be a trained WaveGlow model.
- WaveGlow is a generative model that generates audio by sampling from a distribution.
- the WaveGlow model is modeled based on the distribution of audio samples conditioned on the Mel spectrogram, that is, multiple sample Mel spectrograms and the sample speech corresponding to each sample Mel spectrogram are used as the training set for training.
- adjust the network parameters corresponding to the WaveGlow model until the trained WaveGlow model is obtained.
- the voice information corresponding to the mel-spectrogram that is, the voice information corresponding to the text information, is output.
- the description here is only for illustration and not for limitation.
- WaveGlow model to convert the mel spectrogram into the final audio improves the quality of the final synthesized speech.
- the WaveGlow model is also a fully parallel network, and the Shiben solution realizes fully parallel end-to-end speech synthesis.
- the terminal obtains text information; the text information is input into the trained spectrum generation model for processing, and the Mel spectrogram corresponding to the text information is obtained.
- the spectrum generation model is non-autoregressive without distillation model, the spectrum generation model includes an encoder, a length prediction network, and a decoder, wherein the training process and the actual use process of the decoder are inverse operations; based on the mel spectrogram, the speech information corresponding to the text information is generated .
- the acquired text information is input into the trained spectrum generation model for processing, and the mel spectrum corresponding to the text information is obtained.
- the generation model is a non-autoregressive model without distillation, that is, the The output of the spectrum generation model is a fully parallel output, which improves the rate at which the spectrum generation model generates the mel-spectrogram, thereby increasing the speed of speech synthesis. Furthermore, since the training process and the actual use process of the decoder in the spectrum generation model are inverse operations, the decoder learns how to extract the mel-spectrogram corresponding to the text information more accurately and quickly, so that based on the The speech generated by the mel spectrogram is of high quality.
- FIG. 3 is a specific flow chart of step S102 of the method for synthesizing speech shown in an exemplary embodiment of the present application; in some possible implementations of the present application, the above S102 may include S1021 ⁇ S1023, specifically as follows:
- S1021 Use the encoder to encode the text information to obtain a text vector corresponding to the text information.
- the encoder can include three layers of Convolutional Neural Networks (CNN) and one layer of Long Short Term Memory networks (LSTM). Among them, each layer of CNN network is connected with a layer of activation function layer (Rectified Linear Units, ReLu), a layer of batch regularization layer and a layer of dropout layer. Among them, the role of the LSTM network is to capture the relationship between contexts in text information.
- CNN Convolutional Neural Networks
- LSTM Long Short Term Memory networks
- the Input Text shown in Figure 2 is the input entry, through which the text information is first input into the encoder, and the CNN layer and LSTM layer in the encoder encode the text information to generate the text corresponding to the text information vector.
- the text vector is used to characterize and summarize the content of the text information. The description here is only for illustration and not for limitation.
- S1022 Predict the text vector through the length prediction network to obtain the spectral length of the speech corresponding to the text vector.
- the length prediction network is used to generate the spectral length corresponding to the phoneme corresponding to each text in the text information, and the spectral length corresponding to each text is added, and the obtained sum is the spectral length of the speech corresponding to the text vector, that is, the text information The spectral length of the corresponding speech.
- the text vector may include a text vector corresponding to each character in the text information.
- the text vector may also include a text vector corresponding to each word segment in the text information.
- the text information includes 10 characters, and the spectrum length of the speech corresponding to the 10 characters may be 50 frames. Since the length of the speech is not the same as the length of the text, only the length of the text can be obtained during inference, but it is not known what length of speech should be generated. Therefore, a length prediction network is needed to predict the length of the final synthesized speech.
- the length prediction network includes two CNN layers and one accumulation layer.
- each CNN layer is connected to a ReLU layer, a batch regularization layer and a dropout layer.
- the function of the accumulation layer is to accumulate the lengths corresponding to all the phonemes to obtain the final spectral length of the speech corresponding to the text vector.
- the text vector corresponding to the text information is input into the length prediction network, the CNN layer in the length prediction network processes the text vector, and outputs the spectrum length corresponding to the text vector corresponding to each word segment in the text information.
- the spectrum length corresponding to the text vector corresponding to each word segment can be represented in a sequence.
- the output is [1, 2.1, 3.2], which is summed through the accumulation layer, and the sum of these sequences is the spectral length of the speech corresponding to the final text vector. It is worth noting that since the spectrum length cannot be a decimal during inference, it can be rounded up or down. For example, when the output is [1,2.1,3.2], the spectral length of the speech corresponding to the text vector is 7. The description here is only for illustration and not for limitation.
- the network of the decoder is a generation network based on the flow (Flow) model.
- Flow can transform a simple distribution into a complex distribution through a series of reversible transformations, such as converting the eigenvector of a normal distribution into a Mel spectrum. Since it is a reversible transformation, the process of training is to convert the mel spectrogram into a normally distributed feature vector.
- the text vector obtained by encoding the text information based on the encoder, and the spectrum length obtained by predicting the text vector based on the length prediction network are both used as the input of the decoder.
- the decoder includes a normal distribution function layer, a split layer, a coupling block layer, an affine xform layer, an invertible linear layer, and a reshape layer connected in sequence.
- the coupling block layer includes multiple network layers. Please refer to FIG. 4 .
- FIG. 4 is a schematic diagram of a coupling layer network structure shown in an exemplary embodiment of the present application.
- the coupling block layer includes a three-layer convolutional neural network (ConvNet, Conv) and an attention layer (attention).
- ConvNet convolutional neural network
- attention layer attention layer
- the text vector and the spectrum length are input into the normal distribution function layer for processing, and finally the reshape layer performs the last processing, and outputs the final processing result, that is, the output mel spectrum.
- the reshape layer performs the last processing, and outputs the final processing result, that is, the output mel spectrum.
- the above S1023 may include S10231-S10234, specifically as follows:
- S10231 Process the spectrum length through the normal distribution function layer to obtain a first eigenvector of the normal distribution.
- the spectrum length is input into a normal distribution function layer for processing, and the normal distribution function layer outputs a first eigenvector conforming to a normal distribution.
- Z is initialized as a feature vector conforming to a normal distribution, such as a normal distribution (B, T 2 *10, 8).
- B represents the batch size (batch size, B)
- T 2 represents the spectrum length, that is, the spectrum length predicted by the length prediction network
- 8 represents the dimension on the channel (channel).
- S10232 Input the first feature vector into the split layer for processing to obtain a second feature vector.
- the split layer divides the first feature vector output by the normal distribution function layer into two parts along the channel dimension, for example, y a and y b . If the original dimension is 2D, then the first D dimension is y a , and the last D dimension is y b .
- the split layer processes y a to obtain the second feature vector.
- S10233 Based on the coupling block layer, the affine xform layer, and the invertible linear layer, perform an invertible transformation on the text vector and the second feature vector to obtain a third feature vector.
- the output of the coupling block layer is also equally divided into two parts in the channel dimension, recorded as log s and b.
- the output of the coupling block layer is input into the affine xform layer for processing, and the obtained processing result is input into the invertible linear layer for invertible transformation to obtain the third feature vector.
- Input the text vector into the coupling block layer the dimension of the text vector is (B, T 1 , D). Among them, T 1 represents the length of the input text information.
- Input the second feature vector into the coupling block layer the dimension of the second feature vector is (B, T 2 , D). Among them, T 2 represents the spectrum length.
- the text vector is processed by a layer of CNN to output its tensor in the channel dimension.
- the second feature vector is processed by two layers of CNN, and the tensor of the second feature vector in the channel dimension is output.
- the dimension of the tensor corresponding to the output text vector is the same as that of the tensor corresponding to the second feature vector.
- the attention layer is used here to align them.
- the attention layer uses the tensor corresponding to the text vector as key and value, and uses the tensor corresponding to the second feature vector as query. After that, the output of the attention layer is sent to a 1x1 CNN network, and the output of the CNN network is used as the output of the coupling block layer.
- the output of the coupling block layer is passed to the affine xform layer.
- the specific processing of the affine xform layer is realized by the following formula:
- the z in the above (4) is the output of the affine xform layer. Input the output of the affine xform layer into the invertible linear layer for invertible transformation. Exemplarily, z input to the invertible linear layer is multiplied by an invertible matrix, that is, an invertible transformation is performed to obtain a third feature vector.
- S10234 Perform data reorganization on the third feature vector through the reshape layer to obtain the mel spectrum.
- performing data reorganization on the third eigenvector is to transform the dimension corresponding to the third eigenvector.
- the dimension of the third feature vector is transformed from (B, T 2 *10, 8) to (B, T 2 , 80).
- 80 indicates that the dimension of the third feature vector is 80 dimensions. Since what is obtained after converting the dimensions is the Mel Spectrum, the dimension of the third eigenvector is actually the dimension of the previous Mel Spectrum. It is worth noting that the value of the dimension can be set and adjusted according to the actual situation, which is not limited.
- the dimension at the time of initialization is small, and finally the reshape layer is used to upgrade the dimension, that is, the small dimension is adjusted to a large dimension, in order to reduce the calculation amount of the invertible linear layer.
- FIG. 5 is a schematic flowchart of a method for synthesizing speech provided by another embodiment of the present application. It mainly involves the process of obtaining a spectrum generation model before performing the process of synthesizing speech as shown in FIG. 1 .
- the method includes:
- S201 Obtain a sample training set, where the sample training set includes multiple sample texts, a sample mel-spectrogram corresponding to each sample text, and a sample spectrum length corresponding to each sample mel-spectrogram.
- a plurality of sample texts and a sample mel-spectrogram corresponding to each sample text may be collected in the network, and a sample spectrum length of the sample mel-spectrogram corresponding to each sample text may be determined.
- the preset number of sample texts in the sample training set, the sample mel spectrogram corresponding to each sample text, and the sample spectrum length corresponding to each sample mel spectrogram are used as the training set, and the sample training set is divided by the remaining data in the training set as a test set.
- S202 Encode each sample text with an initial encoder to obtain a sample text vector corresponding to each sample text.
- each sample text in the training set is encoded by an initial encoder to obtain a sample text vector corresponding to each sample text in the training set.
- the network structure of the initial encoder is the same as that of the encoder in the trained spectral generation model.
- S203 Predict each sample text vector through the initial length prediction network to obtain the actual spectrum length of the speech corresponding to each sample text vector.
- the network structure of the initial length prediction network is the same as that of the length prediction network in the trained spectrum generation model.
- For the specific process of processing the sample text vector by the initial length prediction network refer to the specific process in S1022 above, which will not be repeated here.
- the decoder mainly learns how to convert the Mel spectrogram into a normally distributed feature vector.
- the logarithm of the probability density function of the Mel spectrogram can be obtained according to the specific reversible transformation fi, as shown in the following formula:
- p Y (y) refers to the probability density function of the Mel spectrum
- ⁇ (z) refers to the probability density function of the Gaussian distribution
- L refers to the number of transformations
- Flow-based models usually design the transformed Jacobian matrix as a triangular matrix. Afterwards, the probability density function of the mel-spectrogram can be maximized by maximizing the above equation, so that the entire model can be trained.
- the place enclosed by the dotted line in Fig. 3 represents each reversible transformation f i , and there are K transformations in total. Exemplarily, K may be 12. The description here is only for illustration and not for limitation.
- the network structure of the initial decoder is the same as that of the decoder in the trained spectrum generation model. It’s just that in the training process, each sample text vector, the actual spectral length of the speech corresponding to each sample text vector, and the sample mel spectrogram corresponding to each sample text are first input into the reshape layer in the initial decoder.
- the reshape layer performs dimensionality reduction processing on the sample mel-spectrogram.
- the results of the dimensionality reduction processing and the sample text vector are sequentially processed, and the obtained results are input to the normal distribution function.
- Floor the results of the dimensionality reduction processing and the sample text vector are sequentially processed, and the obtained results are input to the normal distribution function.
- the normal distribution function layer determines the sample feature vector of the normal distribution corresponding to the sample text based on the result and the actual spectrum length of the speech corresponding to the sample text vector.
- the specific processing process of the split layer, coupling block layer, affine xform layer and invertible linear layer can refer to the specific description in S102, and will not be repeated here.
- the loss value may include a first loss value and a second loss value, the first loss value is a loss value between the actual spectral length of the speech corresponding to each sample text vector and the sample spectral length corresponding to each sample text vector, The second loss value is determined based on a sample feature vector of a normal distribution corresponding to each sample text.
- the sample spectrum length corresponding to each sample mel-spectrogram is the sample spectrum length of the sample text vector corresponding to the sample mel-spectrogram.
- a loss value between the actual spectral length of speech corresponding to each sample text vector and the sample spectral length corresponding to each sample text vector is calculated based on a preset loss function, and the loss value is recorded as a first loss value.
- the second loss value is a negative number of the value of the above formula (5), that is, the second loss value is a negative number of log p Y (y).
- the first loss value when the first loss value does not satisfy the first preset condition, adjust the weights and parameters of the initial length prediction network, and continue to train the initial length prediction network.
- the training of the initial length prediction network is stopped, and the trained initial length prediction network is used as the length prediction network in the finally trained spectrum generation model.
- the first preset condition is that the first loss value is less than or equal to a preset first loss value threshold.
- the first loss value is greater than the first loss value threshold, adjust the weights and parameters of the initial length prediction network, and continue to train the initial length prediction network.
- the first loss value is less than or equal to the first loss value threshold, stop training the initial length prediction network, and use the trained initial length prediction network as the length prediction network in the final trained spectrum generation model.
- the description here is only for illustration and not for limitation.
- the second loss value when the second loss value does not satisfy the second preset condition, adjust the weights and parameters of the initial decoder, and continue to train the initial decoder.
- the training of the initial decoder is stopped, and the trained initial decoder is used as the decoder in the finally trained spectrum generation model.
- the second preset condition is that the second loss value is less than or equal to a preset second loss value threshold.
- the second loss value is greater than the second loss value threshold, adjust the weights and parameters of the initial decoder, and continue to train the initial decoder.
- stop training the initial decoder and use the trained initial decoder as a decoder in the final trained spectrum generation model.
- the description here is only for illustration and not for limitation.
- the training is stopped, and based on the trained initial encoder, the trained initial length prediction network and the trained initial decoding generator to generate the spectrum generation model.
- the structure of the traditional decoder is different from the structure of the decoder designed in this solution, and the traditional decoder inputs data into the same network layer during the training process and the actual use process, After being processed by several network layers, the data is output by the last network layer.
- the structure of the decoder is special, and the training process and the actual use process are inverse operations, which effectively avoids such problems in the prior art, and can accurately learn the speech features corresponding to the text information. Furthermore, based on the trained decoder, the mel-spectrogram corresponding to the text information can be extracted accurately and quickly.
- FIG. 6 is a schematic diagram of an apparatus for synthesizing speech provided by an embodiment of the present application.
- the units included in the device are used to execute the steps in the embodiments corresponding to FIG. 1 to FIG. 5 .
- FIG. 6 For details, please refer to the relevant descriptions in the embodiments corresponding to FIG. 1 to FIG. 5 .
- Figure 6 only the parts related to this embodiment are shown. See Figure 6, including:
- An acquisition unit 310 configured to acquire text information
- the processing unit 320 is configured to input the text information into the trained spectrum generation model for processing to obtain the Mel spectrogram corresponding to the text information, and the spectrum generation model is a non-autoregressive model without distillation , the spectrum generation model includes an encoder, a length prediction network, and a decoder, wherein the training process and the actual use process of the decoder are inverse operations;
- the generating unit 330 is configured to generate voice information corresponding to the text information based on the mel spectrogram.
- the decoder includes a normal distribution function layer, a split layer, a coupling block layer, an affine xform layer, an invertible linear layer, and a reshape layer connected in sequence.
- the processing unit 320 includes:
- An encoding unit configured to encode the text information by the encoder to obtain a text vector corresponding to the text information
- a prediction unit configured to predict the text vector through the length prediction network to obtain the spectral length of the speech corresponding to the text vector
- a decoding unit configured to input the text vector and the spectrum length into the decoder for decoding processing to obtain the mel spectrogram.
- the decoding unit is specifically configured to:
- the affine xform layer and the invertible linear layer are reversibly transformed to obtain a third feature vector
- the device further includes a training unit, the training unit is specifically used for:
- the sample training set includes a plurality of sample texts, a sample mel spectrogram corresponding to each sample text, and a sample spectrum length corresponding to each sample mel spectrogram;
- each sample text vector, the actual spectral length of the speech corresponding to each sample text vector, and the sample mel spectrogram corresponding to each sample text into the initial decoder for processing, and obtain each The sample feature vector of the normal distribution corresponding to the sample text;
- the training is stopped, and the spectrum generation model is generated based on the trained initial encoder, the trained initial length prediction network, and the trained initial decoder.
- the loss value includes a first loss value and a second loss value
- the first loss value is the difference between the actual spectral length of the speech corresponding to each sample text vector and the sample spectral length corresponding to each sample text vector
- the second loss value is determined based on the sample feature vector of the normal distribution corresponding to each sample text.
- the generating unit 330 is specifically configured to:
- the mel spectrogram is input into a trained neural vocoder for processing to obtain the speech information.
- FIG. 7 is a schematic diagram of a terminal for synthesizing speech provided by another embodiment of the present application.
- the terminal 4 of this embodiment includes: a processor 40 , a memory 41 and a computer program 42 stored in the memory 41 and operable on the processor 40 .
- the processor 40 executes the computer program 42, it implements the steps in the above embodiments of the methods for synthesizing speech, such as S101 to S103 shown in FIG. 1 .
- the processor 40 executes the computer program 42
- the functions of the units in the above embodiments are implemented, for example, the functions of the units 310 to 330 shown in FIG. 6 .
- the computer program 42 can be divided into one or more units, and the one or more units are stored in the memory 41 and executed by the processor 40 to complete the present application.
- the one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 42 in the terminal 4 .
- the computer program 42 may be divided into an acquisition unit, a processing unit, and a generation unit, and the specific functions of each unit are as described above.
- the terminal may include, but not limited to, a processor 40 and a memory 41 .
- FIG. 7 is only an example of the terminal 4, and does not constitute a limitation on the terminal. It may include more or less components than shown in the figure, or combine certain components, or different components, such as the
- the terminal may also include an input and output terminal, a network access terminal, a bus, and the like.
- the so-called processor 40 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the storage 41 may be an internal storage unit of the terminal, such as a hard disk or memory of the terminal.
- the memory 41 can also be an external storage terminal of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card) etc.
- the memory 41 may also include both an internal storage unit of the terminal and an external storage terminal.
- the memory 41 is used to store the computer instructions and other programs and data required by the terminal.
- the memory 41 can also be used to temporarily store data that has been output or will be output.
- the embodiment of the present application also provides a computer storage medium.
- the computer storage medium may be non-volatile or volatile.
- the computer storage medium stores a computer program. When the computer program is executed by a processor, the above-mentioned synthesizing The steps in the method embodiment of speech.
- the present application also provides a computer program product, which, when the computer program product runs on a terminal, causes the terminal to execute the steps in the above embodiments of the method for synthesizing speech.
- the embodiment of the present application also provides a chip or integrated circuit, the chip or integrated circuit includes: a processor, used to call and run a computer program from the memory, so that the terminal installed with the chip or integrated circuit executes the above-mentioned synthesized speech The steps in the method embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
本申请适用于语音合成技术领域,提供了一种合成语音的方法、装置、终端及存储介质。该方法包括:获取文本信息;将文本信息输入到已训练的频谱生成模型中进行处理,得到文本信息对应的梅尔谱图,频谱生成模型为无需蒸馏的非自回归式的模型,频谱生成模型包括编码器、长度预测网络以及解码器,该解码器的训练过程和实际使用过程是逆运算的过程;基于该梅尔谱图,生成该文本信息对应的语音信息。上述方案中,由于该生成模型为无需蒸馏的非自回归式的模型,提升了该频谱生成模型生成梅尔谱图的速率,进而提升了语音合成的速度。且基于该频谱生成模型可准确、快速地提取文本信息对应的梅尔谱图,进而使得基于该梅尔谱图生成的语音质量高。
Description
本申请要求于2021年06月09日在中华人民共和国国家知识产权局专利局提交的、申请号为202110641868.X、发明名称为“一种合成语音的方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请属于语音合成技术领域,尤其涉及一种合成语音的方法、装置、终端及存储介质。
随着深度学习的发展,目前端到端的语音合成技术已经可以产生高质量的语音。相比较参数语音合成与拼接语音合成这两种传统的语音合成方法,端到端的语音合成方法不需要对语音进行复杂的建模,同时可以产生更加自然的语音。
发明人意识到,现有的端到端的语音合成模型通常分为自回归式的模型和非自回归式的模型。其中,自回归式的模型是指模型的输出是一步一步的输出,即每一步的输出会依赖于之前的输出。因此,该模型无论是在训练过程中,还是实际使用过程中,都非常耗时,合成语音的效率低。非自回归式的模型是指模型的输出属于全并行的输出,这种模型虽然合成语音的速度快,但是由于该模型需要蒸馏,导致该模型最终合成的语音质量很低。
因此,急需一种无论是在训练过程中,还是实际使用过程中都耗时少、合成语音的效率高,且合成的语音质量高的端到端的语音合成模型。
本申请实施例的目的之一在于提供一种合成语音的方法、装置、终端及存储介质,以解决现有技术中的语音合成模型,无论是在训练过程中,还是实际使用过程中,都非常耗时,合成语音的效率低、质量低的技术问题。
第一方面,本申请实施例提供了一种合成语音的方法,其中,该方法包括:
获取文本信息;
将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;
基于所述梅尔谱图,生成所述文本信息对应的语音信息。
第二方面,本申请实施例提供了一种合成语音的装置,其中,该装置包括:
获取单元,用于获取文本信息;
处理单元,用于将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使 用过程是逆运算的过程;
生成单元,用于基于所述梅尔谱图,生成所述文本信息对应的语音信息。
第三方面,本申请实施例提供了一种合成语音的终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:
获取文本信息;
将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;
基于所述梅尔谱图,生成所述文本信息对应的语音信息。
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:
获取文本信息;
将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;
基于所述梅尔谱图,生成所述文本信息对应的语音信息。
本申请实施例与现有技术相比存在的有益效果是:通过获取文本信息;将该文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,该频谱生成模型为无需蒸馏的非自回归式的模型,该频谱生成模型包括编码器、长度预测网络以及解码器,其中,该解码器的训练过程和实际使用过程是逆运算的过程;基于该梅尔谱图,生成该文本信息对应的语音信息。上述方案中,将获取到的文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,由于该生成模型为无需蒸馏的非自回归式的模型,即该频谱生成模型的输出属于全并行的输出,提升了该频谱生成模型生成梅尔谱图的速率,进而提升了语音合成的速度。进一步地,由于该频谱生成模型中的解码器的训练过程和实际使用过程是逆运算的过程,使解码器学习到了如何更准确、快速地提取文本信息对应的梅尔谱图,进而使得基于该梅尔谱图生成的语音质量高。
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一示例性实施例提供的合成语音的方法的示意性流程图;
图2是本申请一示例性实施例示出频谱生成模型的网络结构示意图;
图3是本申请一示例性实施例示出的合成语音的方法的步骤S102的具体流程图;
图4是本申请一示例性实施例示出的耦合层网络结构示意图;
图5为本申请的另一个实施例提供的一种合成语音的方法的示意流程图;
图6是本申请一实施例提供的一种合成语音的装置的示意图;
图7是本申请另一实施例提供的合成语音的终端的示意图。
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
随着深度学习的发展,目前端到端的语音合成技术已经可以产生高质量的语音。相比较参数语音合成与拼接语音合成这两种传统的语音合成方法,端到端的语音合成方法不需要对语音进行复杂的建模,同时可以产生更加自然的语音。
然而,现有的端到端的语音合成模型通常分为自回归式的模型和非自回归式的模型。其中,自回归式的模型是指模型的输出是一步一步的输出,即每一步的输出会依赖于之前的输出。因此,该模型无论是在训练过程中,还是实际使用过程中,都非常耗时,合成语音的效率低。非自回归式的模型是指模型的输出属于全并行的输出,这种模型虽然合成语音的速度快,但是由于该模型需要蒸馏,导致该模型最终合成语音的质量很低。
因此,急需一种无论是在训练过程中,还是实际使用过程中都耗时少、合成语音的效率高,且合成的语音质量高的端到端的语音合成模型。
有鉴于此,本申请提供一种合成语音的方法,获取文本信息;将该文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,该频谱生成模型为无需蒸馏的非自回归式的模型,该频谱生成模型包括编码器、长度预测网络以及解码器,其中,该解码器的训练过程和实际使用过程是逆运算的过程;基于该梅尔谱图,生成该文本信息对应的语音信息。上述方案中,将获取到的文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,由于该生成模型为无需蒸馏的非自回归式的模型,即该频谱生成模型的输出属于全并行的输出,提升了该频谱生成模型生成梅尔谱图的速率,进而提升了语音合成的速度。进一步地,由于该频谱生成模型中的解码器的训练 过程和实际使用过程是逆运算的过程,使解码器学习到了如何更准确、快速地提取文本信息对应的梅尔谱图,进而使得基于该梅尔谱图生成的语音质量高。
请参见图1,图1是本申请一示例性实施例提供的合成语音的方法的示意性流程图。本申请提供的合成语音的方法的执行主体为合成语音的终端,其中,该终端包括但不限于智能手机、平板电脑、计算机、个人数字助理(Personal Digital Assistant,PDA)、台式电脑等移动终端,还可以包括各种类型的服务器。如图1所示的合成语音的方法可包括:S101~S103,具体如下:
S101:获取文本信息。
文本信息为待转换为语音的文字信息。文本信息可以包括文本数据,例如文本信息可以是字、词语、词组、短语、一个句子或多个句子的组合等。上述仅为示例性说明,对文本信息的格式以及内容不做限定。
处理文本信息的终端在检测到语音合成指令时,获取文本信息。语音合成指令指用于命令终端依据获取到的文本信息进行语音合成的指令。语音合成指令可以由用户触发,如用户点击所述终端中的语音合成选项。获取的文本信息可以是用户上传至该终端的文本信息,也可以是该终端根据语音合成指令中包含的文件标识,获取文件标识对应的文本文件,并提取该文本文件中的文本信息。
S102:将该文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,该频谱生成模型为无需蒸馏的非自回归式的模型,该频谱生成模型包括编码器、长度预测网络以及解码器,其中,该解码器的训练过程和实际使用过程是逆运算的过程。
在本实施例中,终端中预先存储有预先训练好的频谱生成模型。该频谱生成模型是使用机器学习算法,基于样本训练集训练得到。其中,样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度。
可以理解的是,频谱生成模型可以由终端预先训练好,也可以由其他设备预先训练好后将频谱生成模型对应的文件移植至终端中。也就是说,训练该频谱生成模型的执行主体与使用该频谱生成模型进行语音合成的执行主体可以是相同的,也可以是不同的。
该频谱生成模型为无需蒸馏的非自回归式的模型。自回归式的模型是指模型的输出是一步一步的输出,即每一步的输出会依赖于之前的输出。而非自回归式的模型是指模型的输出属于全并行的输出,相较于自回归式的模型,非自回归式的模型大大地提高了处理数据的速度。然而,传统的非自回归式的模型在训练过程中需要蒸馏,这导致传统的非自回归式的模型最终合成的语音质量很低。本方案中的频谱生成模型无需蒸馏,且频谱生成模型中的解码器的训练过程和实际使用过程是逆运算的过程,使解码器学习到了如何更准确、快速地提取文本信息对应的梅尔谱图,进而使得基于该梅尔谱图生成的语音质量高。
请参见图2,图2是本申请一示例性实施例示出频谱生成模型的网络结构示意图。如图2所示,其中,Encoder表示频谱生成模型中的编码器,Length Predictor表示频谱生成模型中的长度预测网络,图2右边整个方框内的各个网络层构成频谱生成模型中的解码器。
该解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。如图2所示,方框内最下方的Z表示正太分布函数层,split层表示解码器中的分离层,coupling block层表示解码器中的耦合层,affine xform层表示解码器中的变换层,invertible linear层表示解码器中的可逆线性变换层,reshape层表示解码器中的数据重组层。MIe-spectrogram表示输出的梅尔谱图。
该解码器的训练过程和实际使用过程是逆运算的过程。示例性地,在该解码器的训练过程中,将数据输入reshape层,经过invertible linear层、affine xform层、coupling block层、split层处理后,由正太分布函数层输出数据。在该解码器的的实际使用过程中,是将数据输入正太分布函数层,经过split层、coupling block层、affine xform层、invertible linear层处理后,由reshape层输出数据。
传统的解码器的结构与本方案中设计的解码器的结构并不相同,且传统的解码器在训练过程中,与实际使用过程中都是将数据输入同一个网络层,经过若干个网络层处理后,都由最后一个网络层输出数据。这样需要解码器基于第一个网络层输入的数据去一步一步推理,最终由最后一个网络层输出,很容易出现推理错误,而且当一个网络层推理错误时,后面的网络层也会延续这种错误,最终导致输出的数据准确率很低。本方案中,解码器的结构特殊,且训练过程和实际使用过程是逆运算的过程,有效地避免了现有技术中存在的这种的问题,可以准确地学习到文本信息对应的语音特征,进而基于训练好的解码器可以准确、快速地提取文本信息对应的梅尔谱图。
将该文本信息输入到已训练的频谱生成模型中进行处理。示例性地,基于频谱生成模型中的编码器、长度预测网络以及解码器对文本信息进行处理,得到该文本信息对应的梅尔谱图。
S103:基于该梅尔谱图,生成该文本信息对应的语音信息。
将文本信息对应的梅尔谱图转换为该文本信息对应的语音信息。示例性地,将该梅尔谱图输入到已训练的神经声码器中进行处理,得到该文本信息对应的语音信息。其中,语音信息包括该文本信息对应的音频。
示例性地,已训练的神经声码器可以为已训练的WaveGlow模型。WaveGlow是一个生成模型,通过从分布采样中生成音频。该WaveGlow模型在训练过程中以梅尔频谱图为条件的音频样本的分布进行建模,即以多个样本梅尔频谱图和每个样本梅尔频谱图对应样本语音为训练集进行训练,在训练过程中调整WaveGlow模型对应的网络参数,直至得到训练好的WaveGlow模型。
将梅尔谱图输入到已训练的WaveGlow模型中,通过WaveGlow模型中的放射层以及卷积层对梅尔谱图进行处理,将每个放射层以及卷积层处理得到的向量进行连接,最终输出该梅尔谱图对应的语音信息,即文本信息对应的语音信息。此处仅为示例性说明,对此不做限定。
采用WaveGlow模型将梅尔谱图转换最终的音频,提高了最终合成的语音的质量。且WaveGlow模型也是一种全并行的网络,事本方案实现了全并行的端到端的语音合成。
上述实施例中,终端获取文本信息;将该文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,该频谱生成模型为无需蒸馏的非自回归式的模型,该频谱生成模型包括编码器、长度预测网络以及解码器,其中,该解码器的训练过程和实际使用过程是逆运算的过程;基于该梅尔谱图,生成该文本信息对应的语音信息。上述方案中,将获取到的文本信息输入到已训练的频谱生成模型中进行处理,得到该文本信息对应的梅尔谱图,由于该生成模型为无需蒸馏的非自回归式的模型,即该频谱生成模型的输出属于全并行的输出,提升了该频谱生成模型生成梅尔谱图的速率,进而提升了语音合成的速度。进一步地,由于该频谱生成模型中的解码器的训练过程和实际使用过程是逆运算的过程,使解码器学习到了如何更准确、快速地提取文本信息对应的梅尔谱图,进而使得基于该梅尔谱图生成的语音质量高。
请参见图3,图3是本申请一示例性实施例示出的合成语音的方法的步骤S102的具体流程图;在本申请一些可能的实现方式中,上述S102可包括S1021~S1023,具体如下:
S1021:通过该编码器对该文本信息进行编码,得到该文本信息对应的文本向量。
该编码器可以包括三层卷积神经网络(Convolutional Neural Networks,CNN)、一层长短期记忆网络(Long Short Term Memory networks,LSTM)。其中,每层CNN网络连接了一层激活函数层(Rectified Linear Units,ReLu)、一层批正则化层以及一层dropout层。其中,LSTM网络的作用在于捕捉文本信息中上下文之间的关系。
如图2中所示的Input Text即为输入的入口,将文本信息通过该入口先输入至编码器中,编码器中的CNN层、LSTM层对文本信息进行编码,生成该文本信息对应的文本向量。该文本向量用于表征、概括该文本信息的内容。此处仅为示例性说明,对此不做限定。
S1022:通过该长度预测网络对该文本向量进行预测,得到该文本向量对应的语音的频谱长度。
长度预测网络用于生成文本信息中每个文字对应的音素所对应的频谱长度,将每个文字对应的频谱长度相加,得到的和即为文本向量对应的语音的频谱长度,也就是文本信息对应的语音的频谱长度。
可选地,文本向量可以包括文本信息中每个文字对应的文本向量。文本向量也可以包括文本信息中每个分词对应的文本向量。
例如,文本信息包括10个文字,这10个文字对应的语音的频谱长度可能是50帧。由于语音的长度和文字的长度不是等同的,在推断的时候只能获取到文本的长度,却不知道应该生成什么长度的语音,因此需要一个长度预测网络来预测最终合成的语音的长度。
示例性地,长度预测网络包括两层CNN层和一层累加层。其中,每一层CNN层都连接着一层ReLU层、一层批正则化层以及一层dropout层。其中,累加层的作用在于将所有音素对应的长度累加起来,得到最终该文本向量对应的语音的频谱长度。
例如,将文本信息对应的文本向量输入长度预测网络中,长度预测网络中的CNN层对文本向量进行处理,输出文本信息中每个分词对应的文本向量所对应的频谱长度。可以用序列的方式表示每个分词对应的文本向量所对应的频谱长度。
例如,输出为[1,2.1,3.2],通过累加层进行求和,这些序列之和即为最终该文本向量对应的语音的频谱长度。值得说明的是,在推断的时候由于频谱长度不可能是小数,因此可向上或向下取整。例如,输出为[1,2.1,3.2]时,该文本向量对应的语音的频谱长度为7。此处仅为示例性说明,对此不做限定。
S1023:将该文本向量以及该频谱长度输入该解码器进行解码处理,得到该梅尔谱图。
解码器的网络是一个基于流(Flow)模型的生成网络。Flow可以通过一系列可逆变换将一个简单的分布转换为一个复杂的分布,例如将正太分布的特征向量转换为梅尔谱图。由于它是可逆变换,因此训练的过程中,就是把梅尔谱图转换成正态分布的特征向量。
基于编码器对文本信息进行编码得到的文本向量,以及基于长度预测网络文本向量进行预测得到的频谱长度,都作为该解码器的输入。
解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。其中,coupling block层又包括多个网络层。请参见图4,图4是本申请一示例性实施例示出的耦合层网络结构示意图。如图4所示,coupling block层包括三层卷积神经网络(ConvNet,Conv)和一层注意力层(attention)。
在实际使用过程中,将文本向量以及该频谱长度输入正太分布函数层进行处理,最终由reshape层进行最后一次处理后,输出最后的处理结果,即输出梅尔谱图。此处仅为示例性说明,对此不做限定。
可选地,在本申请一些可能的实现方式中,上述S1023可包括S10231~S10234,具体如下:
S10231:通过该正太分布函数层对该频谱长度进行处理,得到正太分布的第一特征向量。
将频谱长度输入正太分布函数层进行处理,该正太分布函数层输出符合正太分布的第一特征向量。
示例性地,初始化Z为一个符合正太分布的特征向量,例如正太分布(B,T
2*10,8)。其中,B表示批尺寸(batch size,B),T
2表示频谱长度,即长度预测网络预测出的频谱长度,8表示通道(channel)上的维度。将B与T
2的值代入即可得到符合正太分布的第一特征向量。值得说明的是,通道上的维度可根据实际情况进行设置、调整,此处仅为示例性说明,对此不做限定。
S10232:将该第一特征向量输入该split层进行处理,得到第二特征向量。
split层将正太分布函数层输出的第一特征向量沿通道维度上平均分为两部分,例如平均分为y
a和y
b。若原有维度是2D,那么前D维是y
a,后D维是y
b。split层对y
a进行处理,得到第二特征向量。可选地,将y
b输入后续的coupling block层进行处理。
S10233:基于该coupling block层、该affine xform层以及所述invertible linear层,对文本向量以及第二特征向量进行可逆变换,得到第三特征向量。
coupling block层的输出也在通道维度上平均分为两部分,记为log s和b。将coupling block层的输出输入affine xform层进行处理,将得到的处理结果输入invertible linear层进行可逆变换,得到第三特征向量。
示例性地,将y
b输入coupling block层进行处理,得到logs,t=NN(y
b)。将文本向量输入coupling block层,该文本向量的维度为(B,T
1,D)。其中,T
1表示输入的文本信息的长度。将第二特征向量输入coupling block层,该第二特征向量的维度为(B,T
2,D)。其中,T
2表示频谱长度。在coupling block层中,文本向量经过一层CNN处理,输出其在通道维度上的张量。第二特征向量经过两层CNN处理,输出第二特征向量在通道维度上的张量。输出的文本向量对应的张量的维度与第二特征向量对应的张量维度相同。
由于两个张量在时间维度上长度不同,这里采用注意力层即attention层来使它们对齐。注意力层将文本向量对应的张量作为key和value,将第二特征向量对应的张量作为query。之后,将注意力层的输出传送至一个1x1的CNN网络中,CNN网络的输出作为该coupling block层的输出。
coupling block层的输出传入affine xform层。affine xform层具体的处理由下述公式实现:
s=exp(logs),(1)
z
a=(y
a-t)/s,(2)
z
b=y
b,(3)
z=concat(z
a,z
b),(4)
上述(4)中的z即为affine xform层的输出。将affine xform层的输出输入invertible linear层进行可逆变换。示例性地,对输入该invertible linear层的z乘以一个可逆矩阵,即进行一次可逆变换,得到第三特征向量。
S10234:通过该reshape层对该第三特征向量进行数据重组,得到该梅尔谱图。
示例性地,对第三特征向量进行数据重组,就是将第三特征向量对应的维度进行转换。例如,将第三特征向量的维度由(B,T
2*10,8)变换为(B,T
2,80)。其中,80表示第三特征向量的维度是80维。由于转换维度后得到的就是梅尔谱图,那么实际上第三特征向量的维度也就是梅尔谱图前期的维度。值得说明的是,维度的值可以根据实际情况进行设置、调整,对此不做限定。
上述方式中,初始化时的维度小,最后经过reshape层进行升维处理,即将小维度调整为大维度,是为了减小invertible linear层的计算量。
请参见图5,图5为本申请的另一个实施例提供的一种合成语音的方法的示意流程图。主要涉及在执行如图1所示的合成语音的过程之前,获得频谱生成模型的过程。该方法包括:
S201:获取样本训练集,该样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度。
可在网络中采集多个样本文本以及每个样本文本对应的样本梅尔谱图,并确定每个样本文本对应的样本梅尔谱图的样本频谱长度。将样本训练集中预设数量的样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度,作为训练集,样本训练集除过训练集剩余的数据作为测试集。
S202:通过初始编码器对每个样本文本进行编码,得到每个样本文本对应的样本文本向量。
示例性地,通过初始编码器对训练集中的每个样本文本进行编码,得到训练集中的每个样本文本对应的样本文本向量。
初始编码器的网络结构与已训练好的频谱生成模型中的编码器的网络结构相同。初始编码器对样本文本进行编码的具体过程,可参考上述S1021中的具体过程,此处不再赘述。
S203:通过初始长度预测网络对每个样本文本向量进行预测,得到每个样本文本向量对应的语音的实际频谱长度。
初始长度预测网络的网络结构与已训练好的频谱生成模型中的长度预测网络的网络结构相同。初始长度预测网络对样本文本向量进行处理的具体过程,可参考上述S1022中的具体过程,此处不再赘述。
S204:将每个样本文本向量、每个样本文本向量对应的语音的实际频谱长度以及每个样本文本对应的样本梅尔谱图,输入到初始解码器中进行处理,得到每个样本文本对应的正太分布的样本特征向量。
解码器在训练的过程中,主要是学习如何将梅尔谱图转换成正态分布的特征向量。
可选地,可以根据具体的可逆变换f
i来得到梅尔谱图的概率密度函数的对数,如下式所示:
为了减少计算,基于Flow的模型通常将变换的雅克比矩阵设计为一个三角阵。之后,就可通过最大化上式来最大化梅尔谱图的概率密度函数,这样就可以训练整个模型。图3虚线框住的地方代表每一次的可逆变换f
i,总共有K次变换。示例性地,K可以为12。此处仅为示例性说明,对此不做限定。
示例性地,初始解码器的网络结构与已训练好的频谱生成模型中的解码器的网络结构相同。只是在训练过程中,是将每个样本文本向量、每个样本文本向量对应的语音的实际频谱长度以及每个样本文本对应的样本梅尔谱图,首先输入到初始解码器中的reshape层,reshape层对样本梅尔谱图进行降维处理,基于split层、coupling block层、affine xform层 以及invertible linear层依次对降维处理的结果以及样本文本向量处理后,得到的结果输入至正太分布函数层。正太分布函数层基于该结果以及样本文本向量对应的语音的实际频谱长度,确定样本文本对应的正太分布的样本特征向量。其中,split层、coupling block层、affine xform层以及invertible linear层的具体处理过程可参考S102中的具体描述,此处不再赘述。
S205:根据预设的损失函数计算损失值。
该损失值可以包括第一损失值和第二损失值,该第一损失值为每个样本文本向量对应的语音的实际频谱长度与每个样本文本向量对应的样本频谱长度之间的损失值,该第二损失值基于每个样本文本对应的正太分布的样本特征向量确定。
示例性地,每个样本梅尔谱图对应的样本频谱长度,即为该样本梅尔谱图对应的样本文本向量的样本频谱长度。基于预设的损失函数计算每个样本文本向量对应的语音的实际频谱长度与每个样本文本向量对应的样本频谱长度之间的损失值,并将该损失值记为第一损失值。第二损失值为上述(5)式的值的负数,即第二损失值为log p
Y(y)的负数。
S206:当该损失值不满足预设条件时,调整该初始长度预测网络和/或该初始解码器的参数,并基于该样本训练集继续训练。
示例性地,当第一损失值不满足第一预设条件时,调整初始长度预测网络的权值和参数,并继续训练该初始长度预测网络。当第一损失值满足第一预设条件时,停止训练该初始长度预测网络,并将训练后的该初始长度预测网络作为最终训练好的频谱生成模型中的长度预测网络。假设第一预设条件为第一损失值小于或等于预设的第一损失值阈值。那么,当第一损失值大于第一损失值阈值时,调整初始长度预测网络的权值和参数,并继续训练该初始长度预测网络。当第一损失值小于或等于第一损失值阈值时,停止训练该初始长度预测网络,并将训练后的该初始长度预测网络作为最终训练好的频谱生成模型中的长度预测网络。此处仅为示例性说明,对此不做限定。
示例性地,当第二损失值不满足第二预设条件时,调整初始解码器的权值和参数,并继续训练该初始解码器。当第二损失值满足第二预设条件时,停止训练该初始解码器,并将训练后的该初始解码器作为最终训练好的频谱生成模型中的解码器。例如,假设第二预设条件为第二损失值小于或等于预设的第二损失值阈值。那么,当第二损失值大于第二损失值阈值时,调整初始解码器的权值和参数,并继续训练该初始解码器。当第二损失值小于或等于第二损失值阈值时,停止训练该初始解码器,并将训练后的该初始解码器作为最终训练好的频谱生成模型中的解码器。此处仅为示例性说明,对此不做限定。
S207:当该损失值满足该预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成该频谱生成模型。
当第一损失值满足第一预设条件,且第二损失值满足第二预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成该频谱生成模型。
上述实施例中,传统的解码器的结构与本方案中设计的解码器的结构并不相同,且传统的解码器在训练过程中,与实际使用过程中都是将数据输入同一个网络层,经过若干个网络层处理后,都由最后一个网络层输出数据。这样需要解码器基于第一个网络层输入的数据去一步一步推理,最终由最后一个网络层输出,很容易出现推理错误,而且当一个网络层推理错误时,后面的网络层也会延续这种错误,最终导致输出的数据准确率很低。本方案中,解码器的结构特殊,且训练过程和实际使用过程是逆运算的过程,有效地避免了现有技术中存在的这种的问题,可以准确地学习到文本信息对应的语音特征,进而基于训练好的解码器可以准确、快速地提取文本信息对应的梅尔谱图。
请参见图6,图6是本申请一实施例提供的一种合成语音的装置的示意图。该装置包括的各单元用于执行图1~图5对应的实施例中的各步骤。具体请参阅图1~图5各自对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图6,包括:
获取单元310,用于获取文本信息;
处理单元320,用于将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;
生成单元330,用于基于所述梅尔谱图,生成所述文本信息对应的语音信息。
可选地,所述解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。
可选地,所述处理单元320包括:
编码单元,用于通过所述编码器对所述文本信息进行编码,得到所述文本信息对应的文本向量;
预测单元,用于通过所述长度预测网络对所述文本向量进行预测,得到所述文本向量对应的语音的频谱长度;
解码单元,用于将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图。
可选地,所述解码单元具体用于:
通过所述正太分布函数层对所述频谱长度进行处理,得到正太分布的第一特征向量;
将所述第一特征向量输入所述split层进行处理,得到第二特征向量;
基于所述coupling block层、所述affine xform层以及所述invertible linear层,对所述文本向量以及所述第二特征向量进行可逆变换,得到第三特征向量;
通过所述reshape层对所述第三特征向量进行数据重组,得到所述梅尔谱图。
可选地,所述装置还包括训练单元,所述训练单元具体用于:
获取样本训练集,所述样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度;
通过初始编码器对每个样本文本进行编码,得到每个样本文本对应的样本文本向量;
通过初始长度预测网络对每个样本文本向量进行预测,得到每个样本文本向量对应的语音的实际频谱长度;
将所述每个样本文本向量、所述每个样本文本向量对应的语音的实际频谱长度以及所述每个样本文本对应的样本梅尔谱图,输入到初始解码器中进行处理,得到每个样本文本对应的正太分布的样本特征向量;
根据预设的损失函数计算损失值;
当所述损失值不满足预设条件时,调整所述初始长度预测网络和/或所述初始解码器的参数,并基于所述样本训练集继续训练;
当所述损失值满足所述预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成所述频谱生成模型。
可选地,所述损失值包括第一损失值和第二损失值,所述第一损失值为每个样本文本向量对应的语音的实际频谱长度与每个样本文本向量对应的样本频谱长度之间的损失值,所述第二损失值基于每个样本文本对应的正太分布的样本特征向量确定。
可选地,所述生成单元330具体用于:
将所述梅尔谱图输入到已训练的神经声码器中进行处理,得到所述语音信息。
请参见图7,图7是本申请另一实施例提供的合成语音的终端的示意图。如图7所示,该实施例的终端4包括:处理器40、存储器41以及存储在所述存储器41中并可在所述处理器40上运行的计算机程序42。所述处理器40执行所述计算机程序42时实现上述各个合成语音的方法实施例中的步骤,例如图1所示的S101至S103。或者,所述处理器40执行所述计算机程序42时实现上述各实施例中各单元的功能,例如图6所示单元310至330功能。
示例性地,所述计算机程序42可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器41中,并由所述处理器40执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机指令段,该指令段用于描述所述计算机程序42在所述终端4中的执行过程。例如,所述计算机程序42可以被分割为获取单元、处理单元以及生成单元,各单元具体功能如上所述。
所述终端可包括,但不仅限于,处理器40、存储器41。本领域技术人员可以理解,图7仅仅是终端4的示例,并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端还可以包括输入输出终端、网络接入终端、总线等。
所称处理器40可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器41可以是所述终端的内部存储单元,例如终端的硬盘或内存。所述存储器41也可以是所述终端的外部存储终端,例如所述终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器41还可以既包括所述终端的内部存储单元也包括外部存储终端。所述存储器41用于存储所述计算机指令以及所述终端所需的其他程序和数据。所述存储器41还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机存储介质,计算机存储介质可以是非易失性,也可以是易失性,该计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述各个合成语音的方法实施例中的步骤。
本申请还提供了一种计算机程序产品,当计算机程序产品在终端上运行时,使得该终端执行上述各个合成语音的方法实施例中的步骤。
本申请实施例还提供了一种芯片或者集成电路,该芯片或者集成电路包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有该芯片或者集成电路的终端执行上述各个合成语音的方法实施例中的步骤。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神范围,均应包含在本申请的保护范围之内。
Claims (20)
- 一种合成语音的方法,其中,包括:获取文本信息;将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;基于所述梅尔谱图,生成所述文本信息对应的语音信息。
- 如权利要求1所述的方法,其中,所述解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。
- 如权利要求2所述的方法,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,包括:通过所述编码器对所述文本信息进行编码,得到所述文本信息对应的文本向量;通过所述长度预测网络对所述文本向量进行预测,得到所述文本向量对应的语音的频谱长度;将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图。
- 如权利要求3所述的方法,其中,所述将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图,包括:通过所述正太分布函数层对所述频谱长度进行处理,得到正太分布的第一特征向量;将所述第一特征向量输入所述split层进行处理,得到第二特征向量;基于所述coupling block层、所述affine xform层以及所述invertible linear层,对所述文本向量以及所述第二特征向量进行可逆变换,得到第三特征向量;通过所述reshape层对所述第三特征向量进行数据重组,得到所述梅尔谱图。
- 如权利要求1所述的方法,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图之前,所述方法还包括:获取样本训练集,所述样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度;通过初始编码器对每个样本文本进行编码,得到每个样本文本对应的样本文本向量;通过初始长度预测网络对每个样本文本向量进行预测,得到每个样本文本向量对应的语音的实际频谱长度;将所述每个样本文本向量、所述每个样本文本向量对应的语音的实际频谱长度以及所述每个样本文本对应的样本梅尔谱图,输入到初始解码器中进行处理,得到每个样本文本对应的正太分布的样本特征向量;根据预设的损失函数计算损失值;当所述损失值不满足预设条件时,调整所述初始长度预测网络和/或所述初始解码器的参数,并基于所述样本训练集继续训练;当所述损失值满足所述预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成所述频谱生成模型。
- 如权利要求5所述的方法,其中,所述损失值包括第一损失值和第二损失值,所述第一损失值为每个样本文本向量对应的语音的实际频谱长度与每个样本文本向量对应的样本频谱长度之间的损失值,所述第二损失值基于每个样本文本对应的正太分布的样本特征向量确定。
- 如权利要求1所述的方法,其中,所述基于所述梅尔谱图,生成所述文本信息对应的语音信息,包括:将所述梅尔谱图输入到已训练的神经声码器中进行处理,得到所述语音信息。
- 一种合成语音的装置,其中,包括:获取单元,用于获取文本信息;处理单元,用于将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;生成单元,用于基于所述梅尔谱图,生成所述文本信息对应的语音信息。
- 一种合成语音的终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:获取文本信息;将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;基于所述梅尔谱图,生成所述文本信息对应的语音信息。
- 如权利要求9所述的终端,其中,所述解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。
- 如权利要求10所述的终端,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,包括:通过所述编码器对所述文本信息进行编码,得到所述文本信息对应的文本向量;通过所述长度预测网络对所述文本向量进行预测,得到所述文本向量对应的语音的频谱长度;将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图。
- 如权利要求11所述的终端,其中,所述将所述文本向量以及所述频谱长度输入所 述解码器进行解码处理,得到所述梅尔谱图,包括:通过所述正太分布函数层对所述频谱长度进行处理,得到正太分布的第一特征向量;将所述第一特征向量输入所述split层进行处理,得到第二特征向量;基于所述coupling block层、所述affine xform层以及所述invertible linear层,对所述文本向量以及所述第二特征向量进行可逆变换,得到第三特征向量;通过所述reshape层对所述第三特征向量进行数据重组,得到所述梅尔谱图。
- 如权利要求9所述的终端,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图之前,所述方法还包括:获取样本训练集,所述样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度;通过初始编码器对每个样本文本进行编码,得到每个样本文本对应的样本文本向量;通过初始长度预测网络对每个样本文本向量进行预测,得到每个样本文本向量对应的语音的实际频谱长度;将所述每个样本文本向量、所述每个样本文本向量对应的语音的实际频谱长度以及所述每个样本文本对应的样本梅尔谱图,输入到初始解码器中进行处理,得到每个样本文本对应的正太分布的样本特征向量;根据预设的损失函数计算损失值;当所述损失值不满足预设条件时,调整所述初始长度预测网络和/或所述初始解码器的参数,并基于所述样本训练集继续训练;当所述损失值满足所述预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成所述频谱生成模型。
- 如权利要求13所述的终端,其中,所述损失值包括第一损失值和第二损失值,所述第一损失值为每个样本文本向量对应的语音的实际频谱长度与每个样本文本向量对应的样本频谱长度之间的损失值,所述第二损失值基于每个样本文本对应的正太分布的样本特征向量确定。
- 如权利要求9所述的终端,其中,所述基于所述梅尔谱图,生成所述文本信息对应的语音信息,包括:将所述梅尔谱图输入到已训练的神经声码器中进行处理,得到所述语音信息。
- 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现:获取文本信息;将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,所述频谱生成模型为无需蒸馏的非自回归式的模型,所述频谱生成模型包括编码器、长度预测网络以及解码器,其中,所述解码器的训练过程和实际使用过程是逆运算的过程;基于所述梅尔谱图,生成所述文本信息对应的语音信息。
- 如权利要求16所述的计算机可读存储介质,其中,所述解码器包括依次连接的正太分布函数层、split层、coupling block层、affine xform层、invertible linear层、reshape层。
- 如权利要求17所述的计算机可读存储介质,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图,包括:通过所述编码器对所述文本信息进行编码,得到所述文本信息对应的文本向量;通过所述长度预测网络对所述文本向量进行预测,得到所述文本向量对应的语音的频谱长度;将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图。
- 如权利要求18所述的计算机可读存储介质,其中,所述将所述文本向量以及所述频谱长度输入所述解码器进行解码处理,得到所述梅尔谱图,包括:通过所述正太分布函数层对所述频谱长度进行处理,得到正太分布的第一特征向量;将所述第一特征向量输入所述split层进行处理,得到第二特征向量;基于所述coupling block层、所述affine xform层以及所述invertible linear层,对所述文本向量以及所述第二特征向量进行可逆变换,得到第三特征向量;通过所述reshape层对所述第三特征向量进行数据重组,得到所述梅尔谱图。
- 如权利要求16所述的计算机可读存储介质,其中,所述将所述文本信息输入到已训练的频谱生成模型中进行处理,得到所述文本信息对应的梅尔谱图之前,所述方法还包括:获取样本训练集,所述样本训练集包括多个样本文本、每个样本文本对应的样本梅尔谱图以及每个样本梅尔谱图对应的样本频谱长度;通过初始编码器对每个样本文本进行编码,得到每个样本文本对应的样本文本向量;通过初始长度预测网络对每个样本文本向量进行预测,得到每个样本文本向量对应的语音的实际频谱长度;将所述每个样本文本向量、所述每个样本文本向量对应的语音的实际频谱长度以及所述每个样本文本对应的样本梅尔谱图,输入到初始解码器中进行处理,得到每个样本文本对应的正太分布的样本特征向量;根据预设的损失函数计算损失值;当所述损失值不满足预设条件时,调整所述初始长度预测网络和/或所述初始解码器的参数,并基于所述样本训练集继续训练;当所述损失值满足所述预设条件时,停止训练,并基于训练后的初始编码器、训练后的初始长度预测网络以及训练后的初始解码器生成所述频谱生成模型。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110641868.X | 2021-06-09 | ||
CN202110641868.XA CN113362804B (zh) | 2021-06-09 | 2021-06-09 | 一种合成语音的方法、装置、终端及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022257454A1 true WO2022257454A1 (zh) | 2022-12-15 |
Family
ID=77533304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/071430 WO2022257454A1 (zh) | 2021-06-09 | 2022-01-11 | 一种合成语音的方法、装置、终端及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113362804B (zh) |
WO (1) | WO2022257454A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117153162A (zh) * | 2023-11-01 | 2023-12-01 | 北京中电慧声科技有限公司 | 一种语音隐私保护方法及装置 |
CN118571238A (zh) * | 2024-08-02 | 2024-08-30 | 北京远鉴信息技术有限公司 | 一种音频处理方法、装置、电子设备及存储介质 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113362804B (zh) * | 2021-06-09 | 2024-03-19 | 平安科技(深圳)有限公司 | 一种合成语音的方法、装置、终端及存储介质 |
CN113920981A (zh) * | 2021-09-17 | 2022-01-11 | 作业帮教育科技(北京)有限公司 | 一种基于n元非自回归语音合成方法、装置及电子设备 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931736A (zh) * | 2020-09-27 | 2020-11-13 | 浙江大学 | 利用非自回归模型与整合放电技术的唇语识别方法、系统 |
CN112133282A (zh) * | 2020-10-26 | 2020-12-25 | 厦门大学 | 轻量级多说话人语音合成系统及电子设备 |
CN112233646A (zh) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | 基于神经网络的语音克隆方法、系统、设备及存储介质 |
WO2021061484A1 (en) * | 2019-09-27 | 2021-04-01 | Amazon Technologies, Inc. | Text-to-speech processing |
CN112669809A (zh) * | 2019-10-16 | 2021-04-16 | 百度(美国)有限责任公司 | 并行神经文本到语音转换 |
CN112802450A (zh) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | 一种韵律可控的中英文混合的语音合成方法及其系统 |
CN112802448A (zh) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | 一种新音色生成的语音合成方法和系统 |
CN113362804A (zh) * | 2021-06-09 | 2021-09-07 | 平安科技(深圳)有限公司 | 一种合成语音的方法、装置、终端及存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10692484B1 (en) * | 2018-06-13 | 2020-06-23 | Amazon Technologies, Inc. | Text-to-speech (TTS) processing |
CN112002305B (zh) * | 2020-07-29 | 2024-06-18 | 北京大米科技有限公司 | 语音合成方法、装置、存储介质及电子设备 |
CN111899720B (zh) * | 2020-07-30 | 2024-03-15 | 北京字节跳动网络技术有限公司 | 用于生成音频的方法、装置、设备和介质 |
CN111739508B (zh) * | 2020-08-07 | 2020-12-01 | 浙江大学 | 一种基于dnn-hmm双模态对齐网络的端到端语音合成方法及系统 |
-
2021
- 2021-06-09 CN CN202110641868.XA patent/CN113362804B/zh active Active
-
2022
- 2022-01-11 WO PCT/CN2022/071430 patent/WO2022257454A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021061484A1 (en) * | 2019-09-27 | 2021-04-01 | Amazon Technologies, Inc. | Text-to-speech processing |
CN112669809A (zh) * | 2019-10-16 | 2021-04-16 | 百度(美国)有限责任公司 | 并行神经文本到语音转换 |
CN111931736A (zh) * | 2020-09-27 | 2020-11-13 | 浙江大学 | 利用非自回归模型与整合放电技术的唇语识别方法、系统 |
CN112233646A (zh) * | 2020-10-20 | 2021-01-15 | 携程计算机技术(上海)有限公司 | 基于神经网络的语音克隆方法、系统、设备及存储介质 |
CN112133282A (zh) * | 2020-10-26 | 2020-12-25 | 厦门大学 | 轻量级多说话人语音合成系统及电子设备 |
CN112802450A (zh) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | 一种韵律可控的中英文混合的语音合成方法及其系统 |
CN112802448A (zh) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | 一种新音色生成的语音合成方法和系统 |
CN113362804A (zh) * | 2021-06-09 | 2021-09-07 | 平安科技(深圳)有限公司 | 一种合成语音的方法、装置、终端及存储介质 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117153162A (zh) * | 2023-11-01 | 2023-12-01 | 北京中电慧声科技有限公司 | 一种语音隐私保护方法及装置 |
CN117153162B (zh) * | 2023-11-01 | 2024-05-24 | 北京中电慧声科技有限公司 | 一种语音隐私保护方法及装置 |
CN118571238A (zh) * | 2024-08-02 | 2024-08-30 | 北京远鉴信息技术有限公司 | 一种音频处理方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN113362804A (zh) | 2021-09-07 |
CN113362804B (zh) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022257454A1 (zh) | 一种合成语音的方法、装置、终端及存储介质 | |
US11948066B2 (en) | Processing sequences using convolutional neural networks | |
WO2020224219A1 (zh) | 中文分词方法、装置、电子设备及可读存储介质 | |
CN111460807B (zh) | 序列标注方法、装置、计算机设备和存储介质 | |
WO2022121180A1 (zh) | 模型的训练方法、装置、语音转换方法、设备及存储介质 | |
CN109887484B (zh) | 一种基于对偶学习的语音识别与语音合成方法及装置 | |
CN112435656B (zh) | 模型训练方法、语音识别方法、装置、设备及存储介质 | |
CN110288980A (zh) | 语音识别方法、模型的训练方法、装置、设备及存储介质 | |
CN112509555B (zh) | 方言语音识别方法、装置、介质及电子设备 | |
CN112466314A (zh) | 情感语音数据转换方法、装置、计算机设备及存储介质 | |
WO2023134067A1 (zh) | 语音分类模型的训练方法、装置、设备及存储介质 | |
CN111368037A (zh) | 基于Bert模型的文本相似度计算方法和装置 | |
CN112084752B (zh) | 基于自然语言的语句标注方法、装置、设备及存储介质 | |
WO2022227214A1 (zh) | 分类模型训练方法、装置、终端设备及存储介质 | |
CN112052329A (zh) | 文本摘要生成方法、装置、计算机设备及可读存储介质 | |
CN113327578B (zh) | 一种声学模型训练方法、装置、终端设备及存储介质 | |
CN113111908A (zh) | 一种基于模板序列或词序列的bert异常检测方法及设备 | |
CN116450813B (zh) | 文本关键信息提取方法、装置、设备以及计算机存储介质 | |
WO2023065635A1 (zh) | 命名实体识别方法、装置、存储介质及终端设备 | |
CN115544227A (zh) | 多模态数据的情感分析方法、装置、设备及存储介质 | |
CN114694255B (zh) | 基于通道注意力与时间卷积网络的句子级唇语识别方法 | |
CN115687934A (zh) | 意图识别方法、装置、计算机设备及存储介质 | |
US20230252225A1 (en) | Automatic Text Summarisation Post-processing for Removal of Erroneous Sentences | |
WO2023102931A1 (zh) | 韵律结构的预测方法、电子设备、程序产品及存储介质 | |
CN116956835A (zh) | 一种基于预训练语言模型的文书生成方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22819062 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22819062 Country of ref document: EP Kind code of ref document: A1 |