WO2021127978A1 - 语音合成方法、装置、计算机设备和存储介质 - Google Patents

语音合成方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021127978A1
WO2021127978A1 PCT/CN2019/127911 CN2019127911W WO2021127978A1 WO 2021127978 A1 WO2021127978 A1 WO 2021127978A1 CN 2019127911 W CN2019127911 W CN 2019127911W WO 2021127978 A1 WO2021127978 A1 WO 2021127978A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
synthesized
part information
spectrum
imaginary part
Prior art date
Application number
PCT/CN2019/127911
Other languages
English (en)
French (fr)
Inventor
黄东延
盛乐园
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003188.6A priority Critical patent/CN111316352B/zh
Priority to PCT/CN2019/127911 priority patent/WO2021127978A1/zh
Priority to US17/117,148 priority patent/US11763796B2/en
Publication of WO2021127978A1 publication Critical patent/WO2021127978A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of speech synthesis technology, and in particular to a speech synthesis method, device, computer equipment and storage medium.
  • Speech synthesis technology refers to the process of obtaining synthesized speech based on the speech text to be synthesized.
  • the deep generative model greatly improves the quality of the synthesized speech.
  • WaveNet compared with traditional speech synthesizers, shows excellent performance.
  • WaveNet needs to generate voice sampling points in the process of speech synthesis, and WaveNet is an autoregressive model. Due to its autoregressive nature, the speed of speech synthesis is slow, and due to the need to generate a large number of speech sampling points, it again leads to speech The synthesis speed becomes slow and the process is complicated.
  • a speech synthesis method comprising: obtaining a speech text to be synthesized; obtaining a Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized; inputting the Mel spectrum into a complex neural network to obtain The complex frequency spectrum corresponding to the speech text to be synthesized, where the complex frequency spectrum includes real part information and imaginary part information; and the synthesized speech corresponding to the speech text to be synthesized is obtained according to the complex frequency spectrum.
  • a speech synthesis device comprising: a text acquisition module, configured to acquire a speech text to be synthesized; a first frequency spectrum module, configured to obtain a Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized;
  • the second spectrum module is used to input the Mel spectrum into the complex neural network to obtain the complex spectrum corresponding to the speech text to be synthesized.
  • the complex spectrum includes real part information and imaginary part information; the speech synthesis module is used to The complex frequency spectrum obtains the synthesized speech corresponding to the speech text to be synthesized.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps: acquiring speech text to be synthesized; Synthesize the speech text to obtain the Mel spectrum corresponding to the speech text to be synthesized; input the Mel spectrum into the complex neural network to obtain the complex spectrum corresponding to the speech text to be synthesized, the complex spectrum including real part information and imaginary part Information; obtain the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum.
  • a computer-readable storage medium storing a computer program.
  • the processor causes the processor to perform the following steps: obtain the speech text to be synthesized; obtain the speech text to be synthesized according to the speech text to be synthesized Mel frequency spectrum corresponding to the speech text; input the Mel frequency spectrum into a complex neural network to obtain the complex frequency spectrum corresponding to the speech text to be synthesized, the complex frequency spectrum including real part information and imaginary part information; obtained according to the complex frequency spectrum
  • the synthesized speech corresponding to the speech text to be synthesized is executed by a processor, the processor causes the processor to perform the following steps: obtain the speech text to be synthesized; obtain the speech text to be synthesized according to the speech text to be synthesized Mel frequency spectrum corresponding to the speech text; input the Mel frequency spectrum into a complex neural network to obtain the complex frequency spectrum corresponding to the speech text to be synthesized, the complex frequency spectrum including real part information and imaginary part information; obtained according to the complex frequency spectrum The synthesized speech corresponding to the speech text to be synthesized
  • the above speech synthesis method device, computer equipment and computer-readable storage medium, first obtain the speech text to be synthesized; then obtain the mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized; and transfer the mel
  • the frequency spectrum is input into a complex neural network to obtain a complex frequency spectrum corresponding to the speech text to be synthesized.
  • the complex frequency spectrum includes real part information and imaginary part information; finally, a synthesized speech corresponding to the speech text to be synthesized is obtained according to the complex frequency spectrum.
  • Figure 1 is a flow chart of the implementation of a speech synthesis method in an embodiment
  • Figure 2 is a flow chart of the implementation of step 106 in an embodiment
  • Figure 3 is a flow chart of an implementation of a speech synthesis method in an embodiment
  • FIG. 4 is a flowchart of the implementation of step 304 in an embodiment
  • Figure 5 is a flow chart of the implementation of step 312 in an embodiment
  • Fig. 6 is a schematic diagram of training a complex neural network in an embodiment
  • FIG. 7 is a block diagram of the composition structure of a speech synthesis device in another embodiment
  • FIG. 8 is a block diagram of the composition structure of the second spectrum module 706 in an embodiment
  • Fig. 9 is a block diagram of the composition structure of a speech synthesis device in an embodiment
  • Fig. 10 is a structural block diagram of a computer device in an embodiment.
  • a speech synthesis method is provided.
  • the execution subject of the speech synthesis method described in the embodiment of the present invention is a device capable of implementing the speech synthesis method described in the embodiment of the present invention.
  • Devices can include but are not limited to terminals and servers.
  • Terminals include mobile terminals and desktop terminals.
  • Mobile terminals include but are not limited to mobile phones, tablets and laptops.
  • Desktop terminals include but are not limited to desktop computers and vehicle-mounted computers.
  • Servers include high-performance Computers and high-performance computer clusters.
  • the speech synthesis method specifically includes the following steps:
  • Step 102 Acquire the speech text to be synthesized.
  • the speech text to be synthesized is the text corresponding to the speech to be synthesized.
  • speech is synthesized from the text to be synthesized to obtain the purpose of speech synthesis.
  • Step 104 Obtain the Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized.
  • the mel frequency spectrum is a way of expressing the speech frequency spectrum.
  • the ordinary speech frequency spectrum is a large spectrogram.
  • the speech frequency is filtered by the mel filter to obtain a relatively small spectrogram.
  • the smaller spectrogram is the Mel spectrum.
  • the speech text to be synthesized is input into a sound spectrum network.
  • the sound spectrum network includes an encoder and a decoder, wherein the encoder is used to obtain hidden layer features according to the speech text to be synthesized, and the decoder is used to obtain hidden layer features corresponding to the speech text to be synthesized
  • the Mel spectrum is predicted.
  • the encoder includes a character vector unit, a convolution unit, and a two-way LSTM unit.
  • the speech text to be synthesized is encoded by the character vector unit into a fixed-dimensional (for example, 512-dimensional) character vector; the character vector is input into the convolution unit (for example, 3 Layer convolution kernel), the convolution unit extracts the context feature of the character vector; the context feature extracted by the convolution unit is input into the bidirectional LSTM unit to obtain the coding feature.
  • the decoder can be an autoregressive cyclic neural network, and the decoder predicts the Mel spectrum based on the encoding features output by the bidirectional LSTM unit.
  • Step 106 Input the Mel spectrum into a complex neural network to obtain a complex spectrum corresponding to the speech text to be synthesized, where the complex spectrum includes real part information and imaginary part information.
  • the complex neural network takes Mel spectrum as input and complex spectrum as output.
  • the network structure of the complex neural network includes the U-net network structure.
  • the real part information and the imaginary part information of the complex spectrum can be regarded as two images, that is, the output of the complex neural network is regarded as two spectrum images.
  • Step 108 Obtain the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum.
  • the synthesized speech corresponding to the speech text to be synthesized can be obtained. It should be noted that since the complex spectrum includes real part information and imaginary part information, the final synthesized speech is synthesized based on the real part information and the imaginary part information. Compared with the method of synthesizing speech only based on the real part information, the embodiment of the present invention The speech synthesized by the method will be more realistic because it retains more speech information.
  • obtaining the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum in step 108 includes: using an inverse short-time Fourier transform to process the complex frequency spectrum to obtain the synthesized speech The synthesized speech corresponding to the speech text.
  • the speech itself is a one-dimensional time-domain signal, and it is difficult to see the frequency change law of the speech from the time-domain signal.
  • the speech can be changed from the time domain to the frequency domain. Although the frequency distribution of the speech can be seen at this time, the time domain information is missing. From the frequency domain distribution of the speech, it is difficult to see the time domain information of the speech. .
  • time-frequency analysis methods have emerged.
  • the short-time Fourier transform is a very common time-frequency domain analysis method, and the inverse short-time Fourier transform is the inverse process of the short-time Fourier transform.
  • the short-time Fourier transform can change the speech from the time domain to the frequency domain, and the inverse short-time Fourier transform can restore the speech in the frequency domain to the time domain.
  • Using an inverse short-time Fourier transform (function) to restore speech in the frequency domain to the time domain is simpler than using an autoregressive model to synthesize speech.
  • the speech text to be synthesized is first obtained; then the Mel spectrum corresponding to the speech text to be synthesized is obtained according to the speech text to be synthesized; and the Mel spectrum is input into the complex neural network to obtain the speech text to be synthesized
  • the complex neural network includes a down-sampling network and an up-sampling network
  • the up-sampling network includes a real part deconvolution kernel and an imaginary part deconvolution kernel.
  • Step 106A Input the Mel spectrum into a down-sampling network in the complex neural network, and obtain the spectrum characteristics corresponding to the Mel spectrum output by the down-sampling network.
  • the down-sampling network includes multiple layers, and each layer is provided with a convolution kernel.
  • the convolution kernel of each layer is used to extract features of the input of the layer, so as to continuously mine deeper features and realize the transformation of large sizes to Small size.
  • Step 106B Input the spectral characteristics corresponding to the Mel spectrum into the up-sampling network.
  • the obtained spectral characteristics are input to the up-sampling network in the complex neural network, so that the up-sampling network obtains the complex frequency spectrum according to the spectral characteristics.
  • Step 106C the real part deconvolution kernel in the upsampling network processes the spectral features corresponding to the Mel spectrum to obtain the real part information corresponding to the speech text to be synthesized.
  • a deconvolution kernel is provided in the upsampling network, and the deconvolution kernel performs a deconvolution operation.
  • the deconvolution is transposed convolution to realize the transformation from a small size to a large size.
  • Step 106D the imaginary part deconvolution kernel in the upsampling network processes the spectral features corresponding to the Mel spectrum to obtain the imaginary part information corresponding to the speech text to be synthesized.
  • two deconvolution kernels are set in the up-sampling network, specifically the real part deconvolution kernel and the imaginary part deconvolution kernel, and the spectral characteristics are obtained by setting the real part deconvolution kernel.
  • the real part information corresponding to the speech text to be synthesized is processed by setting the imaginary part deconvolution kernel to process the spectral characteristics to obtain the imaginary part information corresponding to the speech text to be synthesized.
  • a training method for the speech text to be synthesized is provided. As shown in FIG. 3, before obtaining the speech text to be synthesized in step 314, the method further includes:
  • Step 302 Obtain training voice.
  • the training speech is the speech used to train the complex neural network.
  • Step 304 Obtain the Mel spectrum corresponding to the training voice according to the training voice.
  • the complex neural network uses the Mel spectrum as input. Therefore, it is necessary to obtain the Mel spectrum corresponding to the training speech first, and then use the obtained Mel spectrum to train the complex neural network.
  • obtaining the Mel spectrum corresponding to the training voice according to the training voice in step 304 includes:
  • Step 304A Use short-time Fourier transform to process the training speech to obtain a complex frequency spectrum corresponding to the training speech.
  • the short-time Fourier transform refers to the function transformation that transforms the time domain signal into the frequency domain.
  • Using the short-time Fourier transform to process the training speech can obtain the complex frequency spectrum corresponding to the training speech, and the complex frequency spectrum corresponding to the training speech includes the real Part and imaginary part.
  • Step 304B Calculate the amplitude spectrum and the phase spectrum corresponding to the training voice according to the complex frequency spectrum corresponding to the training voice.
  • Step 304C Use a Mel filter to filter the amplitude spectrum corresponding to the training speech to obtain the Mel spectrum corresponding to the training speech.
  • the dimensional spectrum of the amplitude spectrum is reduced (filtered) by the mel filter, and the mel spectrum can be obtained.
  • Step 306 Input the Mel spectrum corresponding to the training speech to the complex neural network to obtain first real part information and first imaginary part information corresponding to the training speech.
  • Step 308 Obtain a synthesized speech corresponding to the training speech according to the first real part information and the first imaginary part information.
  • the inverse short-time Fourier transform is used to process the first real part information and the first imaginary part information corresponding to the training speech output by the complex neural network (that is, the complex frequency spectrum corresponding to the training speech is obtained) to generate synthesized speech.
  • Step 310 Obtain second real part information and second imaginary part information corresponding to the training speech according to the training speech.
  • the short-time Fourier transform is used to process the training speech, and the second real part information and the second imaginary part information (ie, complex frequency spectrum) corresponding to the training speech can be obtained.
  • Step 312 According to the training speech, the synthesized speech corresponding to the training speech, the first real part information, the first imaginary part information, the second real part information, and the second imaginary part information, Obtain the network loss parameter, so as to update the complex neural network according to the network loss parameter.
  • step 312 includes:
  • Step 312A Obtain a first loss parameter according to the training speech and the synthesized speech corresponding to the training speech.
  • the discriminator compares the training speech with the synthesized speech, and then outputs the first loss parameter according to the comparison result. Specifically, the greater the difference between the training speech and the synthesized speech, the greater the first loss parameter; Conversely, the smaller the difference between the training speech and the synthesized speech, the smaller the first loss parameter.
  • the discriminator outputs a third loss parameter according to the training speech and the synthesized speech.
  • the third loss parameter is used to determine whether the synthesized speech and the training speech are true or false. If the synthesized speech is truer (closer to the training speech), the third loss is The smaller the parameter; if the synthesized speech is fake, the third loss parameter is larger. Then gradient descent is performed on the third loss parameter, so as to update the discriminator.
  • the first loss parameter makes a more detailed judgment.
  • Step 312B Perform a sampling operation on the first real part information and the first imaginary part information to obtain a first real part imaginary part set, and the first real part imaginary part set includes a preset number of different dimensions Real and imaginary information.
  • the first real part information and the first imaginary part information output by the complex neural network are sampled multiple times, and the real part information and imaginary part information with lower dimensions are obtained each time, and then continue to the dimension Lower real part information and imaginary part information are sampled, and finally after multiple sampling, a preset number of real part information and imaginary part information with different dimensions are obtained.
  • the size before sampling is 512 ⁇ 512
  • the size after sampling is 256 ⁇ 256
  • the size after sampling again is 128 ⁇ 128.
  • Step 312C Perform a sampling operation on the second real part information and the second imaginary part information to obtain a second real part imaginary part set, and the second real part imaginary part set includes a preset number of different dimensions Real and imaginary information.
  • the second real part information and the second imaginary part information corresponding to the training speech are sampled multiple times, and the real part information and the imaginary part information with a lower dimension are obtained each time, and then the real part information and the imaginary part information with a lower dimension are obtained.
  • Part information and imaginary part information are sampled, and finally after multiple sampling, a preset number of real part information and imaginary part information with different dimensions are obtained.
  • each sampling parameter is consistent with the sampling parameters of the first real part information and the first imaginary part information sampling each time.
  • Step 312D Obtain a second loss parameter according to the first real and imaginary part set and the second real and imaginary part set.
  • the first real part information and first imaginary part information in the first real part imaginary part set are compared with the corresponding second real part information and second imaginary part information in the second real part imaginary part set, Obtain the loss sub-parameter; add multiple loss sub-parameters to obtain the second loss parameter.
  • Step 312E Use the sum of the first loss parameter and the second loss parameter as the network loss parameter.
  • the sum of the first loss parameter and the second loss parameter is used as the network loss parameter, so as to update the complex neural network according to the network loss parameter, since the update of the complex neural network also takes into account synthesized speech, training speech, and complex neural network output
  • the first real part information and the first imaginary part information of can improve the network update speed, accelerate the training of the complex neural network, and can obtain a high-quality complex neural network.
  • gradient descent is performed on the network loss parameter, so as to realize the update of the complex neural network.
  • a speech synthesis device 700 includes: a text acquisition module 702 for acquiring a speech text to be synthesized.
  • the first frequency spectrum module 704 is configured to obtain the Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized.
  • the second spectrum module 706 is configured to input the Mel spectrum into a complex neural network to obtain a complex spectrum corresponding to the speech text to be synthesized, and the complex spectrum includes real part information and imaginary part information.
  • the speech synthesis module 708 is configured to obtain the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum.
  • the aforementioned speech synthesis device first obtains the speech text to be synthesized; then obtains the Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized; and inputs the Mel spectrum into the complex neural network to obtain the speech text to be synthesized
  • the speech synthesis module 708 includes: an inverse transform module, configured to process the complex frequency spectrum using an inverse short-time Fourier transform to obtain the synthesized speech corresponding to the speech text to be synthesized.
  • the complex neural network includes a down-sampling network and an up-sampling network
  • the up-sampling network includes a real part deconvolution kernel and an imaginary part deconvolution kernel
  • the second The frequency spectrum module 706 includes: a down-sampling module 7062, configured to input the Mel spectrum into a down-sampling network in the complex neural network to obtain the spectrum characteristics corresponding to the Mel spectrum output by the down-sampling network; up The sampling input module 7064 is used to input the spectral features corresponding to the Mel spectrum into the upsampling network; the real part module 7066 is used to check the Mel spectrum corresponding to the real part deconvolution in the upsampling network
  • the imaginary part module 7068 is used for the imaginary part deconvolution kernel in the upsampling network to process the spectral characteristics corresponding to the Mel spectrum to obtain the real part information corresponding to the speech text to be synthesized.
  • the device 700 further includes: a training voice acquisition module 710, configured to acquire training voice; and a training voice mel module 712, configured to obtain the training voice based on the training voice.
  • Mel spectrum corresponding to the voice training voice input module 714 for inputting the Mel spectrum corresponding to the training voice into the complex neural network to obtain first real part information and first imaginary part information corresponding to the training voice
  • Training synthesis module 716 used to obtain the synthesized speech corresponding to the training speech according to the first real part information and the first imaginary part information
  • training speech spectrum module 718 used to obtain the training speech according to the training speech
  • the network update module 720 is configured to perform according to the training speech, the synthesized speech corresponding to the training speech, the first real part information, and the first
  • the imaginary part information, the second real part information, and the second imaginary part information are used to obtain a network loss parameter, so as to update the complex neural network according to the network loss parameter.
  • the network update module includes: a first loss module, configured to obtain a first loss parameter according to the training voice and a synthesized voice corresponding to the training voice; and a first sampling module, configured to The first real part information and the first imaginary part information are sampled to obtain a first real part imaginary part set.
  • the first real part imaginary part set includes a preset number of real part information and imaginary parts with different dimensions.
  • a second sampling module for sampling the second real part information and the second imaginary part information to obtain a second real part imaginary part set, the second real part imaginary part set including pre Set the number of real part information and imaginary part information with different dimensions; a sampling loss module for obtaining a second loss parameter according to the first real part imaginary part set and the second real part imaginary part set; the sum of losses The module is configured to use the sum of the first loss parameter and the second loss parameter as the network loss parameter.
  • the training speech mel module includes: a short-time Fourier module, configured to process the training speech using short-time Fourier transform to obtain a complex frequency spectrum corresponding to the training speech;
  • the spectrum calculation module is used to calculate the amplitude spectrum and the phase spectrum corresponding to the training voice according to the complex frequency spectrum corresponding to the training voice;
  • the mel filter module is used to calculate the amplitude spectrum corresponding to the training voice by using a mel filter Filtering is performed to obtain the Mel spectrum corresponding to the training speech.
  • Fig. 10 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be a server and a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can realize the speech synthesis method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the speech synthesis method.
  • FIG. 10 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the speech synthesis method provided in this application can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 10.
  • the memory of the computer equipment can store various program templates that make up the speech synthesis device.
  • the text acquisition module 702 the first frequency spectrum module 704, the second frequency spectrum module 706, and the speech synthesis module 708.
  • a computer device includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer program: acquiring speech text to be synthesized; Obtain the Mel spectrum corresponding to the speech text to be synthesized according to the speech text to be synthesized; input the Mel spectrum into a complex neural network to obtain the complex spectrum corresponding to the speech text to be synthesized, and the complex spectrum includes a real part Information and imaginary part information; the synthesized speech corresponding to the speech text to be synthesized is obtained according to the complex frequency spectrum.
  • the obtaining the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum includes: using an inverse short-time Fourier transform to process the complex frequency spectrum to obtain the speech text to be synthesized The corresponding synthesized speech.
  • the complex neural network includes a down-sampling network and an up-sampling network
  • the up-sampling network includes a real part deconvolution kernel and an imaginary part deconvolution kernel
  • the input of the Mel spectrum into the complex number includes: inputting the Mel spectrum into a down-sampling network in the complex neural network to obtain the Mel spectrum corresponding to the output of the down-sampling network Spectrum features; input the spectrum features corresponding to the Mel spectrum into the upsampling network; the real part deconvolution kernel in the upsampling network processes the spectrum features corresponding to the Mel spectrum to obtain the speech to be synthesized Real part information corresponding to the text; the imaginary part deconvolution kernel in the upsampling network processes the spectral features corresponding to the Mel spectrum to obtain the imaginary part information corresponding to the speech text to be synthesized.
  • the computer program when the computer program is executed by the processor, the computer program is further used to: before the obtaining the speech text to be synthesized, obtain the training voice; obtain the Mel spectrum corresponding to the training voice according to the training voice; The Mel spectrum corresponding to the training speech is input into the complex neural network to obtain the first real part information and the first imaginary part information corresponding to the training speech; according to the first real part information and the first imaginary part information Obtain the synthetic speech corresponding to the training speech according to the part information; Obtain the second real part information and the second imaginary part information corresponding to the training speech according to the training speech; Obtain the synthetic speech corresponding to the training speech and the training speech , The first real part information, the first imaginary part information, the second real part information, and the second imaginary part information to obtain a network loss parameter, so as to update the complex nerve according to the network loss parameter The internet.
  • the said training voice, the synthesized voice corresponding to the training voice, the first real part information, the first imaginary part information, the second real part information and the first Two imaginary part information to obtain a network loss parameter including: obtaining a first loss parameter according to the training speech and the synthesized speech corresponding to the training speech; sampling the first real part information and the first imaginary part information Operation to obtain a first set of real and imaginary parts.
  • the first set of real and imaginary parts includes a preset number of real part information and imaginary part information with different dimensions;
  • the imaginary part information is sampled to obtain a second real part imaginary part set.
  • the second real part imaginary part set includes a preset number of real part information and imaginary part information with different dimensions; according to the first real part imaginary part set The part set and the second real part and imaginary part set obtain a second loss parameter; and the sum of the first loss parameter and the second loss parameter is used as the network loss parameter.
  • the obtaining the Mel spectrum corresponding to the training voice according to the training voice includes: using short-time Fourier transform to process the training voice to obtain the complex spectrum corresponding to the training voice Calculate the amplitude spectrum and phase spectrum corresponding to the training voice according to the complex spectrum corresponding to the training voice; use a mel filter to filter the amplitude spectrum corresponding to the training voice to obtain the mel corresponding to the training voice Spectrum.
  • a computer-readable storage medium wherein the computer-readable storage medium stores a computer program, and the computer program is characterized in that, when the computer program is executed by a processor, the following steps are implemented: obtaining a speech text to be synthesized; according to the speech text to be synthesized Obtain the Mel spectrum corresponding to the speech text to be synthesized; input the Mel spectrum into the complex neural network to obtain the complex spectrum corresponding to the speech text to be synthesized, the complex spectrum including real part information and imaginary part information; according to The complex frequency spectrum obtains the synthesized speech corresponding to the speech text to be synthesized.
  • the obtaining the synthesized speech corresponding to the speech text to be synthesized according to the complex frequency spectrum corresponding to the speech text to be synthesized includes: processing the complex frequency spectrum by using an inverse short-time Fourier transform to obtain The synthesized speech corresponding to the speech text to be synthesized.
  • the complex neural network includes a down-sampling network and an up-sampling network
  • the up-sampling network includes a real part deconvolution kernel and an imaginary part deconvolution kernel
  • the input of the Mel spectrum into the complex A neural network to obtain the complex spectrum corresponding to the speech text to be synthesized includes: inputting the Mel spectrum into a down-sampling network in the complex neural network to obtain the Mel spectrum corresponding to the output of the down-sampling network Spectrum features; input the spectrum features corresponding to the Mel spectrum into the upsampling network; the real part deconvolution kernel in the upsampling network processes the spectrum features corresponding to the Mel spectrum to obtain the speech to be synthesized Real part information corresponding to the text; the imaginary part deconvolution kernel in the upsampling network processes the spectral features corresponding to the Mel spectrum to obtain the imaginary part information corresponding to the speech text to be synthesized.
  • the computer program when the computer program is executed by the processor, the computer program is further used to: before the obtaining the speech text to be synthesized, obtain the training speech; obtain the Mel spectrum corresponding to the training speech according to the training speech; The Mel spectrum corresponding to the training speech is input into the complex neural network to obtain the first real part information and the first imaginary part information corresponding to the training speech; according to the first real part information and the first imaginary part information Obtain the synthetic speech corresponding to the training speech according to the part information; Obtain the second real part information and the second imaginary part information corresponding to the training speech according to the training speech; Obtain the synthetic speech corresponding to the training speech and the training speech , The first real part information, the first imaginary part information, the second real part information, and the second imaginary part information to obtain a network loss parameter, so as to update the complex nerve according to the network loss parameter The internet.
  • the said training voice, the synthesized voice corresponding to the training voice, the first real part information, the first imaginary part information, the second real part information and the first Two imaginary part information to obtain a network loss parameter including: obtaining a first loss parameter according to the training speech and the synthesized speech corresponding to the training speech; sampling the first real part information and the first imaginary part information Operation to obtain a first set of real and imaginary parts.
  • the first set of real and imaginary parts includes a preset number of real part information and imaginary part information with different dimensions;
  • the imaginary part information is sampled to obtain a second real part imaginary part set.
  • the second real part imaginary part set includes a preset number of real part information and imaginary part information with different dimensions; according to the first real part imaginary part set The part set and the second real part and imaginary part set obtain a second loss parameter; and the sum of the first loss parameter and the second loss parameter is used as the network loss parameter.
  • the obtaining the Mel spectrum corresponding to the training voice according to the training voice includes: using short-time Fourier transform to process the training voice to obtain the complex spectrum corresponding to the training voice Calculate the amplitude spectrum and phase spectrum corresponding to the training voice according to the complex spectrum corresponding to the training voice; use a mel filter to filter the amplitude spectrum corresponding to the training voice to obtain the mel corresponding to the training voice Spectrum.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Abstract

一种语音合成方法、装置、计算机设备和计算机可读存储介质。该方法包括:获取待合成语音文本(102);根据待合成语音文本得到待合成语音文本对应的梅尔频谱(104);将梅尔频谱输入复数神经网络,得到待合成语音文本对应的复数频谱,该复数频谱包括实部信息和虚部信息(106);根据该复数频谱得到待合成语音文本对应的合成语音(108)。该方法能够使语音合成更高效、简单。

Description

语音合成方法、装置、计算机设备和存储介质 技术领域
本申请涉及语音合成技术领域,尤其涉及一种语音合成方法、装置、计算机设备和存储介质。
背景技术
语音合成技术是指根据待合成的语音文本得到合成的语音的过程。在语音合成的过程中,深度生成模型大大提高了合成的语音的质量,譬如,WaveNet,与传统的语音合成器相比,表现出了卓越的性能。
技术问题
但是,WaveNet在语音合成的过程中,需要生成语音的采样点,并且,WaveNet是自回归模型,由于其自回归性质导致语音合成速度较慢,而且由于需要生成大量的语音采样点,再次导致语音合成速度变慢且过程繁杂。
技术解决方案
基于此,有必要针对上述问题,提出一种高效且简单的语音合成方法、装置、计算机设备和存储介质。
一种语音合成方法,所述方法包括:获取待合成语音文本;根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;根据所述复数频谱得到所述待合成语音文本对应的合成语音。
一种语音合成装置,所述装置包括:文本获取模块,用于获取待合成语音文本;第一频谱模块,用于根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;第二频谱模块,用于将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;语音合成模块,用于根据所述复数频谱得到所述待合成语音文本对应的合成语音。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:获取待合成语音文本;根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;根据所述复数频谱得到所述待合成语音文本对应的合成语音。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:获取待合成语音文本;根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;根据所述复数频谱得到所述待合成语音文本对应的合成语音。
有益效果
实施本申请实施例,将具有如下有益效果:
上述语音合成方法、装置、计算机设备和计算机可读存储介质,首先获取待合成语音文本;然后根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;并且将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;最后根据所述复数频谱得到所述待合成语音文本对应的合成语音。可见,通过上述方式,由于是根据语音文本对应的梅尔频谱得到语音文本的复数频谱,复数频谱包含实部信息和虚部信息,该实部信息和虚部信息可以看做是两张图像,生成两张图像所需的像素点远小于生成语音所需的采样点,因此,相较于WaveNet自回归的方式具有更低的复杂度,并且具有更高的合成效率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中语音合成方法的实现流程图;
图2为一个实施例中步骤106的实现流程图;
图3为一个实施例中语音合成方法的实现流程图;
图4为一个实施例中步骤304的实现流程图;
图5为一个实施例中步骤312的实现流程图;
图6为一个实施例中训练复数神经网络的示意图;
图7为另一个实施例中语音合成装置的组成结构框图;
图8为一个实施例中第二频谱模块706的组成结构框图;
图9为一个实施例中语音合成装置的组成结构框图;
图10为一个实施例中计算机设备的结构框图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
如图1所示,在一个实施例中,提供了一种语音合成方法,本发明实施例所述的语音合成方法的执行主体为能够实现本发明实施例所述的语音合成方法的设备,该设备可以包括但不限于终端和服务器,其中,终端包括移动终端和台式终端,移动终端包括但不限于手机、平板电脑和笔记本电脑,台式终端包括但不限于台式电脑和车载电脑,服务器包括高性能计算机和高性能计算机集群。该语音合成方法,具体包括如下步骤:
步骤102,获取待合成语音文本。
其中,待合成语音文本,为待合成的语音对应的文本。在本发明实施例中,根据待合成语音文本合成语音,得到语音合成的目的。
步骤104,根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱。
其中,梅尔频谱,为语音频谱的一种表现方式,普通的语音频谱是一张很大的频谱图,使用梅尔滤波器对语音频率进行滤波,从而得到相对较小的频谱图,该相对较小的频谱图即为梅尔频谱。
将所述待合成语音文本输入声谱网络,声谱网络包括编码器和解码器,其中,编码器用于根据待合成语音文本得到隐层特征,解码器用于根据待合成语音文本对应的隐层特征预测得到梅尔频谱。
具体的,编码器包括字符向量单元、卷积单元和双向LSTM单元,待合成语音文本被字符向量单元编码成固定维度(例如,512维)的字符向量;字符向量输入卷积单元(例如,3层卷积核),卷积单元提取字符向量的上下文特征;将卷积单元提取的上下文特征输入双向LSTM单元,得到编码特征。解码器可以是一个自回归循环神经网络,解码器根据双向LSTM单元输出的编码特征预测梅尔频谱。
步骤106,将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息。
其中,复数神经网络,以梅尔频谱作为输入,以复数频谱作为输出。在本发明实施例中,复数神经网络的网络结构包括U-net网络结构。
复数频谱的实部信息和虚部信息可以看做是两张图像,也就是说,将复数神经网络的输出看做是两张频谱图像。
步骤108,根据所述复数频谱得到所述待合成语音文本对应的合成语音。
根据待合成语音文本对应的复数频谱即可得到待合成语音文本对应的合成语音。需要说明的是,由于复数频谱包括实部信息和虚部信息,最终合成的语音是根据实部信息和虚部信息合成的,相较于只依据实部信息合成语音的方法,本发明实施例的方法合成的语音由于保留了更多的语音信息将更加的真实。
在一个实施例中,步骤108所述根据所述复数频谱得到所述待合成语音文本对应的合成语音,包括:使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
语音本身是一维的时域信号,从该时域信号是很难看出语音的频率变化规律的。通过傅里叶可以将语音从时域变到频域,虽然此时可以看出语音的频率分布,但是缺丢失了时域信息,从该语音的频域分布也很难看出语音的时域信息。为了解决这个问题,很多时频分析方法应运而生,短时傅里叶变换就是很常用的时频域分析方法,逆短时傅里叶变换是短时傅里叶变换的逆过程。
具体的,短时傅里叶变换能够将语音从时域变到频域,逆短时傅里叶变换能够将频域的语音复原到时域。使用逆短时傅里叶变换(函数)将频域的语音复原到时域相较于使用自回归模型合成语音的方式更为简单。
上述语音合成方法,首先获取待合成语音文本;然后根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;并且将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;最后根据所述复数频谱得到所述待合成语音文本对应的合成语音。可见,通过上述方式,由于是根据语音文本对应的梅尔频谱得到语音文本的复数频谱,复数频谱包含实部信息和虚部信息,该实部信息和虚部信息可以看做是两张图像,生成两张图像所需的像素点远小于生成语音所需的采样点,因此,相较于WaveNet自回归的方式具有更低的复杂度,并且具有更高的合成效率。
在一个实施例中,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核。如图2所示,步骤106所述将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,包括:
步骤106A,将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征。
其中,下采样网络包括多层,每层设置有卷积核,每层的卷积核用于对该层的输入进行特征提取,从而不断挖掘出更加深层次的特征,实现将大尺寸变换到小尺寸。将梅尔频谱输入下采样网络,经过多层的卷积核进行特征提取,得到所述梅尔频谱对应的频谱特征。
步骤106B,将所述梅尔频谱对应的频谱特征输入所述上采样网络。
在得到梅尔频谱对应的频谱特征之后,将得到的频谱特征输入复数神经网络中的上采样网络,以便上采样网络根据频谱特征得到复数频谱。
步骤106C,所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息。
上采样网络中设置有反卷积核,反卷积核执行反卷积操作,反卷积就是转置卷积,实现将小尺寸变换到大尺寸。
步骤106D,所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
在本发明实施例中,在上采样网络中设置两种反卷积核,具体的为实部反卷积核以及虚部反卷积核,通过设置实部反卷积核对频谱特征进行处理得到待合成语音文本对应的实部信息,通过设置虚部反卷积核对频谱特征进行处理得到待合成语音文本对应的虚部信息。
在一个实施例中,提供了待合成语音文本的训练方式,如图3所示,在步骤314所述获取待合成语音文本之前,还包括:
步骤302,获取训练语音。
其中,训练语音,为用于训练复数神经网络的语音。
步骤304,根据所述训练语音得到所述训练语音对应的梅尔频谱。
在本发明实施例中,复数神经网络以梅尔频谱作为输入,因此,需要首先得到训练语音对应的梅尔频谱,然后再使用得到的梅尔频谱对复数神经网络进行训练。
在一个实施例中,如图4所示,步骤304所述根据所述训练语音得到所述训练语音对应的梅尔频谱,包括:
步骤304A,使用短时傅里叶变换对所述训练语音进行处理,得到所述训练语音对应的复数频谱。
其中,短时傅里叶变换,指将时域信号变换到频域的函数变换,使用短时傅里叶变换对训练语音进行处理能够得到训练语音对应的复数频谱,训练语音对应复数频谱包括实部和虚部。
步骤304B,根据所述训练语音对应的复数频谱计算得到所述训练语音对应的幅度谱和相位谱。
获取复数频谱到幅度谱的计算公式,根据该计算公式计算得到训练语音对应的幅度谱;获取复数频谱到相位谱的计算公式,根据该计算公式计算得到训练语音对应的相位谱。
步骤304C,采用梅尔滤波器对所述训练语音对应的幅度谱进行滤波,得到所述训练语音对应的梅尔频谱。
采用梅尔滤波器对幅度谱进行降维(滤波),即可得到梅尔频谱。
步骤306,将所述训练语音对应的梅尔频谱输入所述复数神经网络,得到所述训练语音对应的第一实部信息和第一虚部信息。
步骤308,根据所述第一实部信息和所述第一虚部信息得到所述训练语音对应的合成语音。
采用逆短时傅里叶变换对复数神经网络输出的训练语音对应的第一实部信息和第一虚部信息(即得到了训练语音对应的复数频谱)进行处理即可生成合成语音,后续将根据合成语音、训练语音等更新复数神经网络,以便通过不断更新复数神经网络,使得最终复数神经网络输出的第一实部信息和第一虚部信息更加接近真实语音的实部信息和虚部信息,提高最终合成的语音的质量。
步骤310,根据所述训练语音得到所述训练语音对应的第二实部信息和第二虚部信息。
采用短时傅里叶变换对训练语音进行处理,即可得到训练语音对应的第二实部信息和第二虚部信息(即复数频谱)。
步骤312,根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,以便根据所述网络损失参数更新所述复数神经网络。
在一个实施例中,如图5所示,步骤312包括:
步骤312A,根据所述训练语音和所述训练语音对应的合成语音得到第一损失参数。
如图6所示,鉴别器将训练语音和合成语音进行比对,然后根据比对结果输出第一损失参数,具体的,训练语音和合成语音差别越大,则第一损失参数也越大;相反的,训练语音和合成语音差别越小,则第一损失参数也越小。
进一步的,鉴别器根据训练语音和合成语音输出第三损失参数,第三损失参数用于确定合成语音与训练语音的真假,若合成语音越真(和训练语音越接近),则第三损失参数越小;若合成语音越假,则第三损失参数越大。然后对第三损失参数进行梯度下降,从而实现对鉴别器的更新。
相较于第三损失参数,第一损失参数进行的是更为细节的判断。
步骤312B,对所述第一实部信息和所述第一虚部信息进行采样操作,得到第一实部虚部集,所述第一实部虚部集中包括预设个数的维度不同的实部信息和虚部信息。
如图6所示,对复数神经网络输出的第一实部信息和第一虚部信息进行多次采样,每次采样得到维度更低的实部信息和虚部信息,然后再继续对该维度更低的实部信息和虚部信息进行采样,最终经过多次采样,得到预设个数的维度不同的实部信息和虚部信息。例如,采样前的尺寸是512×512,采样后的尺寸是256×256,再次采样后的尺寸为128×128。
步骤312C,对所述第二实部信息和所述第二虚部信息进行采样操作,得到第二实部虚部集,所述第二实部虚部集中包括预设个数的维度不同的实部信息和虚部信息。
同样的,对训练语音对应的第二实部信息和第二虚部信息进行多次采样,每次采样得到维度更低的实部信息和虚部信息,然后再继续对该维度更低的实部信息和虚部信息进行采样,最终经过多次采样,得到预设个数的维度不同的实部信息和虚部信息。在第二实部信息和第二虚部信息的采样过程中,每次的采样参数和第一实部信息以及第一虚部信息每次采样的采样参数保持一致。
步骤312D,根据所述第一实部虚部集和所述第二实部虚部集得到第二损失参数。
如图6所示,将第一实部虚部集中的第一实部信息和第一虚部信息与第二实部虚部集中相应的第二实部信息和第二虚部信息进行比较,得到损失子参数;将多个损失子参数相加,即可得到第二损失参数。
步骤312E,将所述第一损失参数和第二损失参数的和作为所述网络损失参数。
将第一损失参数和第二损失参数的和作为网络损失参数,以便根据所述网络损失参数更新所述复数神经网络,由于复数神经网络的更新同时考虑到了合成语音、训练语音以及复数神经网络输出的第一实部信息和第一虚部信息,能够提高网络更新速度,加速复数神经网络的训练,并且能够得到高质量的复数神经网络。
具体的,对网络损失参数进行梯度下降,从而实现对复数神经网络的更新。
如图7所示,在一个实施例中,提出了一种语音合成装置700,该装置700包括:文本获取模块702,用于获取待合成语音文本。第一频谱模块704,用于根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱。第二频谱模块706,用于将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息。语音合成模块708,用于根据所述复数频谱得到所述待合成语音文本对应的合成语音。
上述语音合成装置,首先获取待合成语音文本;然后根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;并且将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;最后根据所述复数频谱得到所述待合成语音文本对应的合成语音。可见,通过上述装置,由于是根据语音文本对应的梅尔频谱得到语音文本的复数频谱,复数频谱包含实部信息和虚部信息,该实部信息和虚部信息可以看做是两张图像,生成两张图像所需的像素点远小于生成语音所需的采样点,因此,相较于WaveNet自回归的方式具有更低的复杂度,并且具有更高的合成效率。
在一个实施例中,所述语音合成模块708,包括:逆变换模块,用于使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
在一个实施例中,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核;如图8所示,所述第二频谱模块706,包括:下采样模块7062,用于将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征;上采样输入模块7064,用于将所述梅尔频谱对应的频谱特征输入所述上采样网络;实部模块7066,用于所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息;虚部模块7068,用于所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
在一个实施例中,如图9所示,所述装置700,还包括:训练语音获取模块710,用于获取训练语音;训练语音梅尔模块712,用于根据所述训练语音得到所述训练语音对应的梅尔频谱;训练语音输入模块714,用于将所述训练语音对应的梅尔频谱输入所述复数神经网络,得到所述训练语音对应的第一实部信息和第一虚部信息;训练合成模块716,用于根据所述第一实部信息和所述第一虚部信息得到所述训练语音对应的合成语音;训练语音频谱模块718,用于根据所述训练语音得到所述训练语音对应的第二实部信息和第二虚部信息;网络更新模块720,用于根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,以便根据所述网络损失参数更新所述复数神经网络。
在一个实施例中,所述网络更新模块,包括:第一损失模块,用于根据所述训练语音和所述训练语音对应的合成语音得到第一损失参数;第一采样模块,用于对所述第一实部信息和所述第一虚部信息进行采样操作,得到第一实部虚部集,所述第一实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;第二采样模块,用于对所述第二实部信息和所述第二虚部信息进行采样操作,得到第二实部虚部集,所述第二实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;采样损失模块,用于根据所述第一实部虚部集和所述第二实部虚部集得到第二损失参数;损失求和模块,用于将所述第一损失参数和第二损失参数的和作为所述网络损失参数。
在一个实施例中,所述训练语音梅尔模块,包括:短时傅里叶模块,用于使用短时傅里叶变换对所述训练语音进行处理,得到所述训练语音对应的复数频谱;谱计算模块,用于根据所述训练语音对应的复数频谱计算得到所述训练语音对应的幅度谱和相位谱;梅尔滤波模块,用于采用梅尔滤波器对所述训练语音对应的幅度谱进行滤波,得到所述训练语音对应的梅尔频谱。
图10示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是服务器和终端。如图10所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音合成方法。本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的语音合成方法可以实现为一种计算机程序的形式,计算机程序可在如图10所示的计算机设备上运行。计算机设备的存储器中可存储组成语音合成装置的各个程序模板。比如,文本获取模块702,第一频谱模块704,第二频谱模块706,语音合成模块708。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:获取待合成语音文本;根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;根据所述复数频谱得到所述待合成语音文本对应的合成语音。
在一个实施例中,所述根据所述复数频谱得到所述待合成语音文本对应的合成语音,包括:使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
在一个实施例中,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核;所述将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,包括:将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征;将所述梅尔频谱对应的频谱特征输入所述上采样网络;所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息;所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
在一个实施例中,所述计算机程序被处理器执行时,还用于:在所述获取待合成语音文本之前,获取训练语音;根据所述训练语音得到所述训练语音对应的梅尔频谱;将所述训练语音对应的梅尔频谱输入所述复数神经网络,得到所述训练语音对应的第一实部信息和第一虚部信息;根据所述第一实部信息和所述第一虚部信息得到所述训练语音对应的合成语音;根据所述训练语音得到所述训练语音对应的第二实部信息和第二虚部信息;根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,以便根据所述网络损失参数更新所述复数神经网络。
在一个实施例中,所述根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,包括:根据所述训练语音和所述训练语音对应的合成语音得到第一损失参数;对所述第一实部信息和所述第一虚部信息进行采样操作,得到第一实部虚部集,所述第一实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;对所述第二实部信息和所述第二虚部信息进行采样操作,得到第二实部虚部集,所述第二实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;根据所述第一实部虚部集和所述第二实部虚部集得到第二损失参数;将所述第一损失参数和第二损失参数的和作为所述网络损失参数。
在一个实施例中,所述根据所述训练语音得到所述训练语音对应的梅尔频谱,包括:使用短时傅里叶变换对所述训练语音进行处理,得到所述训练语音对应的复数频谱;根据所述训练语音对应的复数频谱计算得到所述训练语音对应的幅度谱和相位谱;采用梅尔滤波器对所述训练语音对应的幅度谱进行滤波,得到所述训练语音对应的梅尔频谱。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如下步骤:获取待合成语音文本;根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;根据所述复数频谱得到所述待合成语音文本对应的合成语音。
在一个实施例中,所述根据所述待合成语音文本对应的复数频谱得到所述待合成语音文本对应的合成语音,包括:使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
在一个实施例中,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核;所述将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,包括:将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征;将所述梅尔频谱对应的频谱特征输入所述上采样网络;所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息;所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
在一个实施例中,所述计算机程序被处理器执行时,还用于:在所述获取待合成语音文本之前,获取训练语音;根据所述训练语音得到所述训练语音对应的梅尔频谱;将所述训练语音对应的梅尔频谱输入所述复数神经网络,得到所述训练语音对应的第一实部信息和第一虚部信息;根据所述第一实部信息和所述第一虚部信息得到所述训练语音对应的合成语音;根据所述训练语音得到所述训练语音对应的第二实部信息和第二虚部信息;根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,以便根据所述网络损失参数更新所述复数神经网络。
在一个实施例中,所述根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,包括:根据所述训练语音和所述训练语音对应的合成语音得到第一损失参数;对所述第一实部信息和所述第一虚部信息进行采样操作,得到第一实部虚部集,所述第一实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;对所述第二实部信息和所述第二虚部信息进行采样操作,得到第二实部虚部集,所述第二实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;根据所述第一实部虚部集和所述第二实部虚部集得到第二损失参数;将所述第一损失参数和第二损失参数的和作为所述网络损失参数。
在一个实施例中,所述根据所述训练语音得到所述训练语音对应的梅尔频谱,包括:使用短时傅里叶变换对所述训练语音进行处理,得到所述训练语音对应的复数频谱;根据所述训练语音对应的复数频谱计算得到所述训练语音对应的幅度谱和相位谱;采用梅尔滤波器对所述训练语音对应的幅度谱进行滤波,得到所述训练语音对应的梅尔频谱。
需要说明的是,上述语音合成方法、语音合成装置、计算机设备及计算机可读存储介质属于一个总的发明构思,语音合成方法、语音合成装置、计算机设备及计算机可读存储介质实施例中的内容可相互适用。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (11)

  1. 一种语音合成方法,其特征在于,所述方法包括:
    获取待合成语音文本;
    根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;
    将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;
    根据所述复数频谱得到所述待合成语音文本对应的合成语音。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述复数频谱得到所述待合成语音文本对应的合成语音,包括:
    使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
  3. 根据权利要求1所述的方法,其特征在于,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核;所述将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,包括:
    将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征;
    将所述梅尔频谱对应的频谱特征输入所述上采样网络;
    所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息;
    所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
  4. 根据权利要求1所述的方法,其特征在于,在所述获取待合成语音文本之前,还包括:
    获取训练语音;
    根据所述训练语音得到所述训练语音对应的梅尔频谱;
    将所述训练语音对应的梅尔频谱输入所述复数神经网络,得到所述训练语音对应的第一实部信息和第一虚部信息;
    根据所述第一实部信息和所述第一虚部信息得到所述训练语音对应的合成语音;
    根据所述训练语音得到所述训练语音对应的第二实部信息和第二虚部信息;
    根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,以便根据所述网络损失参数更新所述复数神经网络。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述训练语音、所述训练语音对应的合成语音、所述第一实部信息、所述第一虚部信息、所述第二实部信息和所述第二虚部信息,得到网络损失参数,包括:
    根据所述训练语音和所述训练语音对应的合成语音得到第一损失参数;
    对所述第一实部信息和所述第一虚部信息进行采样操作,得到第一实部虚部集,所述第一实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;
    对所述第二实部信息和所述第二虚部信息进行采样操作,得到第二实部虚部集,所述第二实部虚部集中包括预设个数的维度不同的实部信息和虚部信息;
    根据所述第一实部虚部集和所述第二实部虚部集得到第二损失参数;
    将所述第一损失参数和第二损失参数的和作为所述网络损失参数。
  6. 根据权利要求4所述的方法,其特征在于,所述根据所述训练语音得到所述训练语音对应的梅尔频谱,包括:
    使用短时傅里叶变换对所述训练语音进行处理,得到所述训练语音对应的复数频谱;
    根据所述训练语音对应的复数频谱计算得到所述训练语音对应的幅度谱和相位谱;
    采用梅尔滤波器对所述训练语音对应的幅度谱进行滤波,得到所述训练语音对应的梅尔频谱。
  7. 一种语音合成装置,其特征在于,所述装置包括:
    文本获取模块,用于获取待合成语音文本;
    第一频谱模块,用于根据所述待合成语音文本得到所述待合成语音文本对应的梅尔频谱;
    第二频谱模块,用于将所述梅尔频谱输入复数神经网络,得到所述待合成语音文本对应的复数频谱,所述复数频谱包括实部信息和虚部信息;
    语音合成模块,用于根据所述复数频谱得到所述待合成语音文本对应的合成语音。
  8. 根据权利要求7所述的装置,其特征在于,所述语音合成模块,包括:
    逆变换模块,用于使用逆短时傅里叶变换对所述复数频谱进行处理,得到所述待合成语音文本对应的合成语音。
  9. 根据权利要求7所述的装置,其特征在于,所述复数神经网络包括下采样网络和上采样网络,所述上采样网络包括实部反卷积核和虚部反卷积核;所述第二频谱模块,包括:
    下采样模块,用于将所述梅尔频谱输入所述复数神经网络中的下采样网络,得到所述下采样网络输出的所述梅尔频谱对应的频谱特征;
    上采样输入模块,用于将所述梅尔频谱对应的频谱特征输入所述上采样网络;
    实部模块,用于所述上采样网络中的实部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的实部信息;
    虚部模块,用于所述上采样网络中的虚部反卷积核对所述梅尔频谱对应的频谱特征进行处理得到所述待合成语音文本对应的虚部信息。
  10. 一种计算机设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6任一项所述语音合成方法的步骤。
  11. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至6任一项所述语音合成方法的步骤。
PCT/CN2019/127911 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备和存储介质 WO2021127978A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201980003188.6A CN111316352B (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备和存储介质
PCT/CN2019/127911 WO2021127978A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备和存储介质
US17/117,148 US11763796B2 (en) 2019-12-24 2020-12-10 Computer-implemented method for speech synthesis, computer device, and non-transitory computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127911 WO2021127978A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/117,148 Continuation US11763796B2 (en) 2019-12-24 2020-12-10 Computer-implemented method for speech synthesis, computer device, and non-transitory computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021127978A1 true WO2021127978A1 (zh) 2021-07-01

Family

ID=71147678

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127911 WO2021127978A1 (zh) 2019-12-24 2019-12-24 语音合成方法、装置、计算机设备和存储介质

Country Status (3)

Country Link
US (1) US11763796B2 (zh)
CN (1) CN111316352B (zh)
WO (1) WO2021127978A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037760B (zh) 2020-08-24 2022-01-07 北京百度网讯科技有限公司 语音频谱生成模型的训练方法、装置及电子设备
CN112382271B (zh) * 2020-11-30 2024-03-26 北京百度网讯科技有限公司 语音处理方法、装置、电子设备和存储介质
CN112634914B (zh) * 2020-12-15 2024-03-29 中国科学技术大学 基于短时谱一致性的神经网络声码器训练方法
WO2022133630A1 (zh) * 2020-12-21 2022-06-30 深圳市优必选科技股份有限公司 跨语言音频转换方法、计算机设备和存储介质
CN112712812B (zh) * 2020-12-24 2024-04-26 腾讯音乐娱乐科技(深圳)有限公司 音频信号生成方法、装置、设备以及存储介质
CN113470616B (zh) * 2021-07-14 2024-02-23 北京达佳互联信息技术有限公司 语音处理方法和装置以及声码器和声码器的训练方法
CN114265373A (zh) * 2021-11-22 2022-04-01 煤炭科学研究总院 综采面一体式操控台控制系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014168591A1 (en) * 2013-04-11 2014-10-16 Cetinturk Cetin Relative excitation features for speech recognition
CN109523989A (zh) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 语音合成方法、语音合成装置、存储介质及电子设备
CN109754778A (zh) * 2019-01-17 2019-05-14 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN110136690A (zh) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110211604A (zh) * 2019-06-17 2019-09-06 广东技术师范大学 一种用于语音变形检测的深度残差网络结构

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011026247A1 (en) * 2009-09-04 2011-03-10 Svox Ag Speech enhancement techniques on the power spectrum
JP5085700B2 (ja) * 2010-08-30 2012-11-28 株式会社東芝 音声合成装置、音声合成方法およびプログラム
US10872596B2 (en) * 2017-10-19 2020-12-22 Baidu Usa Llc Systems and methods for parallel wave generation in end-to-end text-to-speech
US11017761B2 (en) * 2017-10-19 2021-05-25 Baidu Usa Llc Parallel neural text-to-speech
TWI651927B (zh) * 2018-02-14 2019-02-21 National Central University 訊號源分離方法及訊號源分離裝置
US11462209B2 (en) * 2018-05-18 2022-10-04 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
CN109065067B (zh) * 2018-08-16 2022-12-06 福建星网智慧科技有限公司 一种基于神经网络模型的会议终端语音降噪方法
CN109817198B (zh) * 2019-03-06 2021-03-02 广州多益网络股份有限公司 语音合成方法、装置及存储介质
CN110310621A (zh) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 歌唱合成方法、装置、设备以及计算机可读存储介质
US20220165247A1 (en) * 2020-11-24 2022-05-26 Xinapse Co., Ltd. Method for generating synthetic speech and speech synthesis system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014168591A1 (en) * 2013-04-11 2014-10-16 Cetinturk Cetin Relative excitation features for speech recognition
CN109754778A (zh) * 2019-01-17 2019-05-14 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN109523989A (zh) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 语音合成方法、语音合成装置、存储介质及电子设备
CN110136690A (zh) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110211604A (zh) * 2019-06-17 2019-09-06 广东技术师范大学 一种用于语音变形检测的深度残差网络结构

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHENG NAIJUN: "SIGNAL ENHANCEMENT BASED ON COMPLEX-VALUED NEURAL NETWORKS", XIDIAN UNIVERSITY MASTER'S THESES, 1 January 2018 (2018-01-01), XP055827314 *

Also Published As

Publication number Publication date
CN111316352B (zh) 2023-10-10
US11763796B2 (en) 2023-09-19
CN111316352A (zh) 2020-06-19
US20220189454A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
WO2021127978A1 (zh) 语音合成方法、装置、计算机设备和存储介质
JP7427723B2 (ja) ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成
CN111081268A (zh) 一种相位相关的共享深度卷积神经网络语音增强方法
WO2021128256A1 (zh) 语音转换方法、装置、设备及存储介质
DE112014003337T5 (de) Sprachsignaltrennung und Synthese basierend auf auditorischer Szenenanalyse und Sprachmodellierung
US20220253700A1 (en) Audio signal time sequence processing method, apparatus and system based on neural network, and computer-readable storage medium
WO2019233364A1 (zh) 基于深度学习的音频音质增强
WO2021127811A1 (zh) 一种语音合成方法、装置、智能终端及可读介质
EP4099316A1 (en) Speech synthesis method and system
WO2023283823A1 (zh) 语音对抗样本检测方法、装置、设备及计算机可读存储介质
WO2020015270A1 (zh) 语音信号分离方法、装置、计算机设备以及存储介质
WO2022141868A1 (zh) 一种提取语音特征的方法、装置、终端及存储介质
CN108922561A (zh) 语音区分方法、装置、计算机设备及存储介质
CN111261177A (zh) 语音转换方法、电子装置及计算机可读存储介质
CN113823308B (zh) 一种使用单个带噪语音样本进行语音去噪的方法
CN113470684A (zh) 音频降噪方法、装置、设备及存储介质
Sheng et al. High-quality speech synthesis using super-resolution mel-spectrogram
CN108172214A (zh) 一种基于Mel域的小波语音识别特征参数提取方法
CN113421584B (zh) 音频降噪方法、装置、计算机设备及存储介质
CN116705056A (zh) 音频生成方法、声码器、电子设备及存储介质
CN116959465A (zh) 语音转换模型训练方法、语音转换方法、装置及介质
CN115798453A (zh) 语音重建方法、装置、计算机设备和存储介质
CN111108558B (zh) 语音转换方法、装置、计算机设备及计算机可读存储介质
KR102400598B1 (ko) 기계학습 기반의 잡음 제거 방법 및 그를 위한 장치
CN114141259A (zh) 语音转换方法、装置、设备、存储介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957144

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957144

Country of ref document: EP

Kind code of ref document: A1