CN113744714A - Speech synthesis method, speech synthesis device, computer equipment and storage medium - Google Patents
Speech synthesis method, speech synthesis device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113744714A CN113744714A CN202111136538.1A CN202111136538A CN113744714A CN 113744714 A CN113744714 A CN 113744714A CN 202111136538 A CN202111136538 A CN 202111136538A CN 113744714 A CN113744714 A CN 113744714A
- Authority
- CN
- China
- Prior art keywords
- sample data
- audio
- mel
- text
- speech synthesis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 52
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 52
- 238000001308 synthesis method Methods 0.000 title claims abstract description 22
- 238000001228 spectrum Methods 0.000 claims abstract description 88
- 230000003068 static effect Effects 0.000 claims abstract description 50
- 238000000034 method Methods 0.000 claims abstract description 22
- 238000012549 training Methods 0.000 claims description 40
- 238000013256 Gubra-Amylin NASH model Methods 0.000 claims description 33
- 238000013135 deep learning Methods 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 4
- 230000006872 improvement Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000000354 decomposition reaction Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a voice synthesis method, a voice synthesis device, computer equipment and a storage medium. The method comprises the following steps: acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio. By implementing the method of the embodiment of the invention, the final expression of the whole generated waveform can be quickly and effectively improved, and the tone quality of the voice synthesis audio is improved.
Description
Technical Field
The present invention relates to a speech synthesis method, and more particularly, to a speech synthesis method, apparatus, computer device, and storage medium.
Background
Currently, although with the rapid development of speech synthesis technology in recent years, research on neural network vocoders based on GAN (generic adaptive Networks) has been greatly advanced, which improves the sound quality of synthesized audio. Neural network based speech synthesis enables real-time synthesis of audio resembling human natural sounds. However, in the frequency space, the difference between the generated audio and the real audio still exists, so that the synthesized audio is noisy, and the sound quality of the synthesized audio is reduced.
Therefore, it is necessary to design a new method to quickly and effectively improve the final representation of the whole generated waveform and improve the sound quality of the speech synthesis audio.
Disclosure of Invention
The present invention is directed to overcoming the drawbacks of the prior art and providing a speech synthesis method, apparatus, computer device and storage medium.
In order to achieve the purpose, the invention adopts the following technical scheme: a method of speech synthesis comprising:
acquiring a text file to be synthesized;
inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;
and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio.
The further technical scheme is as follows: the acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.
The further technical scheme is as follows: the Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with a Mel frequency spectrum; and the GAN model performs downsampling on the audio sample data with the Mel frequency spectrum by using one-dimensional static discrete wavelet transform.
The further technical scheme is as follows: the acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, and comprises the following steps:
acquiring text sample data and corresponding audio sample data;
carrying out regularization processing on the text sample data to obtain a standard format text;
denoising and denoising the audio sample data to obtain processed audio data;
aligning the standard format text and the processed audio data to obtain data to be trained;
constructing a deep learning network;
and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
The further technical scheme is as follows: the GAN model employs a multi-scale discriminator, with multiple discriminators operating at different audio resolutions.
The further technical scheme is as follows: the Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with Mel frequency spectrum, and comprises the following steps:
and audio sample data with a Mel frequency spectrum is subjected to one-dimensional static discrete wavelet transform of the GAN model to obtain sub-band signals of multiple frequencies, and the sub-band signals of the multiple frequencies are subjected to convolution through a convolution layer of the GAN model to perform speech synthesis.
The present invention also provides a speech synthesis apparatus comprising:
the file acquisition unit is used for acquiring a text file to be synthesized;
the acoustic feature extraction unit is used for inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;
and the voice synthesis unit is used for inputting the Mel frequency spectrum into a Mel-GAN vocoder improved based on static discrete wavelet transform for voice synthesis so as to obtain voice audio.
The further technical scheme is as follows: further comprising:
and the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the acoustic spectrum prediction network.
The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.
The invention also provides a storage medium storing a computer program which, when executed by a processor, is operable to carry out the method as described above.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the text file to be synthesized is obtained, the acoustic characteristics are extracted by means of the voice spectrum prediction network, and the extracted content is input into the Mel-GAN vocoder improved based on the static discrete wavelet transform for voice synthesis, the static discrete wavelet transform can more completely extract the information of the signal in the frequency domain, the loss of high-frequency detail parts is avoided, the quality of the generated audio frequency can be effectively improved, the final expression of the whole generated waveform can be rapidly and effectively improved, and the voice quality of the voice synthesis audio frequency is improved.
The invention is further described below with reference to the accompanying drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a speech synthesis method according to an embodiment of the present invention;
FIG. 3 is a sub-flow diagram of a speech synthesis method according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a speech synthesis method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention. The voice synthesis method is applied to a server, a text file to be synthesized is obtained from a terminal through the server, the text file to be synthesized is input to a sound spectrum prediction network for Mel spectrum extraction, and the extracted Mel spectrum is input to a Mel-GAN vocoder improved based on static discrete wavelet transform for voice synthesis, so that voice audio is obtained.
Fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S130.
And S110, acquiring a text file to be synthesized.
In the present embodiment, the text file to be synthesized refers to text characters to be synthesized.
And S120, inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum.
In this embodiment, the mel spectrum refers to the acoustic features corresponding to the text file to be synthesized.
Specifically, the sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.
In an embodiment, referring to fig. 3, the step S120 may include steps S121 to S126.
And S121, acquiring text sample data and corresponding audio sample data.
In this embodiment, the text sample data refers to data with a text format, and the audio sample data refers to audio content.
And S122, performing regularization processing on the text sample data to obtain a standard format text.
In this embodiment, the standard format text refers to text content with a format meeting the requirement.
Performing regularization processing on the collected text sample data for multiple times; specifically, in a Chinese/English text, special symbols, such as a book name number, a Roman number, a dollar symbol $, an English pound symbol £ and a Renminbi symbol $, are replaced by corresponding Chinese/word forms; and the serial number and the Arabic number are replaced by corresponding Chinese or English words; removing meaningless special symbols, and finally converting the Chinese character text into a Pinyin form text, thereby forming a standard format text.
And S123, denoising and denoising the audio sample data to obtain processed audio data.
In the embodiment, denoising and denoising are performed on audio sample data to remove meaningless audio; and obtaining high-tone quality, low-noise and clear audio data for training the sound spectrum prediction network.
And S124, aligning the standard format text and the processed audio data to obtain the data to be trained.
In this embodiment, the data to be trained refers to data formed by aligning standard format text and processed audio data.
S125, constructing a deep learning network;
and S126, training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
Specifically, aligning the standard format text and the processed audio data, inputting the aligned standard format text and the processed audio data into a voice spectrum prediction network for voice synthesis, and generating a Mel frequency spectrum corresponding to the audio; and carrying out one-to-one correspondence on the standard format text and the processed audio data, wherein each audio corresponds to a unique corresponding text. Inputting audio and text data into a voice spectrum prediction network for voice synthesis, extracting acoustic features of the audio, coding the text, training the voice spectrum prediction network to obtain weight parameters of a model, and outputting corresponding acoustic features, namely a Mel frequency spectrum.
And S130, inputting the Mel frequency spectrum into a Mel-GAN vocoder improved based on static discrete wavelet transform for voice synthesis to obtain voice audio.
In the present embodiment, the voice audio refers to the voice content formed by the text file to be synthesized.
Specifically, the Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with a Mel frequency spectrum; and the GAN model performs downsampling on the audio sample data with the Mel frequency spectrum by using one-dimensional static discrete wavelet transform.
Specifically, the Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with Mel frequency spectrum, and comprises the following steps:
and audio sample data with a Mel frequency spectrum is subjected to one-dimensional static discrete wavelet transform of the GAN model to obtain sub-band signals of multiple frequencies, and the sub-band signals of the multiple frequencies are subjected to convolution through a convolution layer of the GAN model to perform speech synthesis.
The one-dimensional static discrete wavelet transform can effectively decompose a signal into a low-frequency part and a high-frequency part, low-frequency envelope information and high-frequency detail information of the signal are reserved in the down-sampling process of the signal, a plurality of discriminators on different scales can capture audio with fine-grained structures on different levels, each discriminator learns the characteristics of different audio frequency ranges, and the multi-scale discriminators share the same network structure and operate different audio scales in a frequency domain. The static discrete wavelet transform can more completely extract the information of the signal in the frequency domain, avoid losing high-frequency detail parts, and effectively improve the quality of the generated audio.
The Mel-GAN vocoder improved based on the static discrete wavelet transform is trained, and the weight parameters of the vocoder model are obtained to be used for generating audio by utilizing Mel spectrum reasoning in the voice synthesis process.
The Mel-GAN vocoder can generate high-quality coherent wave vocoder model by training GAN model, and can achieve better effects in the aspects of voice synthesis, music or translation, music synthesis and the like. The Mel-GAN vocoder is a non-autoregressive model, has few model parameters, and has high speed of deducing waveforms from the frequency spectrum. The generator of Mel-GAN is a non-autoregressive feedforward convolution structure, including generator and arbiter, with Mel frequency spectrum not input, output the waveform of the audio frequency; a multi-scale discriminator is employed, with multiple discriminators operating at different audio resolutions, with down-sampling of the original audio by means of step-wise averaging pooling. However, average pooling downsampling can obtain audio frequency domain information at different resolutions, but ignores the nyquist sampling theorem, namely: the sampling frequency fs is more than or equal to 2 times of the highest frequency of the signal, and the information of the original signal can be completely reserved by the sampled signal. The average pooling downsampling ignores the sampling theorem and causes the high-frequency content of the sampled signal to be aliased, the high-frequency spectrum is distorted and becomes invalid, so that the frequency of the generated audio is different from the real audio, and the difference can cause noise which is represented as fizzy noise or reverberation in the audio. Therefore, the Mel-GAN vocoder based on the static discrete wavelet transform is proposed, namely, the one-dimensional static discrete wavelet transform is used for replacing the average pooling operation to carry out down-sampling, because the one-dimensional static discrete wavelet transform can effectively decompose the signal into a low-frequency part and a high-frequency part, the low-frequency envelope information and the high-frequency detail information of the signal are reserved in the sampling process.
The training discriminator loss is defined as: the training generator loss is defined as:where x represents the input signal, s represents the Mel spectrum, z represents the Gaussian noise, k represents the three sub-discriminators, E represents the expectation, D representskRepresents the kth sub-discriminator and G represents the generator.
An input signal x is subjected to one-dimensional static discrete wavelet transform to obtain a low-frequency part and a high-frequency part, and the low-frequency part and the high-frequency part are mainly obtained through a low-pass filter g and a high-pass filter h: wherein, ylow[n]And yhigh[n]Respectively representing the output of the signal after passing through a low-pass filter and a high-pass filter, wherein N represents the length of the signal x, and k represents the decomposition level number of the static wavelet transform. Since the static discrete wavelet transform has biorthogonal characteristics, it is possible to safely decompose the low and high frequency parts of the signal, and after each level of wavelet decomposition, the subband signals of all frequencies can be passed to the convolutional layer through the channel connection.
Discriminator loss is defined as:wherein L isDRepresenting discriminator loss, x andrespectively representing the real audio and the generated audio,represents the kth level of static discrete wavelet transform, k represents the decomposition level of the static wavelet transform, | |2Represents the second paradigm, DkRepresenting the discriminator at the kth decomposition level.
The static discrete wavelet transform can more completely extract the information of the signal in the frequency domain, avoid losing high-frequency detail parts, and effectively improve the quality of the generated audio.
And arranging training data, including text sample data and corresponding audio sample data. The audio sample data is first de-duplicated, de-noised, and then the audio is clipped into about 8 seconds of audio each. The text sample data is arranged and split, and in the Chinese/English text, special symbols, such as book name number, Roman number, dollar symbol $, pound sign $, RMB symbol $, and the like, are replaced by corresponding Chinese character/word forms; and the serial number and the Arabic number are replaced by corresponding Chinese or English words; removing meaningless special symbols, and finally converting the Chinese character text into a pinyin text.
Inputting the pure audio sample data with the Mel frequency spectrum after clipping into a Mel-GAN vocoder model improved based on static discrete wavelet transform for training, obtaining the trained vocoder model weight parameters, using the Mel frequency spectrum to infer and generate audio in the voice synthesis process, increasing the Mel frequency spectrum loss to improve the training efficiency of the generator and the fidelity of the generated audio, reducing the difference between the generated audio and the real audio on the premise of not reducing the model training and reasoning speed, reserving more sampling information, and greatly improving the quality of voice generation.
According to the voice synthesis method, the text file to be synthesized is obtained, the acoustic features are extracted by means of the voice spectrum prediction network, the extracted contents are input into the Mel-GAN vocoder improved based on the static discrete wavelet transform for voice synthesis, the static discrete wavelet transform can more completely extract the information of signals in the frequency domain, the loss of high-frequency detail parts is avoided, the quality of generated audio can be effectively improved, the final expression of the whole generated waveform is rapidly and effectively improved, and the voice quality of voice synthesis audio is improved.
Fig. 4 is a schematic block diagram of a speech synthesis apparatus 300 according to an embodiment of the present invention. As shown in fig. 4, the present invention also provides a speech synthesis apparatus 300 corresponding to the above speech synthesis method. The speech synthesis apparatus 300 includes a unit for performing the above-described speech synthesis method, and the apparatus may be configured in a server. Specifically, referring to fig. 4, the speech synthesis apparatus 300 includes a file obtaining unit 301, an acoustic feature extracting unit 302, and a speech synthesis unit 303.
A file acquiring unit 301 configured to acquire a text file to be synthesized; an acoustic feature extraction unit 302, configured to input the text file to be synthesized into a sound spectrum prediction network to extract an acoustic feature, so as to obtain a mel-frequency spectrum; a speech synthesis unit 303, configured to input the Mel spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis, so as to obtain speech audio.
In one embodiment, the speech synthesis apparatus 300 further comprises: and the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the acoustic spectrum prediction network.
In an embodiment, the speech synthesis apparatus 300 further comprises a vocoder generation unit for training the GAN model with audio sample data with Mel spectrum to obtain Mel-GAN vocoder improved based on static discrete wavelet transform, and the GAN model down-samples the audio sample data with Mel spectrum by using one-dimensional static discrete wavelet transform.
In an embodiment, the vocoder generating unit is configured to obtain subband signals of multiple frequencies through one-dimensional static discrete wavelet transform of the GAN model by using audio sample data with mel spectrum, and perform convolution on the subband signals of multiple frequencies through the convolution layer of the GAN model to perform speech synthesis.
In an embodiment, the prediction network generation unit includes a data acquisition subunit, a regularization subunit, a noise reduction subunit, an alignment subunit, a construction subunit, and a training subunit.
The data acquisition subunit is used for acquiring text sample data and corresponding audio sample data; the regularization subunit is used for regularizing the text sample data to obtain a standard format text; the noise reduction subunit is used for carrying out noise reduction and denoising processing on the audio sample data to obtain processed audio data; the aligning subunit is used for aligning the standard format text and the processed audio data to obtain data to be trained; the building subunit is used for building a deep learning network; and the training subunit is used for training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation processes of the speech synthesis apparatus 300 and each unit may refer to the corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.
The speech synthesis apparatus 300 may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 5.
Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, wherein the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 5, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 comprises program instructions that, when executed, cause the processor 502 to perform a speech synthesis method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a speech synthesis method.
The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:
acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio.
The acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.
The Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with a Mel frequency spectrum; and the GAN model performs downsampling on the audio sample data with the Mel frequency spectrum by using one-dimensional static discrete wavelet transform.
The GAN model employs a multi-scale discriminator, with multiple discriminators operating at different audio resolutions.
In an embodiment, when implementing the step that the audio spectrum prediction network is obtained by training the deep learning network after preprocessing text sample data and corresponding audio sample data, the processor 502 specifically implements the following steps:
acquiring text sample data and corresponding audio sample data; carrying out regularization processing on the text sample data to obtain a standard format text; denoising and denoising the audio sample data to obtain processed audio data; aligning the standard format text and the processed audio data to obtain data to be trained; constructing a deep learning network; and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
In an embodiment, when the Mel-GAN vocoder modified based on the static discrete wavelet transform is implemented by the processor 502, the following steps are specifically implemented when the step of training the GAN model by using the audio sample data with the Mel frequency spectrum is implemented:
and audio sample data with a Mel frequency spectrum is subjected to one-dimensional static discrete wavelet transform of the GAN model to obtain sub-band signals of multiple frequencies, and the sub-band signals of the multiple frequencies are subjected to convolution through a convolution layer of the GAN model to perform speech synthesis.
It should be understood that in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio.
The acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.
The Mel-GAN vocoder improved based on the static discrete wavelet transform is obtained by training a GAN model through audio sample data with a Mel frequency spectrum; and the GAN model performs downsampling on the audio sample data with the Mel frequency spectrum by using one-dimensional static discrete wavelet transform.
The GAN model employs a multi-scale discriminator, with multiple discriminators operating at different audio resolutions.
In an embodiment, when the processor executes the computer program to implement the step that the audio spectrum prediction network is obtained by training the deep learning network after preprocessing text sample data and corresponding audio sample data, the following steps are specifically implemented:
acquiring text sample data and corresponding audio sample data; carrying out regularization processing on the text sample data to obtain a standard format text; denoising and denoising the audio sample data to obtain processed audio data; aligning the standard format text and the processed audio data to obtain data to be trained; constructing a deep learning network; and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
In an embodiment, when the processor executes the computer program to implement the step of training the Mel-GAN vocoder based on the static discrete wavelet transform improvement by using audio sample data with Mel frequency spectrum, the processor implements the following steps:
and audio sample data with a Mel frequency spectrum is subjected to one-dimensional static discrete wavelet transform of the GAN model to obtain sub-band signals of multiple frequencies, and the sub-band signals of the multiple frequencies are subjected to convolution through a convolution layer of the GAN model to perform speech synthesis.
The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A speech synthesis method, comprising:
acquiring a text file to be synthesized;
inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;
and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio.
2. The speech synthesis method according to claim 1, wherein the audio spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.
3. The speech synthesis method of claim 1, wherein the Mel-GAN vocoder based on the static discrete wavelet transform improvement is obtained by training a GAN model with audio sample data with Mel frequency spectrum; and the GAN model performs downsampling on the audio sample data with the Mel frequency spectrum by using one-dimensional static discrete wavelet transform.
4. The speech synthesis method according to claim 2, wherein the audio spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, and comprises:
acquiring text sample data and corresponding audio sample data;
carrying out regularization processing on the text sample data to obtain a standard format text;
denoising and denoising the audio sample data to obtain processed audio data;
aligning the standard format text and the processed audio data to obtain data to be trained;
constructing a deep learning network;
and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.
5. The method of speech synthesis of claim 3 wherein the GAN model employs a multi-scale discriminator, the plurality of discriminators operating at different audio resolutions.
6. The method of claim 5, wherein the Mel-GAN vocoder based on the static discrete wavelet transform improvement is obtained by training a GAN model with audio sample data with Mel frequency spectrum, comprising:
and audio sample data with a Mel frequency spectrum is subjected to one-dimensional static discrete wavelet transform of the GAN model to obtain sub-band signals of multiple frequencies, and the sub-band signals of the multiple frequencies are subjected to convolution through a convolution layer of the GAN model to perform speech synthesis.
7. A speech synthesis apparatus, comprising:
the file acquisition unit is used for acquiring a text file to be synthesized;
the acoustic feature extraction unit is used for inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;
and the voice synthesis unit is used for inputting the Mel frequency spectrum into a Mel-GAN vocoder improved based on static discrete wavelet transform for voice synthesis so as to obtain voice audio.
8. The speech synthesis apparatus according to claim 7, further comprising:
and the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the acoustic spectrum prediction network.
9. A computer device, characterized in that the computer device comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program implements the method according to any of claims 1 to 6.
10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111136538.1A CN113744714B (en) | 2021-09-27 | 2021-09-27 | Speech synthesis method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111136538.1A CN113744714B (en) | 2021-09-27 | 2021-09-27 | Speech synthesis method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113744714A true CN113744714A (en) | 2021-12-03 |
CN113744714B CN113744714B (en) | 2024-04-05 |
Family
ID=78741401
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111136538.1A Active CN113744714B (en) | 2021-09-27 | 2021-09-27 | Speech synthesis method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113744714B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113744715A (en) * | 2021-09-27 | 2021-12-03 | 深圳市木愚科技有限公司 | Vocoder speech synthesis method, device, computer equipment and storage medium |
CN117676185A (en) * | 2023-12-05 | 2024-03-08 | 无锡中感微电子股份有限公司 | Packet loss compensation method and device for audio data and related equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5826232A (en) * | 1991-06-18 | 1998-10-20 | Sextant Avionique | Method for voice analysis and synthesis using wavelets |
CN107480471A (en) * | 2017-07-19 | 2017-12-15 | 福建师范大学 | The method for the sequence similarity analysis being characterized based on wavelet transformation |
CN108280416A (en) * | 2018-01-17 | 2018-07-13 | 国家海洋局第三海洋研究所 | A kind of broadband underwater acoustic signal processing method of small echo across scale correlation filtering |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN111402855A (en) * | 2020-03-06 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111627418A (en) * | 2020-05-27 | 2020-09-04 | 携程计算机技术(上海)有限公司 | Training method, synthesizing method, system, device and medium for speech synthesis model |
CN111968618A (en) * | 2020-08-27 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Speech synthesis method and device |
CN112652291A (en) * | 2020-12-15 | 2021-04-13 | 携程旅游网络技术(上海)有限公司 | Speech synthesis method, system, device and storage medium based on neural network |
CN113053354A (en) * | 2021-03-12 | 2021-06-29 | 云知声智能科技股份有限公司 | Method and equipment for improving voice synthesis effect |
CN113112995A (en) * | 2021-05-28 | 2021-07-13 | 思必驰科技股份有限公司 | Word acoustic feature system, and training method and system of word acoustic feature system |
CN113362801A (en) * | 2021-06-10 | 2021-09-07 | 携程旅游信息技术(上海)有限公司 | Audio synthesis method, system, device and storage medium based on Mel spectrum alignment |
CN113409759A (en) * | 2021-07-07 | 2021-09-17 | 浙江工业大学 | End-to-end real-time speech synthesis method |
-
2021
- 2021-09-27 CN CN202111136538.1A patent/CN113744714B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5826232A (en) * | 1991-06-18 | 1998-10-20 | Sextant Avionique | Method for voice analysis and synthesis using wavelets |
CN107480471A (en) * | 2017-07-19 | 2017-12-15 | 福建师范大学 | The method for the sequence similarity analysis being characterized based on wavelet transformation |
CN108280416A (en) * | 2018-01-17 | 2018-07-13 | 国家海洋局第三海洋研究所 | A kind of broadband underwater acoustic signal processing method of small echo across scale correlation filtering |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN111402855A (en) * | 2020-03-06 | 2020-07-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111627418A (en) * | 2020-05-27 | 2020-09-04 | 携程计算机技术(上海)有限公司 | Training method, synthesizing method, system, device and medium for speech synthesis model |
CN111968618A (en) * | 2020-08-27 | 2020-11-20 | 腾讯科技(深圳)有限公司 | Speech synthesis method and device |
CN112652291A (en) * | 2020-12-15 | 2021-04-13 | 携程旅游网络技术(上海)有限公司 | Speech synthesis method, system, device and storage medium based on neural network |
CN113053354A (en) * | 2021-03-12 | 2021-06-29 | 云知声智能科技股份有限公司 | Method and equipment for improving voice synthesis effect |
CN113112995A (en) * | 2021-05-28 | 2021-07-13 | 思必驰科技股份有限公司 | Word acoustic feature system, and training method and system of word acoustic feature system |
CN113362801A (en) * | 2021-06-10 | 2021-09-07 | 携程旅游信息技术(上海)有限公司 | Audio synthesis method, system, device and storage medium based on Mel spectrum alignment |
CN113409759A (en) * | 2021-07-07 | 2021-09-17 | 浙江工业大学 | End-to-end real-time speech synthesis method |
Non-Patent Citations (1)
Title |
---|
苏健民: "小波变换在基于动物叫声的物种识别系统中的应用", 《自动化技术与应用》, vol. 27, no. 8, pages 77 - 80 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113744715A (en) * | 2021-09-27 | 2021-12-03 | 深圳市木愚科技有限公司 | Vocoder speech synthesis method, device, computer equipment and storage medium |
CN117676185A (en) * | 2023-12-05 | 2024-03-08 | 无锡中感微电子股份有限公司 | Packet loss compensation method and device for audio data and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113744714B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10580430B2 (en) | Noise reduction using machine learning | |
US10319390B2 (en) | Method and system for multi-talker babble noise reduction | |
CN113436643B (en) | Training and application method, device and equipment of voice enhancement model and storage medium | |
CN113744714A (en) | Speech synthesis method, speech synthesis device, computer equipment and storage medium | |
CN112712812A (en) | Audio signal generation method, device, equipment and storage medium | |
CN111508518B (en) | Single-channel speech enhancement method based on joint dictionary learning and sparse representation | |
Zhang et al. | Birdsoundsdenoising: Deep visual audio denoising for bird sounds | |
CN113744715A (en) | Vocoder speech synthesis method, device, computer equipment and storage medium | |
Guimarães et al. | Monaural speech enhancement through deep wave-U-net | |
CN108665054A (en) | Based on the Mallat algorithms of genetic algorithm optimization threshold value cardiechema signals noise reduction application | |
US20230186943A1 (en) | Voice activity detection method and apparatus, and storage medium | |
CN113178201A (en) | Unsupervised voice conversion method, unsupervised voice conversion device, unsupervised voice conversion equipment and unsupervised voice conversion medium | |
CN113593590A (en) | Method for suppressing transient noise in voice | |
CN113129919A (en) | Air control voice noise reduction method based on deep learning | |
CN113782044B (en) | Voice enhancement method and device | |
Li et al. | Deeplabv3+ vision transformer for visual bird sound denoising | |
Girirajan et al. | Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network. | |
CN110211598A (en) | Intelligent sound noise reduction communication means and device | |
CN116705056A (en) | Audio generation method, vocoder, electronic device and storage medium | |
CN117496990A (en) | Speech denoising method, device, computer equipment and storage medium | |
CN115762540A (en) | Multidimensional RNN voice noise reduction method, device, equipment and medium | |
CN115440240A (en) | Training method for voice noise reduction, voice noise reduction system and voice noise reduction method | |
Yang et al. | A speech enhancement algorithm combining spectral subtraction and wavelet transform | |
Ullah et al. | Semi-supervised transient noise suppression using OMLSA and SNMF algorithms | |
CN114626424A (en) | Data enhancement-based silent speech recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |