CN113744714B

CN113744714B - Speech synthesis method, device, computer equipment and storage medium

Info

Publication number: CN113744714B
Application number: CN202111136538.1A
Authority: CN
Inventors: 黄元忠; 魏静; 卢庆华
Original assignee: Shenzhen Muyu Technology Co ltd
Current assignee: Shenzhen Muyu Technology Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2024-04-05
Anticipated expiration: 2041-09-27
Also published as: CN113744714A

Abstract

The embodiment of the invention discloses a voice synthesis method, a device, computer equipment and a storage medium. The method comprises the following steps: acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic characteristics so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis to obtain speech audio. By implementing the method of the embodiment of the invention, the final expression of the whole generated waveform can be quickly and effectively improved, and the tone quality of the voice synthesized audio can be improved.

Description

Speech synthesis method, device, computer equipment and storage medium

Technical Field

The present invention relates to a speech synthesis method, and more particularly, to a speech synthesis method, apparatus, computer device, and storage medium.

Background

At present, although GAN (generated challenge network, generative Adversarial Networks) based neural network vocoders have been studied to make great progress in recent years with the development of speech synthesis technology, the quality of synthesized audio is improved. The neural network-based speech synthesis is capable of synthesizing audio resembling natural human pronunciation in real time. However, in the frequency space, there is still a gap between the generated audio and the real audio, which results in noise in the synthesized audio and reduces the quality of the synthesized audio.

Therefore, it is necessary to design a new method to quickly and effectively improve the final expression of the whole generated waveform and improve the tone quality of the speech synthesized audio.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a voice synthesis method, a device, computer equipment and a storage medium.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a method of speech synthesis comprising:

acquiring a text file to be synthesized;

inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic characteristics so as to obtain a Mel frequency spectrum;

and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis to obtain speech audio.

The further technical scheme is as follows: the sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

The further technical scheme is as follows: the Mel-GAN vocoder based on the static discrete wavelet transformation improvement is obtained by training a GAN model through audio sample data with Mel frequency spectrum; and the GAN model downsamples audio sample data with mel-frequency spectrum using a one-dimensional static discrete wavelet transform.

The further technical scheme is as follows: the sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, and comprises the following steps:

acquiring text sample data and corresponding audio sample data;

regularizing the text sample data to obtain a standard format text;

denoising and denoising the audio sample data to obtain processed audio data;

aligning the standard format text and the processed audio data to obtain data to be trained;

constructing a deep learning network;

and training the deep learning network by utilizing data to be trained to determine a sound spectrum prediction network.

The further technical scheme is as follows: the GAN model employs a multi-scale discriminator, with multiple discriminators operating at different audio resolutions.

The further technical scheme is as follows: the Mel-GAN vocoder based on static discrete wavelet transformation improvement is obtained by training GAN model through audio sample data with Mel frequency spectrum, and comprises:

the audio sample data with the Mel frequency spectrum is transformed by the one-dimensional static discrete wavelet of the GAN model to obtain subband signals with a plurality of frequencies, and the subband signals with a plurality of frequencies are convolved by the convolution layer of the GAN model to perform voice synthesis.

The invention also provides a voice synthesis device, comprising:

the file acquisition unit is used for acquiring a text file to be synthesized;

the acoustic feature extraction unit is used for inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;

and the voice synthesis unit is used for inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform to perform voice synthesis so as to obtain voice audio.

The further technical scheme is as follows: further comprises:

the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the sound spectrum prediction network.

The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.

The present invention also provides a storage medium storing a computer program which, when executed by a processor, performs the above-described method.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the text file to be synthesized is obtained, the acoustic characteristics are extracted by means of the sound spectrum prediction network, the extracted content is input into the Mel-GAN vocoder improved based on static discrete wavelet transformation for speech synthesis, the static discrete wavelet transformation can more completely extract the information of signals in the frequency domain, the loss of high-frequency detail parts is avoided, the quality of generated audio can be effectively improved, the final expression of the whole generated waveform is rapidly and effectively improved, and the tone quality of speech synthesized audio is improved.

The invention is further described below with reference to the drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a speech synthesis method according to an embodiment of the present invention;

fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 3 is a schematic sub-flowchart of a speech synthesis method according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a speech synthesis method according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention. The voice synthesis method is applied to a server, a text file to be synthesized is obtained from a terminal through the server, the text file to be synthesized is input into a sound spectrum prediction network to extract Mel frequency spectrums, and the extracted Mel frequency spectrums are input into a Mel-GAN vocoder based on static discrete wavelet transformation improvement to perform voice synthesis so as to obtain voice audio.

Fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S130.

S110, acquiring a text file to be synthesized.

In this embodiment, the text file to be synthesized refers to text characters to be synthesized.

S120, inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum.

In this embodiment, the mel spectrum refers to the acoustic features corresponding to the text file to be synthesized.

Specifically, the sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

In one embodiment, referring to fig. 3, the step S120 may include steps S121 to S126.

S121, acquiring text sample data and corresponding audio sample data.

In this embodiment, text sample data refers to data with a text format, and audio sample data refers to audio content.

S122, regularizing the text sample data to obtain a standard format text.

In this embodiment, the standard format text refers to text content having a format that meets the requirements.

Carrying out multiple regularization treatment on the collected text sample data; specifically, in chinese/english text, special symbols, such as a title number, roman numerals, dollar symbols $, pound symbols $, renminbi symbols $, etc., are replaced with corresponding chinese/word forms; and replacing the serial number and Arabic numerals with corresponding Chinese or English words; and removing nonsensical special symbols, and finally, converting the Chinese character text into a pinyin-form text, thereby forming a standard format text.

S123, denoising and denoising the audio sample data to obtain the processed audio data.

In this embodiment, denoising and noise reduction processing are performed on the audio sample data to remove meaningless audio; high-tone quality, low-noise and clear audio data are obtained for training of a sound spectrum prediction network.

S124, aligning the standard format text and the processed audio data to obtain data to be trained.

In this embodiment, the data to be trained is data formed by aligning text in a standard format and processed audio data.

S125, constructing a deep learning network;

and S126, training the deep learning network by utilizing the data to be trained so as to determine a sound spectrum prediction network.

Specifically, aligning the standard format text and the processed audio data, inputting the aligned text and the processed audio data into a voice synthesis sound spectrum prediction network, and generating a Mel frequency spectrum corresponding to the audio; and carrying out one-to-one correspondence on the standard format text and the processed audio data, wherein each piece of audio corresponds to only one corresponding text. The audio and text data are input into a voice synthesis sound spectrum prediction network, acoustic features of the audio are extracted, the text is encoded, the sound spectrum prediction network is trained to obtain weight parameters of a model, and corresponding acoustic features, namely a Mel frequency spectrum, are output.

S130, inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis to obtain speech audio.

In the present embodiment, the voice audio refers to voice content formed by a text file to be synthesized.

Specifically, the Mel-GAN vocoder based on static discrete wavelet transform improvement is obtained by training GAN model through audio sample data with Mel frequency spectrum; and the GAN model downsamples audio sample data with mel-frequency spectrum using a one-dimensional static discrete wavelet transform.

Specifically, the Mel-GAN vocoder based on static discrete wavelet transform improvement is obtained by training GAN model through audio sample data with Mel frequency spectrum, comprising:

Because the one-dimensional static discrete wavelet transform can effectively decompose a signal into a low frequency part and a high frequency part, in the down sampling process of the signal, not only the low frequency envelope information of the signal but also the high frequency detail information of the signal are reserved, a plurality of discriminators on different scales can capture audio with fine granularity structures on different levels, each discriminator learns the characteristics of different audio frequency ranges, the multi-scale discriminators share the same network structure, and different audio scales are operated in the frequency domain. The static discrete wavelet transformation can more completely extract the information of the signal in the frequency domain, avoid losing high-frequency detail parts, and can effectively improve the quality of generated audio.

Training a Mel-GAN vocoder based on static discrete wavelet transformation improvement to obtain weight parameters of a vocoder model for generating audio by Mel spectrum reasoning in the process of speech synthesis.

The Mel-GAN vocoder can achieve better effects in speech synthesis, music or translation, music synthesis, etc. by training GAN model to generate vocoder model of high quality coherent wave. The Mel-GAN vocoder is a non-autoregressive model, has few model parameters, and has high speed of deducing waveforms from frequency spectrums. The Mel-GAN generator is a non-autoregressive feedforward convolution structure, and comprises a generator and a discriminator, wherein the waveform of the audio is output without input of Mel frequency spectrum; with multi-scale discriminators, multiple discriminators run at different audio resolutions, downsampling the original audio by cross-step averaging pooling. However, averaging pooled downsampling may obtain audio frequency domain information at different resolutions, but ignoring the nyquist sampling theorem, namely: the sampling frequency fs is more than or equal to 2 times of the highest frequency of the signal, and the sampled signal can completely retain the information of the original signal. Averaging pooled downsampling ignores the sampling theorem, which causes the sampled signal high frequency content to be aliased, the high frequency band spectrum to be distorted, and to become ineffective, so that the frequency of the generated audio is different from that of the real audio, and the difference can cause noise, and the noise appears as hissing noise or reverberation in the audio. Therefore, a Mel-GAN vocoder based on static discrete wavelet transform is proposed, namely, one-dimensional static discrete wavelet transform is utilized to replace average pooling operation for downsampling, because the one-dimensional static discrete wavelet transform can effectively decompose signals into low-frequency and high-frequency parts, and both low-frequency envelope information and high-frequency detail information of the signals are reserved in the sampling process.

Training discriminator loss is defined as: the training generator penalty is defined as: />Where x represents the input signal, s represents the mel spectrum, z represents gaussian noise, k represents three sub-discriminators, E represents the expectation, D _k Representing the kth sub-discriminator and G representing the generator.

The input signal x is subjected to one-dimensional static discrete wavelet transformation to obtain a low-frequency part and a high-frequency part, and the low-frequency part and the high-frequency part are mainly obtained through a low-pass filter g and a high-pass filter h: wherein y is _low [n]And y _high [n]Respectively representing the output of the signal after passing through the low-pass filter and the high-pass filter, N represents the length of the signal x, and k represents the decomposition level of the static wavelet transformation. Since the static discrete wavelet transform has a biorthogonal characteristic, it is possible to safely decompose low and high frequency parts of signals, and after wavelet decomposition of each stage, subband signals of all frequencies can be transferred to the convolution layer through the channel connection.

The discriminator loss is defined as:wherein L is _D Representing discriminator loss, x and->Representing real audio and generated audio, respectively, +.>Representing the kth level static discrete wavelet transform, k represents the number of decomposition levels of the static wavelet transform, I.I. | ₂ Represents a second paradigm, D _k Representing the discriminator at the kth stage number of the analysis.

The static discrete wavelet transformation can more completely extract the information of the signal in the frequency domain, avoid losing high-frequency detail parts, and can effectively improve the quality of generated audio.

Training data is collated, including text sample data and corresponding audio sample data. The audio sample data is first de-duplicated, de-noised and de-noised, and then the audio is clipped to each audio of about 8 seconds or so. The text sample data is processed, sorted and split, and in Chinese/English text, special symbols such as the title number, roman number, dollar symbol $, pound symbol $ and RMB symbol $ are replaced by corresponding Chinese characters/words; and replacing the serial number and Arabic numerals with corresponding Chinese or English words; and removing nonsensical special symbols, and finally, converting the Chinese character text into a pinyin-type text.

The audio sample data with the Mel frequency spectrum after clipping is input into a Mel-GAN vocoder model improved based on static discrete wavelet transformation for training, and the weight parameters of the trained vocoder model are obtained to be used for generating audio by utilizing Mel frequency spectrum reasoning in the speech synthesis process, so that Mel frequency spectrum loss is increased to improve the training efficiency of a generator and the fidelity of the generated audio, the difference between the generated audio and the real audio is reduced on the premise of not reducing the model training and reasoning speed, more sampling information is reserved, and the speech generation quality is greatly improved.

According to the voice synthesis method, the text file to be synthesized is obtained, the acoustic characteristics are extracted by means of the sound spectrum prediction network, the extracted content is input into the Mel-GAN vocoder improved based on static discrete wavelet transformation for voice synthesis, the static discrete wavelet transformation can be used for extracting the information of signals in the frequency domain more completely, the loss of high-frequency detail parts is avoided, the quality of generated audio can be effectively improved, the final expression of the whole generated waveform is rapidly and effectively improved, and the tone quality of voice synthesized audio is improved.

Fig. 4 is a schematic block diagram of a speech synthesis apparatus 300 according to an embodiment of the present invention. As shown in fig. 4, the present invention also provides a voice synthesis apparatus 300 corresponding to the above voice synthesis method. The speech synthesis apparatus 300 includes a unit for performing the above-described speech synthesis method, and may be configured in a server. Specifically, referring to fig. 4, the speech synthesis apparatus 300 includes a file acquisition unit 301, an acoustic feature extraction unit 302, and a speech synthesis unit 303.

A file obtaining unit 301, configured to obtain a text file to be synthesized; the acoustic feature extraction unit 302 is configured to input the text file to be synthesized into a sound spectrum prediction network to extract acoustic features, so as to obtain a mel frequency spectrum; a speech synthesis unit 303, configured to input the Mel spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis, so as to obtain speech audio.

In an embodiment, the speech synthesis apparatus 300 further comprises: the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the sound spectrum prediction network.

In an embodiment, the speech synthesis apparatus 300 further comprises a vocoder generation unit for training the GAN model from audio sample data with Mel-frequency spectrum to obtain a Mel-GAN vocoder modified based on static discrete wavelet transform, and the GAN model downsamples the audio sample data with Mel-frequency spectrum using one-dimensional static discrete wavelet transform.

In one embodiment, the vocoder generating unit is configured to obtain subband signals with multiple frequencies by one-dimensional static discrete wavelet transform of the GAN model from audio sample data with mel frequency spectrum, and convolve the subband signals with multiple frequencies by a convolution layer of the GAN model to perform speech synthesis.

In an embodiment, the prediction network generation unit includes a data acquisition subunit, a regularization subunit, a noise reduction subunit, an alignment subunit, a construction subunit, and a training subunit.

The data acquisition subunit is used for acquiring text sample data and corresponding audio sample data; the regularization subunit is used for regularizing the text sample data to obtain a standard format text; the noise reduction subunit is used for carrying out noise reduction and noise reduction treatment on the audio sample data so as to obtain treated audio data; an alignment subunit, configured to align the standard format text and the processed audio data to obtain data to be trained; a construction subunit for constructing a deep learning network; and the training subunit is used for training the deep learning network by utilizing the data to be trained so as to determine a sound spectrum prediction network.

It should be noted that, as will be clearly understood by those skilled in the art, the specific implementation process of the above-mentioned speech synthesis apparatus 300 and each unit may refer to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, the description is omitted here.

The above-described speech synthesis apparatus 300 may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, where the server may be a stand-alone server or may be a server cluster formed by a plurality of servers.

With reference to FIG. 5, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a speech synthesis method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a speech synthesis method.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device 500 to which the present application is applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:

acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic characteristics so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis to obtain speech audio.

The sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

The Mel-GAN vocoder based on the static discrete wavelet transformation improvement is obtained by training a GAN model through audio sample data with Mel frequency spectrum; and the GAN model downsamples audio sample data with mel-frequency spectrum using a one-dimensional static discrete wavelet transform.

The GAN model employs a multi-scale discriminator, with multiple discriminators operating at different audio resolutions.

In one embodiment, when implementing the spectrum prediction network by training the deep learning network after preprocessing the text sample data and the corresponding audio sample data, the processor 502 specifically implements the following steps:

acquiring text sample data and corresponding audio sample data; regularizing the text sample data to obtain a standard format text; denoising and denoising the audio sample data to obtain processed audio data; aligning the standard format text and the processed audio data to obtain data to be trained; constructing a deep learning network; and training the deep learning network by utilizing data to be trained to determine a sound spectrum prediction network.

In one embodiment, when implementing the Mel-GAN vocoder modified based on the static discrete wavelet transform, the processor 502 implements the following steps in training the GAN model through audio sample data with Mel spectrum:

It should be appreciated that in embodiments of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:

In one embodiment, when the processor executes the computer program to implement the sound spectrum prediction network as a step of training a deep learning network after preprocessing text sample data and corresponding audio sample data, the method specifically includes the following steps:

In one embodiment, when the processor executes the computer program to implement the Mel-GAN vocoder based on the static discrete wavelet transform improvement as a result of training GAN model with audio sample data with Mel spectrum, the processor implements the following steps:

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, comprising:

acquiring a text file to be synthesized;

inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for speech synthesis to obtain speech audio;

the Mel-GAN vocoder based on the static discrete wavelet transformation improvement is obtained by training a GAN model through audio sample data with Mel frequency spectrum; the GAN model downsamples audio sample data with a Mel frequency spectrum by utilizing one-dimensional static discrete wavelet transform;

the Mel-GAN vocoder based on static discrete wavelet transformation improvement is obtained by training GAN model through audio sample data with Mel frequency spectrum, and comprises:

2. The method of claim 1, wherein the spectrum prediction network is obtained by pre-processing text sample data and corresponding audio sample data and training a deep learning network.

3. The method of claim 2, wherein the spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, comprising:

acquiring text sample data and corresponding audio sample data;

regularizing the text sample data to obtain a standard format text;

denoising and denoising the audio sample data to obtain processed audio data;

constructing a deep learning network;

4. The method of claim 1, wherein the GAN model employs a multi-scale discriminator, the plurality of discriminators operating at different audio resolutions.

5. A speech synthesis apparatus, comprising:

the file acquisition unit is used for acquiring a text file to be synthesized;

a voice synthesis unit for inputting the Mel frequency spectrum into a Mel-GAN vocoder modified based on static discrete wavelet transform for voice synthesis to obtain voice audio;

the device also comprises a vocoder generating unit, a sampling unit and a sampling unit, wherein the vocoder generating unit is used for training a GAN model through audio sample data with a Mel frequency spectrum to obtain a Mel-GAN vocoder improved based on static discrete wavelet transform, and the GAN model is used for downsampling the audio sample data with the Mel frequency spectrum through one-dimensional static discrete wavelet transform;

and the vocoder generation unit is used for obtaining subband signals with a plurality of frequencies by one-dimensional static discrete wavelet transformation of the GAN model through audio sample data with a Mel frequency spectrum, and convoluting the subband signals with the plurality of frequencies through a convolution layer of the GAN model so as to perform voice synthesis.

6. The speech synthesis apparatus of claim 5, further comprising:

7. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-4.

8. A storage medium storing a computer program which, when executed by a processor, performs the method of any one of claims 1 to 4.