CN113744715A

CN113744715A - Vocoder speech synthesis method, device, computer equipment and storage medium

Info

Publication number: CN113744715A
Application number: CN202111139651.5A
Authority: CN
Inventors: 黄元忠; 魏静; 卢庆华
Original assignee: Shenzhen Muyu Technology Co ltd
Current assignee: Shenzhen Muyu Technology Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-03

Abstract

The embodiment of the invention discloses a vocoder voice synthesis method, a vocoder voice synthesis device, computer equipment and a storage medium. The method comprises the following steps: acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a modified multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio. By implementing the method of the embodiment of the invention, the final expression of the whole generated waveform can be quickly and effectively improved, and the tone quality of the voice synthesis audio is improved.

Description

Vocoder speech synthesis method, device, computer equipment and storage medium

Technical Field

The present invention relates to a voice synthesis method, more specifically to a voice synthesis method, a voice synthesis device, a computer device and a storage medium of a vocoder.

Background

At present, with the rapid development of artificial intelligence technology, the conversation demand of human-computer interaction increases day by day, and with the progress of neural network technology, the speech synthesis technology is also greatly improved, and at present, the speech synthesis technology is widely applied to the fields of smart homes, talking novels, virtual customer service, smart cars and the like.

The speech synthesis based on the neural network can synthesize the audio frequency similar to the natural pronunciation of human beings in real time, and at present, most end-to-end speech synthesis comprises two stages; one is to predict an intermediate representation of audio, such as mel spectrum, linear spectrum, speech features, etc., from text. The second is to infer synthesized speech audio from the intermediate representation of the audio. The second stage is the process of converting an acoustic model, which is often referred to as a vocoder, into a speech waveform. In recent years, neural vocoder technology is rapidly developed, and vocoders limit the sound quality of finally synthesized voice and are also the computational bottleneck of the whole voice synthesis model. Improving the performance of vocoders, synthesizing highly natural speech audio remains a very challenging task. However, the existing vocoder speech synthesis method has the problems that the synthesized audio has noise and the tone quality of the synthesized audio is low.

Therefore, it is necessary to design a new method to quickly and effectively improve the final representation of the whole generated waveform and improve the sound quality of the speech synthesis audio.

Disclosure of Invention

The present invention is directed to overcoming the deficiencies of the prior art and providing a method, apparatus, computer device and storage medium for vocoder speech synthesis.

In order to achieve the purpose, the invention adopts the following technical scheme: the vocoder speech synthesis method comprises the following steps:

acquiring a text file to be synthesized;

inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;

and inputting the Mel frequency spectrum into a modified multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio.

The further technical scheme is as follows: the acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

The further technical scheme is as follows: the improved multi-scale HiFi-GAN vocoder model is a voice synthesis vocoder model formed by obtaining parameters of a network model through training audio data; and the feature matching loss in the improved multi-scale HiFi-GAN vocoder model is replaced by multi-scale short-time Fourier transform.

The further technical scheme is as follows: the inputting the Mel frequency spectrum into a modified multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio comprises:

the improved multi-scale HiFi-GAN vocoder model adopts a multi-band processing method, takes a Mel frequency spectrum as input, generates sub-band signals, adds the sub-band signals as a full-band signal and inputs the full-band signal into a discriminator to carry out voice synthesis.

The further technical scheme is as follows: the acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, and comprises the following steps:

acquiring text sample data and corresponding audio sample data;

editing the text sample data, removing special symbols, converting Arabic numerals into corresponding text words, and converting Chinese text data in the text sample data into a Pinyin format or a phoneme format to obtain a standard format text;

denoising and denoising the audio sample data to obtain processed audio data;

aligning the standard format text and the processed audio data to obtain data to be trained;

constructing a deep learning network;

and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.

The further technical scheme is as follows: the improved multi-scale HiFi-GAN vocoder model comprises a generator and a discriminator and carries out optimization of the network through multi-scale short-time Fourier transform loss, Mel frequency spectrum loss and GAN network loss.

The invention also provides a vocoder speech synthesis device, comprising:

the file acquisition unit is used for acquiring a text file to be synthesized;

the acoustic feature extraction unit is used for inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum;

and the voice synthesis unit is used for inputting the Mel frequency spectrum into the improved multi-scale HiFi-GAN vocoder model for voice synthesis so as to obtain voice audio.

The further technical scheme is as follows: further comprising:

and the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the acoustic spectrum prediction network.

The invention also provides computer equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the method when executing the computer program.

The invention also provides a storage medium storing a computer program which, when executed by a processor, is operable to carry out the method as described above.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the text file to be synthesized is obtained, the acoustic characteristics are extracted by means of the voice spectrum prediction network, the extracted contents are input into the improved multi-scale HiFi-GAN vocoder model for voice synthesis, the difference between the real audio and the generated audio can be more effectively measured by multi-scale short-time Fourier transform, the final expression of the whole generated waveform can be rapidly and effectively improved, and the voice quality of the voice synthesis audio is improved.

The invention is further described below with reference to the accompanying drawings and specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a vocoder speech synthesis method according to an embodiment of the present invention;

fig. 2 is a flow chart of a vocoder speech synthesis method according to an embodiment of the present invention;

FIG. 3 is a sub-flowchart of a vocoder speech synthesis method according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a vocoder speech synthesis apparatus provided by an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view illustrating an application scenario of a voice synthesis method of a vocoder according to an embodiment of the present invention. Fig. 2 is a schematic flow chart of a vocoder speech synthesis method according to an embodiment of the present invention. The vocoder speech synthesis method is applied to a server. The method comprises the steps of obtaining a text file to be synthesized from a terminal through a server, inputting the text file to be synthesized into a sound spectrum prediction network for Mel frequency spectrum extraction, and inputting the extracted Mel frequency spectrum into an improved multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio.

Fig. 2 is a flow chart of a vocoder speech synthesis method according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S130.

And S110, acquiring a text file to be synthesized.

In the present embodiment, the text file to be synthesized refers to text characters to be synthesized.

And S120, inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum.

In this embodiment, the mel spectrum refers to the acoustic features corresponding to the text file to be synthesized.

Specifically, the sound spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

In an embodiment, referring to fig. 3, the step S120 may include steps S121 to S126.

And S121, acquiring text sample data and corresponding audio sample data.

In this embodiment, the text sample data refers to data with a text format, and the audio sample data refers to audio content.

And S122, editing the text sample data, removing special symbols, converting Arabic numerals into corresponding text words, and converting the Chinese text data in the text sample data into a Pinyin format or a phoneme format to obtain a standard format text.

In this embodiment, the standard format text refers to text content with a format meeting the requirement.

For Chinese text data, converting the Chinese text data into a pinyin or phoneme format; carrying out duplication removal, noise reduction and clipping on the audio sample data to obtain pure audio data for training a speech synthesis model; and editing the text sample data, removing special symbols, converting Arabic numerals into corresponding text words, and converting Chinese text data into a Pinyin format or a phoneme format.

And S123, denoising and denoising the audio sample data to obtain processed audio data.

In the embodiment, denoising and denoising are performed on audio sample data to remove meaningless audio; and obtaining high-tone quality, low-noise and clear audio data for training the sound spectrum prediction network.

And S124, aligning the standard format text and the processed audio data to obtain the data to be trained.

In this embodiment, the data to be trained refers to data formed by aligning standard format text and processed audio data.

S125, constructing a deep learning network;

and S126, training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.

Specifically, aligning the standard format text and the processed audio data, inputting the aligned standard format text and the processed audio data into a voice spectrum prediction network for voice synthesis, and generating a Mel frequency spectrum corresponding to the audio; and carrying out one-to-one correspondence on the standard format text and the processed audio data, wherein each audio corresponds to a unique corresponding text. Inputting audio and text data into a voice spectrum prediction network for voice synthesis, extracting acoustic features of the audio, coding the text, training the voice spectrum prediction network to obtain weight parameters of a model, and outputting corresponding acoustic features, namely a Mel frequency spectrum.

And S130, inputting the Mel frequency spectrum into an improved multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio.

In the present embodiment, the voice audio refers to the voice content formed by the text file to be synthesized.

Specifically, the improved multi-scale HiFi-GAN vocoder model is a voice synthesis vocoder model formed by training audio data to obtain parameters of a network model; and the feature matching loss in the improved multi-scale HiFi-GAN vocoder model is replaced by multi-scale short-time Fourier transform.

In an embodiment, the step S130 may include the following specific steps:

The improved multi-scale HiFi-GAN vocoder model comprises a generator and a discriminator and carries out optimization on a network through multi-scale short-time Fourier transform loss, Mel frequency spectrum loss and GAN network loss.

Specifically, an improved multi-scale HiFi-GAN vocoder model is trained, and model parameters are obtained to be used for generating voice audio by utilizing Mel spectrum reasoning in a voice synthesis process.

The HiFi-GAN vocoder is a deep neural network model, adopts an end-to-end forward network structure, trains a multi-scale discriminator, and can realize high-efficiency and high-quality voice synthesis. The HiFiGAN vocoder has two discriminators, namely a multi-scale discriminator and a multi-period discriminator, which respectively identify voices from two different angles; the HiFi-GAN vocoder calculates the distance L1 between the real and generated samples in each feature space by extracting each intermediate feature of the discriminator using the feature matching penalty as an additional penalty to the training generator. Although the loss of feature matching helps stabilize the GAN (Generative adaptive Networks), it is difficult to measure the difference between the potential features of true and false speech, resulting in a slow network convergence process. In order to solve the problem, the multi-scale short-time Fourier transform loss in the Parallel wavelet GAN is used for replacing the feature matching loss in the HiFi-GAN vocoder, and experiments prove that the multi-scale short-time Fourier transform can more effectively measure the difference between real audio and false audio. Meanwhile, a multiband processing method in a Multi-Band MelGAN is adopted, a Mel frequency spectrum is used as input, sub-Band signals are generated, and the sub-Band signals are added to be used as full-Band signals to be input into a discriminator.

The window width of the ordinary short-time fourier transform is fixed, so the frequency band requirement of unsteady signal change cannot be met. The multi-scale short-time fourier transform calculates the loss of STFT under multiple parameters using different analysis parameters, such as FFT _ size, window _ size, frame shift, etc.

Single STFT loss definition:

where, x represents the target waveform,

representing the audio that the generator predicts, the audio,

the loss of spectral convergence is indicated by the loss of,

representing the log short-time fourier transform amplitude loss.

Spectral convergence loss definition:

logarithmic short-time fourier transform amplitude loss definition:

wherein | - |_FDenotes the norm, |₁Representing the L1 norm.

Assuming that there are M multi-scale short-time fourier transforms, the auxiliary loss of the multi-scale short-time fourier transform is defined as:

HiFiGAN has two types of discriminators, MPD (Multi-periodic Discriminator) and MSD (Multi-scale Discriminator), and GNA loss is defined as: l is_adv(D,G)＝E_(x,s)[(D(x)-1)²+(D(G(s)))²]；L_adv(D,G)＝E_(x,s)[(D(G(s))-1)²](ii) a x represents the real audio and s represents the mel-frequency spectrum of the real audio.

The mel-frequency spectral loss is increased to improve the training efficiency of the generator and the fidelity of the generated audio. The mel-frequency spectrum loss is obtained by calculating the distance L1 between the mel-frequency spectrum of the real audio and the mel-frequency spectrum of the waveform generated by the generator, and is defined as:

representing a function that converts the waveform into a mel-frequency spectrum.

Since the discriminators MPD and MSD are composed of a series of sub-discriminators, the final multi-scale HiFi-GAN generator loss function is defined as:

discriminator loss is defined as:

D_krepresenting the kth sub-discriminator in MPD and MSD.

The time domain signal based on Fourier transform cannot acquire the time domain information of the signal, the window width of common short-time Fourier transform is fixed, so the frequency band requirement of unsteady signal change cannot be met, the voice signal is an unsteady and time-varying signal, in order to be suitable for the characteristics of the voice signal, multi-scale short-time Fourier transform auxiliary loss is provided, and the time-frequency domain characteristics of voice can be learned by combining a plurality of short-time Fourier transform losses under different analysis parameters. Furthermore, it also prevents the generator from overfitting the representation of the short-time fourier transform of a fixed window width, which can effectively improve the final appearance of the overall generated waveform. Under the premise of not reducing the voice quality of voice synthesis audio, the audio waveform is generated more efficiently and quickly.

And sorting the text sample data and the corresponding audio sample data, firstly, carrying out duplication removal and noise reduction on the audio data, and then, cutting the audio into audio about 8 seconds each. And sorting and splitting the text sample data, checking the text corresponding to the audio, removing special symbols in the text, and converting Arabic numerals into corresponding text words. For Chinese text data, the Chinese text data is converted into a Pinyin format with tones, and the aligned audio and text data are input into a voice spectrum prediction network for speech synthesis to be trained to generate acoustic characteristics corresponding to the audio, namely a Mel frequency spectrum.

And inputting the clipped pure audio sample data into an improved multi-scale HiFi-GAN vocoder model for training. Obtaining improved vocoder model parameters for use in speech synthesis to generate audio by Mel spectral reasoning, and increasing Mel spectral loss to improve training efficiency of the generator and fidelity of the generated audio.

The multi-scale short-time Fourier transform loss is used for replacing the characteristic matching loss in the HiFi-GAN, the window width of the common short-time Fourier transform is fixed, and the voice signal is an unsteady and time-varying signal, so that the short-time Fourier transform with the fixed window width cannot meet the frequency band requirement of unsteady signal variation. Multi-scale STFT, using different analysis parameters, such as FFT _ size, window _ size, frame shift, etc., calculates the loss of STFT under multiple parameters, more effectively measures the difference between the real audio and the generated audio. Meanwhile, a multiband processing method in a Multi-Band MelGAN is adopted, a Mel frequency spectrum is used as input, sub-Band signals are generated, and the sub-Band signals are added to be used as full-Band signals to be input into a discriminator.

According to the vocoder voice synthesis method, the text file to be synthesized is obtained, the acoustic characteristics are extracted by means of the voice spectrum prediction network, the extracted contents are input into the improved multi-scale HiFi-GAN vocoder model for voice synthesis, the difference between the real audio and the generated audio can be measured more effectively through multi-scale short-time Fourier transform, the final expression of the whole generated waveform can be rapidly and effectively improved, and the voice quality of the voice synthesis audio is improved.

Fig. 4 is a schematic block diagram of a vocoder speech synthesis apparatus 300 according to an embodiment of the present invention. As shown in fig. 4, the present invention also provides a vocoder speech synthesis apparatus 300 corresponding to the above vocoder speech synthesis method. The vocoder speech synthesis apparatus 300 includes a unit for performing the above-described vocoder speech synthesis method, and the apparatus may be configured in a server. Specifically, referring to fig. 4, the vocoder speech synthesis apparatus 300 includes a file acquiring unit 301, an acoustic feature extracting unit 302, and a speech synthesizing unit 303.

A file acquiring unit 301 configured to acquire a text file to be synthesized; an acoustic feature extraction unit 302, configured to input the text file to be synthesized into a sound spectrum prediction network to extract an acoustic feature, so as to obtain a mel-frequency spectrum; and a speech synthesis unit 303, configured to input the mel spectrum into a modified multi-scale HiFi-GAN vocoder model for speech synthesis to obtain a speech audio.

In one embodiment, the speech synthesis apparatus further comprises: and the prediction network generation unit is used for training the deep learning network after preprocessing the text sample data and the corresponding audio sample data to obtain the acoustic spectrum prediction network.

In an embodiment, the speech synthesis apparatus further comprises a vocoder generation unit for training the HiFi-GAN vocoder with audio sample data with mel spectrum to obtain an improved multi-scale HiFi-GAN vocoder model, and the feature matching loss in the HiFi-GAN vocoder is replaced by multi-scale short-time fourier transform.

In an embodiment, the speech synthesis unit 303 is configured to generate subband signals by using a multi-band processing method according to the modified multi-scale HiFi-GAN vocoder model, and add the subband signals as a full-band signal to be input to the discriminator for speech synthesis.

In one embodiment, the prediction network generation unit includes a data acquisition subunit, a text processing subunit, a noise reduction subunit, an alignment subunit, a construction subunit, and a training subunit.

The data acquisition subunit is used for acquiring text sample data and corresponding audio sample data; a text processing subunit, configured to clip the text sample data, remove a special symbol, convert an arabic number into a corresponding text word, and convert the chinese text data in the text sample data into a pinyin format or a phoneme format to obtain a standard format text; the noise reduction subunit is used for carrying out noise reduction and denoising processing on the audio sample data to obtain processed audio data; the aligning subunit is used for aligning the standard format text and the processed audio data to obtain data to be trained; the building subunit is used for building a deep learning network; and the training subunit is used for training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.

It should be noted that, as will be clear to those skilled in the art, the specific implementation process of the vocoder speech synthesis apparatus 300 and each unit can refer to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.

The vocoder speech synthesis apparatus 300 described above may be implemented in the form of a computer program that may be run on a computer device such as that shown in fig. 5.

Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, wherein the server may be an independent server or a server cluster composed of a plurality of servers.

Referring to fig. 5, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer programs 5032 comprise program instructions that, when executed, cause the processor 502 to perform a vocoder speech synthesis method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may perform a vocoder speech synthesis method.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer device 500 to which the present application may be applied, and that a particular computer device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps:

acquiring a text file to be synthesized; inputting the text file to be synthesized into a sound spectrum prediction network to extract acoustic features so as to obtain a Mel frequency spectrum; and inputting the Mel frequency spectrum into a modified multi-scale HiFi-GAN vocoder model for voice synthesis to obtain voice audio.

The acoustic spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

The improved multi-scale HiFi-GAN vocoder model is a voice synthesis vocoder model formed by obtaining parameters of a network model through training audio data; and the feature matching loss in the improved multi-scale HiFi-GAN vocoder model is replaced by multi-scale short-time Fourier transform.

The improved multi-scale HiFi-GAN vocoder model comprises a generator and a discriminator and carries out optimization of the network through multi-scale short-time Fourier transform loss, Mel frequency spectrum loss and GAN network loss.

In an embodiment, when the step of inputting the mel spectrum into the improved multi-scale HiFi-GAN vocoder model for speech synthesis to obtain the speech audio is implemented by the processor 502, the following steps are specifically implemented:

In an embodiment, when implementing the step that the audio spectrum prediction network is obtained by training the deep learning network after preprocessing text sample data and corresponding audio sample data, the processor 502 specifically implements the following steps:

acquiring text sample data and corresponding audio sample data; editing the text sample data, removing special symbols, converting Arabic numerals into corresponding text words, and converting Chinese text data in the text sample data into a Pinyin format or a phoneme format to obtain a standard format text; denoising and denoising the audio sample data to obtain processed audio data; aligning the standard format text and the processed audio data to obtain data to be trained; constructing a deep learning network; and training the deep learning network by using the data to be trained so as to determine the sound spectrum prediction network.

It should be understood that in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program includes program instructions, and the computer program may be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, wherein the computer program, when executed by a processor, causes the processor to perform the steps of:

The improved multi-scale HiFi-GAN vocoder model is a voice synthesis vocoder model formed by obtaining parameters of a network model through training audio data; and the feature matching loss in the improved multi-scale HiFi-GAN vocoder model is replaced by multi-scale short-time Fourier transform. The improved multi-scale HiFi-GAN vocoder model comprises a generator and a discriminator and carries out optimization of the network through multi-scale short-time Fourier transform loss, Mel frequency spectrum loss and GAN network loss. In an embodiment, when the processor executes the computer program to implement the step of inputting the mel spectrum into the modified multi-scale HiFi-GAN vocoder model for speech synthesis to obtain speech audio, the processor specifically implements the following steps:

In an embodiment, when the processor executes the computer program to implement the step that the audio spectrum prediction network is obtained by training the deep learning network after preprocessing text sample data and corresponding audio sample data, the following steps are specifically implemented:

The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for synthesizing speech from a vocoder, comprising:

acquiring a text file to be synthesized;

2. The method of claim 1, wherein the voice spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data.

3. The method of claim 1, wherein the improved multi-scale HiFi-GAN vocoder model is a speech synthesis vocoder model formed by training parameters of an audio data acquisition network model; and the feature matching loss in the improved multi-scale HiFi-GAN vocoder model is replaced by multi-scale short-time Fourier transform.

4. The method as claimed in claim 1, wherein the inputting the mel spectrum into a modified multi-scale HiFi-GAN vocoder model for speech synthesis to obtain speech audio comprises:

5. The method of claim 2, wherein the voice spectrum prediction network is obtained by training a deep learning network after preprocessing text sample data and corresponding audio sample data, and comprises:

acquiring text sample data and corresponding audio sample data;

denoising and denoising the audio sample data to obtain processed audio data;

constructing a deep learning network;

6. The method as claimed in claim 3, wherein the improved multi-scale HiFi-GAN vocoder model comprises a generator and a discriminator, and the optimization of the network is performed by multi-scale short-time fourier transform loss, mel-frequency spectrum loss, GAN network loss.

7. A vocoder speech synthesis apparatus, comprising:

the file acquisition unit is used for acquiring a text file to be synthesized;

8. The vocoder speech synthesis of claim 7, further comprising:

9. A computer device, characterized in that the computer device comprises a memory, on which a computer program is stored, and a processor, which when executing the computer program implements the method according to any of claims 1 to 6.

10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.