CN116453501A

CN116453501A - Speech synthesis method based on neural network and related equipment

Info

Publication number: CN116453501A
Application number: CN202310420753.7A
Authority: CN
Inventors: 郭洋; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-07-18

Abstract

The method comprises the steps of obtaining a speaking audio sample set, training a preset voice separation model based on the speaking audio sample set, determining base vectors of a plurality of waveforms from model parameters of the trained voice separation model, obtaining a text to be synthesized, obtaining spectral features corresponding to the text to be synthesized through a trained sound spectrum prediction model, and finally inputting the spectral features to a trained Hifi-Gan generator to obtain weight parameters corresponding to the base vectors through the Hifi-Gan generator, and synthesizing voice audio of the text to be synthesized according to the base vectors and the weight vectors. According to the embodiment of the application, the audio waveform is represented by the basis vector and the weight parameter of the waveform, so that the dimension and the layer number of an up-sampling network of the Hifi-Gan generator are reduced, and the voice synthesis speed can be improved while the voice waveform generation quality is ensured.

Description

Speech synthesis method based on neural network and related equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and related device for synthesizing speech based on a neural network.

Background

With the development of neural networks, speech synthesis technology has made tremendous progress. Among them, most of the neural network-based speech synthesis techniques can be divided into: acquiring intermediate expressions of predicted voices, such as frequency spectrum characteristics; a vocoder is constructed and original audio is synthesized from the intermediate representation by the vocoder.

However, since the model parameters of the related vocoders are too many, the calculation efficiency is low, and real-time speech synthesis cannot be performed in low-end devices such as smart phones.

Disclosure of Invention

The main objective of the embodiments of the present application is to provide a voice synthesis method, apparatus, electronic device and computer readable storage medium based on a neural network, which can improve the voice synthesis speed while ensuring the voice waveform generation quality.

To achieve the above object, a first aspect of an embodiment of the present application proposes a method for synthesizing speech based on a neural network, the method including:

acquiring a speaking audio sample set;

training a preset initial voice separation model based on the voice frequency sample set to obtain a target voice separation model;

determining basis vectors of a plurality of waveforms according to model parameters of the target voice separation model;

Obtaining a text to be synthesized;

performing feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized;

inputting the language characteristic information into a trained sound spectrum prediction model to generate a first frequency spectrum characteristic which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model;

inputting the first frequency spectrum characteristic to a trained Hifi-Gan generator to obtain a first weight parameter corresponding to the base vector through the Hifi-Gan generator;

and synthesizing the target voice audio of the text to be synthesized according to the base vector and the first weight vector.

According to some embodiments of the present invention, the initial speech separation model includes an encoder, a separator, and a decoder;

the basis vector is a filter parameter in the decoder.

According to some embodiments of the present invention, the set of voice frequency samples includes a plurality of audio samples, the audio samples being obtained by mixing random noise and original speaking audio of a single speaker;

training a preset initial voice separation model based on the voice frequency description sample set to obtain a target voice separation model, wherein the training comprises the following steps:

Inputting acoustic wave information of a plurality of audio samples to an encoder of the initial voice separation model so as to obtain mixed weight parameters of base vectors of the audio samples corresponding to a plurality of waveforms through the encoder;

inputting the mixed weight parameters to a separator of the initial voice separation model to obtain second weight parameters corresponding to the random noise and third weight parameters of the original speaking audio through the separator;

inputting the second weight parameter and the third weight parameter to a decoder of the initial speech separation model to obtain a first predicted waveform of the random noise and a second predicted waveform of the original speaking audio through the decoder;

determining a first voice separation loss value according to the acoustic wave information of the random noise, the first predicted waveform and a preset voice separation loss function;

determining a second speech separation loss value according to the acoustic information of the original speech audio, the second predicted waveform and the speech separation loss function;

updating model parameters of the initial voice separation model based on the first voice separation loss value and the second voice separation loss value to obtain a target voice separation model.

According to some embodiments of the present invention, the training process of the Hifi-Gan generator includes:

acquiring a second frequency spectrum characteristic of the training text set;

inputting the second spectral features of the training text set to the Hifi-Gan generator to obtain a fourth weight parameter corresponding to the basis vector through the Hifi-Gan generator;

determining a first generation loss value according to the fourth weight parameter, a preset weight threshold and a first generation loss function;

updating model parameters of the Hifi-Gan generator based on the first generation loss value to obtain a trained Hifi-Gan generator.

According to some embodiments of the present invention, the first generation loss function is determined by the following formula:

wherein the L is _weight For the first generation loss value, W is a fourth weight parameter, theAs the weight threshold, the values are ₁ The first norm is characterized.

According to the voice synthesis method based on the neural network provided by some embodiments of the present invention, after the second spectral feature of the training text set is input to the Hifi-Gan generator to obtain a fourth weight parameter corresponding to the basis vector through the Hifi-Gan generator, the method further includes:

Synthesizing training voice audio corresponding to the training text set according to the base vector and the fourth weight vector;

determining a probability value of the training voice audio as a real sample, and determining a second generation loss value according to the probability value and a preset second generation loss function;

updating model parameters of the Hifi-Gan generator based on the first generation loss value to obtain a trained Hifi-Gan generator, wherein the method comprises the following steps:

updating model parameters of the Hifi-Gan generator based on the first generated loss value and the second generated loss value to obtain a trained Hifi-Gan generator.

According to some embodiments of the present invention, the second generation loss function is determined by the following formula:

L _adv ＝(D(G(S))-1) ² ；

wherein the L is _adv And generating a loss value for the second generation, wherein S is a second frequency spectrum characteristic, G (S) is training voice audio, and D (G (S)) is a probability value that the training voice audio is a real sample.

To achieve the above object, a second aspect of the embodiments of the present application proposes a voice synthesis apparatus based on a neural network, the apparatus comprising:

the first acquisition module is used for acquiring a speaking audio sample set;

The model training module is used for training a preset initial voice separation model based on the voice frequency description sample set to obtain a target voice separation model;

the base vector acquisition module is used for determining base vectors of a plurality of waveforms according to model parameters of the target voice separation model;

the second acquisition module is used for acquiring a text to be synthesized;

the voice feature extraction module is used for carrying out feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized;

the spectrum feature extraction module is used for inputting the language feature information into a trained sound spectrum prediction model so as to generate a first spectrum feature which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model;

the weight acquisition module is used for inputting the first frequency spectrum characteristics to a trained Hifi-Gan generator so as to acquire first weight parameters corresponding to the base vector through the Hifi-Gan generator;

and the voice synthesis module is used for synthesizing the target voice audio of the text to be synthesized according to the base vector and the first weight vector.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, the electronic device comprising a memory, a processor, a computer program stored on the memory and executable on the processor, the computer program implementing the method of the first aspect when executed by the processor.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, for computer-readable storage, the storage medium storing one or more computer programs executable by one or more processors to implement the method described in the first aspect.

The method comprises the steps of obtaining a speaking audio sample set, training a preset initial voice separation model based on the speaking audio sample set to obtain a target voice separation model, determining base vectors of a plurality of waveforms according to model parameters of the target voice separation model, obtaining a text to be synthesized, carrying out feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized, inputting the language feature information into a trained sound spectrum prediction model to generate a first frequency spectrum feature which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model, finally inputting the first frequency spectrum feature into a trained Hifi-Gan generator to obtain a first weight parameter corresponding to the base vector through the Hifi-Gan generator, and synthesizing target voice audio of the text to be synthesized according to the base vector and the first weight vector. According to the embodiment of the application, the audio waveform is represented by the basis vector and the weight parameter of the waveform, so that the dimension and the layer number of an up-sampling network of the Hifi-Gan generator are reduced, and the voice synthesis speed can be improved while the voice waveform generation quality is ensured.

Drawings

Fig. 1 is a schematic flow chart of a voice synthesis method based on a neural network according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of the substeps of step S120 in FIG. 1;

FIG. 3 is a schematic flow chart of a method for synthesizing speech based on a neural network according to another embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for synthesizing speech based on a neural network according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a model training process of a speech separation model according to an embodiment of the present application;

FIG. 6 is a flow chart of a method for synthesizing speech based on a neural network according to another embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice synthesis device based on a neural network according to an embodiment of the present application;

fig. 8 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It is noted that unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

With the development of neural networks, speech synthesis technology has made tremendous progress. Among them, most of the neural network-based speech synthesis techniques can be divided into: acquiring intermediate expressions of predicted voices, such as frequency spectrum characteristics; a vocoder is constructed and original audio is synthesized from the intermediate representation by the vocoder. Assuming intermediate expressions as mel-spectra, there are three common methods in the related art that can convert mel-spectra into speech waveforms:

1. pure signal processing methods, such as Griffin-Lim algorithm, have the disadvantage that the generated speech waveform has serious artifacts;

2. the neural network model based on autoregressive, such as WaveNet and other neural vocoders, has the defects of long synthesis time and inapplicability to real-time application;

3. neural network models based on non-autoregressive, such as the neural vocoder of WaceGlow et al.

Meanwhile, with the wide application of the countermeasure generation network Gan in the field of image generation, gan-based vocoders, such as Hifi-Gan neural vocoders, are also gradually emerging.

Although the anti-generation network based vocoder is computationally more efficient than the autoregressive based vocoder, real-time speech audio synthesis on a processor remains a challenging task, e.g., the Hifi-Gan vocoder can produce relatively high quality audio whose main computation consumes the up-sampling layer from its generator designed to directly generate and output a time series of waveform resolution magnitudes, whose complexity of the process is closely related to the number of samples corresponding to each frame, and real-time speech synthesis is still not possible in some low-end devices such as cell phones. It becomes extremely important to further reduce the parameters of the model to increase the computational efficiency and thus the speech synthesis speed.

Based on the above, the embodiment of the application provides a voice synthesis method, a voice synthesis device, an electronic device and a computer readable storage medium based on a neural network, which can improve the voice synthesis speed while guaranteeing the voice waveform generation quality.

The embodiment of the application provides a voice synthesis method based on a neural network, a device, an electronic device and a computer readable storage medium, and specifically, the following embodiment is used to describe the voice synthesis method based on the neural network in the embodiment of the application.

The embodiment of the application can acquire and process the related data based on the neural network technology. The neural network (Artificial Intelligence, AI) is a theory, method, technology and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain the best results.

Neural network infrastructure technologies generally include technologies such as sensors, dedicated neural network chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, electromechanical integration, and the like. The neural network software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The voice synthesis method based on the neural network provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and a neural network platform and the like; the software may be an application or the like that implements a neural network-based speech synthesis method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, fig. 1 shows a flow chart of a voice synthesis method based on a neural network according to an embodiment of the present application. As shown in fig. 1, the neural network-based voice synthesis method includes, but is not limited to, steps S110 to S180:

Step S110, acquiring a speaking audio sample set;

step S120, training a preset initial voice separation model based on the speaking audio sample set to obtain a target voice separation model;

step S130, determining base vectors of a plurality of waveforms according to model parameters of the target voice separation model;

step S140, obtaining a text to be synthesized;

step S150, carrying out feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized;

step S160, inputting the language characteristic information into a trained sound spectrum prediction model to generate a first frequency spectrum characteristic which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model;

step S170, inputting the first frequency spectrum characteristic to a trained Hifi-Gan generator to obtain a first weight parameter corresponding to the base vector through the Hifi-Gan generator;

step S180, synthesizing the target voice audio of the text to be synthesized according to the base vector and the first weight vector.

It can be understood that the neural network-based speech synthesis method first trains a preset initial speech separation model to obtain a target speech separation model. And obtaining base vectors of a plurality of waveforms from model parameters of the target voice separation model, predicting weights of the base vectors corresponding to input frequency spectrum features through a trained Hifi-Gan generator, and reconstructing voice audio waveforms based on the base vectors and the weights, namely, representing the audio waveforms by using the base vectors of the waveforms and the corresponding weights.

Because the basis vectors of the waveforms are fixed, the voice audio can be synthesized based on the basis vectors and the weights by reducing the dimension and the layer number of the up-sampling network layer in the Hifi-Gan generator to match the output weights, and the output dimension is far smaller than the audio waveform.

In some embodiments, the initial speech separation model includes an encoder, a separator, and a decoder, the basis vector being a filter parameter in the decoder.

Referring to fig. 5, fig. 5 is a schematic diagram of a model training flow of a speech separation model according to an embodiment of the present application, and as shown in fig. 5, the initial speech separation model includes an encoder, a separator, and a decoder, where filter parameters in the decoder are basis vectors of waveforms.

In some embodiments, the initial speech separation model is a Time-domain audio separation network (Time-domain Audio Separation Network).

In some embodiments, the set of speaking audio samples includes a plurality of audio samples derived from a mixture of random noise and the original speaking audio of the individual speaker;

referring to fig. 2, fig. 2 is a schematic flow chart of a substep of step S120 in fig. 1, and as shown in fig. 2, the training of the preset initial speech separation model based on the speaking audio sample set to obtain a target speech separation model includes:

Step S210, inputting acoustic wave information of a plurality of the audio samples to an encoder of the initial voice separation model so as to obtain mixed weight parameters of base vectors of a plurality of waveforms corresponding to the audio samples through the encoder;

step S220, the mixed weight parameter is input to a separator of the initial speech separation model, so as to obtain a second weight parameter corresponding to the random noise and a third weight parameter of the original speaking audio through the separator;

step S230, inputting the second weight parameter and the third weight parameter to a decoder of the initial speech separation model to obtain a first predicted waveform of the random noise and a second predicted waveform of the original speaking audio through the decoder;

step S240, determining a first speech separation loss value according to the acoustic wave information of the random noise, the first predicted waveform and a preset speech separation loss function;

step S250, determining a second speech separation loss value according to the acoustic information of the original speaking audio, the second predicted waveform and the speech separation loss function;

step S260, updating the model parameters of the initial speech separation model based on the first speech separation loss value and the second speech separation loss value to obtain a target speech separation model.

As shown in fig. 5, the random noise and the original speech audio of a single speaker are mixed to obtain an audio sample, that is, the random noise and the original waveform are mixed to obtain a mixed waveform, and the acoustic information (mixed waveform) of the audio sample is input into an encoder in the initial speech separation model, and the encoder performs feature extraction to estimate mixing weight parameters.

Splitting the mixed weight parameters through a separator to obtain a weight W corresponding to random noise _n Weight W corresponding to the original speech audio _s Wherein the separator generates a mask matrix from the mixed weight parameters for the random noise and the original waveform to obtain the weight W based on the mixed weight parameters and the mask matrix _n And W is _s 。

Finally, the decoder passes through the base Signal and the weight W _n And W is _s Matrix multiplication to synthesize a single speakerThe reconstructed waveform and random noise, where the base signal is represented as a base matrix B (Basis Matrix), where each row in base matrix B corresponds to a 1D filter in the encoder (jointly learned with other subdivisions of the initial speech separation model) as the base vector for the waveform. Thus, the output waveform of the initial speech separation model can be expressed as y=b·w.

It will also be appreciated that after training the initial speech separation model, the basis vectors for the plurality of waveforms may be determined from the trained decoder.

And then determining a first voice separation loss value according to the acoustic information of random noise, a first predicted waveform and a preset voice separation loss function, determining a second voice separation loss value according to the acoustic information of original speaking voice frequency, a second predicted waveform and the voice separation loss function, and updating model parameters of an initial voice separation model based on the first voice separation loss value and the second voice separation loss value to obtain a target voice separation model.

In one particular embodiment, the initial speech separation model uses a scale-invariant source-to-noise ratio (SI-SNR) as the training target, where SI-SNR is defined as:

wherein,,a reconstructed waveform characterizing a single speaker, s characterizing the original waveform of the single speaker.

In a specific embodiment, the random noise is gaussian noise, is random noise subject to gaussian distribution, and can improve the robustness of a speech separation model by mixing original speaking audio of a single speaker with gaussian noise to obtain an audio sample and performing model training based on the audio sample.

It will be appreciated that the Hifi-Gan vocoder comprises a generator and a arbiter, wherein the input to the Hifi-Gan generator is the Mel spectrum. Therefore, before speech synthesis with the Hifi-Gan generator, intermediate representations of the speech, i.e. the spectral features corresponding to the speech to be synthesized, are predicted by means of a sound spectrum prediction model.

In some embodiments, the sound spectrum prediction model is a DurIAN model that includes a temporal prediction model and a spectrum prediction network.

The training process of the DurIAN model comprises the following steps: the method comprises the steps of obtaining an audio signal recorded when a speaker speaks, text information expressing the content of the audio signal and frequency spectrum characteristics converted by the audio signal, extracting linguistic information from the text information to serve as linguistic characteristic information, taking the linguistic characteristic information as a sample, taking the duration of the audio signal as a label, and training a time prediction network. After training the time prediction network, training the frequency spectrum prediction network by taking the language characteristic information as a sample and taking the frequency spectrum characteristic as a label.

It should be appreciated that the sound spectrum prediction model is used to predict the frequency characteristics corresponding to the speech to be synthesized, and the specific model is not limited to the DurIAN model described above, as long as the frequency spectrum characteristics of the speech can be predicted by the model.

In some embodiments, referring to fig. 3, the training process of the Hifi-Gan generator comprises:

step S310, obtaining a second frequency spectrum characteristic of the training text set;

step S320, inputting the second spectral feature of the training text set to the Hifi-Gan generator to obtain a fourth weight parameter corresponding to the base vector through the Hifi-Gan generator;

step S330, determining a first generation loss value according to the fourth weight parameter, a preset weight threshold and a first generation loss function;

and step S340, updating model parameters of the Hifi-Gan generator based on the first generation loss value to obtain a trained Hifi-Gan generator.

Referring to fig. 6, fig. 6 is a flow chart of a voice synthesis method based on a neural network according to another embodiment of the present application. As shown in fig. 6, the Hifi-Gan generator is a complete convolutional neural network, using Mel spectrum as input, and upsampling it by transposed convolution, typically up-sampling until the length of the input sequence matches the time resolution of the original waveform. The Hifi-Gan generator specifically comprises N spliced convolutional neural network layers (ConvTranspose) and a Multi-receiving scene fusion module (Multi-Receptive Field Fusion, MRF).

Exemplary, as shown in FIG. 6, the output weight W is matched by reducing the sampling layer network in the Hifi-Gan generator, i.e., reducing the dimension and the number of layers of the convolutional neural network _s Finally, weight W _s And a base Matrix (basic Matrix) including a plurality of base vectors.

It should be appreciated that a second spectral feature of the training text set is obtained, wherein each training sample in the training text set comprises an audio signal recorded when the speaker speaks text information, the spectral feature converted by the audio signal, and a weight threshold corresponding to the audio signal. It will be appreciated that the weight threshold corresponding to the audio signal may be obtained by a trained speech separation model. Thus, the Hifi-Gan generator is trained with the second spectral feature of the training text set as a sample and the weight threshold as a label.

Specifically, the second spectrum characteristic is input into a Hifi-Gan generator, characteristic extraction is carried out on the second spectrum characteristic through the Hifi-Gan generator, a fourth weight parameter is predicted, the fourth weight parameter is a weight parameter corresponding to an audio signal recorded when a speaker speaks text information, a first generation loss value is determined according to the predicted fourth weight parameter, a preset weight threshold and a first generation loss function, and the model parameter corresponding to the Hifi-Gan generator is updated based on the first generation loss value, so that the trained Hifi-Gan generator is obtained.

In some embodiments, the first generation loss function is determined by the following formula:

wherein the L is _weight For the first generation loss value, W is a fourth weight parameter, theAs a result of the weight threshold value, the I I.I ₁ The first norm is characterized.

The loss calculation speed can be improved by introducing weight loss to replace the characteristic loss of the original Hifi-Gan generator.

It is further understood that the first generating loss function may also utilize the second norm, or use other loss functions to measure the weight loss, which is not limited by the embodiments of the present application.

In some embodiments, referring to fig. 4, fig. 4 is a voice synthesis method based on a neural network according to another embodiment of the present application, as shown in fig. 4, after the second spectral feature of the training text set is input to the Hifi-Gan generator to obtain a fourth weight parameter corresponding to the basis vector through the Hifi-Gan generator, the method further includes:

step S410, synthesizing training voice audio corresponding to the training text according to the base vector and the fourth weight vector;

step S420, determining a probability value of the training voice audio as a real sample, and determining a second generation loss value according to the probability value and a preset second generation loss function;

and step S430, updating model parameters of the Hifi-Gan generator based on the first generation loss value and the second generation loss value to obtain a trained Hifi-Gan generator.

It should be understood that, in addition to the weight loss mentioned in the foregoing embodiment, after predicting the training voice audio corresponding to the training text set according to the basis vector and the fourth weight vector, determining, by a discriminator in the Hifi-Gan vocoder, a probability value that the training voice audio is a real sample, determining, according to the probability value and a preset second generation loss function, a second generation loss value, and updating the model parameters of the Hifi-Gan generator according to the first generation loss value obtained by the weight loss and the second generation loss value until the first generation loss value meets a preset condition, thereby obtaining the trained Hifi-Gan generator.

In some embodiments, the second generation loss function is determined by the following formula:

L _adv ＝(D(G(S))-1) ² ；

wherein the L is _adv And for the second generation loss value, S is a second frequency spectrum characteristic, G (S) is training voice audio, and D (G (S)) is a probability value that the training voice audio is a real sample.

In one embodiment, multiple multicycle discriminators and multiscale discriminators in the Hifi-Gan vocoder are employed to improve discrimination of the generator output waveform.

The following describes, by a specific embodiment, a voice synthesis method based on a neural network provided in the embodiment of the present application:

according to the method, the base vector and the weight are used for representing the audio waveform, firstly, a base Matrix (basic Matrix) for separating the audio is learned through a TasNet model in the field of voice separation, the base Matrix comprises a plurality of base vectors with different waveforms, each row in the base Matrix corresponds to a 1D filter of a decoder in the TasNet model respectively, then in voice synthesis, the output weight parameters are matched based on a Hifi-Gan generator and through reducing the dimension and the layer number of a sampling network layer, and then the waveform is reconstructed through the base Matrix and the weight parameters, so that voice synthesis is realized, and compared with an original Hifi-Gan vocoder, the generation speed can be improved under the condition that the waveform generation quality is ensured.

Specifically, referring to FIG. 5, in the training process of TasNet modelMixing an original waveform of a single speaker speaking with random noise to obtain a mixed waveform, inputting the mixed waveform into an encoder of a TasNet model, extracting characteristics of the mixed waveform through the encoder to obtain mixed weight parameters, and separating the mixed weight parameters by using a separator in the TasNet model to obtain a weight W corresponding to the original waveform _s Weight W of random noise _n And then, carrying out waveform reconstruction by a decoder according to the weight and the base matrix to obtain a reconstructed waveform of the single speaker speaking, and finally updating model parameters of the TasNet model based on the reconstructed waveform and the original waveform. The basis vectors of the plurality of waveforms, i.e. the trained basis matrix, are obtained from the trained decoder.

And finally, acquiring the text to be synthesized, extracting linguistic representation of the text to be synthesized to obtain language characteristic information, and predicting intermediate representation of the text to be synthesized, namely the frequency spectrum characteristics corresponding to the text to be synthesized, by using a trained sound spectrum prediction model.

Referring to fig. 6, in speech synthesis, the spectral features corresponding to the text to be synthesized are input to a Hifi-Gan generator, wherein the dimensions and the number of layers of a convolutional neural network in the Hifi-generator are reduced to match the output weight parameters, and finally, a reconstructed waveform is obtained according to the weight parameters and a base matrix obtained from a trained TasNet model, so that speech synthesis is realized.

The method comprises the steps of obtaining a speaking audio sample set, training a preset initial voice separation model based on the speaking audio sample set to obtain a target voice separation model, determining base vectors of a plurality of waveforms according to model parameters of the target voice separation model, obtaining a text to be synthesized, carrying out feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized, inputting the language feature information into a trained sound spectrum prediction model to generate a first frequency spectrum feature which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model, finally inputting the first frequency spectrum feature into a trained Hifi-Gan generator to obtain a first weight parameter corresponding to the base vectors through the Hifi-Gan generator, and synthesizing target voice audio of the text to be synthesized according to the base vectors and the first weight vectors. According to the embodiment of the application, the audio waveform is represented by the basis vector and the weight parameter of the waveform, so that the dimension and the layer number of an up-sampling network of the Hifi-Gan generator are reduced, and the voice synthesis speed can be improved while the voice waveform generation quality is ensured.

Referring to fig. 7, the embodiment of the present application further provides a voice synthesis apparatus 100 based on a neural network, where the voice synthesis apparatus 100 based on the neural network includes:

a first acquisition module 110 for acquiring a set of speaking audio samples;

the model training module 120 is configured to train a preset initial speech separation model based on the speaking audio sample set to obtain a target speech separation model;

a base vector acquisition module 130, configured to determine base vectors of a plurality of waveforms according to model parameters of the target speech separation model;

a second obtaining module 140, configured to obtain a text to be synthesized;

the voice feature extraction module 150 is configured to perform feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized;

the spectral feature extraction module 160 is configured to input the language feature information into a trained sound spectrum prediction model, so as to generate a first spectral feature according with a preset speaker speaking the text to be synthesized through the sound spectrum prediction model;

the weight obtaining module 170 is configured to input the first spectral feature to a trained Hifi-Gan generator, so as to obtain a first weight parameter corresponding to the base vector through the Hifi-Gan generator;

The speech synthesis module 180 is configured to synthesize the target speech audio of the text to be synthesized according to the base vector and the first weight vector.

The device trains a preset initial voice separation model based on a speaking voice sample set to obtain a target voice separation model, then determines base vectors of a plurality of waveforms according to model parameters of the target voice separation model, then obtains a text to be synthesized, performs feature extraction processing on the text to be synthesized to obtain language feature information corresponding to the text to be synthesized, inputs the language feature information into a trained sound spectrum prediction model to generate a first frequency spectrum feature which accords with a preset speaker and speaks the text to be synthesized through the sound spectrum prediction model, finally inputs the first frequency spectrum feature into a trained Hifi-Gan generator to obtain a first weight parameter corresponding to the base vector through the Hifi-Gan generator, and synthesizes target voice audio of the text to be synthesized according to the base vector and the first weight vector. According to the embodiment of the application, the audio waveform is represented by the basis vector and the weight parameter of the waveform, so that the dimension and the layer number of an up-sampling network of the Hifi-Gan generator are reduced, and the voice synthesis speed can be improved while the voice waveform generation quality is ensured.

It should be noted that, because the content of information interaction and execution process between modules of the above apparatus is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and details are not repeated herein.

Referring to fig. 8, fig. 8 shows a hardware structure of an electronic device provided in an embodiment of the present application, where the electronic device includes:

the processor 210 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing a relevant computer program to implement the technical solutions provided in the embodiments of the present application;

the Memory 220 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 220 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 220, and the processor 210 invokes the neural network-based speech synthesis method to perform the embodiments of the present application;

An input/output interface 230 for implementing information input and output;

the communication interface 240 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.); and a bus 250 for transferring information between each of the components of the device (e.g., processor 210, memory 220, input/output interface 230, and communication interface 240);

wherein processor 210, memory 220, input/output interface 230, and communication interface 240 are communicatively coupled to each other within the device via bus 250.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium and is used for computer readable storage, the storage medium stores one or more computer programs, and the one or more computer programs can be executed by one or more processors to realize the voice synthesis method based on the neural network.

The memory is a computer-readable storage medium that can be used to store software programs as well as computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the above units is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of each embodiment of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of neural network-based speech synthesis, the method comprising:

acquiring a speaking audio sample set;

obtaining a text to be synthesized;

2. The method of claim 1, wherein the initial speech separation model comprises an encoder, a separator, and a decoder;

The basis vector is a filter parameter in the decoder.

3. The method of claim 2, wherein the set of voice frequency samples comprises a plurality of audio samples derived from a mixture of random noise and the original speaking audio of a single speaker;

4. The method of claim 1, wherein the training process of the Hifi-Gan generator comprises:

acquiring a second frequency spectrum characteristic of the training text set;

5. The method of claim 4, wherein the first generation loss function is determined by the following equation:

6. The method of claim 4, wherein after the inputting the second spectral feature of the training text set to the Hifi-Gan generator to obtain a fourth weight parameter corresponding to the basis vector through the Hifi-Gan generator, the method further comprises:

7. The method of claim 6, wherein the second generation loss function is determined by the following equation:

L _adv ＝(D(G(S))-1) ² ；

8. A neural network-based speech synthesis apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a speaking audio sample set;

the second acquisition module is used for acquiring a text to be synthesized;

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program that is executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.