CN117116249B

CN117116249B - Training method of audio generation model, audio generation method, device and equipment

Info

Publication number: CN117116249B
Application number: CN202311351363.5A
Authority: CN
Inventors: 郑艺斌; 李新辉; 卢鲤
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2024-01-23
Anticipated expiration: 2043-10-18
Also published as: CN117116249A

Abstract

The application discloses a training method of an audio generation model, an audio generation method, an audio generation device and equipment, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a first text and a first audio, wherein the content of the first text is the same as that of the first audio; determining distribution characteristics of each first phoneme included in the first text through the first network model; determining distribution characteristics of each first audio frame included in the first audio through a first network model; determining a first feature loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame and a second feature loss from the distribution feature of each first audio frame to the distribution feature of each first phoneme; training the first network model based on the first feature loss and the second feature loss to obtain an audio generation model. The phenomenon that the audio signal has pronunciation errors is reduced, and the pronunciation stability and the audio quality are improved.

Description

Training method of audio generation model, audio generation method, device and equipment

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a training method of an audio generation model, an audio generation method, an audio generation device and equipment.

Background

The audio generation technology is an important application in the technical field of artificial intelligence, a neural network model is used for learning a large amount of Text data and audio data to obtain an audio generation model, and the audio generation model is used for automatically generating the audio data according to the Text data to realize generation from Text to audio (TTS). However, the audio generated by the audio generation model has a phenomenon of a pronunciation error, so that the audio quality is poor.

Disclosure of Invention

The application provides a training method, an audio generation device and equipment for an audio generation model, which can improve the stability of pronunciation in audio and improve the audio quality.

In a first aspect, there is provided a training method of an audio generation model, the method comprising: acquiring a first text and a first audio, wherein the first text and the first audio have the same content, the first text comprises a plurality of first phonemes, and the first audio comprises a plurality of first audio frames; determining distribution characteristics of each first phoneme included in the first text through a first network model, wherein the distribution characteristics of the first phonemes are used for describing the first phonemes and accord with reference statistical distribution; determining, by the first network model, a distribution characteristic of each first audio frame included in the first audio, the distribution characteristic of the first audio frame being used to describe the first audio frame and conforming to the reference statistical distribution; determining a first feature loss from the distribution feature of the respective first phones to the distribution feature of the respective first audio frames and a second feature loss from the distribution feature of the respective first audio frames to the distribution feature of the respective first phones; training the first network model based on the first characteristic loss and the second characteristic loss to obtain an audio generation model, wherein the audio generation model is used for generating a reference audio signal based on a reference text.

In a second aspect, there is provided an audio generation method, the method comprising: acquiring a reference text, wherein the reference text comprises a plurality of reference phonemes; determining the distribution characteristics of each reference phoneme included in the reference text through an audio generation model, wherein the distribution characteristics of the reference phonemes are used for describing the reference phonemes and accord with reference statistical distribution, and the audio generation model is trained according to the training method of the audio generation model; determining, by the audio generation model, distribution characteristics of each reference audio frame based on the distribution characteristics of each reference phoneme, the distribution characteristics of the reference audio frame being used to describe the reference audio frame and conforming to the reference statistical distribution; and generating a reference audio signal based on the distribution characteristics of each reference audio frame through the audio generation model, wherein the reference audio corresponding to the reference audio signal is identical to the content of the reference text, and the reference audio comprises each reference audio frame.

In a third aspect, there is provided a training apparatus for an audio generation model, the apparatus comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first text and a first audio, the contents of the first text and the first audio are the same, the first text comprises a plurality of first phonemes, and the first audio comprises a plurality of first audio frames; a determining module, configured to determine, through a first network model, a distribution feature of each first phoneme included in the first text, where the distribution feature of the first phoneme is used to describe the first phoneme and accords with a reference statistical distribution; the determining module is further configured to determine, through the first network model, a distribution characteristic of each first audio frame included in the first audio, where the distribution characteristic of the first audio frame is used to describe the first audio frame and accords with the reference statistical distribution; the determining module is further configured to determine a first feature loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame and a second feature loss from the distribution feature of each first audio frame to the distribution feature of each first phoneme; the training module is used for training the first network model based on the first characteristic loss and the second characteristic loss to obtain an audio generation model, and the audio generation model is used for generating a reference audio signal based on a reference text.

In a possible implementation manner, the determining module is configured to encode the first text through the first network model to obtain text features of the first phonemes; and mapping the text characteristics of each first phoneme through the first network model to obtain the distribution characteristics of each first phoneme.

In one possible implementation, the apparatus further includes: the alignment module is used for aligning each first phoneme and each first audio frame based on the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame to obtain a first number of first audio frames corresponding to each first phoneme; the determining module is further configured to determine, based on the text features of the first phonemes, a second number of first audio frames corresponding to the first phonemes; the determining module is further configured to determine a number loss between a first number and a second number of the first audio frames corresponding to the first phonemes; the training module is configured to train the first network model based on the number loss, the first feature loss, and the second feature loss, and obtain an audio generation model.

In a possible implementation manner, the determining module is configured to encode the first audio to obtain audio features of the respective first audio frames; and mapping the audio characteristics of each first audio frame through the first network model to obtain the distribution characteristics of each first audio frame.

In one possible implementation, the first audio is a first sample audio signal or a spectrogram of the first sample audio signal, the first network model comprising a first decoder; the apparatus further comprises: the decoding module is used for decoding the audio characteristics of each first audio frame through the first decoder to obtain a first reconstructed audio signal; the determining module is further configured to determine a first signal loss between the first sample audio signal and the first reconstructed audio signal; the training module is configured to train the first network model based on the first signal loss, the first feature loss, and the second feature loss, to obtain an audio generation model.

In one possible implementation, the first decoder includes a first input layer, at least two first convolution layers, and a first output layer, where any one of the first convolution layers includes at least two convolution kernels of the same hole coefficient and different convolution sizes, the convolution kernels of the different first convolution layers corresponding to different hole coefficients; the decoding module is used for converting the audio characteristics of each first audio frame into input characteristics of a first channel number through the first input layer; carrying out cavity convolution on the input features of the first channel number through each convolution kernel included in a first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer; for any one of the first convolution layers except the first one, carrying out cavity convolution on the output features of the last first convolution layer through each convolution kernel included in the any one of the first convolution layers to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the any one of the first convolution layers; the output characteristics of the last first convolution layer are converted into the first reconstructed audio signal by the first output layer.

In a possible implementation manner, the training module is configured to adjust parameters of the first network model based on the first feature loss and the second feature loss to obtain a second network model, where the second network model includes a feature processing network and a second decoder; acquiring second audio, wherein the second audio is a second sample audio signal or a spectrogram of the second sample audio signal, and the second audio comprises a plurality of second audio frames; encoding the second audio to obtain audio characteristics of each second audio frame; decoding the audio features of each second audio frame through the second decoder to obtain a second reconstructed audio signal; determining a second signal loss between the second sample audio signal and the second reconstructed audio signal; adjusting parameters of the second decoder based on the second signal loss to obtain a third decoder; the audio generation model is determined based on the feature processing network and the third decoder.

In a possible implementation manner, the obtaining module is further configured to obtain a second text, where the second text is the same as the content of the second audio, and the second text includes a plurality of second phonemes; the determining module is further configured to determine, through the feature processing network, a distribution feature of each second phoneme included in the second text, and map an audio feature of each second audio frame to obtain a distribution feature of each second audio frame; the determining module is further configured to determine a third feature loss from the distribution feature of each second phoneme to the distribution feature of each second audio frame and a fourth feature loss from the distribution feature of each second audio frame to the distribution feature of each second phoneme; the training module is configured to adjust parameters of the second decoder based on the third feature loss, the fourth feature loss, and the second signal loss, to obtain a third decoder.

In a possible implementation manner, the training module is configured to obtain third audio, where the third audio is a third sample audio signal or a spectrogram of the third sample audio signal, and the third audio includes a plurality of third audio frames; encoding the third audio to obtain audio characteristics of each third audio frame; decoding the audio features of each third audio frame through the third decoder to obtain a third reconstructed audio signal; decoding the audio features of each third audio frame through a fourth decoder to obtain a fourth reconstructed audio signal, wherein the number of parameters of the fourth decoder is smaller than that of the third decoder; based on a third signal loss between the third reconstructed audio signal and the fourth reconstructed audio signal; adjusting parameters of the fourth decoder based on the third signal loss to obtain a reference decoder; the audio generation model is determined based on the feature processing network and the reference decoder.

In one possible implementation, the reference decoder includes a reference input layer, at least two reference convolution layers, and a reference output layer, where any one reference convolution layer includes at least two convolution kernels of the same hole coefficient and different convolution sizes, and the convolution kernels of different reference convolution layers correspond to different hole coefficients; the training module is configured to fuse, for the any one of the reference convolution layers, each convolution kernel included in the any one of the reference convolution layers into a fused convolution kernel, so as to obtain a reconstructed convolution layer, where the fused convolution kernel has the same hole coefficient as the convolution kernel included in the any one of the reference convolution layers, and the convolution size of the fused convolution kernel is not smaller than the convolution size of each convolution kernel included in the any one of the reference convolution layers; splicing the reference input layer, at least two reconstruction convolution layers and the reference output layer to obtain a target decoder; the audio generation model is determined based on the feature processing network and the target decoder.

In a possible implementation manner, the training module is configured to fill, for a first convolution kernel included in the any one reference convolution layer and having a convolution size smaller than that of the fusion convolution kernel, a parameter of the first convolution kernel to obtain a filled first convolution kernel, where the convolution size of the filled first convolution kernel is the same as that of the fusion convolution kernel; and determining the reconstruction convolution layer based on the parameters of the first convolution kernel after filling and the parameters of a second convolution kernel, wherein the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel.

In a fourth aspect, there is provided an audio generating apparatus, the apparatus comprising: the acquisition module is used for acquiring a reference text, wherein the reference text comprises a plurality of reference phonemes; the determining module is used for determining the distribution characteristics of each reference phoneme included in the reference text through an audio generation model, wherein the distribution characteristics of the reference phonemes are used for describing the reference phonemes and accord with reference statistical distribution, and the audio generation model is trained according to the method shown in the first aspect; the determining module is further configured to determine, by using the audio generating model, a distribution characteristic of each reference audio frame based on the distribution characteristic of each reference phoneme, where the distribution characteristic of each reference audio frame is used to describe the reference audio frame and conforms to the reference statistical distribution; and the generation module is used for generating a reference audio signal based on the distribution characteristics of each reference audio frame through the audio generation model, wherein the reference audio corresponding to the reference audio signal is the same as the content of the reference text, and the reference audio comprises each reference audio frame.

In a possible implementation manner, the determining module is configured to encode the reference text through the audio generating model to obtain text features of each reference phoneme; and mapping the text characteristics of each reference phoneme through the audio generation model to obtain the distribution characteristics of each reference phoneme.

In a possible implementation manner, the determining module is further configured to determine, based on text features of the respective reference phonemes, a number of reference audio frames corresponding to the respective reference phonemes; the determining module is configured to expand, by using the audio generating model, the distribution characteristics of each reference phoneme based on the number of reference audio frames corresponding to each reference phoneme, so as to obtain the distribution characteristics of each reference audio frame.

In a possible implementation manner, the generating module is configured to map, through the audio generating model, distribution features of the respective reference audio frames to obtain audio features of the respective reference audio frames; and decoding the audio features of each reference audio frame through the audio generation model to obtain a reference audio signal.

In a fifth aspect, there is provided an electronic device, including a processor and a memory, where at least one computer program is stored in the memory, where the at least one computer program is loaded and executed by the processor, so that the electronic device implements the training method of the audio generation model described in the first aspect or implements the audio generation method described in the second aspect.

In a sixth aspect, there is also provided a computer readable storage medium, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor, to cause an electronic device to implement the training method of the audio generation model described in the first aspect or implement the audio generation method described in the second aspect.

In a seventh aspect, there is further provided a computer program, where the computer program is at least one, and at least one computer program is loaded and executed by a processor, so that the electronic device implements the training method of the audio generation model described in the first aspect or implements the audio generation method described in the second aspect.

In an eighth aspect, there is also provided a computer program product having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to cause an electronic device to implement the training method of the audio generation model described in the first aspect or to implement the audio generation method described in the second aspect.

The technical scheme provided by the application at least brings the following beneficial effects.

In the technical scheme provided by the application, the distribution characteristics of each first phoneme in the first text and the distribution characteristics of each first audio frame in the first audio are determined through the first network model, and then the first characteristic loss and the second characteristic loss between the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame are determined. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first audio frame using the distribution feature of each first phoneme is measured by the first feature loss. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first phoneme using the distribution feature of each first audio frame is measured by the second feature loss. That is, the distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame can be measured more accurately and more stably through the first feature loss and the second feature loss, so that after the first network model is trained based on the first feature loss and the second feature loss, the model can reduce the difference between the distribution feature of the phoneme and the distribution feature of the audio frame as much as possible, and the distribution feature of the audio frame can be determined according to the distribution feature of the phoneme, thereby realizing the generation of the audio signal. Because the distribution characteristics of the phonemes and the distribution characteristics of the audio frames are closer, the phenomenon that the audio signals have pronunciation errors is less, and the pronunciation stability and the audio quality are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an implementation environment of a training method of an audio generation model or an audio generation method according to an embodiment of the present application.

Fig. 2 is a flowchart of a training method of an audio generation model according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a convolutional layer according to an embodiment of the present application.

Fig. 4 is a schematic diagram of reconstruction of a convolutional layer according to an embodiment of the present application.

Fig. 5 is a flowchart of an audio generation method according to an embodiment of the present application.

Fig. 6 is a training schematic diagram of an audio generation model according to an embodiment of the present application.

Fig. 7 is a training frame diagram of an audio generation model according to an embodiment of the present application.

Fig. 8 is an application framework diagram of an audio generation model provided in an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a training device for an audio generation model according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application.

Fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In the technical field of artificial intelligence, an audio generation model can be obtained by enabling a neural network model to learn a large amount of Text data and audio data, and the audio data is automatically generated according to the Text data through the audio generation model, so that generation from Text to audio (TTS) is realized. However, the audio generated by the audio generation model has a phenomenon of a pronunciation error, so that the audio quality is poor. Based on the above, the embodiment of the application provides a training method or an audio generation method of an audio generation model, which can reduce the phenomenon that an audio signal has a pronunciation error, and improve the pronunciation stability and the audio quality.

Fig. 1 is a schematic diagram of an implementation environment of an audio generation model training method or an audio generation method according to an embodiment of the present application, where the implementation environment includes a terminal device 101 and a server 102 as shown in fig. 1. The training method or the audio generating method of the audio generating model in the embodiment of the present application may be performed by the terminal device 101, may be performed by the server 102, or may be performed by both the terminal device 101 and the server 102.

The terminal device 101 may be a smart phone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart television, a smart car device, a smart voice interaction device, a smart home appliance, etc. The server 102 may be one server, or a server cluster formed by a plurality of servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 102 may be in communication connection with the terminal device 101 via a communication network, which may be a wired network or a wireless network. The server 102 may have functions of data processing, data storage, data transceiving, and the like, which are not limited in the embodiments of the present application. The number of terminal devices 101 and servers 102 is not limited, and may be one or more.

In the embodiment of the present application, the terminal device 101 acquires the reference text, and sends the reference text to the server 102 through the communication network. The server 102 may generate a reference audio signal based on the reference text by training the resulting audio generation model, and transmit the reference audio signal to the terminal device 101 through the communication network.

Alternatively, the server 102 may train the first network model to obtain the second network model, and the training process may be described in steps 201 to 211, which are not described herein. Next, the server 102 fixes the feature processing network in the second network model, trains the second decoder in the second network model to obtain the third decoder, and the training process can be described in steps 212 to 213, which is not described herein. Then, the server 102 distills the fourth decoder with the third decoder to obtain the reference decoder, and the process may be described in steps 214 to 218, which are not repeated herein. Next, the server 102 reconstructs the reference convolutional layer in the reference decoder to obtain the target decoder, and the process can be described in steps 2181 to 2182, which is not described herein. The server 102 then splices the feature processing network with the target decoder to obtain an audio generation model.

From here on, the server 102 trains to get an audio generation model. After that, each time the server 102 acquires the reference text transmitted by the terminal device 101, a reference audio signal can be generated based on the reference text by the audio generation model, and the reference audio signal is transmitted to the terminal device 101 through the communication network. This part of the content can be seen from the description of steps 501 to 504, which is not repeated here.

Alternative embodiments of the present application may be implemented based on artificial intelligence techniques. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Fig. 2 is a flowchart of a training method of an audio generation model according to an embodiment of the present application, where the method may be applied in the above implementation environment, so as to reduce a phenomenon that an audio signal has a pronunciation error, and improve pronunciation stability and audio quality. The terminal device 101 or the server 102 performs the training method of the audio generation model in the embodiment of the present application, and the terminal device 101 or the server 102 may be referred to as an electronic device, and the method is performed by the electronic device. As shown in fig. 2, the method includes the following steps.

In step 201, a first text and a first audio are acquired, wherein the content of the first text is the same as that of the first audio, the first text includes a plurality of first phonemes, and the first audio includes a plurality of first audio frames.

The embodiment of the application does not limit the way in which the electronic device obtains the first text. The electronic device may obtain the first text entered by the user, or the electronic device may read the first text from another device. The first text is character type data, for example, the first text is: xiatian.

The first text includes a plurality of first phonemes. The phonemes are the smallest phonetic units determined from the pronunciation actions, one pronunciation action constituting one phoneme, e.g., one phoneme may be f, a, etc. The first phoneme is also character-type data corresponding to the first text.

Likewise, the embodiment of the present application does not limit the manner in which the electronic device obtains the first audio. The electronic device may obtain the first audio input by the user, or the electronic device may read the first audio from another device. The first audio may be a first sample audio signal, the first sample audio signal being an analog signal. The first audio may also be a spectrogram of a first sample audio signal obtained by converting the first sample audio signal, which in the embodiment of the present application is not limited to a spectrogram, and may be a linear spectrogram, a mel spectrogram, or the like.

The first audio includes a plurality of first audio frames. The duration of the first audio frame is obtained by dividing the number of sampling points included in the first audio by the sampling frequency. The first audio frame may be an analog signal or a spectrogram or the like, corresponding to the first audio.

In step 202, the distribution characteristics of each first phoneme included in the first text are determined through the first network model, where the distribution characteristics of the first phonemes are used to describe the first phonemes and conform to the reference statistical distribution.

The embodiment of the present application does not limit the structure, the size, etc. of the first network model, and illustratively, the first network model includes at least one module of the first text encoder, the first mapping layer, the first standard stream network, the first time length predictor and the first decoder, and functions, structures, etc. of each module are correspondingly described below and are not repeated herein.

The first text may be converted by a conversion (transducer) layer to obtain a first text representation. The first text tokens are then input into a first network model, and distribution characteristics of each first phoneme are determined based on the first text tokens by the first network model. The distribution characteristics of the first phonemes are used to describe semantics, content, pronunciation modes, etc. of the first phonemes, and conform to a reference statistical distribution. The embodiment of the present application does not limit the reference statistical distribution, and the reference statistical distribution is at least one distribution of normal distribution, binomial distribution, poisson distribution, and the like.

In one possible implementation, step 202 includes steps 2021 to 2022 (not shown in the figures).

In step 2021, the first text is encoded by the first network model, resulting in text features of each first phoneme.

In this embodiment, the first network model includes a first text encoder, which is also referred to as an a priori encoder, including at least one network layer of a convolution layer, an embedding layer, a conversion layer, a mapping layer, an activation layer, and the like. The first text representation obtained by converting the first text comprises representations of a plurality of first phonemes, the representations of the first phonemes are input into a text encoder, and the representations of the first phonemes are encoded through the text encoder to obtain text features of the first phonemes.

Illustratively, the text encoder comprises two encoding layers in series, one encoding layer comprising a convolution layer, a conversion layer and a convolution layer in series in order, the convolution layer being used for performing a convolution process, the conversion layer being used for feature extraction based on an attention mechanism.

In brief, a representation of each first phoneme is input to a text encoder. For the first coding layer, the first convolution layer is used for convolving the characterization of each first phoneme to obtain the convolution result of each first phoneme; then, for each convolution result of the first phonemes, performing attention processing on the convolution result of the first phonemes by using the convolution result of at least one first phoneme adjacent to the first phonemes through a conversion layer to obtain an attention processing result of the first phonemes; and then, convoluting the attention processing results of the first phonemes again through the second convolution layer to obtain convolution results of the first phonemes, namely obtaining the output information of the first coding layer. And inputting the output information of the first coding layer into a second coding layer, and similarly processing the output information of the first coding layer through the second coding layer according to the processing principle of the first coding layer to obtain the output information of the second coding layer. The output information of the second coding layer is the text characteristics of each first phoneme.

In the embodiment of the present application, the text feature of the first phoneme is determined by the representation of at least one first phoneme adjacent to the first phoneme, so that the text feature of the first phoneme can reflect the semantics, the content, the pronunciation mode and the like of the first phoneme in the global environment of the first text, and the accuracy is high and the representation capability is strong.

In step 2022, the text features of each first phoneme are mapped by the first network model to obtain the distribution features of each first phoneme.

In this embodiment, the text encoder of the first network model is spliced with the first mapping layer. Optionally, the first mapping layer is a linear mapping layer such as a projection layer, and the text features of each first phoneme are mapped linearly by the linear mapping layer to obtain the distribution features of each first phoneme. Or the mapping layer is a nonlinear mapping layer, and nonlinear mapping is carried out on the text features of each first phoneme through the nonlinear mapping layer, so as to obtain the distribution features of each first phoneme.

The distribution characteristics of the first phoneme are used to describe a reference statistical distribution to which the text characteristics of the first phoneme correspond. Taking the example that the reference statistical distribution is a normal distribution, the distribution characteristics of the first phoneme may include a mean and a variance. Mean value of Is the mean, variance of the text features of the first phonemeIs the variance of the text features of the first phoneme, and reflects the normal distribution which the text features of the first phoneme conform to by means of the mean and the variance.

In step 203, the distribution characteristics of each first audio frame included in the first audio are determined through the first network model, where the distribution characteristics of the first audio frame are used to describe the first audio frame and conform to the reference statistical distribution.

In this embodiment of the present application, when the first audio is a first sample audio signal, spectrum conversion may be performed on the first sample audio signal first, so as to obtain a spectrum of the first sample audio signal. Then, a spectrogram of the first sample audio signal is input into a first network model, and distribution characteristics of each first audio frame are determined based on the spectrogram of the first sample audio signal through the first network model. When the first audio is a spectrogram of the first sample audio signal, the distribution characteristics of the respective first audio frames may be determined directly by the first network model based on the spectrogram of the first sample audio signal. The distribution characteristics of the first audio frame are used for describing the semantics, the content, the pronunciation mode and the like of the first audio frame, and conform to the reference statistical distribution.

In one possible implementation, step 203 includes steps 2031 to 2032 (not shown in the figures).

In step 2031, the first audio is encoded to obtain audio characteristics for each first audio frame.

In the embodiment of the application, the first network model is spliced with the audio encoder. The audio encoder, also referred to as a posterior encoder, comprises at least one network layer of convolutional layers, embedded layers, transform layers, map layers, active layers, etc. The first sample audio signal may be converted into a spectrogram of the first sample audio signal if the first audio is the first sample audio signal, although the first audio may also be a spectrogram of the first sample audio signal. The spectrogram of the first sample audio signal comprises spectrograms of a plurality of first audio frames, the spectrograms of the first audio frames are input into an audio encoder, and the spectrograms of the first audio frames are encoded through the audio encoder to obtain audio characteristics of the first audio frames.

Illustratively, the audio encoder is a variable self-encoder (Variational Auto Encoder, VAE). Alternatively, the audio encoder includes a plurality of convolution layers for performing convolution processing. That is, for each first audio frame, the spectrogram of each first audio frame is convolved with the spectrogram of at least one first audio frame adjacent to the first audio frame by a plurality of convolution layers included in the audio encoder, to obtain the audio characteristics of each first audio frame.

The audio characteristics of the first audio frame are determined through a spectrogram of at least one first audio frame adjacent to the first audio frame, so that the audio characteristics of the first audio frame can reflect the semantics, the content, the pronunciation mode and the like of the first audio frame in the global environment of the first audio, the accuracy is high, and the characterization capability is high.

In step 2032, the audio features of each first audio frame are mapped by the first network model to obtain the distribution features of each first audio frame.

In this embodiment of the present application, the audio encoder is then spliced with a first standard Flow (Flow) network in the first network model, and the audio features of each first audio frame are mapped to the distribution features of each first audio frame through the first standard Flow network. It should be noted that, the first standard stream network is a reversible network, and may take the audio features of each first audio frame as input, the distribution features of each first audio frame as output, or the distribution features of each first audio frame as input, and the audio features of each first audio frame as output, so as to map the input to the output.

The distribution characteristics of the first audio frame are used to describe a reference statistical distribution to which the audio characteristics of the first audio frame correspond. Taking the example that the reference statistical distribution is a normal distribution, the distribution characteristics of the first audio frame may include a mean and a variance. The mean value is the mean value of the audio features of the first audio frame, and the variance is the variance of the audio features of the first audio frame, and the normal distribution which the audio features of the first audio frame conform to is reflected through the mean value and the variance. Alternatively, the distribution characteristics of the first audio frame may be expressed as Z characterizes the audio characteristics of the first audio frame.

Step 204 determines a first feature loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame and a second feature loss from the distribution feature of each first audio frame to the distribution feature of each first phoneme.

In the embodiment of the application, the forward KL divergence (Kullback-Leibler Divergence) and the reverse KL divergence between the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame can be calculated. The KL divergence, also known as relative entropy, has asymmetry, that is, the forward KL divergence is not equal to the reverse KL divergence. Wherein the first feature loss is a forward KL divergence, and is used for representing a distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first audio frame with the distribution feature of each first phoneme, and is a loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame. The second feature loss is an inverse KL divergence, which is used to characterize a distance between the distribution feature of each first audio frame and the distribution feature of each first phoneme in the case of fitting the distribution feature of each first phoneme using the distribution feature of each first audio frame, from the distribution feature of each first audio frame to the distribution feature of each first phoneme.

The first text encoder has been mentioned above as a priori encoder. Based on this, the text feature of each first phoneme determined by the a priori encoder based on the first text representation c may be referred to as a priori feature, the reference statistical distribution satisfied by the text feature of each first phoneme may be referred to as a priori distribution, which may be expressed as. Similarly, the audio encoder is referred to as a posterior encoder, the audio features of the respective first audio frames determined by the posterior encoder based on the spectrogram x of the first sample audio signal may be referred to as posterior features, the reference statistical distribution satisfied by the audio features of the respective first audio frames may be referred to as posterior distribution, and the posterior distribution may be expressed as. On the basis, the first characteristic loss can be expressed asThe second characteristic loss may be expressed as。

In step 205, training the first network model based on the first feature loss and the second feature loss to obtain an audio generation model, where the audio generation model is used to generate a reference audio signal based on the reference text.

In the embodiment of the application, the loss of the first network model may be determined based on the first feature loss and the second feature loss. The embodiment of the application does not limit the determination manner, and for example, the first characteristic loss and the second characteristic loss may be calculated by weighted summation, weighted averaging, and the like, so as to obtain the loss of the first network model. Then, training the first network model based on the loss of the first network model to obtain an audio generation model.

It will be appreciated that training the first network model corresponds to adjusting parameters of the first network model. In short, parameters of the first network model can be adjusted based on the loss of the first network model, so that the first network model is trained once, and the adjusted first network model is obtained. And if the adjusted first network model meets the first ending condition, taking the adjusted first network model as a second network model, and if the adjusted first network model does not meet the first ending condition, taking the adjusted first network model as the first network model, and training the first network model again according to the mode from step 201 to step 205 until the adjusted first network model meets the first ending condition, and taking the adjusted first network model as the second network model. And then, taking the second network model as an audio generation model, or training the second network model again to obtain the audio generation model.

The embodiment of the application does not limit that the adjusted first network model meets the first ending condition. Illustratively, the adjusted first network model satisfying the first end condition means: the training times corresponding to the adjusted first network model reach the set times, or the loss of the adjusted first network model is in the set range, and the like.

By calculating the first feature loss and the second feature loss and training the first network model based on the first feature loss and the second feature loss, the model can not only determine the distribution feature of the first phoneme based on the distribution feature of the first audio frame, but also determine the distribution feature of the first audio frame based on the distribution feature of the first phoneme, so that the distribution feature of the first phoneme is enabled to be close to the distribution feature of the first audio frame as much as possible, the distribution feature of the first phoneme and the distribution feature of the first audio frame can be matched accurately, the accuracy and the stability of the model are improved, and the audio generated by the audio generation model has the characteristics of high quality and high stability.

In a possible implementation A1, the method according to the embodiment of the present application further includes steps 206 to 208 (not shown in the figure), and steps 206 to 208 may be performed after step 203.

And step 206, aligning each first phoneme with each first audio frame based on the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame to obtain a first number of first audio frames corresponding to each first phoneme.

In this embodiment of the present application, a monotonically aligned search (Monotonic Alignment Search, MAS) algorithm may be used to sequentially align the distribution feature of each first phoneme and the distribution feature of each first audio frame based on the monotonically non-skipped feature, so as to align each first phoneme with each first audio frame, and obtain a first number of first audio frames corresponding to each first phoneme. The MAS algorithm is not described in detail in this embodiment of the present application.

Optionally, one first phoneme corresponds to n first audio frames, n is a positive integer greater than or equal to 1, that is, the first number is a positive integer greater than or equal to 1, for example, the second number of three first phonemes corresponds to the first audio frames is 2, 1, respectively.

Step 207 determines a second number of corresponding first audio frames for each first phoneme based on the text characteristics of each first phoneme.

In this embodiment of the present application, the first network model includes a first time length predictor, and the embodiment of the present application does not limit the structure, the size, and the like of the first time length predictor. The text characteristics of each first phoneme may be input into a first time length predictor by which a second number of corresponding first audio frames for each first phoneme is predicted. Optionally, one first phoneme corresponds to m first audio frames, m is a positive number, that is, the second number is a positive number greater than 0, for example, the second numbers of three first phonemes corresponding to the first audio frames are 1.8, 1.9, and 0.9, respectively.

In step 208, a number penalty between the first number and the second number of corresponding first audio frames for each first phoneme is determined.

In this embodiment of the present application, the number loss may be determined according to a calculation formula of any one loss, such as a mean square error loss, a cross entropy loss, a relative entropy loss, and the like, based on a first number of first audio frames corresponding to each first phoneme and a second number of first audio frames corresponding to each first phoneme.

In implementation A1, step 205 includes: training the first network model based on the number loss, the first feature loss and the second feature loss to obtain an audio generation model.

In the embodiment of the present application, the number loss, the first feature loss, and the second feature loss may be calculated by performing weighted summation, weighted averaging, and the like, to obtain the loss of the first network model. Alternatively, the loss of the first network model may be calculated based on the number loss, the first feature loss, the second feature loss, and other losses. At least one training is performed on the first network model through the loss of the first network model to obtain a second network model, and an audio generation model is determined based on the second network model, wherein the determination mode of the audio generation model is described above and is not repeated here.

In a possible implementation A2, the first audio is a first sample audio signal or a spectrogram of the first sample audio signal, and the first network model comprises a first decoder. The method of the embodiment of the present application further includes steps 209 to 210 (not shown in the figure), and steps 209 to 210 may be performed after step 2031.

In step 209, the audio features of each first audio frame are decoded by the first decoder to obtain a first reconstructed audio signal.

In this embodiment of the present application, the audio features of each first audio frame may be input to a first decoder, and the audio features of each first audio frame may be decoded by the first decoder to obtain a first reconstructed audio signal. The embodiment of the present application does not limit the structure of the first decoder, and it can be understood that the first decoders with different structures correspond to different decoding modes. Optionally, the first decoder comprises a plurality of convolution layers, and the first reconstructed audio signal is obtained by convolving the audio features of each first audio frame a plurality of times.

Optionally, the first decoder includes a first input layer, at least two first convolution layers, and a first output layer, where any one of the first convolution layers includes at least two convolution kernels of the same hole coefficient and different convolution sizes, and the convolution kernels of different first convolution layers correspond to different hole coefficients. Step 209 comprises: converting the audio features of each first audio frame into input features of a first channel number through a first input layer; carrying out cavity convolution on the input features of the first channel number through each convolution kernel included in the first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer; for any one of the first convolution layers except the first one, carrying out hole convolution on the output characteristics of the previous first convolution layer through each convolution kernel included in any one of the first convolution layers to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output characteristics of any one of the first convolution layers; the output characteristics of the last first convolution layer are converted into a first reconstructed audio signal by the first output layer.

In this embodiment of the present application, the first decoder includes a first input layer, at least two first convolution layers, and a first output layer that are sequentially connected in series, and the first input layer, the first convolution layers, and the first output layer are sequentially described below.

The embodiment of the application does not limit the structure, the size and the like of the first input layer. Optionally, the audio features of the first audio include audio features of respective first audio frames, and after the audio features of the first audio are input to the first input layer, the audio features of the first audio are converted into input features of the first channel number by the first input layer. The embodiment of the application does not limit the first channel number, and the first channel number is exemplified。

In the embodiment of the application, the structures of the first convolution layers are similar. The first convolution layer comprises at least two convolution kernels with the same cavity coefficient and different convolution sizes, and the convolution kernels of different first convolution layers correspond to different cavity coefficients. For example, the number of the first convolution layers is two, the first convolution layer comprises three convolution kernels with hole coefficients of 1, and convolution sizes of 11×1, 7×1 and 3×1, respectively, and the second convolution layer comprises three convolution kernels with hole coefficients of 3, and convolution sizes of 11×1, 7×1 and 3×1, respectively. Wherein, the cavity coefficient is also called expansion coefficient, and the cavity coefficient is used for controlling the size of the receptive field, and the larger the cavity coefficient is, the larger the receptive field is. By means of the cavity coefficients with different sizes, the characteristics of different detail degrees are captured, and the characteristic capacity of the characteristics is improved. The convolution size is used for controlling the size of the extracted features, features with different detail degrees are extracted through convolution sizes with different sizes, and the characteristic capability of the features is improved.

Optionally, for the first convolution layer, carrying out cavity convolution with the same cavity coefficient and different convolution sizes on the input features of the first channel number through each convolution kernel included in the first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer. And for the second first convolution layer, carrying out cavity convolution with the same cavity coefficient and different convolution sizes on the output characteristics of the first convolution layer through each convolution kernel included in the first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output characteristics of the second first convolution layer. And so on until the output characteristics of the last first convolutional layer are obtained.

It will be appreciated that the number of channels of the feature may be changed during the cavitation process. Wherein the first channel numberIs the number of channels inputting the features of the first convolution layer (i.e., the audio features of the first audio), the number of channels outputting the features of the last first convolution layer is 。

The embodiment of the application does not aim at the first outputThe structure, size, etc. of the layers are defined. Optionally, after the output characteristics of the last first convolution layer are input to the first output layer, the output characteristics of the last first convolution layer are converted into the first reconstructed audio signal by the first output layer. The embodiment of the application does not limit the first channel number, and the first channel number is exemplified。

It should be noted that, the first convolution layer may include, in addition to the convolution kernel, a normalization layer, which may be a batch normalization (Batch Normalization, BN) layer or a layer normalization (Layer Normalization, LN) layer. In the embodiment of the application, the batch normalization layer is also called batch normalization layer, and the layer normalization layer is also called layer normalization layer.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a convolution layer according to an embodiment of the present disclosure. Wherein (a) in fig. 3 shows three convolution layers in the related art. Briefly, the first convolution layer includes three convolution kernels each having a convolution size of 3×1, and hole coefficients d=1, d=3, and d=5, respectively, and the output features of the three convolution kernels are added. The second convolution layer includes three convolution kernels each having a convolution size of 5×1, and hole coefficients d=1, d=3, and d=5, respectively, and the output features of the three convolution kernels are added. The third convolution layer includes three convolution kernels each having a convolution size of 11×1, and hole coefficients d=1, d=3, and d=5, respectively, and the output features of the three convolution kernels are added.

Fig. 3 (b) shows three convolutional layers included in the decoder in the embodiment of the present application. Briefly, the first convolution layer includes three convolution kernels with hole coefficients d=1, convolution sizes 11×1, 5×1, and 3×1, respectively, and the output features of the three convolution kernels are added. The second convolution layer includes three convolution kernels with hole coefficients d=3, convolution sizes 11×1, 5×1, and 3×1, respectively, and the output features of the three convolution kernels are added. The third convolution layer includes three convolution kernels with hole coefficients d=5 and convolution sizes 11×1, 5×1, and 3×1, respectively, and the output features of the three convolution kernels are added.

In the embodiment of the application, by setting the convolution layer shown in (b) in fig. 3, the cavity convolution with different convolution sizes and different cavity coefficients can be ensured to be performed on the features in the training process, so that the training effect is improved, and the model can output high-quality audio. While guaranteeing the training effect, each convolution layer can be reconstructed, the calculation cost is reduced, the audio quality is guaranteed, and the audio generation efficiency is improved. The convolution layer shown in (b) in fig. 3 may be reconstructed into the convolution layer shown in (c) in fig. 3, where the reconstruction process is described correspondingly below, and is not described herein.

Step 210 determines a first signal loss between the first sample audio signal and the first reconstructed audio signal.

In the embodiment of the present application, the first signal loss may be calculated based on the first sample audio signal and the first reconstructed audio signal according to a calculation formula of any one of the loss such as the mean square error loss, the cross entropy loss, the relative entropy loss, and the like. Alternatively, the first signal loss may be calculated based on a spectrogram of the first sample audio signal and a spectrogram of the first reconstructed audio signal according to a calculation formula of any one of a mean square error loss, a cross entropy loss, a relative entropy loss, and the like. The first signal loss is used to measure the difference between the first sample audio signal and the first reconstructed audio signal.

In implementation A2, step 205 includes: and training the first network model based on the first signal loss, the first characteristic loss and the second characteristic loss to obtain an audio generation model.

In the embodiment of the present application, the first signal loss, the first feature loss, and the second feature loss may be calculated by performing weighted summation, weighted averaging, and the like, to obtain a loss of the first network model. Alternatively, the loss of the first network model may be calculated based on the first signal loss, the first feature loss, the second feature loss, and other losses. At least one training is performed on the first network model through the loss of the first network model to obtain a second network model, and an audio generation model is determined based on the second network model, wherein the determination mode of the audio generation model is described above and is not repeated here.

Optionally, the calculation of the number loss, the first signal loss, the first feature loss and the second feature loss, the weighted summation, the weighted averaging, etc. is performed to obtain a loss of the first network model. Training the first network model through the loss of the first network model to obtain an audio generation model.

In one possible implementation, step 205 includes steps 211 to 213 (not shown).

Step 211, adjusting parameters of the first network model based on the first feature loss and the second feature loss to obtain a second network model, wherein the second network model comprises a feature processing network and a second decoder.

The process of adjusting the parameters of the first network model to obtain the second network model has been described above, and will not be described herein. Since the second network model is obtained by training the first network model, the structure, function, and the like of the second network model are similar to those of the first network model.

Optionally, the first network model includes at least one of a first text encoder, a first mapping layer, a first standard stream network, a first time length predictor, and a first decoder. In the process of training the first network model to obtain the second network model, in the first aspect, training the first text encoder to obtain the reference text encoder, in the second aspect, training the first mapping layer to obtain the reference mapping layer, in the third aspect, training the first standard stream network to obtain the reference standard stream network, in the fourth aspect, training the first time length predictor to obtain the reference time length predictor, in the fifth aspect, training the first decoder to obtain the second decoder. The feature processing network includes at least one of a reference text encoder, a reference mapping layer, a reference standard stream network, and a reference duration predictor.

Step 212, obtaining second audio, wherein the second audio is a second sample audio signal or a spectrogram of the second sample audio signal, and the second audio includes a plurality of second audio frames; encoding the second audio to obtain audio characteristics of each second audio frame; and decoding the audio features of each second audio frame through a second decoder to obtain a second reconstructed audio signal.

The embodiment of the application does not limit the manner in which the electronic device obtains the second audio. The electronic device may obtain the second audio input by the user, or the electronic device may read the second audio from another device. The second audio is similar to the first audio, and a description of step 201 will be omitted herein. Wherein the second audio is the same as or different from the first audio.

In the embodiment of the application, the second network model may be spliced with the audio encoder. If the second audio is a second sample audio signal, the second sample audio signal may be converted into a spectrogram of the second sample audio signal, although the second sample audio may also be a spectrogram of the second sample audio signal. The spectrogram of the second sample audio signal comprises spectrograms of a plurality of second audio frames, the spectrograms of the second audio frames are input into an audio encoder, and the spectrograms of the second audio frames are encoded through the audio encoder to obtain the audio characteristics of the second audio frames. The method for determining the audio features of the second audio frame is similar to the method for determining the audio features of the first audio frame, and will not be described herein.

The audio encoder is followed by a second decoder. The audio characteristics of each second audio frame may be input to a second decoder, and the audio characteristics of each second audio frame may be decoded by the second decoder to obtain a second reconstructed audio signal. It will be appreciated that the structure of the second decoder is similar to that of the first decoder. Optionally, the second decoder includes a second input layer, at least two second convolution layers, and a second output layer, where any one of the second convolution layers includes at least two convolution kernels of the same hole coefficient and different convolution sizes, and the convolution kernels of different second convolution layers correspond to different hole coefficients. The second reconstructed audio signal is determined in a similar manner to the first reconstructed audio signal, and a description of step 209 will be omitted herein.

Step 213, determining a second signal loss between the second sample audio signal and the second reconstructed audio signal; adjusting parameters of the second decoder based on the second signal loss to obtain a third decoder; an audio generation model is determined based on the feature processing network and the third decoder.

In this embodiment of the present application, the second signal loss may be calculated based on the second sample audio signal and the second reconstructed audio signal according to a calculation formula of any one of the loss such as the mean square error loss, the cross entropy loss, and the relative entropy loss. Alternatively, the second signal loss may be calculated based on a spectrogram of the second sample audio signal and a spectrogram of the second reconstructed audio signal according to a calculation formula of any one of the loss such as the mean square error loss, the cross entropy loss, and the relative entropy loss. The second signal loss is used to measure the difference between the second sample audio signal and the second reconstructed audio signal.

Next, the second signal loss is determined as a loss of the second decoder, or the loss of the second decoder is determined based on the second signal loss and other losses. Training the second decoder based on the loss of the second decoder to obtain a third decoder.

It will be appreciated that training the second decoder corresponds to adjusting parameters of the second decoder. In short, parameters of the second decoder may be adjusted based on the loss of the second decoder, so as to implement training of the second decoder once, and obtain an adjusted second decoder. And if the adjusted second decoder meets the second ending condition, taking the adjusted second decoder as a third decoder, and if the adjusted second decoder does not meet the second ending condition, taking the adjusted second decoder as the second decoder, training the second decoder again according to the mode from step 212 to step 213 until the adjusted second decoder meets the second ending condition, and taking the adjusted second decoder as the third decoder. And then splicing the third decoder on the feature processing network to form a third network model, and taking the third network model as an audio generation model, or training the third network model again to obtain the audio generation model.

The embodiment of the application does not limit that the adjusted second decoder meets the second ending condition. Illustratively, the adjusted second decoder satisfying the second end condition means: the training times corresponding to the adjusted second decoder reach the set times, or the loss of the adjusted second decoder is within the set range, and so on.

After the first network model is globally trained to obtain a second network model, the second network model is locally trained by training a second decoder included in the second network model. Since the performance of the model cannot be improved due to the fact that the model is integrally converged, and the model is integrally converged and does not mean the local network convergence of the model, only parameters of the second decoder are adjusted through the fixed feature processing network, decoding capability is improved, the decoder can reconstruct a clearer voice signal based on audio features of each audio frame, and quality of the voice signal is improved.

Optionally, the method of the embodiment of the present application further includes steps 214 to 215 (not shown in the figure), and steps 214 to 215 are performed after step 212.

Step 214, obtaining a second text, wherein the second text and the second audio have the same content, and the second text comprises a plurality of second phonemes; and determining the distribution characteristics of each second phoneme included in the second text through a characteristic processing network, and mapping the audio characteristics of each second audio frame to obtain the distribution characteristics of each second audio frame.

The embodiment of the application does not limit the manner in which the electronic device obtains the second text. The electronic device may obtain the second text entered by the user, or the electronic device may read the second text from another device. The second text is similar to the first text, and the description of step 201 may be found, which is not repeated here. Wherein the second text is the same as or different from the first text.

The feature processing network includes a reference text encoder operable to convert the second text into a second text representation, the second text representation including representations of a plurality of second phones, the representations of each second phone being input to the reference text encoder, the representations of each second phone being encoded by the reference text encoder to obtain text features of each second phone. The text feature of the second phoneme is determined in a similar manner to that of the first phoneme, and will not be described herein.

The feature processing network further includes a reference mapping layer, after which the reference text encoder is spliced. And carrying out linear mapping or nonlinear mapping on the text features of each second phoneme through a reference mapping layer to obtain the distribution features of each second phoneme, wherein the distribution features of the second phonemes are used for describing the reference statistical distribution which is met by the text features of the second phonemes. The determining manner of the distribution characteristics of the second phoneme is similar to that of the first phoneme, and may be described in step 202, which is not repeated here.

In addition, the feature processing network further comprises a reference standard stream network, and the audio encoder is spliced with the reference standard stream network. And mapping the audio characteristics of each second audio frame into the distribution characteristics of each second audio frame by referring to the standard stream network, wherein the distribution characteristics of the second audio frame are used for describing that the audio characteristics of the second audio frame accord with the reference statistical distribution. The determining manner of the distribution characteristics of the second audio frame is similar to the determining manner of the distribution characteristics of the first audio frame, and thus, the description of step 2032 will not be repeated here.

Step 215 determines a third feature loss from the distribution feature of each second phoneme to the distribution feature of each second audio frame and a fourth feature loss from the distribution feature of each second audio frame to the distribution feature of each second phoneme.

In the embodiment of the present application, the forward KL divergence and the reverse KL divergence between the distribution characteristics of each second phoneme and the distribution characteristics of each second audio frame may be calculated. The third feature loss is a forward KL divergence for characterizing a distance between the distribution feature of each second phoneme and the distribution feature of each second audio frame in the case of fitting the distribution feature of each second audio frame using the distribution feature of each second phoneme, as a loss from the distribution feature of each second phoneme to the distribution feature of each second audio frame. The fourth feature loss is an inverse KL divergence, which is used to characterize a distance between the distribution feature of each second audio frame and the distribution feature of each second phoneme in the case of fitting the distribution feature of each second phoneme using the distribution feature of each second audio frame, from the distribution feature of each second audio frame to the distribution feature of each second phoneme. The implementation of step 215 may be described in step 204, and will not be described herein.

In step 213, "adjust parameters of the second decoder based on the second signal loss to obtain a third decoder", comprising: and adjusting parameters of the second decoder based on the third characteristic loss, the fourth characteristic loss and the second signal loss to obtain a third decoder.

The third characteristic loss, the fourth characteristic loss, and the second signal loss may be weighted summed, weighted averaged, etc. to obtain a loss for the second decoder. Then, the second decoder is trained based on the loss of the second decoder, resulting in a third decoder.

In addition, the alignment of each second phoneme and each second audio frame can be performed based on the distribution characteristics of each second phoneme and the distribution characteristics of each second audio frame, so as to obtain a third number of second audio frames corresponding to each second phoneme; determining a fourth number of second audio frames corresponding to each second phone based on the text features of each second phone; a number penalty between the third number and the fourth number of corresponding second audio frames for each second phoneme is determined. The determining manner of the number loss between the third number and the fourth number is similar to the determining manner of the number loss between the first number and the second number, and may be described in step 208, which is not repeated herein.

A loss of the second decoder is determined based on at least one of the third feature loss, the fourth feature loss, the second signal loss, a number loss between the third number and the fourth number. Then, the second decoder is trained based on the loss of the second decoder, resulting in a third decoder.

In one possible implementation, the determining the audio generation model in step 213 "based on the feature processing network and the third decoder" includes steps 216 to 218 (not shown in the figure).

Step 216, obtaining third audio, wherein the third audio is a third sample audio signal or a spectrogram of the third sample audio signal, and the third audio includes a plurality of third audio frames; encoding the third audio to obtain audio characteristics of each third audio frame; and decoding the audio characteristics of each third audio frame through a third decoder to obtain a third reconstructed audio signal.

In this embodiment, the implementation manner of step 216 is similar to that of step 212, and thus, the description of step 212 will be omitted herein. Wherein the third audio is the same as or different from the first audio and the second audio.

Optionally, the third decoder comprises a third input layer, at least two third convolution layers and a third output layer. The third input layer is similar in structure and function to the first input layer for converting audio features of each third audio frame to input features of the first number of channels. The third convolution layer has similar structure and function with the first convolution layer, and is used for carrying out the cavity convolution of the same cavity coefficient and different convolution sizes on the characteristics input into the third convolution layer to obtain convolution results corresponding to all convolution kernels, and adding the convolution results corresponding to all convolution kernels to obtain the output characteristics of the third convolution layer. The third output layer is similar in structure and function to the first output layer for converting the output characteristics of the last third convolution layer into a third reconstructed audio signal.

In step 217, the audio features of each third audio frame are decoded by a fourth decoder, so as to obtain a fourth reconstructed audio signal, where the number of parameters of the fourth decoder is smaller than the number of parameters of the third decoder.

In the embodiment of the application, the fourth decoder may be reconfigured based on the third decoder. And generating a fourth input layer by means of parameter random initialization, and splicing the fourth input layer before each third convolution layer and each third output layer to obtain a fourth decoder.

The fourth decoder includes a fourth input layer, at least two third convolutional layers, and a third output layer. The fourth input layer and the third input layer are similar in structure and function for converting the audio characteristics of each third audio frame to the input characteristics of the second number of channels. Wherein the second channel number is smaller than the first channel number, for example, the first channel number is 512 and the second channel number is 256. The third convolution layer is used for carrying out cavity convolution with the same cavity coefficient and different convolution sizes on the characteristics input into the third convolution layer to obtain convolution results corresponding to all convolution kernels, and adding the convolution results corresponding to all the convolution kernels to obtain the output characteristics of the third convolution layer. The third output layer is for converting the output characteristics of the last third convolution layer into a fourth reconstructed audio signal.

By reconstructing the fourth input layer, the original first channel number is reduced to the second channel number, and the calculated amount is reduced. In general, the number of channels and the number of neurons are positively correlated, i.e., the greater the number of channels, the greater the number of neurons. Compared with the third input layer, the number of neurons in the fourth input layer is obviously reduced, so that the number of parameters of the fourth decoder is smaller than that of the third decoder, a model structure of the compression decoder is realized, and the occupation of storage resources is reduced.

Step 218, based on a third signal loss between the third reconstructed audio signal and the fourth reconstructed audio signal; adjusting parameters of a fourth decoder based on the third signal loss to obtain a reference decoder; an audio generation model is determined based on the feature processing network and the reference decoder.

In the embodiment of the present application, the third signal loss may be calculated based on the third reconstructed audio signal and the fourth reconstructed audio signal according to a calculation formula of any one of the loss such as the mean square error loss, the cross entropy loss, the relative entropy loss, and the like. Alternatively, the third signal loss may be calculated based on a spectrogram of the third reconstructed audio signal and a spectrogram of the fourth reconstructed audio signal according to a calculation formula of any one of the loss such as the mean square error loss, the cross entropy loss, and the relative entropy loss. The third signal loss is used to measure the difference between the third reconstructed audio signal and the fourth reconstructed audio signal.

Next, the third signal loss is determined as the loss of the fourth decoder, or the loss of the fourth decoder is determined based on the third signal loss and other losses. The fourth decoder is trained based on the loss of the fourth decoder to obtain a reference decoder.

It will be appreciated that training the fourth decoder corresponds to adjusting parameters of the fourth decoder. In short, parameters of the fourth decoder may be adjusted based on the loss of the fourth decoder, so as to implement training of the fourth decoder once, and obtain the adjusted fourth decoder. And if the adjusted fourth decoder meets the third ending condition, taking the adjusted fourth decoder as a reference decoder, and if the adjusted fourth decoder does not meet the third ending condition, taking the adjusted fourth decoder as the fourth decoder, training the fourth decoder again according to the mode from step 216 to step 218 until the adjusted fourth decoder meets the third ending condition, and taking the adjusted fourth decoder as the reference decoder. And then splicing the reference decoder after the feature processing network to form an audio generation model, or reconstructing the reference decoder to obtain a target decoder, and splicing the target decoder after the feature processing network to form the audio generation model.

The embodiment of the application does not limit that the adjusted fourth decoder meets the third ending condition. Illustratively, the adjusted fourth decoder satisfying the third end condition means: the training times corresponding to the adjusted fourth decoder reach the set times, or the loss of the adjusted fourth decoder is within the set range, and so on.

Optionally, the electronic device may further obtain a third text, where the third text is the same as the third audio, and the third text includes a plurality of third phonemes; determining the distribution characteristics of each third phoneme included in the third text through a characteristic processing network, and mapping the audio characteristics of each third audio frame to obtain the distribution characteristics of each third audio frame; a fifth feature loss from the distribution feature of each third phoneme to the distribution feature of each third audio frame and a sixth feature loss from the distribution feature of each third audio frame to the distribution feature of each third phoneme are determined.

In addition, the electronic device may further align each third phoneme with each third audio frame based on the distribution feature of each third phoneme and the distribution feature of each third audio frame, to obtain a fifth number of third audio frames corresponding to each third phoneme; determining a sixth number of third audio frames corresponding to each third phone based on the text features of each third phone; a number penalty between a fifth number and a sixth number of corresponding third audio frames for each third phoneme is determined.

Thereafter, a loss of the fourth decoder is determined based on at least one of the third signal loss, the fifth feature loss, the sixth feature loss, the number loss between the fifth number and the sixth number. The fourth decoder is trained based on the loss of the fourth decoder to obtain a reference decoder.

It will be appreciated that embodiments of the present application may distill the fourth decoder based on the third decoder to obtain the reference decoder implementation principle, and distill the fourth decoder based on the second decoder to obtain the distilled decoder, as shown in steps 216 through 218. And then splicing the distillation decoder after the feature processing network to form an audio generation model, or reconstructing the distillation decoder, and splicing the decoder obtained by reconstruction after the feature processing network to form the audio generation model, wherein the reconstruction mode is described below and is not repeated.

In the embodiment of the application, the first network model is globally trained to obtain the second network model, and then the second decoder included in the second network model is locally trained, so that the third decoder with the first channel number is obtained by training. On the basis of the third decoder, the fourth decoder is trained in a knowledge distillation mode, so that the fourth decoder continuously learns the third decoder, and the quality of generated voice signals is guaranteed while the model structure is reduced.

The process of reconstructing the reference decoder to obtain the target decoder, and determining the audio generation model based on the target decoder is described below.

Illustratively, the reference decoder includes a reference input layer, at least two reference convolutional layers, and a reference output layer, any one of the reference convolutional layers including at least two convolution kernels of different convolution sizes for the same hole coefficient, the convolution kernels of different reference convolutional layers corresponding to different hole coefficients. The "determining an audio generation model based on the feature processing network and the reference decoder" in step 218 includes steps 2181 to 2183 (not shown in the figure).

Step 2181, for any one of the reference convolution layers, fusing each convolution kernel included in any one of the reference convolution layers into a fused convolution kernel, thereby obtaining a reconstructed convolution layer, wherein the fused convolution kernel has the same cavity coefficient as the convolution kernel included in any one of the reference convolution layers, and the convolution size of the fused convolution kernel is not smaller than the convolution size of each convolution kernel included in any one of the reference convolution layers.

In this embodiment of the present application, since the hole coefficients of the convolution kernels included in any one of the reference convolution layers are the same, the reference convolution layer including each convolution kernel may be fused into one convolution kernel, and the convolution kernel is referred to as a fused convolution kernel. The cavity coefficient of the fusion convolution kernel is the same as the cavity coefficient of each convolution kernel included in the reference convolution layer, and the convolution size of the fusion convolution kernel is not smaller than the convolution size of each convolution kernel included in the reference convolution layer. For example, if the reference convolution layer includes three convolution kernels having a hole coefficient d=1 and convolution sizes of 11×1, 7×1, and 3×1, respectively, the hole coefficient d=1 of the fusion convolution layer obtained by fusing the three convolution kernels has a convolution size of 11×1 or more.

Optionally, step 2181 includes: for a first convolution kernel of which the convolution size is smaller than that of the fusion convolution kernel, filling parameters of the first convolution kernel to obtain a filled first convolution kernel, wherein the convolution size of the filled first convolution kernel is the same as that of the fusion convolution kernel; and determining a reconstruction convolution layer based on the parameters of the first convolution kernel and the parameters of the second convolution kernel after filling, wherein the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel.

In this embodiment of the present application, a specified character may be used to fill a parameter of the first convolution kernel, so as to obtain a filled first convolution kernel. The embodiment of the application does not limit the parameters of the designated characters and the first convolution kernel. Optionally, the parameters of the first convolution kernel include at least one of a weight term and a bias term, and the specified character is a number such as 0, 1, or the like, or a null character, or the like.

The convolution size of the first convolution kernel after filling is the same as that of the fusion convolution kernel, and the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel. And carrying out operations such as weighting and averaging on the parameters of the first convolution kernel and the parameters of the second convolution kernel after filling to obtain the parameters of the fusion convolution kernel, thereby obtaining the reconstruction convolution layer.

Optionally, the parameters of the fused convolution kernel include at least one of a weight term and a bias term. And performing operations such as weighting and averaging on the weight items of the first convolution kernel and the weight items of the second convolution kernel after filling to obtain the weight items of the fusion convolution kernel, and performing operations such as weighting and averaging on the offset items of the first convolution kernel and the offset items of the second convolution kernel after filling to obtain the offset items of the fusion convolution kernel. In this way, determining the parameters of the fused convolution kernel is achieved, thereby determining the reconstructed convolution layer.

The fusion process is described below by taking the fusion of two convolution kernels with identical hole coefficients and convolution sizes of 3×3 and 1×1, respectively, to obtain a fused convolution layer. Alternatively, the process may be carried out in a single-stage,indicating a convolution size ofThe number of channels of the input feature of the convolution kernel isThe number of channels of the output characteristic of the convolution kernel is。Indicating a convolution size ofThe number of channels of the input feature of the convolution kernel isThe number of channels of the output characteristic of the convolution kernel is. Each convolution kernel is followed by a batch of normalization layers.Sequentially representThe mean, variance, scaling factor, and bias of the batch normalization layer that is spliced after the convolution kernel of (c). Sequentially representThe mean, variance, scaling factor, and bias of the batch normalization layer that is spliced after the convolution kernel of (c).Representation ofConvolution kernel sum of (2)Is provided with an input characteristic of the convolution kernel,representation ofConvolution kernel sum of (2)Output characteristics of the convolution kernel of (c). Assume thatEquation (1) shown below can be obtained.

+

Equation (1).

Where bn represents a function of the batch normalization layer, which may be represented as a batch normalization function. Assume that the satisfaction isThen the following formula (2) can be obtained.

Equation (2).

Order theIs based onThe weight term and the deviation term of the determined convolution kernel can then be obtained as shown in the following formula (3).

，Equation (3).

Based on the above formulas (1) to (3), it can be verified that, forEquation (4) shown below can be obtained.

Equation (4).

The above equation (4) shows that performing hole convolution and batch normalization of the input features of the convolution kernel with arbitrary convolution size is equivalent to biasing the weighted result based on the bias term after weighting the input features based on the weight term. Based on this, it is possible toConvolution kernel sum of (2)The convolution kernels of the (2) are fused to obtain a fused convolution kernel, and the convolution size of the fused convolution kernel is not less than For example, the convolution size of the fused convolution kernel is。

Referring to fig. 4, fig. 4 is a schematic diagram illustrating reconstruction of a convolutional layer according to an embodiment of the present application. Fig. 4 (a) shows a process of fusing two convolution kernels having the same hole coefficients and convolution sizes of 3×3 and 1×1, respectively, to obtain a fused convolution kernel having a convolution size of 3×3. In this embodiment of the present application, the structure of the convolution layer is: the 3×3 convolution kernel and the 1×1 convolution kernel are each spliced with a normalization layer, and the output results of the two normalization layers are added. The conversion principle from the formula (1) to the formula (4) can be obtained, and the convolution kernel and the normalization layer are equivalent to the convolution kernel. Based on this, the above-described convolution layer can be regarded as adding the output characteristic of the convolution kernel of 3×3 and the output characteristic of the convolution kernel of 1×1. By aligningConvolution kernel sum of (2)Is fused to obtain the convolution size ofIs a fusion convolution kernel of (c).

Fig. 4 (b) shows a method of fusing two convolution kernels having the same hole coefficients and convolution sizes of 3×3 and 1×1, respectively, to obtain a fused convolution kernel having a convolution size of 3×3. Assume thatAs can be seen from the figure, according to the conversion principle from the formula (1) to the formula (4), the conversion of the parameters of the convolution kernel and the parameters of the normalization layer into the weight term and the bias term of the convolution kernel can be realized. Using a specified number 0 pair Digitally filling the weight terms of the convolution kernel of (2) such that the filled weight terms areThe parameters of the convolution kernel of (2) are similar to those of the convolution kernel of 3 x 3 in shape and structure, so that the filled convolution kernel is realizedThe convolution size of the convolution kernel of (2) is the same as the convolution size of the convolution kernel of 3 x 3. Then, the filled material is filledAdding the weight term of the convolution kernel of 3×3 to obtain the weight term of the fusion convolution kernel of 3×3, and filling the weight term of the fusion convolution kernelAnd adding the offset term of the convolution kernel of 3×3 with the offset term of the convolution kernel of 3×3 to obtain the offset term of the fusion convolution kernel of 3×3.

According to the aboveConvolution kernel sum of (2)The principle of fusion of the convolution kernels of (a) can be easily appliedExpanded to fusion of at least two convolution kernels of identical hole coefficients and different convolution sizes, e.g. according to the principle for three identical hole coefficients as shown in fig. 3, the convolution sizes are respectively、、Is fused by the convolution kernel of (a).

As can be seen from comparing fig. 3 (b) and (c), the convolution sizes are respectively given by three hole coefficients d=1、、Is fused into a hole coefficient d=1, and the convolution size isThe normalized layer is spliced after the fusion convolution kernel to form a first reconstruction convolution layer. Three hole coefficients d=3, and convolution sizes are respectively as follows 、、Is fused into a hole coefficient d=3, and the convolution size isAnd splicing the normalized layers after the fusion convolution kernel to form a second reconstruction convolution layer. Three hole coefficients d=5, and convolution sizes are respectively as follows、、Is fused into a hole coefficient d=3, and the convolution size isAnd splicing the normalized layers after the fusion convolution kernel to form a third reconstruction convolution layer.

By fusing convolution kernels with the same cavity coefficient and different convolution sizes, a plurality of cavity convolution branches are fused to one cavity convolution branch, extra calculation overhead caused by the fact that multiple branches calculate and sum respectively is avoided, and audio generation efficiency is improved. In addition, multi-branch calculation is adopted in the training process, and after the training is finished, the multi-branches are combined into one branch, so that the training effect is ensured, the model can output high-quality audio, and the audio generation efficiency can be accelerated.

And 2182, splicing the reference input layer, the at least two reconstruction convolution layers and the reference output layer to obtain the target decoder.

In this embodiment of the present application, at least two reconstruction convolution layers may be spliced after the reference input layer, and the reference output layer is spliced after the at least two reconstruction convolution layers, so as to obtain a target decoder, where functions of the target decoder are described correspondingly below, and are not described herein again.

Step 2183 determines an audio generation model based on the feature processing network and the target decoder. That is, after splicing the target decoder to the feature processing network, an audio generation model is formed.

It will be appreciated that the second decoder, third decoder and reference decoder are similar in structure, feature processing, etc. Therefore, the reconstruction of the convolution layer of the second decoder or the convolution layer of the third decoder may be performed according to the principle of reconstructing the convolution layers of the reference decoder shown in steps 2081 to 2082, so as to obtain a reconstructed decoder, and the reconstruction process is not repeated here. And then splicing the reconstruction decoder after the feature processing network to obtain an audio generation model.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of the relevant region. For example, reference herein to first audio, first text, second audio, second text, third audio, third text, etc. is taken with sufficient authorization.

In the method, the distribution characteristics of each first phoneme in the first text and the distribution characteristics of each first audio frame in the first audio are determined through the first network model, and then the first characteristic loss and the second characteristic loss between the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame are determined. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first audio frame using the distribution feature of each first phoneme is measured by the first feature loss. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first phoneme using the distribution feature of each first audio frame is measured by the second feature loss. That is, the distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame can be measured more accurately and more stably through the first feature loss and the second feature loss, so that after the first network model is trained based on the first feature loss and the second feature loss, the model can reduce the difference between the distribution feature of the phoneme and the distribution feature of the audio frame as much as possible, and the distribution feature of the audio frame can be determined according to the distribution feature of the phoneme, thereby realizing the generation of the audio signal. Because the distribution characteristics of the phonemes and the distribution characteristics of the audio frames are closer, the phenomenon that the audio signals have pronunciation errors is less, and the pronunciation stability and the audio quality are improved.

Fig. 5 is a flowchart of an audio generating method according to an embodiment of the present application, where the method may be applied in the above implementation environment, and may generate an audio signal with high pronunciation stability and high audio quality. The terminal device 101 or the server 102 performs the audio generation method in the embodiment of the present application, and the terminal device 101 or the server 102 is referred to as an electronic device, and the method is performed by the electronic device. As shown in fig. 5, the method includes the following steps.

In step 501, a reference text is obtained, the reference text comprising a plurality of reference phonemes.

The embodiment of the application does not limit the way in which the electronic device obtains the reference text. The electronic device may obtain the reference text entered by the user, or the electronic device may read the reference text from another device. The implementation of step 501 is similar to that of step 201, and the description of step 201 may be seen, which is not repeated herein.

In step 502, the distribution characteristics of each reference phoneme included in the reference text are determined by the audio generating model, and the distribution characteristics of the reference phonemes are used to describe the reference phonemes and conform to the reference statistical distribution.

Wherein the audio generation model is trained according to the training method of the audio generation model related to fig. 2. The reference text can be converted through the conversion layer to obtain the representation of the reference text. Thereafter, the reference text tokens are input into an audio generation model by which the distribution characteristics of the individual reference phonemes are determined based on the reference text tokens. The implementation of step 502 is similar to that of step 202, and the description of step 202 may be seen, which is not repeated here.

In one possible implementation, step 502 includes steps 5021 to 5022 (not shown in the figures).

In step 5021, the reference text is encoded through the audio generation model, so as to obtain the text characteristics of each reference phoneme.

In the embodiment of the application, the audio generation model comprises a reference text encoder, the reference text representation comprises representations of a plurality of reference phonemes, the representations of the reference phonemes are input into the reference text encoder, and the representations of the reference phonemes are encoded through the reference text encoder to obtain text characteristics of the reference phonemes. The implementation of step 5021 is similar to that of step 2021, and the description of step 2021 may be found, which is not repeated herein.

And 5022, mapping the text characteristics of each reference phoneme through an audio generation model to obtain the distribution characteristics of each reference phoneme.

In this embodiment, a reference mapping layer is spliced after a reference text encoder of an audio generation model, and text features of each reference phoneme are mapped linearly or non-linearly by the reference mapping layer to obtain distribution features of each reference phoneme, where the distribution features of the reference phonemes are used to describe a reference statistical distribution that is consistent with the text features of the reference phonemes. The implementation of step 5022 is similar to that of step 2022, and a description of step 2022 may be found, which is not repeated herein.

In step 503, the distribution characteristics of each reference audio frame are determined by the audio generation model based on the distribution characteristics of each reference phoneme, and the distribution characteristics of the reference audio frame are used to describe the reference audio frame and conform to the reference statistical distribution.

Since a phoneme is a minimum phonetic unit determined according to a pronunciation action, one pronunciation action constitutes one phoneme, and thus, when a sound corresponding to the phoneme is uttered, the sound occupies at least one audio frame. Based on this, the audio generation model may determine, based on the distribution characteristics of any one of the reference phonemes, the distribution characteristics of at least one reference audio frame corresponding to the reference phonemes, the distribution characteristics of the reference audio frame being used to describe semantics, content, pronunciation patterns, etc. of the reference audio frame and conforming to the reference statistical distribution.

Optionally, the method of the embodiment of the present application further includes step 505 (not shown in the figure).

In step 505, the number of reference audio frames corresponding to each reference phoneme is determined based on the text characteristics of each reference phoneme.

In an embodiment of the present application, the audio generation model includes a reference duration predictor. The text characteristics of each reference phoneme may be input into a reference time length predictor by which the number of corresponding reference audio frames for each reference phoneme is predicted. Optionally, one reference phoneme corresponds to m reference audio frames, m is a positive number greater than 0, for example, the number of three reference phonemes corresponds to 1.8, 1.9, 0.9, respectively. The implementation of step 505 is similar to that of step 207, and the description of step 207 may be found, which is not repeated here.

On the basis of step 505, step 503 includes: and expanding the distribution characteristics of each reference phoneme based on the number of the corresponding reference audio frames of each reference phoneme through an audio generation model to obtain the distribution characteristics of each reference audio frame.

For any one of the reference phonemes, the number of copies is determined by the number of reference audio frames to which the reference phoneme corresponds. Alternatively, if the number of reference phonemes corresponding to the reference audio frames is a positive integer, the number of copies is the number. If the number of the reference phonemes corresponding to the reference audio frames is a positive decimal number, operations such as rounding or rounding are performed on the number to obtain the copy number in the form of a positive integer. And then, copying the distribution characteristics of the reference phonemes according to the copy number, wherein the distribution characteristics of each copied reference phoneme are the distribution characteristics of one reference audio frame corresponding to the reference phoneme. For example, the number of copies is 2, the distribution characteristics of one reference phoneme may be copied into the distribution characteristics of 2 reference phonemes, and the distribution characteristics of the 2 reference phonemes are the distribution characteristics of 2 reference audio frames. By the method, the distribution characteristics of the reference phonemes are expanded, and the distribution characteristics of each reference audio frame corresponding to the reference phonemes are obtained.

In step 504, a reference audio signal is generated based on the distribution characteristics of each reference audio frame by using the audio generation model, and the reference audio corresponding to the reference audio signal is identical to the content of the reference text, and the reference audio includes each reference audio frame.

Since the distribution characteristics of the reference audio frames are used to describe the semantics, content, pronunciation mode, etc. of the reference audio frames, the reference audio signals can be determined by the audio generation model based on the distribution characteristics of the respective reference audio frames, the reference audio signals have the semantics, content, etc. of the reference text, and correspond to the sounds generated by the biological pronunciation reference text.

In one possible implementation, step 504 includes steps 5041 to 5042 (not shown).

In step 5041, the distribution characteristics of each reference audio frame are mapped by the audio generation model, so as to obtain the audio characteristics of each reference audio frame.

In an embodiment of the present application, the audio generation model includes a reference standard streaming network. Because the reference standard stream network is a reversible network, not only can the audio characteristics of the audio frames be used as input, the distribution characteristics of the audio frames be used as output, but also the distribution characteristics of the audio frames can be used as input, and the audio characteristics of the audio frames can be used as output, thereby realizing the mapping of the input to the output. Thus, by referencing the standard streaming network, the distribution characteristics of the individual reference audio frames can be mapped to the audio characteristics of the individual reference audio frames. The implementation of step 5041 is similar to that of step 2032, and a description of step 2032 may be found, which is not repeated herein.

In step 5042, the audio features of each reference audio frame are decoded by the audio generation model to obtain a reference audio signal.

The audio generation model includes any one of the second decoder, the third decoder, the reference decoder, the target decoder, and the reconstruction decoder, and the structure, function, etc. of each decoder are described above, and are not described herein. The audio features of each reference audio frame are decoded by a decoder to obtain a reference audio signal. The implementation of step 5042 is similar to that of step 209, step 212, step 216, and step 217, and will not be described herein.

Optionally, if the audio generating model includes any one of the second decoder, the third decoder and the reference decoder, since the decoder includes multiple convolution layers, and each convolution layer includes at least two convolution kernels with the same hole coefficient and different convolution sizes, each convolution layer needs to perform hole convolution with different convolution sizes on the input feature to obtain a hole convolution result of each convolution kernel, and add the hole convolution results of each convolution kernel to obtain an output feature of the convolution layer.

If the audio generation model includes a target decoder or a reconstruction decoder, the structure is simple because the target decoder or the reconstruction decoder is obtained by fusing the respective convolution kernels included in each layer of convolution layer into a fused convolution kernel. Each convolution layer only needs to carry out cavity convolution on the input features once, so that the calculated amount is reduced, and the audio generation efficiency is improved.

It can be understood that in practical application, the audio generation model of the embodiment of the application can be based on the audio generation model, and the decoder structure is improved, so that the audio generation of any language, any dialect and any pronunciation object can be realized, and the self-adaptive related speech generation task is realized. The voice generation task comprises a voice synthesis task, a voice conversion task, a singing synthesis task, a music generation task and the like, is widely applied to the fields of reading, news, games, intelligent robots and the like, and has good expansibility.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of the relevant region. For example, references and the like referred to in this application are all obtained with sufficient authorization.

The audio generation model in the method has the characteristic that the distribution characteristics of phonemes are more similar to the distribution characteristics of audio frames. Therefore, the distribution characteristics of each reference audio frame can be determined more accurately based on the distribution characteristics of each reference phoneme through the audio generation model, so that the phenomenon of pronunciation errors in the reference audio signals generated based on the distribution characteristics of each reference audio frame is fewer, and the pronunciation stability and the audio quality are improved.

The foregoing describes the training method and the audio generation method of the audio generation model according to the embodiments of the present application from the perspective of method steps, and is described fully below. Referring to fig. 6, fig. 6 is a training schematic diagram of an audio generation model according to an embodiment of the present application. In the embodiment of the application, training data is first prepared, and then the training data is used for training the first network model to obtain an audio generation model.

Optionally, the training data includes text data and audio data, wherein the text data corresponds to the first text, the second text, the third text, etc. mentioned above, and the audio data corresponds to the first audio, the second audio, the third audio, etc. mentioned above. The text data is a text in a certain language, and the audio data is an audio signal recorded by recording sounds generated when the sound object utters the text data. The audio data is about 300 hours, and each time training is performed, the audio data is extracted for about 20 hours for training, and the training frame is shown in fig. 7. The first network model includes a decoder, a linear mapping layer, a text encoder, a standard stream network, and a duration predictor.

For audio data, first, a linear spectrogram of the audio data is determined. Then, the linear spectrogram is input into an audio encoder to obtain the audio characteristics of each audio frame. In one aspect, the audio features of each audio frame are decoded by a decoder to obtain a reconstructed audio signal. On the other hand, the audio features of each audio frame are converted into the distribution features of each audio frame through the standard streaming network.

For text data, first, a phoneme representation sequence of the text data is determined, the phoneme representation sequence corresponding to the first text representation, the second text representation, the third text representation, etc. mentioned above. Next, the phoneme characterization sequence is encoded by a text encoder to obtain the text features of each phoneme. On the one hand, the text features of each phoneme are subjected to linear mapping through a linear mapping layer to obtain the distribution features of each phoneme, and then the distribution features of each audio frame and the distribution features of each phoneme are aligned by using a monotonic alignment search algorithm to obtain the first number of audio frames corresponding to each phoneme. On the other hand, a second number of audio frames corresponding to each phoneme is determined by the duration predictor based on the text features of each phoneme.

And then, adjusting parameters of a decoder, a linear mapping layer, a text encoder, a standard stream network and a duration predictor in the first network model based on the reconstructed audio signal, the audio data, the first quantity, the second quantity, the text characteristics of each phoneme and the text characteristics of each audio frame to obtain a second network model.

And then, determining the reconstructed audio signal, the audio data, the first quantity, the second quantity, the text characteristics of each phoneme and the text characteristics of each audio frame through a second network model according to the processing mode of the audio data and the text data, and adjusting parameters of a decoder on the basis of a fixed linear mapping layer, a text encoder, a standard stream network and a duration predictor based on the determined information to obtain a third network model.

The decoder in the third network model corresponds to the third decoder mentioned above. The fourth decoder with smaller parameter number can be constructed based on the third decoder, and the fourth decoder is distilled by using the third decoder according to the processing mode of the audio data and the text data to obtain the reference decoder, and the target decoder is obtained by reconstructing the reference decoder. The audio generation model includes a target decoder, a linear mapping layer, a text encoder, a standard stream network, and a duration predictor, and can be used to generate an audio signal, with an application framework as shown in fig. 8.

First, a phoneme representation sequence of the text data is determined, which corresponds to the above-mentioned reference text representation or the like. Next, the phoneme characterization sequence is encoded by a text encoder to obtain the text features of each phoneme. On the one hand, the text features of each phoneme are subjected to linear mapping through a linear mapping layer, so that the distribution features of each phoneme are obtained. On the other hand, the number of audio frames corresponding to each phoneme is determined by the duration predictor based on the text characteristics of each phoneme, for example, the number of audio frames corresponding to three phonemes is 1.8, 1.2 and 0.9, respectively.

And then, the number of the audio frames corresponding to each phoneme is rounded upwards to obtain the copy number corresponding to each phoneme. For example, the numbers of audio frames corresponding to three phonemes 1.8, 1.2, and 0.9 are rounded up to obtain the copies 2, and 1 corresponding to three phonemes. And expanding the distribution characteristics of each phoneme according to the copy number corresponding to each phoneme to obtain the distribution characteristics of each audio frame. Then, the distribution characteristics of each audio frame are mapped into the audio characteristics of each audio frame through a standard stream network, and the audio characteristics of each audio frame are decoded through a target decoder to obtain an audio signal.

The embodiment of the present application also tests an audio generation model (denoted as model 1) of the related art and an audio generation model (denoted as model 2) shown in fig. 8, and results of the tests are shown in table 1 below.

TABLE 1

Model	MOS	Pronunciation error rate	Real time rate
				Model 1	4.22	0.2%	0.88
Model 2	4.20	0.001%	0.15

Table 1 compares the performance of model 1 and model 2 from three aspects of MOS (Mean Opinion Score ), pronunciation error rate, and real-time rate, where the model performance is better with a larger MOS value, and the model performance is better with a smaller pronunciation error rate, and the model performance is better with a smaller real-time rate. From the test results, although the MOS of the model 1 and the model 2 are relatively close, the pronunciation error rate and the real-time rate of the model 2 are obviously smaller than those of the model 1, which indicates that the audio generating model shown in fig. 8 can generate the audio with higher stability, and the audio generating efficiency is improved by about 6 times under the condition of similar MOS with the model 1.

Fig. 9 is a schematic structural diagram of an audio generation model training device according to an embodiment of the present application, where, as shown in fig. 9, the device includes the following matters.

The obtaining module 901 is configured to obtain a first text and a first audio, where the content of the first text is the same as that of the first audio, the first text includes a plurality of first phonemes, and the first audio includes a plurality of first audio frames.

A determining module 902, configured to determine, through a first network model, a distribution characteristic of each first phoneme included in the first text, where the distribution characteristic of the first phoneme is used to describe the first phoneme and accords with a reference statistical distribution.

The determining module 902 is further configured to determine, by using the first network model, a distribution characteristic of each first audio frame included in the first audio, where the distribution characteristic of the first audio frame is used to describe the first audio frame and conforms to the reference statistical distribution.

The determining module 902 is further configured to determine a first feature loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame and a second feature loss from the distribution feature of each first audio frame to the distribution feature of each first phoneme.

The training module 903 is configured to train the first network model based on the first feature loss and the second feature loss to obtain an audio generation model, where the audio generation model is configured to generate a reference audio signal based on the reference text.

In a possible implementation manner, the determining module 902 is configured to encode the first text through the first network model to obtain text features of each first phoneme; and mapping the text features of each first phoneme through the first network model to obtain the distribution features of each first phoneme.

In one possible implementation, the apparatus further includes the following.

And the alignment module is used for aligning each first phoneme with each first audio frame based on the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame to obtain a first number of the first audio frames corresponding to each first phoneme.

The determining module 902 is further configured to determine, based on the text feature of each first phoneme, a second number of first audio frames corresponding to each first phoneme.

The determining module 902 is further configured to determine a number penalty between the first number and the second number of corresponding first audio frames for each first phoneme.

The training module 903 is configured to train the first network model based on the number loss, the first feature loss, and the second feature loss, to obtain an audio generation model.

In one possible implementation, the determining module 902 is configured to encode the first audio to obtain audio features of each first audio frame; and mapping the audio characteristics of each first audio frame through the first network model to obtain the distribution characteristics of each first audio frame.

In one possible implementation, the first audio is a first sample audio signal or a spectrogram of the first sample audio signal, the first network model comprising a first decoder; the apparatus further comprises the following.

And the decoding module is used for decoding the audio characteristics of each first audio frame through the first decoder to obtain a first reconstructed audio signal.

The determining module 902 is further configured to determine a first signal loss between the first sample audio signal and the first reconstructed audio signal.

The training module 903 is configured to train the first network model based on the first signal loss, the first feature loss, and the second feature loss, to obtain an audio generation model.

In one possible implementation, the first decoder includes a first input layer, at least two first convolution layers, and a first output layer, any one of the first convolution layers including at least two convolution kernels of different convolution sizes for the same hole coefficient, the convolution kernels of different first convolution layers corresponding to different hole coefficients.

The decoding module is used for converting the audio characteristics of each first audio frame into the input characteristics of the first channel number through the first input layer; carrying out cavity convolution on the input features of the first channel number through each convolution kernel included in the first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer; for any one of the first convolution layers except the first one, carrying out hole convolution on the output characteristics of the previous first convolution layer through each convolution kernel included in any one of the first convolution layers to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output characteristics of any one of the first convolution layers; the output characteristics of the last first convolution layer are converted into a first reconstructed audio signal by the first output layer.

In one possible implementation, the training module 903 is configured to adjust parameters of the first network model based on the first feature loss and the second feature loss to obtain a second network model, where the second network model includes a feature processing network and a second decoder; acquiring second audio, wherein the second audio is a second sample audio signal or a spectrogram of the second sample audio signal, and the second audio comprises a plurality of second audio frames; encoding the second audio to obtain audio characteristics of each second audio frame; decoding the audio features of each second audio frame by a second decoder to obtain a second reconstructed audio signal; determining a second signal loss between the second sample audio signal and the second reconstructed audio signal; adjusting parameters of the second decoder based on the second signal loss to obtain a third decoder; an audio generation model is determined based on the feature processing network and the third decoder.

In a possible implementation manner, the obtaining module 901 is further configured to obtain a second text, where the second text and the second audio have the same content, and the second text includes a plurality of second phonemes.

The determining module 902 is further configured to determine, through the feature processing network, a distribution feature of each second phoneme included in the second text, and map an audio feature of each second audio frame to obtain a distribution feature of each second audio frame.

The determining module 902 is further configured to determine a third feature loss from the distribution feature of each second phoneme to the distribution feature of each second audio frame and a fourth feature loss from the distribution feature of each second audio frame to the distribution feature of each second phoneme.

The training module 903 is configured to adjust parameters of the second decoder based on the third feature loss, the fourth feature loss, and the second signal loss, to obtain a third decoder.

In one possible implementation, the training module 903 is configured to obtain third audio, where the third audio is a third sample audio signal or a spectrogram of the third sample audio signal, and the third audio includes a plurality of third audio frames; encoding the third audio to obtain audio characteristics of each third audio frame; decoding the audio features of each third audio frame by a third decoder to obtain a third reconstructed audio signal; decoding the audio features of each third audio frame through a fourth decoder to obtain a fourth reconstructed audio signal, wherein the number of parameters of the fourth decoder is smaller than that of the third decoder; based on a third signal loss between the third reconstructed audio signal and the fourth reconstructed audio signal; adjusting parameters of a fourth decoder based on the third signal loss to obtain a reference decoder; an audio generation model is determined based on the feature processing network and the reference decoder.

In one possible implementation, the reference decoder includes a reference input layer, at least two reference convolution layers, and a reference output layer, any one of the reference convolution layers including at least two convolution kernels of different convolution sizes for the same hole coefficient, the convolution kernels of different reference convolution layers corresponding to different hole coefficients.

The training module 903 is configured to fuse, for any one of the reference convolution layers, each convolution kernel included in any one of the reference convolution layers into a fused convolution kernel, to obtain a reconstructed convolution layer, where the fused convolution kernel has the same hole coefficient as the convolution kernel included in any one of the reference convolution layers, and the convolution size of the fused convolution kernel is not smaller than the convolution size of each convolution kernel included in any one of the reference convolution layers; splicing the reference input layer, the at least two reconstruction convolution layers and the reference output layer to obtain a target decoder; an audio generation model is determined based on the feature processing network and the target decoder.

In one possible implementation manner, the training module 903 is configured to fill parameters of a first convolution kernel for a first convolution kernel that includes a convolution size smaller than a convolution size of the fusion convolution kernel in any one of the reference convolution layers, so as to obtain a filled first convolution kernel, where the convolution size of the filled first convolution kernel is the same as the convolution size of the fusion convolution kernel; and determining a reconstruction convolution layer based on the parameters of the first convolution kernel and the parameters of the second convolution kernel after filling, wherein the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel.

In the device, the distribution characteristics of each first phoneme in the first text and the distribution characteristics of each first audio frame in the first audio are determined through the first network model, and then the first characteristic loss and the second characteristic loss between the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame are determined. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first audio frame using the distribution feature of each first phoneme is measured by the first feature loss. The distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame in the case of fitting the distribution feature of each first phoneme using the distribution feature of each first audio frame is measured by the second feature loss. That is, the distance between the distribution feature of each first phoneme and the distribution feature of each first audio frame can be measured more accurately and more stably through the first feature loss and the second feature loss, so that after the first network model is trained based on the first feature loss and the second feature loss, the model can reduce the difference between the distribution feature of the phoneme and the distribution feature of the audio frame as much as possible, and the distribution feature of the audio frame can be determined according to the distribution feature of the phoneme, thereby realizing the generation of the audio signal. Because the distribution characteristics of the phonemes and the distribution characteristics of the audio frames are closer, the phenomenon that the audio signals have pronunciation errors is less, and the pronunciation stability and the audio quality are improved.

Fig. 10 is a schematic structural diagram of an audio generating device according to an embodiment of the present application, and as shown in fig. 10, the device includes the following matters.

An obtaining module 1001 is configured to obtain a reference text, where the reference text includes a plurality of reference phonemes.

A determining module 1002, configured to determine distribution characteristics of each reference phoneme included in the reference text by using an audio generating model, where the distribution characteristics of the reference phonemes are used to describe the reference phonemes and conform to the reference statistical distribution, and the audio generating model is trained according to the method shown in the first aspect.

The determining module 1002 is further configured to determine, by using the audio generation model, a distribution characteristic of each reference audio frame based on a distribution characteristic of each reference phoneme, where the distribution characteristic of each reference audio frame is used to describe the reference audio frame and conforms to a reference statistical distribution.

And a generating module 1003, configured to generate, by using the audio generating model, a reference audio signal based on the distribution characteristics of each reference audio frame, where the reference audio corresponding to the reference audio signal is the same as the content of the reference text, and the reference audio includes each reference audio frame.

In one possible implementation, the determining module 1002 is configured to encode the reference text by using an audio generating model to obtain text features of each reference phoneme; and mapping the text characteristics of each reference phoneme through an audio generation model to obtain the distribution characteristics of each reference phoneme.

In a possible implementation manner, the determining module 1002 is further configured to determine, based on the text features of each reference phoneme, a number of reference audio frames corresponding to each reference phoneme.

The determining module 1002 is configured to expand, by using the audio generating model, the distribution characteristics of each reference phoneme based on the number of reference audio frames corresponding to each reference phoneme, so as to obtain the distribution characteristics of each reference audio frame.

In a possible implementation manner, the generating module 1003 is configured to map, through an audio generating model, distribution features of each reference audio frame to obtain audio features of each reference audio frame; and decoding the audio features of each reference audio frame through the audio generation model to obtain a reference audio signal.

The audio generation model in the device has the characteristic that the distribution characteristics of phonemes are more similar to the distribution characteristics of audio frames. Therefore, the distribution characteristics of each reference audio frame can be determined more accurately based on the distribution characteristics of each reference phoneme through the audio generation model, so that the phenomenon of pronunciation errors in the reference audio signals generated based on the distribution characteristics of each reference audio frame is fewer, and the pronunciation stability and the audio quality are improved.

It should be understood that, when the apparatus provided in fig. 9 and fig. 10 implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functional modules may be allocated to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 11 shows a block diagram of a terminal device 1100 according to an exemplary embodiment of the present application. The terminal device 1100 includes: a processor 1101 and a memory 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one computer program for execution by processor 1101 to implement the training method or audio generation method of the audio generation model provided by the method embodiments in the present application.

In some embodiments, the terminal device 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102, and peripheral interface 1103 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, and a power supply 1108.

A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 1104 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited in this application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this time, the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1105 may be one and disposed on the front panel of the terminal device 1100; in other embodiments, the display 1105 may be at least two, and disposed on different surfaces of the terminal device 1100 or in a folded design; in other embodiments, the display 1105 may be a flexible display disposed on a curved surface or a folded surface of the terminal device 1100. Even more, the display 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1105 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1106 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal device 1100, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1107 may also include a headphone jack.

A power supply 1108 is used to power the various components in terminal device 1100. The power supply 1108 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 1108 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal device 1100 also includes one or more sensors 1109. The one or more sensors 1109 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, optical sensor 1114, and proximity sensor 1115.

The acceleration sensor 1111 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established in the terminal apparatus 1100. For example, the acceleration sensor 1111 may be configured to detect components of gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1111. Acceleration sensor 1111 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal device 1100, and the gyro sensor 1112 may collect a 3D motion of the user on the terminal device 1100 in cooperation with the acceleration sensor 1111. The processor 1101 may implement the following functions based on the data collected by the gyro sensor 1112: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 1113 may be disposed at a side frame of the terminal device 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is provided at a side frame of the terminal apparatus 1100, a grip signal of the terminal apparatus 1100 by a user can be detected, and the processor 1101 performs left-right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1114 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1114. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1105 is turned up; when the ambient light intensity is low, the display luminance of the display screen 1105 is turned down. In another embodiment, the processor 1101 may also dynamically adjust the shooting parameters of the camera assembly 1106 based on the intensity of ambient light collected by the optical sensor 1114.

A proximity sensor 1115, also referred to as a distance sensor, is typically provided on the front panel of the terminal device 1100. The proximity sensor 1115 is used to collect the distance between the user and the front surface of the terminal device 1100. In one embodiment, when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal device 1100 gradually decreases, the processor 1101 controls the display 1105 to switch from the bright screen state to the off screen state; when the proximity sensor 1115 detects that the distance between the user and the front surface of the terminal apparatus 1100 gradually increases, the processor 1101 controls the display screen 1105 to switch from the off-screen state to the on-screen state.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is not limiting and that terminal device 1100 may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of a server provided in the embodiment of the present application, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more processors 1201 and one or more memories 1202, where the one or more memories 1202 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1201 to implement a training method or an audio generating method of the audio generating model provided in each method embodiment, and the processor 1201 is a CPU. Of course, the server 1200 may also have a wired or wireless network interface, a keyboard, an input/output interface, etc. for performing input/output, and the server 1200 may also include other components for implementing device functions, which are not described herein.

In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one computer program loaded and executed by a processor to cause an electronic device to implement a training method or an audio generation method of any of the above-described audio generation models.

Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program, which is at least one computer program, loaded and executed by a processor, to cause an electronic device to implement a training method or an audio generation method of any of the above-mentioned audio generation models.

In an exemplary embodiment, there is also provided a computer program product in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to cause an electronic device to implement a training method or an audio generation method of any of the above-mentioned audio generation models.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to any modification, equivalents, or improvements made within the principles of the present application.

Claims

1. A method of training an audio generation model, the method comprising:

acquiring a first text and a first audio, wherein the first text and the first audio have the same content, the first text comprises a plurality of first phonemes, and the first audio comprises a plurality of first audio frames;

determining distribution characteristics of each first phoneme included in the first text through a first network model, wherein the distribution characteristics of the first phonemes are used for describing the first phonemes and accord with reference statistical distribution, and the distribution characteristics of the first phonemes comprise a mean value of the text characteristics of the first phonemes and a variance of the text characteristics of the first phonemes;

Determining, by the first network model, a distribution characteristic of each first audio frame included in the first audio, the distribution characteristic of the first audio frame being used to describe the first audio frame and conforming to the reference statistical distribution, the distribution characteristic of the first audio frame including a mean value of the audio characteristics of the first audio frame and a variance of the audio characteristics of the first audio frame;

determining a first feature loss from the distribution feature of the respective first phones to the distribution feature of the respective first audio frames and a second feature loss from the distribution feature of the respective first audio frames to the distribution feature of the respective first phones, the first feature loss being a forward KL divergence and the second feature loss being a reverse KL divergence;

training the first network model based on the first characteristic loss and the second characteristic loss to obtain an audio generation model, wherein the audio generation model is used for generating a reference audio signal based on a reference text.

2. The method of claim 1, wherein determining, by the first network model, a distribution characteristic of each first phoneme included in the first text comprises:

Encoding the first text through the first network model to obtain text characteristics of each first phoneme;

and mapping the text characteristics of each first phoneme through the first network model to obtain the distribution characteristics of each first phoneme.

3. The method according to claim 2, wherein the method further comprises:

aligning each first phoneme with each first audio frame based on the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame to obtain a first number of first audio frames corresponding to each first phoneme;

determining a second number of corresponding first audio frames of each first phoneme based on the text features of each first phoneme;

determining a number penalty between a first number and a second number of corresponding first audio frames for each first phoneme;

the training the first network model based on the first feature loss and the second feature loss to obtain an audio generation model includes:

training the first network model based on the number loss, the first feature loss and the second feature loss to obtain an audio generation model.

4. The method of claim 1, wherein said determining, by the first network model, a distribution characteristic of each first audio frame comprised by the first audio comprises:

encoding the first audio to obtain audio characteristics of each first audio frame;

and mapping the audio characteristics of each first audio frame through the first network model to obtain the distribution characteristics of each first audio frame.

5. The method of claim 4, wherein the first audio is a first sample audio signal or a spectrogram of the first sample audio signal, the first network model comprising a first decoder; the method further comprises the steps of:

decoding the audio features of each first audio frame through the first decoder to obtain a first reconstructed audio signal;

determining a first signal loss between the first sample audio signal and the first reconstructed audio signal;

and training the first network model based on the first signal loss, the first characteristic loss and the second characteristic loss to obtain an audio generation model.

6. The method of claim 5, wherein the first decoder comprises a first input layer, at least two first convolution layers, and a first output layer, any one of the first convolution layers comprising at least two convolution kernels of the same hole coefficient and different convolution sizes, the convolution kernels of the different first convolution layers corresponding to different hole coefficients;

the decoding, by the first decoder, the audio features of the first audio frames to obtain a first reconstructed audio signal, including:

converting audio features of the first audio frames into input features of a first channel number through the first input layer;

carrying out cavity convolution on the input features of the first channel number through each convolution kernel included in a first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer;

for any one of the first convolution layers except the first one, carrying out cavity convolution on the output features of the last first convolution layer through each convolution kernel included in the any one of the first convolution layers to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the any one of the first convolution layers;

The output characteristics of the last first convolution layer are converted into the first reconstructed audio signal by the first output layer.

7. The method according to any one of claims 1 to 6, wherein the training the first network model based on the first feature loss and the second feature loss to obtain an audio generation model comprises:

adjusting parameters of the first network model based on the first characteristic loss and the second characteristic loss to obtain a second network model, wherein the second network model comprises a characteristic processing network and a second decoder;

acquiring second audio, wherein the second audio is a second sample audio signal or a spectrogram of the second sample audio signal, and the second audio comprises a plurality of second audio frames;

encoding the second audio to obtain audio characteristics of each second audio frame;

decoding the audio features of each second audio frame through the second decoder to obtain a second reconstructed audio signal;

determining a second signal loss between the second sample audio signal and the second reconstructed audio signal;

adjusting parameters of the second decoder based on the second signal loss to obtain a third decoder;

The audio generation model is determined based on the feature processing network and the third decoder.

8. The method of claim 7, wherein the method further comprises:

acquiring a second text, wherein the second text and the second audio have the same content, and the second text comprises a plurality of second phonemes;

determining the distribution characteristics of each second phoneme included in the second text through the characteristic processing network, and mapping the audio characteristics of each second audio frame to obtain the distribution characteristics of each second audio frame;

determining a third feature loss from the distribution feature of the respective second phones to the distribution feature of the respective second audio frames and a fourth feature loss from the distribution feature of the respective second audio frames to the distribution feature of the respective second phones;

said adjusting parameters of said second decoder based on said second signal loss to obtain a third decoder comprising:

and adjusting parameters of the second decoder based on the third characteristic loss, the fourth characteristic loss and the second signal loss to obtain a third decoder.

9. The method of claim 7, wherein the determining the audio generation model based on the feature processing network and the third decoder comprises:

Acquiring third audio, wherein the third audio is a third sample audio signal or a spectrogram of the third sample audio signal, and the third audio comprises a plurality of third audio frames;

encoding the third audio to obtain audio characteristics of each third audio frame;

decoding the audio features of each third audio frame through the third decoder to obtain a third reconstructed audio signal;

decoding the audio features of each third audio frame through a fourth decoder to obtain a fourth reconstructed audio signal, wherein the number of parameters of the fourth decoder is smaller than that of the third decoder;

based on a third signal loss between the third reconstructed audio signal and the fourth reconstructed audio signal;

adjusting parameters of the fourth decoder based on the third signal loss to obtain a reference decoder;

the audio generation model is determined based on the feature processing network and the reference decoder.

10. The method of claim 9, wherein the reference decoder comprises a reference input layer, at least two reference convolutional layers, and a reference output layer, any one of the reference convolutional layers comprising at least two convolution kernels of different convolutional sizes for the same hole coefficient, the convolution kernels of different reference convolutional layers corresponding to different hole coefficients;

The determining the audio generation model based on the feature processing network and the reference decoder comprises:

for any one of the reference convolution layers, fusing each convolution kernel included in the any one of the reference convolution layers into a fusion convolution kernel to obtain a reconstruction convolution layer, wherein the fusion convolution kernel has the same cavity coefficient as the convolution kernel included in the any one of the reference convolution layers, and the convolution size of the fusion convolution kernel is not smaller than that of each convolution kernel included in the any one of the reference convolution layers;

splicing the reference input layer, at least two reconstruction convolution layers and the reference output layer to obtain a target decoder;

the audio generation model is determined based on the feature processing network and the target decoder.

11. The method of claim 10, wherein the fusing the any one of the reference convolution layers comprising the respective convolution kernels into a fused convolution kernel results in a reconstructed convolution layer comprising:

filling parameters of a first convolution kernel with a convolution size smaller than that of the fusion convolution kernel included in any one of the reference convolution layers to obtain a filled first convolution kernel, wherein the convolution size of the filled first convolution kernel is the same as that of the fusion convolution kernel;

And determining the reconstruction convolution layer based on the parameters of the first convolution kernel after filling and the parameters of a second convolution kernel, wherein the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel.

12. A method of audio generation, the method comprising:

acquiring a reference text, wherein the reference text comprises a plurality of reference phonemes;

determining distribution characteristics of each reference phoneme included in the reference text through an audio generation model, wherein the distribution characteristics of the reference phonemes are used for describing the reference phonemes and accord with reference statistical distribution, the distribution characteristics of the reference phonemes comprise a mean value of text characteristics of the reference phonemes and a variance of the text characteristics of the reference phonemes, and the audio generation model is trained according to the method of any one of claims 1 to 11;

determining, by the audio generation model, distribution characteristics of each reference audio frame based on the distribution characteristics of each reference phoneme, the distribution characteristics of the reference audio frame being used to describe the reference audio frame and conforming to the reference statistical distribution, the distribution characteristics of the reference audio frame including a mean of the audio characteristics of the reference audio frame and a variance of the audio characteristics of the reference audio frame;

And generating a reference audio signal based on the distribution characteristics of each reference audio frame through the audio generation model, wherein the reference audio corresponding to the reference audio signal is identical to the content of the reference text, and the reference audio comprises each reference audio frame.

13. The method of claim 12, wherein the determining, by an audio generation model, the distribution characteristics of each reference phoneme included in the reference text comprises:

coding the reference text through the audio generation model to obtain text characteristics of each reference phoneme;

and mapping the text characteristics of each reference phoneme through the audio generation model to obtain the distribution characteristics of each reference phoneme.

14. The method of claim 13, wherein the method further comprises:

determining the number of the reference audio frames corresponding to each reference phoneme based on the text characteristics of each reference phoneme;

the determining, by the audio generation model, the distribution characteristics of each reference audio frame based on the distribution characteristics of each reference phoneme includes:

for any one reference phoneme, determining the number of copies based on the number of reference audio frames corresponding to the any one reference phoneme through the audio generation model, and copying the distribution characteristics of the any one reference phoneme according to the number of copies, wherein the distribution characteristics of each copied reference phoneme are the distribution characteristics of one reference audio frame corresponding to the any one reference phoneme.

15. The method of claim 12, wherein generating, by the audio generation model, a reference audio signal based on the distribution characteristics of the respective reference audio frames comprises:

mapping the distribution characteristics of each reference audio frame through the audio generation model to obtain the audio characteristics of each reference audio frame;

and decoding the audio features of each reference audio frame through the audio generation model to obtain a reference audio signal.

16. A training device for an audio generation model, the device comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first text and a first audio, the contents of the first text and the first audio are the same, the first text comprises a plurality of first phonemes, and the first audio comprises a plurality of first audio frames;

a determining module, configured to determine, through a first network model, a distribution feature of each first phoneme included in the first text, where the distribution feature of the first phoneme is used to describe the first phoneme and conforms to a reference statistical distribution, and the distribution feature of the first phoneme includes a mean value of text features of the first phoneme and a variance of the text features of the first phoneme;

The determining module is further configured to determine, through the first network model, a distribution characteristic of each first audio frame included in the first audio, where the distribution characteristic of the first audio frame is used to describe the first audio frame and conforms to the reference statistical distribution, and the distribution characteristic of the first audio frame includes a mean value of audio characteristics of the first audio frame and a variance of audio characteristics of the first audio frame;

the determining module is further configured to determine a first feature loss from the distribution feature of each first phoneme to the distribution feature of each first audio frame and a second feature loss from the distribution feature of each first audio frame to the distribution feature of each first phoneme, where the first feature loss is a forward KL divergence and the second feature loss is a reverse KL divergence;

the training module is used for training the first network model based on the first characteristic loss and the second characteristic loss to obtain an audio generation model, and the audio generation model is used for generating a reference audio signal based on a reference text.

17. The apparatus of claim 16 wherein the means for determining encodes the first text through the first network model to obtain text features for the respective first phonemes; and mapping the text characteristics of each first phoneme through the first network model to obtain the distribution characteristics of each first phoneme.

18. The apparatus of claim 17, wherein the apparatus further comprises:

the alignment module is used for aligning each first phoneme and each first audio frame based on the distribution characteristics of each first phoneme and the distribution characteristics of each first audio frame to obtain a first number of first audio frames corresponding to each first phoneme;

the determining module is further configured to determine, based on the text features of the first phonemes, a second number of first audio frames corresponding to the first phonemes;

the determining module is further configured to determine a number loss between a first number and a second number of the first audio frames corresponding to the first phonemes;

the training module is configured to train the first network model based on the number loss, the first feature loss, and the second feature loss, and obtain an audio generation model.

19. The apparatus of claim 16, wherein the determining module is configured to encode the first audio to obtain audio characteristics of the respective first audio frames; and mapping the audio characteristics of each first audio frame through the first network model to obtain the distribution characteristics of each first audio frame.

20. The apparatus of claim 19, wherein the first audio is a first sample audio signal or a spectrogram of the first sample audio signal, the first network model comprising a first decoder; the apparatus further comprises:

the decoding module is used for decoding the audio characteristics of each first audio frame through the first decoder to obtain a first reconstructed audio signal;

the determining module is further configured to determine a first signal loss between the first sample audio signal and the first reconstructed audio signal;

the training module is configured to train the first network model based on the first signal loss, the first feature loss, and the second feature loss, to obtain an audio generation model.

21. The apparatus of claim 20, wherein the first decoder comprises a first input layer, at least two first convolution layers, and a first output layer, any one of the first convolution layers comprising at least two convolution kernels of the same hole coefficient and different convolution sizes, the convolution kernels of the different first convolution layers corresponding to different hole coefficients;

the decoding module is used for converting the audio characteristics of each first audio frame into input characteristics of a first channel number through the first input layer; carrying out cavity convolution on the input features of the first channel number through each convolution kernel included in a first convolution layer to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the first convolution layer; for any one of the first convolution layers except the first one, carrying out cavity convolution on the output features of the last first convolution layer through each convolution kernel included in the any one of the first convolution layers to obtain convolution results corresponding to each convolution kernel, and adding the convolution results corresponding to each convolution kernel to obtain the output features of the any one of the first convolution layers; the output characteristics of the last first convolution layer are converted into the first reconstructed audio signal by the first output layer.

22. The apparatus of any one of claims 16 to 21, wherein the training module is configured to adjust parameters of the first network model based on the first feature loss and the second feature loss to obtain a second network model, the second network model including a feature processing network and a second decoder; acquiring second audio, wherein the second audio is a second sample audio signal or a spectrogram of the second sample audio signal, and the second audio comprises a plurality of second audio frames; encoding the second audio to obtain audio characteristics of each second audio frame; decoding the audio features of each second audio frame through the second decoder to obtain a second reconstructed audio signal; determining a second signal loss between the second sample audio signal and the second reconstructed audio signal; adjusting parameters of the second decoder based on the second signal loss to obtain a third decoder; the audio generation model is determined based on the feature processing network and the third decoder.

23. The apparatus of claim 22, wherein the obtaining module is further configured to obtain a second text, the second text being the same as the second audio, the second text comprising a plurality of second phonemes;

The determining module is further configured to determine, through the feature processing network, a distribution feature of each second phoneme included in the second text, and map an audio feature of each second audio frame to obtain a distribution feature of each second audio frame;

the determining module is further configured to determine a third feature loss from the distribution feature of each second phoneme to the distribution feature of each second audio frame and a fourth feature loss from the distribution feature of each second audio frame to the distribution feature of each second phoneme;

the training module is configured to adjust parameters of the second decoder based on the third feature loss, the fourth feature loss, and the second signal loss, to obtain a third decoder.

24. The apparatus of claim 22, wherein the training module is configured to obtain a third audio, the third audio being a third sample audio signal or a spectrogram of the third sample audio signal, the third audio comprising a plurality of third audio frames; encoding the third audio to obtain audio characteristics of each third audio frame; decoding the audio features of each third audio frame through the third decoder to obtain a third reconstructed audio signal; decoding the audio features of each third audio frame through a fourth decoder to obtain a fourth reconstructed audio signal, wherein the number of parameters of the fourth decoder is smaller than that of the third decoder; based on a third signal loss between the third reconstructed audio signal and the fourth reconstructed audio signal; adjusting parameters of the fourth decoder based on the third signal loss to obtain a reference decoder; the audio generation model is determined based on the feature processing network and the reference decoder.

25. The apparatus of claim 24, wherein the reference decoder comprises a reference input layer, at least two reference convolutional layers, and a reference output layer, any one of the reference convolutional layers comprising at least two convolution kernels of different convolutional sizes for the same hole coefficient, the convolution kernels of different reference convolutional layers corresponding to different hole coefficients;

the training module is configured to fuse, for the any one of the reference convolution layers, each convolution kernel included in the any one of the reference convolution layers into a fused convolution kernel, so as to obtain a reconstructed convolution layer, where the fused convolution kernel has the same hole coefficient as the convolution kernel included in the any one of the reference convolution layers, and the convolution size of the fused convolution kernel is not smaller than the convolution size of each convolution kernel included in the any one of the reference convolution layers; splicing the reference input layer, at least two reconstruction convolution layers and the reference output layer to obtain a target decoder; the audio generation model is determined based on the feature processing network and the target decoder.

26. The apparatus of claim 25, wherein the training module is configured to, for a first convolution kernel included in the any one of the reference convolution layers and having a convolution size smaller than a convolution size of the fused convolution kernel, fill parameters of the first convolution kernel to obtain a filled first convolution kernel, where the convolution size of the filled first convolution kernel is the same as the convolution size of the fused convolution kernel; and determining the reconstruction convolution layer based on the parameters of the first convolution kernel after filling and the parameters of a second convolution kernel, wherein the convolution size of the second convolution kernel is the same as that of the fusion convolution kernel.

27. An audio generating apparatus, the apparatus comprising:

the acquisition module is used for acquiring a reference text, wherein the reference text comprises a plurality of reference phonemes;

a determining module, configured to determine, by using an audio generation model, distribution characteristics of each reference phoneme included in the reference text, where the distribution characteristics of the reference phonemes are used to describe the reference phonemes and conform to a reference statistical distribution, the distribution characteristics of the reference phonemes include a mean value of text characteristics of the reference phonemes and a variance of text characteristics of the reference phonemes, and the audio generation model is trained according to the method of any one of claims 1 to 11;

the determining module is further configured to determine, by using the audio generation model, a distribution characteristic of each reference audio frame based on a distribution characteristic of each reference phoneme, where the distribution characteristic of each reference audio frame is used to describe the reference audio frame and conforms to the reference statistical distribution, and the distribution characteristic of each reference audio frame includes a mean value of audio characteristics of the reference audio frame and a variance of audio characteristics of the reference audio frame;

and the generation module is used for generating a reference audio signal based on the distribution characteristics of each reference audio frame through the audio generation model, wherein the reference audio corresponding to the reference audio signal is the same as the content of the reference text, and the reference audio comprises each reference audio frame.

28. The apparatus of claim 27, wherein the determining module is configured to encode the reference text by the audio generation model to obtain text features of the respective reference phonemes; and mapping the text characteristics of each reference phoneme through the audio generation model to obtain the distribution characteristics of each reference phoneme.

29. The apparatus of claim 28, wherein the means for determining is further for determining a number of reference audio frames corresponding to each reference phoneme based on the text characteristics of each reference phoneme;

the determining module is configured to determine, for any reference phoneme, a number of copies based on a number of reference audio frames corresponding to the any reference phoneme through the audio generating model, copy a distribution feature of the any reference phoneme according to the number of copies, where the distribution feature of each reference phoneme after copying is a distribution feature of one reference audio frame corresponding to the any reference phoneme.

30. The apparatus of claim 27, wherein the generating module is configured to map, by using the audio generating model, the distribution characteristics of the respective reference audio frames to obtain audio characteristics of the respective reference audio frames; and decoding the audio features of each reference audio frame through the audio generation model to obtain a reference audio signal.

31. An electronic device comprising a processor and a memory, wherein the memory stores at least one computer program that is loaded and executed by the processor to cause the electronic device to implement the training method of the audio generation model of any one of claims 1 to 11 or to implement the audio generation method of any one of claims 12 to 15.

32. A computer readable storage medium, wherein at least one computer program is stored in the computer readable storage medium, the at least one computer program being loaded and executed by a processor to cause an electronic device to implement the method of training an audio generation model according to any one of claims 1 to 11 or to implement the method of generating audio according to any one of claims 12 to 15.