CN115188362A

CN115188362A - Speech synthesis model generation method and device, equipment, medium and product thereof

Info

Publication number: CN115188362A
Application number: CN202210803325.8A
Authority: CN
Inventors: 王汉超; 林伟
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-10-14

Abstract

The application relates to a method for generating a speech synthesis model, a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: calling a controller, and generating the structural code of the vocoder by the controller; constructing a vocoder according to the structural code, the vocoder comprising a conditional network and an autoregressive network generated according to the structural code; iteratively training the vocoder to a convergence state by adopting a training set, and performing gradient updating and iteratively generating a new vocoder on the controller before the controller does not reach convergence according to the performance score obtained by the vocoder on a test set; after the controller reaches convergence, the vocoder is selected as the speech synthesis model according to the performance score. According to the method and the device, the automatic output and the preference of the vocoder are realized by means of the controller, so that the obtained voice synthesis model meets the requirements of mobile terminal equipment deployment, good performance can be obtained after the mobile terminal equipment is deployed, and the requirements of model miniaturization and high real-time performance in a voice synthesis scene can be met.

Description

Speech synthesis model generation method and device, equipment, medium and product thereof

Technical Field

The present application relates to the field of voice communication technologies, and in particular, to a method, an apparatus, a device, a medium, and a product for generating a voice synthesis model.

Background

As the number of online audio users continues to grow, users place increasingly higher demands on network audio content, entertainment attributes, transmission quality, and the like. The deep speech synthesis model is deployed in different application scenes and network environments, so that the model is required to have a better inference effect, and higher requirements are provided for the real-time performance of model operation.

The deep speech synthesis model has enough characterization capability, but if it is to be deployed to a mobile end, there are many problems, such as: on one hand, the depth speech synthesis model has more general parameters, for example, the WaveNet model parameter quantity reaches 4.6M, the WaveGlow model parameter quantity reaches 87.9M, and the FlowaveNet model parameter quantity reaches 182.6M, which far exceeds the calculation capability of a mobile terminal; on the other hand, in some speech synthesis-based application scenarios, such as packet loss compensation, the model is required to reach or exceed the speed of real-time synthesis.

In order to achieve the real-time effect of the mobile terminal, the traditional mobile terminal speech synthesis method generally uses methods such as concatenation synthesis, parameter synthesis or shallow learning. These methods are usually limited by the number of model parameters and the number of Floating Point Operations per Second (FLOPS), and in practice, the requirements of miniaturization and real-time performance cannot be met, and manual compression pruning or large-scale computational deployment in the background is often required.

Disclosure of Invention

The present application aims to solve the above problems and provide a speech synthesis model generation method, and a corresponding apparatus, device, non-volatile readable storage medium, and computer program product.

According to an aspect of the present application, there is provided a method for generating a speech synthesis model, comprising the steps of:

calling a controller, and generating the structural code of the vocoder by the controller;

constructing a vocoder according to the structural code, the vocoder comprising a conditional network and an autoregressive network generated according to the structural code;

iteratively training the vocoder to a convergence state by adopting a training set, and performing gradient updating and iteratively generating a new vocoder on the controller before the controller does not reach the convergence according to the performance score obtained by the vocoder on a test set;

after the controller reaches convergence, the vocoder is selected as the speech synthesis model according to the performance score.

According to another aspect of the present application, there is provided a speech synthesis model generation apparatus including:

a code generation module, which is set to call the controller, and the controller generates the structure code of the vocoder;

a vocoder construction module configured to construct a vocoder according to structural encoding, the vocoder comprising a conditional network and an autoregressive network generated according to the structural encoding;

the iterative decision module is set to adopt a training set to iteratively train the vocoder to a convergence state, and according to the performance score obtained by the vocoder on a testing set, gradient updating is carried out on the controller and a new vocoder is iteratively generated before the controller does not reach the convergence;

and the model output module is set to select the vocoder as the voice synthesis model according to the performance score after the controller reaches convergence.

According to another aspect of the present application, there is provided a speech synthesis model generation apparatus comprising a central processing unit and a memory, the central processing unit being configured to invoke execution of a computer program stored in the memory to perform the steps of the speech synthesis model generation method described herein.

According to another aspect of the present application, there is provided a non-transitory readable storage medium storing a computer program implemented according to the speech synthesis model generation method in the form of computer readable instructions, the computer program, when invoked by a computer, performing the steps included in the method.

According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method described in any one of the embodiments of the present application.

Compared with the prior art, the method and the device have the advantages that the controller is adopted to generate the structural codes, the vocoder is constructed according to the structural codes, the vocoder is trained and tested to obtain the corresponding performance scores, the iterative process of the controller is controlled according to the performance scores, a plurality of vocoders are generated in the training process of the controller, the vocoder with the best actual measurement performance is preferably selected as a voice synthesis model, the automatic generation and the optimization of the vocoder are realized by means of the controller, the obtained voice synthesis model meets the requirement of mobile terminal equipment deployment, the good performance can be obtained after the mobile terminal equipment is deployed, and the requirements on model miniaturization and high real-time performance in a voice synthesis scene can be met.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a network architecture corresponding to a voice call service applied in the present application;

FIG. 2 is a schematic block diagram of the topology of the vocoder of the present application;

FIG. 3 is a schematic flow chart diagram of an embodiment of a speech synthesis model generation method of the present application;

FIG. 4 is a flow diagram of an exemplary training process for an encoder produced by the controller of the present application;

FIG. 5 is a schematic diagram of the processing prior to deployment of a speech synthesis model in an embodiment of the present application;

FIG. 6 is a flow diagram illustrating a process for performing multi-stage training of a speech synthesis model according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a third stage of training a speech synthesis model according to an embodiment of the present application;

FIG. 8 is a schematic block diagram of a speech synthesis model generation apparatus according to the present application;

fig. 9 is a schematic structural diagram of a speech synthesis model generation apparatus used in the present application.

Detailed Description

The models referred or possibly referred to in the application include a traditional machine learning model or a deep learning model, unless specified in clear text, the models can be deployed in a remote server and used for remote calling at a client, and can also be deployed in a client with sufficient equipment capability for direct calling.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

Referring to fig. 1, a network architecture adopted in an exemplary application scenario of the present application may be used to deploy a voice call service, where the voice call service supports real-time voice communication, and during a process of encoding and decoding a voice stream of the voice call service, packet loss compensation may be implemented by running a voice synthesis model generated in any embodiment of the present application. The application server 81 shown in fig. 1 may be used to support the operation of the voice call service, and the media server 82 may be used to process the de-encoding process of the voice stream pushed by each user to implement relaying, wherein the terminal device, such as the computer 83 and the mobile phone 84, is generally provided as a client for the terminal user to use, and may be used to send or receive the voice stream. Besides, when the voice stream needs to be coded and decoded at the terminal device, the voice synthesis model obtained in the embodiments of the present application can also be deployed in the terminal device, so as to compensate for packet loss of the received or transmitted voice stream. The application scenario of the speech synthesis model disclosed above is only an example, and can also be used for packet loss compensation of a speech stream in a live stream in a network live service scenario, for example.

Fig. 2 shows a schematic block diagram of a vocoder produced by the present application, which provides a fixed topology according to which the vocoder of the present application may be constructed and at least one of the vocoders is preferentially selected as a speech synthesis model.

According to the topology shown in fig. 2, the vocoder includes a conditional network and an autoregressive network, the conditional network is mainly used for characterizing voice data in a voice stream, and the autoregressive network is mainly used for generating a subsequent voice frame required for packet loss compensation for the voice data.

The conditional network is mainly used for extracting global feature information of voice data by a residual network, is used for extracting local feature information corresponding to multiple scales of the voice data by multiple different scaling coefficients by an up-sampling network, and then splices the global feature information and the local feature information into comprehensive feature information by a splicing layer to realize extraction of deep semantic information of the voice data, so that the comprehensive feature information realizes feature representation of information with slow change of phonemes, rhythm and the like in acoustic features. The global feature information obtained by the residual error network is further divided into multiple paths to provide reference information for the process of processing the comprehensive feature information by the autoregressive network.

The autoregressive Network is realized by adopting a Recurrent Neural Network (RNN), two unidirectional Gate control circulation units (GRU) are adopted inside the autoregressive Network, the number of the Gate control circulation units can be flexibly set as required, and each Gate control circulation Unit is responsible for processing comprehensive characteristic information, acquiring corresponding deep semantic information, splicing with global characteristic information output by the residual error Network and inputting the information into the next node for processing. And the autoregressive network is provided with a classification network at the tail end thereof and is used for realizing classification mapping according to the output processed by the gating circulation unit and the combination information of the global characteristic information so as to restore a subsequent voice frame.

In one embodiment, the vocoder can be obtained by modifying a Wave Recurrent Neural Network (Wave Recurrent Neural Network) or a variant model thereof, such as an SC-Wave Neural Network (Speaker Conditional Wave Neural Network), as a basic model, where the Wave rnn is an auto-regressive Network model suitable for processing audio data in a sequence form, and the initial design goal of the Wave rnn is to maintain high-speed sequence generation, and an author uses technologies such as simplified model, sparsification, parallel sequence generation and the like to significantly improve the sequence generation speed, so that good performance of the vocoder can even realize real-time speech synthesis on a CPU.

The speech synthesis model determined through the processing of any embodiment of the application can be trained to a convergence state in advance to be put into use, the overall complexity of the network structure of the speech synthesis model is reduced, the number of parameters is small, the temporary context is provided by utilizing global feature information, the model is easier to be trained to converge, and more accurate prediction can be made.

Referring to fig. 3, a method for generating a speech synthesis model according to an aspect of the present application, in one embodiment, includes the following steps:

step S1100, calling a controller, and generating the structure code of the vocoder by the controller;

the method adopts the controller suitable for realizing Network Architecture Search (NAS) for iteratively producing the structure coding of the vocoder, carries out strategy gradient optimization on the controller by utilizing the actually measured performance of the vocoder produced by the controller, trains the controller to a convergence state, and ensures that the performance of the vocoder produced gradually becomes more and more excellent in the iterative training process of the controller.

The controller is constructed by a Recurrent Neural Network (RNN), the input of the RNN is random numbers, a structural code represented by a high-dimensional vector is generated, and the structural code represents the code information corresponding to the conditional Network and the autoregressive Network of the vocoder in a sequence form. When the vocoder is constructed, according to the structural codes complying with the preset specification of the topological structure of the vocoder, the coding information corresponding to the conditional network and the autoregressive network is selected from the structural codes and can be respectively used for constructing the conditional network and the autoregressive network.

In one embodiment, the structural encoding includes encoded information corresponding to an upsampling network within the conditional network of the vocoder. In an exemplary embodiment, the upsampling network includes three convolutional layers matched with different scaling coefficients for performing different scale sampling, and for this purpose, the coding information corresponding to the conditional network in the structural coding may provide corresponding design parameters for the three convolutional layers. The design parameters may be layer type and number of channels of the convolutional layer.

In one embodiment, the layer type of the convolutional layer of the upsampling network may be any one of the following:

the first method comprises the following steps: the convolution kernel is a convolution with 1 x 1, and the layer type is represented as 0;

and the second method comprises the following steps: the convolution kernel is a convolution of 3 by 3, and the layer type is represented as 1;

and the third is that: convolution kernel is a convolution of 5 by 5, layer type is denoted 2;

and fourthly: convolution with convolution kernel 1 x 1, deconvolution with step size 2, layer type 3;

and a fifth mode: convolution with convolution kernel 3 × 3, deconvolution with step size 2, layer type 4;

and a sixth mode: convolution kernel 5 x 5 convolution with a step size of 2 deconvolution, layer type 5.

In one embodiment, the number of channels corresponding to the convolutional layer may be any one of the following numbers of channels: 32. 64, 96, 128, etc., and correspondingly, the type code of each channel number can be correspondingly represented by 0 to 3.

According to the above example, the coding information corresponding to the upsampling network of the conditional network in the structure coding may be a subsequence formed by layer types and channel numbers provided corresponding to the respective convolutional layers, and the form is as follows:

{ a first convolutional layer type; the number of channels of the first convolution layer; a second convolutional layer type; a second number of convolutional layer channels; a third convolutional layer type; number of channels of the third convolutional layer }

According to the example, it can be seen that each convolutional layer in the upsampling network occupies two dimensions in the structure coding, which correspond to the layer type and the number of channels, respectively, and is very simple.

In one embodiment, the structural code includes code information corresponding to gated cyclic units in an autoregressive network of the vocoder. Suppose the input characteristic of the gating cycle unit at the t step is x _t Hidden state is h _t The neuron structure of the gated cyclic unit can be regarded as a functional relation:

h _t ＝f(h _t-1 ,x _t )

in particular, for the neuron structure of the gated cyclic unit of the vocoder, there is generally an architecture expressed in terms of the following description formula:

h _t ＝(1-Z _t )*h _t-1 +Z _t *C _t

wherein Z _t For updating the gate (update gate), controlling the speed of updating of hidden state information, C _t For candidate hidden states, the structural search of the gated cyclic unit focuses on searching Z according to the formula for describing the neuron structure _t And C _t The architecture of (1).

Thus, for example, four execution units, the encoded information for a gated loop unit may be expressed in the following form:

{ a first binary operation type; an activation function of a first binary operation; a second type of binary operation; an activation function of a second binary operation; a third type of binary operation; an activation function of a third binary operation; a fourth type of binary operation; activation function of fourth binary operation }

As can be seen from the above example, a gated loop cell consists of 4 cells, each cell including a binary operation and oneAn activation function. Wherein the 1 st unit decides to update the gate Z _t The last 3 units determine the candidate hidden state C _t 。

Here, the type that each binary operation can take may be selected from among the following types:

addition, type coding can be expressed as 0

Element-wise multiplication (type-coding), type-coding can be expressed as 1

Splicing, type coding can be represented as 2

The type of the activation function can be selected from the following types:

no activation function, type coding can be represented as 0

ReLU, type coding can be represented as 1

Sigmoid, type code can be represented as 2

Tanh, type code can be represented as 3

It can be seen that according to the above example, the coding information of the conditional network will occupy 6 dimensions in the structure coding, the coding information of the autoregressive network will occupy 8 dimensions in the structure coding, and the structure coding generated by the controller is a vector of 14 dimensions. Of course, the dimensions occupied by the conditional network and the autoregressive network may be flexibly set according to the preset topology of the vocoder constructed as needed, and their respective occupation on the dimension of the structure coding is also correspondingly variable, and is not limited to the above examples.

Step S1200, constructing a vocoder according to the structural coding, wherein the vocoder comprises a conditional network and an autoregressive network which are generated according to the structural coding;

it will be appreciated that a corresponding vocoder can be constructed as long as the structural coding is utilized following the above principles. In one embodiment, the vocoder may be specifically constructed according to the following process:

firstly, constructing an up-sampling network in a conditional network of a vocoder according to first coding information in structural coding, wherein the first coding information comprises layer types corresponding to a plurality of convolutional layers of the up-sampling network and channel numbers corresponding to the layer types;

following the example of the previous step, the substrings corresponding to the conditional network, that is, the eigenvalues corresponding to the first 6 dimensions, can be obtained for constructing the conditional network, specifically, constructing an upsampling network in the conditional network. Similarly, substrings corresponding to the autoregressive network, that is, feature values corresponding to the last 8 dimensions, can be obtained for constructing the autoregressive network, specifically, a gating cycle unit in the autoregressive network.

An exemplary structural code is as follows:

{0,0,1,1,5,3,0,2,0,2,1,0,0,3}

taking the first 6 bits in the exemplary structure code as the first coding information corresponding to the upsampling network, that is:

{0,0,1,1,5,3}

the network structure obtained by the up-sampling network constructed according to the first coding information is constructed by three convolutional layers, and the structure and the function of each convolutional layer are as follows: after the first convolution layer acquires the input acoustic features, repeating the operation by 2 times, executing convolution operation with convolution kernel of 1 x 1, and processing the data into data of 32 channels; the second convolution layer repeats 2 times the output of the first convolution layer, performs convolution operation with convolution kernel of 3 × 3, and processes the convolution operation into data of 64 channels; the third convolutional layer performs a convolution operation of 5 × 5 in the exemplary structural coding, and processes the data into 128 channels, so as to access the autoregressive network.

Secondly, constructing a gating cycle unit in an autoregressive network of the vocoder according to second coding information in the structural coding, wherein the second coding information comprises an operation type corresponding to a structural node of the gating cycle unit and an activation type corresponding to the operation type;

taking the last 8 bits in the exemplary structural code as second coding information corresponding to the gated cyclic unit, that is:

{0,2,0,2,1,0,0,3}

constructing a gating cycle unit according to the second coding information, and when the binary operation is applied, following heuristic rules:

1. multiplying the operands of the addition operation by a learnable matrix and then adding, wherein the output dimension of the learnable matrix is the smaller dimension of the learnable matrix;

2. if the dimensionalities of the two operands are the same, directly multiplying the two operands without using a learnable matrix; if the dimensionality is different, multiplying the operand with the higher dimensionality by a learning matrix to ensure that the operand is consistent with the operand with the lower dimensionality and then multiplying;

3. the stitching operation does not use learnable matrices for direct stitching.

According to the above rule, the second encoded information obtained in the previous example is taken as an example, wherein the first two bit codes {0,2} corresponding to the gates are updated, and the binary operation thereof is designated as an addition operation and is directly applied to h _t-1 ,x _t Concretely, it is shown that _t-1 And x _t Matching learnable matrix addition is carried out, then a sigmoid function is used for activation output, and the structural description formula is expressed as follows:

Z _t ＝σ(ω ₁ *h _t-1 +ω ₂ *x _t )

the last six bits of code {0,2,1,0, 3} corresponding to the candidate hidden state is first applied to h _t-1 ,x _t Then the result is compared with h _t-1 Combining the results x _t Combined, i.e. in turn corresponding to the following function: f. of ₁ 、f ₂ 、f ₃ The principle formula is exemplified as follows:

i ₁ ＝f ₁ (h _t-1 ,x _t )

i ₂ ＝f ₂ (i ₁ ,h _t-1 )

C _t ＝f ₃ (i ₂ ,x _t )

thus, according to the above principle formula, the process formula corresponding to the second encoded information is expressed as follows:

i ₁ ＝σ(ω ₃ *h _t-1 +ω ₄ *x _t )

i ₂ ＝i ₁ ⊙h _t-1

C _t ＝tanh(ω ₅ *i ₂ +ω ₆ *x _t )

according to the architecture described above, the gated loop unit combines the two parts and outputs the result, and the formula is:

h _t ＝(1-Z _t )*h _t-1 +Z _t *C _t

through the above processes, the gated cyclic units in the autoregressive network can be obtained, and when the autoregressive network defines a plurality of gated cyclic units connected in series according to the preset topological structure of the vocoder constructed as required, each gated cyclic unit can adopt the same structure.

And finally, constructing the conditional network and the autoregressive network into a vocoder according to a preset topological structure.

Since other structures of the vocoder are defined by the preset topology, after the up-sampling network and the gating cycle unit of the vocoder are constructed according to the coding information of the vocoder, the corresponding vocoder can be obtained by correspondingly configuring the up-sampling network and the gating cycle unit into the preset topology.

It is obvious that, the priori knowledge provided by the prototype of the vocoder is used to know the preset topology structure, on this basis, the controller generates the coding information of the vocoder, the upsampling network and the gating cycle unit therein are generated, the corresponding vocoder is constructed, the generation process does not need to rely on a directed acyclic graph like a general network search architecture, because of the complex branch structure obtained by the directed acyclic graph, even if the scale of the weight parameter is saved, the forward transmission cost is still high when the vocoder is deployed to the terminal device, and under the condition of relying on the directed acyclic graph, if the data dimension is changed in the upsampling process, the branch structure needs to deal with the problem of data dimension consistency, which often leads to performance reduction.

Correspondingly, the control structure vocoder according to the present application is embodied in the aspect of network architecture search, and has the following advantages:

1. using prior knowledge of vocoder aspect, sacrificing the space complexity of network topology architecture search, and exchanging key module with dense vocoder weight, namely, joint search of up-sampling network and gate control circulation unit;

2. the search space is designed by taking the number of the compressed network FLOPS as a target, operators with less weight are used, and some heuristic rules are applied, so that the efficiency can be improved;

3. and searching a voice synthesis model with optimal resource allocation by combining a subsequent soft constraint objective function, so as to achieve the purpose that the mobile terminal deploys to meet the real-time requirement.

Step 1300, iteratively training the vocoder to a convergence state by adopting a training set, and performing gradient update on the controller and iteratively generating a new vocoder before the controller does not reach the convergence according to the performance score obtained by the vocoder on the testing set;

the vocoder generated by the controller needs to obtain its measured performance, and thus, a training set is prepared, and the vocoder generated by the controller is iteratively trained to a convergent state using a sufficient number of training samples therein. The training sample can be the audio data needed by the input of the corresponding vocoder, each audio data comprises a plurality of voice frames, one of the voice frame sequences continuously formed according to the time stamp is taken, the acoustic feature of the voice frame sequence is extracted, the voice frame sequence is input into the vocoder, the acoustic feature is characterized by the condition network of the vocoder, the subsequent voice frame of the voice frame sequence is predicted by the autoregressive network of the vocoder on the basis of the characteristic representation, the vocoder finally calculates the model loss value of the subsequent voice frame according to the continuous subsequent voice frame of the voice frame sequence of the training sample, the gradient updating and iteration process of the vocoder are controlled according to the model loss value, and the vocoder finally reaches the convergence state through the continuous iteration training, thereby completing the training of the vocoder.

After the vocoder is trained to the convergent state, the controller may be updated as follows:

firstly, testing a vocoder by adopting a test sample in a test set to obtain a performance score, wherein the performance score comprises a quality score obtained after the test sample is processed by the vocoder;

and testing the vocoder which has reached the convergence state by using a corresponding prepared test set and using the test sample in the test set, and obtaining the corresponding performance score of the whole test set according to a preset statistical method so as to analyze the performance of the vocoder obtained on the test set.

In one embodiment, the performance Score may be an aggregate statistical value of quality scores obtained according to each test sample, where the quality Score may be a subjective quality Score (MOS, mean Opinion Score) or an objective quality Score, such as a peak signal-to-noise ratio or a structural similarity Score. And (4) carrying out a corresponding statistical algorithm on the quality scores of the total test samples, wherein an average value mode is generally adopted. According to this embodiment, the controller may calculate its model loss value using a hard-constrained multi-objective function as follows:

s.t.lat(a,h)≤B

where a is the frame search space corresponding to the vocoder, MOS (a) is the mean subjective opinion score of frame a of the vocoder on the test set, lat (a, h) is the inferred delay of frame a on the hardware condition h, and B is the maximum inferred delay obtained by the vocoder on the entire test set, which will limit the controller to sample only in the search space where the inferred delay does not exceed B.

In another embodiment, the performance score may be obtained by statistically obtaining an aggregate statistical value according to the quality score corresponding to each test sample, and then superimposing an inference delay generated by the vocoder when testing the whole test set, so as to implement a comprehensive evaluation by combining the voice quality and the machine execution performance. According to this embodiment, the controller may calculate its model loss value using the following soft constraint objective function:

wherein, A is the frame search space corresponding to the vocoder, MOS (a) is the mean subjective opinion score of the frame a of the vocoder on the test set, lat (a, h) is the inference delay of the frame a on the hardware condition h, and c is the constant for controlling the frame compression.

Then, strategy gradient optimization is carried out on the controller according to the performance scores, and the controller is continuously called iteratively to generate a new vocoder under the condition that the controller does not reach a convergence state.

It can be seen that, after the vocoder obtains its corresponding performance score, the controller may be determined whether to converge according to the performance score, so as to determine whether to continue to iteratively generate a new vocoder, specifically, for the case that the controller does not converge, policy gradient optimization may be performed on the controller according to the performance score, and then starting from step S1100 again, a new vocoder is generated, and steps S1100 to S1300 are performed in a loop; for the case where the controller has converged, the next step can be skipped, and at the same time, the training of the controller is terminated.

In one embodiment suitable for the controller to use a soft constraint objective function, a policy gradient update is applied to the controller, and the following policy update function may be applied:

wherein, pi _θ For the current strategy probability distribution (parameterized by the controller parameter θ), a _t The structural design selected for the t step of the gated cyclic unit of the controller is that R is a _1:T The determined framework calculates the resulting reward value under a soft constraint objective function.

In another embodiment, the strategy gradient update is performed on the controller, and the following strategy update function can be adopted by applying the strategy corresponding to the deep reinforcement learning method:

wherein, Q (a) _t |a _1:t-1 ) For the Q function, implemented by an RNN sharing parameters with the controller, α is the learning rate of Q learning and γ is the discounting factor.

It will be appreciated that the ability of the controller to produce vocoders with superior performance may be continually improved by applying a gradient optimization to the controller based on the performance scores obtained by the vocoders on the test set.

And step S1400, after the controller converges, selecting a vocoder as a voice synthesis model according to the performance score.

The controller generates a plurality of vocoders through continuous iteration in the training process, the vocoders are trained and tested to obtain corresponding performance scores, the performance scores reflect the performance of the corresponding vocoders, the real-time performance of the vocoders during voice synthesis is better when the performance scores are higher, and the network structure of the vocoders is usually optimal, so that one of the vocoders generated in the iterative training process of the controller can be selected as a voice synthesis model according to the performance scores. Generally, one or more vocoders with the highest performance scores may be used as the speech synthesis model. Of course, in other embodiments, a vocoder with the second highest performance score may be selected as the speech synthesis model according to the actual hardware capability of the terminal device to be deployed, and the same principle is essentially applied.

The voice synthesis model selected in the step can be further trained according to needs to optimize the reasoning ability so as to be deployed into the terminal equipment for use.

According to the embodiment, the controller is adopted to generate the structural code, the vocoder is constructed according to the structural code, the vocoder is trained and tested to obtain the corresponding performance score, the iterative process of the controller is controlled according to the performance score, a plurality of vocoders are produced in the training process of the controller, the vocoder with the best actual measurement performance is preferably selected as a voice synthesis model, the controller is used for realizing the automatic production and the optimization of the vocoder, the obtained voice synthesis model meets the requirement of mobile terminal equipment deployment, the good performance can be obtained after the mobile terminal equipment deployment, and the requirements on model miniaturization and high real-time performance in a voice synthesis scene can be met.

Based on any of the above embodiments, referring to fig. 4, iteratively training the vocoder to a convergence state using a training set, comprising:

step S1310, calling a single training sample in a training set, acquiring a plurality of continuous speech frames with preset duration to construct a speech frame sequence, and extracting acoustic features corresponding to the speech sequence;

the training set can adopt a public data set or on-line user data, the public data set is a voice data set which comprises a plurality of languages and is formed by gathering audio data provided by tens of thousands of contributors, and each voice data can be used as a first type of training sample. The online user data can be collected by self and comprises ten thousand online users' audio data sampling segments, after background noise is eliminated from an original sampling segment, a pure Voice segment is intercepted through Voice Activity Detection (VAD), and finally a training sample of 15-30s is formed. During vocoder training, one training sample is called for each iterative training.

For each training sample, a plurality of time-sequential continuous speech frames in the training sample are acquired in a preset time length to form a speech frame sequence, and then acoustic features of the speech frame sequence are extracted.

The acoustic features play a role in describing relevant information of the relatively stable style features in the speech frame, such as phonemes, prosody and the like, and can be any one of logarithmic mel-frequency spectrum information, time-frequency spectrum information and CQT filtering information.

Those skilled in the art will appreciate that the above acoustic features may be encoded using a corresponding algorithm. In the process of coding, the speech signal is subjected to conventional processing such as pre-emphasis, framing and windowing, and then time domain or frequency domain analysis is carried out, namely speech signal analysis is realized. The purpose of pre-emphasis is to boost the high frequency part of the speech signal and smooth the spectrum; typically the pre-emphasis is implemented by a first order high pass filter. Before the speech signal is analyzed, the speech signal needs to be framed, the length of each frame of the speech signal is usually set to be 20ms, and 10ms overlap between two adjacent frames can be realized by considering the frame shift factor. To implement framing, this may be accomplished by windowing the speech signal. Different window selections affect the result of the speech signal analysis, and it is common to use window functions corresponding to hamming windows (Hamm) to perform the windowing operation.

In one embodiment, for the time-frequency spectrum information, voice data of each voice information in a time domain is pre-emphasized, framed, windowed, and transformed into a frequency domain by short-time fourier transform (STFT), so as to obtain data corresponding to a spectrogram, thereby forming the time-frequency spectrum information.

In another embodiment, the logarithmic mel-frequency spectrum may be obtained by filtering the time-frequency spectrum information by using a mel-scale filter bank and then taking the logarithm of the filtered time-frequency spectrum information.

In another embodiment, for the CQT filtering information, a Constant Q Transform (CQT) refers to a filter bank whose center frequency is exponentially distributed and whose filtering bandwidth is different, but the ratio of the center frequency to the bandwidth is Constant Q. Unlike fourier transforms, the horizontal axis frequency of the spectrum is not linear, but is based on log2, and the filter window length can be varied for better performance depending on the spectral line frequency.

Any of the above specific acoustic signatures can be used as an input to the vocoder of the present application, and in one embodiment, the acoustic signatures can be constructed according to a certain predetermined format in order to facilitate the processing of the vocoder. For example, the acoustic features corresponding to each speech frame are organized into a row vector, and for the whole encoded speech frame sequence, the row vectors of the speech frames are spliced together longitudinally according to time sequence to obtain a two-dimensional matrix as the acoustic features of the whole speech frame sequence.

Step S1320, inputting the acoustic feature into a conditional network of a vocoder, obtaining global feature information of the acoustic feature through a residual network therein, obtaining local feature information of the acoustic feature under multiple scales through an up-sampling network therein, and obtaining comprehensive feature information composed of the global feature information and the local feature information;

according to the principles of the vocoder of the present application, inputting the acoustic signature into the conditional network of the vocoder will obtain the output as follows:

firstly, extracting the acoustic features based on a residual error network in a conditional network to obtain global feature information of the acoustic features;

in this embodiment, taking the specific architecture of the vocoder in the present application obtained by using SC-WaveRNN as a prototype topology as an example, a Speaker Encoder in a prototype network may be omitted, as compared with the prototype network provided by the original author, and of course, in another embodiment, the Encoder may also be used.

The speaker coder in the prototype network of SC-WaveRNN is not necessary from the point of view of the present application implementing speech synthesis. The speaker coder is an important contribution of the SC-WaveRNN paper, and the authors measure the positive gain of the speaker coder in all cases by PESQ (Perceptial evaluation of speed quality); the same index is used for measuring, and the contribution of a speaker encoder is not obvious aiming at the task of executing packet loss compensation. The reason for this is that SC-WaveRNN aims at TTS (Text to Speech, from Text to Speech), the model input contains a complete mel spectrum, and it is important that the speaker encoder maps the mel spectrum to the speaker characteristics; for the application scenario of packet loss concealment in the application, the speaker characteristics can only affect the first frame of the compensated voice, the added speaker encoder comprises an LSTM (Long Short-Term Memory), the calculation complexity is high, and the benefit is not obvious. Thus, those skilled in the art may implement the configuration of the vocoder with or without a speaker encoder in accordance with the principles disclosed herein.

The acoustic signature of the sequence of speech frames obtained above, from which the subsequent speech frame is generated, is input into the conditional network of the vocoder, one of which is input into the residual network of the conditional network. The residual error network is responsible for performing residual error convolution operation on the acoustic features, and deep semantic information in the acoustic features is extracted on the global scale of the voice frame sequence, so that corresponding global feature information is obtained, and global representation of the acoustic features of the voice frame sequence is realized.

Then, carrying out multi-scale sampling on the acoustic features based on an up-sampling network in a conditional network to obtain local feature information of the acoustic features;

the acoustic features of the voice frame sequence are input into an up-sampling network in a conditional network from a second path, the up-sampling network is constructed to correspond to a plurality of scaling scales, for example, three scaling scales, under the constraint of first coding information provided by the structural coding, deep semantic information is respectively extracted from the acoustic features at the different scaling scales, and the information granularity is continuously refined, so that corresponding local feature information under different information granularities is obtained, and local representation of the acoustic features of the voice frame sequence is realized.

And then, performing feature splicing on the global feature information and the local feature information based on a splicing layer in the conditional network to obtain comprehensive feature information.

And a splicing layer arranged in the conditional network performs characteristic splicing on the global characteristic information and the local characteristic information obtained by the residual network to construct comprehensive characteristic information, wherein the comprehensive characteristic information not only comprises the global information of the acoustic characteristics, but also comprises the local information of the acoustic characteristics under different fine scales, so that the important characteristics in the acoustic characteristics can be comprehensively and completely characterized, and the method is favorable for guiding the autoregressive network to generate effective subsequent voice frames.

It can be understood from the above process that the vocoder realizes effective feature representation of the voice frame sequence by synthesizing the important features of the acoustic features under global and local conditions, which is the basis for generating the subsequent voice frames, and by adopting the vocoder with the preferred network structure, the working efficiency of the vocoder can be improved, and good profit can be obtained.

Step S1330, extracting style relatively stable features in the voice frame sequence from the comprehensive feature information through a gating cycle unit of an autoregressive network in a vocoder to obtain predicted feature information;

firstly, the comprehensive characteristic information output from the conditional network is fully connected through a first full connection layer in the autoregressive network so as to further realize characteristic synthesis.

Then, the fully-connected comprehensive characteristic information is input into a first gating circulation unit which is generated according to second coding information in the structural coding output by the controller to perform characteristic extraction, so that important characteristics are selected, and first gating characteristic information is obtained. And after the first gating characteristic information is further spliced with global characteristic information obtained by a residual error network, the first gating characteristic information is input into a second gating circulation unit generated according to second coding information in the structural coding output by the controller. The second gating circulation unit performs feature extraction on the input feature information to obtain second gating feature information in the same way, and the second gating feature information is spliced with global feature information obtained by the residual error network and then output.

In one embodiment, the output of the second gating feature information spliced with the global feature information may be used as the predicted feature information. In another embodiment, the second gating feature information is further fully connected with feature information obtained by splicing the global feature information, and after full connection, the second gating feature information is further spliced with the global feature information obtained by the residual error network to obtain predicted feature information. In each step, the global feature information obtained by continuously quoting the residual error network provides context reference, which is beneficial to accurately extracting important features in acoustic features and enables subsequent voice frames generated by the autoregressive network to be more effective.

Step S1340, generating subsequent voice frames of the voice frame sequence according to the prediction characteristic information through a classification network in a vocoder;

the prediction characteristic information is input into a classification network preset in each autoregressive network, and the probability of each bit required by constructing a subsequent speech frame is determined through classification mapping of the classification network, so that the subsequent speech frame is constructed.

In one embodiment, in the classification network, in the process of constructing the subsequent speech frame according to the prediction feature information, the following temperature coefficient-based formula is applied for audio sampling:

wherein T is the sampling temperature, y _i To predict the tag, P _i The probability of the ith bit of a subsequent speech frame.

According to the above process, it can be understood that the autoregressive network can effectively generate a subsequent speech frame of the current speech frame under the guidance of the global feature and the local feature of the acoustic feature according to the output of the conditional network, so as to implement effective packet loss compensation on the speech stream.

Step S1350, calculating the loss value of the subsequent speech frame by using the speech frame after the time sequence of the speech frame sequence in the training sample, and controlling the iterative training of the vocoder according to the loss value.

When a subsequent speech frame is generated aiming at a speech frame sequence in a training sample, the next speech frame of the last speech frame of the speech frame sequence is used as a supervision label, the loss value of the subsequent speech frame relative to the next speech frame is calculated, then whether the vocoder has converged is determined according to the loss value, when the vocoder has not converged, the vocoder is subjected to back propagation according to the loss value, the weight parameters of a conditional network and an autoregressive network are updated in a gradient mode, and the next training sample is called from a training set to continue to carry out iterative training on the vocoder until the vocoder reaches the convergence.

It can be seen that, in the process of training the vocoder, the last speech frame in the time sequence of the sequence of speech frames sampled in the training sample, whose time sequence is the next speech frame following in succession, is used as the supervision tag of the subsequent speech frame generated based on the sequence of speech frames for calculating the loss value of the subsequent speech frame, thereby implementing effective supervision on the pre-training process of the vocoder.

The above embodiments show that the vocoders constructed from the structural code output by the controller will be trained on the same training set, and therefore, a comparison of performance between the multiple vocoders produced by the controller may be subsequently made to produce the preferred vocoder as the speech synthesis model.

On the basis of any of the above embodiments, referring to fig. 5, after the step of selecting the vocoder as the speech synthesis model according to the performance score after the controller reaches convergence, the method includes:

s1500, training the voice synthesis model to a convergence state;

after the at least one vocoder is preferably selected as the speech synthesis model suitable for being deployed to the terminal device in step S1400, the speech synthesis model may be further trained to fully enhance its speech synthesis capability, and an exemplary training manner may be that the conditional network of the speech synthesis model is used as a generator and the autoregressive network thereof is used as a discriminator to be configured to perform training for generating a countermeasure model. The following also discloses an embodiment corresponding to a training mode proposed by the present application, and is not shown for the moment. Through training, the optimized speech synthesis model reaches a convergence state again, and can be deployed to terminal equipment for use.

Step S1600, configuring the voice synthesis model to smoothly access the subsequent voice frame generated according to the voice stream into the voice stream.

When the finally trained speech synthesis model is deployed to the terminal device, the speech synthesis model can access the subsequent speech frames generated for the speech stream into the speech stream by configuring the transition mode between the subsequent speech frames generated by the speech synthesis model and the corresponding speech stream. The method can be realized by the following steps:

firstly, splicing subsequent voice frames generated by a voice frame synthesis model to obtain a compensation frame sequence;

for the subsequent voice frames generated by the voice synthesis model, no matter a single subsequent voice frame is generated or a plurality of subsequent voice frames are continuously generated, the subsequent voice frames can be processed in a centralized way and sequentially spliced according to the time stamps to form a compensation frame sequence.

Further, adjusting the volume corresponding to the compensation frame sequence to be not more than the volume of the voice frame sequence;

in order to unify the volume effect, a preset limiter, specifically a volume limiter can be called to control the volume limit of each subsequent voice frame in the compensation frame sequence, and the excessive volume in the subsequent voice frame is reduced by taking the volume in the voice frame sequence as a reference, so that the volume of the subsequent voice frame in the compensation frame sequence does not exceed the volume of the voice frame in the voice frame sequence, thereby controlling the volume of the voice frame obtained by packet loss compensation at a reasonable amplitude and maintaining the consistency of the voice quality.

And finally, smoothly accessing the compensation frame sequence to the voice stream where the voice frame sequence is positioned.

The compensation frame sequence is connected into the voice stream, and the packet loss compensation of the voice stream can be further realized. In order to maintain the acoustically smooth compensated frame sequence obtained by the speech synthesis model after the speech stream has been switched in, the compensated frame sequence may be smoothly switched in the speech stream in a fade-in and fade-out manner.

In one embodiment, after the compensating frame sequence is controlled to access the speech stream, fading out starts at 20ms, i.e. fading out starts from a second subsequent speech frame, until the speech frame is completely muted 20ms after the end of the packet loss or 120ms after the start of the packet loss, and furthermore, the speech frames in the speech stream after the compensating frame sequence are controlled to fade in within a time window of 20 ms. The time settings related to fade-out and fade-in can be flexibly adjusted according to actual requirements, and are not limited to the above examples.

According to the above process, it can be understood that the subsequent speech frames recovered by the packet loss compensation implemented by the speech synthesis model can be smoothly accessed into the primitive speech stream, so that the primitive speech stream keeps smoothness in hearing and good tone quality is obtained.

The above embodiments disclose that, after the speech synthesis model optimized from the vocoder generated by the controller is trained again to the convergence state, the speech synthesis model can be adapted to be deployed in a terminal device such as a mobile terminal, and the speech synthesis model can continuously generate a plurality of subsequent speech frames for the speech stream by configuring the speech synthesis model with a function of smoothly accessing the subsequent speech frames generated according to the speech stream into the speech stream, so as to implement continuous packet loss compensation.

Based on any of the above embodiments, referring to fig. 6, the training the speech synthesis model to the convergence state includes:

step S1510, performing first-stage training on the voice synthesis model by adopting a first class of training samples in the data set, training the voice synthesis model to a convergence state, and performing training according to a preset weight sparse target in the first-stage training;

two types of training samples, namely a first type of training sample and a second type of training sample, are prepared, and the two types of training samples can be stored in the same data set or different data sets.

The first type of training samples are mainly used for implementing pre-training of the speech synthesis model, and the second type of training samples are mainly used for implementing fine-tuning training of the speech synthesis model, so that the first type of training samples can adopt materials with properly relaxed environmental noise, and the second type of training samples can adopt materials with clearer foreground speech.

In one embodiment, the first type of training sample may be a public data set, and as described above, one public data set selected in the actual measurement training of the present application is a voice data set that includes multiple languages and is formed by aggregating audio data provided by tens of thousands of contributors, where each voice data may be used as the first type of training sample.

In one embodiment, the second type of training sample may automatically acquire the online user data, as described above, the online user data used in the actual measurement training of the present application includes audio data sampling segments of tens of thousands of online users, and after the background noise is eliminated from the original sampling segment, a pure Voice segment is intercepted by Voice Activity Detection (VAD), so as to finally form a 15-30s training sample.

The first class training sample and the second class training sample can be subjected to voice preprocessing in advance, and each voice frame in the first class training sample and the second class training sample is determined so as to extract the voice frame sequence and the corresponding acoustic features of the voice frame sequence.

In the first stage training process, the speech synthesis model is trained for each first type of training sample, which is the same as the embodiment disclosed in the foregoing, and is not repeated here. In the same way, it can be seen that, in the pre-training process of the speech synthesis model, the last speech frame in the time sequence of the sequence of speech frames sampled in the first class of training samples, the speech frame whose time sequence is continuous to the next speech frame, is used as the supervision label of the subsequent speech frame generated based on the sequence of speech frames, and is used for calculating the loss value of the subsequent speech frame, thereby implementing effective supervision on the pre-training process of the speech synthesis model.

To further reduce the weight of the speech synthesis model, in the first stage training process, for each training sample, the speech synthesis model weights are arranged in absolute values, n being the smallest absolute value _t Each weight is set to 0, where:

wherein N is the total weight number of the model, and S belongs to [0,1 ]]Target sparsification rate, t training step, t ₀ For starting the step of sparse deployment, T is the total step number of sparse training, alpha>1 is a constant that controls the rate of thinning.

Step S1520, using the second class training sample in the data set to carry out the second stage training of the speech synthesis model, and training the speech synthesis model to a convergence state;

in the second stage training process, compared with the first stage training, the basic training processes are the same, but the training samples adopted in the second stage training are second type training samples, and when the second stage training is executed, the speech synthesis model is trained with a smaller learning rate relative to the first stage training, so that the parameterization of the conditional network is more in line with the data distribution of the online user. The second class of training samples provided by the online users are pure voice segments, so that the second class of training samples are adopted to train the voice synthesis model at a higher learning rate, and the capability of the voice synthesis model for generating subsequent voice frames is further improved.

Similarly, in the second stage training process, the weight sparseness training can be performed on the speech synthesis model according to the same principle as the first stage training.

Step S1530, the weights of the conditional network of the speech synthesis model are fixed, the second class training sample in the data set is used to perform the third-stage training on the speech synthesis model, and the speech synthesis model is trained to the convergence state, so as to adjust the weights of the autoregressive network in the speech synthesis model.

The third stage training is performed on the speech synthesis model, and in fact, after the speech synthesis model is pre-trained in the first two stages, fine tuning training is performed on the speech synthesis model to further adjust the weight of the autoregressive network therein, so as to ensure that the model can effectively produce subsequent speech frames of a given speech frame sequence.

For this reason, before the third-stage training is performed, the conditional network is considered to have reached the desired requirements, and the weights of the conditional network are frozen, so that the weights of the conditional network are not updated by gradients during the third-stage training. The weights of the autoregressive network are maintained as learnable weights, and are further corrected in the third stage training process. Then, a third stage training of the speech synthesis model may be initiated.

When the third-stage training is carried out on the speech synthesis model, the adopted training samples are second-type training samples corresponding to online user data, the basic process of training each training sample is the same as that of the first-stage training and the second-stage training, and similarly, the weight thinning training can be applied as required. In the pre-training process of the speech synthesis model, the speech frame with the last time sequence in the sequence of the speech frames sampled in the second class of training samples, the speech frame with the subsequent time sequence, which is used as the supervision label of the subsequent speech frame generated based on the sequence of the speech frames, is used for calculating the loss value of the subsequent speech frame, thereby realizing the effective supervision of the fine tuning training process of the speech synthesis model.

And (4) training in the third stage, keeping the weight of the condition network unchanged, continuously correcting the weight of the autoregressive network, and finally reaching a convergence state, so that the whole training process of the speech synthesis model can be terminated.

According to the above embodiments, it is understood that the speech synthesis model of the present application is trained in multiple stages, and is pre-trained in the first stage using the first type of training samples, and is pre-trained in the second stage using the second type of training samples at a smaller learning rate, and is capable of enhancing the capability of generating a subsequent speech frame by using the second type of training samples under the condition of the weight of the network under the solidified conditions in the third stage, so as to finally achieve comprehensive training, and enable the obtained speech synthesis model to generate an effective capability of generating a subsequent speech frame for a given speech frame sequence. The speech synthesis model obtained according to the training process has low parameter quantity and few floating point operation times per second, can realize real-time inference of the mobile terminal, and is particularly suitable for being deployed in mobile terminals such as mobile phones and computers.

Based on any of the above embodiments, referring to fig. 7, the third stage training is performed on the speech synthesis model by using the second class of training samples in the data set, including:

step S1531, replacing acoustic features of a plurality of following speech frames with a preset number of continuous time sequences of the last speech frame in the sequence of the speech frames sampled from each second class training sample with mask representation;

in the third stage training process, corresponding training can be performed on the speech synthesis model according to the number of subsequent speech frames that are expected to be generated by the speech synthesis model for predicting the speech frame sequence, that is, the maximum compensation number. For this purpose, in each iterative training, a speech frame sequence is obtained for the called second class training sample, for a plurality of continuous speech frames after the last speech frame of the speech frame sequence, the specific number is determined according to a preset maximum compensation number, the acoustic features of the speech frames are replaced by mask representations, for example, the maximum compensation number is determined as 6 speech frames according to the duration of 120ms, and then the acoustic features of 6 time-sequential subsequent speech frames following the speech frame sequence are replaced by mask representations. The way of performing the mask representation may be, for example, a way of replacing all the feature values of the corresponding acoustic features with values 1, 0.5, or the like, and may be flexibly set.

Step S1532, the speech synthesis model generates the subsequent speech frames corresponding to the plurality of subsequent speech frames by iteration based on the speech frame sequence of the training sample;

for the second class of training samples, the speech synthesis model starts to generate subsequent speech frames based on the acoustic features of the first speech frame sequence of the second class of training samples, after a subsequent speech frame is generated, the subsequent speech frame is used as the last speech frame in the speech frame sequence to obtain a new speech frame sequence, then the acoustic features of the new speech frame sequence are continuously extracted in an iteration mode to generate the next subsequent speech frame, and the iteration is carried out until a plurality of subsequent speech frames corresponding to the maximum compensation number are generated.

Step S1533, calculating a loss value of the corresponding subsequent speech frame according to the plurality of subsequent speech frames in the second class of training samples, and correcting the weight of the autoregressive network in the speech synthesis model according to the loss value.

In the process that the speech synthesis model generates a plurality of subsequent speech frames aiming at each second class training sample, aiming at each subsequent speech frame, the speech synthesis model adopts the subsequent speech frame of which the timestamp corresponds to the subsequent speech frame in the training sample as a supervision label of the subsequent speech frame, calculates the loss value of the subsequent speech frame, implements back propagation on the speech synthesis model according to the loss value, and gradiently updates the weight of the autoregressive network, and the conditional network does not participate in the gradient updating because the weight of the conditional network is frozen.

According to the above embodiment, it can be easily understood that the acoustic features of the subsequent speech frames in the second class of training samples are replaced by the mask representation during the third stage training, so that the autoregressive network can be guided to generate the subsequent speech frames corresponding to the subsequent speech frames, and a plurality of corresponding subsequent speech frames can be continuously generated by iteration according to the preset maximum compensation number, so that the speech synthesis model learns the capability of continuously compensating the speech frames, and the efficiency of performing packet loss compensation of the speech synthesis model is improved.

Referring to fig. 8, an apparatus for generating a speech synthesis model according to an aspect of the present application includes a code generation module 1100, a vocoder construction module 1200, an iterative decision module 1300, and a model generation module 1400, wherein: the code generation module 1100 is configured to invoke a controller, which generates a structural code of a vocoder; the vocoder construction module 1200 configured to construct a vocoder according to structural encoding, the vocoder comprising a conditional network and an autoregressive network generated according to structural encoding; the iterative decision module 1300 is configured to iteratively train the vocoder to a convergence state by using a training set, and perform gradient update on the controller and iteratively generate a new vocoder before the controller does not reach convergence according to a performance score obtained by the vocoder on a test set; the model generation module 1400 is configured to select a vocoder as the speech synthesis model based on the performance score after the controller has reached convergence.

On the basis of any of the above embodiments, the vocoder construction module 1200 includes: a first constructing unit configured to construct an upsampling network in a conditional network of a vocoder according to first coding information in a structure coding, the first coding information including layer types corresponding to a plurality of convolutional layers of the upsampling network and channel numbers corresponding thereto; a second constructing unit configured to construct the gated loop unit in the autoregressive network of the vocoder according to second encoding information in the structural encoding, the second encoding information including an operation type corresponding to a structural node of the gated loop unit and an activation type corresponding thereto; and the integral construction unit is arranged to construct the conditional network and the autoregressive network into a vocoder according to a preset topological structure.

On the basis of any of the above embodiments, the iterative decision module 1300 includes: the sample processing unit is used for calling a single training sample in the training set, acquiring a plurality of continuous voice frames with preset time length to construct a voice frame sequence, and extracting acoustic features corresponding to the voice sequence; the feature representation unit is arranged to input the acoustic features into a conditional network of the vocoder, acquire global feature information of the acoustic features through a residual network therein, acquire local feature information of the acoustic features under multiple scales through an up-sampling network therein, and acquire comprehensive feature information consisting of the global feature information and the local feature information; the feature prediction unit is arranged to extract relatively stable style features in the voice frame sequence from the comprehensive feature information through a gating circulation unit of an autoregressive network in the vocoder to obtain predicted feature information; a voice frame generating unit configured to generate a subsequent voice frame of the voice frame sequence according to the prediction feature information via a classification network in the vocoder; and the loss calculating unit is arranged to calculate the loss value of the subsequent speech frame by adopting the speech frame after the time sequence of the speech frame sequence in the training sample, and controls the iterative training of the vocoder according to the loss value.

On the basis of any of the above embodiments, the iterative decision module 1300 includes: the testing and scoring unit is arranged to test the vocoder by adopting a testing sample in the testing set to obtain a performance score, and the performance score comprises a quality score obtained after the testing sample is processed by the vocoder; and the gradient optimization unit is arranged for implementing strategy gradient optimization on the controller according to the performance score, and continuously and iteratively calling the controller to generate a new vocoder under the condition that the controller does not reach a convergence state.

Based on any of the above embodiments, the model yield module 1400 includes: a model training module configured to train the speech synthesis model to a converged state; and the packet loss compensation module is configured to configure the voice synthesis model to smoothly access a subsequent voice frame generated according to a voice stream into the voice stream.

On the basis of any of the above embodiments, the packet loss compensation module includes: the first training unit is used for implementing first-stage training on the vocoder by adopting a first type of training sample in the data set, training the vocoder to a convergence state, and implementing training according to a preset weight sparse target in the first-stage training; a second training unit configured to perform a second stage of training for the vocoder using a second type of training sample in the data set to train the vocoder to a convergent state; a third training unit, configured to solidify the weight of the conditional network of the vocoder, perform a third stage of training on the vocoder using the second type of training samples in the data set, train the vocoder to a convergence state, and adjust the weight of the autoregressive network in the vocoder; the training samples are audio data and comprise a plurality of voice frames with continuous time sequences, the voice frames with continuous time sequences are used for calculating the loss value of the subsequent voice frames generated by the vocoder corresponding to the voice frames with continuous time sequences, and the second type of training samples are audio data corresponding to pure human voice segments.

On the basis of any of the above embodiments, the third training unit includes: the masking processing subunit is set to replace the acoustic characteristics of a plurality of following voice frames with a preset number and continuous time sequence of the last voice frame in the voice frame sequence sampled from each second class training sample into masking representation; an iteration generation subunit, configured to generate, by the vocoder, subsequent speech frames corresponding to the plurality of subsequent speech frames iteratively based on the speech frame sequence of the training sample; a weight modification subunit arranged to calculate loss values of their corresponding subsequent speech frames from said plurality of subsequent speech frames in the second class of training samples, and to modify the weights of the autoregressive network in the vocoder in accordance with the loss values.

Another embodiment of the present application also provides a speech synthesis model generation apparatus. As shown in fig. 9, the internal structure of the speech synthesis model generation device is schematic. The speech synthesis model generation device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The speech synthesis model generation device includes a computer-readable non-transitory readable storage medium storing an operating system, a database storing information sequences, and computer-readable instructions which, when executed by a processor, cause the processor to implement a speech synthesis model generation method.

The processor of the speech synthesis model generation device is used to provide computational and control capabilities to support the operation of the overall speech synthesis model generation device. The memory of the speech synthesis model generation device may have stored therein computer readable instructions which, when executed by the processor, may cause the processor to perform the speech synthesis model generation method of the present application. The network interface of the speech synthesis model generation device is used for connecting and communicating with a terminal.

It will be understood by those skilled in the art that the structure shown in fig. 9 is a block diagram of only a part of the structure related to the present application, and does not constitute a limitation of the speech synthesis model generation apparatus to which the present application is applied, and a specific speech synthesis model generation apparatus may include more or less components than those shown in the figure, or combine some components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module in fig. 8, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for realizing data transmission between user terminals or servers. The nonvolatile readable storage medium in the present embodiment stores program codes and data necessary for executing all the modules in the speech synthesis model generation device according to the present application, and the server can call the program codes and data of the server to execute the functions of all the modules.

The present application also provides a non-transitory readable storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech synthesis model generation method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

To sum up, this application realizes the automatic output and the preference of vocoder with the help of the controller, makes the speech synthesis model that obtains accord with the mobile terminal equipment and deploys needs, can obtain good performance after the mobile terminal equipment deploys, can accord with the speech synthesis scene to the required requirement of model miniaturization and high real-time.

Claims

1. A method for generating a speech synthesis model, comprising:

calling a controller, and generating the structure code of the vocoder by the controller;

2. The method of claim 1, wherein constructing the vocoder from the structural encoding comprises:

constructing an up-sampling network in a conditional network of a vocoder according to first coding information in the structural coding, wherein the first coding information comprises layer types corresponding to a plurality of convolutional layers of the up-sampling network and channel numbers corresponding to the layer types;

constructing a gating cycle unit in an autoregressive network of a vocoder according to second coding information in the structural coding, wherein the second coding information comprises an operation type corresponding to a structural node of the gating cycle unit and an activation type corresponding to the operation type;

and constructing the conditional network and the autoregressive network into a vocoder according to a preset topological structure.

3. The method of claim 1, wherein iteratively training the vocoder to a convergence state using a training set comprises:

calling a single training sample in a training set, acquiring a plurality of continuous voice frames with preset duration to construct a voice frame sequence, and extracting acoustic features corresponding to the voice sequence;

inputting the acoustic features into a conditional network of a vocoder, acquiring global feature information of the acoustic features through a residual network therein, acquiring local feature information of the acoustic features under multiple scales through an up-sampling network therein, and acquiring comprehensive feature information consisting of the global feature information and the local feature information;

extracting style relatively stable characteristics in the voice frame sequence from the comprehensive characteristic information through a gating circulating unit of an autoregressive network in the vocoder to obtain predicted characteristic information;

generating subsequent speech frames of a sequence of speech frames according to the prediction feature information via a classification network in a vocoder;

and calculating the loss value of the subsequent speech frame by adopting the speech frame after the time sequence of the speech frame sequence in the training sample, and controlling the iterative training of the vocoder according to the loss value.

4. The method of claim 1, wherein the step of performing a gradient update on the controller and iteratively generating a new vocoder before the controller fails to converge according to the performance scores obtained by the vocoder on the test set comprises:

testing the vocoder by using the test sample in the test set to obtain a performance score, wherein the performance score comprises a quality score obtained after the test sample is processed by the vocoder;

and implementing policy gradient optimization on the controller according to the performance score, and continuously and iteratively calling the controller to generate a new vocoder under the condition that the controller does not reach a convergence state.

5. The method of claim 1, wherein the step of selecting a vocoder as the speech synthesis model according to the performance score after the controller converges comprises:

training the speech synthesis model to a converged state;

the speech synthesis model is configured to smoothly access subsequent speech frames generated from a speech stream into the speech stream.

6. The method of generating a speech synthesis model according to claim 5, wherein the training of the speech synthesis model to a convergent state comprises:

performing first-stage training on the voice synthesis model by adopting a first type of training sample in the data set, training the voice synthesis model to a convergence state, and performing training according to a preset weight sparse target in the first-stage training;

carrying out second-stage training on the voice synthesis model by adopting a second class of training samples in the data set, and training the voice synthesis model to a convergence state;

solidifying the weight of the conditional network of the voice synthesis model, adopting a second class of training samples in a data set to carry out third-stage training on the voice synthesis model, and training the voice synthesis model to a convergence state so as to adjust the weight of an autoregressive network in the voice synthesis model;

the training samples are audio data and comprise a plurality of continuous time sequence voice frames, the voice frames with continuous time sequences are used for calculating the loss value of the subsequent voice frames generated by the voice synthesis model corresponding to the continuous time sequence preceding voice frames, and the second class of training samples are audio data corresponding to pure human sound fragments.

7. The method of generating a speech synthesis model according to claim 6, wherein the third stage training is performed on the speech synthesis model using the second class of training samples in the data set, and comprises:

replacing acoustic features of a plurality of subsequent voice frames of a preset number with continuous time sequence of the last voice frame in the voice frame sequence sampled from each second class training sample by mask representation;

the speech synthesis model iteratively generates subsequent speech frames corresponding to the plurality of subsequent speech frames on the basis of the speech frame sequence of the training sample;

and calculating the loss value of the corresponding subsequent speech frame according to the plurality of subsequent speech frames in the second class of training samples, and correcting the weight of the autoregressive network in the speech synthesis model according to the loss value.

8. A speech synthesis model generation apparatus, comprising:

9. A speech synthesis model generation device comprising a central processor and a memory, characterized in that the central processor is configured to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A non-transitory readable storage medium storing a computer program implemented according to the method of any one of claims 1 to 7 in the form of computer readable instructions, the computer program, when invoked by a computer, performing the steps included in the corresponding method.

11. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 7.