CN113470616A

CN113470616A - Speech processing method and apparatus, vocoder and vocoder training method

Info

Publication number: CN113470616A
Application number: CN202110794822.1A
Authority: CN
Inventors: 张旭; 张新; 李楠; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-01
Anticipated expiration: 2041-07-14
Also published as: CN113470616B

Abstract

The present disclosure provides a voice processing method and apparatus, and a vocoder and a training method of the vocoder. The speech processing method may include: downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features; obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal. The present disclosure enables synthesis of high sample rate speech signals while maintaining low computational complexity.

Description

Speech processing method and apparatus, vocoder and vocoder training method

Technical Field

The present disclosure relates to the field of speech processing, and in particular, to a speech processing method and a speech processing apparatus for speech synthesis, and a training method for a vocoder and a vocoder.

Background

Vocoders have found wide application in speech synthesis using deep learning. The existing speech synthesis process generally performs frequency domain mel-frequency spectrum prediction on input characters, and then converts the mel-frequency spectrum into time domain sampling points. Generally, the conversion from mel-frequency spectrum to sampling points is performed by using the griffin algorithm, but the algorithm can cause the voice quality to be poor, and the voice converted by using the deep learning method has high quality. Generally, the higher the effective sampling rate of speech, the higher the quality of synthesized speech and the better the listening experience. However, the synthesis of high-sample-rate audio is also generally accompanied by an increase in the number of parameters of the network, which increases the cost of operating the network.

Disclosure of Invention

The present disclosure provides a speech processing method and a speech processing apparatus for speech synthesis and a training method of a vocoder and a vocoder to solve at least the above-mentioned problems.

According to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, which may include the steps of: downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features; obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.

Optionally, the step of obtaining the low time-domain signal using the first neural network of the vocoder based on the low sampling rate mel-frequency spectrum feature may include: performing the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

Alternatively, the step of utilizing a second neural network of the vocoder to obtain a speech signal corresponding to the high-sampling-rate mel spectral feature based on the high-sampling-rate mel spectral feature and the high-time-domain signal may include: performing the following for each sample point of the speech signal: calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic; obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features; and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

Alternatively, the high sampling rate mel-frequency spectrum features may be obtained by performing mel-frequency spectrum prediction on the input text.

According to a second aspect of embodiments of the present disclosure, there is provided a training method of a vocoder, the training method may include the steps of: acquiring a sample set, wherein the sample set comprises a high-sampling-rate time domain signal, a low-frequency domain feature and a high-frequency domain feature, the low-sampling-rate time domain signal is obtained by down-sampling the high-sampling-rate time domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time domain signal; obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features; obtaining a high time domain signal by up-sampling the low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal; constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal; training parameters of the vocoder based on the impairments calculated by the impairment function.

Optionally, the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low frequency domain feature may include: performing the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

Optionally, the step of utilizing a second neural network of the vocoder to obtain the synthesized signal based on the high frequency domain features and the high time domain signal may comprise: performing the following for each sample point of the composite signal: calculating a second estimated value of a current sampling point of the synthesized signal based on the magnitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

Optionally, the step of constructing a loss function using the low-time-domain signal, the low-sampling-rate time-domain signal, the synthesized signal, and the high-sampling-rate time-domain signal may include: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.

According to a third aspect of the embodiments of the present disclosure, there is provided a voice processing apparatus, which may include: an acquisition module configured to downsample the high sampling rate mel-frequency spectrum features to acquire low sampling rate mel-frequency spectrum features; and a processing module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.

Optionally, the processing module may be configured to perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

Optionally, the processing module is configured to perform the following for each sample point of the speech signal: calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic; obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features; and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

According to a fourth aspect of embodiments of the present disclosure, there is provided a training apparatus of a vocoder, the training apparatus may include: an obtaining module configured to obtain a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency domain feature and a high-frequency domain feature, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time-domain signal; and a training module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features; obtaining a high time domain signal by up-sampling the low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal; constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal; training parameters of the vocoder based on the impairments calculated by the impairment function.

Optionally, the training module may be configured to perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

Optionally, the training module may be configured to: performing the following for each sample point of the composite signal: calculating a second estimated value of a current sampling point of the synthesized signal based on the magnitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

Optionally, the training module may be configured to: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method and the training method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the training method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the speech processing method and the training method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the vocoder of the present disclosure is implemented by adding a network with smaller parameters on the basis of the original LPCNet, so that the vocoder of the present disclosure can synthesize a high-sampling-rate speech signal while maintaining a low computational complexity.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a diagram of an existing LPCNet;

fig. 2 is a schematic flow diagram for training a vocoder according to an embodiment of the present disclosure;

fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow diagram for resampling, according to an embodiment of the disclosure;

FIG. 6 is a flow diagram of a method of speech processing according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a training apparatus of a vocoder according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 9 is a schematic block diagram of a speech processing device according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure.

Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a diagram of an existing LPCNet.

Referring to fig. 1, the conventional LPCNet implements vocoder functions in the form of an encoder and a decoder, with the input of the encoder portion (such as 101 in fig. 1) being a frame rate domain feature of speech (such as features in fig. 1) and the output being an embedded vector provided to the decoder portion (such as 102 in fig. 1). The linear prediction coefficients LPC module may compute the LPC based on features and predict/compute an estimate of the current sample point. The decoder section 102 receives the output of the encoder section 101 and compares it with an estimated value p of the current sampling point calculated by the LPC module_tA sample point s output at a time on the decoder section 102_t-1And the error e between the output of the decoder part at the previous moment and the real sampling point_t-1Performing a series operation to output a sampling point s at the current time_t。

In fig. 1, the encoder section 101 includes two convolutional layers (such as conv 1 × 3) and two fully-connected layers (such as FC). The encoder section 101 may output the embedded vector by performing two convolution operations, a summation operation, and two full-connected layer operations on the features.

The decoder portion 102 includes two threshold rotation units (such as GURs)_AAnd GRU_B) Dual full connectivity layers (such as dual FC), normalization layers (such as softmax). The decoder portion 102 may utilize the output of the encoder, p_t、s_t-1And e_t-1Performing concat operation, two threshold cycle unit operation, double FC operation, softmax operation, sampling operation and summation operation to output the current sampling point s of the LPCnet_t。

The way of outputting sampling points by the LPCNet is autoregressive, only one sampling point in the time domain can be output every time the LPCNet runs, and if speech with a high sampling rate level is to be output, the amount of computation is increased by multiple times. For example, at a sampling rate of 16k, the LPCNet needs to run 16000 decoders for every second of speech, and after the sampling rate is increased to 32k, the LPCNet needs to run 32000 decoders, which include two layers of GRUs, and thus the calculation amount is large.

To solve the above problem, the present disclosure proposes a set cascade stage based deep neural network algorithm to synthesize high sampling rate audio. According to the embodiment of the present disclosure, a neural network (hereinafter, may be referred to as network B) is added on the basis of the existing LPCNet (hereinafter, may be referred to as network a), where network a may be used to process frequency domain features of a low sampling rate, and network B may be used to process frequency domain features of a high sampling rate. The low-sampling-rate signal output by the network A is up-sampled to a pseudo high-sampling-rate signal, and the high-sampling-rate frequency domain feature and the pseudo high-sampling-rate signal are connected in parallel and then serve as the input of the newly added network B, so that the network B outputs a high-sampling-rate synthesized signal. Thus, the network complexity is reduced while the quality of the synthesized audio is ensured.

Hereinafter, according to various embodiments of the present disclosure, a method, an apparatus, and a system of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a schematic flow diagram for training a vocoder according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, a vocoder may be constructed of network a and network B.

Referring to fig. 2, a sample set is first obtained before training the vocoder. The target time domain signal with a high sampling rate (i.e., the high sampling rate time domain signal) may be downsampled to obtain a low sampling rate signal (i.e., the low sampling rate time domain signal or the low frequency signal), and then the low sampling rate signal is subjected to Short Time Fourier Transform (STFT) and then passes through a Mel filter Bank (Mel Bank) to obtain a Mel spectrum feature (i.e., the low frequency domain feature) with a low sampling rate, which is used as an input feature of the network a. Here network a may comprise a first encoder, a first decoder and an LPC module. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The first encoder performs an encoding operation on the mel-frequency spectrum features to obtain a first embedded vector, and the first decoder performs a decoding operation by using the first embedded vector to obtain a synthetic low-frequency signal (i.e. a low time-domain signal or a synthetic low-frequency signal). The synthesized low frequency signal is up-sampled to obtain a synthesized high frequency signal (i.e., a high time domain signal or a synthesized high frequency signal) as an input to the network B. Here, the high time domain signal may be regarded as a pseudo high frequency signal.

In addition, after the short-time fourier transform STFT is performed on the target time domain signal with the high sampling rate, mel spectrum features (i.e., high frequency domain features) with the high sampling rate are obtained by a mel filter bank and are used as input of the network B.

According to an embodiment of the present disclosure, network B may include an LPC module, a second encoder, and a second decoder. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The second encoder may perform an encoding operation on the mel-frequency spectral features at the high sampling rate to obtain a second embedded vector. The second decoder performs a decoding operation using the second embedded vector and the pseudo-synthesized high-frequency signal to obtain a final high-frequency signal. The required size of the network parameters can be reduced since network B has enough information to synthesize a high sample rate speech signal.

Compared to network a, the second encoder of network B according to embodiments of the present disclosure may reduce one convolutional layer and one full-connection layer, and the second decoder may use only one layer GRU. Compared with the method that the network A is directly used for synthesizing the voice signals with high sampling rate, the vocoder network structure and the operation complexity of the vocoder are reduced. This is because the network B has more input information than the network a, and therefore a complex network structure is not required for synthesizing the high-sampling-rate signal, and the high-sampling-rate signal can be synthesized with a smaller number of parameters. The training process of the vocoder of the present disclosure will be described in detail with reference to fig. 3.

Fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure. The vocoder according to the present disclosure has a better effect in synthesizing audio of a high sampling rate.

Referring to fig. 3, in step S301, a training sample set is acquired. The training sample set for training the vocoder may include a high sampling rate time domain signal, a low frequency domain feature and a high frequency domain feature, wherein the low sampling rate time domain signal is obtained by down-sampling the high sampling rate time domain signal, the low frequency domain feature is a mel-spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-spectrum feature of the high sampling rate time domain signal.

As an example, the low-sampling-rate time-domain signal and the high-sampling-rate time-domain signal may be time-frequency converted to obtain a low-frequency-domain signal and a high-frequency-domain signal, respectively. The low frequency domain features and the high frequency domain features are obtained by applying a mel filter to energy spectra of the low frequency domain signal and the high frequency domain signal.

For example, a high sample rate time domain signal x of length T in the time domain^h(T) as a training sample, wherein T represents time, and T is more than 0 and less than or equal to T. First, for the signal x^h(t) downsampling to obtain a low sample rate time domain signal x^l(t) then separately for the signal x^h(t) and x^l(t) performing a short-time Fourier transform STFT to obtain a signal x using the following equations (1) and (2)^h(t) and x^l(t) amplitude spectrum Mag of the frequency domain signal^hAnd Mag^l：

Mag^h＝abs(STFT(x^h)) (1)

Mag^l＝abs(STFT(x^l)) (2)

In the secondary amplitude spectrum Mag^hAnd Mag^lTaking the energy spectrum and passing through a Mel filter bank H_m(k) To obtain Mel spectrum Mel^hAnd Mel^l. Mel Filter Bank H_m(k) A set of non-linearly distributed triangular filter banks, with a center frequency f (M), where M is 1, 2.

Wherein,

the Mel filter bank H is calculated according to the following equation (4)_m(k) Logarithmic energy q (m) output by each filter in (a):

where K represents a frequency point subscript, | X (K) | is equivalent to the high and low frequency signal x^h(t) and x^l(t) magnitude spectrum of the frequency domain signal. Therefore, the input characteristics Mel of the network A and the network B can be obtained by the above calculation^hAnd Nel^l。

In step S302, a low-time domain signal is obtained using a first neural network of a vocoder based on low-frequency domain features. Here, the first neural network may be the network a described above. The first neural network outputs a sampling point after each operation, thereby outputting a low time domain signal.

The following operations may be performed for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimated value of a current sampling point of a low-time-domain signal based on a magnitude spectrum of the low-frequency-domain signal, obtaining a first embedding vector based on low-frequency-domain features through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low-time-domain signal based on the first embedding vector, the first estimated value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder.

As an example, the Mel spectrum Mel^lInput into a first encoder of network a. The first encoder may be implemented by two convolutional layers and two fully-connected layers, but is not limited thereto. The first encoder may output a first embedded vector v of a fixed dimension according to the operation of equation (5)_AFor use by the first decoder of network a.

v_A＝En(Mel^l) (5)

Where En denotes an operation procedure of the encoder.

A low time domain signal may be obtained with a first decoder of network a based on the first embedding vector. Here, the low time domain signal is a synthesized low sampling rate time domain signal.

For the first decoder, the input at each time is v_A,p_t,e_t-1Wherein p is_tAs an estimate of the current sample point predicted from the LPC coefficients, e_t-1Representing the sampling point s output by the first decoder at the previous instant_t-1Difference from the true sampling point, s_t-1The output of the first decoder at the last instant. Here, the true sampling point is obtained from a low-sampling-rate time-domain signal obtained by down-sampling a high-sampling-rate time-domain signal.

In case the current sample is the first sample of the low time domain signal, the sample error of the first decoder at the previous instant may be set to zero. However, the above examples are merely exemplary, and initialization may be variously performed according to design requirements.

The estimated value p of the current sampling point can be predicted according to equation (6) below_t：

Wherein K represents the order of LPC, a_kLPC coefficients representing the respective order, which coefficients can be represented by a magnitude spectrum Mag^lAnd (6) obtaining a prediction. For example, assume high samplingThe sampling rate of the time domain signal is 32K, the sampling rate of the downsampled signal is 16K, and in this case, K may be 16. That is, for the first decoder, the LPC coefficients are calculated from the magnitude spectrum of the signal after down-sampling, and then p is calculated using the LPC coefficients and the previous output of the first decoder_t. For p_tThe calculation of (c) may be implemented by the LPC module in network a.

The operation of the first decoder can be described by the following equation (7):

wherein,

for the synthesized low-sampling-rate time-domain signal, De, output by the first decoder_ARepresenting a run of a first decoder outputting a synthesized low-sample-rate time-domain signal at time t each run

In step S303, a high time domain signal is obtained by upsampling the low time domain signal.

As an example, a low sample rate time domain signal may be expressed according to equation (8) below

Resampling to a high sample rate time domain signal

To be combined with

As an additional input to the second decoder.

Where, sample represents the resampling operation, L represents the current low sampling rate, and M represents the high sampling rate after resampling. Here, ,

may be considered a pseudo high sample rate signal. Here, the resampling operation may be implemented by a resampling module in the vocoder, or the resampling module may be included in the network a, to which the present disclosure is not limited. The description of resampling will be described in detail below.

In step S304, a synthesized signal is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal. The synthesized signal here is a speech signal to be finally output. According to an embodiment of the present disclosure, the second neural network may be implemented by network B described above.

The following operations may be performed for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the final synthesized signal based on the amplitude spectrum of the high-sampling-rate time-domain signal in the sample set, obtaining a second embedding vector based on the high-frequency domain features through the operation of a second encoder of the second neural network, and obtaining the current sampling point of the final synthesized signal based on the second embedding vector, the current sampling point of the high-time-domain signal, the second estimation value, the sampling point error of the previous moment for the high-time-domain signal, the sampling point error of a previous moment for a second decoder of the second neural network, and the sampling point output by the previous moment for the second decoder through the operation of a second decoder.

As an example, the Mel spectrum Mel^hInput into a second encoder of network B. The second encoder according to the present disclosure may be implemented by one convolutional layer and one fully-connected layer. The second encoder may output an embedded vector v of a fixed dimension according to the operation of equation (9)_BFor use by the second decoder.

v_B＝En(Mel^h) (9)

Where En denotes an operation procedure of the encoder.

For the second decoder, the input at each time is

p_t,sb_t-1,

And v_BWherein p is_tTo estimate the current sample point predicted from the LPC coefficients,

as a sampling point at the last instant of the high time domain signal

And real sampling point

The difference between the values of the two signals,

sampling point sb output by the second decoder at the previous moment_t-1And real sampling point

Difference between, sb_t-1The output of the second decoder at the previous instant. It should be noted that for p input into the second decoder_tThe LPC coefficients are calculated from the magnitude spectrum of the high sample rate time domain signal in the sample set, and p is then calculated using the LPC coefficients and the previous output of the second decoder_t. Can be similarly calculated using equation (6).

In the case where the current sampling point is the first sampling point of the final composite signal, the error of the sampling point at the previous time of the high time domain signal may be set to zero, the error of the sampling point at the previous time of the second decoder may be set to zero, and the sampling point output by the second decoder at the previous time may be set to the second estimated value. However, the above examples are merely exemplary, and the present disclosure may be variously initialized according to design requirements.

The operation of the second decoder can be described by the following equation (10):

wherein sb_tFor the resultant high-sampling-rate time-domain signal, De, output by the second decoder_BRepresenting an operation of a second decoder which outputs a high-sampling-rate time-domain signal sb at time t_t。

In step S305, a loss function is constructed using the low time-domain signal, the low-sampling-rate time-domain signal in the sample set, the synthesized signal, and the high-sampling-rate time-domain signal in the sample set. The method comprises the steps of constructing a first cross entropy loss function by using a low time domain signal and a downsampled low sampling rate time domain signal, constructing a second cross entropy loss function by using a final synthesis signal and a high sampling rate time domain signal, and constructing the loss function by using the first cross entropy loss function and the second cross entropy loss function.

As an example, the first loss function is constructed using the output of the first decoder and a signal down-sampled from the high-sample-rate time-domain signal (i.e., the low-sample-rate time-domain signal), and the second loss function is constructed using the output of the second decoder and the high-sample-rate time-domain signal. For example, the loss function may be constructed according to equation (11) below:

wherein Cross Encopy is a cross entropy loss function,

for the output of the first decoder per time instant sb_tFor the output of the second decoder at each moment,

for a low sampling rate true signal at each time instant (signal after down-sampling the sample high frequency training signal),

a high sample rate true signal (sample high frequency training signal) at each time instant.

In step S306, parameters of the vocoder are trained based on the impairments calculated by the impairment function. For example, parameters of the vocoder are trained by minimizing the loss calculated by the first and second loss functions, i.e., network parameters in the first encoder, the second encoder, the first decoder, and the second decoder in the vocoder are trained.

The following explains the sampling rate and resampling.

The sampling rate is how many sampling points are used in a certain time period to describe a signal, the basic idea of sampling rate conversion is decimation and interpolation, and audio resampling from a signal perspective is filtering. Once the window size of the filter function and the interpolation function are determined, the resampling performance can be determined. Decimation may cause spectral aliasing, and interpolation may produce image components. Usually, an anti-aliasing filter is added before decimation, and an anti-image filter is added after interpolation, as shown in fig. 5, h (n) represents an anti-aliasing filter, and g (n) represents an anti-image filter.

Assuming that the original sampling rate of the audio signal is L, the new sampling rate is M, and the original signal length is N, the signal length K of the new sampling rate satisfies the following relationship (12):

for each discrete time value: k (K is more than or equal to 1 and less than or equal to K), the actual value n_kIs expressed by equation (13):

n_kthe position to be interpolated or decimated in the case of the original sampling interval.

Under ideal conditions, filteringFrequency response h of wave filter_DThe frequency response of (n) is shown in equation (14):

where D is a multiple of decimation or interpolation, i.e., D is L/M

The signal after the filter output can be represented by equation (15):

according to the embodiment of the disclosure, a network B with smaller parameters can be added on the basis of the existing LPCNet, so that the new vocoder synthesizes a voice signal with a high sampling rate while maintaining lower operation complexity.

Fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure. In fig. 4, it is shown and described primarily for network B in the vocoder.

Referring to fig. 4, a vocoder according to the present disclosure may include a network a (such as the LPCNet a in fig. 4) and a network B (such as portions other than the LPCNet a in fig. 4). The structure of network B is similar to that of network a, but network B is formed using fewer parameters. The present disclosure adds a network B on the basis of the existing LPCNet (i.e., LPCNet a in fig. 4), so that the vocoder of the present disclosure synthesizes a speech signal with a high sampling rate while maintaining a low computational complexity.

The network a is used for processing the low-frequency mel-frequency spectrum characteristics to obtain a synthesized low-frequency signal, and obtaining a pseudo-synthesized high-frequency signal by up-sampling the synthesized low-frequency signal. Here, for the pseudo synthesized high frequency signal, it may be obtained by the network a, or may be obtained by other modules.

Network B is used to process high frequency mel-frequency spectral features (such as the features in fig. 4) to obtain a composite high frequency signal.

Network B may include an LPC module (for computing estimates of LPC and sample points), a second encoder 401, and a second encoderA second decoder 402 and other modules (such as an upsampling module). Compared with network A, the encoder part 401 has one convolution layer and one full link layer reduced, the decoder part 402 has only one GRU, and the input part of the decoder has one additional GRU

And

therefore, the network B has a simpler structure and a smaller amount of parameters.

For a vocoder that has been trained, the sample point error of the decoder in network a at the previous time and the sample point error of the decoder in network B at the previous time may be set to zero.

The network B shown in fig. 4 is merely exemplary, and the present disclosure is not limited thereto.

The method is characterized in that a network B is added on the basis of the original LPCnet A, a low-sampling-rate signal is synthesized by using the low-frequency-domain characteristic and the original LPCnet A, the output of a decoder of the LPCnet A is up-sampled to a pseudo high-sampling-rate signal, the high-frequency-domain characteristic and the pseudo high-sampling-rate signal are connected in parallel and then serve as the input of the newly added network B, and the decoder network B outputs a high-sampling-rate synthesized signal.

Fig. 6 is a flow diagram of a method of speech processing according to an embodiment of the present disclosure. The TTS conversion from text to voice is mainly divided into two parts, namely prediction of a frequency domain Mel spectrum of input characters and conversion of the Mel spectrum into a sampling point of a time domain, and the vocoder is mainly used for converting the Mel spectrum into the sampling point of the time domain. The speech processing method shown in fig. 6 is mainly applied to convert frequency domain features converted from text into speech signals.

Referring to fig. 6, in step S601, low-frequency domain features and high-frequency domain features predicted from text are obtained, wherein the low-frequency domain features are low-sampling-rate mel spectral features corresponding to the text, and the high-frequency domain features are high-sampling-rate mel spectral features corresponding to the text. Here, the low-sampling-rate mel-frequency spectrum feature and the high-sampling-rate mel-frequency spectrum feature may be obtained for the same text. For example, a high-sampling-rate mel-frequency spectrum feature is obtained by performing mel-frequency spectrum prediction on input characters, and then a low-sampling-rate mel-frequency spectrum feature is obtained by down-sampling the high-sampling-rate mel-frequency spectrum feature.

In step S602, a low-time domain signal is obtained using a first neural network of a vocoder based on low-frequency domain features. Performing the following for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimation value of a current sampling point of a low time domain signal based on a magnitude spectrum corresponding to a low sampling rate Mel spectrum characteristic, obtaining a first embedding vector based on the low frequency domain characteristic through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low time domain signal based on the first embedding vector, the first estimation value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder. In speech processing, the sample point error may be set to zero. For example, equation (6) above may be used to calculate the first estimate, and equation (7) is used to obtain the current sample point of the low time domain signal.

In step S603, a high time domain signal is obtained by upsampling the low time domain signal. Resampling may be performed using equation (8) above.

In step S604, a synthesized signal corresponding to the input text is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal.

The following is performed for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the synthesized signal based on a magnitude spectrum corresponding to the high sampling rate Mel spectral feature, obtaining a second embedding vector based on the high frequency domain feature through operation of a second encoder of the second neural network, and obtaining the current sampling point of the synthesized signal based on the second embedding vector, the current sampling point of the high time domain signal, the second estimation value, a sampling point error at a previous moment for the high time domain signal, a sampling point error at a previous moment for a second decoder of the second neural network, and a sampling point output by the second decoder at the previous moment through operation of a second decoder. Here, in the voice processing, the sample point error may be set to zero.

In the case where the current sample point is the first sample point of the synthesized signal, the sample point output by the second decoder at the previous time may be set to the second estimation value. For example, the second estimation value may be calculated using equation (6), and the final speech signal may be obtained using equation (10).

Fig. 7 is a block diagram of a training apparatus of a vocoder according to an embodiment of the present disclosure.

Referring to fig. 7, training apparatus 700 may include an acquisition module 701 and a training module 702. Each module in the training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary according to the type of the module. In various embodiments, some modules in training device 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The obtaining module 701 may obtain a sample set, where the sample set includes a high sampling rate time domain signal, a low frequency domain feature, and a high frequency domain feature, where the low sampling rate time domain signal is obtained by down-sampling the high sampling rate time domain signal, the low frequency domain feature is a mel-frequency spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-frequency spectrum feature of the high sampling rate time domain signal. Here, the acquisition module 701 may directly acquire the sample set from the outside. Alternatively, the obtaining module 701 may obtain the high-sampling-rate time-domain signal from the outside, then down-sample the high-sampling-rate time-domain signal to obtain the low-sampling-rate time-domain signal, and perform time-frequency transformation and filtering processing (such as through a mel filter bank) on the high-sampling-rate time-domain signal and the low-sampling-rate time-domain signal, respectively, so as to obtain the low-frequency-domain feature and the high-frequency-domain feature.

Training module 702 may obtain a low time-domain signal using a first neural network of a vocoder based on low frequency-domain features and obtain a high time-domain signal by upsampling the low time-domain signal, obtain a synthesized signal using a second neural network of the vocoder based on high frequency-domain features and the high time-domain signal, construct a loss function using the low time-domain signal, the low sample rate time-domain signal, the synthesized signal, and the high sample rate time-domain signal, train network parameters of the vocoder by minimizing a loss calculated from the loss function.

Alternatively, training module 702 may perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

Alternatively, the training module 702 may perform the following for each sample point of the final composite signal: calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

Optionally, training module 702 may construct the first cross entropy loss function using the low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function by using the synthesized signal and the high sampling rate time domain signal; the loss function is formed by a first cross entropy loss function and a second cross entropy loss function.

Fig. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 8, the speech processing apparatus 800 may include an acquisition module 801 and a processing module 802. Each module in the speech synthesis apparatus 800 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing device 800 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.

The obtaining module 801 may obtain low-frequency domain features and high-frequency domain features predicted by the text, wherein the low-frequency domain features are low-sampling-rate mel-frequency spectrum features corresponding to the text, and the high-frequency domain features are high-sampling-rate mel-frequency spectrum features corresponding to the text. The mel-spectrum features acquired by the acquisition module 801 can be obtained from the outside.

The processing module 802 may obtain a low time-domain signal using a first neural network of a vocoder based on the low frequency-domain features, obtain a high time-domain signal by upsampling the low time-domain signal, and obtain a synthesized signal corresponding to text using a second neural network of the vocoder based on the high frequency-domain features and the high time-domain signal.

Alternatively, the processing module 802 may perform the following for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimation value of a current sampling point of a low time domain signal based on a magnitude spectrum corresponding to a low sampling rate Mel spectrum characteristic, obtaining a first embedding vector based on the low frequency domain characteristic through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low time domain signal based on the first embedding vector, the first estimation value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder. Here, the sampling point error of the first decoder is used in the training phase of the vocoder, and thus, in the voice processing, the sampling point error may be set to zero, to which the present disclosure is not limited.

Alternatively, the processing module 802 may perform the following for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the synthesized signal based on a magnitude spectrum corresponding to the high sampling rate Mel spectral feature, obtaining a second embedding vector based on the high frequency domain feature through operation of a second encoder of the second neural network, and obtaining the current sampling point of the synthesized signal based on the second embedding vector, the current sampling point of the high time domain signal, the second estimation value, a sampling point error at a previous moment for the high time domain signal, a sampling point error at a previous moment for a second decoder of the second neural network, and a sampling point output by the second decoder at the previous moment through operation of a second decoder. Here, the sample point error may be set to zero, and the present disclosure is not limited thereto.

Fig. 9 is a schematic structural diagram of a speech processing device in a hardware operating environment according to an embodiment of the present disclosure.

As shown in fig. 9, the speech processing apparatus 900 may include: a processing component 901, a communication bus 902, a network interface 903, an input output interface 904, a memory 905, and a power component 904. Wherein a communication bus 902 is used to enable connective communication between these components. The input output interface 904 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and optionally, the input output interface 904 may also include a standard wired interface, a wireless interface. The network interface 903 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 905 may be a high-speed random access memory or a stable nonvolatile memory. The memory 905 may optionally be a storage device separate from the processing component 901 described above.

Those skilled in the art will appreciate that the architecture shown in FIG. 9 does not constitute a limitation of the speech processing apparatus 900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 9, the memory 905, which is one type of storage medium, may include therein an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a voice processing program, a model training program, and a database.

In the voice processing apparatus 900 shown in fig. 9, the network interface 903 is mainly used for data communication with an external electronic apparatus/terminal; the input/output interface 904 is mainly used for data interaction with a user; the processing component 901 and the memory 905 in the voice processing apparatus 900 may be provided in the voice processing apparatus 900, and the voice processing apparatus 900 executes the video editing method provided by the embodiments of the present disclosure by the processing component 901 calling the video editing program, the material, and various APIs provided by the operating system stored in the memory 905.

The processing component 901 may include at least one processor, with the memory 905 having stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform a method of speech processing or a method of vocoder training in accordance with embodiments of the present disclosure. Further, the processing component 901 may perform encoding operations and decoding operations, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

Processing component 901 may be used to train the vocoder of the present disclosure. For example, the processing component 901 may obtain a sample set from the outside, wherein the sample set comprises a high sampling rate time domain signal, a low sampling rate time domain signal, low frequency domain features and high frequency domain features, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency-domain feature is a Mel-spectral feature of the low-sampling-rate time-domain signal, the high-frequency-domain feature is a Mel-spectral feature of the high-sampling-rate time-domain signal, the low-time-domain signal is obtained by utilizing a first neural network of a vocoder based on the low-frequency-domain feature, the method includes obtaining a high time domain signal by upsampling a low time domain signal, obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal, constructing a loss function using the low time domain signal, the low sample rate time domain signal, the synthesized signal, and the high sample rate time domain signal, training parameters of the vocoder by minimizing a loss calculated from the loss function.

As another example, processing component 901 may implement converting text to speech signals as a vocoder of the present disclosure. For example, the processing component 901 may externally obtain low-frequency-domain features and high-frequency-domain features predicted from the text, wherein the low-frequency-domain features are low-sampling-rate mel-spectrum features corresponding to the text, the high-frequency-domain features are high-sampling-rate mel-spectrum features corresponding to the text, obtain a low-time-domain signal by using a first neural network of a vocoder based on the low-frequency-domain features, obtain a high-time-domain signal by up-sampling the low-time-domain signal, and obtain a synthesized signal corresponding to the text by using a second neural network of the vocoder based on the high-frequency-domain features and the high-time-domain signal. Other deep neural network implementations besides LPCNet may also be employed for the first and second neural networks.

The processing component 901 may implement control of components included in the speech processing device 900 by executing programs. The input-output interface 904 may output a finally synthesized voice signal.

The speech processing device 900 may receive or output video and/or audio via the input-output interface 904. For example, the speech processing apparatus 900 may output the synthesized speech signal via the input-output interface 904.

By way of example, the speech processing device 900 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. The speech processing apparatus 900 need not be a single electronic device, but can be any collection of devices or circuits that can individually or jointly execute the instructions (or sets of instructions). The speech processing device 900 may also be part of an integrated control system or system manager or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the speech processing device 900, the processing component 901 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processing component 901 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processing component 901 may execute instructions or code stored in a memory, where the memory 905 may also store data. Instructions and data may also be sent and received over a network via the network interface 903, where the network interface 903 may employ any known transmission protocol.

The memory 905 may be integrated with the processing component 901, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory 905 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and processing components 901 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processing components 901 can read data stored in the memory 905.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure, the electronic device 1000 may include at least one memory 1002 and at least one processor 1001, the at least one memory 1002 storing a set of computer-executable instructions, the set of computer-executable instructions, when executed by the at least one processor 1001, performing a method of speech processing or a method of vocoder training according to an embodiment of the disclosure.

The processor 1001 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1001 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.

The memory 1002, which is one type of storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, a speech processing program, a model training program, and a database.

The memory 1002 may be integrated with the processor 1001, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 1002 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1002 and the processor 1001 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 1001 can read files stored in the memory 1002.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via a bus and/or a network.

Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a model training method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-mentioned speech processing method or model training method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech processing method, characterized in that the speech processing method comprises:

downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features;

obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features;

obtaining a high time domain signal by up-sampling the low time domain signal;

a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.

2. The speech processing method of claim 1 wherein the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low sample rate mel spectral features comprises:

performing the following for each sample point of the low time domain signal:

calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic;

obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features;

and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.

3. The speech processing method of claim 1 wherein the step of utilizing a second neural network of the vocoder to obtain the speech signal corresponding to the high-sample-rate mel spectral features based on the high-sample-rate mel spectral features and the high-time-domain signal comprises:

performing the following for each sample point of the speech signal:

calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic;

obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features;

and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.

4. The speech processing method of claim 1 wherein the high-sampling-rate mel-frequency features are obtained by performing mel-frequency spectrum prediction on the input text.

5. A method of training a vocoder, the method comprising:

acquiring a sample set, wherein the sample set comprises a high-sampling-rate time domain signal, a low-frequency domain feature and a high-frequency domain feature, the low-sampling-rate time domain signal is obtained by down-sampling the high-sampling-rate time domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time domain signal;

obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features;

obtaining a high time domain signal by up-sampling the low time domain signal;

obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal;

constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal;

training parameters of the vocoder based on the impairments calculated by the impairment function.

6. Training method according to claim 5, wherein the step of constructing a loss function using the low time-domain signal, the low sample rate time-domain signal, the synthesis signal and the high sample rate time-domain signal comprises:

constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal;

constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal;

the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.

7. A speech processing apparatus, characterized in that the speech processing apparatus comprises:

an acquisition module configured to downsample the high sampling rate mel-frequency spectrum features to acquire low sampling rate mel-frequency spectrum features; and

a processing module configured to:

obtaining a high time domain signal by up-sampling the low time domain signal;

8. An apparatus for training a vocoder, the apparatus comprising:

an obtaining module configured to obtain a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency domain feature and a high-frequency domain feature, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time-domain signal;

a training module configured to:

obtaining a high time domain signal by up-sampling the low time domain signal;

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method of any one of claims 1 to 4 or the training method of any one of claims 5 to 6.

10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method as claimed in any one of claims 1 to 4 or a training method as claimed in any one of claims 5 to 6.