CN113470616B

CN113470616B - Speech processing method and device, vocoder and training method of vocoder

Info

Publication number: CN113470616B
Application number: CN202110794822.1A
Authority: CN
Inventors: 张旭; 张新; 李楠; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2024-02-23
Anticipated expiration: 2041-07-14
Also published as: CN113470616A

Abstract

The present disclosure provides a method and apparatus for processing speech and a vocoder and a method for training a vocoder. The voice processing method may include: downsampling the high sampling rate mel-spectrum features to obtain low sampling rate mel-spectrum features; obtaining a low time domain signal using a first neural network of a vocoder based on the low sample rate mel-spectrum feature; obtaining a high time domain signal by upsampling a low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high sample rate Mel-spectral feature based on the high sample rate Mel-spectral feature and the high time domain signal. The method and the device can synthesize the high-sampling-rate voice signal while keeping low operation complexity.

Description

Speech processing method and device, vocoder and training method of vocoder

Technical Field

The present disclosure relates to the field of speech processing, and more particularly, to a speech processing method and apparatus for speech synthesis and a vocoder and a training method of the vocoder.

Background

Vocoders have found wide application in speech synthesis using deep learning. The existing speech synthesis flow generally predicts the mel spectrum of the input text in the frequency domain, and then converts the mel spectrum into sampling points in the time domain. Usually, the conversion from mel spectrum to sampling point is performed by using the griffin algorithm, but the algorithm may result in poor voice quality, and the voice quality converted by using the deep learning method is higher. In general, the higher the effective sampling rate of speech, the higher the quality of the synthesized speech and the better the hearing. But the synthesis of high sample rate audio is often accompanied by an increased number of network parameters, which increases the cost of operating the network.

Disclosure of Invention

The present disclosure provides a voice processing method and a voice processing apparatus for voice synthesis and a vocoder and a training method of the vocoder to solve at least the above-mentioned problems.

According to a first aspect of embodiments of the present disclosure, there is provided a voice processing method, which may include the steps of: downsampling the high sampling rate mel-spectrum features to obtain low sampling rate mel-spectrum features; obtaining a low time domain signal using a first neural network of a vocoder based on the low sample rate mel-spectrum feature; obtaining a high time domain signal by upsampling a low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high sample rate Mel-spectral feature based on the high sample rate Mel-spectral feature and the high time domain signal.

Optionally, the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low sample rate mel-spectrum feature may comprise: the following is performed for each sample point of the low time domain signal: calculating a first estimated value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate mel spectrum feature; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low sample rate mel-spectrum features; the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

Optionally, the step of obtaining a speech signal corresponding to the high sample rate mel-spectrum feature using the second neural network of the vocoder based on the high sample rate mel-spectrum feature and the high time domain signal may comprise: the following is performed for each sample point of the speech signal: calculating a second estimated value of a current sampling point of the voice signal based on an amplitude spectrum corresponding to the high sampling rate mel-spectrum feature; obtaining a second embedded vector via operation of a second encoder of a second neural network based on the high sample rate mel-spectrum feature; the current sampling point of the speech signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time instant of the high-time-domain signal, the sampling point error for the previous time instant of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time instant.

Alternatively, the high sampling rate mel-spectrum feature may be obtained by performing mel-spectrum prediction on the input text.

According to a second aspect of embodiments of the present disclosure, there is provided a method of training a vocoder, the method of training may include the steps of: obtaining a sample set, wherein the sample set comprises a high sampling rate time domain signal, a low frequency domain feature and a high frequency domain feature, wherein the low sampling rate time domain signal is obtained by downsampling the high sampling rate time domain signal, the low frequency domain feature is a mel spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel spectrum feature of the high sampling rate time domain signal; obtaining a low time domain signal using a first neural network of a vocoder based on the low frequency domain features; obtaining a high time domain signal by upsampling a low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signals; constructing a loss function using a low-sample-rate time-domain signal, the composite signal, and the high-sample-rate time-domain signal; parameters of the vocoder are trained based on the loss calculated by the loss function.

Optionally, the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low frequency domain characteristics may include: the following is performed for each sample point of the low time domain signal: calculating a first estimated value of a current sampling point of the low-time-domain signal based on an amplitude spectrum of the low-sampling-rate time-domain signal; obtaining a first embedded vector based on the low frequency domain features via operation of a first encoder of a first neural network; the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

Optionally, the step of obtaining the synthesized signal using the second neural network of the vocoder based on the high frequency domain features and the high time domain signals may include: the following is performed for each sample point of the composite signal: calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high sampling rate time domain signal; obtaining a second embedded vector based on the high frequency domain features via operation of a second encoder of a second neural network; the current sampling point of the synthesized signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time of the high-time-domain signal, the sampling point error for the previous time of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time.

Optionally, the step of constructing the loss function using the low-sample-rate time-domain signal, the composite signal, and the high-sample-rate time-domain signal may comprise: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is formed by a first cross entropy loss function and a second cross entropy loss function.

According to a third aspect of embodiments of the present disclosure, there is provided a voice processing apparatus, which may include: an acquisition module configured to downsample the high-sampling-rate mel-spectrum features to acquire low-sampling-rate mel-spectrum features; and a processing module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on the low sample rate mel-spectrum feature; obtaining a high time domain signal by upsampling a low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high sample rate Mel-spectral feature based on the high sample rate Mel-spectral feature and the high time domain signal.

Optionally, the processing module may be configured to, for each sampling point of the low time domain signal, perform the following: calculating a first estimated value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate mel spectrum feature; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low sample rate mel-spectrum features; the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

Optionally, the processing module is configured to, for each sampling point of the speech signal: calculating a second estimated value of a current sampling point of the voice signal based on an amplitude spectrum corresponding to the high sampling rate mel-spectrum feature; obtaining a second embedded vector via operation of a second encoder of a second neural network based on the high sample rate mel-spectrum feature; the current sampling point of the speech signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time instant of the high-time-domain signal, the sampling point error for the previous time instant of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time instant.

According to a fourth aspect of embodiments of the present disclosure, there is provided a training device of a vocoder, the training device may include: an acquisition module configured to acquire a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency-domain feature, and a high-frequency-domain feature, wherein the low-sampling-rate time-domain signal is obtained by downsampling the high-sampling-rate time-domain signal, the low-frequency-domain feature is a mel-spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency-domain feature is a mel-spectrum feature of the high-sampling-rate time-domain signal; and a training module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on the low frequency domain features; obtaining a high time domain signal by upsampling a low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signals; constructing a loss function using a low-sample-rate time-domain signal, the composite signal, and the high-sample-rate time-domain signal; parameters of the vocoder are trained based on the loss calculated by the loss function.

Optionally, the training module may be configured to perform the following for each sampling point of the low time domain signal: calculating a first estimated value of a current sampling point of the low-time-domain signal based on an amplitude spectrum of the low-sampling-rate time-domain signal; obtaining a first embedded vector based on the low frequency domain features via operation of a first encoder of a first neural network; the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

Alternatively, the training module may be configured to: the following is performed for each sample point of the composite signal: calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high sampling rate time domain signal; obtaining a second embedded vector based on the high frequency domain features via operation of a second encoder of a second neural network; the current sampling point of the synthesized signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time of the high-time-domain signal, the sampling point error for the previous time of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time.

Alternatively, the training module may be configured to: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is formed by a first cross entropy loss function and a second cross entropy loss function.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method and training method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and training method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executed by at least one processor in an electronic device to perform the speech processing method and training method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the vocoder of the present disclosure is realized by adding a network with smaller parameters on the basis of the original LPCNet, so that the vocoder of the present disclosure can synthesize high sampling rate voice signals while keeping low operation complexity.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a diagram of a conventional LPCNet;

fig. 2 is a flow diagram for training a vocoder according to an embodiment of the present disclosure;

fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure;

FIG. 5 is a flow diagram for resampling according to an embodiment of the disclosure;

FIG. 6 is a flow chart of a speech processing method according to an embodiment of the present disclosure;

Fig. 7 is a block diagram of a training device of a vocoder according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural view of a voice processing apparatus according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to written meanings, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the claims and their equivalents.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Fig. 1 is a diagram of a conventional LPCNet.

Referring to fig. 1, the conventional LPCNet implements the function of a vocoder in a manner of an encoder and a decoder, and an input of an encoder part (such as 101 in fig. 1) is a frame frequency domain feature (such as features in fig. 1) of voice and output is an embedded vector provided to a decoder part (such as 102 in fig. 1). The linear prediction coefficient LPC module may calculate LPC from features and predict/calculate an estimate of the current sample point. The decoder section 102 receives the output of the encoder section 101 and compares it with the estimated value p of the current sampling point calculated by the LPC module _t Sampling point s output at the last time of decoder section 102 _t-1 And error e between the output of the decoder section at the last instant and the true sample point _t-1 Performing serial operation to output sampling point s at current moment _t 。

In fig. 1, encoder section 101 includes two convolutional layers (such as conv1×3) and two fully-connected layers (such as FC). The encoder section 101 may output the embedded vector by performing two convolution operations, a summation operation, and two full-concatenated layer operations on features.

The decoder portion 102 includes two threshold loop units (such as GUR _A And GRU (glass fiber reinforced Unit) _B ) Dual full connectivity layer (such as dual FC), normalization layer (such as softmax). Decoder portion 102 may utilize the output of the encoder, p _t 、s _t-1 And e _t-1 Performing concat operation, two-threshold cyclic unit operation, double FC operation, softmax operation, sampling operation and summation operation to output the current sampling point s of LPCNet _t 。

The LPCNet outputs sampling points in an autoregressive mode, only one sampling point in a time domain can be output once, and if high-sampling-rate-level voice is to be output, the operation amount is increased in a multiple way. For example, at a sampling rate of 16k, the LPCNet needs to run the decoder 16000 times per second of speech, and after the sampling rate is raised to 32k, the LPCNet needs to run the decoder 32000 times, and the decoder contains two layers of GRUs, so that the operation amount is large.

In order to solve the above problems, the present disclosure proposes a deep neural network algorithm based on a set-union stage to synthesize high sampling rate audio. According to an embodiment of the present disclosure, a neural network (hereinafter, may be referred to as a network B) is added to an existing LPCNet (hereinafter, may be referred to as a network a), the network a may be used to process frequency domain features of a low sampling rate, and the network B may be used to process frequency domain features of a high sampling rate. The low sampling rate signal output by the network A is up-sampled to a pseudo high sampling rate signal, and the high sampling rate frequency domain characteristic and the pseudo high sampling rate signal are connected in parallel and then serve as the input of a newly added network B, so that the network B outputs a high sampling rate synthesized signal. In this way, network complexity is reduced while guaranteeing the quality of the synthesized audio.

Hereinafter, according to various embodiments of the present disclosure, the method, apparatus, and system of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a flow diagram for training a vocoder according to an embodiment of the present disclosure. According to embodiments of the present disclosure, a vocoder may be comprised of network a and network B.

Referring to fig. 2, a sample set is first obtained before training the vocoder. The target time domain signal with high sampling rate (i.e. the high sampling rate time domain signal) can be downsampled to obtain a low sampling rate signal (i.e. the low sampling rate time domain signal or the low frequency signal), and then the low sampling rate signal is passed through a Mel filter Bank (Mel Bank) after being subjected to short-time fourier transform STFT to obtain a Mel spectrum characteristic (i.e. the low frequency domain characteristic) with low sampling rate, which is used as an input characteristic of the network a. Here network a may comprise a first encoder, a first decoder and an LPC module. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The first encoder performs an encoding operation on the mel-spectrum feature to obtain a first embedded vector, and the first decoder performs a decoding operation on the first embedded vector to obtain a synthesized low-frequency signal (i.e., a low-time-domain signal or a synthesized low-frequency signal). The synthesized low frequency signal is up-sampled to obtain a synthesized high frequency signal (i.e., a high time domain signal or a synthesized high frequency signal) as an input to the network B. Here, the high time domain signal may be regarded as a pseudo high frequency signal.

In addition, the target time domain signal with high sampling rate is subjected to short-time Fourier transform STFT and then passes through a Mel filter bank to obtain Mel spectrum characteristics (namely high frequency domain characteristics) with high sampling rate, and the Mel spectrum characteristics are used as the input of the network B.

According to an embodiment of the present disclosure, network B may include an LPC module, a second encoder, and a second decoder. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The second encoder may encode the mel-spectrum features at a high sampling rate to obtain a second embedded vector. The second decoder performs a decoding operation using the second embedded vector and the pseudo-synthesized high frequency signal to obtain a final high frequency signal. Since network B has enough information to synthesize a high sample rate speech signal, the required network parameter size can be reduced.

The second encoder of network B according to embodiments of the present disclosure may reduce one convolutional layer and one fully-connected layer compared to network a, and the second decoder may use only one GRU. Compared with the method of directly synthesizing the voice signal with high sampling rate by using the network A, the vocoder has the advantages of reduced network structure and operation complexity. This is because network B has more input information than network a, so that a complex network structure is no longer required to synthesize a high sample rate signal, and a high sample rate signal can be synthesized with a smaller amount of parameters. The training process of the vocoder of the present disclosure will be described in detail with reference to fig. 3.

Fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure. The vocoder according to the present disclosure has a better effect in synthesizing high sample rate audio.

Referring to fig. 3, in step S301, a training sample set is acquired. The training sample set for training the vocoder may include a high sampling rate time domain signal, a low sampling rate time domain signal obtained by downsampling the high sampling rate time domain signal, a low frequency domain feature, and a high frequency domain feature, wherein the low frequency domain feature is a mel-spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-spectrum feature of the high sampling rate time domain signal.

As an example, the low sampling rate time domain signal and the high sampling rate time domain signal may be time-frequency converted to obtain a low frequency domain signal and a high frequency domain signal, respectively. The low frequency domain features and the high frequency domain features are obtained by applying mel filters to the energy spectra of the low frequency domain signals and the high frequency domain signals.

For example, a high sample rate time domain signal x of length T in the time domain will ^h And (T) is taken as a training sample, wherein T represents time, and T is more than 0 and less than or equal to T. First to signal x ^h (t) downsampling to obtain a low sample rate time domain signal x ^l (t) then separately for signal x ^h (t) and x ^l (t) performing a short time Fourier transform STFT, the signal x can be obtained by using the following equations (1) and (2) ^h (t) and x ^l Amplitude spectrum Mag of frequency domain signal of (t) ^h And Mag ^l ：

Mag ^h ＝abs(STFT(x ^h )) (1)

Mag ^l ＝abs(STFT(x ^l )) (2)

In the slave amplitude spectrum Mag ^h And Mag ^l Energy spectrum is taken and then passed through Mel filter bank H _m (k) To obtain Mel spectrum Mel ^h And Mel ^l . Mel filter bank H _m (k) Is a set of non-linearly distributed triangular filter banks with a center frequency f (M), m=1, 2,..m, M is the number of filters and its frequency response is defined as equation (3):

wherein,

the Mel filter bank H is calculated according to the following equation (4) _m (k) Log energy Q (m) of each filter output of (b):

wherein K represents a frequency point subscript, |X (K) | is equivalent to the high-low frequency signal X ^h (t) and x ^l And (t) an amplitude spectrum of the frequency domain signal. Thus, the input features Mel of network A and network B can be obtained by the above calculation ^h And Nel ^l 。

In step S302, a low time domain signal is obtained using a first neural network of a vocoder based on the low frequency domain characteristics. Here, the first neural network may be the network a described above. The first neural network outputs a sampling point after each operation, thereby outputting a low time domain signal.

The following may be performed for each sample point of the low time domain signal: a first estimate of a current sample point of the low-time-domain signal is calculated based on an amplitude spectrum of the low-frequency-domain signal, a first embedded vector is obtained via operation of a first encoder of the first neural network based on the low-frequency-domain feature, and the current sample point of the low-time-domain signal is obtained via operation of the first decoder based on the first embedded vector, the first estimate, and a sample point error at a previous time instant for a first decoder of the first neural network.

As an example, mel-spectrum Mel ^l Is input to the first encoder of network a. The first encoder may be implemented by two convolution layers and two full-connection layers, but is not limited thereto. The first encoder may output a first embedded vector v of a fixed dimension according to the operation of equation (5) _A For use by the first decoder of network a.

v _A ＝En(Mel ^l ) (5)

Where En denotes the operational procedure of the encoder.

The low time domain signal may be obtained using a first decoder of network a based on the first embedded vector. Here, the low time domain signal is a synthesized low sample rate time domain signal.

For the first decoder, each input is v _A ,p _t ,e _t-1 Wherein p is _t E is an estimated value of the current sampling point predicted according to the LPC coefficient _t-1 Representing the sampling point s output by the first decoder at the last instant _t-1 Difference from true sample point, s _t-1 Is the output of the first decoder at the last instant. Here, the true sampling point is obtained from a low sampling rate time domain signal obtained by downsampling a high sampling rate time domain signal.

In case that the current sampling point is the first sampling point of the low time domain signal, the sampling point error of the first decoder at the previous time instant may be set to zero. However, the above examples are merely exemplary, and may be initialized differently according to design requirements.

The estimated value p of the current sampling point can be predicted according to the following equation (6) _t ：

Wherein K represents the order of LPC, a _k LPC coefficients representing the corresponding order, which coefficients may be represented by an amplitude spectrum Mag ^l And (5) predicting to obtain the final product. For example, let the sampling rate of the high sampling rate time domain signal be 32K, and the sampling rate of the downsampled signal be 16K, where K is preferably 16. That is, for the first decoder, the LPC coefficients are calculated from the amplitude spectrum of the down-sampled signal, and then p is calculated using the LPC coefficients and the previous output of the first decoder _t . For p _t May be implemented by an LPC module in network a.

The operation of the first decoder can be described by the following equation (7):

wherein,de for synthesizing the low sample rate time domain signal output by the first decoder _A Representing an operation of a first decoder, which outputs a synthesized low sample rate time domain signal at time t per operation of the first decoder>

In step S303, a high time domain signal is obtained by upsampling a low time domain signal.

As an example, the low sampling rate time domain signal may be calculated according to the following equation (8)Resampling to a high sample rate time domain signal +.>To be->As an additional input to the second decoder.

Where, resemple represents a resampling operation, L represents a current low sampling rate, and M represents a high sampling rate after resampling. Here the number of the elements is the number,can be considered as a pseudo high sample rate signal. Here, the resampling may be performed by a resampling module in the vocoder, or the resampling module may be included in the network a, to which the present disclosure is not limited. The description of resampling will be described in detail below.

In step S304, a synthesized signal is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signals. The synthesized signal here is a speech signal to be finally output. According to embodiments of the present disclosure, the second neural network may be implemented by the network B described above.

The following operations may be performed for each sample point of the final composite signal: a second estimated value of a current sampling point of the final synthesized signal is calculated based on an amplitude spectrum of the high sampling rate time domain signal in the sample set, a second embedded vector is obtained based on the high frequency domain feature via an operation of a second encoder of the second neural network, and the current sampling point of the final synthesized signal is obtained based on the second embedded vector, the current sampling point of the high time domain signal, the second estimated value, a sampling point error for a previous time of the high time domain signal, a sampling point error for a second decoder of the second neural network at the previous time, and a sampling point output by the second decoder at the previous time via an operation of the second decoder.

As an example, mel-spectrum Mel ^h Is input to a second encoder of network B. The second encoder according to the present disclosure may be implemented by one convolution layer and one full connection layer. The second encoder may output an embedded vector v of a fixed dimension according to the operation of equation (9) _B For use by the second decoder.

v _B ＝En(Mel ^h ) (9)

Where En denotes the operational procedure of the encoder.

For the second decoder, each input isp _t ,sb _t-1 ,/>And v _B Wherein p is _t For the estimated value of the current sampling point predicted from the LPC coefficient,/and the like>Sample point +.>And (3) real sampling point->Difference between->Sampling point sb output for the second decoder at the last time instant _t-1 And the true sampling pointDifference between sb _t-1 Is the output of the second decoder at the last instant. It should be noted that for p input into the second decoder _t The LPC coefficients are calculated from the amplitude spectrum of the high sample rate time domain signal in the sample set, and p is then calculated using the LPC coefficients and the previous output of the second decoder _t . Can be similarly calculated using equation (6).

In the case where the current sample point is the first sample point of the final synthesized signal, the sample point error of the previous time of the high-time-domain signal may be set to zero, the sample point error of the second decoder at the previous time may be set to zero, and the sample point output by the second decoder at the previous time may be set to the second estimated value. However, the above examples are merely exemplary, and the present disclosure may be initialized differently according to design requirements.

The operation of the second decoder can be described by the following equation (10):

wherein sb is _t De for synthesizing the high sample rate time domain signal output by the second decoder _B Representing an operation of the second decoder, the second decoder outputs a high sampling rate time domain signal sb at time t each time _t 。

In step S305, a loss function is constructed using the low time domain signal, the low sample rate time domain signal in the sample set, the composite signal, and the high sample rate time domain signal in the sample set. The first cross entropy loss function may be constructed using the low time domain signal and the downsampled low sample rate time domain signal, and the second cross entropy loss function may be constructed using the final composite signal and the high sample rate time domain signal, the loss function being constructed from the first cross entropy loss function and the second cross entropy loss function.

As an example, a first loss function is constructed using the output of the first decoder and the signal after downsampling the high sample rate time domain signal (i.e., the low sample rate time domain signal), and a second loss function is constructed using the output of the second decoder and the high sample rate time domain signal. For example, the loss function may be constructed according to the following equation (11):

wherein CrossEntropy is a cross entropy loss function, For each moment of output of the first decoder, sb _t For the output of the second decoder per moment, < >>For a low sampling rate real signal per moment (signal after downsampling the sample high frequency training signal), a>For a high sampling rate real signal (sample high frequency training signal) per time instant.

In step S306, the parameters of the vocoder are trained based on the loss calculated by the loss function. For example, parameters of the vocoder, i.e., network parameters in the first encoder, the second encoder, the first decoder, and the second decoder in the vocoder, are trained by minimizing the loss calculated by the first loss function and the second loss function.

The sampling rate and resampling are explained below.

The sampling rate refers to how many sampling points are used to describe a signal during a time period, the basic idea of sampling rate conversion is decimation and interpolation, and from a signal point of view audio resampling is filtering. The window size of the filter function and the performance of the interpolation function, once determined, can be determined. The decimation may cause aliasing of the spectrum, while the interpolation may produce image components. An anti-aliasing filter is typically added before decimation, and an anti-image filter is added after interpolation, as shown in fig. 5, h (n) represents the anti-aliasing filter, and g (n) represents the anti-image filter.

Assuming that the original sampling rate of the audio signal is L, the new sampling rate is M, and the original signal length is N, the signal length K of the new sampling rate satisfies the following relationship (12):

for each discrete time value: k (K is not less than 1 and not more than K), the actual value n _k The value of (2) is represented by equation (13):

n _k to be the location where interpolation or decimation is to be performed in the case of the original sampling interval.

In an ideal case, the frequency response h of the filter _D The frequency response of (n) is as shown in equation (14):

where D is a multiple of the extraction or interpolation, i.e. d=l/M

The signal after the filter output can be represented by equation (15):

according to the embodiment of the disclosure, a network B with smaller parameters can be added on the basis of the existing LPCNet, so that the new vocoder can synthesize the voice signal with high sampling rate while keeping low operation complexity.

Fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure. In fig. 4, it is shown and described primarily with respect to network B in the vocoder.

Referring to fig. 4, a vocoder according to the present disclosure may include a network a (such as LPCNet a in fig. 4) and a network B (such as portions of fig. 4 other than LPCNet a). The structure of network B is similar to that of network a, but network B is formed using fewer parameters. The present disclosure enables the vocoder of the present disclosure to synthesize high sample rate speech signals while maintaining low operational complexity by adding a network B on top of the existing LPCNet (i.e., LPCNet a in fig. 4).

The network a is configured to process the low frequency mel-spectrum features to obtain a composite low frequency signal, and to obtain a pseudo-composite high frequency signal by upsampling the composite low frequency signal. Here, for the pseudo-composite high frequency signal, it may be obtained by the network a, or may be obtained by other modules.

The network B is used to process high frequency mel-spectrum features such as features in fig. 4 to obtain a composite high frequency signal.

Network B may include an LPC module (for calculating estimates of LPC and sample points), a second encoder 401 and a second decoder 402, and other modules (such as an upsampling module). The encoder section 401 has one convolutional layer and one fully-concatenated layer less than network a, the decoder section 402 uses only one GRU, and an additional layer is added to the input section of the decoderAndtherefore, the structure of the network B is simpler and the parameter amount is smaller.

For a trained vocoder, the sample point error of the decoder in network a at the last time and the sample point error of the decoder in network B at the last time may be set to zero.

The network B shown in fig. 4 is merely exemplary, and the present disclosure is not limited thereto.

The method is characterized in that a network B is added on the basis of the original LPCNet A, the low-frequency domain characteristics and the original LPCNet A are utilized to synthesize a low-sampling rate signal, the decoder output of the LPCNet A is up-sampled to a pseudo high-sampling rate signal, the high-frequency domain characteristics and the pseudo high-sampling rate signal are connected in parallel and then serve as the input of the newly added network B, and the decoder network B outputs a high-sampling rate synthesized signal.

Fig. 6 is a flowchart of a speech processing method according to an embodiment of the present disclosure. TTS conversion from text to voice is mainly divided into two parts, namely prediction of a Mel spectrum of an input text in a frequency domain and sampling points for converting the Mel spectrum into a time domain, and a vocoder is mainly used for converting the Mel spectrum into the sampling points of the time domain. The speech processing method shown in fig. 6 is mainly applied to convert the frequency domain features converted from text into a speech signal.

Referring to fig. 6, in step S601, a low frequency domain feature predicted from a text, which is a low sampling rate mel-spectrum feature corresponding to the text, and a high frequency domain feature, which is a high sampling rate mel-spectrum feature corresponding to the text, are acquired. Here, the low sampling rate mel-spectrum feature and the high sampling rate mel-spectrum feature may be obtained for the same text. For example, a high sampling rate mel-spectrum feature is obtained by performing mel-spectrum prediction on an input text, and then downsampling the high sampling rate mel-spectrum feature to obtain a low sampling rate mel-spectrum feature.

In step S602, a low time domain signal is obtained using a first neural network of a vocoder based on the low frequency domain characteristics. The following is performed for each sample point of the low time domain signal: a first estimated value of a current sampling point of the low-time-domain signal is calculated based on an amplitude spectrum corresponding to the low-sampling-rate Mel spectrum feature, a first embedded vector is obtained based on the low-frequency-domain feature via an operation of a first encoder of a first neural network, and the current sampling point of the low-time-domain signal is obtained based on the first embedded vector, the first estimated value, and a sampling point error at a previous time instant for a first decoder of the first neural network via an operation of the first decoder. In speech processing, the sample point error may be set to zero. For example, the first estimated value may be calculated using equation (6) above, and the current sampling point of the low time domain signal may be obtained using equation (7).

In step S603, a high time domain signal is obtained by upsampling a low time domain signal. Resampling may be performed using equation (8) above.

In step S604, a synthesized signal corresponding to the input text is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signals.

The following is performed for each sample point of the final composite signal: a second estimated value of a current sampling point of the synthesized signal is calculated based on an amplitude spectrum corresponding to the high-sampling rate Mel spectrum feature, a second embedded vector is obtained based on the high-frequency domain feature via an operation of a second encoder of the second neural network, and the current sampling point of the synthesized signal is obtained based on the second embedded vector, the current sampling point of the high-time domain signal, the second estimated value, a sampling point error for a previous time of the high-time domain signal, a sampling point error for a second decoder of the second neural network at the previous time, and a sampling point output by the second decoder at the previous time via an operation of the second decoder. Here, in the voice processing, the sampling point error may be set to zero.

In the case where the current sampling point is the first sampling point of the synthesized signal, the sampling point output by the second decoder at the previous time instant may be set to the second estimated value. For example, equation (6) may be used to calculate the second estimate and equation (10) may be used to obtain the final speech signal.

Fig. 7 is a block diagram of a training device of a vocoder according to an embodiment of the present disclosure.

Referring to fig. 7, training apparatus 700 may include an acquisition module 701 and a training module 702. Each module in the training apparatus 700 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the types of the modules. In various embodiments, some modules in the exercise device 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.

The obtaining module 701 may obtain a sample set, where the sample set includes a high sampling rate time domain signal, a low frequency domain feature, and a high frequency domain feature, where the low sampling rate time domain signal is obtained by downsampling the high sampling rate time domain signal, the low frequency domain feature is a mel-spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-spectrum feature of the high sampling rate time domain signal. Here, the acquisition module 701 may acquire the sample set directly from the outside. Alternatively, the acquisition module 701 may acquire the high sampling rate time domain signal from the outside, then downsamples the high sampling rate time domain signal to obtain the low sampling rate time domain signal, and performs time-frequency transformation and filtering processing (such as through a mel filter bank) on the high sampling rate time domain signal and the low sampling rate time domain signal, respectively, so as to obtain the low frequency domain feature and the high frequency domain feature.

Training module 702 may utilize a first neural network of a vocoder to obtain a low time domain signal based on the low frequency domain features and a high time domain signal by upsampling the low time domain signal, a second neural network of a vocoder to obtain a composite signal based on the high frequency domain features and the high time domain signal, construct a loss function using the low time domain signal, the low sample rate time domain signal, the composite signal, and the high sample rate time domain signal, and train network parameters of the vocoder by minimizing the loss calculated by the loss function.

Alternatively, training module 702 may perform the following for each sample point of the low-time-domain signal: calculating a first estimated value of a current sampling point of the low-time-domain signal based on an amplitude spectrum of the low-sampling-rate time-domain signal; obtaining a first embedded vector based on the low frequency domain features via operation of a first encoder of a first neural network; the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

Alternatively, training module 702 may perform the following for each sample point of the final composite signal: calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high sampling rate time domain signal; obtaining a second embedded vector based on the high frequency domain features via operation of a second encoder of a second neural network; the current sampling point of the synthesized signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time of the high-time-domain signal, the sampling point error for the previous time of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time.

Alternatively, training module 702 may construct a first cross entropy loss function using the low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is formed by a first cross entropy loss function and a second cross entropy loss function.

Fig. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 8, a speech processing apparatus 800 may include an acquisition module 801 and a processing module 802. Each module in the speech processing apparatus 800 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the types of the modules. In various embodiments, some modules in speech processing device 800 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.

The obtaining module 801 may obtain a low frequency domain feature predicted from the text and a high frequency domain feature, where the low frequency domain feature is a low sample rate mel spectrum feature corresponding to the text and the high frequency domain feature is a high sample rate mel spectrum feature corresponding to the text. The mel-spectrum features acquired for the acquisition module 801 may be obtained externally.

The processing module 802 may obtain a low time domain signal using a first neural network of a vocoder based on the low frequency domain characteristics, obtain a high time domain signal by upsampling the low time domain signal, and obtain a synthesized signal corresponding to text using a second neural network of the vocoder based on the high frequency domain characteristics and the high time domain signal.

Alternatively, the processing module 802 may perform the following for each sample point of the low time domain signal: a first estimated value of a current sampling point of the low-time-domain signal is calculated based on an amplitude spectrum corresponding to the low-sampling-rate Mel spectrum feature, a first embedded vector is obtained based on the low-frequency-domain feature via an operation of a first encoder of a first neural network, and the current sampling point of the low-time-domain signal is obtained based on the first embedded vector, the first estimated value, and a sampling point error at a previous time instant for a first decoder of the first neural network via an operation of the first decoder. Here, the sampling point error of the first decoder is used in the training phase of the vocoder, and thus, in the voice processing, the sampling point error may be set to zero, to which the present disclosure is not limited.

Alternatively, the processing module 802 may perform the following for each sample point of the final composite signal: a second estimated value of a current sampling point of the synthesized signal is calculated based on an amplitude spectrum corresponding to the high-sampling rate Mel spectrum feature, a second embedded vector is obtained based on the high-frequency domain feature via an operation of a second encoder of the second neural network, and the current sampling point of the synthesized signal is obtained based on the second embedded vector, the current sampling point of the high-time domain signal, the second estimated value, a sampling point error for a previous time of the high-time domain signal, a sampling point error for a second decoder of the second neural network at the previous time, and a sampling point output by the second decoder at the previous time via an operation of the second decoder. Here, the sampling point error may be set to zero, to which the present disclosure is not limited.

Fig. 9 is a schematic structural diagram of a speech processing device of a hardware running environment of an embodiment of the present disclosure.

As shown in fig. 9, the voice processing apparatus 900 may include: a processing component 901, a communication bus 902, a network interface 903, an input output interface 904, a memory 905, and a power component 906. Wherein a communication bus 902 is employed to facilitate a coupled communication between the components. The input output interface 904 may include a video display (such as a liquid crystal display), microphone and speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), optionally the input output interface 904 may also include standard wired interfaces, wireless interfaces. The network interface 903 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 905 may be a high-speed random access memory or a stable nonvolatile memory. The memory 905 may alternatively be a storage device separate from the processing assembly 901 described previously.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting of the speech processing device 900 and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 9, an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a voice processing program, a model training program, and a database may be included in the memory 905 as one storage medium.

In the voice processing apparatus 900 shown in fig. 9, the network interface 903 is mainly used for data communication with an external electronic device/terminal; the input/output interface 904 is mainly used for data interaction with a user; the processing component 901 and the memory 905 in the voice processing apparatus 900 may be provided in the voice processing apparatus 900, and the voice processing apparatus 900 executes the video editing method provided by the embodiment of the present disclosure by calling the video editing program, the material, and various APIs provided by the operating system stored in the memory 905 through the processing component 901.

The processing component 901 may include at least one processor, with a set of computer-executable instructions stored in memory 905 that, when executed by the at least one processor, perform a voice processing method or a vocoder training method according to embodiments of the present disclosure. Further, the processing component 901 may perform encoding operations, decoding operations, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

The processing component 901 may be used to train a vocoder of the present disclosure. For example, the processing component 901 may obtain a sample set from outside, wherein the sample set includes a high sampling rate time domain signal, a low frequency domain feature, and a high frequency domain feature, wherein the low sampling rate time domain signal is obtained by downsampling the high sampling rate time domain signal, the low frequency domain feature is a mel-frequency feature of the low sampling rate time domain signal, the high frequency domain feature is a mel-frequency feature of the high sampling rate time domain signal, the low time domain signal is obtained by using a first neural network of the vocoder based on the low frequency domain feature, the high time domain signal is obtained by upsampling the low time domain signal, a synthesized signal is obtained by using a second neural network of the vocoder based on the high frequency domain feature and the high time domain signal, a loss function is constructed using the low time domain signal, the synthesized signal, and the high sampling rate time domain signal, and a parameter of the vocoder is trained by minimizing a loss calculated by the loss function.

As another example, the processing component 901 may implement converting text to speech signals as a vocoder of the present disclosure. For example, the processing component 901 may obtain, from the outside, a low frequency domain feature predicted from the text and a high frequency domain feature, wherein the low frequency domain feature is a low sampling rate mel-spectrum feature corresponding to the text, the high frequency domain feature is a high sampling rate mel-spectrum feature corresponding to the text, obtain a low time domain signal using a first neural network of a vocoder based on the low frequency domain feature, obtain a high time domain signal by upsampling the low time domain signal, and obtain a synthesized signal corresponding to the text using a second neural network of the vocoder based on the high frequency domain feature and the high time domain signal. Other deep neural network implementations besides LPCNet may also be employed for the first and second neural networks.

The processing component 901 can realize control of components included in the voice processing apparatus 900 by executing programs. The input output interface 904 may output the final synthesized speech signal.

The speech processing device 900 can receive or output video and/or audio via the input-output interface 904. For example, the speech processing device 900 can output synthesized speech signals via the input-output interface 904.

By way of example, the speech processing device 900 can be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the speech processing device 900 need not be a single electronic device, but may be any device or aggregate of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The speech processing device 900 can also be part of an integrated control system or system manager, or can be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In the speech processing apparatus 900, the processing component 901 can comprise a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processing component 901 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processing component 901 may execute instructions or code stored in a memory, wherein the memory 905 may also store data. Instructions and data may also be transmitted and received over a network via network interface 903, where network interface 903 may employ any known transmission protocol.

The memory 905 may be integrated with the processing component 901, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 905 may include a separate device, such as an external disk drive, a storage array, or other storage device that any database system may use. The memory and the processing component 901 may be operatively coupled or may communicate with each other, for example, through an I/O port, network connection, etc., such that the processing component 901 is capable of reading data stored in the memory 905.

According to embodiments of the present disclosure, an electronic device may be provided. Fig. 10 is a block diagram of an electronic device according to an embodiment of the present disclosure, the electronic device 1000 may include at least one memory 1002 and at least one processor 1001, the at least one memory 1002 storing a set of computer-executable instructions that, when executed by the at least one processor 1001, perform a speech processing method or a vocoder training method according to an embodiment of the present disclosure.

The processor 1001 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1001 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and so forth.

The memory 1002, which is one storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, a speech processing program, a model training program, and a database.

The memory 1002 may be integrated with the processor 1001, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. In addition, the memory 1002 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory 1002 and the processor 1001 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., so that the processor 1001 can read files stored in the memory 1002.

In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via buses and/or networks.

It will be appreciated by those skilled in the art that the structure shown in fig. 10 is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a model training method according to the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

In accordance with embodiments of the present disclosure, a computer program product may also be provided, instructions in which are executable by a processor of a computer device to perform the above-described speech processing method or model training method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech processing method, the speech processing method comprising:

downsampling the high sampling rate mel-spectrum features of the input text to obtain low sampling rate mel-spectrum features;

Obtaining a low time domain signal using a first neural network of a vocoder based on the low sample rate mel-spectrum feature;

obtaining a high time domain signal by upsampling a low time domain signal;

a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high sample rate mel-spectral feature based on the high sample rate mel-spectral feature and the high time domain signal,

wherein the step of obtaining a speech signal corresponding to the high sample rate mel-spectrum feature using the second neural network of the vocoder based on the high sample rate mel-spectrum feature and the high time domain signal comprises:

the following is performed for each sample point of the speech signal:

calculating a second estimated value of a current sampling point of the voice signal based on an amplitude spectrum corresponding to the high sampling rate mel-spectrum feature;

obtaining a second embedded vector via operation of a second encoder of a second neural network based on the high sample rate mel-spectrum feature;

the current sampling point of the speech signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time instant of the high-time-domain signal, the sampling point error for the previous time instant of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time instant.

2. The method of claim 1, wherein the step of using the first neural network of the vocoder to obtain the low time domain signal based on the low sample rate mel-spectrum feature comprises:

the following is performed for each sample point of the low time domain signal:

calculating a first estimated value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate mel spectrum feature;

obtaining a first embedded vector via operation of a first encoder of a first neural network based on low sample rate mel-spectrum features;

the current sampling point of the low time domain signal is obtained via an operation of the first decoder based on the first embedded vector, the first estimate, and a sampling point error at a previous time instant for the first decoder of the first neural network.

3. The method of claim 1, wherein the high sampling rate mel-spectrum feature of the input text is obtained by performing mel-spectrum prediction on the input text.

4. A method of training a vocoder, the method comprising:

obtaining a sample set, wherein the sample set comprises a high sampling rate time domain signal, a low frequency domain feature and a high frequency domain feature, wherein the low sampling rate time domain signal is obtained by downsampling the high sampling rate time domain signal, the low frequency domain feature is a mel spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel spectrum feature of the high sampling rate time domain signal;

Obtaining a low time domain signal using a first neural network of a vocoder based on the low frequency domain features;

obtaining a high time domain signal by upsampling a low time domain signal;

obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signals;

constructing a loss function using a low-sample-rate time-domain signal, the composite signal, and the high-sample-rate time-domain signal;

training parameters of the vocoder based on the loss calculated by the loss function,

wherein the step of obtaining the synthesized signal using the second neural network of the vocoder based on the high frequency domain features and the high time domain signal comprises:

the following is performed for each sample point of the composite signal:

calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high sampling rate time domain signal;

obtaining a second embedded vector based on the high frequency domain features via operation of a second encoder of a second neural network;

the current sampling point of the synthesized signal is obtained via an operation of the second decoder based on the second embedded vector, the current sampling point of the high-time-domain signal, the second estimated value, the sampling point error for the previous time of the high-time-domain signal, the sampling point error for the previous time of the second decoder of the second neural network, and the sampling point output by the second decoder at the previous time.

5. The training method of claim 4, wherein the step of using the first neural network of the vocoder to obtain the low time domain signal based on the low frequency domain characteristics comprises:

the following is performed for each sample point of the low time domain signal:

calculating a first estimated value of a current sampling point of the low-time-domain signal based on an amplitude spectrum of the low-sampling-rate time-domain signal;

obtaining a first embedded vector based on the low frequency domain features via operation of a first encoder of a first neural network;

6. The training method of claim 4 wherein the step of constructing a loss function using a low time domain signal, the low sample rate time domain signal, the composite signal, and the high sample rate time domain signal comprises:

constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal;

constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal;

The loss function is formed by a first cross entropy loss function and a second cross entropy loss function.

7. A speech processing apparatus, characterized in that the speech processing apparatus comprises:

an acquisition module configured to downsample high sample rate mel-spectrum features of the input text to acquire low sample rate mel-spectrum features; and

a processing module configured to:

obtaining a high time domain signal by upsampling a low time domain signal;

wherein the processing module is configured to, for each sampling point of the speech signal:

8. The speech processing apparatus of claim 7 wherein the processing module is configured to:

the following is performed for each sample point of the low time domain signal:

9. The speech processing apparatus of claim 7 wherein the high sample rate mel-spectrum feature of the input text is obtained by performing mel-spectrum prediction on the input text.

10. A training device for a vocoder, the training device comprising:

an acquisition module configured to acquire a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency-domain feature, and a high-frequency-domain feature, wherein the low-sampling-rate time-domain signal is obtained by downsampling the high-sampling-rate time-domain signal, the low-frequency-domain feature is a mel-spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency-domain feature is a mel-spectrum feature of the high-sampling-rate time-domain signal;

A training module configured to:

obtaining a high time domain signal by upsampling a low time domain signal;

wherein the training module is configured to perform the following for each sampling point of the composite signal:

11. The training device of claim 10, wherein the training module is configured to:

the following is performed for each sample point of the low time domain signal:

12. The training device of claim 10, wherein the training module is configured to:

13. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method of any one of claims 1 to 3 or the training method of any one of claims 4 to 6.

14. A computer readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform the speech processing method of any one of claims 1 to 3 or the training method of any one of claims 4 to 6.