CN113470616A - Speech processing method and apparatus, vocoder and vocoder training method - Google Patents

Speech processing method and apparatus, vocoder and vocoder training method Download PDF

Info

Publication number
CN113470616A
CN113470616A CN202110794822.1A CN202110794822A CN113470616A CN 113470616 A CN113470616 A CN 113470616A CN 202110794822 A CN202110794822 A CN 202110794822A CN 113470616 A CN113470616 A CN 113470616A
Authority
CN
China
Prior art keywords
domain signal
sampling
low
rate
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110794822.1A
Other languages
Chinese (zh)
Other versions
CN113470616B (en
Inventor
张旭
张新
李楠
郑羲光
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110794822.1A priority Critical patent/CN113470616B/en
Publication of CN113470616A publication Critical patent/CN113470616A/en
Application granted granted Critical
Publication of CN113470616B publication Critical patent/CN113470616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure provides a voice processing method and apparatus, and a vocoder and a training method of the vocoder. The speech processing method may include: downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features; obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal. The present disclosure enables synthesis of high sample rate speech signals while maintaining low computational complexity.

Description

Speech processing method and apparatus, vocoder and vocoder training method
Technical Field
The present disclosure relates to the field of speech processing, and in particular, to a speech processing method and a speech processing apparatus for speech synthesis, and a training method for a vocoder and a vocoder.
Background
Vocoders have found wide application in speech synthesis using deep learning. The existing speech synthesis process generally performs frequency domain mel-frequency spectrum prediction on input characters, and then converts the mel-frequency spectrum into time domain sampling points. Generally, the conversion from mel-frequency spectrum to sampling points is performed by using the griffin algorithm, but the algorithm can cause the voice quality to be poor, and the voice converted by using the deep learning method has high quality. Generally, the higher the effective sampling rate of speech, the higher the quality of synthesized speech and the better the listening experience. However, the synthesis of high-sample-rate audio is also generally accompanied by an increase in the number of parameters of the network, which increases the cost of operating the network.
Disclosure of Invention
The present disclosure provides a speech processing method and a speech processing apparatus for speech synthesis and a training method of a vocoder and a vocoder to solve at least the above-mentioned problems.
According to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, which may include the steps of: downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features; obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.
Optionally, the step of obtaining the low time-domain signal using the first neural network of the vocoder based on the low sampling rate mel-frequency spectrum feature may include: performing the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
Alternatively, the step of utilizing a second neural network of the vocoder to obtain a speech signal corresponding to the high-sampling-rate mel spectral feature based on the high-sampling-rate mel spectral feature and the high-time-domain signal may include: performing the following for each sample point of the speech signal: calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic; obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features; and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
Alternatively, the high sampling rate mel-frequency spectrum features may be obtained by performing mel-frequency spectrum prediction on the input text.
According to a second aspect of embodiments of the present disclosure, there is provided a training method of a vocoder, the training method may include the steps of: acquiring a sample set, wherein the sample set comprises a high-sampling-rate time domain signal, a low-frequency domain feature and a high-frequency domain feature, the low-sampling-rate time domain signal is obtained by down-sampling the high-sampling-rate time domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time domain signal; obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features; obtaining a high time domain signal by up-sampling the low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal; constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal; training parameters of the vocoder based on the impairments calculated by the impairment function.
Optionally, the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low frequency domain feature may include: performing the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
Optionally, the step of utilizing a second neural network of the vocoder to obtain the synthesized signal based on the high frequency domain features and the high time domain signal may comprise: performing the following for each sample point of the composite signal: calculating a second estimated value of a current sampling point of the synthesized signal based on the magnitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
Optionally, the step of constructing a loss function using the low-time-domain signal, the low-sampling-rate time-domain signal, the synthesized signal, and the high-sampling-rate time-domain signal may include: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.
According to a third aspect of the embodiments of the present disclosure, there is provided a voice processing apparatus, which may include: an acquisition module configured to downsample the high sampling rate mel-frequency spectrum features to acquire low sampling rate mel-frequency spectrum features; and a processing module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features; obtaining a high time domain signal by up-sampling the low time domain signal; a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.
Optionally, the processing module may be configured to perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic; obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
Optionally, the processing module is configured to perform the following for each sample point of the speech signal: calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic; obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features; and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
Alternatively, the high sampling rate mel-frequency spectrum features may be obtained by performing mel-frequency spectrum prediction on the input text.
According to a fourth aspect of embodiments of the present disclosure, there is provided a training apparatus of a vocoder, the training apparatus may include: an obtaining module configured to obtain a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency domain feature and a high-frequency domain feature, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time-domain signal; and a training module configured to: obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features; obtaining a high time domain signal by up-sampling the low time domain signal; obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal; constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal; training parameters of the vocoder based on the impairments calculated by the impairment function.
Optionally, the training module may be configured to perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
Optionally, the training module may be configured to: performing the following for each sample point of the composite signal: calculating a second estimated value of a current sampling point of the synthesized signal based on the magnitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
Optionally, the training module may be configured to: constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal; the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method and the training method as described above.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the training method as described above.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the speech processing method and the training method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the vocoder of the present disclosure is implemented by adding a network with smaller parameters on the basis of the original LPCNet, so that the vocoder of the present disclosure can synthesize a high-sampling-rate speech signal while maintaining a low computational complexity.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a diagram of an existing LPCNet;
fig. 2 is a schematic flow diagram for training a vocoder according to an embodiment of the present disclosure;
fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure;
FIG. 5 is a schematic flow diagram for resampling, according to an embodiment of the disclosure;
FIG. 6 is a flow diagram of a method of speech processing according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a training apparatus of a vocoder according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 9 is a schematic block diagram of a speech processing device according to an embodiment of the present disclosure;
fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure.
Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Fig. 1 is a diagram of an existing LPCNet.
Referring to fig. 1, the conventional LPCNet implements vocoder functions in the form of an encoder and a decoder, with the input of the encoder portion (such as 101 in fig. 1) being a frame rate domain feature of speech (such as features in fig. 1) and the output being an embedded vector provided to the decoder portion (such as 102 in fig. 1). The linear prediction coefficients LPC module may compute the LPC based on features and predict/compute an estimate of the current sample point. The decoder section 102 receives the output of the encoder section 101 and compares it with an estimated value p of the current sampling point calculated by the LPC moduletA sample point s output at a time on the decoder section 102t-1And the error e between the output of the decoder part at the previous moment and the real sampling pointt-1Performing a series operation to output a sampling point s at the current timet
In fig. 1, the encoder section 101 includes two convolutional layers (such as conv 1 × 3) and two fully-connected layers (such as FC). The encoder section 101 may output the embedded vector by performing two convolution operations, a summation operation, and two full-connected layer operations on the features.
The decoder portion 102 includes two threshold rotation units (such as GURs)AAnd GRUB) Dual full connectivity layers (such as dual FC), normalization layers (such as softmax). The decoder portion 102 may utilize the output of the encoder, pt、st-1And et-1Performing concat operation, two threshold cycle unit operation, double FC operation, softmax operation, sampling operation and summation operation to output the current sampling point s of the LPCnett
The way of outputting sampling points by the LPCNet is autoregressive, only one sampling point in the time domain can be output every time the LPCNet runs, and if speech with a high sampling rate level is to be output, the amount of computation is increased by multiple times. For example, at a sampling rate of 16k, the LPCNet needs to run 16000 decoders for every second of speech, and after the sampling rate is increased to 32k, the LPCNet needs to run 32000 decoders, which include two layers of GRUs, and thus the calculation amount is large.
To solve the above problem, the present disclosure proposes a set cascade stage based deep neural network algorithm to synthesize high sampling rate audio. According to the embodiment of the present disclosure, a neural network (hereinafter, may be referred to as network B) is added on the basis of the existing LPCNet (hereinafter, may be referred to as network a), where network a may be used to process frequency domain features of a low sampling rate, and network B may be used to process frequency domain features of a high sampling rate. The low-sampling-rate signal output by the network A is up-sampled to a pseudo high-sampling-rate signal, and the high-sampling-rate frequency domain feature and the pseudo high-sampling-rate signal are connected in parallel and then serve as the input of the newly added network B, so that the network B outputs a high-sampling-rate synthesized signal. Thus, the network complexity is reduced while the quality of the synthesized audio is ensured.
Hereinafter, according to various embodiments of the present disclosure, a method, an apparatus, and a system of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 2 is a schematic flow diagram for training a vocoder according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, a vocoder may be constructed of network a and network B.
Referring to fig. 2, a sample set is first obtained before training the vocoder. The target time domain signal with a high sampling rate (i.e., the high sampling rate time domain signal) may be downsampled to obtain a low sampling rate signal (i.e., the low sampling rate time domain signal or the low frequency signal), and then the low sampling rate signal is subjected to Short Time Fourier Transform (STFT) and then passes through a Mel filter Bank (Mel Bank) to obtain a Mel spectrum feature (i.e., the low frequency domain feature) with a low sampling rate, which is used as an input feature of the network a. Here network a may comprise a first encoder, a first decoder and an LPC module. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The first encoder performs an encoding operation on the mel-frequency spectrum features to obtain a first embedded vector, and the first decoder performs a decoding operation by using the first embedded vector to obtain a synthetic low-frequency signal (i.e. a low time-domain signal or a synthetic low-frequency signal). The synthesized low frequency signal is up-sampled to obtain a synthesized high frequency signal (i.e., a high time domain signal or a synthesized high frequency signal) as an input to the network B. Here, the high time domain signal may be regarded as a pseudo high frequency signal.
In addition, after the short-time fourier transform STFT is performed on the target time domain signal with the high sampling rate, mel spectrum features (i.e., high frequency domain features) with the high sampling rate are obtained by a mel filter bank and are used as input of the network B.
According to an embodiment of the present disclosure, network B may include an LPC module, a second encoder, and a second decoder. The LPC module may be used to calculate LPC coefficients as well as an estimate of the current sample point. The second encoder may perform an encoding operation on the mel-frequency spectral features at the high sampling rate to obtain a second embedded vector. The second decoder performs a decoding operation using the second embedded vector and the pseudo-synthesized high-frequency signal to obtain a final high-frequency signal. The required size of the network parameters can be reduced since network B has enough information to synthesize a high sample rate speech signal.
Compared to network a, the second encoder of network B according to embodiments of the present disclosure may reduce one convolutional layer and one full-connection layer, and the second decoder may use only one layer GRU. Compared with the method that the network A is directly used for synthesizing the voice signals with high sampling rate, the vocoder network structure and the operation complexity of the vocoder are reduced. This is because the network B has more input information than the network a, and therefore a complex network structure is not required for synthesizing the high-sampling-rate signal, and the high-sampling-rate signal can be synthesized with a smaller number of parameters. The training process of the vocoder of the present disclosure will be described in detail with reference to fig. 3.
Fig. 3 is a flow chart of a method of training a vocoder according to an embodiment of the present disclosure. The vocoder according to the present disclosure has a better effect in synthesizing audio of a high sampling rate.
Referring to fig. 3, in step S301, a training sample set is acquired. The training sample set for training the vocoder may include a high sampling rate time domain signal, a low frequency domain feature and a high frequency domain feature, wherein the low sampling rate time domain signal is obtained by down-sampling the high sampling rate time domain signal, the low frequency domain feature is a mel-spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-spectrum feature of the high sampling rate time domain signal.
As an example, the low-sampling-rate time-domain signal and the high-sampling-rate time-domain signal may be time-frequency converted to obtain a low-frequency-domain signal and a high-frequency-domain signal, respectively. The low frequency domain features and the high frequency domain features are obtained by applying a mel filter to energy spectra of the low frequency domain signal and the high frequency domain signal.
For example, a high sample rate time domain signal x of length T in the time domainh(T) as a training sample, wherein T represents time, and T is more than 0 and less than or equal to T. First, for the signal xh(t) downsampling to obtain a low sample rate time domain signal xl(t) then separately for the signal xh(t) and xl(t) performing a short-time Fourier transform STFT to obtain a signal x using the following equations (1) and (2)h(t) and xl(t) amplitude spectrum Mag of the frequency domain signalhAnd Magl
Magh=abs(STFT(xh)) (1)
Magl=abs(STFT(xl)) (2)
In the secondary amplitude spectrum MaghAnd MaglTaking the energy spectrum and passing through a Mel filter bank Hm(k) To obtain Mel spectrum MelhAnd Mell. Mel Filter Bank Hm(k) A set of non-linearly distributed triangular filter banks, with a center frequency f (M), where M is 1, 2.
Figure BDA0003162267740000091
Wherein,
Figure BDA0003162267740000092
the Mel filter bank H is calculated according to the following equation (4)m(k) Logarithmic energy q (m) output by each filter in (a):
Figure BDA0003162267740000093
where K represents a frequency point subscript, | X (K) | is equivalent to the high and low frequency signal xh(t) and xl(t) magnitude spectrum of the frequency domain signal. Therefore, the input characteristics Mel of the network A and the network B can be obtained by the above calculationhAnd Nell
In step S302, a low-time domain signal is obtained using a first neural network of a vocoder based on low-frequency domain features. Here, the first neural network may be the network a described above. The first neural network outputs a sampling point after each operation, thereby outputting a low time domain signal.
The following operations may be performed for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimated value of a current sampling point of a low-time-domain signal based on a magnitude spectrum of the low-frequency-domain signal, obtaining a first embedding vector based on low-frequency-domain features through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low-time-domain signal based on the first embedding vector, the first estimated value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder.
As an example, the Mel spectrum MellInput into a first encoder of network a. The first encoder may be implemented by two convolutional layers and two fully-connected layers, but is not limited thereto. The first encoder may output a first embedded vector v of a fixed dimension according to the operation of equation (5)AFor use by the first decoder of network a.
vA=En(Mell) (5)
Where En denotes an operation procedure of the encoder.
A low time domain signal may be obtained with a first decoder of network a based on the first embedding vector. Here, the low time domain signal is a synthesized low sampling rate time domain signal.
For the first decoder, the input at each time is vA,pt,et-1Wherein p istAs an estimate of the current sample point predicted from the LPC coefficients, et-1Representing the sampling point s output by the first decoder at the previous instantt-1Difference from the true sampling point, st-1The output of the first decoder at the last instant. Here, the true sampling point is obtained from a low-sampling-rate time-domain signal obtained by down-sampling a high-sampling-rate time-domain signal.
In case the current sample is the first sample of the low time domain signal, the sample error of the first decoder at the previous instant may be set to zero. However, the above examples are merely exemplary, and initialization may be variously performed according to design requirements.
The estimated value p of the current sampling point can be predicted according to equation (6) belowt
Figure BDA0003162267740000101
Wherein K represents the order of LPC, akLPC coefficients representing the respective order, which coefficients can be represented by a magnitude spectrum MaglAnd (6) obtaining a prediction. For example, assume high samplingThe sampling rate of the time domain signal is 32K, the sampling rate of the downsampled signal is 16K, and in this case, K may be 16. That is, for the first decoder, the LPC coefficients are calculated from the magnitude spectrum of the signal after down-sampling, and then p is calculated using the LPC coefficients and the previous output of the first decodert. For ptThe calculation of (c) may be implemented by the LPC module in network a.
The operation of the first decoder can be described by the following equation (7):
Figure BDA0003162267740000102
wherein,
Figure BDA0003162267740000103
for the synthesized low-sampling-rate time-domain signal, De, output by the first decoderARepresenting a run of a first decoder outputting a synthesized low-sample-rate time-domain signal at time t each run
Figure BDA0003162267740000104
In step S303, a high time domain signal is obtained by upsampling the low time domain signal.
As an example, a low sample rate time domain signal may be expressed according to equation (8) below
Figure BDA0003162267740000105
Resampling to a high sample rate time domain signal
Figure BDA0003162267740000106
To be combined with
Figure BDA0003162267740000107
As an additional input to the second decoder.
Figure BDA0003162267740000108
Where, sample represents the resampling operation, L represents the current low sampling rate, and M represents the high sampling rate after resampling. Here, ,
Figure BDA0003162267740000109
may be considered a pseudo high sample rate signal. Here, the resampling operation may be implemented by a resampling module in the vocoder, or the resampling module may be included in the network a, to which the present disclosure is not limited. The description of resampling will be described in detail below.
In step S304, a synthesized signal is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal. The synthesized signal here is a speech signal to be finally output. According to an embodiment of the present disclosure, the second neural network may be implemented by network B described above.
The following operations may be performed for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the final synthesized signal based on the amplitude spectrum of the high-sampling-rate time-domain signal in the sample set, obtaining a second embedding vector based on the high-frequency domain features through the operation of a second encoder of the second neural network, and obtaining the current sampling point of the final synthesized signal based on the second embedding vector, the current sampling point of the high-time-domain signal, the second estimation value, the sampling point error of the previous moment for the high-time-domain signal, the sampling point error of a previous moment for a second decoder of the second neural network, and the sampling point output by the previous moment for the second decoder through the operation of a second decoder.
As an example, the Mel spectrum MelhInput into a second encoder of network B. The second encoder according to the present disclosure may be implemented by one convolutional layer and one fully-connected layer. The second encoder may output an embedded vector v of a fixed dimension according to the operation of equation (9)BFor use by the second decoder.
vB=En(Melh) (9)
Where En denotes an operation procedure of the encoder.
For the second decoder, the input at each time is
Figure BDA0003162267740000111
pt,sbt-1,
Figure BDA0003162267740000112
And vBWherein p istTo estimate the current sample point predicted from the LPC coefficients,
Figure BDA0003162267740000113
as a sampling point at the last instant of the high time domain signal
Figure BDA0003162267740000114
And real sampling point
Figure BDA0003162267740000115
The difference between the values of the two signals,
Figure BDA0003162267740000116
sampling point sb output by the second decoder at the previous momentt-1And real sampling point
Figure BDA0003162267740000117
Difference between, sbt-1The output of the second decoder at the previous instant. It should be noted that for p input into the second decodertThe LPC coefficients are calculated from the magnitude spectrum of the high sample rate time domain signal in the sample set, and p is then calculated using the LPC coefficients and the previous output of the second decodert. Can be similarly calculated using equation (6).
In the case where the current sampling point is the first sampling point of the final composite signal, the error of the sampling point at the previous time of the high time domain signal may be set to zero, the error of the sampling point at the previous time of the second decoder may be set to zero, and the sampling point output by the second decoder at the previous time may be set to the second estimated value. However, the above examples are merely exemplary, and the present disclosure may be variously initialized according to design requirements.
The operation of the second decoder can be described by the following equation (10):
Figure BDA0003162267740000118
wherein sbtFor the resultant high-sampling-rate time-domain signal, De, output by the second decoderBRepresenting an operation of a second decoder which outputs a high-sampling-rate time-domain signal sb at time tt
In step S305, a loss function is constructed using the low time-domain signal, the low-sampling-rate time-domain signal in the sample set, the synthesized signal, and the high-sampling-rate time-domain signal in the sample set. The method comprises the steps of constructing a first cross entropy loss function by using a low time domain signal and a downsampled low sampling rate time domain signal, constructing a second cross entropy loss function by using a final synthesis signal and a high sampling rate time domain signal, and constructing the loss function by using the first cross entropy loss function and the second cross entropy loss function.
As an example, the first loss function is constructed using the output of the first decoder and a signal down-sampled from the high-sample-rate time-domain signal (i.e., the low-sample-rate time-domain signal), and the second loss function is constructed using the output of the second decoder and the high-sample-rate time-domain signal. For example, the loss function may be constructed according to equation (11) below:
Figure BDA0003162267740000121
wherein Cross Encopy is a cross entropy loss function,
Figure BDA0003162267740000122
for the output of the first decoder per time instant sbtFor the output of the second decoder at each moment,
Figure BDA0003162267740000123
for a low sampling rate true signal at each time instant (signal after down-sampling the sample high frequency training signal),
Figure BDA0003162267740000124
a high sample rate true signal (sample high frequency training signal) at each time instant.
In step S306, parameters of the vocoder are trained based on the impairments calculated by the impairment function. For example, parameters of the vocoder are trained by minimizing the loss calculated by the first and second loss functions, i.e., network parameters in the first encoder, the second encoder, the first decoder, and the second decoder in the vocoder are trained.
The following explains the sampling rate and resampling.
The sampling rate is how many sampling points are used in a certain time period to describe a signal, the basic idea of sampling rate conversion is decimation and interpolation, and audio resampling from a signal perspective is filtering. Once the window size of the filter function and the interpolation function are determined, the resampling performance can be determined. Decimation may cause spectral aliasing, and interpolation may produce image components. Usually, an anti-aliasing filter is added before decimation, and an anti-image filter is added after interpolation, as shown in fig. 5, h (n) represents an anti-aliasing filter, and g (n) represents an anti-image filter.
Assuming that the original sampling rate of the audio signal is L, the new sampling rate is M, and the original signal length is N, the signal length K of the new sampling rate satisfies the following relationship (12):
Figure BDA0003162267740000125
for each discrete time value: k (K is more than or equal to 1 and less than or equal to K), the actual value nkIs expressed by equation (13):
Figure BDA0003162267740000126
nkthe position to be interpolated or decimated in the case of the original sampling interval.
Under ideal conditions, filteringFrequency response h of wave filterDThe frequency response of (n) is shown in equation (14):
Figure BDA0003162267740000127
where D is a multiple of decimation or interpolation, i.e., D is L/M
The signal after the filter output can be represented by equation (15):
Figure BDA0003162267740000131
according to the embodiment of the disclosure, a network B with smaller parameters can be added on the basis of the existing LPCNet, so that the new vocoder synthesizes a voice signal with a high sampling rate while maintaining lower operation complexity.
Fig. 4 is a schematic diagram of a vocoder according to an embodiment of the present disclosure. In fig. 4, it is shown and described primarily for network B in the vocoder.
Referring to fig. 4, a vocoder according to the present disclosure may include a network a (such as the LPCNet a in fig. 4) and a network B (such as portions other than the LPCNet a in fig. 4). The structure of network B is similar to that of network a, but network B is formed using fewer parameters. The present disclosure adds a network B on the basis of the existing LPCNet (i.e., LPCNet a in fig. 4), so that the vocoder of the present disclosure synthesizes a speech signal with a high sampling rate while maintaining a low computational complexity.
The network a is used for processing the low-frequency mel-frequency spectrum characteristics to obtain a synthesized low-frequency signal, and obtaining a pseudo-synthesized high-frequency signal by up-sampling the synthesized low-frequency signal. Here, for the pseudo synthesized high frequency signal, it may be obtained by the network a, or may be obtained by other modules.
Network B is used to process high frequency mel-frequency spectral features (such as the features in fig. 4) to obtain a composite high frequency signal.
Network B may include an LPC module (for computing estimates of LPC and sample points), a second encoder 401, and a second encoderA second decoder 402 and other modules (such as an upsampling module). Compared with network A, the encoder part 401 has one convolution layer and one full link layer reduced, the decoder part 402 has only one GRU, and the input part of the decoder has one additional GRU
Figure BDA0003162267740000132
And
Figure BDA0003162267740000133
therefore, the network B has a simpler structure and a smaller amount of parameters.
For a vocoder that has been trained, the sample point error of the decoder in network a at the previous time and the sample point error of the decoder in network B at the previous time may be set to zero.
The network B shown in fig. 4 is merely exemplary, and the present disclosure is not limited thereto.
The method is characterized in that a network B is added on the basis of the original LPCnet A, a low-sampling-rate signal is synthesized by using the low-frequency-domain characteristic and the original LPCnet A, the output of a decoder of the LPCnet A is up-sampled to a pseudo high-sampling-rate signal, the high-frequency-domain characteristic and the pseudo high-sampling-rate signal are connected in parallel and then serve as the input of the newly added network B, and the decoder network B outputs a high-sampling-rate synthesized signal.
Fig. 6 is a flow diagram of a method of speech processing according to an embodiment of the present disclosure. The TTS conversion from text to voice is mainly divided into two parts, namely prediction of a frequency domain Mel spectrum of input characters and conversion of the Mel spectrum into a sampling point of a time domain, and the vocoder is mainly used for converting the Mel spectrum into the sampling point of the time domain. The speech processing method shown in fig. 6 is mainly applied to convert frequency domain features converted from text into speech signals.
Referring to fig. 6, in step S601, low-frequency domain features and high-frequency domain features predicted from text are obtained, wherein the low-frequency domain features are low-sampling-rate mel spectral features corresponding to the text, and the high-frequency domain features are high-sampling-rate mel spectral features corresponding to the text. Here, the low-sampling-rate mel-frequency spectrum feature and the high-sampling-rate mel-frequency spectrum feature may be obtained for the same text. For example, a high-sampling-rate mel-frequency spectrum feature is obtained by performing mel-frequency spectrum prediction on input characters, and then a low-sampling-rate mel-frequency spectrum feature is obtained by down-sampling the high-sampling-rate mel-frequency spectrum feature.
In step S602, a low-time domain signal is obtained using a first neural network of a vocoder based on low-frequency domain features. Performing the following for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimation value of a current sampling point of a low time domain signal based on a magnitude spectrum corresponding to a low sampling rate Mel spectrum characteristic, obtaining a first embedding vector based on the low frequency domain characteristic through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low time domain signal based on the first embedding vector, the first estimation value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder. In speech processing, the sample point error may be set to zero. For example, equation (6) above may be used to calculate the first estimate, and equation (7) is used to obtain the current sample point of the low time domain signal.
In step S603, a high time domain signal is obtained by upsampling the low time domain signal. Resampling may be performed using equation (8) above.
In step S604, a synthesized signal corresponding to the input text is obtained using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal.
The following is performed for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the synthesized signal based on a magnitude spectrum corresponding to the high sampling rate Mel spectral feature, obtaining a second embedding vector based on the high frequency domain feature through operation of a second encoder of the second neural network, and obtaining the current sampling point of the synthesized signal based on the second embedding vector, the current sampling point of the high time domain signal, the second estimation value, a sampling point error at a previous moment for the high time domain signal, a sampling point error at a previous moment for a second decoder of the second neural network, and a sampling point output by the second decoder at the previous moment through operation of a second decoder. Here, in the voice processing, the sample point error may be set to zero.
In the case where the current sample point is the first sample point of the synthesized signal, the sample point output by the second decoder at the previous time may be set to the second estimation value. For example, the second estimation value may be calculated using equation (6), and the final speech signal may be obtained using equation (10).
Fig. 7 is a block diagram of a training apparatus of a vocoder according to an embodiment of the present disclosure.
Referring to fig. 7, training apparatus 700 may include an acquisition module 701 and a training module 702. Each module in the training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary according to the type of the module. In various embodiments, some modules in training device 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.
The obtaining module 701 may obtain a sample set, where the sample set includes a high sampling rate time domain signal, a low frequency domain feature, and a high frequency domain feature, where the low sampling rate time domain signal is obtained by down-sampling the high sampling rate time domain signal, the low frequency domain feature is a mel-frequency spectrum feature of the low sampling rate time domain signal, and the high frequency domain feature is a mel-frequency spectrum feature of the high sampling rate time domain signal. Here, the acquisition module 701 may directly acquire the sample set from the outside. Alternatively, the obtaining module 701 may obtain the high-sampling-rate time-domain signal from the outside, then down-sample the high-sampling-rate time-domain signal to obtain the low-sampling-rate time-domain signal, and perform time-frequency transformation and filtering processing (such as through a mel filter bank) on the high-sampling-rate time-domain signal and the low-sampling-rate time-domain signal, respectively, so as to obtain the low-frequency-domain feature and the high-frequency-domain feature.
Training module 702 may obtain a low time-domain signal using a first neural network of a vocoder based on low frequency-domain features and obtain a high time-domain signal by upsampling the low time-domain signal, obtain a synthesized signal using a second neural network of the vocoder based on high frequency-domain features and the high time-domain signal, construct a loss function using the low time-domain signal, the low sample rate time-domain signal, the synthesized signal, and the high sample rate time-domain signal, train network parameters of the vocoder by minimizing a loss calculated from the loss function.
Alternatively, training module 702 may perform the following for each sample point of the low time domain signal: calculating a first estimation value of a current sampling point of the low-sampling-rate time domain signal based on the amplitude spectrum of the low-sampling-rate time domain signal; obtaining a first embedding vector via operation of a first encoder of a first neural network based on low frequency domain features; and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
Alternatively, the training module 702 may perform the following for each sample point of the final composite signal: calculating a second estimated value of the current sampling point of the synthesized signal based on the amplitude spectrum of the high-sampling-rate time domain signal; obtaining a second embedding vector via operation of a second encoder of the second neural network based on the high frequency domain features; and obtaining the current sampling point of the synthesized signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
Optionally, training module 702 may construct the first cross entropy loss function using the low time domain signal and the low sample rate time domain signal; constructing a second cross entropy loss function by using the synthesized signal and the high sampling rate time domain signal; the loss function is formed by a first cross entropy loss function and a second cross entropy loss function.
Fig. 8 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure.
Referring to fig. 8, the speech processing apparatus 800 may include an acquisition module 801 and a processing module 802. Each module in the speech synthesis apparatus 800 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the speech processing device 800 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.
The obtaining module 801 may obtain low-frequency domain features and high-frequency domain features predicted by the text, wherein the low-frequency domain features are low-sampling-rate mel-frequency spectrum features corresponding to the text, and the high-frequency domain features are high-sampling-rate mel-frequency spectrum features corresponding to the text. The mel-spectrum features acquired by the acquisition module 801 can be obtained from the outside.
The processing module 802 may obtain a low time-domain signal using a first neural network of a vocoder based on the low frequency-domain features, obtain a high time-domain signal by upsampling the low time-domain signal, and obtain a synthesized signal corresponding to text using a second neural network of the vocoder based on the high frequency-domain features and the high time-domain signal.
Alternatively, the processing module 802 may perform the following for each sample point of the low time domain signal: the method comprises the steps of calculating a first estimation value of a current sampling point of a low time domain signal based on a magnitude spectrum corresponding to a low sampling rate Mel spectrum characteristic, obtaining a first embedding vector based on the low frequency domain characteristic through operation of a first encoder of a first neural network, and obtaining the current sampling point of the low time domain signal based on the first embedding vector, the first estimation value and sampling point errors of a first decoder of the first neural network at a previous moment through operation of the first decoder. Here, the sampling point error of the first decoder is used in the training phase of the vocoder, and thus, in the voice processing, the sampling point error may be set to zero, to which the present disclosure is not limited.
Alternatively, the processing module 802 may perform the following for each sample point of the final composite signal: and calculating a second estimation value of a current sampling point of the synthesized signal based on a magnitude spectrum corresponding to the high sampling rate Mel spectral feature, obtaining a second embedding vector based on the high frequency domain feature through operation of a second encoder of the second neural network, and obtaining the current sampling point of the synthesized signal based on the second embedding vector, the current sampling point of the high time domain signal, the second estimation value, a sampling point error at a previous moment for the high time domain signal, a sampling point error at a previous moment for a second decoder of the second neural network, and a sampling point output by the second decoder at the previous moment through operation of a second decoder. Here, the sample point error may be set to zero, and the present disclosure is not limited thereto.
Fig. 9 is a schematic structural diagram of a speech processing device in a hardware operating environment according to an embodiment of the present disclosure.
As shown in fig. 9, the speech processing apparatus 900 may include: a processing component 901, a communication bus 902, a network interface 903, an input output interface 904, a memory 905, and a power component 904. Wherein a communication bus 902 is used to enable connective communication between these components. The input output interface 904 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and optionally, the input output interface 904 may also include a standard wired interface, a wireless interface. The network interface 903 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 905 may be a high-speed random access memory or a stable nonvolatile memory. The memory 905 may optionally be a storage device separate from the processing component 901 described above.
Those skilled in the art will appreciate that the architecture shown in FIG. 9 does not constitute a limitation of the speech processing apparatus 900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 9, the memory 905, which is one type of storage medium, may include therein an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a voice processing program, a model training program, and a database.
In the voice processing apparatus 900 shown in fig. 9, the network interface 903 is mainly used for data communication with an external electronic apparatus/terminal; the input/output interface 904 is mainly used for data interaction with a user; the processing component 901 and the memory 905 in the voice processing apparatus 900 may be provided in the voice processing apparatus 900, and the voice processing apparatus 900 executes the video editing method provided by the embodiments of the present disclosure by the processing component 901 calling the video editing program, the material, and various APIs provided by the operating system stored in the memory 905.
The processing component 901 may include at least one processor, with the memory 905 having stored therein a set of computer-executable instructions that, when executed by the at least one processor, perform a method of speech processing or a method of vocoder training in accordance with embodiments of the present disclosure. Further, the processing component 901 may perform encoding operations and decoding operations, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
Processing component 901 may be used to train the vocoder of the present disclosure. For example, the processing component 901 may obtain a sample set from the outside, wherein the sample set comprises a high sampling rate time domain signal, a low sampling rate time domain signal, low frequency domain features and high frequency domain features, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency-domain feature is a Mel-spectral feature of the low-sampling-rate time-domain signal, the high-frequency-domain feature is a Mel-spectral feature of the high-sampling-rate time-domain signal, the low-time-domain signal is obtained by utilizing a first neural network of a vocoder based on the low-frequency-domain feature, the method includes obtaining a high time domain signal by upsampling a low time domain signal, obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal, constructing a loss function using the low time domain signal, the low sample rate time domain signal, the synthesized signal, and the high sample rate time domain signal, training parameters of the vocoder by minimizing a loss calculated from the loss function.
As another example, processing component 901 may implement converting text to speech signals as a vocoder of the present disclosure. For example, the processing component 901 may externally obtain low-frequency-domain features and high-frequency-domain features predicted from the text, wherein the low-frequency-domain features are low-sampling-rate mel-spectrum features corresponding to the text, the high-frequency-domain features are high-sampling-rate mel-spectrum features corresponding to the text, obtain a low-time-domain signal by using a first neural network of a vocoder based on the low-frequency-domain features, obtain a high-time-domain signal by up-sampling the low-time-domain signal, and obtain a synthesized signal corresponding to the text by using a second neural network of the vocoder based on the high-frequency-domain features and the high-time-domain signal. Other deep neural network implementations besides LPCNet may also be employed for the first and second neural networks.
The processing component 901 may implement control of components included in the speech processing device 900 by executing programs. The input-output interface 904 may output a finally synthesized voice signal.
The speech processing device 900 may receive or output video and/or audio via the input-output interface 904. For example, the speech processing apparatus 900 may output the synthesized speech signal via the input-output interface 904.
By way of example, the speech processing device 900 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. The speech processing apparatus 900 need not be a single electronic device, but can be any collection of devices or circuits that can individually or jointly execute the instructions (or sets of instructions). The speech processing device 900 may also be part of an integrated control system or system manager or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the speech processing device 900, the processing component 901 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processing component 901 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
The processing component 901 may execute instructions or code stored in a memory, where the memory 905 may also store data. Instructions and data may also be sent and received over a network via the network interface 903, where the network interface 903 may employ any known transmission protocol.
The memory 905 may be integrated with the processing component 901, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the memory 905 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and processing components 901 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processing components 901 can read data stored in the memory 905.
According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 10 is a block diagram of an electronic device according to an embodiment of the disclosure, the electronic device 1000 may include at least one memory 1002 and at least one processor 1001, the at least one memory 1002 storing a set of computer-executable instructions, the set of computer-executable instructions, when executed by the at least one processor 1001, performing a method of speech processing or a method of vocoder training according to an embodiment of the disclosure.
The processor 1001 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 1001 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.
The memory 1002, which is one type of storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, a speech processing program, a model training program, and a database.
The memory 1002 may be integrated with the processor 1001, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 1002 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 1002 and the processor 1001 may be operatively coupled, or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 1001 can read files stored in the memory 1002.
In addition, the electronic device 1000 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 1000 may be connected to each other via a bus and/or a network.
Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a model training method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-mentioned speech processing method or model training method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A speech processing method, characterized in that the speech processing method comprises:
downsampling the high sampling rate Mel spectral features to obtain low sampling rate Mel spectral features;
obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features;
obtaining a high time domain signal by up-sampling the low time domain signal;
a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.
2. The speech processing method of claim 1 wherein the step of obtaining the low time domain signal using the first neural network of the vocoder based on the low sample rate mel spectral features comprises:
performing the following for each sample point of the low time domain signal:
calculating a first estimation value of a current sampling point of the low time domain signal based on an amplitude spectrum corresponding to the low sampling rate Mel spectrum characteristic;
obtaining a first embedded vector via operation of a first encoder of a first neural network based on low-sample-rate mel-frequency spectral features;
and obtaining a current sampling point of the low time domain signal through the operation of the first decoder based on the first embedding vector, the first estimation value and the sampling point error of the first decoder aiming at the previous moment of the first neural network.
3. The speech processing method of claim 1 wherein the step of utilizing a second neural network of the vocoder to obtain the speech signal corresponding to the high-sample-rate mel spectral features based on the high-sample-rate mel spectral features and the high-time-domain signal comprises:
performing the following for each sample point of the speech signal:
calculating a second estimation value of the current sampling point of the voice signal based on the amplitude spectrum corresponding to the high sampling rate Mel spectrum characteristic;
obtaining a second embedding vector via operation of a second encoder of a second neural network based on the high-sampling-rate mel-frequency spectral features;
and obtaining the current sampling point of the voice signal based on the second embedded vector, the current sampling point of the high time domain signal, the second estimation value, the sampling point error of the previous moment aiming at the high time domain signal, the sampling point error of the previous moment aiming at the second decoder of the second neural network and the sampling point output by the previous moment aiming at the second decoder through the operation of the second decoder.
4. The speech processing method of claim 1 wherein the high-sampling-rate mel-frequency features are obtained by performing mel-frequency spectrum prediction on the input text.
5. A method of training a vocoder, the method comprising:
acquiring a sample set, wherein the sample set comprises a high-sampling-rate time domain signal, a low-frequency domain feature and a high-frequency domain feature, the low-sampling-rate time domain signal is obtained by down-sampling the high-sampling-rate time domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time domain signal;
obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features;
obtaining a high time domain signal by up-sampling the low time domain signal;
obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal;
constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal;
training parameters of the vocoder based on the impairments calculated by the impairment function.
6. Training method according to claim 5, wherein the step of constructing a loss function using the low time-domain signal, the low sample rate time-domain signal, the synthesis signal and the high sample rate time-domain signal comprises:
constructing a first cross entropy loss function using a low time domain signal and the low sample rate time domain signal;
constructing a second cross entropy loss function using the composite signal and the high sample rate time domain signal;
the loss function is constructed from a first cross entropy loss function and a second cross entropy loss function.
7. A speech processing apparatus, characterized in that the speech processing apparatus comprises:
an acquisition module configured to downsample the high sampling rate mel-frequency spectrum features to acquire low sampling rate mel-frequency spectrum features; and
a processing module configured to:
obtaining a low time domain signal using a first neural network of a vocoder based on low sample rate mel-frequency spectral features;
obtaining a high time domain signal by up-sampling the low time domain signal;
a second neural network of the vocoder is utilized to obtain a speech signal corresponding to the high-sampling-rate Mel spectral features based on the high-sampling-rate Mel spectral features and the high-time-domain signal.
8. An apparatus for training a vocoder, the apparatus comprising:
an obtaining module configured to obtain a sample set, wherein the sample set includes a high-sampling-rate time-domain signal, a low-frequency domain feature and a high-frequency domain feature, wherein the low-sampling-rate time-domain signal is obtained by down-sampling the high-sampling-rate time-domain signal, the low-frequency domain feature is a mel-frequency spectrum feature of the low-sampling-rate time-domain signal, and the high-frequency domain feature is a mel-frequency spectrum feature of the high-sampling-rate time-domain signal;
a training module configured to:
obtaining a low time domain signal using a first neural network of a vocoder based on low frequency domain features;
obtaining a high time domain signal by up-sampling the low time domain signal;
obtaining a synthesized signal using a second neural network of the vocoder based on the high frequency domain features and the high time domain signal;
constructing a loss function using a low time-domain signal, the low-sample rate time-domain signal, the composite signal, and the high-sample rate time-domain signal;
training parameters of the vocoder based on the impairments calculated by the impairment function.
9. An electronic device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the speech processing method of any one of claims 1 to 4 or the training method of any one of claims 5 to 6.
10. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a speech processing method as claimed in any one of claims 1 to 4 or a training method as claimed in any one of claims 5 to 6.
CN202110794822.1A 2021-07-14 2021-07-14 Speech processing method and device, vocoder and training method of vocoder Active CN113470616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110794822.1A CN113470616B (en) 2021-07-14 2021-07-14 Speech processing method and device, vocoder and training method of vocoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110794822.1A CN113470616B (en) 2021-07-14 2021-07-14 Speech processing method and device, vocoder and training method of vocoder

Publications (2)

Publication Number Publication Date
CN113470616A true CN113470616A (en) 2021-10-01
CN113470616B CN113470616B (en) 2024-02-23

Family

ID=77880157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110794822.1A Active CN113470616B (en) 2021-07-14 2021-07-14 Speech processing method and device, vocoder and training method of vocoder

Country Status (1)

Country Link
CN (1) CN113470616B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07334194A (en) * 1994-06-14 1995-12-22 Matsushita Electric Ind Co Ltd Method and device for encoding/decoding voice
KR20020084765A (en) * 2001-05-03 2002-11-11 (주)디지텍 Method for synthesizing voice
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN111316352A (en) * 2019-12-24 2020-06-19 深圳市优必选科技股份有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
WO2020191271A1 (en) * 2019-03-20 2020-09-24 Research Foundation Of The City University Of New York Method for extracting speech from degraded signals by predicting the inputs to a speech vocoder
CN112599141A (en) * 2020-11-26 2021-04-02 北京百度网讯科技有限公司 Neural network vocoder training method and device, electronic equipment and storage medium
US20210193113A1 (en) * 2019-12-23 2021-06-24 Ubtech Robotics Corp Ltd Speech synthesis method and apparatus and computer readable storage medium using the same

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07334194A (en) * 1994-06-14 1995-12-22 Matsushita Electric Ind Co Ltd Method and device for encoding/decoding voice
KR20020084765A (en) * 2001-05-03 2002-11-11 (주)디지텍 Method for synthesizing voice
WO2020191271A1 (en) * 2019-03-20 2020-09-24 Research Foundation Of The City University Of New York Method for extracting speech from degraded signals by predicting the inputs to a speech vocoder
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
US20210193113A1 (en) * 2019-12-23 2021-06-24 Ubtech Robotics Corp Ltd Speech synthesis method and apparatus and computer readable storage medium using the same
CN111316352A (en) * 2019-12-24 2020-06-19 深圳市优必选科技股份有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
CN112599141A (en) * 2020-11-26 2021-04-02 北京百度网讯科技有限公司 Neural network vocoder training method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张筱;张巍;王文浩;万永菁;: "基于多谱特征生成对抗网络的语音转换算法", 计算机工程与科学, no. 05 *
张译之: "《基于深度神经网络的语音合成算法研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 05 *

Also Published As

Publication number Publication date
CN113470616B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN112289333B (en) Training method and device of voice enhancement model and voice enhancement method and device
Kuleshov et al. Audio super resolution using neural networks
KR101378696B1 (en) Determining an upperband signal from a narrowband signal
TW202111692A (en) Artificial intelligence based audio coding
CN109147805B (en) Audio tone enhancement based on deep learning
CN112927707A (en) Training method and device of voice enhancement model and voice enhancement method and device
RU2677453C2 (en) Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
US20230008547A1 (en) Audio frame loss concealment
CN112712812A (en) Audio signal generation method, device, equipment and storage medium
JPH10307599A (en) Waveform interpolating voice coding using spline
CN110556121B (en) Band expansion method, device, electronic equipment and computer readable storage medium
US20210366461A1 (en) Generating speech signals using both neural network-based vocoding and generative adversarial training
JP2013057735A (en) Hidden markov model learning device for voice synthesis and voice synthesizer
JP2019045856A (en) Audio data learning device, audio data inference device, and program
Hao et al. Time-domain neural network approach for speech bandwidth extension
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
KR20200123395A (en) Method and apparatus for processing audio data
JP2009223210A (en) Signal band spreading device and signal band spreading method
CN112309425B (en) Sound tone changing method, electronic equipment and computer readable storage medium
CN113362837A (en) Audio signal processing method, device and storage medium
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
CN116705056A (en) Audio generation method, vocoder, electronic device and storage medium
US20190066657A1 (en) Audio data learning method, audio data inference method and recording medium
CN113470616B (en) Speech processing method and device, vocoder and training method of vocoder
JPH09127985A (en) Signal coding method and device therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant