CN117423348A

CN117423348A - Speech compression method and system based on deep learning and vector prediction

Info

Publication number: CN117423348A
Application number: CN202311743425.7A
Authority: CN
Inventors: 李晔; 于兴业; 吝灵霞
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-01-19
Anticipated expiration: 2043-12-19
Also published as: CN117423348B

Abstract

The disclosure provides a voice compression method and a system based on deep learning and vector prediction, which relate to the technical field of voice signal processing and comprise the following steps: acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences; extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector; the method comprises the steps of performing difference on original acoustic characteristics and a prediction vector, obtaining a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to a second depth network, searching a corresponding difference quantization vector in the codebook according to a received residual index by the second depth network, adding the difference quantization vector and the prediction vector to obtain a reconstruction vector, and decoding the reconstruction vector to output synthesized voice.

Description

Speech compression method and system based on deep learning and vector prediction

Technical Field

The disclosure relates to the technical field of speech signal processing, in particular to a speech compression method and system based on deep learning and vector prediction.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The low-rate speech coding technology has wide application requirements in the fields of satellite communication, short-wave communication, underwater acoustic communication, secret communication and the like, for example, in extremely severe mountain area communication environments, an ultra-short wave radio station needs to ensure all-weather 24-hour communication, and the coding rate of speech coding is often lower than 600bps. As the coding rate of speech decreases, the speech synthesis quality is affected, and thus the extraction of acoustic features and bit allocation become particularly important. In particular, ultra-low rate speech compression coding techniques based on deep learning, which are also known as neuro-vocoders.

The basic steps of current nerve vocoders include: extracting acoustic features of input signal sample points at the encoding end; quantizing the extracted features at a quantization end, and packaging the extracted features into binary bytes for transmission; and then unpacking the received data packet at the dequantizing end, restoring the acoustic characteristics according to the codebook, and finally synthesizing the extracted acoustic characteristics at the decoding end to restore the input voice signal. The quantization mode of the nerve vocoder mainly adopts scalar quantization or residual vector quantization, but the scheme still has the following defects:

1) In the face of a scene with large data, scalar quantization quantizes each dimension independently into one scalar, which easily leads to information loss, and scalar quantization is very sensitive to noise signals.

2) Residual vector quantization, while reducing quantization loss to some extent relative to scalar quantization, the use of multiple quantizers additionally adds computational and storage overhead.

However, each of the above two modes is independent, and does not depend on other quantization results, in other words, the past state or the future state of the encoder or the decoder, so that correlation between data cannot be utilized.

Disclosure of Invention

In order to solve the above problems, the present disclosure proposes a speech compression method and system based on deep learning and vector prediction, which improves speech coding quality by introducing predictive vector quantization and vector quantizing the difference between an input vector and a predictive vector based on the deep learning.

According to some embodiments, the present disclosure employs the following technical solutions:

a voice compression method based on deep learning and vector prediction comprises the following steps:

acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences;

extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector;

and the second depth network finds out a corresponding difference quantized vector in the codebook according to the received residual index, adds the difference quantized vector and the predicted vector to obtain a reconstructed vector, decodes the reconstructed vector to output synthesized voice, and judges the authenticity of the generated voice through a discriminator.

a speech compression system based on deep learning and vector prediction, comprising:

the data acquisition module is used for acquiring multi-frame voice signals at a low rate and preprocessing the multi-frame voice signals into voice sequences;

the prediction module is used for extracting acoustic features by taking the current frame voice sequence as an input signal of the first depth network, predicting the acoustic features of the next frame voice sequence by utilizing the acoustic features, and taking the acoustic features as a prediction vector;

the vector quantization module is used for carrying out difference on the original acoustic characteristics and the prediction vector to obtain a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to the second depth network, searching a corresponding difference quantization vector in the codebook according to the received residual index by the second depth network, and adding the difference quantization vector and the prediction vector to obtain a reconstruction vector;

and the voice synthesis module is used for decoding the reconstruction vector to output synthesized voice and judging the authenticity of the generated voice through a discriminator.

Compared with the prior art, the beneficial effects of the present disclosure are:

the present disclosure provides a speech compression method based on deep learning and vector prediction, which introduces a prediction vector quantization technique into a deep-learning low-rate neural vocoder, reduces quantization loss by using a predictor, improves time correlation between vectors, predicts a next frame vector by training a predictor with a past reconstructed vector as an input, and quantizes a difference value between the predicted vector and the input vector by inputting the difference value into a codebook to obtain a quantization index for transmission. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. The method reduces quantization error and improves the coding synthesis quality of voice by utilizing the time correlation of data through the predictor.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow chart of a speech synthesis method based on predictive vector quantization in accordance with an embodiment of the present disclosure;

FIG. 2 is a flow chart of prediction vector quantization according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of encoding and decoding structures of a nerve vocoder according to an embodiment of the present disclosure.

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

In one embodiment of the present disclosure, a speech compression method based on deep learning and vector prediction is provided, including:

step one: acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences;

step two: extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector;

step three: the original acoustic characteristics and the prediction vectors are subjected to difference to obtain difference vectors, quantization vectors which are most matched with the difference vectors are searched in a designed codebook and are transmitted to a second depth network as residual indexes, the second depth network finds corresponding difference quantization vectors in the codebook according to the received residual indexes, and the difference quantization vectors and the prediction vectors are added to obtain a reconstruction vector;

step four: and decoding the reconstructed vector to output synthesized voice, and judging the authenticity of the generated voice through a discriminator.

As an embodiment, the disclosed speech compression method based on deep learning and vector prediction adopts a high-quality low-rate neural vocoder technology based on prediction vector quantization, and further improves the quality of synthesized speech at a low rate, firstly, the input speech signal is subjected to feature extraction through a first deep network, the parameters are quantized and encoded according to the prediction vector quantization, the encoded index is packed into a binary byte stream for transmission, the transmitted byte is unpacked at a decoding end, a decoder obtains a quantized vector of a codebook according to the index, the synthesized speech is synthesized by using the deep network, and finally, the synthesized speech is subjected to true-false authentication through a discriminator, so that the quality of the synthesized speech is further improved. The specific implementation process is as follows:

(1) The input voice signal is 8KHz sampled voice, and the real voice sequence can be composed ofIndicating (I)>For the number of voice channels>Is the total sample point of speech, where d is the duration of speech, < >>Is the sampling rate of the speech.

(2) The encoder Enc of the first depth network is composed of a decoder having a channel sumIs composed of a kernel-sized one-dimensional convolution of V convolution blocks. Each convolution block consists of a single residual unit consisting of a jump-concatenated convolution with kernel sizes 3 and 1, followed by a downsampling layer consisting of a stride convolution, the kernel size being twice the stride S. The residual unit consists of a skip concatenated convolution. The number of channels doubles every time downsampling. The convolution block is followed by two layers of LSTM to better capture long-term dependencies in the sequence data. Finally, the kernel size is K and +.>One-dimensional convolution layer of the output channel. Wherein each convolution is preceded by a Snake activation function defined as: -A->Where α is a trainable parameter controlling the frequency of the periodic component of the signal, the greater the α, the higher the frequency. Encoded speech signal +.>. Wherein->=32，V=4，S={1,4,5,8}，K=7，/>=512。

(3) As shown in fig. 2, the prediction vector quantization process is: at the quantization end, the codebook is designed to have a size of N, i.e. the codebook contains N codewords, where each codeword represents each block signal (vector) mapped to an index suitable for transmission in the channel, and the decoding end restores the reconstructed block signal (vector) based on this index. Frame length L, frame rate M, each codebook may be encodedBits.

For codebook initialization, clustering is performed based on a K-Means algorithm, and the frame rate after coding is firstly obtainedThe cluster is set to N, from +.>N samples were randomly selected as the initial average. Calculating the distance from each sample to each mean value, distributing each sample to the mean value closest to the sample to form clusters, calculating the new mean value of each cluster, and updating the mean value to obtain an initialized codebook->。

First, the frame rate after encoding is obtainedQuantizer is +.>And prediction vectorDifference of->Quantization is performed, the difference vector is +.>The quantizer finds the difference vector +_with the input in the codebook by Euclidean distance>The best matching difference quantization vector +.>Wherein Q represents the quantizer. Handle->Index of->To the decoding end. Prediction vector->From the previous reconstruction vector +.>And (5) predicting to obtain the final product. (reconstruction vector->) The prediction vector predictor has the following form: />. Where n=256, l=20 ms, m=50.

The predictor is composed of 4 layers of time sequence convolution network blocks, wherein the time sequence convolution network blocks firstly pass through one-dimensional convolution with the channel number of 512, then pass through expansion convolution with the kernel of 3, the expansion size of D, and finally pass through one-dimensional convolution with the channel number of 512, and in order to reduce information loss, skip links are added in the first layer and the last layer. Wherein the 4-layer time sequence convolution network blocks D are {1,2,5,8}, respectively.

(4) Packaging the quantized voice into a binary byte stream for transmission, and transmitting the binary byte stream to a second depth network for decoding;

(5) Unpacking the transmitted binary byte, finding out a corresponding difference quantization vector in the codebook according to the received residual index, and then adding the difference quantization vector with the prediction vector to obtain a reconstruction vector.

At the decoding end, unpacking the transmitted binary bytes, and the decoder based on the received indexFind the corresponding difference quantization vector in the codebook +.>Then ∈>And predictor output->Adding to obtain an input vector +.>Reconstruction vector +.>:/>。

(6) Dequantized speech featuresThe method is input into a decoder which has the same structure as the encoder but is symmetrically inverted, and is subjected to one-dimensional convolution with the kernel size of K, then subjected to two layers of LSTM and V convolution blocks, and finally subjected to one-dimensional convolution with the kernel of K. And finally, obtaining the reconstructed voice. The parameters are consistent with (2).

(7) The quality of the synthesized speech is further improved by introducing a multi-scale STFT discriminator (MS-STFT) and a multi-period discriminator (MPD) to determine true/false of the generated speech. The MS-STFT discriminator consists of the same structured network operating on a multi-scale complex-valued STFT, with real and imaginary parts connected. Each sub-network consists of a two-dimensional convolution layer (using a kernel size of 3 x 8 with 32 channels), followed by a two-dimensional convolution, increasing in the time dimension by the expansion rate of D, with a step of 2 on the frequency axis. The final two-dimensional convolution of kernel size 3 x 3 and stride (1, 1) provides the final prediction. Using 5 different scales, the STFT window length is (2048, 1024, 512, 256, 128). MPD is a mix of sub-discriminators, each sub-discriminator only accepting equally spaced samples of the input speech; the space is given as period p. The sub-discriminators aim at capturing implicit structures that are different from each other by looking at different portions of the input audio. The period p is set to [2, 3, 5, 7, 11] to avoid overlapping as much as possible. The 1D original audio of length T is first reshaped into 2D data of height T/p and width p, and then a 2D convolution is applied to the reshaped data. In each convolution layer of the MPD, the kernel size on the width axis is limited to 1 to process the periodic samples independently.

The improvement of the present disclosure is that the problem of large quantization error, complex computation, and inability to exploit time correlation is solved by introducing a predictor into the quantization phase of the nerve vocoder. In the quantization stage, a codebook is initialized by using a K-Means algorithm, a past reconstruction vector is used as an input to a predictor to predict a next frame vector, a difference value between the prediction vector and the input vector is input to the codebook for quantization, and a quantization index is obtained for transmission. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. In addition, an STFT discriminator and a multi-period discriminator are introduced after decoding to improve the speech synthesis quality.

Example 2

In one embodiment of the present disclosure, a speech compression system based on deep learning and vector prediction is provided, comprising:

The method comprises the steps of training a predictor, taking a past reconstruction vector as an input to predict a next frame vector, inputting a difference value between the prediction vector and the input vector into a codebook to be quantized, and obtaining a quantization index to be transmitted. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. The method reduces quantization error and improves the synthesis quality of voice by utilizing the time correlation of data by the predictor.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. The voice compression method based on deep learning and vector prediction is characterized by comprising the following steps:

2. The speech compression method based on deep learning and vector prediction of claim 1, wherein the first depth network consists of a one-dimensional convolution and a plurality of convolution blocks, each convolution block consisting of a single residual unit consisting of skip-concatenated convolutions of kernel sizes 3 and 1 followed by a downsampling layer consisting of stride convolutions, the kernel size being twice the stride S, the residual units consisting of skip-concatenated convolutions.

3. The speech compression method based on deep learning and vector prediction of claim 2, wherein the number of channels is doubled every time downsampling, the convolution block is followed by two layers of LSTM, capturing long-term dependencies in the speech sequence data.

4. The speech compression method of claim 1 wherein the codebook is designed to have a size of N, a frame length of L, a frame rate of M, each codebook being encodableBit, initializing codebook, clustering N clusters by using K-Means algorithm to obtain initialized codebook +.>。

5. The speech compression method based on deep learning and vector prediction of claim 1,

to difference between acoustic characteristics and predicted vectors, to acoustic characteristicsAnd prediction vector->Difference of->Quantization is performed, the difference signal is +.>Finding the vector of difference from the input in the codebook by Euclidean distance +.>Best matching quantization vector/>Wherein Q represents the quantizer.

6. The speech compression method based on deep learning and vector prediction of claim 5,

and packaging the quantized vector into a binary byte stream for transmission, transmitting the binary byte stream to a second depth network for decoding, unpacking the transmitted binary byte stream, finding a corresponding difference quantized vector in a codebook according to a received residual index, and adding the difference quantized vector with a prediction vector to obtain a reconstruction vector.

7. The speech compression method of claim 6, wherein the reconstructed vector is input to a second depth network having the same structure as the first depth network but being symmetrically inverted, and the synthesized speech is output by first one-dimensional convolution with kernel size K, then two-layer LSTM and a plurality of convolution blocks, and finally one-dimensional convolution with kernel K.

8. A speech compression method based on deep learning and vector prediction as claimed in claim 1, characterized in that the authenticity of the generated speech is judged by introducing a multi-scale STFT discriminator and a multi-period discriminator, the STFT discriminator being composed of identical structured networks operating on multi-scale complex values STFT, wherein the real and imaginary parts are connected, each sub-network being composed of a two-dimensional convolution layer.

9. The method of speech compression based on deep learning and vector prediction of claim 8, wherein the multicycle discriminator is a mix of sub-discriminators, each sub-discriminator accepting only equidistant samples of the input speech sequence, capturing implicit structures that differ from each other by looking at different parts of the input speech sequence.

10. A speech compression system based on deep learning and vector prediction, comprising: