CN117423348A - Speech compression method and system based on deep learning and vector prediction - Google Patents

Speech compression method and system based on deep learning and vector prediction Download PDF

Info

Publication number
CN117423348A
CN117423348A CN202311743425.7A CN202311743425A CN117423348A CN 117423348 A CN117423348 A CN 117423348A CN 202311743425 A CN202311743425 A CN 202311743425A CN 117423348 A CN117423348 A CN 117423348A
Authority
CN
China
Prior art keywords
vector
prediction
difference
voice
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311743425.7A
Other languages
Chinese (zh)
Other versions
CN117423348B (en
Inventor
李晔
于兴业
吝灵霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202311743425.7A priority Critical patent/CN117423348B/en
Publication of CN117423348A publication Critical patent/CN117423348A/en
Application granted granted Critical
Publication of CN117423348B publication Critical patent/CN117423348B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The disclosure provides a voice compression method and a system based on deep learning and vector prediction, which relate to the technical field of voice signal processing and comprise the following steps: acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences; extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector; the method comprises the steps of performing difference on original acoustic characteristics and a prediction vector, obtaining a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to a second depth network, searching a corresponding difference quantization vector in the codebook according to a received residual index by the second depth network, adding the difference quantization vector and the prediction vector to obtain a reconstruction vector, and decoding the reconstruction vector to output synthesized voice.

Description

Speech compression method and system based on deep learning and vector prediction
Technical Field
The disclosure relates to the technical field of speech signal processing, in particular to a speech compression method and system based on deep learning and vector prediction.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The low-rate speech coding technology has wide application requirements in the fields of satellite communication, short-wave communication, underwater acoustic communication, secret communication and the like, for example, in extremely severe mountain area communication environments, an ultra-short wave radio station needs to ensure all-weather 24-hour communication, and the coding rate of speech coding is often lower than 600bps. As the coding rate of speech decreases, the speech synthesis quality is affected, and thus the extraction of acoustic features and bit allocation become particularly important. In particular, ultra-low rate speech compression coding techniques based on deep learning, which are also known as neuro-vocoders.
The basic steps of current nerve vocoders include: extracting acoustic features of input signal sample points at the encoding end; quantizing the extracted features at a quantization end, and packaging the extracted features into binary bytes for transmission; and then unpacking the received data packet at the dequantizing end, restoring the acoustic characteristics according to the codebook, and finally synthesizing the extracted acoustic characteristics at the decoding end to restore the input voice signal. The quantization mode of the nerve vocoder mainly adopts scalar quantization or residual vector quantization, but the scheme still has the following defects:
1) In the face of a scene with large data, scalar quantization quantizes each dimension independently into one scalar, which easily leads to information loss, and scalar quantization is very sensitive to noise signals.
2) Residual vector quantization, while reducing quantization loss to some extent relative to scalar quantization, the use of multiple quantizers additionally adds computational and storage overhead.
However, each of the above two modes is independent, and does not depend on other quantization results, in other words, the past state or the future state of the encoder or the decoder, so that correlation between data cannot be utilized.
Disclosure of Invention
In order to solve the above problems, the present disclosure proposes a speech compression method and system based on deep learning and vector prediction, which improves speech coding quality by introducing predictive vector quantization and vector quantizing the difference between an input vector and a predictive vector based on the deep learning.
According to some embodiments, the present disclosure employs the following technical solutions:
a voice compression method based on deep learning and vector prediction comprises the following steps:
acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences;
extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector;
and the second depth network finds out a corresponding difference quantized vector in the codebook according to the received residual index, adds the difference quantized vector and the predicted vector to obtain a reconstructed vector, decodes the reconstructed vector to output synthesized voice, and judges the authenticity of the generated voice through a discriminator.
According to some embodiments, the present disclosure employs the following technical solutions:
a speech compression system based on deep learning and vector prediction, comprising:
the data acquisition module is used for acquiring multi-frame voice signals at a low rate and preprocessing the multi-frame voice signals into voice sequences;
the prediction module is used for extracting acoustic features by taking the current frame voice sequence as an input signal of the first depth network, predicting the acoustic features of the next frame voice sequence by utilizing the acoustic features, and taking the acoustic features as a prediction vector;
the vector quantization module is used for carrying out difference on the original acoustic characteristics and the prediction vector to obtain a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to the second depth network, searching a corresponding difference quantization vector in the codebook according to the received residual index by the second depth network, and adding the difference quantization vector and the prediction vector to obtain a reconstruction vector;
and the voice synthesis module is used for decoding the reconstruction vector to output synthesized voice and judging the authenticity of the generated voice through a discriminator.
Compared with the prior art, the beneficial effects of the present disclosure are:
the present disclosure provides a speech compression method based on deep learning and vector prediction, which introduces a prediction vector quantization technique into a deep-learning low-rate neural vocoder, reduces quantization loss by using a predictor, improves time correlation between vectors, predicts a next frame vector by training a predictor with a past reconstructed vector as an input, and quantizes a difference value between the predicted vector and the input vector by inputting the difference value into a codebook to obtain a quantization index for transmission. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. The method reduces quantization error and improves the coding synthesis quality of voice by utilizing the time correlation of data through the predictor.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flow chart of a speech synthesis method based on predictive vector quantization in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow chart of prediction vector quantization according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of encoding and decoding structures of a nerve vocoder according to an embodiment of the present disclosure.
Detailed Description
The disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
In one embodiment of the present disclosure, a speech compression method based on deep learning and vector prediction is provided, including:
step one: acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences;
step two: extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector;
step three: the original acoustic characteristics and the prediction vectors are subjected to difference to obtain difference vectors, quantization vectors which are most matched with the difference vectors are searched in a designed codebook and are transmitted to a second depth network as residual indexes, the second depth network finds corresponding difference quantization vectors in the codebook according to the received residual indexes, and the difference quantization vectors and the prediction vectors are added to obtain a reconstruction vector;
step four: and decoding the reconstructed vector to output synthesized voice, and judging the authenticity of the generated voice through a discriminator.
As an embodiment, the disclosed speech compression method based on deep learning and vector prediction adopts a high-quality low-rate neural vocoder technology based on prediction vector quantization, and further improves the quality of synthesized speech at a low rate, firstly, the input speech signal is subjected to feature extraction through a first deep network, the parameters are quantized and encoded according to the prediction vector quantization, the encoded index is packed into a binary byte stream for transmission, the transmitted byte is unpacked at a decoding end, a decoder obtains a quantized vector of a codebook according to the index, the synthesized speech is synthesized by using the deep network, and finally, the synthesized speech is subjected to true-false authentication through a discriminator, so that the quality of the synthesized speech is further improved. The specific implementation process is as follows:
(1) The input voice signal is 8KHz sampled voice, and the real voice sequence can be composed ofIndicating (I)>For the number of voice channels>Is the total sample point of speech, where d is the duration of speech, < >>Is the sampling rate of the speech.
(2) The encoder Enc of the first depth network is composed of a decoder having a channel sumIs composed of a kernel-sized one-dimensional convolution of V convolution blocks. Each convolution block consists of a single residual unit consisting of a jump-concatenated convolution with kernel sizes 3 and 1, followed by a downsampling layer consisting of a stride convolution, the kernel size being twice the stride S. The residual unit consists of a skip concatenated convolution. The number of channels doubles every time downsampling. The convolution block is followed by two layers of LSTM to better capture long-term dependencies in the sequence data. Finally, the kernel size is K and +.>One-dimensional convolution layer of the output channel. Wherein each convolution is preceded by a Snake activation function defined as: -A->Where α is a trainable parameter controlling the frequency of the periodic component of the signal, the greater the α, the higher the frequency. Encoded speech signal +.>. Wherein->=32,V=4,S={1,4,5,8},K=7,/>=512。
(3) As shown in fig. 2, the prediction vector quantization process is: at the quantization end, the codebook is designed to have a size of N, i.e. the codebook contains N codewords, where each codeword represents each block signal (vector) mapped to an index suitable for transmission in the channel, and the decoding end restores the reconstructed block signal (vector) based on this index. Frame length L, frame rate M, each codebook may be encodedBits.
For codebook initialization, clustering is performed based on a K-Means algorithm, and the frame rate after coding is firstly obtainedThe cluster is set to N, from +.>N samples were randomly selected as the initial average. Calculating the distance from each sample to each mean value, distributing each sample to the mean value closest to the sample to form clusters, calculating the new mean value of each cluster, and updating the mean value to obtain an initialized codebook->
First, the frame rate after encoding is obtainedQuantizer is +.>And prediction vectorDifference of->Quantization is performed, the difference vector is +.>The quantizer finds the difference vector +_with the input in the codebook by Euclidean distance>The best matching difference quantization vector +.>Wherein Q represents the quantizer. Handle->Index of->To the decoding end. Prediction vector->From the previous reconstruction vector +.>And (5) predicting to obtain the final product. (reconstruction vector->) The prediction vector predictor has the following form: />. Where n=256, l=20 ms, m=50.
The predictor is composed of 4 layers of time sequence convolution network blocks, wherein the time sequence convolution network blocks firstly pass through one-dimensional convolution with the channel number of 512, then pass through expansion convolution with the kernel of 3, the expansion size of D, and finally pass through one-dimensional convolution with the channel number of 512, and in order to reduce information loss, skip links are added in the first layer and the last layer. Wherein the 4-layer time sequence convolution network blocks D are {1,2,5,8}, respectively.
(4) Packaging the quantized voice into a binary byte stream for transmission, and transmitting the binary byte stream to a second depth network for decoding;
(5) Unpacking the transmitted binary byte, finding out a corresponding difference quantization vector in the codebook according to the received residual index, and then adding the difference quantization vector with the prediction vector to obtain a reconstruction vector.
At the decoding end, unpacking the transmitted binary bytes, and the decoder based on the received indexFind the corresponding difference quantization vector in the codebook +.>Then ∈>And predictor output->Adding to obtain an input vector +.>Reconstruction vector +.>:/>
(6) Dequantized speech featuresThe method is input into a decoder which has the same structure as the encoder but is symmetrically inverted, and is subjected to one-dimensional convolution with the kernel size of K, then subjected to two layers of LSTM and V convolution blocks, and finally subjected to one-dimensional convolution with the kernel of K. And finally, obtaining the reconstructed voice. The parameters are consistent with (2).
(7) The quality of the synthesized speech is further improved by introducing a multi-scale STFT discriminator (MS-STFT) and a multi-period discriminator (MPD) to determine true/false of the generated speech. The MS-STFT discriminator consists of the same structured network operating on a multi-scale complex-valued STFT, with real and imaginary parts connected. Each sub-network consists of a two-dimensional convolution layer (using a kernel size of 3 x 8 with 32 channels), followed by a two-dimensional convolution, increasing in the time dimension by the expansion rate of D, with a step of 2 on the frequency axis. The final two-dimensional convolution of kernel size 3 x 3 and stride (1, 1) provides the final prediction. Using 5 different scales, the STFT window length is (2048, 1024, 512, 256, 128). MPD is a mix of sub-discriminators, each sub-discriminator only accepting equally spaced samples of the input speech; the space is given as period p. The sub-discriminators aim at capturing implicit structures that are different from each other by looking at different portions of the input audio. The period p is set to [2, 3, 5, 7, 11] to avoid overlapping as much as possible. The 1D original audio of length T is first reshaped into 2D data of height T/p and width p, and then a 2D convolution is applied to the reshaped data. In each convolution layer of the MPD, the kernel size on the width axis is limited to 1 to process the periodic samples independently.
The improvement of the present disclosure is that the problem of large quantization error, complex computation, and inability to exploit time correlation is solved by introducing a predictor into the quantization phase of the nerve vocoder. In the quantization stage, a codebook is initialized by using a K-Means algorithm, a past reconstruction vector is used as an input to a predictor to predict a next frame vector, a difference value between the prediction vector and the input vector is input to the codebook for quantization, and a quantization index is obtained for transmission. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. In addition, an STFT discriminator and a multi-period discriminator are introduced after decoding to improve the speech synthesis quality.
Example 2
In one embodiment of the present disclosure, a speech compression system based on deep learning and vector prediction is provided, comprising:
the data acquisition module is used for acquiring multi-frame voice signals at a low rate and preprocessing the multi-frame voice signals into voice sequences;
the prediction module is used for extracting acoustic features by taking the current frame voice sequence as an input signal of the first depth network, predicting the acoustic features of the next frame voice sequence by utilizing the acoustic features, and taking the acoustic features as a prediction vector;
the vector quantization module is used for carrying out difference on the original acoustic characteristics and the prediction vector to obtain a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to the second depth network, searching a corresponding difference quantization vector in the codebook according to the received residual index by the second depth network, and adding the difference quantization vector and the prediction vector to obtain a reconstruction vector;
and the voice synthesis module is used for decoding the reconstruction vector to output synthesized voice and judging the authenticity of the generated voice through a discriminator.
The method comprises the steps of training a predictor, taking a past reconstruction vector as an input to predict a next frame vector, inputting a difference value between the prediction vector and the input vector into a codebook to be quantized, and obtaining a quantization index to be transmitted. The quantization index is received at the decoding end, a quantization vector is obtained from the quantization index, and then the quantization vector is added to the predictor output to obtain a reconstructed vector of the input vector. The method reduces quantization error and improves the synthesis quality of voice by utilizing the time correlation of data by the predictor.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims (10)

1. The voice compression method based on deep learning and vector prediction is characterized by comprising the following steps:
acquiring multi-frame voice signals at a low rate, and preprocessing the multi-frame voice signals into voice sequences;
extracting acoustic features by taking the current frame voice sequence as an input signal of a first depth network, and predicting acoustic features of the next frame voice sequence by utilizing the acoustic features and taking the acoustic features as a prediction vector;
and the second depth network finds out a corresponding difference quantized vector in the codebook according to the received residual index, adds the difference quantized vector and the predicted vector to obtain a reconstructed vector, decodes the reconstructed vector to output synthesized voice, and judges the authenticity of the generated voice through a discriminator.
2. The speech compression method based on deep learning and vector prediction of claim 1, wherein the first depth network consists of a one-dimensional convolution and a plurality of convolution blocks, each convolution block consisting of a single residual unit consisting of skip-concatenated convolutions of kernel sizes 3 and 1 followed by a downsampling layer consisting of stride convolutions, the kernel size being twice the stride S, the residual units consisting of skip-concatenated convolutions.
3. The speech compression method based on deep learning and vector prediction of claim 2, wherein the number of channels is doubled every time downsampling, the convolution block is followed by two layers of LSTM, capturing long-term dependencies in the speech sequence data.
4. The speech compression method of claim 1 wherein the codebook is designed to have a size of N, a frame length of L, a frame rate of M, each codebook being encodableBit, initializing codebook, clustering N clusters by using K-Means algorithm to obtain initialized codebook +.>
5. The speech compression method based on deep learning and vector prediction of claim 1,
to difference between acoustic characteristics and predicted vectors, to acoustic characteristicsAnd prediction vector->Difference of->Quantization is performed, the difference signal is +.>Finding the vector of difference from the input in the codebook by Euclidean distance +.>Best matching quantization vector/>Wherein Q represents the quantizer.
6. The speech compression method based on deep learning and vector prediction of claim 5,
and packaging the quantized vector into a binary byte stream for transmission, transmitting the binary byte stream to a second depth network for decoding, unpacking the transmitted binary byte stream, finding a corresponding difference quantized vector in a codebook according to a received residual index, and adding the difference quantized vector with a prediction vector to obtain a reconstruction vector.
7. The speech compression method of claim 6, wherein the reconstructed vector is input to a second depth network having the same structure as the first depth network but being symmetrically inverted, and the synthesized speech is output by first one-dimensional convolution with kernel size K, then two-layer LSTM and a plurality of convolution blocks, and finally one-dimensional convolution with kernel K.
8. A speech compression method based on deep learning and vector prediction as claimed in claim 1, characterized in that the authenticity of the generated speech is judged by introducing a multi-scale STFT discriminator and a multi-period discriminator, the STFT discriminator being composed of identical structured networks operating on multi-scale complex values STFT, wherein the real and imaginary parts are connected, each sub-network being composed of a two-dimensional convolution layer.
9. The method of speech compression based on deep learning and vector prediction of claim 8, wherein the multicycle discriminator is a mix of sub-discriminators, each sub-discriminator accepting only equidistant samples of the input speech sequence, capturing implicit structures that differ from each other by looking at different parts of the input speech sequence.
10. A speech compression system based on deep learning and vector prediction, comprising:
the data acquisition module is used for acquiring multi-frame voice signals at a low rate and preprocessing the multi-frame voice signals into voice sequences;
the prediction module is used for extracting acoustic features by taking the current frame voice sequence as an input signal of the first depth network, predicting the acoustic features of the next frame voice sequence by utilizing the acoustic features, and taking the acoustic features as a prediction vector;
the vector quantization module is used for carrying out difference on the original acoustic characteristics and the prediction vector to obtain a difference vector, searching a quantization vector which is most matched with the difference vector in a designed codebook, transmitting the quantization vector as a residual index to the second depth network, searching a corresponding difference quantization vector in the codebook according to the received residual index by the second depth network, and adding the difference quantization vector and the prediction vector to obtain a reconstruction vector;
and the voice synthesis module is used for decoding the reconstruction vector to output synthesized voice and judging the authenticity of the generated voice through a discriminator.
CN202311743425.7A 2023-12-19 2023-12-19 Speech compression method and system based on deep learning and vector prediction Active CN117423348B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311743425.7A CN117423348B (en) 2023-12-19 2023-12-19 Speech compression method and system based on deep learning and vector prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311743425.7A CN117423348B (en) 2023-12-19 2023-12-19 Speech compression method and system based on deep learning and vector prediction

Publications (2)

Publication Number Publication Date
CN117423348A true CN117423348A (en) 2024-01-19
CN117423348B CN117423348B (en) 2024-04-02

Family

ID=89530574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311743425.7A Active CN117423348B (en) 2023-12-19 2023-12-19 Speech compression method and system based on deep learning and vector prediction

Country Status (1)

Country Link
CN (1) CN117423348B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020075592A (en) * 2001-03-26 2002-10-05 한국전자통신연구원 LSF quantization for wideband speech coder
CN1420487A (en) * 2002-12-19 2003-05-28 北京工业大学 Method for quantizing one-step interpolation predicted vector of 1kb/s line spectral frequency parameter
EP1930881A2 (en) * 1998-08-24 2008-06-11 Mindspeed Technologies, Inc. Speech decoder employing noise compensation
CN103050122A (en) * 2012-12-18 2013-04-17 北京航空航天大学 MELP-based (Mixed Excitation Linear Prediction-based) multi-frame joint quantization low-rate speech coding and decoding method
CN103325375A (en) * 2013-06-05 2013-09-25 上海交通大学 Coding and decoding device and method of ultralow-bit-rate speech
CN106203624A (en) * 2016-06-23 2016-12-07 上海交通大学 Vector Quantization based on deep neural network and method
US20190371349A1 (en) * 2018-06-01 2019-12-05 Qualcomm Incorporated Audio coding based on audio pattern recognition
CN116153320A (en) * 2023-02-27 2023-05-23 上海交通大学 Speech signal combined noise reduction compression method and system
CN116504254A (en) * 2023-04-18 2023-07-28 平安科技(深圳)有限公司 Audio encoding and decoding method and device, storage medium and computer equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1930881A2 (en) * 1998-08-24 2008-06-11 Mindspeed Technologies, Inc. Speech decoder employing noise compensation
KR20020075592A (en) * 2001-03-26 2002-10-05 한국전자통신연구원 LSF quantization for wideband speech coder
CN1420487A (en) * 2002-12-19 2003-05-28 北京工业大学 Method for quantizing one-step interpolation predicted vector of 1kb/s line spectral frequency parameter
CN103050122A (en) * 2012-12-18 2013-04-17 北京航空航天大学 MELP-based (Mixed Excitation Linear Prediction-based) multi-frame joint quantization low-rate speech coding and decoding method
CN103325375A (en) * 2013-06-05 2013-09-25 上海交通大学 Coding and decoding device and method of ultralow-bit-rate speech
CN106203624A (en) * 2016-06-23 2016-12-07 上海交通大学 Vector Quantization based on deep neural network and method
US20190371349A1 (en) * 2018-06-01 2019-12-05 Qualcomm Incorporated Audio coding based on audio pattern recognition
CN116153320A (en) * 2023-02-27 2023-05-23 上海交通大学 Speech signal combined noise reduction compression method and system
CN116504254A (en) * 2023-04-18 2023-07-28 平安科技(深圳)有限公司 Audio encoding and decoding method and device, storage medium and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEXANDRE DÉFOSSEZ ET AL: "High Fidelity Neural Audio Compression", ARXIV, 24 October 2022 (2022-10-24), pages 1 - 19 *
刘继新: "基于矢量量化技术的音频信息隐藏算法的研究", 中国博士学位论文全文数据库信息科技辑, 15 April 2011 (2011-04-15) *

Also Published As

Publication number Publication date
CN117423348B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US7729905B2 (en) Speech coding apparatus and speech decoding apparatus each having a scalable configuration
TWI405187B (en) Scalable speech and audio encoder device, processor including the same, and method and machine-readable medium therefor
US8392176B2 (en) Processing of excitation in audio coding and decoding
Zhen et al. Cascaded cross-module residual learning towards lightweight end-to-end speech coding
EP2254110B1 (en) Stereo signal encoding device, stereo signal decoding device and methods for them
US20080091440A1 (en) Sound Encoder And Sound Encoding Method
JP4875249B2 (en) Automatic speech recognition execution method
KR20050087956A (en) Lossless audio decoding/encoding method and apparatus
JP4489959B2 (en) Speech synthesis method and speech synthesizer for synthesizing speech from pitch prototype waveform by time synchronous waveform interpolation
CN112767954A (en) Audio encoding and decoding method, device, medium and electronic equipment
WO2014051964A1 (en) Apparatus and method for audio frame loss recovery
US8027242B2 (en) Signal coding and decoding based on spectral dynamics
CN102598125B (en) Encoder apparatus, decoder apparatus and methods of these
JP4978539B2 (en) Encoding apparatus, encoding method, and program.
CN100585700C (en) Sound encoding device and method thereof
JP3344944B2 (en) Audio signal encoding device, audio signal decoding device, audio signal encoding method, and audio signal decoding method
CN117423348B (en) Speech compression method and system based on deep learning and vector prediction
US20120123788A1 (en) Coding method, decoding method, and device and program using the methods
EP1121686A1 (en) Speech parameter compression
CN112669857B (en) Voice processing method, device and equipment
CN114913862A (en) Vocoder parameter error code masking method and system based on tabu transfer matrix
US8949117B2 (en) Encoding device, decoding device and methods therefor
JPH08179800A (en) Sound coding device
CN117831548A (en) Training method, encoding method, decoding method and device of audio coding and decoding system
JPH09120300A (en) Vector quantization device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant