WO2023241205A1 - 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023241205A1
WO2023241205A1 PCT/CN2023/088638 CN2023088638W WO2023241205A1 WO 2023241205 A1 WO2023241205 A1 WO 2023241205A1 CN 2023088638 W CN2023088638 W CN 2023088638W WO 2023241205 A1 WO2023241205 A1 WO 2023241205A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
low
signal
sub
band
Prior art date
Application number
PCT/CN2023/088638
Other languages
English (en)
French (fr)
Inventor
王蒙
阳珊
黄庆博
康迂勇
史裕鹏
肖玮
商世东
苏丹
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023241205A1 publication Critical patent/WO2023241205A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the present application relates to data processing technology, and in particular, to an audio processing method, device, electronic equipment, computer-readable storage medium and computer program product.
  • Audio codec technology is a core technology in communication services including remote audio and video calls. Speech coding technology, simply put, uses less network bandwidth resources to transmit as much speech information as possible. From the perspective of Shannon information theory, speech coding is a kind of source coding. The purpose of source coding is to compress the amount of data we want to transmit information as much as possible on the encoding side and remove the redundancy in the information. At the same time, on the decoding side It can also be restored losslessly (or nearly losslessly).
  • Embodiments of the present application provide an audio processing method, device, electronic equipment, computer-readable storage medium, and computer program product, which can improve audio coding efficiency while ensuring audio quality.
  • This embodiment of the present application provides an audio processing method, including:
  • the high-frequency features are quantized and encoded to obtain a high-frequency code stream of the audio signal.
  • This embodiment of the present application provides an audio processing method, including:
  • Subband synthesis is performed on the low-frequency subband signal and the high-frequency subband signal to obtain a synthesized audio signal corresponding to the low-frequency code stream and the high-frequency code stream.
  • a decomposition module configured to decompose the audio signal into a low-frequency sub-band signal and a high-frequency sub-band signal
  • a high-frequency analysis module configured to obtain high-frequency characteristics of the high-frequency sub-band signal, wherein the characteristic dimension of the high-frequency characteristic is lower than the characteristic dimension of the low-frequency characteristic;
  • An embodiment of the present application provides an audio processing device, including:
  • a feature reconstruction module configured to perform feature reconstruction on the low-frequency features to obtain low-frequency subband signals corresponding to the low-frequency features
  • a high-frequency reconstruction module configured to perform high-frequency reconstruction on the high-frequency features to obtain high-frequency subband signals corresponding to the high-frequency features
  • a synthesis module configured to perform subband synthesis on the low-frequency subband signal and the high-frequency subband signal to obtain a synthesized audio signal corresponding to the low-frequency code stream and the high-frequency code stream.
  • An embodiment of the present application provides an electronic device for audio processing.
  • the electronic device includes:
  • Memory used to store computer programs or instructions
  • Figure 1 is a schematic diagram of spectrum comparison under different code rates provided by the embodiment of the present application.
  • Figure 2 is a schematic architectural diagram of an audio coding and decoding system provided by an embodiment of the present application
  • Figure 3A is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 3B is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of an audio processing method provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of an end-to-end voice communication link provided by an embodiment of the present application.
  • Figure 7 is a schematic flow chart of a speech encoding and decoding method based on subband decomposition and neural network provided by an embodiment of the present application;
  • Figure 8 is a schematic diagram of a filter bank provided by an embodiment of the present application.
  • Figure 9A is a schematic diagram of a common convolutional network provided by an embodiment of the present application.
  • Figure 9B is a schematic diagram of the dilated convolutional network provided by the embodiment of the present application.
  • Figure 10 is a schematic diagram of frequency band expansion provided by the embodiment of the present application.
  • Figure 11 is a schematic diagram of the first neural network provided by the embodiment of the present application.
  • Figure 12 is a schematic diagram of the neural network structure for high-frequency subband signals provided by the embodiment of the present application.
  • Figure 13 is a schematic diagram of the second neural network provided by the embodiment of the present application.
  • first ⁇ second involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understandable that “first ⁇ second” can be used where permitted. The specific order or sequence is interchanged so that the embodiments of the application described herein can be practiced in other than that illustrated or described herein.
  • Neural Network (NN, Neural Network): It is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. This kind of network relies on the complexity of the system to achieve the purpose of processing information by adjusting the interconnected relationships between a large number of internal nodes.
  • Deep learning Deep Learning: It is a new research direction in the field of machine learning (ML, Machine Learning). Deep learning is to learn the inherent laws and representation levels of sample data. The information obtained in the learning process is important for such things as Interpretation of data such as text, images and sounds helps a lot. Its ultimate goal is to enable machines to have the same analytical learning capabilities as humans and to recognize data such as text, images, and sounds.
  • Quantization refers to the process of approximating the continuous values of a signal (or a large number of discrete values) into a finite number (or fewer) discrete values.
  • quantization includes vector quantization (VQ, Vector Quantization) and scalar quantization.
  • vector quantization is an effective lossy compression technology, and its theoretical basis is Shannon's rate distortion theory.
  • the basic principle of vector quantization is to replace the input vector with the index of the codeword in the code book that best matches the input vector for transmission and storage, and only a simple table lookup operation is required during decoding. For example, several scalar data are composed into a vector space, and the vector space is divided into several small areas. During quantization, the vectors falling into the small areas will be replaced with the corresponding indexes as the input vectors.
  • Scalar quantization is to quantize a scalar, that is, one-dimensional vector quantization, which divides the dynamic range into several small intervals, and each small interval has a representative value (ie, index). When the input signal falls into a certain interval, the input signal is quantized into the representative value.
  • Entropy coding A lossless coding method that does not lose any information according to the entropy principle during the coding process. It is also a lossy coding. A key module in the encoder, located at the end of the encoder. Entropy coding includes Shannon coding, Huffman coding, Exp-Golomb coding and arithmetic coding.
  • Quadrature Mirror Filter Bank It is a filter pair including analysis and synthesis.
  • the QMF analysis filter is used for sub-band signal decomposition to reduce the signal bandwidth so that each sub-band signal can be Processed smoothly through respective channels;
  • the QMF synthesis filter is used to synthesize each sub-band signal recovered from the decoder, such as reconstructing the original audio signal through zero-value interpolation and band-pass filtering.
  • Speech coding technology uses less network bandwidth resources to transmit as much speech information as possible.
  • the compression rate of the voice codec can reach more than 10 times. That is, after the original 10 megabyte (MB, Mega Byte) voice data is compressed by the encoder, only 1MB of voice data needs to be transmitted, which greatly reduces the time required to transmit information. Bandwidth resources to be consumed.
  • the code rate of the uncompressed version (the amount of data transmitted per unit time) ) is 256 kilobit per second (kbps, kilo bit per second); if speech coding technology is used, even lossy coding, within the code rate range of 10-20kbps, the quality of the reconstructed speech signal can be close to the uncompressed version, There is even no difference in the sense of hearing. If a higher sampling rate service is required, such as 32000Hz ultra-wideband voice, the bit rate range must reach at least 30kbps.
  • Figure 1 gives a schematic diagram of spectrum comparison under different bit rates to demonstrate the relationship between compression bit rate and quality.
  • Curve 101 is the spectrum curve of the original speech, that is, the signal without compression
  • curve 102 is the spectrum curve of the OPUS encoder at a code rate of 20kbps
  • curve 103 is the spectrum curve of the OPUS encoder at a code rate of 6kbps.
  • Speech coding can directly encode speech waveform samples sample by sample; or, based on the principle of human vocalization, extract relevant low-dimensional features, the encoding end encodes the features, and the decoding end is based on these Parametrically reconstructed speech signals.
  • Embodiments of the present application provide an audio processing method, device, electronic equipment, computer-readable storage medium, and computer program product, which can improve coding efficiency. Exemplary applications of the electronic device provided by the embodiments of the present application are described below.
  • the electronic device provided by the embodiments of the present application can be implemented as a terminal device, can be implemented as a server, or can be implemented collaboratively by the terminal device and the server. The following description takes the implementation of electronic equipment as terminal equipment as an example.
  • FIG. 2 is a schematic architectural diagram of an audio encoding and decoding system 100 provided by an embodiment of the present application.
  • the audio decoding system 100 includes: a server 200, a network 300, a terminal device 400 (i.e., the encoding end) and a terminal device 500 ( That is, the decoding end), where the network 300 may be a local area network, a wide area network, or a combination of the two.
  • the client 410 runs on the terminal device 400.
  • the client 410 may be various types of clients, such as instant messaging clients, network conferencing clients, live broadcast clients, browsers, etc.
  • the client 410 calls the microphone of the terminal device 400 to collect the audio signal, and collects the collected audio.
  • the signal is encoded to obtain a code stream.
  • the client 410 calls the audio processing method provided by the embodiment of the present application to encode the collected audio signal, that is, decomposes the audio signal into sub-bands to obtain the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal.
  • Feature extraction is performed on the sub-band signal to obtain the low-frequency characteristics of the low-frequency sub-band signal.
  • High-frequency analysis is performed on the high-frequency sub-band signal to obtain the high-frequency characteristics of the high-frequency sub-band signal.
  • the characteristic dimension of the high-frequency characteristic is lower than that of the low-frequency characteristic.
  • the feature dimension is quantized and coded on the low-frequency features to obtain the low-frequency code stream of the audio signal
  • the high-frequency features are quantized and coded to obtain the high-frequency code stream of the audio signal.
  • the encoding end (that is, the terminal equipment 400) performs the processing of low-frequency subband signals and high-frequency Differentiated signal processing is performed on the sub-band signals so that the feature dimensions of high-frequency features are lower than the feature dimensions of low-frequency features.
  • the low-frequency features and high-frequency features with reduced feature dimensions are quantified and encoded separately, so as to ensure the audio quality. to improve audio encoding efficiency.
  • the client 410 can send the code stream (i.e., the low-frequency code stream and the high-frequency code stream) to the server 200 through the network 300, so that the server 200 sends the code stream to the recipient (such as network conference participants, viewers, voice calls, etc.) recipient, etc.) associated terminal device 500.
  • the code stream i.e., the low-frequency code stream and the high-frequency code stream
  • the recipient such as network conference participants, viewers, voice calls, etc.
  • the client 510 After receiving the code stream sent by the server 200, the client 510 (such as an instant messaging client, a network conferencing client, a live broadcast client, a browser, etc.) can decode the code stream to obtain an audio signal, thereby realizing audio communication. .
  • the client 510 such as an instant messaging client, a network conferencing client, a live broadcast client, a browser, etc.
  • the client 510 calls the audio processing method provided by the embodiment of the present application to decode the received code stream, that is, quantitatively decode the low-frequency code stream, obtain the low-frequency characteristics corresponding to the low-frequency code stream, and quantize the high-frequency code stream.
  • Decode to obtain the high-frequency features corresponding to the high-frequency code stream perform feature reconstruction on the low-frequency features, and obtain the low-frequency sub-band signals corresponding to the low-frequency features, perform high-frequency reconstruction on the high-frequency features, and obtain the high-frequency sub-band signals corresponding to the high-frequency features.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, and application technology based on the cloud computing business model. It can form a resource pool and use it on demand, which is flexible and convenient. Cloud computing technology will become an important support.
  • the service interaction function between the above servers 200 can be realized through cloud technology.
  • the server 200 shown in Figure 2 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, Cloud servers for basic cloud computing services such as cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal device 400 and the terminal device 500 shown in FIG. 2 can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, etc., but are not limited thereto.
  • the terminal devices (such as the terminal device 400 and the terminal device 500) and the server 200 can be connected directly or indirectly through wired or wireless communication methods, which are not limited in the embodiments of this application.
  • the terminal device or server 200 can also implement the audio processing method provided by the embodiments of the present application by running a computer program.
  • a computer program can be a native program or software module in the operating system; it can be a native (Native) application (APP, Application), that is, a program that needs to be installed in the operating system to run, such as live broadcast APP, network conferencing APP, or instant messaging APP, etc.; it can also be a small program, that is, a program that only needs to be downloaded to the browser environment to run.
  • APP Native
  • the computer program described above can be any form of application, module or plug-in.
  • multiple servers may form a blockchain network, and the server 200 is a node on the blockchain network.
  • the server 200 is a node on the blockchain network.
  • data related to the audio processing method provided by the embodiments of the present application can be saved on the blockchain network. Any operation of the data by any server needs to be processed by other servers through a consensus algorithm. Confirm, thereby realizing unilateral tampering of data and avoiding unnecessary leakage of data.
  • the processor 520 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware Components, etc., wherein the general processor can be a microprocessor or any conventional processor, etc.
  • DSP Digital Signal Processor
  • User interface 540 includes one or more output devices 541 that enable the presentation of media content, including one or more speakers and/or one or more visual displays.
  • User interface 540 also includes one or more input devices 542, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
  • Presentation module 553 for enabling the presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 541 (e.g., display screens, speakers, etc.) associated with user interface 540 );
  • information e.g., a user interface for operating peripheral devices and displaying content and information
  • output devices 541 e.g., display screens, speakers, etc.
  • Figure 3B shows the audio processing device 555 stored in the memory 550, which may include a decoding module 5555, a feature reconstruction module 5556, a high-frequency reconstruction module 5557 and a synthesis module 5558 for implementing the audio decoding function. These modules are logically , so they can be combined or further split according to the functions implemented.
  • the input signal is sampled to obtain a sampling signal x(n) including 640 sample points.
  • the low-frequency subband signal x LB (n) of the audio signal is obtained, and the high-pass filtered signal is down-sampled to obtain the high-frequency subband signal x HB (n) of the audio signal.
  • the effective bandwidths of the low-frequency sub-band signal x LB (n) and the high-frequency sub-band signal x HB (n) are 0-8kHz and 8-16kHz respectively.
  • the low-frequency sub-band signal x LB (n) and the high-frequency sub-band signal x HB The number of sample points for (n) is 320.
  • step 102 feature extraction is performed on the low-frequency sub-band signal to obtain low-frequency features of the low-frequency sub-band signal.
  • the neural network model is called based on the low-frequency subband signal x LB (n) to generate a lower-dimensional feature vector F LB (n), that is, the low-frequency feature.
  • the input low-frequency subband signal x LB (n) is convolved through causal convolution to obtain a 24 ⁇ 320 convolution feature.
  • the 24 ⁇ 320 convolutional features are pooled (i.e., preprocessed) with a factor of 2 to obtain a 24 ⁇ 160 pooled feature.
  • the 24 ⁇ 160 pooled features are downsampled to obtain 192 ⁇ 1 downsampled features.
  • a 56-dimensional feature vector F LB (n) can be obtained.
  • downsampling is implemented through multiple cascaded coding layers; downsampling the pooling features to obtain downsampled features of the low-frequency subband signal can be achieved in the following manner: through multiple cascaded The first coding layer in the coding layer downsamples the pooled features; the downsampling results of the first coding layer are output to the subsequent cascaded coding layers, and the downsampling and downsampling are continued through the subsequent cascaded coding layers. The sampling results are output until the end A coding layer; the downsampling result output by the last coding layer is used as the downsampling feature of the low-frequency subband signal.
  • three coding blocks i.e., coding layers
  • downsampling factors Down_factor
  • the 160 pooling features are down-sampled to obtain a 48 ⁇ 40 down-sampling result
  • step 103 high-frequency analysis is performed on the high-frequency sub-band signal to obtain high-frequency characteristics of the high-frequency sub-band signal.
  • the feature dimensions of high-frequency features are lower than those of low-frequency features. Since low-frequency subband signals have a greater impact on audio coding than high-frequency subband signals, differentiated signal processing is performed on low-frequency subband signals and high-frequency subband signals, so that the feature dimensions of high-frequency features are lower than those of low-frequency features. feature dimensions. Among them, the dimension of the low-frequency characteristics of the high-frequency sub-band signal is smaller than the dimension of the high-frequency sub-band signal. High-frequency analysis is used to reduce the dimension of the high-frequency sub-band signal and realize the function of data compression.
  • performing high-frequency analysis on the high-frequency sub-band signal to obtain the high-frequency characteristics of the high-frequency sub-band signal can be achieved in the following manner: calling the first neural network model to perform feature extraction on the high-frequency sub-band signal, Obtain the high-frequency characteristics of the high-frequency sub-band signal.
  • performing high-frequency analysis on the high-frequency sub-band signal to obtain the high-frequency characteristics of the high-frequency sub-band signal can be achieved in the following manner: performing band extension on the high-frequency sub-band signal to obtain the high-frequency sub-band signal high frequency characteristics.
  • frequency band expansion can be used to recover broadband speech signals from narrow-band speech signals with limited frequency bands to quickly compress high-frequency signals.
  • Sub-band signals extract high-frequency features of high-frequency sub-band signals.
  • performing frequency band extension on the high-frequency sub-band signal to obtain the high-frequency characteristics of the high-frequency sub-band signal can be achieved in the following manner: performing frequency domain transformation based on multiple sample points included in the high-frequency sub-band signal, Obtain the transformation coefficients corresponding to multiple sample points; divide the transformation coefficients corresponding to multiple sample points into multiple sub-bands; calculate the average value of the transformation coefficients included in each sub-band, obtain the average energy corresponding to each sub-band, and average The energy is used as the sub-band spectrum envelope corresponding to each sub-band; the sub-band spectrum envelope corresponding to multiple sub-bands is determined as the high-frequency characteristic of the high-frequency sub-band signal.
  • the frequency domain transformation method in the embodiment of the present application includes Modified Discrete Cosine Transform (MDCT), Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), etc.
  • MDCT Modified Discrete Cosine Transform
  • DCT Discrete Cosine Transform
  • FFT Fast Fourier Transform
  • the application embodiments are not limited to frequency domain transformation methods.
  • the types of average values calculated in the embodiments of the present application include arithmetic average and geometric average.
  • the embodiments of the present application are not limited to average values. How values are handled.
  • frequency domain transformation is performed based on multiple sample points included in the high-frequency subband signal to obtain transformation coefficients corresponding to the multiple sample points, including: obtaining the reference high-frequency subband signal of the reference audio signal, where, The reference audio signal is an audio signal adjacent to the audio signal; based on the multiple sample points included in the reference high-frequency subband signal and the multiple sample points included in the high-frequency subband signal, the multiple samples included in the high-frequency subband signal are The discrete cosine transform is performed on the points to obtain the transformation coefficients corresponding to the multiple sample points included in the high-frequency subband signal.
  • the process of calculating the mean value of the transform coefficients included in each subband is as follows: determine the sum of squares of the transform coefficients corresponding to the sample points included in each subband; compare the sum of squares with the number of sample points included in the subband , determined to obtain the average energy corresponding to each sub-band.
  • the modified discrete cosine transform is called to generate MDCT coefficients of 320 points (that is, the high-frequency subband signal includes Transform coefficients corresponding to multiple sample points).
  • the n+1th frame of high-frequency data i.e., the reference audio signal
  • the n-th frame of high-frequency data i.e., the audio signal
  • the sub-bands are composed of multiple adjacent MDCT coefficients into a group.
  • MDCT coefficients can be divided into 8 sub-bands.
  • 320 points can be evenly distributed, that is, each subband includes a consistent number of points.
  • the embodiment of the present application can also perform non-uniform division of 320 points.
  • the sub-bands with lower frequencies include fewer MDCT coefficients (higher frequency resolution), and the sub-bands with higher frequencies include more MDCT coefficients. (lower frequency resolution).
  • the MDCT coefficients of 320 points are divided into 8 sub-bands, 8 sub-band spectral envelopes can be obtained. These 8 sub-band spectral envelopes are the eigenvectors F HB (n) of the generated high-frequency sub-band signals, that is, high frequency feature.
  • the low-frequency features are quantized to obtain the index value of the low-frequency feature; the index value of the low-frequency feature is entropy-encoded to obtain the low-frequency code stream of the audio signal; the high-frequency feature is quantized to obtain the high-frequency feature The index value of the high-frequency feature is entropy-encoded to obtain the high-frequency code stream of the audio signal.
  • the feature dimension of low-frequency features that have a greater impact on audio signals is kept higher than that of high-frequency features, ensuring the quality of the encoded audio signal.
  • Quality reduces the feature dimension of high-frequency features that have less impact on the quality of the audio signal, reducing the quantization
  • the amount of coded data improves coding efficiency.
  • the low-frequency code stream and the high-frequency code stream are obtained by encoding the sub-band signals obtained after the audio signal is decomposed into sub-bands.
  • the feature dimension of the high-frequency feature is lower than the feature dimension of the low-frequency feature.
  • step 202 feature reconstruction is performed on the low-frequency features to obtain low-frequency subband signals corresponding to the low-frequency features.
  • feature reconstruction is the inverse process of feature extraction. Feature reconstruction is performed on low-frequency features to obtain low-frequency subband signals (an estimate) corresponding to the low-frequency features.
  • step 203 high-frequency reconstruction is performed on the high-frequency features to obtain high-frequency subband signals corresponding to the high-frequency features.
  • the decoding end when the encoding end performs feature extraction on high-frequency subband signals and obtains high-frequency features, performs feature reconstruction on the high-frequency features to obtain high-frequency subband signals corresponding to the high-frequency features.
  • the frequency domain transformation method in the embodiment of the present application includes modified discrete cosine transform (MDCT, Modified Discrete Cosine Transform), discrete cosine transform (DCT, Discrete Cosine Transform), fast Fourier transform (FFT, Fast Fourier Transform) etc.
  • MDCT modified discrete cosine transform
  • DCT discrete cosine transform
  • FFT fast Fourier transform
  • FFT Fast Fourier Transform
  • the reference transformation coefficient of the reference high-frequency sub-band signal is amplified to obtain the amplified reference transformation coefficient, which can be achieved in the following manner: based on the high-frequency feature For the corresponding subband spectral envelope, the reference transform coefficient of the reference high-frequency subband signal is divided into multiple subbands; perform the following processing for any subband among the multiple subbands: determine the subband corresponding to the subband spectral envelope.
  • first average energy and determine the second average energy corresponding to the sub-band; determine the amplification factor based on the ratio of the first average energy to the second average energy; multiply the amplification factor by each reference transformation coefficient included in the sub-band, and obtain The amplified reference transformation coefficient.
  • the 320-point MDCT coefficients generated by x' LB (n) are copied to generate the MDCT coefficients of the high-frequency part (ie, the reference transform coefficients of the reference high-frequency subband signal).
  • the high-frequency part ie, the reference transform coefficients of the reference high-frequency subband signal.
  • the 8 sub-band spectrum envelopes obtained previously that is, the 8 sub-band spectrum envelopes obtained by querying the quantization table, that is, the sub-band spectrum envelope corresponding to the high-frequency characteristics.
  • These 8 sub-band spectrum envelopes correspond to 8 high-frequency subbands
  • the reference values of the MDCT coefficients of the generated 320-point reference high-frequency subband signals are divided into 8 reference high-frequency subbands (that is, the reference transform coefficients of the reference high-frequency subband signals are divided into Multiple sub-bands)
  • the average energy of the high-frequency subband that replicates the reference high-frequency subband signal generated in the past is Y_L
  • the high-frequency subband that currently needs to be amplified i.e., the high-frequency subband corresponding to the subband spectral envelope decoded based on the code stream
  • the average energy of frequency sub-band is Y_H
  • calculate an amplification factor a sqrt(Y_H/Y_L). With the amplification factor a, each point in the reference high-frequency subband generated from the replication is directly multiplied by a.
  • the MDCT inverse transform is called to generate the estimated value x′ HB (n) of the high-frequency subband signal (that is, the high-frequency subband signal corresponding to the high-frequency feature).
  • step 204 subband synthesis is performed on the low-frequency subband signal and the high-frequency subband signal to obtain a synthesized audio signal corresponding to the low-frequency code stream and the high-frequency code stream.
  • subband synthesis is the reverse process of subband decomposition.
  • the decoder performs subband synthesis on low-frequency subband signals and high-frequency subband signals to restore the audio signal, where the synthesized audio signal is the restored audio signal.
  • subband synthesis is performed on the low-frequency subband signal and the high-frequency subband signal to obtain a synthesized audio signal corresponding to the low-frequency code stream and the high-frequency code stream. This can be achieved in the following manner: upstream the low-frequency subband signal. Sampling is performed to obtain a low-pass filtered signal; the high-frequency subband signal is upsampled to obtain a high-frequency filtered signal; the low-pass filtered signal and the high-frequency filtered signal are filtered and synthesized to obtain a synthesized audio signal.
  • the low-frequency sub-band signal and the high-frequency sub-band signal are sub-band synthesized through the QMF synthesis filter to restore the audio signal.
  • the embodiments of this application can be applied to various audio scenarios, such as voice calls, instant messaging, etc.
  • voice calls such as voice calls, instant messaging, etc.
  • the following takes a voice call as an example for explanation.
  • Speech coding can directly encode speech waveform samples sample by sample; or, based on the principle of human vocalization, extract relevant low-dimensional features, the encoding end encodes the features, and the decoding end is based on these Parametrically reconstructed speech signals.
  • Embodiments of this application provide a speech coding and decoding method (ie, audio processing method) based on sub-band decomposition and neural network. Based on the characteristics of the speech signal, the speech signal with a specific sampling rate is decomposed into a low-frequency sub-band signal and a high-frequency sub-band signal. Different subband signals can be compressed using different data compression mechanisms. For the important part (low-frequency sub-band signal), based on neural network (NN, After processing with Neural Network technology, a lower-dimensional feature vector than the input low-frequency subband signal will be obtained. For the relatively unimportant parts (high-frequency subband signals), fewer bits are used for encoding.
  • N neural network
  • the embodiment of the present application can be applied to the voice communication link as shown in Figure 6.
  • the voice codec involved in the embodiment of the present application is The technology is deployed in the encoding and decoding parts to solve the basic functions of speech compression.
  • the encoder is deployed on the uplink client 601, and the decoder is deployed on the downlink client 602.
  • Voice is collected through the uplink client, and pre-processing enhancement, encoding and other processing are performed, and the encoded code stream is transmitted to the downlink client 602 through the network.
  • Decoding, enhancement and other processing are performed through the downlink client 602, so that the decoded speech is played back on the downlink client 602.
  • a transcoder needs to be deployed in the background of the system (that is, the server) to solve the problem of interconnection between the new encoder and the existing encoder.
  • the sending end upstream client
  • the receiving end downstream client
  • PSTN Public Switched Telephone Network
  • G.722 the Public Switched Telephone Network
  • the NN decoder needs to be executed to generate the speech signal, and then the G.722 encoder is called to generate a specific code stream to implement the transcoding function, so that the receiving end can correctly decode based on the specific code stream.
  • the encoding end performs the following processing: use an analysis filter to decompose the input speech signal x(n) of the nth frame into a low-frequency subband signal x LB (n) and a high-frequency subband signal x HB (n).
  • the first NN is called to obtain a low-dimensional feature vector F LB (n).
  • the dimension of the feature vector F LB (n) is smaller than the dimension of the low-frequency subband signal to reduce the amount of data.
  • Dilated CNN dilated convolutional network
  • the embodiments of this application do not limit other NN structures, such as autoencoder (Autoencoder), fully connected (FC, Full-Connection) network, long short-term memory (LSTM, Long Short-Term Memory) network, convolutional neural network (CNN) , Convolutional Neural Network)+LSTM and so on.
  • autoencoder Autoencoder
  • FC Fully connected
  • LSTM Long Short-Term Memory
  • CNN convolutional neural network
  • CNN Convolutional Neural Network
  • the high-frequency sub-band signal x HB (n) can use other solutions to extract the feature vector F HB (n).
  • frequency band expansion technology based on speech signal analysis can generate high-frequency subband signals with only a 1-2kbps code rate; it can also use the same NN structure as low-frequency subband signals or a more streamlined network (such as output features vector is smaller than the low-frequency feature vector F LB (n)).
  • the decoding end performs the following processing: decodes the code stream received by the decoding end, and obtains the estimated value F′ LB (n) of the low-frequency feature vector and the estimated value F′ HB (n) of the high-frequency feature vector.
  • the second NN is called based on the estimated value F′ LB (n) of the low-frequency feature vector to generate the low-frequency subband signal estimated value x′ LB (n).
  • the high-frequency reconstruction is called based on the estimated value F′ HB (n) of the high-frequency feature vector to generate the high-frequency subband signal estimated value x′ HB (n).
  • QMF synthesis filtering is called to generate the reconstructed speech signal x′(n).
  • the QMF filter bank Before specifically introducing the speech encoding and decoding method based on subband decomposition and neural network provided by the embodiment of the present application, the QMF filter bank, atrous convolution network and frequency band expansion technology will be introduced below.
  • a QMF filter bank is a pair of analysis-synthesis filters.
  • the input signal with a sampling rate of Fs can be decomposed into two signals with a sampling rate of Fs/2, representing the QMF low-pass signal and the QMF high-pass signal respectively.
  • the spectral response of the low-pass part H_Low(z) and the high-pass part H_High(z) of the QMF filter is shown in Figure 8.
  • h Low (k) represents the coefficient of low-pass filtering
  • h High (k) represents the coefficient of high-pass filtering
  • the filter banks H_Low(z) and H_High(z) can be analyzed based on QMF, Describe the QMF synthesis filter bank, as shown in equation (2).
  • G Low (z) H Low (z)
  • G High (z) (-1)*H High (z) (2)
  • G Low (z) represents the recovered low-pass signal
  • G High (z) represents the recovered high-pass signal
  • the low-pass signal and high-pass signal recovered at the decoding end are synthesized through the QMF synthesis filter bank, and the reconstructed signal with the sampling rate Fs corresponding to the input signal can be recovered.
  • Figure 9A is a schematic diagram of a common convolutional network provided by an embodiment of the present application
  • Figure 9B is a schematic diagram of a dilated convolutional network provided by an embodiment of the present application.
  • atrous convolution can increase the receptive field while keeping the size of the feature map unchanged, and can also avoid errors caused by upsampling and downsampling.
  • the convolution kernel sizes (Kernel Size) shown in Figure 9A and Figure 9B are both 3 ⁇ 3; however, the receptive field 901 of the ordinary convolution shown in Figure 9A is only 3, while the dilated convolution shown in Figure 9B The receptive field of 902 reached 5.
  • the receptive field of the ordinary convolution shown in Figure 9A is 3, and the dilation rate (Dilation Rate) (the number of intervals between points in the convolution kernel) is 1; and
  • the atrous convolution shown in Figure 9B has a receptive field of 5 and an expansion rate of 2.
  • the convolution kernel can also move on a plane similar to Figure 9A or Figure 9B, which involves the concept of shift rate (Stride Rate) (step size). For example, each time the convolution kernel is shifted by 1 grid, the corresponding shift rate is 1.
  • the size of the atrous convolution kernel can be defined according to actual application needs (for example: for speech signals, the size of the convolution kernel can be set to 1 ⁇ 3), expansion rate, shift rate and number of channels. This application The examples do not specifically limit this.
  • the broadband signal is first reconstructed, then the broadband signal is copied to the ultra-wideband signal, and finally the shaping is performed based on the ultra-wideband envelope.
  • Figure 10 shows the specific frequency domain implementation scheme Including: 1) Implementing a core layer encoding at a low sampling rate; 2) Selecting the spectrum of the low frequency part to copy to the high frequency; 3) Based on the boundary information recorded in advance (describing the energy correlation between high frequency and low frequency, etc.), Copied high-frequency spectrum for amplification control. Only a bit rate of 1-2kbps can produce the effect of doubling the sampling rate.
  • the input signal is generated.
  • the input signal of the nth frame includes 640 sample points, which is recorded as input signal x(n).
  • the QMF analysis filter (2-channel QMF) and perform downsampling to obtain two parts of sub-band signals, namely the low-frequency sub-band signal x LB (n) and the high-frequency sub-band signal x HB (n).
  • the effective bandwidths of the low-frequency sub-band signal x LB (n) and the high-frequency sub-band signal x HB (n) are 0-8kHz and 8-16kHz respectively.
  • the low-frequency sub-band signal x LB (n) and the high-frequency sub-band signal x HB The number of sample points for (n) is 320.
  • the low-frequency subband signal x LB (n) is input to the first NN for data compression.
  • the size of each convolution kernel is fixed at 1 ⁇ 3 and the shift rate (Stride Rate) is 1.
  • the dilation rate (Dilation Rate) of one or more dilated convolutions can be set according to requirements, such as 3.
  • the embodiments of this application do not limit the dilation rates of different dilated convolution settings.
  • high-frequency analysis is performed on the high-frequency sub-band signal x HB (n) again.
  • the purpose of high-frequency analysis is to extract the key information of the high-frequency subband signal x HB (n) and generate a lower-dimensional feature vector F HB (n).
  • another NN structure similar to the first NN can be introduced to generate a low-dimensional feature vector.
  • high-frequency sub-band signals are less important to quality. Therefore, the NN structure for high-frequency sub-band signals does not need to be as complicated as the first NN.
  • the NN structure for high-frequency subband signals is similar to the structure of the first NN, but compared with the structure of the first NN, the number of channels is greatly reduced.
  • a modified discrete cosine transform is called to generate MDCT coefficients of 320 points.
  • MDCT discrete cosine transform
  • the MDCT coefficients of 320 points are divided into 8 sub-bands, 8 sub-band spectral envelopes can be obtained, and these 8 sub-band spectral envelopes are the eigenvectors F HB (n) of the generated high-frequency sub-band signals.
  • scalar quantization each component is individually quantized
  • entropy coding methods can be performed.
  • the embodiments of the present application do not limit the technical combination of vector quantization (multiple adjacent components are combined into one vector for joint quantization) and entropy coding.
  • the corresponding code stream can be generated. According to experiments, high-quality compression of 32kHz ultra-wideband signals can be achieved through a code rate of 6-10kbps.
  • the second NN shown in Figure 13 is called to generate the estimated value x' LB (n) of the low-frequency subband signal.
  • the second NN is similar to the first NN, such as causal convolution
  • the post-processing structure in the second NN is similar to the pre-processing in the first NN.
  • the structure of the decoding block is symmetrical to that of the encoding block on the encoding side. Specifically, the encoding block on the encoding side first performs atrous convolution and then pooling to complete downsampling. The decoding block on the decoding end first performs pooling to complete upsampling and then performs atrous convolution. .
  • the high-frequency reconstruction in the embodiment of this application includes two solutions.
  • the structure of the deep neural network is similar to the first implementation of high-frequency analysis (shown in Figure 12), such as causal convolution, and the post-processing structure is similar to the pre-processing in the first implementation of high-frequency analysis.
  • the structure of the decoding block is symmetrical to that of the encoding block on the encoding side: the encoding block on the encoding side first performs atrous convolution and then pooling to complete downsampling, and the decoding block on the decoder side first performs pooling to complete upsampling and then performs atrous convolution.
  • the second implementation of high-frequency reconstruction corresponds to the first implementation of high-frequency analysis at the encoding end (corresponding to frequency band extension technology).
  • the following operations are performed: first, the estimated value x′ LB (n) of the low-frequency sub-band signal generated by the decoder is n), a similar 640-point MDCT transformation is also performed on the coding end to generate 320-point MDCT coefficients (that is, the MDCT coefficients of the low-frequency part).
  • the MDCT coefficients of 320 points generated by x′ LB (n) are copied to generate MDCT coefficients of the high-frequency part.
  • the reference value of the MDCT coefficient of the high-frequency subband signal is divided into 8 reference high-frequency subbands, and the following processing is performed for each high-frequency subband: Based on a high-frequency subband and the corresponding reference high-frequency subband, the generated The reference value of the MDCT coefficient of the 320-point high-frequency sub-band signal is amplified and controlled (multiplication is performed in the frequency domain).
  • the embodiments of the present application can jointly train the relevant networks of the encoding end and the decoding end by collecting data to obtain optimal parameters. Users only need to prepare data and set up the corresponding network structure. After completing the training in the background, the trained model can be put into use.
  • the speech encoding and decoding method based on subband decomposition and neural network organically combines signal decomposition, signal processing technology and deep neural network, while ensuring audio quality and acceptable complexity. Compared with the signal processing scheme of related technologies, the coding efficiency is significantly improved.
  • each functional module in the audio processing device can be composed of hardware resources of electronic equipment (such as terminal equipment, servers or server clusters), such as processors and other computing resources, communication resources ( For example, it is used to support various communication methods such as optical cable and cellular) and memory collaborative implementation.
  • 3A and 3B show the audio processing device 555 stored in the memory 550, which can be software in the form of programs and plug-ins, for example, software modules designed in programming languages such as software C/C++, Java, C/C++, Implementation methods such as application software designed with programming languages such as Java or dedicated software modules, application program interfaces, plug-ins, cloud services, etc. in large software systems are given below with examples of different implementation methods.
  • the decomposition module is configured to perform sub-band decomposition on the audio signal to obtain the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal;
  • the feature extraction module is configured to perform feature extraction on the low-frequency sub-band signal to obtain the Low-frequency characteristics of the low-frequency sub-band signal;
  • a high-frequency analysis module configured to perform high-frequency analysis on the high-frequency sub-band signal to obtain high-frequency characteristics of the high-frequency sub-band signal; wherein, the characteristics of the high-frequency characteristic The dimension is lower than the feature dimension of the low-frequency feature;
  • the encoding module is configured to perform quantized encoding on the low-frequency feature to obtain the low-frequency code stream of the audio signal, and perform quantized encoding on the high-frequency feature to obtain the audio The high-frequency code stream of the signal.
  • the decomposition module is further configured to perform sampling processing on the audio signal to obtain a sampled signal, wherein the sampled signal includes a plurality of sample points obtained by sampling; and perform low-pass filtering on the sampled signal. , obtain a low-pass filtered signal; perform down-sampling on the low-pass filtered signal to obtain the low-frequency sub-band signal of the audio signal; perform high-pass filtering on the sampled signal to obtain a high-pass filtered signal; perform on the high-pass filtered signal Downsampling is performed to obtain the high-frequency subband signal of the audio signal.
  • the feature extraction module is further configured to convolve the low-frequency sub-band signal to obtain the convolution feature of the low-frequency sub-band signal; perform pooling on the convolution feature to obtain the Pooling characteristics of the low-frequency sub-band signal; down-sampling the pooling characteristics to obtain the down-sampling characteristics of the low-frequency sub-band signal; performing convolution on the down-sampling characteristics to obtain the low-frequency characteristics of the low-frequency sub-band signal feature.
  • the downsampling is implemented through multiple cascaded coding layers; the feature extraction module is further configured to pass the first coding layer in the multiple cascaded coding layers, The above pooling features are used for downsampling; The down-sampling result of the first coding layer is output to the subsequent cascaded coding layer, and the down-sampling and down-sampling result output are continued through the subsequent cascaded coding layer until it is output to the last coding layer; The down-sampling result output by the last coding layer is used as the down-sampling feature of the low-frequency subband signal.
  • the high-frequency analysis module is further configured to call a first neural network model, perform feature extraction on the high-frequency sub-band signal through the first neural network model, and obtain the high-frequency sub-band signal.
  • the high-frequency characteristics of the high-frequency sub-band signal or, perform frequency band expansion on the high-frequency sub-band signal to obtain the high-frequency characteristics of the high-frequency sub-band signal.
  • the high-frequency analysis module is further configured to perform frequency domain transformation based on multiple sample points included in the high-frequency subband signal to obtain transformation coefficients corresponding to the multiple sample points; convert the The transformation coefficients corresponding to multiple sample points are divided into multiple sub-bands; the transformation coefficients included in each sub-band are averaged to obtain the average energy corresponding to each sub-band, and the average energy is used as each sub-band. sub-band spectrum envelopes corresponding to the sub-bands; determining the sub-band spectrum envelopes corresponding to the plurality of sub-bands as high-frequency characteristics of the high-frequency sub-band signal.
  • the high-frequency analysis module is further configured to determine the sum of squares of the transformation coefficients corresponding to the sample points included in each sub-band; and compare the sum of squares with the number of sample points included in the sub-band. The ratio of is determined to obtain the average energy corresponding to each of the sub-bands.
  • the encoding module is further configured to perform quantization processing on the low-frequency feature to obtain an index value of the low-frequency feature; perform entropy coding on the index value of the low-frequency feature to obtain the low-frequency index value of the audio signal.
  • code stream; said quantizing and encoding the high-frequency features to obtain the high-frequency code stream of the audio signal includes: performing quantization processing on the high-frequency features to obtain the index value of the high-frequency features; The index value of the high-frequency feature is entropy encoded to obtain the high-frequency code stream of the audio signal.
  • the audio processing device 555 shown in FIG. 3B includes a series of modules, including a decoding module 5555, a feature reconstruction module 5556, a high-frequency reconstruction module 5557, and a synthesis module 5558. The following continues to describe the solution of each module in the audio processing device 555 provided by the embodiment of the present application to cooperate to implement audio decoding.
  • the decoding module is configured to quantitatively decode the low-frequency code stream to obtain the low-frequency characteristics corresponding to the low-frequency code stream, and to quantitatively decode the high-frequency code stream to obtain the high-frequency characteristics corresponding to the high-frequency code stream; wherein, The low-frequency code stream and the high-frequency code stream are obtained by respectively encoding the sub-band signals obtained after sub-band decomposition of the audio signal, and the characteristic dimension of the high-frequency feature is lower than the characteristic dimension of the low-frequency feature.
  • a feature reconstruction module is configured to perform feature reconstruction on the low-frequency features to obtain low-frequency subband signals corresponding to the low-frequency features.
  • a synthesis module configured to perform subband synthesis on the low-frequency subband signal and the high-frequency subband signal to obtain a synthesized audio signal corresponding to the low-frequency code stream and the high-frequency code stream.
  • the feature reconstruction module is further configured to perform convolution on the low-frequency features to obtain the convolution features of the low-frequency features; and perform upsampling on the convolution features to obtain the upsampling of the low-frequency features. Sampling features; pooling the upsampling features to obtain the pooling features of the low-frequency features; performing convolution on the pooling features to obtain low-frequency subband signals corresponding to the low-frequency features.
  • the upsampling is implemented through multiple cascaded decoding layers; the feature reconstruction module is further configured to pass the first decoding layer in the multiple cascaded decoding layers, The convolutional features are upsampled; the upsampling result of the first decoding layer is output to the subsequent cascaded decoding layer.
  • the layers continue to perform upsampling and output the upsampling results until output to the last decoding layer; the upsampling result output by the last decoding layer is used as the upsampling feature of the low-frequency feature.
  • the decoding module is further configured to perform entropy decoding on the low-frequency code stream to obtain an index value corresponding to the low-frequency code stream; and perform inverse quantization processing on the index value corresponding to the low-frequency code stream to obtain The low-frequency characteristics corresponding to the low-frequency code stream; performing entropy decoding on the high-frequency code stream to obtain the index value corresponding to the high-frequency code stream; performing inverse quantization processing on the index value corresponding to the high-frequency code stream to obtain The high-frequency characteristics corresponding to the high-frequency code stream.
  • Embodiments of the present application provide a computer program product or computer program, which includes a computer program or instructions, and the computer program or instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer program or instructions from the computer-readable storage medium, and the processor executes the computer program or instructions, so that the electronic device executes the audio processing method described above in the embodiment of the present application.
  • a computer program or instructions may be deployed to execute on one computing device, or on multiple computing devices located at one location, or on multiple computing devices distributed across multiple locations and interconnected by a communications network. executed on the device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种音频处理方法、装置(555)、电子设备(500)、计算机可读存储介质(550)及计算机程序产品;方法包括:对音频信号进行子带分解,得到音频信号的低频子带信号以及高频子带信号(101);对低频子带信号进行特征提取,得到低频子带信号的低频特征(102);对高频子带信号进行高频分析,得到高频子带信号的高频特征(103),其中,高频特征的特征维度低于低频特征的特征维度;对低频特征进行量化编码,得到音频信号的低频码流;对高频特征进行量化编码,得到音频信号的高频码流(104)。

Description

音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202210681365.X、申请日为2022年6月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及数据处理技术,尤其涉及一种音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
音频编解码技术是包括远程音视频通话在内的通信服务中的一项核心技术。语音编码技术,简单讲,就是使用较少的网络带宽资源去尽量多的传递语音信息。从香农信息论的角度来讲,语音编码是一种信源编码,信源编码的目的是在编码端尽可能的压缩我们想要传递信息的数据量,去掉信息中的冗余,同时在解码端还能够无损(或接近无损)的恢复出来。
然而,对于如何在保证音频质量的情况下,有效提高音频编码的效率,相关技术尚无有效的解决方案。
发明内容
本申请实施例提供一种音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够在保证音频质量的情况下,提高音频编码效率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种音频处理方法,包括:
将音频信号分解为低频子带信号以及高频子带信号;
获取所述低频子带信号的低频特征;
获取所述高频子带信号的高频特征,其中,所述高频特征的特征维度低于所述低频特征的特征维度;
对所述低频特征进行量化编码,得到所述音频信号的低频码流;
对所述高频特征进行量化编码,得到所述音频信号的高频码流。
本申请实施例提供一种音频处理方法,包括:
对低频码流进行量化解码,得到所述低频码流对应的低频特征;
对高频码流进行量化解码,得到所述高频码流对应的高频特征,其中,所述低频码流以及所述高频码流是对音频信号经过子带分解后得到的子带信号分别进行编码得到的,所述高频特征的特征维度低于所述低频特征的特征维度;
对所述低频特征进行特征重建,得到所述低频特征对应的低频子带信号;
对所述高频特征进行高频重建,得到所述高频特征对应的高频子带信号;
对所述低频子带信号以及所述高频子带信号进行子带合成,得到所述低频码流和所述高频码流对应的合成音频信号。
本申请实施例提供一种音频处理装置,包括:
分解模块,配置为将音频信号分解为低频子带信号以及高频子带信号;
特征提取模块,配置为获取所述低频子带信号的低频特征;
高频分析模块,配置为获取所述高频子带信号的高频特征,其中,所述高频特征的特征维度低于所述低频特征的特征维度;
编码模块,配置为对所述低频特征进行量化编码,得到所述音频信号的低频码流;对所述高频特征进行量化编码,得到所述音频信号的高频码流。
本申请实施例提供一种音频处理装置,包括:
解码模块,配置为对低频码流进行量化解码,得到所述低频码流对应的低频特征,并对高频码流进行量化解码,得到所述高频码流对应的高频特征,其中,所述低频码流以及所述高频码流是对音频信号经过子带分解后得到的子带信号分别进行编码得到的,所述高频特征的特征维度低于所述低频特征的特征维度;
特征重建模块,配置为对所述低频特征进行特征重建,得到所述低频特征对应的低频子带信号;
高频重建模块,配置为对所述高频特征进行高频重建,得到所述高频特征对应的高频子带信号;
合成模块,配置为对所述低频子带信号以及所述高频子带信号进行子带合成,得到所述低频码流和所述高频码流对应的合成音频信号。
本申请实施例提供一种用于音频处理的电子设备,所述电子设备包括:
存储器,用于存储计算机程序或指令;
处理器,用于执行所述存储器中存储的计算机程序或指令时,实现本申请实施例提供的音频处理方法。
本申请实施例提供一种计算机可读存储介质,存储有计算机程序或指令,用于引起处理器执行时,实现本申请实施例提供的音频处理方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时实现本申请实施例提供的音频处理方法。
本申请实施例具有以下有益效果:
通过将音频信号分解为低频子带信号以及高频子带信号,通过对低频子带信号以及高频子带信号分别进行相应的处理,使得高频特征的特征维度低于低频特征的特征维度,一方面,保持对音频信号的影响较大的低频子带信号的低频特征的特征维度,高于对音频编码的质量影响相对小的高频子带信号的高频特征的特征维度,从而尽量保留了编码结果中的低频成分,保证了编码后的音频信号的质量,另一方面,频子带信号的高频特征的特征维度变小,实际上减少音频编码的数据量,从而提高了音频编码效率。
附图说明
图1是本申请实施例提供的不同码率下的频谱比较示意图;
图2是本申请实施例提供的音频编解码系统的架构示意图;
图3A是本申请实施例提供的电子设备的结构示意图;
图3B是本申请实施例提供的电子设备的结构示意图;
图4是本申请实施例提供的音频处理方法的流程示意图;
图5是本申请实施例提供的音频处理方法的流程示意图;
图6是本申请实施例提供的端到端的语音通信链路示意图;
图7是本申请实施例提供的基于子带分解和神经网络的语音编解码方法的流程示意图;
图8是本申请实施例提供的滤波器组示意图;
图9A是本申请实施例提供的普通卷积网络的示意图;
图9B是本申请实施例提供的空洞卷积网络的示意图;
图10是本申请实施例提供的频带扩展的示意图;
图11是本申请实施例提供的第一神经网络的示意图;
图12是本申请实施例提供的针对高频子带信号的神经网络结构的示意图;
图13是本申请实施例提供的第二神经网络的示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)神经网络(NN,Neural Network):是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
2)深度学习(DL,Deep Learning):是机器学习(ML,Machine Learning)领域中一个新的研究方向,深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。
3)量化:是指将信号的连续取值(或者大量离散取值)近似为有限多个(或较少的)离散值的过程。其中,量化包括矢量量化(VQ,Vector Quantization)以及标量量化。
其中,矢量量化是一种有效的有损压缩技术,其理论基础是香农的速率失真理论。矢量量化的基本原理是用码书中与输入矢量最匹配的码字的索引代替输入矢量进行传输与存储,而解码时仅需要简单地查表操作。例如,将若干个标量数据组成一个矢量空间,将矢量空间划分为若干个小区域,量化时将落入小区域的矢量,使用对应的索引代替输入矢量。
标量量化是对标量进行量化,即一维的矢量量化,将动态范围分成若干个小区间,每个小区间有一个代表值(即索引)。当输入信号落入某区间时,将输入信号量化成该代表值。
4)熵编码:编码过程中按熵原理不丢失任何信息的无损编码方式,也是有损编码 中的一个关键模块,处于编码器的末端。熵编码包括香农(Shannon)编码、哈夫曼(Huffman)编码、指数哥伦布编码(Exp-Golomb)和算术编码(arithmetic coding)。
5)正交镜像滤波器组(QMF,Quadrature Mirror Filters):是一个包含分析-合成的滤波器对,其中,QMF分析滤波器用于子带信号分解,以降低信号带宽,使各个子带信号可顺利通过各自通道进行处理;QMF合成滤波器用于将解码端恢复出的各子带信号进行合成,例如通过零值内插和带通滤波等方式重建出原始的音频信号。
语音编码技术就是使用较少的网络带宽资源去尽量多的传递语音信息。语音编解码器的压缩率可以达到10倍以上,也就是原本10兆字节(MB,Mega Byte)的语音数据经过编码器的压缩后,只需要传输1MB的语音数据,大大降低了传递信息所需消耗的带宽资源。例如,对于采样率为16000赫兹(Hz)的宽带语音信号,如果采用16比特(bit)采样深度(取样中对语音强度记录的精细程度),无压缩版本的码率(单位时间内传送数据量)为256千比特每秒(kbps,kilo bit per second);如果使用语音编码技术,即使是有损编码,在10-20kbps的码率范围内,重建的语音信号的质量可以接近无压缩版本,甚至听感上认为无差别。如果需要更高采样率的服务,比如32000Hz的超宽带语音,码率范围至少要达到30kbps以上。
在通信系统中,为了保证通信的顺利,行业内部部署标准的语音编解码协议,例如来自ITU-T、3GPP、IETF、AVS、CCSA等国际国内标准组织的标准,G.711、G.722、AMR系列、EVS、OPUS等标准。图1给出一个不同码率下的频谱比较示意图,以示范压缩码率与质量的关系。曲线101为原始语音的频谱曲线,即没有压缩的信号;曲线102为OPUS编码器在20kbps码率下的频谱曲线;曲线103为OPUS编码在6kbps码率下的频谱曲线。由图1可知,随着编码码率的提升,压缩后的信号更为接近原始信号。
相关技术中,语音编码原理大致如下:语音编码可以直接对语音波形样本,逐样本地进行编码;或者,基于人的发声原理,提取相关低维度特征,编码端对特征进行编码,解码端基于这些参数重建语音信号。
上述编码原理均来自语音信号建模,即基于信号处理的压缩方法。为了相对于基于信号处理的压缩方法,在保证语音质量的情况下,提高编码效率。本申请实施例提供一种音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够提高编码效率。下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为终端设备,也可以实施为服务器,或者由终端设备和服务器协同实施。下面以电子设备实施为终端设备为例进行说明。
示例的,参见图2,图2是本申请实施例提供的音频编解码系统100的架构示意图,音频解码系统100包括:服务器200、网络300、终端设备400(即编码端)和终端设备500(即解码端),其中,网络300可以是局域网,或者是广域网,又或者是二者的组合。
在一些实施例中,在终端设备400上运行有客户端410,客户端410可以是各种类型的客户端,例如即时通信客户端、网络会议客户端、直播客户端、浏览器等。客户端410响应于发送方(例如网络会议的发起者、主播、语音通话的发起者等)触发的音频采集指令,调用终端设备400自带的麦克风进行音频信号的采集,并对采集得到的音频信号进行编码,得到码流。
例如,客户端410调用本申请实施例提供的音频处理方法对采集得到的音频信号进行编码,即对音频信号进行子带分解,得到音频信号的低频子带信号以及高频子带信号,对低频子带信号进行特征提取,得到低频子带信号的低频特征,对高频子带信号进行高频分析,得到高频子带信号的高频特征,其中,高频特征的特征维度低于低频特征的特征维度,对低频特征进行量化编码,得到音频信号的低频码流,并对高频特征进行量化编码,得到音频信号的高频码流。编码端(即终端设备400)对低频子带信号以及高频 子带信号进行差异化的信号处理,使得高频特征的特征维度低于低频特征的特征维度,对减小了特征维度的低频特征以及高频特征分别进行量化编码,从而在保证音频质量的情况下,提高音频编码效率。
客户端410可以将码流(即低频码流以及高频码流)通过网络300发送至服务器200,以使服务器200将码流发送至接收方(例如网络会议的参会对象、观众、语音通话的接收者等)关联的终端设备500。
客户端510(例如即时通信客户端、网络会议客户端、直播客户端、浏览器等)在接收到服务器200发送的码流后,可以对码流进行解码,以得到音频信号,从而实现音频通信。
例如,客户端510调用本申请实施例提供的音频处理方法对接收到的码流进行解码,即对低频码流进行量化解码,得到低频码流对应的低频特征,并对高频码流进行量化解码,得到高频码流对应的高频特征,对低频特征进行特征重建,得到低频特征对应的低频子带信号,对高频特征进行高频重建,得到高频特征对应的高频子带信号,对低频子带信号以及高频子带信号进行子带合成,得到解码出的音频信号。
在一些实施例中,本申请实施例可以借助云技术(Cloud Technology)实现,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、存储、处理和共享的一种托管技术。
云技术是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、以及应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。上述服务器200之间的服务交互功能可以通过云技术实现。
示例的,图2中示出的服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。图2中示出的终端设备400和终端设备500可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端等,但并不局限于此。终端设备(例如终端设备400和终端设备500)以及服务器200可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
在一些实施例中,终端设备或服务器200还可以通过运行计算机程序来实现本申请实施例提供的音频处理方法。举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序,如直播APP、网络会议APP、或者即时通信APP等;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
在一些实施例中,多个服务器可组成为一区块链网络,而服务器200为区块链网络上的节点,区块链网络中的每个节点之间可以存在信息连接,节点之间可以通过上述信息连接进行信息传输。其中,本申请实施例提供的音频处理方法所相关的数据(例如音频处理的逻辑、码流)可保存于区块链网络上,任意一个服务器对数据的操作都需要被其他服务器通过共识算法来确认,从而实现数据的单方面的篡改,避免数据的不必要泄露。
参见图3A和图3B,图3A和图3B是本申请实施例提供的电子设备500的结构示意图,以电子设备500是终端设备为例说明,图3A和图3B所示的电子设备500包括:至少一个处理器520、存储器550、至少一个网络接口530和用户接口540。电子设备500中的各个组件通过总线系统550耦合在一起。可理解,总线系统550用于实现这些 组件之间的连接通信。总线系统550除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图3A和图3B中将各种总线都标为总线系统550。
处理器520可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口540包括使得能够呈现媒体内容的一个或多个输出装置541,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口540还包括一个或多个输入装置542,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器550可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器550可选地包括在物理位置上远离处理器520的一个或多个存储设备。
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块552,用于经由一个或多个(有线或无线)网络接口530到达其他计算设备,示例性的网络接口530包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
呈现模块553,用于经由一个或多个与用户接口540相关联的输出装置541(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块554,用于对一个或多个来自一个或多个输入装置542之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的音频处理装置可以采用软件方式实现,图3A示出了存储在存储器550中的音频处理装置555,可以是程序和插件等形式的软件,包括以下软件模块:分解模块5551、特征提取模块5552、高频分析模块5553、编码模块5554,或解码模块5555、特征重建模块5556、高频重建模块5557以及合成模块5558,其中,分解模块5551、特征提取模块5552、高频分析模块5553、编码模块5554用于实现音频编码功能,
图3B示出了存储在存储器550中的音频处理装置555,其中可以包括解码模块5555、特征重建模块5556、高频重建模块5557以及合成模块5558,用于实现音频解码功能,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。
如前所述,本申请实施例提供的音频处理方法可以由各种类型的电子设备实施。参见图4,图4是本申请实施例提供的音频处理方法的流程示意图,通过音频处理方式实现音频编码功能,下面结合图4示出的步骤进行说明。
在步骤101中,对音频信号进行子带分解,得到音频信号的低频子带信号以及高频子带信号。
作为获取音频信号的示例,编码端响应于发送方(例如网络会议的发起者、主播、语音通话的发起者等)触发的音频采集指令,调用编码端的终端设备自带的麦克风进行音频信号的采集,以获取音频信号(又称输入信号)。
在获取音频信号后,通过QMF分析滤波器将音频信号分解成低频子带信号xLB(n)和高频子带信号xHB(n),由于低频子带信号相对于高频子带信号对音频编码的影响更大,以便后续对低频子带信号以及高频子带信号进行差异化的信号处理。
在一些实施例中,对音频信号进行子带分解,得到音频信号的低频子带信号以及高频子带信号,可以通过以下方式实现:对音频信号进行采样处理,得到采样信号,其中,采样信号包括采样得到的多个样本点;对采样信号进行低通滤波,得到低通滤波信号;对低通滤波信号进行下采样,得到音频信号的低频子带信号;对采样信号进行高通滤波,得到高通滤波信号;对高通滤波信号进行下采样,得到音频信号的高频子带信号。
需要说明的是音频信号为连续的模拟信号,采样信号为离散的数字信号,采样点为从音频信号中采样得到的采样值。
作为示例,以采样率Fs为32000Hz的输入信号为例,对输入信号进行采样,得到包括640个样本点的采样信号x(n)。调用QMF滤波器组中的分析滤波器(2通道),对采样信号进行低通滤波,得到低通滤波信号,对采样信号进行高通滤波,得到高通滤波信号,对低通滤波信号进行下采样,得到音频信号的低频子带信号xLB(n),对高通滤波信号进行下采样,得到音频信号的高频子带信号xHB(n)。低频子带信号xLB(n)和高频子带信号xHB(n)的有效带宽分别是0-8kHz和8-16kHz,低频子带信号xLB(n)和高频子带信号xHB(n)的样本点数为320。
需要说明的是,QMF滤波器组是一个包含分析-合成的滤波器对。对于QMF分析滤波器,可以将采样率为Fs的输入信号分解成两路采样率为Fs/2的信号,分别表示QMF低通信号和QMF高通信号。解码端恢复出的低通信号和高通信号后,经过QMF合成滤波器进行合成,即可以恢复出输入信号对应的采样率Fs的重建信号。
在步骤102中,对低频子带信号进行特征提取,得到低频子带信号的低频特征。
例如,由于低频子带信号相对于高频子带信号对音频编码的影响更大,因此可以通过神经网络模型对低频子带信号进行特征提取,得到低频子带信号的低频特征,以在保证低频特征的完备性的情况下,尽量减小低频特征的特征维度。本申请实施例并不局限于神经网络模型的结构。其中,低频子带信号的低频特征的维度小于低频子带信号的维度。
在一些实施例中,对低频子带信号进行特征提取,得到低频子带信号的低频特征,包括:对低频子带信号进行卷积,得到低频子带信号的卷积特征;对卷积特征进行池化,得到低频子带信号的池化特征;对池化特征进行下采样,得到低频子带信号的下采样特征;对下采样特征进行卷积,得到低频子带信号的低频特征。
如图11所示,基于低频子带信号xLB(n)调用神经网络模型,生成更低维度的特征向量FLB(n),即低频特征。首先,通过因果卷积对输入的低频子带信号xLB(n)进行卷积,得到24×320的卷积特征。然后,对24×320的卷积特征进行因子为2的池化(即预处理),得到24×160的池化特征。接下来,对24×160的池化特征进行下采样,得到192×1的下采样特征。最后,对192×1的下采样特征,再经过因果卷积进行卷积,可以得到一个56维的特征向量FLB(n)。
在一些实施例中,下采样是通过多个级联的编码层实现的;对池化特征进行下采样,得到低频子带信号的下采样特征,可以通过以下方式实现:通过多个级联的编码层中的第一个编码层,对池化特征进行下采样;将第一个编码层的下采样结果输出到后续级联的编码层,通过后续级联的编码层继续进行下采样和下采样结果输出,直至输出到最后 一个编码层;将最后一个编码层输出的下采样结果作为低频子带信号的下采样特征。
如图11所示,级联3个不同下采样因子(Down_factor)的编码块(即编码层)对池化特征进行下采样,即先通过下采样因子(Down_factor=4)的编码块对24×160的池化特征进行下采样,得到48×40的下采样结果,然后通过下采样因子(Down_factor=5)的编码块对48×40的下采样结果进行下采样,得到96×8的下采样结果,最后通过下采样因子(Down_factor=8)的编码块对96×8的下采样结果进行下采样,得到192×1的下采样特征。以编码块(Down_factor=4)为例,可以先执行1个或者多个空洞卷积,并基于Down_factor进行池化,以实现下采样的作用。
需要说明的是,通过一层编码层的处理后,对下采样特征的理解就加深一步,通过多层编码层的学习后,就能够逐步准确学习到低频子带信号的下采样特征。通过级联形式的编码层,能够获取精度渐进的低频子带信号的下采样特征。
在步骤103中,对高频子带信号进行高频分析,得到高频子带信号的高频特征。
其中,高频特征的特征维度低于低频特征的特征维度。由于低频子带信号相对于高频子带信号对音频编码的影响更大,因此对低频子带信号以及高频子带信号进行差异化的信号处理,使得高频特征的特征维度低于低频特征的特征维度。其中,高频子带信号的低频特征的维度小于高频子带信号的维度,高频分析用于对高频子带信号进行降维,实现数据压缩的功能。
在一些实施例中,对高频子带信号进行高频分析,得到高频子带信号的高频特征,可以通过以下方式实现:调用第一神经网络模型对高频子带信号进行特征提取,得到高频子带信号的高频特征。
例如,对高频子带信号可以调用与针对低频子带信号的神经网络模型类似的第一神经网络模型(如图12所示),通过第一神经网络模型对高频子带信号进行卷积,得到高频子带信号的卷积特征;对卷积特征进行池化,得到高频子带信号的池化特征;对池化特征进行下采样,得到高频子带信号的下采样特征;对下采样特征进行卷积,得到高频子带信号的低频特征。
需要说明的是,相对低频子带信号,高频子带信号对质量的重要程度相对低,因此,针对高频子带信号的第一神经网络模型(图12)的结构的复杂程度可以低于图11所示的模型结构。例如,相比图11所示的模型结构,图12所示的模型结构可以减少通道数,从而节约计算资源。
在一些实施例中,对高频子带信号进行高频分析,得到高频子带信号的高频特征,可以通过以下方式实现:对高频子带信号进行频带扩展,得到高频子带信号的高频特征。
例如,由于相对低频子带信号,高频子带信号对质量的重要程度相对低,因此可以采用频带扩展的方式,即从频带受限的窄带语音信号中恢复宽带语音信号,以快速压缩高频子带信号,提取高频子带信号的高频特征。
在一些实施例中,对高频子带信号进行频带扩展,得到高频子带信号的高频特征,可以通过以下方式实现:基于高频子带信号包括的多个样本点进行频域变换,得到多个样本点分别对应的变换系数;将多个样本点分别对应的变换系数划分为多个子带;对每个子带包括的变换系数计算均值,得到每个子带对应的平均能量,并将平均能量作为每个子带对应的子带谱包络;将多个子带分别对应的子带谱包络确定为高频子带信号的高频特征。
例如,本申请实施例的频域变换方法包括改进型离散余弦变换(MDCT,Modified Discrete Cosine Transform)、离散余弦变换(DCT,Discrete Cosine Transform)、快速傅立叶变换(FFT,Fast Fourier Transform)等,本申请实施例并不局限于频域变换的方式。本申请实施例计算的均值的类型包括算术平均、几何平均,本申请实施例并不局限于均 值处理的方式。
在一些实施例中,基于高频子带信号包括的多个样本点进行频域变换,得到多个样本点分别对应的变换系数,包括:获取参考音频信号的参考高频子带信号,其中,参考音频信号是与音频信号相邻的音频信号;基于参考高频子带信号包括的多个样本点以及高频子带信号包括的多个样本点,对高频子带信号包括的多个样本点进行离散余弦变换,得到高频子带信号包括的多个样本点分别对应的变换系数。
在一些实施例中,对每个子带包括的变换系数计算均值的过程如下:确定每个子带包括的样本点对应的变换系数的平方和;将平方和与子带包括的样本点的数量的比值,确定为得到每个子带对应的平均能量。
作为示例,对于包括320点的高频子带信号xHB(n),调用改进型离散余弦变换(MDCT,Modified Discrete Cosine Transform),生成320个点的MDCT系数(即高频子带信号包括的多个样本点分别对应的变换系数)。具体地,如果是50%交叠,可以将第n+1帧高频数据(即参考音频信号)与第n帧高频数据(即音频信号)合并(拼接),计算640个点的MDCT,获得320个点的MDCT系数。
将320个点的MDCT系数分成N个子带(即将多个样本点分别对应的变换系数划分为多个子带),这里的子带就是将相邻的多个MDCT系数组成一组,320个点的MDCT系数可以分成8个子带。例如,可以均匀分配320个点,即每个子带包括的点的数量一致。当然,本申请实施例也可以对320个点进行非均匀划分,比如偏低频的子带包括的MDCT系数更少(频率分辨率更高),偏高频的子带包括的MDCT系数更多(频率分辨率更低)。
根据奈奎斯特(Nyquist)采样定律(要从抽样信号中无失真地恢复原信号,抽样频率应大于2倍原信号最高频率,抽样频率小于2倍频谱最高频率时,信号的频谱有混叠,抽样频率大于2倍频谱最高频率时,信号的频谱无混叠),上述320个点的MDCT系数表示了8-16kHz的频谱。但超宽带语音通信,不一定要求频谱到16kHz,比如,如果设置频谱到14kHz,只需要考虑前240个点的MDCT系数,对应地,子带数可以控制为6。
对于每一个子带,计算当前子带中所有MDCT系数的平均能量(即对每个子带包括的变换系数进行均值处理)作为子带谱包络(频谱包络是经过频谱各主峰值点的平滑曲线),比如,当前子带中包括的MDCT系数为x(n),n=1,2,…,40,则通过几何均值计算平均能量Y=((x(1)2+x(2)2+…+x(40)2)/40)。对于320个点的MDCT系数分成8个子带的情况,可以获得8个子带谱包络,这8个子带谱包络就是生成的高频子带信号的特征向量FHB(n),即高频特征。
在步骤104中,对低频特征进行量化编码,得到音频信号的低频码流,并对高频特征进行量化编码,得到音频信号的高频码流。
在一些实施例中,对低频特征进行量化处理,得到低频特征的索引值;对低频特征的索引值进行熵编码,得到音频信号的低频码流;对高频特征进行量化处理,得到高频特征的索引值;对高频特征的索引值进行熵编码,得到音频信号的高频码流。
例如,对于低频子带信号的特征向量FLB(n)和高频子带信号的特征向量FHB(n),均可以进行标量量化(各分量单独量化)和熵编码的方法。另外,本申请实施例也不限制矢量量化(相邻多个分量组合成一个矢量进行联合量化)和熵编码的技术组合,将编码得到的高频码流以及低频码流传输至解码端,通过解码端对高频码流以及低频码流进行解码。
通过对低频子带信号以及高频子带信号进行差异化的信号处理,一方面,保持对音频信号的影响较大的低频特征的特征维度高于高频特征,保证了编码后的音频信号的质量,另一方面,降低对音频信号的质量影响较小的高频特征的特征维度,减少了量化编 码的数据量,提高了编码效率。
如前所述,本申请实施例提供的音频处理方法可以由各种类型的电子设备实施。参见图5,图5是本申请实施例提供的音频处理方法的流程示意图,通过音频处理方式实现音频解码功能,下面结合图5示出的步骤进行说明。
在步骤201中,对低频码流进行量化解码,得到低频码流对应的低频特征,并对高频码流进行量化解码,得到高频码流对应的高频特征。
其中,低频码流以及高频码流是对音频信号经过子带分解后得到的子带信号分别进行编码得到的,高频特征的特征维度低于低频特征的特征维度。
例如,通过如图4所示的音频处理方法编码得到高频码流以及低频码流后,将编码得到的高频码流以及低频码流传输至解码端,解码端接收到高频码流以及低频码流后,对低频码流进行量化解码,得到低频码流对应的低频特征,并对高频码流进行量化解码,得到高频码流对应的高频特征。
需要说明的是,量化解码是量化编码的逆过程。对于收到码流,先进行熵解码,并通过查量化表(即逆量化,量化表为编码过程中量化所产生的映射表),获得低频的特征向量的估计值F′LB(n),即低频码流对应的低频特征,获得高频的特征向量的估计值F′HB(n),即高频码流对应的高频特征。需要说明的是,解码端对接收到的码流进行解码的过程是编码端进行编码的过程的逆过程,因此,解码过程中产生的值是相对于编码过程中的值的估计值,例如解码过程中产生的高频特征相对于编码过程中的高频特征是一种估计值。
例如,对低频码流进行量化解码,得到低频码流对应的低频特征,包括:对低频码流进行熵解码,得到低频码流对应的索引值;对低频码流对应的索引值进行逆量化处理,得到低频码流对应的低频特征;对高频码流进行量化解码,得到高频码流对应的高频特征,可以通过以下方式实现:对高频码流进行熵解码,得到高频码流对应的索引值;对高频码流对应的索引值进行逆量化处理,得到高频码流对应的高频特征。
在步骤202中,对低频特征进行特征重建,得到低频特征对应的低频子带信号。
例如,特征重建是特征提取的逆过程,对低频特征进行特征重建,得到低频特征对应的低频子带信号(一种估计值)。
在一些实施例中,对低频特征进行特征重建,得到低频特征对应的低频子带信号,可以通过以下方式实现:对低频特征进行卷积,得到低频特征的卷积特征;对卷积特征进行上采样,得到低频特征的上采样特征;对上采样特征进行池化,得到低频特征的池化特征;对池化特征进行卷积,得到低频特征对应的低频子带信号。
如图13所示,基于低频特征向量F′LB(n),调用如图13所示的神经网络模型,生成低频子带信号x′LB(n)。其中,如图13所示的神经网络模型与如图11所示的神经网络模型类似,比如因果卷积、后处理结构与预处理类似。解码块的结构与编码端的编码块是对称地,具体表现在:编码端的编码块先做空洞卷积再池化完成下采样,解码端的解码块先进行池化完成上采样,再做空洞卷积。
首先,通过因果卷积对输入的低频特征向量F′LB(n)进行卷积,得到192×1的卷积特征。然后,对192×1的卷积特征进行上采样,得到24×160的上采样特征。接下来,对24×160的上采样特征进行池化(即后处理),得到24×320的池化特征。最后,对池化特征,再经过因果卷积进行卷积,可以得到一个320维的低频子带信号x′LB(n)。
在一些实施例中,上采样是通过多个级联的解码层实现的;对卷积特征进行上采样,得到低频特征的上采样特征,可以通过以下方式实现:通过多个级联的解码层中的第一个解码层,对卷积特征进行上采样;将第一个解码层的上采样结果输出到后续级联的解码层,通过后续级联的解码层继续进行上采样和上采样结果输出,直至输出到最后一个 解码层;将最后一个解码层输出的上采样结果作为低频特征的上采样特征。
如图13所示,级联3个不同上采样因子(Up_factor)的解码块(即解码层)对卷积特征进行上采样,即先通过上采样因子(Up_factor=8)的解码块对192×1的卷积特征进行上采样,得到96×8的上采样结果,然后通过上采样因子(Up_factor=5)的解码块对96×8的上采样结果进行上采样,得到48×40的上采样结果,最后通过上采样因子(Up_factor=4)的解码块对48×40的上采样结果进行上采样,得到24×160的上采样特征。以解码块(Up_factor=4)为例,可以先并基于Up_factor进行池化,然后执行1个或者多个空洞卷积,以实现上采样的作用。
需要说明的是,通过一层解码层的处理后,对上采样特征的理解就加深一步,通过多层解码层的学习后,就能够逐步准确学习到低频特征的上采样特征。通过级联形式的解码层,能够获取精度渐进的低频特征的上采样特征。
在步骤203中,对高频特征进行高频重建,得到高频特征对应的高频子带信号。
例如,高频重建是高频分析的逆过程,通过高频重建对高频特征进行升维,实现数据解压的功能。
在一些实施例中,对高频特征进行高频重建,得到高频特征对应的高频子带信号,可以通过以下方式实现:调用第二神经网络模型,通过第二神经网络模型对高频特征进行特征重建,得到高频特征对应的高频子带信号。
例如,当编码端是对高频子带信号进行特征提取,获得高频特征时,则解码端对高频特征进行特征重建,得到高频特征对应的高频子带信号。
在一些实施例中,对高频特征进行高频重建,得到高频特征对应的高频子带信号,可以通过以下方式实现:对高频特征进行频带扩展的逆处理,得到高频特征对应的高频子带信号。
例如,当编码端是对高频子带信号进行频带扩展,获得高频特征时,则解码端对高频特征进行频带扩展的逆处理,得到高频特征对应的高频子带信号。
在一些实施例中,对高频特征进行频带扩展的逆处理,得到高频特征对应的高频子带信号,可以通过以下方式实现:基于低频子带信号包括的多个样本点进行频域变换,得到多个样本点分别对应的变换系数;对多个样本点分别对应的变换系数中的后半部分的变换系数进行频谱复制,得到参考高频子带信号的参考变换系数;基于高频特征对应的子带谱包络,对参考高频子带信号的参考变换系数进行放大,得到放大后的参考变换系数;对放大后的参考变换系数进行反频域变换(也即是频域变换的逆变换),得到高频特征对应的高频子带信号。
需要说明的是,本申请实施例的频域变换方法包括改进型离散余弦变换(MDCT,Modified Discrete Cosine Transform)、离散余弦变换(DCT,Discrete Cosine Transform)、快速傅立叶变换(FFT,Fast Fourier Transform)等,本申请实施例并不局限于频域变换的方式。
在一些实施例中,基于高频特征对应的子带谱包络,对参考高频子带信号的参考变换系数进行放大,得到放大后的参考变换系数,可以通过以下方式实现:基于高频特征对应的子带谱包络,将参考高频子带信号的参考变换系数划分为多个子带;针对多个子带中的任意子带执行以下处理:确定子带谱包络中与子带对应的第一平均能量,并确定子带对应的第二平均能量;基于第一平均能量与第二平均能量的比值,确定放大因子;将放大因子与子带包括的每个参考变换系数相乘,得到放大后的参考变换系数。
作为示例,先将解码端生成的低频子带信号x′LB(n),也进行编码端类似的640点的MDCT变换,生成320个点的MDCT系数(即低频部分的MDCT系数),即基于低频子带信号包括的多个样本点进行频域变换,得到多个样本点分别对应的变换系数。
然后,将由x′LB(n)生成的320个点的MDCT系数进行复制,生成高频部分的MDCT系数(即参考高频子带信号的参考变换系数)。参考语音信号基本特征,低频部分谐波更多,高频部分谐波更少。因此,为了避免简单复制导致人工生成的高频部分的MDCT谱包含过多谐波,可以将低频子带信号依赖的320个点的MDCT系数中的后160点作为母版,频谱拷贝2次,生成320个点的参考高频子带信号的MDCT系数的参考值(即参考高频子带信号的参考变换系数),即对多个样本点分别对应的变换系数中的后半部分的变换系数进行频谱复制处理,得到参考高频子带信号的参考变换系数。
接下来,调用前面获得的8个子带谱包络(即通过查询量化表后得到的8个子带谱包络,即高频特征对应的子带谱包络),这8个子带谱包络对应8个高频子带,并将生成的320个点的参考高频子带信号的MDCT系数的参考值分为8个参考高频子带(即将参考高频子带信号的参考变换系数划分为多个子带),针对每个高频子带执行以下处理:基于一个高频子带与对应的参考高频子带,对生成的320点的参考高频子带信号的MDCT系数的参考值进行放大控制(频域上就是做乘法)。例如根据高频子带的平均能量(即第一平均能量)与对应的参考高频子带的平均能量(第二平均能量)计算放大因子,将该对应的参考高频子带中的每个点对应的MDCT系数乘以放大因子,确保解码虚拟生成的高频MDCT系数的能量,与编码端原始的系数能量接近。
例如,假定复制过去生成的参考高频子带信号的高频子带的平均能量是Y_L,当前要做放大控制的高频子带(即基于码流解码出来的子带谱包络对应的高频子带)的平均能量是Y_H,则计算一个放大因子a=sqrt(Y_H/Y_L)。有了放大因子a后,直接将从复制生成的参考高频子带中的每个点均乘以a。
最后,调用MDCT反变换,生成高频子带信号的估计值x′HB(n)(即高频特征对应的高频子带信号)。对放大后的320点的MDCT系数进行MDCT反变换,生成640个点的估计值,通过交叠,取有效的前320个点的估计值作为x′HB(n)。
在步骤204中,对低频子带信号以及高频子带信号进行子带合成,得到低频码流和高频码流对应的合成音频信号。
例如,子带合成是子带分解的逆过程,解码端对低频子带信号以及高频子带信号进行子带合成,以恢复出音频信号,其中合成音频信号即为恢复出的音频信号。
在一些实施例中,对低频子带信号以及高频子带信号进行子带合成,得到低频码流和高频码流对应的合成音频信号,可以通过以下方式实现:对低频子带信号进行上采样,得到低通滤波信号;对高频子带信号进行上采样,得到高频滤波信号;对低通滤波信号以及高频滤波信号进行滤波合成,得到合成音频信号。
例如,在获取低频子带信号以及高频子带信号后,通过QMF合成滤波器对低频子带信号以及高频子带信号进行子带合成以恢复出音频信号。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
本申请实施例可以应用于各种音频场景,例如语音通话、即时通信等。下面以语音通话为例进行说明。
相关技术中,语音编码原理大致如下:语音编码可以直接对语音波形样本,逐样本地进行编码;或者,基于人的发声原理,提取相关低维度特征,编码端对特征进行编码,解码端基于这些参数重建语音信号。
上述编码原理均来自语音信号建模,即基于信号处理的压缩方法。为了相对于基于信号处理的压缩方法,在保证语音质量的情况下,提高编码效率。本申请实施例提供基于子带分解和神经网络的语音编解码方法(即音频处理方法),基于语音信号特点,将特定采样率的语音信号,分解成低频子带信号和高频子带信号,不同的子带信号可以采用不同的数据压缩机制进行压缩。对于重要的部分(低频子带信号),基于神经网络(NN, Neural Network)技术进行处理后,会得到比输入的低频子带信号更低维度的特征向量。对于相对不重要的部分(高频子带信号),使用更少的比特进行编码。
本申请实施例可应用于如图6所示的语音通信链路,以基于网际互连协议的语音传输(VoIP,Voice over Internet Protocol)会议系统为例,将本申请实施例涉及的语音编解码技术部署在编码和解码部分,以解决语音压缩的基本功能。编码器部署在上行客户端601,解码器部署在下行客户端602,通过上行客户端采集语音,并进行前处理增强、编码等处理,将编码得到的码流通过网络传输至下行客户端602,通过下行客户端602进行解码、增强等处理,以在下行客户端602回放解码出的语音。
考虑前向兼容(即新的编码器与已有的编码器兼容),需要在系统的后台(即服务器)部署转码器,以解决新的编码器与已有的编码器互联互通问题。例如,如果发送端(上行客户端)是新的NN编码器,接收端(下行客户端)是公用电话交换网(PSTN,Public Switched Telephone Network)(G.722)。在后台,需要执行NN解码器生成语音信号,然后调用G.722编码器生成特定码流,以实现转码功能,才能让接收端基于特定码流进行正确解码。
下面结合图7说明本申请实施例提供的基于子带分解和神经网络的语音编解码方法:
编码端执行如下处理:对第n帧的输入语音信号x(n),使用分析滤波器分解为低频子带信号xLB(n)和高频子带信号xHB(n)。对于低频子带信号xLB(n),调用第一NN,获得低维度的特征向量FLB(n),特征向量FLB(n)的维度小于低频子带信号的维度,以减少数据量。例如,对于每帧xLB(n),调用空洞卷积网络(Dilated CNN),生成更低维的特征向量FLB(n)。本申请实施例不限制其它的NN结构,例如自编码器(Autoencoder),全连接(FC,Full-Connection)网络,长短期记忆(LSTM,Long Short-Term Memory)网络、卷积神经网络(CNN,Convolutional Neural Network)+LSTM等等。
对于高频子带信号xHB(n),考虑到高频对质量的重要性没有低频那么重要,高频子带信号xHB(n)可以采用其它方案提取特征向量FHB(n)。例如,基于语音信号分析的频带扩展技术,可以只用1-2kbps码率实现高频子带信号的生成;还可以使用与低频子带信号一样的NN结构或者更精简的网络(比如输出的特征向量比低频的特征向量FLB(n)更小)。
对子带对应的特征向量(即FLB(n)以及FHB(n))进行矢量量化或者标量量化,并将量化后的索引值进行熵编码,并编码后得到的码流传输到解码端。
解码端执行如下处理:对解码端接收到码流进行解码,分别获得低频的特征向量的估计值F′LB(n)和高频的特征向量的估计值F′HB(n)。对于低频部分,基于低频的特征向量的估计值F′LB(n)调用第二NN,生成低频子带信号估计值x′LB(n)。对于高频部分,基于高频的特征向量的估计值F′HB(n)调用高频重建,生成高频子带信号估计值x′HB(n)。最后,调用QMF合成滤波,生成重建的语音信号x′(n)。
下面在具体介绍本申请实施例提供的基于子带分解和神经网络的语音编解码方法之前,先介绍QMF滤波器组、空洞卷积网络以及频带扩展技术。
QMF滤波器组是一个包含分析-合成的滤波器对。对于QMF分析滤波器,可以将输入的采样率为Fs的信号分解成两路采样率为Fs/2的信号,分别表示QMF低通信号和QMF高通信号。如图8所示的QMF滤波器的低通部分H_Low(z)和高通部分H_High(z)的频谱响应。基于QMF分析滤波器组的相关理论知识,可以容易地描述上述低通滤波和高通滤波的系数之间的相关性,如公式(1)所示:
hHigh(k)=-1khLow(k)              (1)
其中,hLow(k)表示低通滤波的系数,hHigh(k)表示高通滤波的系数。
类似地,根据QMF相关理论,可以基于QMF分析滤波器组H_Low(z)和H_High(z), 描述QMF合成滤波器组,如公式(2)所示。
GLow(z)=HLow(z)
GHigh(z)=(-1)*HHigh(z)      (2)
其中,GLow(z)表示恢复出的低通信号,GHigh(z)表示恢复出的高通信号。
解码端恢复出的低通信号和高通信号,经过QMF合成滤波器组进行合成,即可以恢复出输入信号对应的采样率Fs的重建信号。
参见图9A和图9B,图9A是本申请实施例提供的普通卷积网络的示意图,图9B是本申请实施例提供的空洞卷积网络的示意图。相对普通卷积网络,空洞卷积能够增加感受野的同时保持特征图的尺寸不变,还可以避免因为上采样、下采样引起的误差。虽然图9A和图9B中示出的卷积核大小(Kernel Size)均为3×3;但是,图9A所示的普通卷积的感受野901只有3,而图9B所示的空洞卷积的感受野902达到了5。也就是说,对于尺寸为3×3的卷积核,图9A所示的普通卷积的感受野为3、扩张率(Dilation Rate)(卷积核中的点的间隔数量)为1;而图9B所示的空洞卷积的感受野为5、扩张率为2。
卷积核还可以在类似图9A或者图9B的平面上进行移动,这里是涉及移位率(Stride Rate)(步长)概念。比如,每次卷积核移位1格,则对应的移位率为1。
此外,还涉及卷积通道数的概念,就是用多少个卷积核对应的参数去进行卷积分析。理论上,通道数越多,对信号的分析更为全面,精度越高;但是,通道越高,复杂度也越高。比如,一个1×320的张量,可以使用24通道卷积运算,输出就是24×320的张量。
需要说明的是,可以根据实际应用需要,自行定义空洞卷积核大小(例如:针对语音信号,卷积核的大小可以设置为1×3)、扩张率、移位率和通道数,本申请实施例对此不作具体限定。
如图10所示的频带扩展(或者频带复制)示意图,先重建宽带信号,然后将宽带信号复制到超宽带信号上,最后基于超宽带包络进行整形,图10示出的频域实现方案具体包括:1)在低采样率下,实现一个核心层编码;2)选择低频部分的频谱复制到高频;3)根据提前记录的边界信息(描述高频与低频的能量相关性等),对复制的高频频谱进行放大控制。仅需1-2kbps码率,就可以产生采样率翻倍的效果。
下面对本申请实施例提供的基于子带分解和神经网络的语音编解码方法进行具体说明。
在一些实施例中,以采样率Fs=32000Hz的语音信号为例(需要说明的是,本申请实施例提供的方法也适用于其他采样率的场景,包括但不限于:8000Hz、32000Hz、48000Hz)。同时,假设帧长设置为20ms,因此,对于Fs=32000Hz,相当于每帧包含640个样本点。
下面参考图7所示的原理图,将编码端和解码端进行详细描述。首先说明编码端的编码原理。
首先,生成输入信号。
针对采样率Fs=32000Hz的语音信号,对于第n帧的输入信号包括640个样本点,记为输入信号x(n)。
其次,QMF信号分解。
调用QMF分析滤波器(2通道QMF),进行下采样,可以获得两部分子带信号,即低频子带信号xLB(n)和高频子带信号xHB(n)。低频子带信号xLB(n)和高频子带信号xHB(n)的有效带宽分别是0-8kHz和8-16kHz,低频子带信号xLB(n)和高频子带信号xHB(n)的样本点数为320。
再次,将低频子带信号xLB(n)输入第一NN进行数据压缩。
基于低频子带信号xLB(n),调用第一NN,生成更低维度的特征向量FLB(n)。需要说明的是,xLB(n)的维度为320,FLB(n)的维度为56,从数据量看,第一NN起到了“降维” 的作用,实现数据压缩的功能。
参见图11所示的第一NN的网络结构图,下面具体说明第一NN进行数据压缩的过程。
调用一个24通道的因果卷积,可以将输入的张量(即向量),扩展为24×320的张量。对24×320的张量进行预处理。例如,对24×320的张量做因子为2的池化(Pooling)操作、且激活函数可以为ReLU,以生成24乘以160的张量。
级联3个不同下采样因子(Down_factor)的编码块。以编码块(Down_factor=4)为例,可以先执行1个或者多个空洞卷积,每个卷积核大小均固定为1×3、移位率(Stride Rate)为1。此外,1个或者多个空洞卷积的扩张率(Dilation Rate)可根据需求设置,比如3,当然,本申请实施例也不限制不同空洞卷积设置的扩展率。
将3个编码块的Down_factor分别设置为4、5、8,等效于设置了不同大小的池化因子,起到下采样的作用。将3个编码块通道数分别设置为48、96、192。经过3个编码块,24×160的张量依次转换成48×40、96×8和192×1的张量。对192×1的张量,再经过类似预处理的因果卷积,可以输出一个56维的特征向量FLB(n)。
继续参考图7,又次,对高频子带信号xHB(n)进行高频分析。高频分析的目的是提取高频子带信号xHB(n)的关键信息,生成更低维度的特征向量FHB(n)。
在一些实施例中,可以引入另一个与第一NN类似的NN结构,生成低维度的特征向量。相对低频子带信号,高频子带信号对质量的重要程度相对低,因此,针对高频子带信号的NN结构也无需像第一NN那么复杂。如图12所示的针对高频子带信号的NN结构,该NN结构与第一NN的结构类似,但是相比第一NN的结构,大幅度减少通道数。
然而,对于高频子带信号,虽然通过如图12所示的NN结构,高频子带信号的数据量减少很多(从320维降为8维),但NN结构的模型复杂度不是最优的。因此,本申请实施例提出另一种方法压缩高频子带信号,即频带扩展(从频带受限的窄带语音信号中恢复宽带语音信号)。下面具体介绍本申请实施例中频带扩展的应用。
对于包括320点的高频子带信号xHB(n),调用改进型离散余弦变换(MDCT,Modified Discrete Cosine Transform),生成320个点的MDCT系数。具体地,如果是50%交叠,可以将第n+1帧高频数据与第n帧高频数据合并(拼接),计算640个点的MDCT,获得320个点的MDCT系数。
将320个点的MDCT系数分成N个子带,这里的子带就是将相邻的多个MDCT系数组成一组,320个点的MDCT系数可以分成8个子带。例如,可以均匀分配320个点,即每个子带包括的点的数量一致。当然,本申请实施例不可以对320个点进行非均匀划分,比如偏低频的子带包括的MDCT系数更少(频率分辨率更高),偏高频的子带包括的MDCT系数更多(频率分辨率更低)。
根据奈奎斯特(Nyquist)采样定律(要从抽样信号中无失真地恢复原信号,抽样频率应大于2倍原信号最高频率,抽样频率小于2倍频谱最高频率时,信号的频谱有混叠,抽样频率大于2倍频谱最高频率时,信号的频谱无混叠),上述320个点的MDCT系数表示了8-16kHz的频谱。但超宽带语音通信,不一定要求频谱到16kHz,比如,如果设置频谱到14kHz,只需要考虑前240个点的MDCT系数,对应地,子带数可以控制为6。
对于每一个子带,计算当前子带中所有MDCT系数的平均能量作为子带谱包络(频谱包络是经过频谱各主峰值点的平滑曲线),比如,当前子带中包括的MDCT系数为x(n),n=1,2,…,40,则平均能量Y=((x(1)2+x(2)2+…+x(40)2)/40)。对于320个点的MDCT系数分成8个子带的情况,可以获得8个子带谱包络,这8个子带谱包络就是生成的高频子带信号的特征向量FHB(n)。
总之,使用上述两种方法(NN结构和频带扩展)中任意一种方法,均可以将320维的高频子带信号,输出为一个8维的特征向量。因此,只需少量数据量,即可表示高频信息,编码效率显著提升。
继续参考图7,最后,量化编码。
对于低频子带信号的特征向量FLB(n)和高频子带信号的特征向量FHB(n),均可以进行标量量化(各分量单独量化)和熵编码的方法。另外,本申请实施例也不限制矢量量化(相邻多个分量组合成一个矢量进行联合量化)和熵编码的技术组合。对特征向量进行量化编码后,可以生成对应的码流。根据实验,通过6-10kbps码率就可以对32kHz超宽带信号实现高质量压缩。
继续参考图7,关于解码端的解码原理如下。
首先,量化解码。
量化解码是量化编码的逆过程。对于收到的码流,先进行熵解码,并通过查量化表,获得低频的特征向量的估计值F′LB(n)和高频的特征向量的估计值F′HB(n)。
其次,将低频的特征向量的估计值F′LB(n)输入第二NN。
基于低频的特征向量的估计值F′LB(n),调用如图13所示的第二NN,生成低频子带信号的估计值x′LB(n)。其中,第二NN与第一NN类似,比如因果卷积、第二NN中的后处理结构与第一NN中的预处理类似。解码块的结构与编码端的编码块是对称地,具体表现在:编码端的编码块先做空洞卷积再池化完成下采样,解码端的解码块先进行池化完成上采样,再做空洞卷积。
再次,对高频的特征向量的估计值F′HB(n)进行高频重建。
与编码端的高频分析类似,本申请实施例中的高频重建包含两种方案。
高频重建的第一种实现,对应于编码端中高频分析的第一种实现(对应图12所示的NN结构)。基于高频的特征向量的估计值F′HB(n),调用深度神经网络,生成高频子带信号的估计值x′HB(n)。
深度神经网络的结构与高频分析的第一种实现(如图12所示)类似,比如因果卷积,后处理结构类似于高频分析的第一种实现中的预处理。解码块结构与编码端的编码块是对称地:编码端的编码块是先做空洞卷积再池化完成下采样,解码端的解码块是先进行池化完成上采样,再做空洞卷积。
高频重建的第二种实现,对应于编码端高频分析的第一种实现(对应频带扩展技术)。基于码流中解码出来的8个子带谱包络,即高频的特征向量的估计值F′HB(n)进行如下操作:先将解码端生成的低频子带信号的估计值x′LB(n),也进行编码端类似的640点的MDCT变换,生成320个点的MDCT系数(即低频部分的MDCT系数)。将由x′LB(n)生成的320个点的MDCT系数进行复制,生成高频部分的MDCT系数。
参考语音信号基本特征,低频部分谐波更多,高频部分谐波更少。因此,为了避免简单复制,造成人工生成的高频部分的MDCT谱包含过多谐波,可以将低频子带依赖的320个点的MDCT系数中的后160点作为母版,频谱拷贝2次,生成320个点的高频子带信号的MDCT系数的参考值。
调用前面获得的8个子带谱包络(即通过查询量化表后得到的8个子带谱包络),这8个子带谱包络对应8个高频子带,并将生成的320个点的高频子带信号的MDCT系数的参考值分为8个参考高频子带,针对每个高频子带执行以下处理:基于一个高频子带与对应的参考高频子带,对生成的320点的高频子带信号的MDCT系数的参考值进行放大控制(频域上就是做乘法)。
例如,根据高频子带的平均能量与对应的参考高频子带的平均能量计算放大因子,将该对应的参考高频子带中的每个点对应的MDCT系数乘以放大因子,确保解码虚拟 生成的高频MDCT系数的能量,与编码端原始的系数能量接近。
例如,假定复制过去生成的高频子带信号的高频子带的平均能量是Y_L,当前要做放大控制的高频子带(即基于码流解码出来的子带谱包络对应的高频子带)的平均能量是Y_H,则计算一个放大因子a=sqrt(Y_H/Y_L)。有了放大因子a后,直接将从复制生成的高频子带中的每个点均乘以a。调用MDCT反变换,生成高频子带信号的估计值x′HB(n)。对放大后的320点的MDCT系数进行MDCT反变换,生成640个点的估计值,通过交叠,取有效的前320个点的估计值作为x′HB(n)。
继续参考图7,最后,调用合成滤波器。
在解码端获得了低频子带信号的估计值x′LB(n)和高频子带信号的估计值x′HB(n)之后,只需要上采样并调用QMF合成滤波器,就可以生成640个点的重建信号x′(n)。
本申请实施例可以通过采集数据,对编码端和解码端的相关网络进行联合训练,获得最优参数。用户仅需准备好数据和设置相应的网络结构,在后台完成训练后,即可将训练好的模型投入使用。
综上,本申请实施例提供的基于子带分解和神经网络的语音编解码方法通过信号分解、信号处理技术与深度神经网络的有机结合,在保证音频质量、且复杂度可接受的情况下,较相关技术的信号处理方案显著提升提高编码效率。
至此已经结合本申请实施例提供的终端设备的示例性应用和实施,说明本申请实施例提供的音频处理方法。本申请实施例还提供音频处理装置,实际应用中,音频处理装置中的各功能模块可以由电子设备(如终端设备、服务器或服务器集群)的硬件资源,如处理器等计算资源、通信资源(如用于支持实现光缆、蜂窝等各种方式通信)、存储器协同实现。图3A和图3B示出了存储在存储器550中的音频处理装置555,其可以是程序和插件等形式的软件,例如,软件C/C++、Java等编程语言设计的软件模块、C/C++、Java等编程语言设计的应用软件或大型软件系统中的专用软件模块、应用程序接口、插件、云服务等实现方式,下面对不同的实现方式举例说明。
其中,图3A示出的音频处理装置555包括一系列的模块,包括分解模块5551、特征提取模块5552、高频分析模块5553、编码模块5554。下面继续说明本申请实施例提供的音频处理装置555中各个模块配合实现音频编码的方案。
分解模块,配置为对音频信号进行子带分解,得到所述音频信号的低频子带信号以及高频子带信号;特征提取模块,配置为对所述低频子带信号进行特征提取,得到所述低频子带信号的低频特征;高频分析模块,配置为对所述高频子带信号进行高频分析,得到所述高频子带信号的高频特征;其中,所述高频特征的特征维度低于所述低频特征的特征维度;编码模块,配置为对所述低频特征进行量化编码,得到所述音频信号的低频码流,并对所述高频特征进行量化编码,得到所述音频信号的高频码流。
在一些实施例中,所述分解模块还配置为对所述音频信号进行采样处理,得到采样信号,其中,所述采样信号包括采样得到的多个样本点;对所述采样信号进行低通滤波,得到低通滤波信号;对所述低通滤波信号进行下采样,得到所述音频信号的低频子带信号;对所述采样信号进行高通滤波,得到高通滤波信号;对所述高通滤波信号进行下采样,得到所述音频信号的高频子带信号。
在一些实施例中,所述特征提取模块还配置为对所述低频子带信号进行卷积,得到所述低频子带信号的卷积特征;对所述卷积特征进行池化,得到所述低频子带信号的池化特征;对所述池化特征进行下采样,得到所述低频子带信号的下采样特征;对所述下采样特征进行卷积,得到所述低频子带信号的低频特征。
在一些实施例中,所述下采样是通过多个级联的编码层实现的;所述特征提取模块还配置为通过所述多个级联的编码层中的第一个编码层,对所述池化特征进行下采样; 将所述第一个编码层的下采样结果输出到后续级联的编码层,通过所述后续级联的编码层继续进行下采样和下采样结果输出,直至输出到最后一个编码层;将所述最后一个编码层输出的下采样结果作为所述低频子带信号的下采样特征。
在一些实施例中,所述高频分析模块还配置为调用第一神经网络模型,通过所述第一神经网络模型对所述高频子带信号进行特征提取,得到所述高频子带信号的高频特征;或者,对所述高频子带信号进行频带扩展,得到所述高频子带信号的高频特征。
在一些实施例中,所述高频分析模块还配置为基于所述高频子带信号包括的多个样本点进行频域变换,得到所述多个样本点分别对应的变换系数;将所述多个样本点分别对应的变换系数划分为多个子带;对每个所述子带包括的变换系数进行均值处理,得到每个所述子带对应的平均能量,并将所述平均能量作为每个所述子带对应的子带谱包络;将所述多个子带分别对应的子带谱包络确定为所述高频子带信号的高频特征。
在一些实施例中,所述高频分析模块还配置为获取参考音频信号的参考高频子带信号,其中,所述参考音频信号是与所述音频信号相邻的音频信号;基于所述参考高频子带信号包括的多个样本点以及所述高频子带信号包括的多个样本点,对所述高频子带信号包括的多个样本点进行离散余弦变换,得到所述高频子带信号包括的多个样本点分别对应的变换系数。
在一些实施例中,所述高频分析模块还配置为确定每个所述子带包括的样本点对应的变换系数的平方和;将所述平方和与所述子带包括的样本点的数量的比值,确定为得到每个所述子带对应的平均能量。
在一些实施例中,所述编码模块还配置为对所述低频特征进行量化处理,得到所述低频特征的索引值;对所述低频特征的索引值进行熵编码,得到所述音频信号的低频码流;所述对所述高频特征进行量化编码,得到所述音频信号的高频码流,包括:对所述高频特征进行量化处理,得到所述高频特征的索引值;对所述高频特征的索引值进行熵编码,得到所述音频信号的高频码流。
图3B示出的音频处理装置555包括一系列的模块,包括解码模块5555、特征重建模块5556、高频重建模块5557以及合成模块5558。下面继续说明本申请实施例提供的音频处理装置555中各个模块配合实现音频解码的方案。
解码模块,配置为对低频码流进行量化解码,得到所述低频码流对应的低频特征,并对高频码流进行量化解码,得到所述高频码流对应的高频特征;其中,所述低频码流以及所述高频码流是对音频信号经过子带分解后得到的子带信号分别进行编码得到的,所述高频特征的特征维度低于所述低频特征的特征维度。
特征重建模块,配置为对所述低频特征进行特征重建,得到所述低频特征对应的低频子带信号。
高频重建模块,配置为对所述高频特征进行高频重建,得到所述高频特征对应的高频子带信号。
合成模块,配置为对所述低频子带信号以及所述高频子带信号进行子带合成,得到所述低频码流和所述高频码流对应的合成音频信号。
在一些实施例中,所述特征重建模块还配置为对所述低频特征进行卷积,得到所述低频特征的卷积特征;对所述卷积特征进行上采样,得到所述低频特征的上采样特征;对所述上采样特征进行池化,得到所述低频特征的池化特征;对所述池化特征进行卷积,得到所述低频特征对应的低频子带信号。
在一些实施例中,所述上采样是通过多个级联的解码层实现的;所述特征重建模块还配置为通过所述多个级联的解码层中的第一个解码层,对所述卷积特征进行上采样;将所述第一个解码层的上采样结果输出到后续级联的解码层,通过所述后续级联的解码 层继续进行上采样和上采样结果输出,直至输出到最后一个解码层;将所述最后一个解码层输出的上采样结果作为所述低频特征的上采样特征。
在一些实施例中,所述高频重建模块还配置为调用第二神经网络模型,通过所述第二神经网络模型对所述高频特征进行特征重建,得到所述高频特征对应的高频子带信号;或者,对所述高频特征进行频带扩展的逆处理,得到所述高频特征对应的高频子带信号。
在一些实施例中,所述高频重建模块还配置为基于所述低频子带信号包括的多个样本点进行频域变换,得到所述多个样本点分别对应的变换系数;对所述多个样本点分别对应的变换系数中的后半部分的变换系数进行频谱复制处理,得到参考高频子带信号的参考变换系数;基于所述高频特征对应的子带谱包络,对所述参考高频子带信号的参考变换系数进行放大处理,得到放大后的所述参考变换系数;对放大后的所述参考变换系数进行反频域变换,得到所述高频特征对应的高频子带信号。
在一些实施例中,所述高频重建模块还配置为基于所述高频特征对应的子带谱包络,将所述参考高频子带信号的参考变换系数划分为多个子带;针对所述多个子带中的任意所述子带执行以下处理:确定所述子带谱包络中与所述子带对应的第一平均能量,并确定所述子带对应的第二平均能量;基于所述第一平均能量与所述第二平均能量的比值,确定放大因子;将所述放大因子与所述子带包括的每个所述参考变换系数相乘,得到放大后的所述参考变换系数。
在一些实施例中,所述解码模块还配置为对所述低频码流进行熵解码,得到所述低频码流对应的索引值;对所述低频码流对应的索引值进行逆量化处理,得到所述低频码流对应的低频特征;对所述高频码流进行熵解码,得到所述高频码流对应的索引值;对所述高频码流对应的索引值进行逆量化处理,得到所述高频码流对应的高频特征。
在一些实施例中,所述合成模块还配置为对所述低频子带信号进行上采样,得到低通滤波信号;对所述高频子带信号进行上采样,得到高频滤波信号;对所述低通滤波信号以及所述高频滤波信号进行滤波合成,得到所述合成音频信号。
本申请实施例提供了一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机程序或指令,所述计算机程序或指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取计算机程序或指令,处理器执行计算机程序或指令,使得该电子设备执行本申请实施例上述的音频处理方法。
本申请实施例提供一种存储有计算机程序或指令的计算机可读存储介质,其中存储有计算机程序或指令,当计算机程序或指令被处理器执行时,将引起处理器执行本申请实施例提供的音频处理方法,例如,如图4-5示出的音频处理方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,计算机程序或指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,计算机程序或指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
可以理解的是,在本申请实施例中,涉及到用户信息等相关的数据,当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (21)

  1. 一种音频处理方法,所述方法包括:
    将音频信号分解为低频子带信号以及高频子带信号;
    获取所述低频子带信号的低频特征;
    获取所述高频子带信号的高频特征,其中,所述高频特征的特征维度低于所述低频特征的特征维度;
    对所述低频特征进行量化编码,得到所述音频信号的低频码流;
    对所述高频特征进行量化编码,得到所述音频信号的高频码流。
  2. 根据权利要求1所述的方法,其中,所述将音频信号分解为低频子带信号以及高频子带信号,包括:
    获取所述音频信号的采样信号,其中,所述采样信号包括采样得到的多个样本点;
    对所述采样信号进行低通滤波,得到低通滤波信号;
    对所述低通滤波信号进行下采样,得到所述音频信号的低频子带信号;
    对所述采样信号进行高通滤波,得到高通滤波信号;
    对所述高通滤波信号进行下采样,得到所述音频信号的高频子带信号。
  3. 根据权利要求1所述的方法,其中,所述提取所述低频子带信号的低频特征,包括:
    对所述低频子带信号进行卷积,得到所述低频子带信号的卷积特征;
    对所述卷积特征进行池化,得到所述低频子带信号的池化特征;
    对所述池化特征进行下采样,得到所述低频子带信号的下采样特征;
    对所述下采样特征进行卷积,得到所述低频子带信号的低频特征。
  4. 根据权利要求3所述的方法,其中,
    所述下采样是通过多个级联的编码层实现的;
    所述对所述池化特征进行下采样,得到所述低频子带信号的下采样特征,包括:
    通过所述多个级联的编码层中的第一个编码层,对所述池化特征进行下采样;
    将所述第一个编码层的下采样结果输出到后续级联的编码层,通过所述后续级联的编码层继续进行下采样和下采样结果输出,直至输出到最后一个编码层;
    将所述最后一个编码层输出的下采样结果作为所述低频子带信号的下采样特征。
  5. 根据权利要求1所述的方法,其中,所述获取所述高频子带信号的高频特征,包括:
    调用第一神经网络模型提取所述高频子带信号的高频特征;或者,
    对所述高频子带信号进行频带扩展,得到所述高频子带信号的高频特征。
  6. 根据权利要求5所述的方法,其中,所述对所述高频子带信号进行频带扩展,得到所述高频子带信号的高频特征,包括:
    基于所述高频子带信号包括的多个样本点进行频域变换,得到所述多个样本点分别对应的变换系数;
    将所述多个样本点分别对应的变换系数划分为多个子带;
    基于每个所述子带包括的变换系数计算均值,将所述均值作为每个所述子带对应的平均能量,并将所述平均能量作为每个所述子带对应的子带谱包络;
    将所述多个子带分别对应的子带谱包络确定为所述高频子带信号的高频特征。
  7. 根据权利要求6所述的方法,其中,所述基于所述高频子带信号包括的多个样本点进行频域变换,得到所述多个样本点分别对应的变换系数,包括:
    获取参考音频信号的参考高频子带信号,其中,所述参考音频信号是与所述音频信号相邻的音频信号;
    基于所述参考高频子带信号包括的多个样本点以及所述高频子带信号包括的多个样本点,对所述高频子带信号包括的多个样本点进行离散余弦变换,得到所述高频子带信号包括的多个样本点分别对应的变换系数。
  8. 根据权利要求6所述的方法,其中,所述基于每个所述子带包括的变换系数确定每个所述子带对应的平均能量,包括:
    确定每个所述子带包括的样本点对应的变换系数的平方和;
    将所述平方和与所述子带包括的样本点的数量的比值,确定为每个所述子带对应的平均能量。
  9. 根据权利要求1所述的方法,其中,所述对所述低频特征进行量化编码,得到所述音频信号的低频码流,包括:
    将所述低频特征量化为所述低频特征的索引值;
    对所述低频特征的索引值进行熵编码,得到所述音频信号的低频码流;
    所述对所述高频特征进行量化编码,得到所述音频信号的高频码流,包括:
    将所述高频特征量化为所述高频特征的索引值;
    对所述高频特征的索引值进行熵编码,得到所述音频信号的高频码流。
  10. 一种音频处理方法,所述方法包括:
    对低频码流进行量化解码,得到所述低频码流对应的低频特征;
    对高频码流进行量化解码,得到所述高频码流对应的高频特征,其中,所述低频码流以及所述高频码流是对音频信号经过子带分解后得到的子带信号分别进行编码得到的,所述高频特征的特征维度低于所述低频特征的特征维度;
    对所述低频特征进行特征重建,得到所述低频特征对应的低频子带信号;
    对所述高频特征进行高频重建,得到所述高频特征对应的高频子带信号;
    对所述低频子带信号以及所述高频子带信号进行子带合成,得到所述低频码流和所述高频码流对应的合成音频信号。
  11. 根据权利要求10所述的方法,其中,所述对所述低频特征进行特征重建,得到所述低频特征对应的低频子带信号,包括:
    对所述低频特征进行卷积,得到所述低频特征的卷积特征;
    对所述卷积特征进行上采样,得到所述低频特征的上采样特征;
    对所述上采样特征进行池化,得到所述低频特征的池化特征;
    对所述池化特征进行卷积,得到所述低频特征对应的低频子带信号。
  12. 根据权利要求11所述的方法,其中,
    所述上采样是通过多个级联的解码层实现的;
    所述对所述卷积特征进行上采样,得到所述低频特征的上采样特征,包括:
    通过所述多个级联的解码层中的第一个解码层,对所述卷积特征进行上采样;
    将所述第一个解码层的上采样结果输出到后续级联的解码层,通过所述后续级联的解码层继续进行上采样和上采样结果输出,直至输出到最后一个解码层;
    将所述最后一个解码层输出的上采样结果作为所述低频特征的上采样特征。
  13. 根据权利要求10所述的方法,其中,所述对所述高频特征进行高频重建,得到所述高频特征对应的高频子带信号,包括:
    调用第二神经网络模型对所述高频特征进行特征重建,得到所述高频特征对应的高频子带信号;或者,
    对所述高频特征进行频带扩展的逆处理,得到所述高频特征对应的高频子带信号。
  14. 根据权利要求13所述的方法,其中,所述对所述高频特征进行频带扩展的逆处理,得到所述高频特征对应的高频子带信号,包括:
    基于所述低频子带信号包括的多个样本点进行频域变换,得到所述多个样本点分别对应的变换系数;
    对所述多个样本点分别对应的变换系数中的后半部分的变换系数进行频谱复制处理,得到参考高频子带信号的参考变换系数;
    基于所述高频特征对应的子带谱包络,对所述参考高频子带信号的参考变换系数进行放大处理,得到放大后的所述参考变换系数;
    对放大后的所述参考变换系数进行反频域变换,得到所述高频特征对应的高频子带信号。
  15. 根据权利要求14所述的方法,其中,所述基于所述高频特征对应的子带谱包络,对所述参考高频子带信号的参考变换系数进行放大,得到放大后的所述参考变换系数,包括:
    基于所述高频特征对应的子带谱包络,将所述参考高频子带信号的参考变换系数划分为多个子带;
    针对所述多个子带中的任意所述子带执行以下处理:
    确定所述子带谱包络中与所述子带对应的第一平均能量,并确定所述子带对应的第二平均能量;
    基于所述第一平均能量与所述第二平均能量的比值,确定放大因子;
    将所述放大因子与所述子带包括的每个所述参考变换系数相乘,得到放大后的所述参考变换系数。
  16. 根据权利要求10所述的方法,其中,
    所述对低频码流进行量化解码,得到所述低频码流对应的低频特征,包括:
    对所述低频码流进行熵解码,得到所述低频码流对应的索引值;
    对所述低频码流对应的索引值进行逆量化处理,得到所述低频码流对应的低频特征;
    所述对高频码流进行量化解码,得到所述高频码流对应的高频特征,包括:
    对所述高频码流进行熵解码,得到所述高频码流对应的索引值;
    对所述高频码流对应的索引值进行逆量化处理,得到所述高频码流对应的高频特征。
  17. 根据权利要求10所述的方法,其中,所述对所述低频子带信号以及所述高频子带信号进行子带合成,得到所述低频码流和所述高频码流对应的合成音频信号,包括:
    对所述低频子带信号进行上采样,得到低通滤波信号;
    对所述高频子带信号进行上采样,得到高频滤波信号;
    对所述低通滤波信号以及所述高频滤波信号进行滤波合成,得到合成音频信号。
  18. 一种音频处理装置,所述装置包括:
    分解模块,配置为将音频信号分解为低频子带信号以及高频子带信号;
    特征提取模块,配置为获取所述低频子带信号的低频特征;
    高频分析模块,配置为获取所述高频子带信号的高频特征,其中,所述高频特征的特征维度低于所述低频特征的特征维度;
    编码模块,配置为对所述低频特征进行量化编码,得到所述音频信号的低频码流;对所述高频特征进行量化编码,得到所述音频信号的高频码流。
  19. 一种电子设备,所述电子设备包括:
    存储器,用于存储计算机程序或指令;
    处理器,用于执行所述存储器中存储的计算机程序或指令时,实现权利要求1至17任一项所述的音频处理方法。
  20. 一种计算机可读存储介质,存储有计算机程序或指令,用于被处理器执行时实现权利要求1至17任一项所述的音频处理方法。
  21. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时实现权利要求1至17任一项所述的音频处理方法。
PCT/CN2023/088638 2022-06-15 2023-04-17 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 WO2023241205A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210681365.XA CN115116456A (zh) 2022-06-15 2022-06-15 音频处理方法、装置、设备、存储介质及计算机程序产品
CN202210681365.X 2022-06-15

Publications (1)

Publication Number Publication Date
WO2023241205A1 true WO2023241205A1 (zh) 2023-12-21

Family

ID=83328558

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/088638 WO2023241205A1 (zh) 2022-06-15 2023-04-17 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN115116456A (zh)
WO (1) WO2023241205A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116456A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、存储介质及计算机程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5765127A (en) * 1992-03-18 1998-06-09 Sony Corp High efficiency encoding method
CN102158692A (zh) * 2010-02-11 2011-08-17 华为技术有限公司 编码方法、解码方法、编码器和解码器
WO2016023322A1 (zh) * 2014-08-15 2016-02-18 北京天籁传音数字技术有限公司 多声道声音信号编码方法、解码方法及装置
CN110556123A (zh) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质
CN113470667A (zh) * 2020-03-11 2021-10-01 腾讯科技(深圳)有限公司 语音信号的编解码方法、装置、电子设备及存储介质
CN115116456A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、存储介质及计算机程序产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5765127A (en) * 1992-03-18 1998-06-09 Sony Corp High efficiency encoding method
CN102158692A (zh) * 2010-02-11 2011-08-17 华为技术有限公司 编码方法、解码方法、编码器和解码器
WO2016023322A1 (zh) * 2014-08-15 2016-02-18 北京天籁传音数字技术有限公司 多声道声音信号编码方法、解码方法及装置
CN110556123A (zh) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 频带扩展方法、装置、电子设备及计算机可读存储介质
CN113470667A (zh) * 2020-03-11 2021-10-01 腾讯科技(深圳)有限公司 语音信号的编解码方法、装置、电子设备及存储介质
CN115116456A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、存储介质及计算机程序产品

Also Published As

Publication number Publication date
CN115116456A (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
JP4374233B2 (ja) 複数因子分解可逆変換(multiplefactorizationreversibletransform)を用いたプログレッシブ・ツー・ロスレス埋込みオーディオ・コーダ(ProgressivetoLosslessEmbeddedAudioCoder:PLEAC)
JP4850837B2 (ja) 異なるサブバンド領域同士の間の通過によるデータ処理方法
CN101944362B (zh) 一种基于整形小波变换的音频无损压缩编码、解码方法
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
CN103187065B (zh) 音频数据的处理方法、装置和系统
WO2019233362A1 (zh) 基于深度学习的语音音质增强方法、装置和系统
JPWO2007088853A1 (ja) 音声符号化装置、音声復号装置、音声符号化システム、音声符号化方法及び音声復号方法
WO2019233364A1 (zh) 基于深度学习的音频音质增强
WO2023241240A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2023241205A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2023241222A1 (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
WO2023241254A1 (zh) 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2023241193A1 (zh) 音频编码方法、装置、电子设备、存储介质及程序产品
Joseph Spoken digit compression using wavelet packet
CN115116457A (zh) 音频编码及解码方法、装置、设备、介质及程序产品
CN113314131B (zh) 一种基于两级滤波的多步音频对象编解码方法
Romano et al. A real-time audio compression technique based on fast wavelet filtering and encoding
CN117219095A (zh) 音频编码方法、音频解码方法、装置、设备及存储介质
CN117198301A (zh) 音频编码方法、音频解码方法、装置、可读存储介质
CN113314132A (zh) 一种应用于交互式音频系统中的音频对象编码方法、解码方法及装置
CN117219099A (zh) 音频编码、音频解码方法、音频编码装置、音频解码装置
JPH09127987A (ja) 信号符号化方法及び装置
JP5491193B2 (ja) 音声コード化の方法および装置
CN117834596A (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
CN117831548A (zh) 音频编解码系统的训练方法、编码方法、解码方法、装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822776

Country of ref document: EP

Kind code of ref document: A1