WO2021258940A1 - 音频编解码方法、装置、介质及电子设备 - Google Patents

音频编解码方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2021258940A1
WO2021258940A1 PCT/CN2021/095022 CN2021095022W WO2021258940A1 WO 2021258940 A1 WO2021258940 A1 WO 2021258940A1 CN 2021095022 W CN2021095022 W CN 2021095022W WO 2021258940 A1 WO2021258940 A1 WO 2021258940A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
audio
low
information
encoded
Prior art date
Application number
PCT/CN2021/095022
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021258940A1 publication Critical patent/WO2021258940A1/zh
Priority to US17/740,304 priority Critical patent/US20220270623A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to audio coding and decoding technology.
  • Audio codec occupies an important position in modern communication systems. By compressing and encoding audio data, the network bandwidth pressure of audio data in network transmission can be reduced, and the storage cost and transmission cost of audio data can be saved.
  • the frequency spectrum of audio data such as music and voice is mainly concentrated in the low frequency band, and the high frequency range is very small. If the entire frequency band is coded, if the high frequency band data is to be protected, the low frequency band data may be coded too finely, resulting in a huge amount of encoded file data, and it is difficult to achieve the ideal compression coding effect; if it is to save the main components of the low frequency band data and discard the high frequency Frequency band data components will lose sound quality, resulting in serious distortion of the audio after decoding. Therefore, how to ensure the accurate transmission of high-frequency data as much as possible in the audio data encoding and decoding process is a problem to be solved urgently at present.
  • the purpose of this application is to provide an audio coding method, an audio decoding method, an audio coding device, an audio decoding device, a computer readable medium and an electronic device, which at least to a certain extent overcomes the high-frequency audio data components in the audio coding and decoding technology Technical problems such as loss and poor transmission accuracy.
  • an audio encoding method which includes: performing subband decomposition of audio to be encoded to obtain a low frequency signal to be encoded corresponding to a low frequency band and a high frequency signal to be encoded corresponding to a high frequency band Compress and encode the low-frequency signal to be encoded to obtain low-frequency encoding data of the low-frequency signal to be encoded; determine high-frequency prediction information based on the low-frequency signal to be encoded based on the correlation between the low-frequency signal and the high-frequency signal; The feature extraction of the high-frequency signal to be encoded is performed to obtain the high-frequency feature information of the high-frequency signal to be encoded, and the high-frequency signal to be encoded is determined according to the difference between the high-frequency feature information and the high-frequency prediction information The high-frequency compensation information; the low-frequency encoding data and the high-frequency compensation information are encapsulated to obtain the audio encoding data of the audio to be encoded.
  • an audio encoding device comprising: an audio decomposition module for sub-band decomposition of audio to be encoded to obtain a low frequency signal to be encoded corresponding to a low frequency band and a low frequency signal corresponding to the high frequency band
  • the high-frequency signal to be encoded the low-frequency encoding module is used to compress and encode the low-frequency signal to be encoded to obtain the low-frequency encoding data of the low-frequency signal to be encoded
  • the high-frequency prediction module is used to based on the low-frequency signal and the high-frequency signal
  • the high-frequency compensation module is used for feature extraction of the high-frequency signal to be encoded to obtain the high-frequency characteristic information of the high-frequency signal to be encoded, And according to the difference between the high-frequency feature information and the high-frequency prediction information, the high-frequency compensation information of the high-frequency signal to be encoded
  • an audio decoding method which includes: encapsulating and parsing audio coded data to be decoded to obtain low-frequency coded data and high-frequency compensation information in the audio coded data;
  • the low-frequency encoded data is decoded to obtain the restored low-frequency signal; based on the correlation between the low-frequency signal and the high-frequency signal, the high-frequency prediction information is determined according to the restored low-frequency signal; the high-frequency prediction information is performed according to the high-frequency compensation information Gain compensation to obtain high-frequency feature information, and perform feature restoration on the high-frequency feature information to obtain a restored high-frequency signal; perform subband synthesis on the restored low-frequency signal and the restored high-frequency signal to obtain the audio encoding Restore audio of data.
  • an audio decoding device which includes: an encapsulation analysis module, configured to encapsulate and analyze the audio coded data to be decoded to obtain the low-frequency coded data and the high-frequency coded data in the audio coded data.
  • low-frequency decoding module used to decode the low-frequency encoded data to obtain a restored low-frequency signal
  • high-frequency prediction module based on the correlation between the low-frequency signal and the high-frequency signal, determine the high frequency based on the restored low-frequency signal Frequency prediction information
  • a high-frequency restoration module used to perform gain compensation on the high-frequency prediction information according to the high-frequency compensation information to obtain high-frequency feature information, and perform feature restoration on the high-frequency feature information to obtain the high-frequency restoration Frequency signal
  • an audio synthesis module for sub-band synthesis of the restored low-frequency signal and the restored high-frequency signal to obtain the original audio of the audio coded data.
  • a computer-readable medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the audio encoding method or the audio decoding method in the above technical solutions.
  • an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute The executable instructions are used to execute the audio encoding method or the audio decoding method in the above technical solution.
  • a computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable medium.
  • the processor of the computer device reads the computer instruction from the computer-readable medium, and the processor executes the computer instruction, so that the computer device executes the audio encoding method or the audio decoding method in the above technical solutions.
  • the high-frequency prediction information can be determined according to the low-frequency signal to be encoded, and then based on the relationship between the high-frequency prediction information and the high-frequency signal to be encoded.
  • the high-frequency compensation information correspondingly, only high-frequency compensation information can be transmitted in the transmission of audio coded data, which can greatly compress the coding rate of high-frequency signals and reduce the bandwidth pressure of network transmission.
  • the high-frequency signal can be reconstructed and restored based on the high-frequency compensation information, ensuring the integrity and accuracy of the high-frequency signal, and avoiding audio distortion and sound quality caused by data compression loss Poor and other issues.
  • Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied;
  • Figure 2 schematically shows a flow chart of the steps of an audio coding method in some embodiments of the present application
  • FIG. 3 schematically shows a flowchart of a method for obtaining high-frequency prediction information at an encoding end in some embodiments of the present application
  • Fig. 4 schematically shows a flowchart of the method steps for classifying audio to be encoded in some embodiments of the present application
  • FIG. 5 schematically shows a flowchart of a method for training a high-frequency prediction neural network based on the pre-processing process of feature extraction first and frequency band segmentation in some embodiments of the present application;
  • FIG. 6 schematically shows a flow chart of a method for training a high-frequency prediction neural network based on a pre-processing process of segmentation and feature extraction in some embodiments of the present application
  • FIG. 7 schematically shows a flowchart of steps of a method for determining high-frequency compensation information in some embodiments of the present application
  • FIG. 8 schematically shows a flowchart of a method for encoding an input signal in an application scenario according to an embodiment of the present application
  • FIG. 9 schematically shows a flowchart of the steps of an audio decoding method in some embodiments of the present application.
  • FIG. 10 schematically shows a flowchart of a method for obtaining high-frequency prediction information at a decoding end in some embodiments of the present application
  • FIG. 11 schematically shows a flow chart of a method for obtaining high-frequency characteristic information through gain compensation in some embodiments of the present application
  • FIG. 12 schematically shows a flowchart of a method for decoding an input code stream in an application scenario according to an embodiment of the present application
  • FIG. 13 schematically shows a structural block diagram of an audio encoding device provided in some embodiments of the present application.
  • FIG. 14 schematically shows a structural block diagram of an audio decoding device provided in some embodiments of the present application.
  • FIG. 15 schematically shows a structural block diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
  • Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
  • the system architecture 100 may include a terminal device 110, a network 120, and a server 130.
  • the terminal device 110 may include various electronic devices such as a smart phone, a tablet computer, a notebook computer, and a desktop computer.
  • the server 130 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
  • the network 120 may provide various connection types of communication links between the terminal device 110 and the server 130, for example, a wired communication link or a wireless communication link.
  • the system architecture in the embodiments of the present application may have any number of terminal devices, networks, and servers.
  • the server 130 may be a server group composed of multiple server devices.
  • the technical solutions provided by the embodiments of the present application may be applied to the terminal device 110, may also be applied to the server 130, or may be implemented jointly by the terminal device 110 and the server 130, which is not specifically limited in this application.
  • user A as the voice transmitter, can collect an analog sound signal through the microphone of the terminal device 110, and convert the analog sound signal into a digital sound signal through an analog-to-digital conversion circuit.
  • the digital sound signal passes through the voice signal.
  • the encoder performs compression, and then is packaged and sent to the server 130 according to the communication network transmission format and protocol.
  • the server 130 sends the voice coded data packet to the user B as the voice receiver.
  • User B then unpacks the received speech coded data packet through the terminal device 110, outputs the speech coded compressed code stream, regenerates the speech digital signal according to the speech coded compressed code stream through the speech decoder, and finally converts the speech digital signal And play out the sound through the speaker.
  • Voice codec can effectively reduce the bandwidth required to transmit voice signals, and also plays a decisive role in saving voice information storage and transmission costs, and ensuring the integrity of voice information during communication network transmission.
  • FIG. 2 schematically shows a step flow chart of the audio coding method in some embodiments of the present application.
  • the audio coding method may be executed by a terminal device, a server, or a terminal device and a server jointly.
  • the embodiment of the present application takes the audio coding method executed by the terminal device as an example for description.
  • the audio coding method may mainly include the following steps S210 to S250.
  • Step S210 Perform subband decomposition of the audio to be encoded to obtain a low-frequency signal to be encoded corresponding to the low-frequency frequency band and a high-frequency signal to be encoded corresponding to the high-frequency frequency band.
  • Step S220 Perform compression coding on the low-frequency signal to be coded to obtain low-frequency coded data of the low-frequency signal to be coded.
  • Step S230 Based on the correlation between the low-frequency signal and the high-frequency signal, determine the high-frequency prediction information according to the low-frequency signal to be encoded.
  • Step S240 Perform feature extraction of the high-frequency signal to be encoded to obtain high-frequency feature information of the high-frequency signal to be encoded, and determine the high-frequency compensation of the high-frequency signal to be encoded according to the difference between the high-frequency feature information and the high-frequency prediction information information.
  • Step S250 Encapsulate the low-frequency encoded data and the high-frequency compensation information to obtain audio encoded data of the audio to be encoded.
  • the high-frequency prediction information can be determined according to the low-frequency signal to be encoded, and then based on the difference between the high-frequency prediction information and the high-frequency signal to be encoded.
  • the high-frequency compensation information correspondingly, only high-frequency compensation information can be transmitted in the transmission of audio coded data, which can greatly compress the coding rate of high-frequency signals and reduce the bandwidth pressure of network transmission.
  • the high-frequency signal can be reconstructed and restored based on the high-frequency compensation information, ensuring the integrity and accuracy of the high-frequency signal, and avoiding audio distortion and sound quality caused by data compression loss Poor and other issues.
  • step S210 sub-band decomposition of the audio to be encoded is performed to obtain a low frequency signal to be encoded corresponding to the low frequency band and a high frequency signal to be encoded corresponding to the high frequency band.
  • Subband decomposition is a method of converting the original audio to be encoded from the time domain to the frequency domain, and then decomposing it from a complete frequency band into several continuous frequency bands according to the frequency. Each frequency band is called a subband.
  • a quadrature mirror filter bank composed of a low-pass filter corresponding to a low frequency band and a high-pass filter corresponding to a high frequency band can be obtained, and then the quadrature mirror filter bank can be used to treat
  • the encoded audio is decomposed into sub-bands to obtain a low-frequency signal to be encoded corresponding to the low-frequency band and a high-frequency signal to be encoded corresponding to the high-frequency band.
  • the quadrature mirror filter bank (Quadrature Mirror Filter, QMF) is formed by combining two or more filters through a shared input interface or a shared output interface.
  • a low-pass filter corresponding to a low-frequency band and a high-pass filter corresponding to a high-frequency band can form a quadrature mirror filter group by sharing an input interface, and the audio to be encoded can be input to the quadrature mirror filter group.
  • the low-frequency signal to be encoded output by the low-pass filter and the high-frequency signal to be encoded output by the high-pass filter can be obtained after sub-band decomposition.
  • the advantage of using the orthogonal mirror filter bank is that it can cancel the spectrum aliasing effect caused by the subband decomposition.
  • step S220 compression coding is performed on the low-frequency signal to be coded to obtain low-frequency coded data of the low-frequency signal to be coded.
  • the low-frequency signal to be encoded obtained by sub-band decomposition it can be compressed and encoded by an encoder to obtain corresponding low-frequency encoded data.
  • the basic idea of audio compression coding is to remove the time domain redundancy, frequency domain redundancy, and auditory redundancy existing in the audio signal through the encoder to achieve the purpose of audio signal compression.
  • Existing audio compression coding methods can mainly include compression of redundant information based on methods such as LTP long-term prediction, LPC short-term prediction, pitch period search, and frequency band duplication.
  • encoding algorithms such as CELP, SILK, or ACC may be used to compress and encode the low-frequency signal to be encoded.
  • the CELP encoding algorithm Code Excited Linear Prediction (Code Excited Linear Prediction)
  • CELP encoding algorithm is an effective medium and low bit rate speech compression coding technology. It uses the codebook as the excitation source and has low bit rate and synthesized speech quality. It has the advantages of high performance and strong anti-noise ability. It is widely used in the code rate of 4.8 ⁇ 16kbps.
  • the speech coders using CELP technology include G.723, G.728, G.729, G.722.2 and so on.
  • the SILK encoding algorithm is a wideband audio encoder developed based on the instant messaging software Skype and provided to third-party developers and hardware manufacturers.
  • the SILK encoding algorithm has good flexibility for audio bandwidth, network bandwidth and algorithm complexity.
  • the ACC encoding algorithm namely Advanced Audio Coding, is an audio compression algorithm with high compression ratio based on MPEG-2. Due to the use of multi-channel and low-complexity description methods, the ACC encoding algorithm can be greatly improved. Provides better sound quality while compressing audio data.
  • step S230 based on the correlation between the low-frequency signal and the high-frequency signal, the high-frequency prediction information is determined according to the low-frequency signal to be encoded.
  • Fig. 3 schematically shows a flowchart of a method for obtaining high-frequency prediction information at an encoding end in some embodiments of the present application.
  • step S230 Based on the correlation between the low-frequency signal and the high-frequency signal, determining high-frequency prediction information according to the low-frequency signal to be encoded may mainly include the following steps S310 to S330.
  • Step S310 Perform classification processing on the audio to be encoded to obtain audio category information of the audio to be encoded.
  • Step S320 Determine the high-frequency prediction neural network corresponding to the audio category information; the high-frequency prediction neural network is trained based on the correlation between the low-frequency signal and the high-frequency signal.
  • Step S330 Through the high-frequency prediction neural network, the low-frequency signal to be encoded is mapped to obtain high-frequency prediction information.
  • the embodiment of this application uses neural networks to take low-frequency signals as input, High-frequency signals are used as targets for prediction. But for different types of audio data, the correlation between the high and low frequency signals is different. For example, for voiced signals, due to the obvious harmonic structure, low-frequency signals and high-frequency signals have similar spectral harmonic structures; while light-sound signals have no harmonic components, which are characterized by a block-like distribution of mid- and high-frequency energy. The high-frequency energy is much higher than the low-frequency energy; the music signal is related to the sound characteristics of different sounding instruments.
  • the embodiment of the present application proposes to classify the audio data first, and then use different types based on the classification results.
  • the neural network for training and prediction solutions to obtain more stable and accurate prediction results.
  • the high-frequency predictive neural network used in the embodiments of this application can be implemented using various network architectures, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Generative Adversarial Networks (Generative Adversarial Networks, GAN) and so on.
  • CNN Convolutional Neural Networks
  • RNN Recurrent Neural Networks
  • GAN Generative Adversarial Networks
  • other machine learning models other than neural networks may also be used to perform mapping processing on the low-frequency signal to be encoded to obtain corresponding high-frequency prediction information, which is not specifically limited in this application.
  • Fig. 4 schematically shows a flow chart of the method steps for classifying audio to be encoded in some embodiments of the present application. As shown in FIG. 4, on the basis of the above embodiments, step S310. Performing classification processing on the audio to be encoded to obtain audio category information of the audio to be encoded may include the following steps S410 to S440.
  • Step S410 Acquire audio data samples, and label the audio data samples frame by frame to obtain the audio category identifier of each data frame in the audio data sample.
  • the audio data samples can be real audio data collected by audio input devices such as a microphone, or can be artificially constructed data synthesized by audio synthesis software. Audio data samples include many different types of data, such as voiced voice, voice light, non-voice, music, and so on. The audio data samples are marked with the data frame as the unit, and the audio category identification of each data frame is obtained.
  • Step S420 Perform feature extraction on the audio data sample from multiple feature dimensions to obtain multi-dimensional sample features of the audio data sample.
  • this step extracts features from multiple feature dimensions to obtain multi-dimensional sample features with strong characterization capabilities.
  • the extracted multi-dimensional sample features may include spectral flatness features, spectral slope features, pitch period, MDCT (Modified Discrete Cosine Transform) and its first/second order Features of multiple dimensions such as derivative.
  • Step S430 Taking the multi-dimensional sample feature as the input value, and using the audio category identifier as the target value corresponding to the input value, train an audio classification neural network for classifying audio data.
  • the training goal of the audio classification neural network is for an audio data sample.
  • the correct audio category identification can be output.
  • the network parameters of the neural network can be updated to improve its output correct audio category identification. Predictive power.
  • the training process can be ended when the training reaches the preset convergence target.
  • the convergence target can be, for example, that the error of the loss function is less than the error threshold or the number of iterative training exceeds the number threshold.
  • Step S440 Perform classification processing on the audio to be encoded through the audio classification neural network to obtain audio category information of the audio to be encoded.
  • the same feature extraction method as the audio data sample is used to obtain the corresponding multi-dimensional audio features, and then the extracted multi-dimensional audio features can be input into the trained audio classification neural network, and the audio classification neural network will perform the corresponding After ground processing, the audio category information with the highest predicted probability is output.
  • an audio classification neural network can be trained using audio data samples, and then the audio classification neural network can be used to accurately predict the audio category of the audio to be encoded.
  • the audio classification neural network Once the audio classification neural network is trained, its network structure and network parameters can be saved on the terminal device as the encoding end or on the server. When the new audio to be encoded needs to be classified, the audio classification can be directly called The neural network can quickly and accurately determine its audio category information.
  • the audio to be encoded is classified into four types: voiced voice, light voice, non-speech, and music. Accordingly, four types of high-frequency prediction neural networks can be trained, each of which is high
  • the frequency prediction neural network is specially used to predict the information from low-frequency signals to high-frequency signals for the audio to be encoded in the corresponding category.
  • the embodiment of the present application may further subdivide the to-be-encoded audio into more audio categories based on the spectral energy distribution. The more detailed the category division, the corresponding high-frequency prediction neural network will have more accurate representation and prediction capabilities.
  • Figures 5 and 6 respectively show two methods for training high-frequency prediction neural networks based on different pre-processing procedures.
  • Fig. 5 schematically shows a flow chart of the method steps for training a high-frequency prediction neural network based on the pre-processing process of feature extraction first and frequency band segmentation in some embodiments of the present application.
  • the training method of this high-frequency prediction neural network may mainly include the following steps S510 to S530.
  • Step S510 Acquire audio data samples corresponding to the audio category information, and compress and transform the audio data samples to obtain frequency spectrum feature samples of the audio data samples.
  • the feature extraction of audio data samples is performed through compression transformation to obtain the spectral feature samples of the audio data samples.
  • the compression transformation method may be, for example, Modified Discrete Cosine Transform (MDCT).
  • MDCT is a linear orthogonal overlap transform algorithm, using a time domain alias cancellation technology (Time Domain Alias Cancellation, TDAC), including a 50% time domain overlap window, without reducing the coding performance Effectively overcome the periodic noise caused by edge effects.
  • TDAC Time Domain Alias Cancellation
  • DFT Discrete Fourier Transform
  • Step S520 Divide the spectrum feature samples according to the value of the frequency points to obtain low-frequency feature samples and high-frequency feature samples.
  • the spectral feature sample obtained by compressing and transforming the audio data sample in step S510 is full-band data.
  • the spectral feature sample can be divided into a high-frequency MDCT part and a low-frequency MDCT according to the corresponding physical frequency value.
  • audio data samples including 1024 frequency points can be processed by MDCT to obtain spectral feature samples of 512 frequency points.
  • data of 1 to 256 points can be divided into low frequency feature samples, and data of 257 to 512 points can be Divided into high-frequency feature samples.
  • Step S530 Using the low-frequency feature sample as the input value, and using the high-frequency feature sample as the target value corresponding to the input value, train the high-frequency prediction neural network.
  • High-frequency prediction neural networks can use various types of network architectures such as CNN, RNN, and GAN.
  • the embodiments of the present application can also train other machine learning models except neural networks to predict high-frequency signals based on low-frequency signals, which is not specifically limited in this application.
  • the audio data samples of the entire frequency band can be compressed and transformed first, and then the frequency bands can be segmented based on the frequency point values to obtain low-frequency characteristic samples and high-frequency characteristic samples.
  • the advantage of this preprocessing scheme is: For each audio data sample, only one compression transformation and frequency band division operation is required, which can reduce the data calculation cost and improve the processing efficiency of the sample.
  • FIG. 6 schematically shows a flow chart of the method steps of training a high-frequency prediction neural network based on the pre-processing process of frequency band segmentation and feature extraction in some embodiments of the present application.
  • the training method of this high-frequency neural network may mainly include the following steps S610 to S630.
  • Step S610 Acquire audio data samples corresponding to the audio category information, and decompose the audio data samples into low-frequency data samples and high-frequency data samples according to the frequency band.
  • Step S620 Perform compression transformation on the low-frequency data sample and the high-frequency data sample respectively to obtain corresponding low-frequency feature samples and high-frequency feature samples.
  • Step S630 Using the low-frequency feature sample as the input value, and using the high-frequency feature sample as the target value corresponding to the input value, train the high-frequency prediction neural network.
  • the embodiment of the present application adopts a pre-processing scheme of first performing frequency band segmentation on audio data samples, and then performing compression transformation.
  • the preprocessing solution of the embodiment of the present application adds a compression transformation process for each audio data sample, that is, a compression transformation process is required for low-frequency data samples and high-frequency data samples respectively.
  • the advantage of this preprocessing scheme is that it maintains the consistency of the training process and the use process of the high-frequency prediction neural network, and can improve the accuracy of high-frequency signal prediction to a certain extent.
  • the network structure and network parameters of the high-frequency prediction neural network can be saved on the terminal equipment of the encoding end and the decoding end or on the server.
  • the low-frequency signal needs to be high-frequency
  • the high-frequency prediction neural network may directly perform mapping processing on the low-frequency signal to be coded to obtain the corresponding high-frequency prediction signal.
  • the low-frequency encoded data obtained by compression and encoding of the low-frequency signal to be encoded
  • the low-frequency prediction neural network may be decoded first to obtain the low-frequency decoding corresponding to the low-frequency signal to be encoded.
  • the signal is then passed through the high-frequency prediction neural network to map the low-frequency decoded signal to obtain the high-frequency prediction information.
  • the high-frequency prediction scheme can maintain the consistency of the operation of the encoding end and the decoding end, thereby improving the accuracy of the high-frequency prediction.
  • the method for mapping low-frequency decoded signals through a high-frequency prediction neural network may include: compressing and transforming the low-frequency decoded signal to obtain the low-frequency spectrum characteristics of the low-frequency decoded signal, and then using the high-frequency prediction neural network The network maps the low-frequency spectrum characteristics to obtain high-frequency prediction information.
  • the method of compressing and transforming the low-frequency decoded signal may be, for example, an improved discrete cosine transform MDCT or other forms of discrete Fourier transform DFT.
  • step S240 feature extraction of the high-frequency signal to be encoded is performed to obtain the high-frequency characteristic information of the high-frequency signal to be encoded, and the high-frequency characteristic information of the high-frequency signal to be encoded is determined according to the difference between the high-frequency characteristic information and the high-frequency prediction information. Compensation information.
  • the feature extraction method for the high-frequency signal to be encoded can be the same compression transformation method as the low-frequency signal to be encoded (or low-frequency decoded signal), that is, the high-frequency signal to be encoded is compressed and transformed to obtain the high-frequency characteristics of the high-frequency signal to be encoded information.
  • Using the same feature extraction method can make the high-frequency feature information and the high-frequency prediction information have consistent feature attributes, thereby facilitating the determination of the feature difference between the two.
  • the high-frequency compensation information of the high-frequency signal to be encoded can be determined.
  • FIG. 7 schematically shows how to determine the high-frequency compensation information in some embodiments of the present application Flow chart of method steps.
  • the determination of the high-frequency compensation information of the high-frequency signal to be encoded according to the difference between the high-frequency characteristic information and the high-frequency prediction information in step S240 may mainly include the following Steps S710 to S730.
  • Step S710 The high-frequency characteristic information is mapped from the linear frequency domain to the critical frequency band domain to obtain characteristic spectrum information corresponding to the high-frequency characteristic information.
  • Step S720 Map the high-frequency prediction information from the linear frequency domain to the critical frequency band domain to obtain predicted spectrum information corresponding to the high-frequency prediction information.
  • Step S730 Determine the high-frequency compensation information of the high-frequency signal to be encoded according to the difference between the characteristic spectrum information and the predicted spectrum information.
  • the critical band domain is a professional term for audiology and psychoacoustics.
  • the critical frequency band refers to the frequency bandwidth of the auditory filter due to the structure of the auditory sensor (such as the cochlea in the structure of the human ear).
  • the critical frequency band is the sound frequency band.
  • the perception of the first single tone will be disturbed by the auditory masking of the second single tone.
  • people use auditory filters to simulate different critical frequency bands.
  • the structure of the human ear generally resonates at 24 frequency points. Therefore, the audio signal in the critical frequency band also presents 24 critical frequency bands, starting from 1 To 24.
  • the Bark domain is more in line with the perceptual characteristics of the acoustic frequency of the human ear, and the number of subbands is relatively small, which is conducive to coding compression.
  • the embodiment of the present application may perform logarithmic transformation on the characteristic spectrum information and the predicted spectrum information respectively to obtain the characteristic spectrum logarithm value and the predicted spectrum logarithm value, and then according to the characteristic
  • the difference between the logarithm value of the spectrum and the logarithm value of the predicted spectrum queries the gain code table to obtain the gain quantized value, and the gain quantized value is determined as the high-frequency compensation information of the high-frequency signal to be encoded.
  • the gain code table is a quantization table with a size of N and successively increasing values. Based on the gain code table, the gain quantization value can be queried.
  • the corresponding Bark domain spectrum information E(k) can be obtained, and then logarithmic transformation can be performed to obtain the corresponding
  • the logarithmic value of the spectrum is 20*log10(E(k) ⁇ 2), and then the difference ⁇ E(k) between the logarithm of the characteristic spectrum and the logarithm of the predicted spectrum is determined, where k represents the sequence number of the high-frequency subband.
  • the query logic for numerical quantization of the difference ⁇ E(k) by querying the gain code table is as follows:
  • Table is a gain code table with increasing values; N is the size of the gain code table, which means that the gain code table includes a total of N quantized values from 0 to N-1; Index is the final quantized gain quantized value.
  • the gain compensation is quantized through the gain code table, which can discretize the original continuous gain compensation information and reduce the amount of calculation for encoding and transmitting the high-frequency signal part.
  • step S250 encapsulation processing is performed on the low-frequency encoded data and the high-frequency compensation information to obtain audio encoded data of the audio to be encoded.
  • the encapsulation process is a process of combining various coded contents to form a specified audio file.
  • the audio coded data obtained by encapsulation can be audio files in various formats such as MP3, AAC, WAV, FLAC, DSD, etc.
  • the audio to be encoded is first classified to obtain corresponding audio category information, and then different types of high-frequency prediction neural networks are selected based on the audio category information to encode low-frequency signals to be encoded Make targeted high-frequency predictions.
  • the audio category information obtained by the classification process can be obtained in step S250, and then the audio category information, low-frequency encoding data, and high-frequency compensation can be obtained.
  • the information is encapsulated together to obtain the audio coded data of the audio to be encoded, so that the audio category information is transmitted to the decoding end together.
  • FIG. 8 schematically shows a flowchart of a method for encoding an input signal in an application scenario according to an embodiment of the present application.
  • the method for the encoding terminal to perform audio encoding on the input signal may mainly include the following steps S801 to S811.
  • Step S801. Signal classification is performed on the input signal to obtain a signal classification result, where the classification category may include four types: voiced voice, voice light, non-voice and music.
  • the signal classification result can guide the selection of the high-frequency prediction neural network of the codec.
  • each signal type will correspond to a high-frequency prediction neural network, and the high-frequency prediction neural network is also classified through a large amount of audio data, and each high-frequency prediction neural network is independently trained with corresponding data of the same type.
  • the high-frequency prediction neural network that has been trained is used in the codec.
  • Step S802 The input signal is subjected to QMF (Quadrature Mirror Filter Bank) for high and low frequency signal decomposition.
  • QMF Quadrature Mirror Filter Bank
  • the advantage of using QMF is that it can cancel the aliasing effect caused by subband division.
  • the input signal is decomposed into low-frequency signal and high-frequency signal through QMF.
  • Step S803 The low-frequency signal decomposed in step S802 will be compressed and encoded by a speech encoder to obtain the low-frequency encoding parameters of the corresponding low-frequency signal.
  • the speech encoder used in this step can be an encoder based on algorithms such as CELP, SILK, AAC, etc. .
  • Step S804. In order to synchronize the input of the high-frequency prediction neural network in the encoder and the decoder, perform a speech decoding on the code stream of the low-frequency signal to obtain the restored low-frequency signal.
  • Step S805. The low-frequency signal obtained by the speech decoding and restoration in step S804 undergoes MDCT (Modified Discrete Cosine Transform) to obtain relevant low-frequency spectrum information.
  • MDCT Modified Discrete Cosine Transform
  • Step S806 Input the low-frequency spectrum information obtained by MDCT transformation into the high-frequency prediction neural network selected according to the signal classification result in step S801, and perform prediction through the high-frequency prediction neural network to obtain high-frequency prediction information.
  • Step S807 In order to correspond to the acoustic perception frequency band of the human ear, here, the high-frequency prediction information is converted from the linear frequency domain to the Bark domain to obtain the Bark domain high-frequency spectrum prediction value (which can be expressed in logarithmic form).
  • Step S808 The real high-frequency signal obtained by QMF decomposition in step S802 is transformed by MDCT to obtain relevant high-frequency spectrum information.
  • Step S809 Perform Bark domain conversion on the high frequency spectrum information obtained in Step S808 to obtain the true value of the Bark domain high frequency spectrum (which can be expressed in logarithmic form).
  • Step S810 The real value of the Bark domain high frequency spectrum obtained in step S809 is subtracted from the predicted value of the Bark domain high frequency spectrum obtained in step S807 to obtain the subband gain compensation value, and the gain quantization is further performed to obtain the high frequency encoding parameter.
  • Step S811 The signal classification result obtained in step S801, the low frequency encoding parameter obtained in step S803, and the high frequency encoding parameter obtained in step S810 are encapsulated to form an encoding parameter for output.
  • the encoding parameters obtained by the above encoding process can be transmitted to other terminal devices or servers as audio data receiving ends through the network, so that the receiving end can decode them to obtain decoded signals.
  • Fig. 9 schematically shows a step flow chart of an audio decoding method in some embodiments of the present application.
  • the audio decoding method may be executed by a terminal device, a server, or a terminal device and a server jointly.
  • the embodiment of the present application takes the audio decoding method executed by the terminal device as an example for description.
  • the audio decoding method may mainly include the following steps S910 to S950.
  • Step S910 Encapsulate and analyze the audio coded data to be decoded to obtain low-frequency coded data and high-frequency compensation information in the audio coded data.
  • Step S920 Perform decoding processing on the low-frequency encoded data to obtain a restored low-frequency signal.
  • Step S930 Based on the correlation between the low-frequency signal and the high-frequency signal, determine the high-frequency prediction information according to the restored low-frequency signal.
  • Step S940 Perform gain compensation on the high-frequency prediction information according to the high-frequency compensation information to obtain high-frequency feature information, and perform feature restoration on the high-frequency feature information to obtain a restored high-frequency signal.
  • Step S950 Perform subband synthesis on the restored low-frequency signal and the restored high-frequency signal to obtain the restored audio of the audio coded data.
  • the corresponding high-frequency prediction information can be determined according to the restored low-frequency signal obtained by decoding, and then the high-frequency compensation information is used to calculate the high-frequency prediction information.
  • the audio decoding method uses the same high-frequency signal prediction method and high-frequency gain compensation method on the encoding end and the decoding end to ensure the integrity and accuracy of the high-frequency signal during the signal transmission process, and avoid the audio caused by the loss of data compression. Distortion, poor sound quality and other issues.
  • step S910 the audio coded data to be decoded is encapsulated and analyzed to obtain low-frequency coded data and high-frequency compensation information in the audio coded data.
  • the audio coded data to be decoded may be composed of a continuous code stream unit, and every two adjacent code stream units are separated by code stream unit separation information.
  • the audio coding data is composed of multiple consecutive ADTS units (Audio Data Transport Stream), and each ADTS unit serves as an audio content packaging unit. Every two ADTS units are separated by a sync word, and the sync word can be 0xFFF (binary "111111111111").
  • the method for encapsulating and parsing the audio coded data to be decoded may include: first searching for code stream unit separation information in the audio coded data to be decoded; and then according to the searched code stream unit separation information, Separate the code stream unit to be decoded from the audio coded data; then perform field analysis on the code stream unit to obtain the low frequency coded data and high frequency compensation information encapsulated in the code stream unit.
  • the decoder after the decoder receives the audio coded data to be decoded, it can search for the field 0x0FFF in the original code stream.
  • the ADTS unit can be separated with this field as an interval, and then the ADTS unit can be obtained by field analysis Encapsulated low-frequency encoding data and high-frequency compensation information.
  • the low-frequency coded data corresponding to the low-frequency signal part and the high-frequency compensation information corresponding to the high-frequency signal part included therein can be obtained.
  • the audio category information included in it can be obtained after encapsulation and parsing in this step, so that a processing scheme consistent with the encoding end can be selected according to the audio category information.
  • step S920 decoding processing is performed on the low-frequency encoded data to obtain a restored low-frequency signal.
  • the decoder can be used to decode it to obtain the corresponding restored low-frequency signal.
  • the decoder used in this step corresponds to the encoder used at the encoding end. For example, if the encoding end uses the CELP algorithm for compression encoding, then this step will also use the corresponding CELP algorithm for decoding; if the encoding end uses SILK or ACC and other algorithms for compression encoding, then this step will also use the corresponding SILK or ACC And other algorithms for decoding processing.
  • step S930 based on the correlation between the low-frequency signal and the high-frequency signal, the high-frequency prediction information is determined according to the restored low-frequency signal.
  • Fig. 10 schematically shows a flow chart of the method for obtaining high-frequency prediction information at the decoding end in some embodiments of the present application.
  • step S930 Based on the correlation between the low-frequency signal and the high-frequency signal, the high-frequency prediction information is determined according to the restored low-frequency signal, which may mainly include the following steps S1010 to S1030.
  • Step S1010 Encapsulate and analyze the audio coded data to obtain audio category information in the audio coded data.
  • Step S1020 Determine the high-frequency prediction neural network corresponding to the audio category information, and the high-frequency prediction neural network is trained based on the correlation between the low-frequency signal and the high-frequency signal.
  • Step S1030 Perform mapping processing on the restored low-frequency signal through the high-frequency prediction neural network to obtain high-frequency prediction information.
  • the audio category information determined when the audio data is classified by the coding end can be obtained.
  • the audio category information is used at the encoding end to guide the selection of the high-frequency prediction neural network.
  • the decoding end applicable to the embodiments of this application can also select the same high-frequency prediction neural network as the encoding end based on the audio category information, so as to ensure the decoding end It can maintain the consistency of high-frequency signal prediction with the encoding end.
  • the high-frequency prediction neural network can be trained on the encoding end. After the high-frequency prediction neural network is trained, its network structure and network parameters can be saved on the encoding end, and the relevant data can also be stored at the same time. Send to the decoder. In this way, after the decoding end loads the network parameters based on the received network structure, a high-frequency predictive neural network consistent with the encoding end can be obtained.
  • the high-frequency prediction neural network can also be trained on the decoding end. After the high-frequency prediction neural network is trained, its network structure and network parameters can be saved on the decoding end. The relevant data is transmitted to the encoding end, so that the encoding end and the decoding end can use the same high-frequency prediction neural network to predict high-frequency signals.
  • the method for training the high-frequency prediction neural network at the decoding end is similar or the same as that of the encoding end. You can refer to the relevant method steps in FIG. 5 and FIG. 6, which will not be repeated here.
  • the high-frequency prediction neural network can also be trained on the server. After the high-frequency prediction neural network is trained, its network structure and network parameters can be saved on the server, and at the same time, the high-frequency prediction neural network can be The server transmits relevant data to the encoding end and the decoding end, so that the encoding end and the decoding end can use the same high-frequency prediction neural network to predict high-frequency signals.
  • the restored low-frequency signal When mapping the restored low-frequency signal through the high-frequency prediction neural network at the decoding end to achieve high-frequency signal prediction, the restored low-frequency signal can be compressed and transformed first to obtain the low-frequency spectrum characteristics of the restored low-frequency signal, and then the high-frequency prediction neural network The network maps the low-frequency spectrum characteristics to obtain high-frequency prediction information.
  • the method of compressing and transforming the restored low-frequency signal may be, for example, an improved discrete cosine transform MDCT or other forms of discrete Fourier transform DFT.
  • step S940 gain compensation is performed on the high-frequency prediction information according to the high-frequency compensation information to obtain high-frequency feature information, and feature restoration is performed on the high-frequency feature information to obtain a restored high-frequency signal.
  • the method of compensating information is two opposite processes.
  • the method for feature restoration of high-frequency feature information at the decoding end and the method for feature extraction of the high-frequency signal to be encoded at the encoding end are also two opposite processes.
  • FIG. 11 schematically shows a flowchart of a method for obtaining high-frequency characteristic information through gain compensation in some embodiments of the present application.
  • performing gain compensation on the high-frequency prediction information according to the high-frequency compensation information in step S940 to obtain high-frequency characteristic information may mainly include the following steps S1110 to S1130.
  • Step S1110 Map the high frequency prediction information from the linear frequency domain to the critical frequency band domain to obtain predicted spectrum information corresponding to the high frequency prediction information.
  • Step S1120 Perform gain compensation on the predicted spectrum information according to the high-frequency compensation information to obtain characteristic spectrum information.
  • Step S1130 Map the characteristic spectrum information from the critical frequency band domain to the linear frequency domain to obtain high frequency characteristic information corresponding to the characteristic spectrum information.
  • a mapping transformation from the linear frequency domain to the critical frequency domain is performed at the encoding end.
  • gain compensation is performed on the predicted spectrum information by the high frequency compensation information at the decoding end, it is necessary to map the obtained characteristic spectrum information from the critical frequency domain back to the linear frequency domain, so as to obtain the high frequency in the linear frequency domain.
  • Frequency feature information to facilitate feature restoration of high-frequency feature information in the linear frequency domain.
  • the logarithmic value after the logarithmic transformation can be used for calculation.
  • the predicted spectrum information may be firstly transformed to obtain the predicted spectrum logarithmic value, and then the predicted spectrum logarithmic value may be obtained according to the high-frequency compensation information.
  • Gain compensation obtains the characteristic spectrum logarithm value, and then exponentially restores the characteristic spectrum logarithm value to obtain the characteristic spectrum information.
  • the method of exponential reduction and the method of logarithmic transformation are mutually inverse processes.
  • step S940 after obtaining the high-frequency characteristic information corresponding to the characteristic spectrum information through Bark domain transformation, the characteristic restoration can be performed on it to obtain the restored high-frequency signal.
  • the encoding end can use compression transformation to perform feature extraction on the to-be-encoded high-frequency signal.
  • the decoding end can use decompression transformation to perform feature restoration on the high-frequency feature information. For example, if Modified Discrete Cosine Transform (MDCT) is used for feature extraction on the encoding end, then the Inverse Modified Discrete Cosine Transform (IMDCT) can be used for feature restoration on the decoding end accordingly.
  • MDCT Modified Discrete Cosine Transform
  • IMDCT Inverse Modified Discrete Cosine Transform
  • step S950 sub-band synthesis is performed on the restored low-frequency signal and the restored high-frequency signal to obtain the restored audio of the audio coded data.
  • the sub-band synthesis at the decoding end is the inverse process of the decomposing of the encoding terminal, which is used to integrate multiple signals of different frequency bands into a complete frequency band.
  • a quadrature mirror filter bank composed of a low-pass filter corresponding to a low frequency band and a high-pass filter corresponding to a high frequency band can be obtained, and then through the orthogonal mirror filter bank, The restored low-frequency signal and the restored high-frequency signal are subjected to sub-band synthesis to obtain the restored audio of the audio coded data.
  • the quadrature mirror filter bank (Quadrature Mirror Filter, QMF) is formed by combining two or more filters through a shared input interface or a shared output interface.
  • a low-pass filter corresponding to a low-frequency band and a high-pass filter corresponding to a high-frequency band can form a quadrature mirror filter group by sharing an output interface, and input the restored low-frequency signal to the low-pass filter.
  • the restored audio in the complete frequency band output by the quadrature mirror filter group can be obtained after sub-band synthesis.
  • FIG. 12 schematically shows a flowchart of a method for decoding an input code stream in an application scenario according to an embodiment of the present application.
  • the method for the decoding end to perform audio decoding on the input code stream may mainly include the following steps S1201 to S1207.
  • Step S1201. Encapsulate and analyze the received input code stream to obtain the low-frequency speech coding parameters, high-frequency gain compensation parameters, and signal classification parameters corresponding to each data frame.
  • the signal classification parameter is used to reflect the high-frequency prediction neural network used in the current data frame.
  • Step S1202. The low-frequency speech encoding parameters obtained by the encapsulation and analysis in step S1201 are decoded by a decoder corresponding to the encoding end to obtain a low-frequency signal.
  • Step S1203. The low-frequency signal undergoes MDCT transformation to obtain low-frequency spectrum information.
  • Step S1204. Input the low-frequency spectrum information transformed in step S1203 into the high-frequency prediction neural network selected according to the signal classification parameters in step S1201, and the high-frequency prediction neural network outputs the predicted high-frequency linear spectrum information.
  • Step S1205. Convert the high-frequency linear spectrum information obtained in step S1204 to the Bark domain, and adjust the Bark subband spectrum energy through the high-frequency gain compensation parameters obtained from the package analysis in step S1201, and then convert it from the Bark domain back to linear after adjustment Domain to obtain high-frequency spectrum information.
  • Step S1206 IMDCT transforms the high-frequency spectrum information obtained in step S1205 to obtain a reconstructed high-frequency signal.
  • Step S1207 The low-frequency signal obtained in step S1202 and the high-frequency signal obtained in step S1206 are synthesized into a full-band decoded signal through a QMF synthesis filter and output.
  • the audio codec method provided by the embodiments of the present application improves the predictive ability of high-frequency signals through neural network prediction of audio sub-band codec, thereby further compressing the high-frequency coding rate.
  • the embodiments of the present application can compare the input signal Perform classification and use different neural networks in different categories. Therefore, the technical solution provided by this application is not only suitable for harmonic structure signals, but also suitable for other types of signals, and can better realize high-frequency signal prediction of different input signals. Fitting.
  • Fig. 13 schematically shows a structural block diagram of an audio encoding device provided in some embodiments of the present application.
  • the audio encoding device 1300 may mainly include: an audio decomposition module 1310, a low frequency encoding module 1320, a high frequency prediction module 1330, a high frequency compensation module 1340, and an encoding packaging module 1350.
  • the audio decomposition module 1310 is used for sub-band decomposition of the audio to be encoded to obtain the low frequency signal to be encoded corresponding to the low frequency frequency band and the high frequency signal to be encoded corresponding to the high frequency frequency band.
  • the low-frequency encoding module 1320 is used to compress and encode the low-frequency signal to be encoded to obtain low-frequency encoding data of the low-frequency signal to be encoded.
  • the high-frequency prediction module 1330 is used to determine high-frequency prediction information based on the low-frequency signal to be encoded based on the correlation between the low-frequency signal and the high-frequency signal.
  • the high-frequency compensation module 1340 is used for feature extraction of the high-frequency signal to be encoded to obtain the high-frequency characteristic information of the high-frequency signal to be encoded; and according to the difference between the high-frequency characteristic information and the high-frequency prediction information, determine the high-frequency High-frequency compensation information of the signal.
  • the encoding and packaging module 1350 is used to encapsulate the low-frequency encoded data and the high-frequency compensation information to obtain audio encoded data of the audio to be encoded.
  • the high-frequency prediction module 1330 includes: an audio classification unit, configured to classify the audio to be encoded to obtain audio category information of the audio to be encoded; and an encoding end network acquisition unit, Used to determine the high-frequency prediction neural network corresponding to the audio category information.
  • the high-frequency prediction neural network is trained based on the correlation between the low-frequency signal and the high-frequency signal; the encoding end network mapping unit is used to predict the neural network through the high-frequency, Perform mapping processing on the low-frequency signal to be coded to obtain high-frequency prediction information.
  • the audio classification unit includes: a sample category labeling subunit, which is used to obtain audio data samples and label the audio data samples frame by frame to obtain each of the audio data samples.
  • the audio category identification of the data frame is used to extract the features of the audio data samples from multiple feature dimensions to obtain the multi-dimensional sample features of the audio data samples;
  • the classification network training subunit is used to take the multi-dimensional sample features as Input the value, and use the audio category identifier as the target value corresponding to the input value, and train the audio classification neural network used to classify audio data;
  • the classification network processing subunit is used to classify the neural network through the audio and to be coded The audio is classified and processed to obtain audio category information of the audio to be encoded.
  • the encoding end network acquisition unit includes: a first sample transformation subunit, configured to acquire audio data samples corresponding to audio category information, and perform processing on the audio data samples Compress and transform to obtain the spectral feature samples of the audio data samples; the first frequency band division subunit is used to divide the spectral feature samples according to the value of the frequency point to obtain low-frequency feature samples and high-frequency feature samples; the first network obtains the subunits, It is used to train high-frequency prediction neural networks with low-frequency feature samples as input values and high-frequency feature samples as target values corresponding to the input values.
  • the encoding end network acquisition unit includes: a second frequency band division subunit for acquiring audio data samples corresponding to audio category information, and according to the frequency band.
  • the audio data samples are decomposed into low-frequency data samples and high-frequency data samples;
  • the second sample transformation subunit is used to compress and transform the low-frequency data samples and high-frequency data samples respectively to obtain the corresponding low-frequency feature samples and high-frequency feature samples;
  • the second network acquisition subunit is used to train the high-frequency prediction neural network by taking the low-frequency feature sample as the input value and the high-frequency feature sample as the target value corresponding to the input value.
  • the encoding end network mapping unit includes: an encoding end low-frequency decoding subunit, which is used to decode low-frequency encoded data to obtain a low-frequency decoding corresponding to the low-frequency signal to be encoded Signal; the encoding end low-frequency mapping sub-unit is used to map the low-frequency decoded signal through the high-frequency prediction neural network to obtain high-frequency prediction information.
  • the encoding-side low-frequency mapping subunit includes: an encoding-side compression transformation subunit, which is used to compress and transform the low-frequency decoded signal to obtain the low-frequency spectral characteristics of the low-frequency decoded signal; encoding;
  • the end feature mapping subunit is used to map the low-frequency spectrum features through the high-frequency prediction neural network to obtain high-frequency prediction information.
  • the audio decomposition module 1310 includes: a filter acquisition unit configured to acquire a low-pass filter corresponding to a low-frequency band and a high-pass filter corresponding to a high-frequency band Orthogonal mirror filter bank; sub-band decomposition unit, used for sub-band decomposition of the audio to be encoded through the orthogonal mirror filter bank, to obtain the low frequency signal to be encoded corresponding to the low frequency band and the to be encoded corresponding to the high frequency band High frequency signal.
  • the high-frequency compensation module 1340 includes: a high-frequency compression transformation unit, configured to compress and transform the high-frequency signal to be encoded to obtain high-frequency characteristic information of the high-frequency signal to be encoded .
  • the high-frequency compensation module 1340 further includes: a characteristic spectrum conversion unit, configured to map the high-frequency characteristic information from the linear frequency domain to the critical frequency domain to obtain the high-frequency characteristic Characteristic spectrum information corresponding to the information; a prediction spectrum conversion unit for mapping the high-frequency prediction information from the linear frequency domain to the critical frequency band domain to obtain predicted spectrum information corresponding to the high-frequency prediction information; a compensation information determining unit for According to the difference between the characteristic spectrum information and the predicted spectrum information, the high-frequency compensation information of the high-frequency signal to be encoded is determined.
  • a characteristic spectrum conversion unit configured to map the high-frequency characteristic information from the linear frequency domain to the critical frequency domain to obtain the high-frequency characteristic Characteristic spectrum information corresponding to the information
  • a prediction spectrum conversion unit for mapping the high-frequency prediction information from the linear frequency domain to the critical frequency band domain to obtain predicted spectrum information corresponding to the high-frequency prediction information
  • a compensation information determining unit for According to the difference between the characteristic spectrum information and the predicted spectrum information, the high-frequency compensation information of the high
  • the compensation information determining unit includes: a first logarithmic transformation subunit, configured to perform logarithmic transformation on the characteristic spectrum information and the predicted spectrum information, respectively, to obtain the characteristic spectrum logarithmic value And the predicted spectrum log value; the gain quantization sub-unit is used to query the gain code table according to the difference between the characteristic spectrum log value and the predicted spectrum log value to obtain the gain quantization value, and determine the gain quantization value as the high frequency signal to be encoded Frequency compensation information.
  • the encoding and packaging module 1350 includes: an encoding and packaging unit for packaging audio category information, low-frequency encoding data, and high-frequency compensation information to obtain audio to be encoded Encoded data.
  • Fig. 14 schematically shows a structural block diagram of an audio decoding device provided in some embodiments of the present application.
  • the audio decoding device 1400 may mainly include: a package analysis module 1410, a low-frequency decoding module 1420, a high-frequency prediction module 1430, a high-frequency restoration module 1440, and an audio synthesis module 1450.
  • the package analysis module 1410 is used to package and analyze the audio coded data to be decoded to obtain low-frequency coded data and high-frequency compensation information in the audio coded data.
  • the low-frequency decoding module 1420 is used to decode the low-frequency encoded data to obtain a restored low-frequency signal.
  • the high-frequency prediction module 1430 is used to determine high-frequency prediction information based on the restored low-frequency signal based on the correlation between the low-frequency signal and the high-frequency signal.
  • the high-frequency restoration module 1440 is configured to perform gain compensation on the high-frequency prediction information according to the high-frequency compensation information to obtain high-frequency feature information, and perform feature restoration on the high-frequency feature information to obtain a restored high-frequency signal.
  • the audio synthesis module 1450 is used for sub-band synthesis of the restored low-frequency signal and the restored high-frequency signal to obtain the original audio of the audio coded data.
  • the high-frequency prediction module 1430 includes: a category acquisition unit, configured to encapsulate and analyze the audio coded data to obtain the audio category information in the audio coded data; the decoder network obtains The unit is used to determine the high-frequency prediction neural network corresponding to the audio category information.
  • the high-frequency prediction neural network is trained based on the correlation between the low-frequency signal and the high-frequency signal; the decoder network mapping unit is used to predict the neural network through the high-frequency The network maps the restored low-frequency signals to obtain high-frequency prediction information.
  • the decoding end network acquisition unit includes: a first sample transformation subunit, configured to acquire audio data samples corresponding to audio category information, and perform processing on the audio data samples Compress and transform to obtain the spectral feature samples of the audio data samples; the first frequency band division subunit is used to divide the spectral feature samples according to the value of the frequency point to obtain low-frequency feature samples and high-frequency feature samples; the first network obtains the subunits, It is used to train high-frequency prediction neural networks with low-frequency feature samples as input values and high-frequency feature samples as target values corresponding to the input values.
  • the decoder network acquisition unit includes: a second frequency band division subunit for acquiring audio data samples corresponding to the audio category information, and according to the frequency band.
  • the audio data samples are decomposed into low-frequency data samples and high-frequency data samples;
  • the second sample transformation subunit is used to compress and transform the low-frequency data samples and high-frequency data samples respectively to obtain the corresponding low-frequency feature samples and high-frequency feature samples;
  • the second network acquisition subunit is used to train the high-frequency prediction neural network by taking the low-frequency feature sample as the input value and the high-frequency feature sample as the target value corresponding to the input value.
  • the decoding end network mapping unit includes: a decoding end compression transformation subunit, which is used to compress and transform the restored low-frequency signal to obtain the low-frequency spectrum characteristics of the restored low-frequency signal;
  • the feature mapping subunit is used to map the low-frequency spectrum features through the high-frequency prediction neural network to obtain high-frequency prediction information.
  • the high-frequency restoration module 1440 includes: a spectrum information prediction unit, configured to map the high-frequency prediction information from the linear frequency domain to the critical frequency domain to obtain the high-frequency prediction information Corresponding predicted spectrum information; spectrum information compensation unit, used to perform gain compensation on predicted spectrum information according to high-frequency compensation information to obtain characteristic spectrum information; characteristic information determining unit, used to map characteristic spectrum information from the critical frequency band domain to linear In the frequency domain, the high-frequency characteristic information corresponding to the characteristic frequency spectrum information is obtained.
  • the spectrum information compensation unit includes: a second logarithmic transformation subunit, configured to perform logarithmic transformation on the predicted spectrum information to obtain the predicted spectrum logarithm; and the logarithmic compensation subunit The unit is used to perform gain compensation on the predicted spectrum log value according to the high-frequency compensation information to obtain the characteristic spectrum log value; the exponent reduction unit is used to exponentially restore the characteristic spectrum log value to obtain the characteristic spectrum information.
  • the high-frequency restoration module further includes: a feature information decompression unit, configured to decompress and transform the high-frequency feature information to obtain a restored high-frequency signal.
  • the audio synthesis module 1450 includes: a filter acquisition unit configured to acquire a low-pass filter corresponding to a low-frequency band and a high-pass filter corresponding to a high-frequency band Orthogonal mirror filter bank; sub-band synthesis unit for sub-band synthesis of the restored low-frequency signal and the restored high-frequency signal through the orthogonal mirror filter bank to obtain the restored audio of the audio coded data.
  • the package parsing module 1410 includes: a code stream search unit, which is used to search for code stream unit separation information in the audio coded data to be decoded; and a code stream separation unit, It is used to separate the code stream unit to be decoded from the audio encoding data according to the searched code stream unit separation information; the code stream analysis unit is used to analyze the fields of the code stream unit to obtain the low-frequency coded data encapsulated in the code stream unit And high frequency compensation information.
  • FIG. 15 schematically shows a structural block diagram of a computer system used to implement an electronic device of an embodiment of the present application.
  • the computer system 1500 includes a central processing unit (Central Processing Unit, CPU) 1501, which can be loaded into a random system according to a program stored in a read-only memory (Read-Only Memory, ROM) 1502 or from a storage part 1508. Access to the program in the memory (Random Access Memory, RAM) 1503 to execute various appropriate actions and processing. In RAM 1503, various programs and data required for system operation are also stored.
  • the CPU 1501, ROM 1502, and RAM 1503 are connected to each other through a bus 1504.
  • An input/output (Input/Output, I/O) interface 1505 is also connected to the bus 1504.
  • the following components are connected to the I/O interface 1505: the input part 1506 including keyboard, mouse, etc.; including the output part 1507 such as cathode ray tube (Cathode Ray Tube, CRT), liquid crystal display (LCD), and speakers 1507 ; A storage part 1508 including a hard disk, etc.; and a communication part 1509 including a network interface card such as a LAN (Local Area Network) card and a modem.
  • the communication section 1509 performs communication processing via a network such as the Internet.
  • the driver 1510 is also connected to the I/O interface 1505 as needed.
  • a removable medium 1511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1510 as required, so that the computer program read therefrom is installed into the storage portion 1508 as required.
  • the processes described in the flowcharts of the various methods can be implemented as computer software programs.
  • the embodiments of the present application include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication section 1509, and/or installed from the removable medium 1511.
  • CPU central processing unit
  • the computer-readable medium shown in the embodiment of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请属于音频编解码技术领域,具体涉及一种音频编解码方法、装置、介质以及电子设备。音频编码方法包括:对待编码音频进行子带分解得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号;对待编码低频信号进行压缩编码得到待编码低频信号的低频编码数据;基于低频信号与高频信号的相关性,根据待编码低频信号确定高频预测信息;对待编码高频信号进行特征提取得到待编码高频信号的高频特征信息,并根据高频特征信息与高频预测信息之间的差异确定待编码高频信号的高频补偿信息;对低频编码数据以及高频补偿信息进行封装处理得到待编码音频的音频编码数据。该方法可以压缩高频信号的编码码率并保证高频信号的准确性。

Description

音频编解码方法、装置、介质及电子设备
本申请要求于2020年06月24日提交中国专利局、申请号为2020105924694、申请名称为“音频编解码方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及音频编解码技术。
背景技术
音频编解码在现代通讯系统中占有重要的地位,通过对音频数据进行压缩编码处理可以降低音频数据在网络传输中的网络带宽压力,节省音频数据的存储成本和传输成本。
音乐、语音等音频数据的频谱主要集中在低频段,高频段幅度很小。如果对整个频段编码,若是要保护高频段数据,就可能造成低频段数据编码过细,进而导致编码文件数据量巨大,难以取得理想的压缩编码效果;若是因保存低频段数据的主要成分而丢弃高频段数据成分,则会丧失音质,导致音频在解码后出现严重的失真问题。因此,如何在音频数据的编解码过程中尽可能地保证高频段数据的准确传输是目前亟待解决的问题。
发明内容
本申请的目的在于提供一种音频编码方法、音频解码方法、音频编码装置、音频解码装置、计算机可读介质以及电子设备,至少在一定程度上克服音频编解码技术中存在的高频段音频数据成分丢失、传输准确性差等技术问题。
根据本申请实施例的一个方面,提供一种音频编码方法,该方法包括:对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号;对所述待编码低频信号进行压缩编码,得到所述待编码低频信号的低频编码数据;基于低频信号与高频信号的相关性,根据所述待编码低频信号确定高频预测信息;对所述待编码高频信号进行特征提取,得到所述待编码高频信号的高频特征信息,并根据所述高频特征信息与所述高频预测信息之间的差异,确定待编码高频信号的高频补偿信息;对所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据。
根据本申请实施例的一个方面,提供一种音频编码装置,该装置包括:音频分解模块,用于对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号;低频编码模块,用于对所述待编码低频信号进行压缩编码,得到所述待编码低频信号的低频编码数据;高频预测模块,用于基于低频信号与高频信号的相关性,根据所述待编码低频信号确定高频预测信息;高频补偿模块,用于对所述待编码高频信号进行特征提取,得到所述待编码高频信号的高频特征信息,并根据所述高频特征信息与所述高频预测信息之间的差异,确定所述待编码高频信号的高频补偿信息;编码封装模块,用于对所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据。
根据本申请实施例的一个方面,提供一种音频解码方法,该方法包括:对待解码的音频编码数据进行封装解析,得到所述音频编码数据中的低频编码数据和高频补偿信息;对所述低频编码数据进行解码处理,得到还原低频信号;基于低频信号与高频信号的相关性,根据所述还原低频信号确定高频预测信息;根据所述高频补偿信息对所述高频预测信息进行增益补偿,得到高频特征信息,并对所述高频特征信息进行特征还原,得到还原高频信号;对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的还原音频。
根据本申请实施例的一个方面,提供一种音频解码装置,该装置包括:封装解析模块,用于对待解码的音频编码数据进行封装解析,得到所述音频编码数据中的低频编码数据和高频补偿信息;低频解码模块,用于对所述低频编码数据进行解码处理,得到还原低频信号;高频预测模块,用于基于低频信号与高频信号的相关性,根据所述还原低频信号确定高频预测信息;高频还原模块,用于根据所述高频补偿信息对所述高频预测信息进行增益补偿,得到高频特征信息,并对所述高频特征信息进行特征还原,得到还原高频信号;音频合成模块,用于对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的原始音频。
根据本申请实施例的一个方面,提供一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如以上技术方案中的音频编码方法或音频解码方法。
根据本申请实施例的一个方面,提供一种电子设备,该电子设备包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器被配置为经由执行所述可执行指令来执行如以上技术方案中的音频编码方法或音频解码方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读介质中。计算机设备的处理器从计算机可读介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行以上技术方案中的音频编码方法或音频解码方法。
在本申请实施例提供的技术方案中,基于低频信号与高频信号的相关性,可以根据待编码低频信号相应地确定高频预测信息,然后基于高频预测信息与待编码高频信号之间的特征差异,确定高频补偿信息;相应地,在音频编码数据的传输中可以仅传输高频补偿信息,从而可以极大地压缩高频信号的编码码率,降低网络传输的带宽压力。同时,在音频编码数据的接收端,可以基于该高频补偿信息对高频信号进行重建还原,保证了高频信号的完整性和准确性,避免了因数据压缩丢失而导致的音频失真、音质差等问题。
附图说明
图1示意性地示出了应用本申请技术方案的示例性系统架构框图;
图2示意性地示出了本申请一些实施例中音频编码方法的步骤流程图;
图3示意性地示出了本申请一些实施例中编码端获取高频预测信息的方法步骤流程图;
图4示意性地示出了本申请一些实施例中对待编码音频进行分类处理的方法步骤流程图;
图5示意性地示出了本申请一些实施例中基于先特征提取后频段分割的预处理过程训练高频预测神经网络的方法步骤流程图;
图6示意性地示出了本申请一些实施例中基于先频段分割后特征提取的预处理过程训练高频预测神经网络的方法步骤流程图;
图7示意性地示出了本申请一些实施例中确定高频补偿信息的方法步骤流程图;
图8示意性地示出了本申请实施例在一应用场景中对输入信号进行编码处理的方法流程图;
图9示意性地示出了本申请一些实施例中音频解码方法的步骤流程图;
图10示意性地示出了本申请一些实施例中解码端获取高频预测信息的方法步骤流程图;
图11示意性地示出了本申请一些实施例中通过增益补偿得到高频特征信息的方法步骤流程图;
图12示意性地示出了本申请实施例在一应用场景中对输入码流进行解码处理的方法流程图;
图13示意性地示出了本申请一些实施例中提供的音频编码装置的结构框图;
图14示意性地示出了本申请一些实施例中提供的音频解码装置的结构框图;
图15示意性示出了适于用来实现本申请实施例的电子设备的计算机系统结构框图。
具体实施方式
图1示意性地示出了应用本申请技术方案的示例性系统架构框图。
如图1所示,系统架构100可以包括终端设备110、网络120和服务器130。终端设备110可以包括智能手机、平板电脑、笔记本电脑、台式电脑等各种电子设备。服务端130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。网络120可以终端设备110和服务器130之间提供各种连接类型的通信链路,例如可以是有线通信链路或者无线通信链路。
根据实现需要,本申请实施例中的系统架构可以具有任意数目的终端设备、网络和服务器。例如,服务器130可以是由多个服务器设备组成的服务器群组。另外,本申请实施例提供的技术方案可以应用于终端设备110,也可以应用于服务器130,或者可以由终端设备110和服务器130共同实施,本申请对此不做特殊限定。
举例而言,在语音通话应用中,用户A作为语音发送端,可以通过终端设备110的麦克风采集得到模拟声音信号,通过模数转换电路将模拟声音信号转换为数字声音信号,数字声音信号经过语音编码器进行压缩,而后按照通信网络传输格式和协议打包发送至服务器130。服务器130将语音编码数据包发送至作为语音接收端的用户B。用户B再通过终端设备110对接收到的语音编码数据包进行解包处理,输出语音编码压缩码流,通过语音解码器根据该语音编码压缩码流重新生成语音数字信号,最后转换该语音数字信号并通过扬声器播放出声音。语音编解码可以有效地降低传输语音信号所需占用的带宽,对于节省语音信息存储和传输成本、保障通信网络传输过程中的语音信息完整性也起了决定性作用。
下面结合图2~图8对本申请提供的音频编码方法做出详细说明。
图2示意性地示出了本申请一些实施例中音频编码方法的步骤流程图,该音频编码方法可以由终端设备执行,也可以由服务器执行,或者可以由终端设备和服务器共同执行。本申请实施例以音频编码方法由终端设备执行为例进行说明。如图2所示,该音频编码方法主要可以包括如下的步骤S210~步骤S250。
步骤S210.对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号。
步骤S220.对待编码低频信号进行压缩编码,得到待编码低频信号的低频编码数据。
步骤S230.基于低频信号与高频信号的相关性,根据待编码低频信号确定高频预测信息。
步骤S240.对待编码高频信号进行特征提取,得到待编码高频信号的高频特征信息,并根据高频特征信息与高频预测信息之间的差异,确定待编码高频信号的高频补偿信息。
步骤S250.对低频编码数据以及高频补偿信息进行封装处理,得到待编码音频的音频编码数据。
在本申请实施例提供的音频编码方法中,基于低频信号与高频信号的相关性,可以根据待编码低频信号相应地确定高频预测信息,然后基于高频预测信息与待编码高频信号之间的特征差异,确定高频补偿信息;相应地,在音频编码数据的传输中可以仅传输高频补偿信息,从而可以极大地压缩高频信号的编码码率,降低网络传输的带宽压力。同时,在音频编码数据的接收端,可以基于该高频补偿信息对高频信号进行重建还原,保证了高频信号的完整性和准确性,避免了因数据压缩丢失而导致的音频失真、音质差等问题。
下面分别对以上实施例中音频编码方法的各个方法步骤做详细说明。
在步骤S210中,对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号。
子带分解是将原始的待编码音频由时间域转变为频率域,然后再按照频率大小将其由完整频带分解为若干个连续频段的方法,其中每一个频段即称为一个子带。
在本申请的一些实施例中,可以获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组,然后通过正交镜像滤波器组,对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号。
正交镜像滤波器组(Quadrature Mirror Filter,QMF)由两个或两个以上的滤波器通过共用输入接口或者共用输出接口的方式组合而成。本申请实施例可以将一个对应于低频频段的低通滤波器和一个对应于高频频段的高通滤波器通过共用输入接口的方式组成正交镜像滤波器组,将待编码音频输入至该正交镜像滤波器组后,可以经过子带分解后得到由低通滤波器输出的待编码低频信号以及由高通滤波器输出的待编码高频信号。使用正交镜像滤波器组的优点是可以抵消由于子带分解而带来的频谱混叠效应。
在步骤S220中,对待编码低频信号进行压缩编码,得到待编码低频信号的低频编码数据。
针对子带分解得到的待编码低频信号,可以通过编码器对其进行压缩编码以得到相应的低频编码数据。音频压缩编码的基本思想是通过编码器去除音频信号存在的时域冗余、频域冗余、听觉冗余以达到音频信号压缩的目的。现有的音频压缩编码方法主要可以包括基于LTP长时预测、LPC短时预测、基音周期搜索、频带复制等方法对冗余信息进行压缩。
在本申请的一些实施例中,可以采用CELP、SILK或者ACC等编码算法对待编码低频信号进行压缩编码。其中,CELP编码算法,即码激励线性预测编码(Code Excited Linear Prediction),是一种有效的中低码率语音压缩编码技术,它是以码本为激励源,具有码率低、合成语音质量高、抗噪能力强等优点,在4.8~16kbps码率上得到广泛应用,采用CELP技术的语音编码器有G.723、G.728、G.729、G.722.2等等。SILK编码算法是基于即时通讯软件Skype开发并向第三方开发人员和硬件制造商提供的宽带音频编码器,SILK``编码算法对于音频带宽、网络带宽和算法复杂度都具有很好的弹性。ACC编码算法,即高级音频编码技术(Advanced Audio Coding),是基于MPEG-2的具有高压缩比的音频压缩算法,由于采用多声道和使用低复杂性的描述方式,ACC编码算法可以在大幅压缩音频数据的前提下提供保留较好的音质。
在步骤S230中,基于低频信号与高频信号的相关性,根据待编码低频信号确定高频预测信息。
图3示意性地示出了本申请一些实施例中编码端获取高频预测信息的方法步骤流程图。如图3所示,在以上实施例的基础上,步骤S230.基于低频信号与高频信号的相关性,根据待编码低频信号确定高频预测信息,主要可以包括如下的步骤S310~步骤S330。
步骤S310.对待编码音频进行分类处理,得到待编码音频的音频类别信息。
步骤S320.确定音频类别信息对应的高频预测神经网络;该高频预测神经网络是基于低频信号与高频信号的相关性训练得到的。
步骤S330.通过高频预测神经网络,对待编码低频信号进行映射处理,得到高频预测信息。
由于音频数据的低频信号和高频信号是具有相关性的,而神经网络(尤其是深度神经网络)可以较好地描述这里的相关性,因此本申请实施例采用神经网络将低频信号作为输入、高频信号作为目标进行预测。但是不同类型的音频数据,其高低频信号的相关性是不同的。例如浊音信号,由于存在明显的谐波结构,所以低频信号和高频信号都有相类似的频谱谐波结构;而轻音信号则没有谐波成分,其特征是中高频能量呈现块式分布,高频能量远高于低频能量;音乐信号跟不同发声乐器的发声特征有关。由此可见,针对不同类型的音频数据,其低频信号与高频信号的具体映射关系会有差异,本申请实施例针对这个问题提出了先对音频数据做信号分类、再基于分类结果采用不同类型的神经网络进行训练和预测的解决方式,以得到更稳定、更准确的预测结果。
本申请实施例中使用的高频预测神经网络可以采用各种不同的网络架构来实现,例如卷积神经网络(Convolutional Neural Networks,CNN)、循环神经网络(Recurrent Neural Networks,RNN)、生成对抗网络(Generative Adversarial Networks,GAN)等等。另外, 在一些可选的实施方式中,也可以采用除神经网络以外的其他机器学习模型,对待编码低频信号进行映射处理得到相应的高频预测信息,本申请对此不做特殊限定。
图4示意性地示出了本申请一些实施例中对待编码音频进行分类处理的方法步骤流程图。如图4所示,在以上各实施例的基础上,步骤S310.对待编码音频进行分类处理,得到待编码音频的音频类别信息,可以包括如下的步骤S410~步骤S440。
步骤S410.获取音频数据样本,并对音频数据样本进行逐帧标注,得到音频数据样本中每个数据帧的音频类别标识。
音频数据样本可以是通过麦克风等音频输入设备采集得到的真实音频数据,也可以是通过音频合成软件合成的人工构造数据。音频数据样本中包括有多种不同类别的数据,如语音浊音、语音轻音、非语音、音乐等等。以数据帧为单位对音频数据样本进行标注,得到每个数据帧的音频类别标识。
步骤S420.从多个特征维度对音频数据样本进行特征提取,得到音频数据样本的多维样本特征。
为了准确获取决定音频数据样本的类别的决定因素,本步骤从多个特征维度对其进行特征提取,从而得到具有较强表征能力的多维样本特征。例如,在一些可选的实施方式中,提取得到的多维样本特征可以包括谱平坦特征、谱斜度特征、基音周期、MDCT(改进离散余弦变换,Modified Discrete Cosine Transform)及其一阶/二阶导数等多个维度的特征。
步骤S430.以多维样本特征作为输入值,并以音频类别标识作为与输入值相对应的目标值,训练得到用于对音频数据进行分类处理的音频分类神经网络。
音频分类神经网络的训练目标是针对一个音频数据样本,当输入其多维样本特征时能够输出得到正确的音频类别标识,在训练过程中可以通过更新神经网络的网络参数来提高其输出正确音频类别标识的预测能力。当训练到达预设的收敛目标时便可以结束训练过程,收敛目标例如可以是损失函数的误差小于误差阈值或者迭代训练的次数超过次数阈值。
步骤S440.通过音频分类神经网络,对待编码音频进行分类处理,得到待编码音频的音频类别信息。
针对待编码音频,采用与音频数据样本相同的特征提取方式得到相应的多维音频特征,然后可以将提取得到的多维音频特征输入至训练完成的音频分类神经网络中,由该音频分类神经网络进行相应地处理后输出预测概率最高的音频类别信息。
通过执行步骤S410~步骤S440,可以采用音频数据样本训练得到音频分类神经网络,进而利用音频分类神经网络对待编码音频进行准确地音频类别预测。音频分类神经网络一经训练完成,可以将其网络结构和网络参数保存在作为编码端的终端设备上或者保存在服务器上,当需要对新的待编码音频进行分类处理时,便可以直接调用该音频分类神经网络,快速且准确地确定其音频类别信息。
针对不同类别的待编码音频,可以相应地训练并选用不同类型的高频预测神经网络,以实现基于待编码低频信号进行高频预测的方案。例如,在本申请的一些实施例中,将待编码音频分类为语音浊音、语音轻音、非语音和音乐四种类型,相应地可以训练得到四种 类型的高频预测神经网络,每种高频预测神经网络专门用于对相应类别的待编码音频进行由低频信号至高频信号的信息预测。另外,本申请实施例还可以进一步根据频谱能量分布对待编码音频细分为更多的音频类别,类别划分越细致,相应的高频预测神经网络将具有更加准确的表征和预测能力。
在训练高频预测神经网络时,涉及音频数据样本的频段分割和特征提取两种预处理过程。图5和图6分别示出了两种基于不同预处理过程训练高频预测神经网络的方法。
图5示意性地示出了本申请一些实施例中基于先特征提取后频段分割的预处理过程,训练高频预测神经网络的方法步骤流程图。如图5所示,该种高频预测神经网络的训练方法主要可以包括如下的步骤S510~步骤S530。
步骤S510.获取与音频类别信息相对应的音频数据样本,并对音频数据样本进行压缩变换,得到音频数据样本的频谱特征样本。
通过压缩变换的方式对音频数据样本进行特征提取,可以得到音频数据样本的频谱特征样本,压缩变换的方法例如可以采用改进离散余弦变换(Modified Discrete Cosine Transform,MDCT)。MDCT是一种线性正交交叠变换算法,使用了一种时域混叠抵消技术(Time Domain Alias Cancellation,TDAC),包含50%的时域交叠窗,在不降低编码性能的情况下能够有效地克服由边缘效应产生的周期化噪声。在本申请的另一些实施例中,也可以采用离散傅里叶变换(Discrete Fourier Transform,DFT)来对音频数据样本进行压缩变换。
步骤S520.对频谱特征样本按照频点的数值进行划分,得到低频特征样本和高频特征样本。
以MDCT算法为例,由步骤S510对音频数据样本进行压缩变换后得到的频谱特征样本是全带数据,可以根据对应的物理频点值,将该频谱特征样本划分为高频MDCT部分和低频MDCT部分,例如,包括1024个频点的音频数据样本经过MDCT处理后可以得到512个频点的频谱特征样本,其中1~256点数据可以被划分为低频特征样本,而257~512点数据可以被划分为高频特征样本。
步骤S530.以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
高频预测神经网络的训练目标是通过更新网络参数,来提高对于低频信号与高频信号之间相关性的表征和预测能力。高频预测神经网络可以选用CNN、RNN、GAN等各种类型的网络架构。另外,本申请实施例也可以训练除神经网络以外的其他机器学习模型,用以根据低频信号对高频信号进行预测,本申请对此不做特殊限定。
通过执行步骤S510~步骤S530,可以先对全频段的音频数据样本进行压缩变换,然后再基于频点数值进行频段分割,得到低频特征样本和高频特征样本,这种预处理方案的优点在于,针对每个音频数据样本仅需要进行一次压缩变换和频段分割操作,能够降低数据计算成本,提高样本的处理效率。
图6示意性地示出了本申请一些实施例中基于先频段分割后特征提取的预处理过程,训练高频预测神经网络的方法步骤流程图。如图6所示,该种高频神经网络的训练方法主 要可以包括如下的步骤S610~步骤S630。
步骤S610.获取与音频类别信息相对应的音频数据样本,并按照所在频段的高低将音频数据样本分解为低频数据样本和高频数据样本。
步骤S620.分别对低频数据样本和高频数据样本进行压缩变换,得到相应的低频特征样本和高频特征样本。
步骤S630.以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
本申请实施例与图5对应实施例的不同之处在于,本申请实施例采用了先对音频数据样本进行频段分割,然后再进行压缩变换的预处理方案。与图5对应实施例相比,本申请实施例的预处理方案针对每个音频数据样本都增加了一次压缩变换过程,即针对低频数据样本和高频数据样本需要分别进行一次压缩变换。这种预处理方案的优点在于,保持了高频预测神经网络在训练过程和使用过程的一致性,能够在一定程度上提高高频信号预测的准确性。
基于以上实施例训练得到高频预测神经网络后,可以将高频预测神经网络的网络结构和网络参数保存在编码端以及解码端的终端设备上或者保存在服务器上,当需要对低频信号进行高频预测时,便可以直接调用该高频预测神经网络。
在本申请的一些实施例中,可以在步骤S330中直接通过高频预测神经网络对待编码低频信号进行映射处理得到相应的高频预测信号。而在另一些实施例中,为了保持编码端与解码端的数据同步性,可以先对低频编码数据(由待编码低频信号压缩编码得到)进行解码处理,得到与待编码低频信号相对应的低频解码信号,然后再通过高频预测神经网络,对低频解码信号进行映射处理得到高频预测信息。通过对待编码低频信号进行一次编码和解码操作后,再进行高频预测的方案能够保持编码端和解码端的操作一致性,从而提高高频预测的准确性。
在本申请的一些实施例中,通过高频预测神经网络对低频解码信号进行映射处理的方法可以包括:对低频解码信号进行压缩变换,得到低频解码信号的低频频谱特征,然后通过高频预测神经网络,对低频频谱特征进行映射处理,得到高频预测信息。其中,对低频解码信号进行压缩变换的方法例如可以是改进离散余弦变换MDCT或者其他形式的离散傅里叶变换DFT。
在步骤S240中,对待编码高频信号进行特征提取,得到待编码高频信号的高频特征信息,并根据高频特征信息与高频预测信息之间的差异确定待编码高频信号的高频补偿信息。
对待编码高频信号进行特征提取的方法可以是与待编码低频信号(或者低频解码信号)相同的压缩变换的方法,即对待编码高频信号进行压缩变换,得到待编码高频信号的高频特征信息。采用相同的特征提取方法,可以使高频特征信息与高频预测信息具有一致的特征属性,从而方便确定二者之间的特征差异。
基于高频特征信息和高频预测信息二者之间的特征差异可以确定待编码高频信号的高频补偿信息,图7示意性地示出了本申请一些实施例中确定高频补偿信息的方法步骤流 程图。如图7所示,在以上各实施例的基础上,步骤S240中的根据高频特征信息与高频预测信息之间的差异,确定待编码高频信号的高频补偿信息,主要可以包括如下的步骤S710~步骤S730。
步骤S710.将高频特征信息从线性频率域映射至临界频带域,得到与高频特征信息相对应的特征频谱信息。
步骤S720.将高频预测信息从线性频率域映射至临界频带域,得到与高频预测信息相对应的预测频谱信息。
步骤S730.根据特征频谱信息和预测频谱信息之间的差异,确定待编码高频信号的高频补偿信息。
临界频带域,即Bark域,是听觉学和心理声学的专业名词。临界频带指的是由于听觉传感器官(如人耳结构中的耳蜗)的构造产生的听觉滤波器的频率带宽。概况地说,临界频带是声音频率带,在临界频带中第一个单音感知性会被第二单音的听觉掩蔽所干扰。在声学研究中,人们使用听觉滤波器来模拟不同的临界频带,人耳结构一般会对24个频率点产生共振,因此在临界频带域中的音频信号也呈现出24个临界频带,分别从1到24。Bark域相对于线性域更符合人耳声学频率的感知特性,而且子带数比较少,有利于编码压缩。
为了实现对特征频谱信息和预测频谱信息之间差异的量化表征,本申请实施例可以分别对特征频谱信息和预测频谱信息进行对数变换,得到特征频谱对数值和预测频谱对数值,然后根据特征频谱对数值和预测频谱对数值的差值查询增益码表得到增益量化值,并将增益量化值确定为待编码高频信号的高频补偿信息。其中,增益码表是一个大小为N的、数值依次递增的量化表格,基于增益码表可以查询得到增益量化值。
举例而言,在将高频特征信息或者高频预测信息从线性频率域映射至临界频带域后,可以得到相应的Bark域频谱信息E(k),然后可以对其做对数变换得到相应的频谱对数值20*log10(E(k)^2),进而确定特征频谱对数值和预测频谱对数值之间的差值ΔE(k),其中k表示高频子带序号。
通过查询增益码表对差值ΔE(k)进行数值量化的查询逻辑如下:
Index=0;
For i=0:N-1
If Table[i]<=ΔE(k)&&Table[i+1]>ΔE(k)
Index=i;
End
End
其中,Table为数值递增的增益码表;N为增益码表的大小,表示增益码表中包括0~N-1共计N个量化值;Index为最终量化得到的增益量化值。
基于以上查询逻辑可知,在获取到特征频谱对数值和预测频谱对数值之间的差值ΔE(k)之后,可以遍历增益码表中的各个量化值,比较差值ΔE(k)与第i个量化编码 Table[i]以及第i+1个量化编码Table[i+1]之间的数值大小。若差值ΔE(k)大于或等于第i个量化编码Table[i],并且差值ΔE(k)小于第i+1个量化编码Table[i+1],那么可以确定增益量化值为i。
通过增益码表对增益补偿进行量化处理,可以将原本连续的增益补偿信息离散化,降低对高频信号部分进行编码传输的计算量。
在步骤S250中,对低频编码数据以及高频补偿信息进行封装处理,得到待编码音频的音频编码数据。
封装处理是对各种编码内容组合形成指定的音频文件的过程,例如,封装得到的音频编码数据可以是MP3、AAC、WAV、FLAC、DSD等各种格式的音频文件。
在本申请的一些实施例中,在进行音频编码时,首先对待编码音频进行了分类处理以得到相应的音频类别信息,然后基于音频类别信息选用不同类型的高频预测神经网络来对待编码低频信号进行针对性地高频预测。在此基础上,为了能够在解码端对高频部分的信号进行准确地预测和重建,在步骤S250中可以获取分类处理得到的音频类别信息,然后将音频类别信息、低频编码数据以及高频补偿信息共同进行封装处理,得到待编码音频的音频编码数据,从而将音频类别信息一起传输至解码端。
图8示意性地示出了本申请实施例在一应用场景中对输入信号进行编码处理的方法流程图。如图8所示,在该应用场景中由编码端对输入信号进行音频编码的方法主要可以包括如下的步骤S801~步骤S811。
步骤S801.对输入信号进行信号分类得到信号分类结果,这里的分类类别可以包括四个类型:语音浊音、语音轻音、非语音和音乐。信号分类结果可以指导编解码的高频预测神经网络的选用。其中,每种信号类型将对应一个高频预测神经网络,而高频预测神经网络也是通过大量音频数据经过分类后,每个高频预测神经网络用相应的同类型数据进行独立训练的,在实际编解码中用到的是已经训练好的高频预测神经网络。
步骤S802.输入信号经过QMF(正交镜像滤波器组)进行高低频信号分解,采用QMF的优点是可以抵消由于子带分割带来的混叠效应。通过QMF将输入信号分解为低频信号和高频信号。
步骤S803.由步骤S802分解得到的低频信号将经过语音编码器进行压缩编码,得到相应的低频信号的低频编码参数,本步骤使用的语音编码器可以是基于CELP、SILK、AAC等算法的编码器。
步骤S804.为了让高频预测神经网络的输入在编码器和解码器中同步,对低频信号编码后的码流做一次语音解码得到还原的低频信号。
步骤S805.由步骤S804进行语音解码还原得到的低频信号经过MDCT(Modified Discrete Cosine Transform,改进离散余弦变换)得到相关的低频频谱信息。
步骤S806.将MDCT变换得到的低频频谱信息输入至步骤S801中根据信号分类结果选取的高频预测神经网络,通过该高频预测神经网络进行预测得到高频预测信息。
步骤S807.为了与人耳声学感知频带相对应,这里将高频预测信息由线性频域转换到 Bark域得到Bark域高频频谱预测值(可以以对数形式表示)。
步骤S808.由步骤S802中通过QMF分解得到的真实高频信号经过MDCT变换得到相关的高频频谱信息。
步骤S809.对步骤S808得到的高频频谱信息进行Bark域转换得到Bark域高频频谱真实值(可以以对数形式表示)。
步骤S810.将步骤S809中得到的Bark域高频频谱真实值与步骤S807中得到的Bark域高频频谱预测值相减后得到子带增益补偿值,并进一步增益量化得到高频编码参数。
步骤S811.将步骤S801得到的信号分类结果、步骤S803中得到的低频编码参数以及步骤S810中得到的高频编码参数进行封装处理后组成编码参数进行输出。
由以上编码流程获得的编码参数可以通过网络传输至其它作为音频数据接收端的终端设备或者服务器,以使接收端对其进行解码处理得到解码信号。
下面结合图9~图12对本申请提供的音频解码方法做出详细说明。
图9示意性地示出了本申请一些实施例中音频解码方法的步骤流程图,该音频解码方法可以由终端设备执行,也可以由服务器执行,或者可以由终端设备和服务器共同执行。本申请实施例以音频解码方法由终端设备执行为例进行说明。如图9所示,该音频解码方法主要可以包括如下的步骤S910~步骤S950。
步骤S910.对待解码的音频编码数据进行封装解析,得到音频编码数据中的低频编码数据和高频补偿信息。
步骤S920.对低频编码数据进行解码处理,得到还原低频信号。
步骤S930.基于低频信号与高频信号的相关性,根据还原低频信号确定高频预测信息。
步骤S940.根据高频补偿信息对高频预测信息进行增益补偿,得到高频特征信息,并对高频特征信息进行特征还原,得到还原高频信号。
步骤S950.对还原低频信号和还原高频信号进行子带合成,得到音频编码数据的还原音频。
在本申请实施例提供的音频解码方法中,基于低频信号与高频信号的相关性,可以根据解码得到的还原低频信号确定相应的高频预测信息,然后利用高频补偿信息对高频预测信息进行增益补偿得到高频特征信息,再对高频特征信息进行特征还原可以得到预测重建的还原高频信号。该音频解码方法通过在编码端和解码端使用相同的高频信号预测方法和高频增益补偿方法,保证信号传输过程中高频信号的完整性和准确性,避免了因数据压缩丢失而导致的音频失真、音质差等问题。
下面分别对以上实施例中音频解码方法的各个方法步骤做详细说明。
在步骤S910中,对待解码的音频编码数据进行封装解析,得到音频编码数据中的低频编码数据和高频补偿信息。
待解码的音频编码数据可以由一个一个连续的码流单元组成,每两个相邻的码流单元之间通过码流单元分隔信息来进行分隔。举例而言,在采用AAC编码标准(Advanced Audio Coding)时,音频编码数据由多个连续的ADTS单元(Audio Data Transport Stream)组成, 每个ADTS单元作为一个音频内容的封装单元。每两个ADTS单元之间通过同步字(syncword)进行分隔,同步字可以是0xFFF(二进制“111111111111”)。
在本申请的一些实施方式中,对待解码的音频编码数据进行封装解析的方法可以包括:首先在待解码的音频编码数据中搜索码流单元分隔信息;然后根据搜索到的码流单元分隔信息,从音频编码数据中分离出待解码的码流单元;再对码流单元进行字段解析,得到码流单元中封装的低频编码数据和高频补偿信息。以AAC编码标准为例,解码端在接收到待解码的音频编码数据后,可以在原始码流中搜索字段0x0FFF,以该字段为间隔可以分离得到ADTS单元,再对ADTS单元进行字段解析可以获得封装在其中的低频编码数据和高频补偿信息。
通过对音频编码数据进行封装解析,可以获取其中包括的对应于低频信号部分的低频编码数据和对应于高频信号部分的高频补偿信息。除此之外,如果在编码端对音频数据进行了分类处理,那么本步骤进行封装解析后还可以获得其中包括的音频类别信息,从而根据音频类别信息选用与编码端一致的处理方案。
在步骤S920中,对低频编码数据进行解码处理,得到还原低频信号。
针对封装解析得到的低频编码数据,可以通过解码器对其进行解码处理得到相应的还原低频信号。本步骤中使用的解码器与编码端使用的编码器相对应。例如,如果编码端使用CELP算法进行压缩编码,那么本步骤也将采用相应的CELP算法进行解码处理;如果编码端使用SILK或者ACC等算法进行压缩编码,那么本步骤也将采用相应的SILK或者ACC等算法进行解码处理。
在步骤S930中,基于低频信号与高频信号的相关性,根据还原低频信号确定高频预测信息。
图10示意性地示出了本申请一些实施例中解码端获取高频预测信息的方法步骤流程图。如图10所示,在以上实施例的基础上,步骤S930.基于低频信号与高频信号的相关性,根据还原低频信号确定高频预测信息,主要可以包括如下的步骤S1010~步骤S1030。
步骤S1010.对音频编码数据进行封装解析,得到音频编码数据中的音频类别信息。
步骤S1020.确定音频类别信息对应的高频预测神经网络,该高频预测神经网络是基于低频信号与高频信号的相关性训练得到的。
步骤S1030.通过高频预测神经网络,对还原低频信号进行映射处理,得到高频预测信息。
通过对音频编码数据进行封装解析,可以获得编码端对音频数据进行分类处理时确定的音频类别信息。该音频类别信息在编码端用于指导高频预测神经网络的选用,在本申请实施例适用的解码端也可以基于该音频类别信息选用与编码端相同的高频预测神经网络,从而确保解码端与编码端能够保持高频信号预测的一致性。
在本申请的一些实施例中,可以在编码端对高频预测神经网络进行训练,训练得到高频预测神经网络后,可以将其网络结构和网络参数保存在编码端,同时也可以将相关数据传送至解码端。如此一来,解码端基于接收到的网络结构加载网络参数后,可以得到与编码端一致的高频预测神经网络。
在本申请的另一实施例中,也可以在解码端对高频预测神经网络进行训练,训练得到高频预测神经网络后,可以将其网络结构和网络参数保存在解码端,同时也可以将相关数据传送至编码端,以使编码端和解码端能够使用相同的高频预测神经网络对高频信号进行预测。解码端训练高频预测神经网络的方法与编码端相似或者相同,可以参考图5和图6中的相关方法步骤,此处不再赘述。
在本申请的另一实施例中,也可以在服务器上对高频预测神经网络进行训练,训练得到高频预测神经网络后,可以将其网络结构和网络参数保存在服务器上,并同时可以由服务器将相关数据传送至编码端和解码端,使得编码端和解码端能够使用相同的高频预测神经网络对高频信号进行预测。
在解码端通过高频预测神经网络对还原低频信号进行映射处理以实现高频信号预测时,可以先对还原低频信号进行压缩变换,得到还原低频信号的低频频谱特征,然后再通过高频预测神经网络,对低频频谱特征进行映射处理,得到高频预测信息。其中,对还原低频信号进行压缩变换的方法例如可以是改进离散余弦变换MDCT或者其他形式的离散傅里叶变换DFT。
在步骤S940中,根据高频补偿信息对高频预测信息进行增益补偿,得到高频特征信息,并对高频特征信息进行特征还原,得到还原高频信号。
在解码端根据高频补偿信息对高频预测信息进行增益补偿得到高频特征信息的方法,与编码端根据高频特征信息与高频预测信息之间的差异确定待编码高频信号的高频补偿信息的方法为两个相反的流程。同时,在解码端对高频特征信息进行特征还原的方法与编码端对待编码高频信号进行特征提取的方法也是两个相反的流程。
图11示意性地示出了本申请一些实施例中通过增益补偿得到高频特征信息的方法步骤流程图。如图11所示,在以上各实施例的基础上,步骤S940中的根据高频补偿信息对高频预测信息进行增益补偿,得到高频特征信息,主要可以包括如下的步骤S1110~步骤S1130。
步骤S1110.将高频预测信息从线性频率域映射至临界频带域,得到与高频预测信息相对应的预测频谱信息。
步骤S1120.根据高频补偿信息对预测频谱信息进行增益补偿,得到特征频谱信息。
步骤S1130.将特征频谱信息从临界频带域映射至线性频率域,得到与特征频谱信息相对应的高频特征信息。
为了获取符合人耳声学频率的感知特性的特征信息,并且使用较少的子带数进行编码压缩,在编码端会进行从线性频率域向临界频带域的映射变换。与之相适应地,在解码端通过高频补偿信息对预测频谱信息进行增益补偿后,需要再将得到的特征频谱信息从临界频带域映射回到线性频率域,从而得到线性频率域下的高频特征信息,以便于在线性频率域下对高频特征信息进行特征还原。
在编码端对特征频谱信息和预测频谱信息之间的差异进行量化表征时,可以使用对数变换后的对数值进行计算。与之相适应地,本申请实施例在解码端对预测频谱信息进行增益补偿时,可以首先对预测频谱信息进行对数变换得到预测频谱对数值,然后根据高频补 偿信息对预测频谱对数值进行增益补偿得到特征频谱对数值,再对特征频谱对数值进行指数还原便可以得到特征频谱信息。其中,指数还原的方法与对数变换的方法互为逆过程。
在步骤S940中,通过Bark域变换得到与特征频谱信息相对应的高频特征信息后,可以再对其进行特征还原以得到还原高频信号。编码端可以采用压缩变换的方式对待编码高频信号进行特征提取,与之相适应地,解码端可以采用解压缩变换对高频特征信息进行特征还原。例如,在编码端使用改进离散余弦变换(Modified Discrete Cosine Transform,MDCT)进行特征提取,那么在解码端可以相应地使用改进离散余弦逆变换(Inverse Modified Discrete Cosine Transform,IMDCT)进行特征还原。
在步骤S950中,对还原低频信号和还原高频信号进行子带合成,得到音频编码数据的还原音频。
解码端的子带合成是编码端子带分解的逆过程,用于将多个不同频段的信号整合至一个完整频段中。在本申请的一些实施例中,可以获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组,然后通过正交镜像滤波器组,对还原低频信号和还原高频信号进行子带合成,得到音频编码数据的还原音频。
正交镜像滤波器组(Quadrature Mirror Filter,QMF)由两个或两个以上的滤波器通过共用输入接口或者共用输出接口的方式组合而成。本申请实施例可以将一个对应于低频频段的低通滤波器和一个对应于高频频段的高通滤波器通过共用输出接口的方式组成正交镜像滤波器组,将还原低频信号输入至低通滤波器并将还原高频信号输入至高通滤波器后,可以经过子带合成后得到该正交镜像滤波器组输出的完整频段下的还原音频。
图12示意性地示出了本申请实施例在一应用场景中对输入码流进行解码处理的方法流程图。如图12所示,在该应用场景中由解码端对输入码流进行音频解码的方法主要可以包括如下的步骤S1201~步骤S1207。
步骤S1201.对接收到的输入码流进行封装解析,分别得到每个数据帧对应的低频语音编码参数、高频增益补偿参数以及信号分类参数。其中,信号分类参数用于反映当前数据帧使用的高频预测神经网络。
步骤S1202.对步骤S1201中封装解析得到的低频语音编码参数经过与编码端对应的解码器进行解码处理得到低频信号。
步骤S1203.低频信号经过MDCT变换得到低频频谱信息。
步骤S1204.将步骤S1203变换得到的低频频谱信息输入至步骤S1201中根据信号分类参数选取的高频预测神经网络,由高频预测神经网络输出预测到的高频线性频谱信息。
步骤S1205.将步骤S1204得到的高频线性频谱信息转换到Bark域,并通过步骤S1201中封装解析得到的高频增益补偿参数进行Bark子带频谱能量调整,经过调整后再由Bark域转换回线性域得到高频频谱信息。
步骤S1206.对步骤S1205得到的高频频谱信息经IMDCT变换得到重建的高频信号。
步骤S1207.将步骤S1202得到的低频信号和步骤S1206得到的高频信号通过QMF合成滤波器合成为全带解码信号并对其进行输出。
本申请实施例提供的音频编解码方法通过神经网络预测音频子带编解码的方式,提升 了高频信号的预测能力,从而进一步压缩高频编码码率,同时本申请实施例可以通过对输入信号进行分类,在不同类别下使用对应不同的神经网络,因此本申请提供的技术方案不仅适用于谐波结构信号,也适用于其它类型的信号,能够较好地实现不同输入信号的高频信号预测拟合。
应当注意,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的音频编码方法及音频解码方法。
图13示意性地示出了本申请一些实施例中提供的音频编码装置的结构框图。如图13所示,音频编码装置1300主要可以包括:音频分解模块1310、低频编码模块1320、高频预测模块1330、高频补偿模块1340和编码封装模块1350。
音频分解模块1310,用于对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号。
低频编码模块1320,用于对待编码低频信号进行压缩编码,得到待编码低频信号的低频编码数据。
高频预测模块1330,用于基于低频信号与高频信号的相关性,根据待编码低频信号确定高频预测信息。
高频补偿模块1340,用于对待编码高频信号进行特征提取,得到待编码高频信号的高频特征信息;并根据高频特征信息与高频预测信息之间的差异,确定待编码高频信号的高频补偿信息。
编码封装模块1350,用于对低频编码数据以及高频补偿信息进行封装处理,得到待编码音频的音频编码数据。
在本申请的一些实施例中,基于以上各实施例,高频预测模块1330包括:音频分类单元,用于对待编码音频进行分类处理,得到待编码音频的音频类别信息;编码端网络获取单元,用于确定音频类别信息对应的高频预测神经网络,该高频预测神经网络是基于低频信号与高频信号的相关性训练得到的;编码端网络映射单元,用于通过高频预测神经网络,对待编码低频信号进行映射处理,得到高频预测信息。
在本申请的一些实施例中,基于以上各实施例,音频分类单元包括:样本类别标注子单元,用于获取音频数据样本,并对音频数据样本进行逐帧标注,得到音频数据样本中每个数据帧的音频类别标识;样本特征提取子单元,用于从多个特征维度对音频数据样本进行特征提取,得到音频数据样本的多维样本特征;分类网络训练子单元,用于以多维样本特征作为输入值,并以音频类别标识作为与输入值相对应的目标值,训练得到用于对音频数据进行分类处理的音频分类神经网络;分类网络处理子单元,用于通过音频分类神经网络,对待编码音频进行分类处理,得到待编码音频的音频类别信息。
在本申请的一些实施例中,基于以上各实施例,编码端网络获取单元包括:第一样本 变换子单元,用于获取与音频类别信息相对应的音频数据样本,并对音频数据样本进行压缩变换,得到音频数据样本的频谱特征样本;第一频段划分子单元,用于对频谱特征样本按照频点的数值进行划分,得到低频特征样本和高频特征样本;第一网络获取子单元,用于以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
在本申请的一些实施例中,基于以上各实施例,编码端网络获取单元包括:第二频段划分子单元,用于获取与音频类别信息相对应的音频数据样本,并按照所在频段的高低将音频数据样本分解为低频数据样本和高频数据样本;第二样本变换子单元,用于分别对低频数据样本和高频数据样本进行压缩变换,得到相应的低频特征样本和高频特征样本;第二网络获取子单元,用于以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
在本申请的一些实施例中,基于以上各实施例,编码端网络映射单元包括:编码端低频解码子单元,用于对低频编码数据进行解码处理,得到与待编码低频信号相对应的低频解码信号;编码端低频映射子单元,用于通过高频预测神经网络,对低频解码信号进行映射处理,得到高频预测信息。
在本申请的一些实施例中,基于以上各实施例,编码端低频映射子单元包括:编码端压缩变换子单元,用于对低频解码信号进行压缩变换,得到低频解码信号的低频频谱特征;编码端特征映射子单元,用于通过高频预测神经网络,对低频频谱特征进行映射处理,得到高频预测信息。
在本申请的一些实施例中,基于以上各实施例,音频分解模块1310包括:滤波器获取单元,用于获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组;子带分解单元,用于通过正交镜像滤波器组,对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号。
在本申请的一些实施例中,基于以上各实施例,高频补偿模块1340包括:高频压缩变换单元,用于对待编码高频信号进行压缩变换,得到待编码高频信号的高频特征信息。
在本申请的一些实施例中,基于以上各实施例,高频补偿模块1340还包括:特征频谱转换单元,用于将高频特征信息从线性频率域映射至临界频带域,得到与高频特征信息相对应的特征频谱信息;预测频谱转换单元,用于将高频预测信息从线性频率域映射至临界频带域,得到与高频预测信息相对应的预测频谱信息;补偿信息确定单元,用于根据特征频谱信息和预测频谱信息之间的差异,确定待编码高频信号的高频补偿信息。
在本申请的一些实施例中,基于以上各实施例,补偿信息确定单元包括:第一对数变换子单元,用于分别对特征频谱信息和预测频谱信息进行对数变换,得到特征频谱对数值和预测频谱对数值;增益量化子单元,用于根据特征频谱对数值和预测频谱对数值的差值查询增益码表,得到增益量化值,并将增益量化值确定为待编码高频信号的高频补偿信息。
在本申请的一些实施例中,基于以上各实施例,编码封装模块1350包括:编码封装单元,用于对音频类别信息、低频编码数据以及高频补偿信息进行封装处理,得到待编码音频的音频编码数据。
图14示意性地示出了本申请一些实施例中提供的音频解码装置的结构框图。如图14所示,音频解码装置1400主要可以包括:封装解析模块1410、低频解码模块1420、高频预测模块1430、高频还原模块1440以及音频合成模块1450。
封装解析模块1410,用于对待解码的音频编码数据进行封装解析,得到音频编码数据中的低频编码数据和高频补偿信息。
低频解码模块1420,用于对低频编码数据进行解码处理,得到还原低频信号。
高频预测模块1430,用于基于低频信号与高频信号的相关性,根据还原低频信号确定高频预测信息。
高频还原模块1440,用于根据高频补偿信息对高频预测信息进行增益补偿,得到高频特征信息,并对高频特征信息进行特征还原,得到还原高频信号。
音频合成模块1450,用于对还原低频信号和还原高频信号进行子带合成,得到音频编码数据的原始音频。
在本申请的一些实施例中,基于以上各实施例,高频预测模块1430包括:类别获取单元,用于对音频编码数据进行封装解析,得到音频编码数据中的音频类别信息;解码端网络获取单元,用于确定音频类别信息对应的高频预测神经网络,该高频预测神经网络是基于低频信号与高频信号的相关性训练得到的;解码端网络映射单元,用于通过高频预测神经网络,对还原低频信号进行映射处理,得到高频预测信息。
在本申请的一些实施例中,基于以上各实施例,解码端网络获取单元包括:第一样本变换子单元,用于获取与音频类别信息相对应的音频数据样本,并对音频数据样本进行压缩变换,得到音频数据样本的频谱特征样本;第一频段划分子单元,用于对频谱特征样本按照频点的数值进行划分,得到低频特征样本和高频特征样本;第一网络获取子单元,用于以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
在本申请的一些实施例中,基于以上各实施例,解码端网络获取单元包括:第二频段划分子单元,用于获取与音频类别信息相对应的音频数据样本,并按照所在频段的高低将音频数据样本分解为低频数据样本和高频数据样本;第二样本变换子单元,用于分别对低频数据样本和高频数据样本进行压缩变换,得到相应的低频特征样本和高频特征样本;第二网络获取子单元,用于以低频特征样本作为输入值,并以高频特征样本作为与输入值相对应的目标值,训练高频预测神经网络。
在本申请的一些实施例中,基于以上各实施例,解码端网络映射单元包括:解码端压缩变换子单元,用于对还原低频信号进行压缩变换,得到还原低频信号的低频频谱特征;解码端特征映射子单元,用于通过高频预测神经网络,对低频频谱特征进行映射处理,得到高频预测信息。
在本申请的一些实施例中,基于以上各实施例,高频还原模块1440包括:频谱信息预测单元,用于将高频预测信息从线性频率域映射至临界频带域,得到与高频预测信息相对应的预测频谱信息;频谱信息补偿单元,用于根据高频补偿信息对预测频谱信息进行增益补偿,得到特征频谱信息;特征信息确定单元,用于将特征频谱信息从临界频带域映射 至线性频率域,得到与特征频谱信息相对应的高频特征信息。
在本申请的一些实施例中,基于以上各实施例,频谱信息补偿单元包括:第二对数变换子单元,用于对预测频谱信息进行对数变换,得到预测频谱对数值;对数值补偿子单元,用于根据高频补偿信息对预测频谱对数值进行增益补偿,得到特征频谱对数值;指数还原子单元,用于对特征频谱对数值进行指数还原,得到特征频谱信息。
在本申请的一些实施例中,基于以上各实施例,高频还原模块还包括:特征信息解压缩单元,用于对高频特征信息进行解压缩变换,得到还原高频信号。
在本申请的一些实施例中,基于以上各实施例,音频合成模块1450包括:滤波器获取单元,用于获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组;子带合成单元,用于通过正交镜像滤波器组,对还原低频信号和还原高频信号进行子带合成,得到音频编码数据的还原音频。
在本申请的一些实施例中,基于以上各实施例,所述封装解析模块1410包括:码流搜索单元,用于在待解码的音频编码数据中搜索码流单元分隔信息;码流分离单元,用于根据搜索到的码流单元分隔信息从音频编码数据中分离出待解码的码流单元;码流解析单元,用于对码流单元进行字段解析,得到码流单元中封装的低频编码数据和高频补偿信息。
本申请各实施例中提供的音频编码装置以及音频解码装置的具体细节已经在对应的方法实施例中进行了详细的描述,此处不再赘述。
图15示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。
需要说明的是,图15示出的电子设备的计算机系统1500仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图15所示,计算机系统1500包括中央处理单元(Central Processing Unit,CPU)1501,其可以根据存储在只读存储器(Read-Only Memory,ROM)1502中的程序或者从存储部分1508加载到随机访问存储器(Random Access Memory,RAM)1503中的程序而执行各种适当的动作和处理。在RAM 1503中,还存储有系统操作所需的各种程序和数据。CPU 1501、ROM 1502以及RAM 1503通过总线1504彼此相连。输入/输出(Input/Output,I/O)接口1505也连接至总线1504。
以下部件连接至I/O接口1505:包括键盘、鼠标等的输入部分1506;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分1507;包括硬盘等的存储部分1508;以及包括诸如LAN(Local Area Network,局域网)卡、调制解调器等的网络接口卡的通信部分1509。通信部分1509经由诸如因特网的网络执行通信处理。驱动器1510也根据需要连接至I/O接口1505。可拆卸介质1511,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1510上,以便于从其上读出的计算机程序根据需要被安装入存储部分1508。
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1509从网络上被下载和安装,和/或从可拆卸 介质1511被安装。在该计算机程序被中央处理单元(CPU)1501执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。

Claims (27)

  1. 一种音频编码方法,由电子设备执行,包括:
    对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号;
    对所述待编码低频信号进行压缩编码,得到所述待编码低频信号的低频编码数据;
    基于低频信号与高频信号的相关性,根据所述待编码低频信号确定高频预测信息;
    对所述待编码高频信号进行特征提取,得到所述待编码高频信号的高频特征信息;并根据所述高频特征信息与所述高频预测信息之间的差异,确定所述待编码高频信号的高频补偿信息;
    对所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据。
  2. 根据权利要求1所述的音频编码方法,所述基于低频信号与高频信号的相关性,根据所述待编码低频信号确定高频预测信息,包括:
    对所述待编码音频进行分类处理,得到所述待编码音频的音频类别信息;
    确定所述音频类别信息对应的高频预测神经网络;所述高频预测神经网络是基于低频信号与高频信号的相关性训练得到的;
    通过所述高频预测神经网络,对所述待编码低频信号进行映射处理,得到所述高频预测信息。
  3. 根据权利要求2所述的音频编码方法,所述对所述待编码音频进行分类处理,得到所述待编码音频的音频类别信息,包括:
    获取音频数据样本,并对所述音频数据样本进行逐帧标注,得到所述音频数据样本中每个数据帧的音频类别标识;
    从多个特征维度对所述音频数据样本进行特征提取,得到所述音频数据样本的多维样本特征;
    以所述多维样本特征作为输入值,并以所述音频类别标识作为与所述输入值相对应的目标值,训练得到用于对音频数据进行分类处理的音频分类神经网络;
    通过所述音频分类神经网络,对所述待编码音频进行分类处理,得到所述待编码音频的音频类别信息。
  4. 根据权利要求2所述的音频编码方法,所述高频预测神经网络是通过以下方式训练得到的:
    获取与所述音频类别信息相对应的音频数据样本,并对所述音频数据样本进行压缩变换,得到所述音频数据样本的频谱特征样本;
    对所述频谱特征样本按照频点的数值进行划分,得到低频特征样本和高频特征样本;
    以所述低频特征样本作为输入值,并以所述高频特征样本作为与所述输入值相对应的目标值,训练所述高频预测神经网络。
  5. 根据权利要求2所述的音频编码方法,所述高频预测神经网络是通过以下方式训练得到的:
    获取与所述音频类别信息相对应的音频数据样本,并按照所在频段的高低将所述音频数据样本分解为低频数据样本和高频数据样本;
    分别对所述低频数据样本和所述高频数据样本进行压缩变换,得到相应的低频特征样本和高频特征样本;
    以所述低频特征样本作为输入值,并以所述高频特征样本作为与所述输入值相对应的目标值,训练所述高频预测神经网络。
  6. 根据权利要求2所述的音频编码方法,所述通过所述高频预测神经网络,对所述待编码低频信号进行映射处理,得到所述高频预测信息,包括:
    对所述低频编码数据进行解码处理,得到与所述待编码低频信号相对应的低频解码信号;
    通过所述高频预测神经网络,对所述低频解码信号进行映射处理,得到所述高频预测信息。
  7. 根据权利要求6所述的音频编码方法,所述通过所述高频预测神经网络,对所述低频解码信号进行映射处理,得到所述高频预测信息,包括:
    对所述低频解码信号进行压缩变换,得到所述低频解码信号的低频频谱特征;
    通过所述高频预测神经网络,对所述低频频谱特征进行映射处理,得到所述高频预测信息。
  8. 根据权利要求1所述的音频编码方法,所述对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号,包括:
    获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组;
    通过所述正交镜像滤波器组,对所述待编码音频进行子带分解,得到所述待编码低频信号和所述待编码高频信号。
  9. 根据权利要求1所述的音频编码方法,所述对所述待编码高频信号进行特征提取,得到所述待编码高频信号的高频特征信息,包括:
    对所述待编码高频信号进行压缩变换,得到所述待编码高频信号的高频特征信息。
  10. 根据权利要求1所述的音频编码方法,所述根据所述高频特征信息与所述高频预测信息之间的差异,确定所述待编码高频信号的高频补偿信息,包括:
    将所述高频特征信息从线性频率域映射至临界频带域,得到与所述高频特征信息相对应的特征频谱信息;
    将所述高频预测信息从线性频率域映射至临界频带域,得到与所述高频预测信息相对应的预测频谱信息;
    根据所述特征频谱信息和所述预测频谱信息之间的差异,确定所述待编码高频信号的高频补偿信息。
  11. 根据权利要求10所述的音频编码方法,所述根据所述特征频谱信息和所述预测频谱信息之间的差异,确定所述待编码高频信号的高频补偿信息,包括:
    分别对所述特征频谱信息和所述预测频谱信息进行对数变换,得到特征频谱对数值和 预测频谱对数值;
    根据所述特征频谱对数值和预测频谱对数值的差值查询增益码表,得到增益量化值,并将所述增益量化值确定为所述待编码高频信号的高频补偿信息。
  12. 根据权利要求2所述的音频编码方法,所述对所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据,包括:
    对所述音频类别信息、所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据。
  13. 一种音频解码方法,由电子设备执行,包括:
    对待解码的音频编码数据进行封装解析,得到所述音频编码数据中的低频编码数据和高频补偿信息;
    对所述低频编码数据进行解码处理,得到还原低频信号;
    基于低频信号与高频信号的相关性,根据所述还原低频信号确定高频预测信息;
    根据所述高频补偿信息对所述高频预测信息进行增益补偿,得到高频特征信息,并对所述高频特征信息进行特征还原,得到还原高频信号;
    对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的还原音频。
  14. 根据权利要求13所述的音频解码方法,所述音频编码数据还包括音频类别信息;所述基于低频信号与高频信号的相关性,根据所述还原低频信号确定高频预测信息,包括:
    对所述音频编码数据进行封装解析,得到所述音频编码数据中的所述音频类别信息;
    确定所述音频类别信息对应的高频预测神经网络;所述高频预测神经网络是基于低频信号与高频信号的相关性训练得到的;
    通过所述高频预测神经网络,对所述还原低频信号进行映射处理,得到所述高频预测信息。
  15. 根据权利要求14所述的音频解码方法,所述高频预测神经网络是通过以下方式训练得到的:
    获取与所述音频类别信息相对应的音频数据样本,并对所述音频数据样本进行压缩变换,得到所述音频数据样本的频谱特征样本;
    对所述频谱特征样本按照频点的数值进行划分,得到低频特征样本和高频特征样本;
    以所述低频特征样本作为输入值,并以所述高频特征样本作为与所述输入值相对应的目标值,训练所述高频预测神经网络。
  16. 根据权利要求14所述的音频解码方法,所述高频预测神经网络是通过以下方式训练得到的:
    获取与所述音频类别信息相对应的音频数据样本,并按照所在频段的高低将所述音频数据样本分解为低频数据样本和高频数据样本;
    分别对所述低频数据样本和所述高频数据样本进行压缩变换,得到相应的低频特征样本和高频特征样本;
    以所述低频特征样本作为输入值,并以所述高频特征样本作为与所述输入值相对应的 目标值,训练所述高频预测神经网络。
  17. 根据权利要求14所述的音频解码方法,所述通过所述高频预测神经网络,对所述还原低频信号进行映射处理,得到所述高频预测信息,包括:
    对所述还原低频信号进行压缩变换,得到所述还原低频信号的低频频谱特征;
    通过所述高频预测神经网络,对所述低频频谱特征进行映射处理,得到所述高频预测信息。
  18. 根据权利要求13所述的音频解码方法,所述根据所述高频补偿信息对所述高频预测信息进行增益补偿,得到高频特征信息,包括:
    将所述高频预测信息从线性频率域映射至临界频带域,得到与所述高频预测信息相对应的预测频谱信息;根据所述高频补偿信息对所述预测频谱信息进行增益补偿,得到特征频谱信息;
    将所述特征频谱信息从临界频带域映射至线性频率域,得到与所述特征频谱信息相对应的高频特征信息。
  19. 根据权利要求18所述的音频解码方法,所述根据所述高频补偿信息对所述预测频谱信息进行增益补偿,得到特征频谱信息,包括:
    对所述预测频谱信息进行对数变换,得到预测频谱对数值;
    根据所述高频补偿信息对所述预测频谱对数值进行增益补偿,得到特征频谱对数值;
    对所述特征频谱对数值进行指数还原,得到所述特征频谱信息。
  20. 根据权利要求13所述的音频解码方法,所述对所述高频特征信息进行特征还原,得到还原高频信号,包括:
    对所述高频特征信息进行解压缩变换,得到还原高频信号。
  21. 根据权利要求13所述的音频解码方法,所述对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的还原音频,包括:
    获取由对应于低频频段的低通滤波器和对应于高频频段的高通滤波器组成的正交镜像滤波器组;
    通过所述正交镜像滤波器组,对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的还原音频。
  22. 根据权利要求13所述的音频解码方法,所述对待解码的音频编码数据进行封装解析,得到所述音频编码数据中的低频编码数据和高频补偿信息,包括:
    在待解码的音频编码数据中搜索码流单元分隔信息;
    根据搜索到的所述码流单元分隔信息,从所述音频编码数据中分离出待解码的码流单元;
    对所述码流单元进行字段解析,得到所述码流单元中封装的所述低频编码数据和所述高频补偿信息。
  23. 一种音频编码装置,包括:
    音频分解模块,用于对待编码音频进行子带分解,得到对应于低频频段的待编码低频信号和对应于高频频段的待编码高频信号;
    低频编码模块,用于对所述待编码低频信号进行压缩编码,得到所述待编码低频信号的低频编码数据;
    高频预测模块,用于基于低频信号与高频信号的相关性,根据所述待编码低频信号确定高频预测信息;
    高频补偿模块,用于对所述待编码高频信号进行特征提取,得到所述待编码高频信号的高频特征信息;并根据所述高频特征信息与所述高频预测信息之间的差异,确定所述待编码高频信号的高频补偿信息;
    编码封装模块,用于对所述低频编码数据以及所述高频补偿信息进行封装处理,得到所述待编码音频的音频编码数据。
  24. 一种音频解码装置,包括:
    封装解析模块,用于对待解码的音频编码数据进行封装解析,得到所述音频编码数据中的低频编码数据和高频补偿信息;
    低频解码模块,用于对所述低频编码数据进行解码处理,得到还原低频信号;
    高频预测模块,用于基于低频信号与高频信号的相关性,根据所述还原低频信号确定高频预测信息;
    高频还原模块,用于根据所述高频补偿信息对所述高频预测信息进行增益补偿,得到高频特征信息,并对所述高频特征信息进行特征还原,得到还原高频信号;
    音频合成模块,用于对所述还原低频信号和所述还原高频信号进行子带合成,得到所述音频编码数据的原始音频。
  25. 一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至22中任意一项所述的方法。
  26. 一种电子设备,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至22中任意一项所述的方法。
  27. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行权利要求1至22中任意一项所述的方法。
PCT/CN2021/095022 2020-06-24 2021-05-21 音频编解码方法、装置、介质及电子设备 WO2021258940A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/740,304 US20220270623A1 (en) 2020-06-24 2022-05-09 Audio coding and decoding method and apparatus, medium, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010592469.4 2020-06-24
CN202010592469.4A CN112767954A (zh) 2020-06-24 2020-06-24 音频编解码方法、装置、介质及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/740,304 Continuation US20220270623A1 (en) 2020-06-24 2022-05-09 Audio coding and decoding method and apparatus, medium, and electronic device

Publications (1)

Publication Number Publication Date
WO2021258940A1 true WO2021258940A1 (zh) 2021-12-30

Family

ID=75693051

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095022 WO2021258940A1 (zh) 2020-06-24 2021-05-21 音频编解码方法、装置、介质及电子设备

Country Status (3)

Country Link
US (1) US20220270623A1 (zh)
CN (1) CN112767954A (zh)
WO (1) WO2021258940A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767954A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备
CN115691521A (zh) * 2021-07-29 2023-02-03 华为技术有限公司 一种音频信号的编解码方法和装置
CN114900779B (zh) * 2022-04-12 2023-06-06 东莞市晨新电子科技有限公司 音频补偿方法、系统和电子设备
CN114550732B (zh) * 2022-04-15 2022-07-08 腾讯科技(深圳)有限公司 一种高频音频信号的编解码方法和相关装置
CN114582361B (zh) * 2022-04-29 2022-07-08 北京百瑞互联技术有限公司 基于生成对抗网络的高解析度音频编解码方法及系统
CN114999503A (zh) * 2022-05-23 2022-09-02 北京百瑞互联技术有限公司 一种基于生成对抗网络的全带宽谱系数生成方法及系统
CN115116454A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频编码方法、装置、设备、存储介质及程序产品
CN115116455A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频处理方法、装置、设备、存储介质及计算机程序产品
CN115120247A (zh) * 2022-07-19 2022-09-30 天津工业大学 一种实现多生理信号联合分析的系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000060575A1 (en) * 1999-04-05 2000-10-12 Hughes Electronics Corporation A voicing measure as an estimate of signal periodicity for a frequency domain interpolative speech codec system
CN101436406A (zh) * 2008-12-22 2009-05-20 西安电子科技大学 音频编解码器
CN103714822A (zh) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 基于silk编解码器的子带编解码方法及装置
CN105070293A (zh) * 2015-08-31 2015-11-18 武汉大学 基于深度神经网络的音频带宽扩展编码解码方法及装置
CN112767954A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4899359B2 (ja) * 2005-07-11 2012-03-21 ソニー株式会社 信号符号化装置及び方法、信号復号装置及び方法、並びにプログラム及び記録媒体
JP5103880B2 (ja) * 2006-11-24 2012-12-19 富士通株式会社 復号化装置および復号化方法
JP4967618B2 (ja) * 2006-11-24 2012-07-04 富士通株式会社 復号化装置および復号化方法
KR101413969B1 (ko) * 2012-12-20 2014-07-08 삼성전자주식회사 오디오 신호의 복호화 방법 및 장치
CN105976830B (zh) * 2013-01-11 2019-09-20 华为技术有限公司 音频信号编码和解码方法、音频信号编码和解码装置
CN106847297B (zh) * 2013-01-29 2020-07-07 华为技术有限公司 高频带信号的预测方法、编/解码设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000060575A1 (en) * 1999-04-05 2000-10-12 Hughes Electronics Corporation A voicing measure as an estimate of signal periodicity for a frequency domain interpolative speech codec system
CN101436406A (zh) * 2008-12-22 2009-05-20 西安电子科技大学 音频编解码器
CN103714822A (zh) * 2013-12-27 2014-04-09 广州华多网络科技有限公司 基于silk编解码器的子带编解码方法及装置
CN105070293A (zh) * 2015-08-31 2015-11-18 武汉大学 基于深度神经网络的音频带宽扩展编码解码方法及装置
CN112767954A (zh) * 2020-06-24 2021-05-07 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备

Also Published As

Publication number Publication date
US20220270623A1 (en) 2022-08-25
CN112767954A (zh) 2021-05-07

Similar Documents

Publication Publication Date Title
WO2021258940A1 (zh) 音频编解码方法、装置、介质及电子设备
RU2437172C1 (ru) Способ кодирования/декодирования индексов кодовой книги для квантованного спектра мдкп в масштабируемых речевых и аудиокодеках
US8428957B2 (en) Spectral noise shaping in audio coding based on spectral dynamics in frequency sub-bands
RU2636685C2 (ru) Решение относительно наличия/отсутствия вокализации для обработки речи
JP2021012398A (ja) 音声符号化装置および方法
CN105723454B (zh) 能量无损编码方法和设备、信号编码方法和设备、能量无损解码方法和设备及信号解码方法和设备
JP2009515212A (ja) オーディオ圧縮
JPH06118995A (ja) 広帯域音声信号復元方法
JP2010537261A (ja) 周波数サブバンドのスペクトルダイナミクスに基づくオーディオ符号化における時間マスキング
WO2006049179A1 (ja) ベクトル変換装置及びベクトル変換方法
KR20080053131A (ko) 음성 부호화 장치 및 그 방법
US20030088402A1 (en) Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope
Zhen et al. Psychoacoustic calibration of loss functions for efficient end-to-end neural audio coding
CN115171709B (zh) 语音编码、解码方法、装置、计算机设备和存储介质
KR20160122160A (ko) 신호 부호화방법 및 장치와 신호 복호화방법 및 장치
CN100585700C (zh) 语音编码装置及其方法
Jiang et al. Latent-domain predictive neural speech coding
CN111816197B (zh) 音频编码方法、装置、电子设备和存储介质
Anees Speech coding techniques and challenges: A comprehensive literature survey
KR102052144B1 (ko) 음성 신호의 대역 선택적 양자화 방법 및 장치
KR20150032220A (ko) 신호 부호화방법 및 장치와 신호 복호화방법 및 장치
US10950251B2 (en) Coding of harmonic signals in transform-based audio codecs
JP3348759B2 (ja) 変換符号化方法および変換復号化方法
KR20230003546A (ko) 오디오 코덱의 감각 불협화음 및 사운드 정위 큐의 불변성 유지
JPH09127987A (ja) 信号符号化方法及び装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21828634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21828634

Country of ref document: EP

Kind code of ref document: A1