WO2023241193A1 - 音频编码方法、装置、电子设备、存储介质及程序产品 - Google Patents

音频编码方法、装置、电子设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023241193A1
WO2023241193A1 PCT/CN2023/088014 CN2023088014W WO2023241193A1 WO 2023241193 A1 WO2023241193 A1 WO 2023241193A1 CN 2023088014 W CN2023088014 W CN 2023088014W WO 2023241193 A1 WO2023241193 A1 WO 2023241193A1
Authority
WO
WIPO (PCT)
Prior art keywords
level
signal
frequency
features
low
Prior art date
Application number
PCT/CN2023/088014
Other languages
English (en)
French (fr)
Inventor
康迂勇
王蒙
黄庆博
史裕鹏
肖玮
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023241193A1 publication Critical patent/WO2023241193A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the present application relates to the field of audio processing technology, and in particular to an audio encoding method, audio decoding method, device, electronic equipment, storage medium and computer program product.
  • Audio codec technology is a core technology used in communication services including remote audio and video calls. Audio coding technology can be understood as using less network bandwidth resources to transmit as much voice information as possible. Audio coding is a type of source coding. The purpose of source coding is to compress the information that the user wants to transmit as much as possible on the encoding side. The amount of data is reduced, the redundancy in the information is removed, and at the same time, it can be recovered losslessly (or nearly lossless) at the decoding end.
  • the audio coding efficiency is low while ensuring the audio coding quality.
  • Embodiments of the present application provide an audio encoding method, device, electronic equipment, computer-readable storage medium, and computer program product, which can improve audio encoding efficiency and ensure audio encoding quality.
  • This embodiment of the present application provides an audio coding method, including:
  • the audio signal and the signal features of the (i-1)-th level are spliced to obtain the splicing features, and the i-th level feature extraction is performed on the splicing features to obtain the
  • the signal characteristics of the i-th level the N and i are integers greater than 1, and the i is less than or equal to the N;
  • the signal characteristics of the first level and the signal characteristics of each level in the N levels are respectively encoded to obtain the code stream of the audio signal at each level.
  • An embodiment of the present application also provides an audio decoding method, including:
  • An embodiment of the present application also provides an audio coding device, including:
  • the first feature extraction module is configured to perform first-level feature extraction on the audio signal to obtain the first-level signal features
  • the second feature extraction module is configured to splice the audio signal and the signal features of the (i-1)th level for the i-th level among the N levels, obtain splicing features, and perform the i-th level on the splicing features.
  • Level feature extraction is performed to obtain the signal features of the i-th level, the N and i are integers greater than 1, and the i is less than or equal to the N;
  • a traversal module configured to traverse the i to obtain signal features of each level in the N levels, where the data dimension of the signal feature is smaller than the data dimension of the audio signal;
  • the encoding module is configured to separately encode the signal characteristics of the first level and the signal characteristics of each level in the N levels to obtain the code stream of the audio signal at each level.
  • An embodiment of the present application also provides an audio decoding device, including:
  • a receiving module configured to receive code streams corresponding to multiple levels obtained by encoding the audio signal
  • a decoding module configured to decode the code streams of each of the levels respectively to obtain signal features of each of the levels, where the data dimension of the signal feature is smaller than the data dimension of the audio signal;
  • a feature reconstruction module configured to perform feature reconstruction on the signal features of each of the levels, respectively, to obtain the hierarchical audio signal of each of the levels
  • An audio synthesis module is configured to perform audio synthesis on multiple hierarchical audio signals of the hierarchical level to obtain the audio signal.
  • An embodiment of the present application also provides an electronic device, including:
  • memory configured to store executable instructions
  • the processor is configured to implement the method provided by the embodiment of the present application when executing the executable instructions stored in the memory.
  • Embodiments of the present application also provide a computer-readable storage medium that stores executable instructions. When the executable instructions are executed by a processor, the method provided by the embodiments of the present application is implemented.
  • An embodiment of the present application also provides a computer program product, which includes a computer program or instructions.
  • a computer program product which includes a computer program or instructions.
  • the method provided by the embodiment of the present application is implemented.
  • the signal features at each level are obtained by coding the audio signal in layers. Since the data dimension of the signal features at each level is smaller than the data dimension of the audio signal, the data dimension of the data processed in the audio coding process is reduced and the coding of the audio signal is improved. Efficiency; when extracting signal features of audio signals hierarchically, the output of each level is used as the input of the next level, so that each level combines the signal features extracted from the previous level to extract more accurate features of the audio signal. The increase in the number of layers can minimize the information loss of the audio signal during the feature extraction process.
  • the audio signal information contained in the multiple code streams obtained by encoding the signal features extracted in this way is closer to the original audio signal, which reduces the information loss of the audio signal during the encoding process and ensures the audio The encoding quality of the encoding.
  • Figure 1 is a schematic architectural diagram of an audio coding system 100 provided by an embodiment of the present application.
  • Figure 2 is a schematic structural diagram of an electronic device 500 for implementing an audio encoding method provided by an embodiment of the present application
  • Figure 3 is a schematic flow chart of an audio encoding method provided by an embodiment of the present application.
  • Figure 4 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 5 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 6 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 7 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 8 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 9 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • Figure 10 is a schematic flow chart of an audio decoding method provided by an embodiment of the present application.
  • Figure 11 is a schematic flow chart of an audio decoding method provided by an embodiment of the present application.
  • Figure 12 is a schematic diagram of spectrum comparison under different code rates provided by the embodiment of the present application.
  • Figure 13 is a schematic flow chart of audio encoding and audio decoding provided by the embodiment of the present application.
  • Figure 14 is a schematic diagram of a voice communication link provided by an embodiment of the present application.
  • Figure 15 is a schematic diagram of a filter bank provided by an embodiment of the present application.
  • Figure 16A is a schematic diagram of a common convolutional network provided by an embodiment of the present application.
  • Figure 16B is a schematic diagram of the dilated convolutional network provided by the embodiment of the present application.
  • Figure 17 is a schematic structural diagram of the first layer low-frequency analysis neural network model provided by the embodiment of the present application.
  • Figure 18 is a schematic structural diagram of the second layer low-frequency analysis neural network model provided by the embodiment of the present application.
  • Figure 19 is a schematic diagram of the first layer low-frequency synthetic neural network model provided by the embodiment of the present application.
  • Figure 20 is a schematic structural diagram of the second layer low-frequency synthetic neural network model provided by the embodiment of the present application.
  • first ⁇ second ⁇ third are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understandable that "first ⁇ second ⁇ third" Where permitted, the specific order or sequence may be interchanged so that the embodiments of the application described herein can be practiced in an order other than that illustrated or described herein.
  • Client an application running in the terminal to provide various services, such as instant messaging client and audio playback client.
  • Audio Coding an application of data compression for digital audio signals containing speech.
  • Quadrature Mirror Filters The QMF filter bank is used to decompose the sub-band signal into multiple signals, thereby reducing the signal bandwidth.
  • the decomposed signals are filtered through their respective channels.
  • Quantization refers to the process of approximating the continuous values of a signal (or a large number of possible discrete values) into a limited number (or fewer) discrete values, including vector quantization, scalar quantization, etc.
  • Vector quantization combine several scalar data into a vector, divide the vector space into several small areas, find a representative vector for each small area, and use the corresponding representative vector to replace the vector that falls into the small area during quantization, that is , is quantized as this representative vector.
  • Scalar quantization divides the entire dynamic range into several small intervals. Each small interval has a representative value. During quantization, the signal value falling into the small interval is replaced by the corresponding representative value, that is, the signal value is quantized as This represents the value.
  • Entropy coding that is, coding that does not lose any information according to the entropy principle during the coding process.
  • Information entropy is the average amount of information in the source.
  • Common entropy coding includes: Shannon coding, Huffman coding and arithmetic Coding (arithmetic coding).
  • Neural Network It is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. This kind of network relies on the complexity of the system to achieve the purpose of processing information by adjusting the interconnected relationships between a large number of internal nodes.
  • Deep learning It is a new research direction in the field of machine learning (ML, Machine Learning). Deep learning is to learn the inherent laws and representation levels of sample data. The information obtained in the learning process is important for such things as Interpretation of data such as text, images and sounds helps a lot. Its ultimate goal is to enable machines to have the same analytical learning capabilities as humans and to recognize data such as text, images, and sounds.
  • Embodiments of the present application provide an audio encoding method, an audio decoding method, a device, an electronic device, a computer-readable storage medium, and a computer program product, which can improve audio encoding efficiency and ensure audio encoding quality.
  • Figure 1 is a schematic architectural diagram of an audio coding system 100 provided by an embodiment of the present application.
  • terminals terminal 400-1 and terminal 400-2 are illustrated as examples
  • the server 200 and the network 300 can be a wide area network or a local area network, or a combination of the two, using wireless or wired links to realize data transmission.
  • the terminal 400-1 is the sending end of the audio signal
  • the terminal 400-2 is the receiving end of the audio signal.
  • terminal 400-1 is configured to Perform first-level feature extraction to obtain the first-level signal features; for the i-th level among the N levels, splice the audio signal and the (i-1)-th level signal features to obtain the splicing features, and perform the splicing Features are extracted at the i-th level to obtain the signal features at the i-th level.
  • N and i are integers greater than 1, and i is less than or equal to N. Traverse i to obtain the signal features at each level in the N levels.
  • the signal The data dimension of the feature is smaller than the data dimension of the audio signal; the signal characteristics of the first level and the signal characteristics of each level in the N levels are encoded separately to obtain the code stream of the audio signal at each level; the audio signal is encoded at each level.
  • the layered code stream is sent to the server 200;
  • Server 200 is configured to receive code streams corresponding to multiple levels obtained by encoding the audio signal by terminal 400-1; and to send code streams corresponding to multiple levels to terminal 400-2;
  • the terminal 400-2 is configured to receive the code streams corresponding to multiple levels obtained by encoding the audio signal sent by the server 200; decode the code streams of each level respectively to obtain the signal characteristics of each level and the data dimensions of the signal characteristics.
  • the data dimension is smaller than the audio signal; reconstruct the signal features of each level separately to obtain the hierarchical audio signal of each level; perform audio synthesis on the hierarchical audio signals of multiple levels to obtain the audio signal.
  • the audio encoding method provided by the embodiments of the present application can be implemented by various electronic devices.
  • it can be implemented by the terminal alone, by the server alone, or by the terminal and the server collaboratively.
  • the terminal alone executes the audio encoding method provided by the embodiments of the present application, or the terminal sends a coding request for the audio signal to the server, and the server executes the audio encoding method provided by the embodiments of the present application according to the received encoding request.
  • Embodiments of this application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart transportation, assisted driving, etc.
  • the electronic device that implements audio encoding provided by the embodiments of the present application may be various types of terminal devices or servers.
  • the server for example, server 200
  • the server may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers.
  • the terminal (such as the terminal 400) can be a smartphone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device (such as a smart speaker), a smart home appliance (such as a smart TV), a smart watch, a vehicle-mounted terminal, etc., but is not limited to this.
  • the terminal and the server can be connected directly or indirectly through wired or wireless communication methods, and the embodiments of the present application do not limit this.
  • the audio encoding method provided by the embodiments of the present application can be implemented with the help of cloud technology.
  • Cloud technology refers to the unification of a series of resources such as hardware, software, and networks within a wide area network or a local area network to realize data processing.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, and application technology based on the cloud computing business model. It can form a resource pool and use it on demand, which is flexible and convenient. Cloud computing technology will become an important support. technology network system The background service requires a large amount of computing and storage resources.
  • the above-mentioned server can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • cloud services such as server 200
  • cloud databases such as server 200
  • cloud computing can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • a terminal or server can implement the audio encoding method provided in the embodiments of the present application by running a computer program.
  • the computer program can be a native program or software module in the operating system; it can be a native program.
  • Application APP, Application
  • APP Application
  • the computer program described above can be any form of application, module or plug-in.
  • multiple servers can be composed into a blockchain, and the servers are nodes on the blockchain. There can be information connections between each node in the blockchain, and the nodes can be connected through the above information. carry out information transmission.
  • data related to the audio coding method provided by the embodiments of the present application (such as the code stream of the audio signal at each level, the neural network model used for feature extraction) can be saved on the blockchain.
  • FIG. 2 is a schematic structural diagram of an electronic device 500 for implementing an audio encoding method provided by an embodiment of the present application.
  • the electronic device 500 that implements the audio encoding method provided by the embodiment of the present application includes: at least one processor 510, a memory 550, and at least one network interface. 520 and user interface 530.
  • the various components in electronic device 500 are coupled together by bus system 540 .
  • bus system 540 is used to implement connection communication between these components.
  • the bus system 540 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled bus system 540 in FIG. 2 .
  • the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware Components, etc., wherein the general processor can be a microprocessor or any conventional processor, etc.
  • DSP Digital Signal Processor
  • Memory 550 may be removable, non-removable, or a combination thereof. Memory 550 optionally includes one or more storage devices physically located remotely from processor 510 .
  • the memory 550 includes volatile memory or non-volatile memory, and may include both volatile memory and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM, Read Only Memory), and the volatile memory can be random access memory (RAM, Random Access Memory).
  • ROM read-only memory
  • RAM Random Access Memory
  • the memory 550 described in the embodiments of this application is intended to include any suitable type of memory.
  • the memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplarily described below.
  • the operating system 551 includes system programs configured to handle various basic system services and perform hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., used to implement various basic services and process hardware-based tasks;
  • Network communications module 552 configured to reach other computing devices via one or more (wired or wireless) network interfaces 520, example network interfaces 520 include: Bluetooth, Wireless Compliance Certified (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
  • the audio encoding device provided by the embodiment of the present application can be implemented in software.
  • Figure 2 shows the audio encoding device 553 stored in the memory 550, which can be software in the form of programs, plug-ins, etc., including the following Software modules: first feature extraction module 5531, second feature extraction module 5532, traversal module 5533 and encoding module 5534. These modules are logical, so they can be combined or split in any way according to the functions implemented, which will be discussed below. Describe the functions of each module.
  • the audio encoding method provided by the embodiments of the present application can be implemented by various electronic devices. For example, it can be implemented by the terminal alone, by the server alone, or by the terminal and the server collaboratively. Taking terminal implementation as an example, see Figure 3.
  • Figure 3 is a schematic flowchart of an audio encoding method provided by an embodiment of the present application.
  • the audio encoding method provided by an embodiment of the present application includes:
  • Step 101 The terminal performs first-level feature extraction on the audio signal to obtain first-level signal features.
  • the audio signal may be a voice signal during a call (such as an Internet call, a phone call), a voice message (such as a voice message sent in an instant messaging client), played music, audio, etc.
  • the audio signal needs to be encoded during transmission, so that the transmitting end of the audio signal can transmit the encoded code stream, and the receiving end of the code stream can decode the received code stream to obtain the audio signal.
  • the encoding process of the audio signal is explained.
  • the audio signal is encoded using a layered encoding method.
  • the layered encoding method is implemented by encoding the audio signal at multiple levels. The encoding process of each level is described below.
  • the terminal can perform first-level feature extraction on the audio signal to obtain the signal features of the audio signal extracted through the first level, that is, the first-level signal features.
  • the audio signal includes a low-frequency subband signal and a high-frequency subband signal.
  • processing the audio signal such as feature extraction, encoding
  • the low-frequency subband signal and the high-frequency subband signal included in the audio signal can be processed.
  • the signals are processed separately. Based on this, refer to Figure 4.
  • Figure 4 is a schematic flow chart of the audio encoding method provided by the embodiment of the present application. Figure 4 shows that step 101 of Figure 3 can be implemented through steps 201 to 203: Step 201, perform subband processing on the audio signal.
  • step 202 Decompose to obtain the low-frequency sub-band signal and high-frequency sub-band signal of the audio signal; step 202, perform first-level feature extraction on the low-frequency sub-band signal, obtain the first-level low-frequency signal feature, and perform the first-level feature extraction on the high-frequency sub-band signal.
  • the first-level feature extraction obtains the first-level high-frequency signal features; step 203, the low-frequency signal features and the high-frequency signal features are used as the first-level signal features.
  • the terminal may first The signal is decomposed into sub-bands to obtain the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal, so that the features of the low-frequency sub-band signal and the high-frequency sub-band signal are extracted respectively.
  • Figure 5 is a schematic flow chart of the audio encoding method provided by the embodiment of the present application.
  • Figure 5 shows that step 201 of Figure 4 can be implemented through steps 2011 to 2013: Step 2011, according to the first
  • the audio signal is sampled at the sampling frequency to obtain a sampled signal
  • step 2012 low-pass filtering is performed on the sampled signal to obtain a low-pass filtered signal, and the low-pass filtered signal is down-sampled to obtain a low-frequency subband signal of the second sampling frequency
  • Step 2013 Perform high-pass filtering on the sampled signal to obtain a high-pass filtered signal, and perform down-sampling on the high-pass filtered signal to obtain a high-frequency subband signal of the second sampling frequency.
  • the second sampling frequency is smaller than the first sampling frequency.
  • the audio signal may be sampled according to a first sampling frequency to obtain a sampling signal, and the first sampling frequency may be preset.
  • the audio signal is a continuous analog signal.
  • a discrete digital signal that is, a sampling signal, is obtained.
  • the sampling signal includes multiple sample points sampled from the audio signal. (i.e. sampled value).
  • step 2012 the adopted signal is low-pass filtered to obtain a low-pass filtered signal, and the low-pass filtered signal is down-sampled to obtain a low-frequency subband signal of the second sampling frequency.
  • the low-pass filtering and high-pass filtering can be implemented by QMF analysis filters.
  • the second sampling frequency may be half of the first sampling frequency, so that low-frequency sub-band signals and high-frequency sub-band signals with the same frequency can be obtained.
  • step 202 after obtaining the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal, the first-level feature extraction is performed on the low-frequency sub-band signal of the audio signal to obtain the first-level low-frequency signal features, and the high-frequency The sub-band signal undergoes first-level feature extraction to obtain first-level high-frequency signal features.
  • step 203 the low-frequency signal features and the high-frequency signal features are used as first-level signal features.
  • Figure 6 is a schematic flowchart of the audio encoding method provided by an embodiment of the present application.
  • Step 301 audio coding
  • the signal is subjected to the first convolution process to obtain the first level of convolution features
  • Step 302 the first pooling process is performed on the convolution features to obtain the first level of pooling features
  • Step 303 the first level of pooling features is obtained.
  • Down-sampling is performed to obtain first-level down-sampled features
  • step 304 perform a second convolution process on the down-sampled features to obtain first-level signal features.
  • the audio signal may be subjected to a first convolution process.
  • the first convolution process can be performed by calling a causal convolution with a preset number of channels (such as 24 channels), thereby obtaining the first-level convolution features.
  • the first pooling process is performed on the convolution features obtained in step 301.
  • the first pooling process can be pre-set with a pooling factor (such as 2), and then the first pooling process is performed on the convolution features based on the pooling factor to obtain the first-level pooling features.
  • a pooling factor such as 2
  • the pooled features obtained in step 302 are first down-sampled.
  • the downsampling factor can be preset, so that downsampling is performed based on the downsampling factor.
  • the first downsampling can be implemented through one coding layer or through multiple coding layers. In some embodiments, the first downsampling is achieved by M cascaded coding layers.
  • Figure 7, is a schematic flow chart of the audio encoding method provided by the embodiment of the present application.
  • FIG. 7 shows that step 303 of Figure 6 can also be implemented through steps 3031 to 3033: step 3031, through M cascades
  • the first coding layer among the coding layers performs the first down-sampling on the pooled features to obtain the down-sampling result of the first coding layer; step 3032, through the j-th coding layer among the M cascaded coding layers , perform the first downsampling on the downsampling result of the (j-1)th coding layer to obtain the downsampling result of the jth coding layer; where M and j are integers greater than 1, and j is less than or equal to M; step 3033. Traverse j to obtain the downsampling result of the Mth coding layer, and use the downsampling result of the Mth coding layer as the downsampling feature of the first level.
  • the downsampling factors of each coding layer may be the same or different.
  • the downsampling factor is equivalent to the pooling factor and plays the role of downsampling.
  • a second convolution process may be performed on the downsampled features.
  • the second convolution process can be performed by calling a causal convolution with a preset number of channels to obtain the first-level signal characteristics.
  • steps 301 to 304 shown in Figure 6 can be implemented by calling the first neural network model.
  • the first neural network model includes a first convolution layer, a pooling layer, a downsampling layer and a second convolution layer.
  • the first level of convolution features can be obtained by calling the first convolution layer to perform the first convolution process on the audio signal; the pooling layer can be called to perform the first pooling process on the convolution features to obtain the first level of pooling.
  • Features call the down-sampling layer to perform the first down-sampling of the pooled features to obtain the first-level down-sampled features; call the second convolution layer to perform the second convolution process on the down-sampled features to obtain the first-level signal features .
  • the first level of feature extraction can also be performed on the low-frequency subband signal and the high-frequency subband signal of the audio signal through steps 301 to 304 shown in FIG. 6 .
  • Hierarchical feature extraction (ie step 202 shown in Figure 4). That is, perform the first convolution process on the low-frequency subband signal of the audio signal to obtain the first convolution feature of the first level; perform the first pooling process on the first convolution feature to obtain the first pooling of the first level.
  • Features perform the first down-sampling on the first pooling feature to obtain the first down-sampling feature of the first level; perform the second convolution process on the first down-sampling feature to obtain the low-frequency signal feature of the first level.
  • Step 102 For the i-th level among the N levels, splice the audio signal and the signal features of the (i-1)-th level to obtain the splicing features, and perform i-th level feature extraction on the splicing features to obtain the i-th level. signal characteristics.
  • N and i are integers greater than 1, and i is less than or equal to N.
  • the remaining levels of feature extraction can also be performed on the audio signal.
  • the remaining levels include N levels.
  • the audio signal and the signal features of the (i-1)-th level are spliced to obtain the splicing features, and the i-th level is performed on the splicing features.
  • Level feature extraction is used to obtain the signal features of the i-th level.
  • the audio signal and the signal features of the first level are spliced to obtain the splicing features
  • the second level feature extraction is performed on the splicing features to obtain the second level signal features
  • the third level the splicing features are obtained.
  • the audio signal and the second-level signal features are spliced to obtain the splicing features, and the third-level feature extraction is performed on the spliced features to obtain the third-level signal features; for the fourth level, the audio signal and the third-level signal are obtained
  • the features are spliced to obtain spliced features, and fourth-level feature extraction is performed on the spliced features to obtain fourth-level signal features, and so on.
  • the audio signal includes a low-frequency subband signal and a high-frequency subband signal.
  • the audio signal can also be decomposed into sub-bands to obtain the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal.
  • the process of subband decomposition can be referred to the above steps 2011 to 2013.
  • the data output by performing feature extraction include: the low-frequency signal features of the i-th level and the high-frequency signal features of the i-th level.
  • Figure 8 is a schematic flowchart of the audio encoding method provided by the embodiment of the present application.
  • Figure 8 shows that step 102 of Figure 3 can be implemented through steps 401 to 403: step 401, the low-frequency sub-band of the audio signal is processed.
  • the band signal and the (i-1)th level low-frequency signal feature are spliced to obtain the first splicing feature, and the i-th level feature extraction is performed on the first splicing feature to obtain the i-th level low-frequency signal feature; step 402,
  • the high-frequency subband signal of the audio signal and the high-frequency signal feature at the (i-1) level are spliced to obtain the second splicing feature, and the i-th level feature extraction is performed on the second splicing feature to obtain the i-th level high-frequency signal feature.
  • frequency signal characteristics; Step 403, use the low-frequency signal characteristics of the i-th level and the high-frequency signal characteristics of the i-th level as the signal characteristics of the i-th level.
  • step 401 after obtaining the low-frequency sub-band signal and the high-frequency sub-band signal of the audio signal, the low-frequency sub-band signal of the audio signal and the low-frequency signal features extracted at the (i-1)th level are processed. Splicing, the first splicing feature is obtained, and then the i-th level feature extraction is performed on the first splicing feature to obtain the i-th level low-frequency signal feature. Similarly, in step 402, the high-frequency subband signal of the audio signal and the high-frequency signal feature extracted at the (i-1)th level are spliced to obtain the second splicing feature, and then the second splicing feature is processed.
  • step 403 the low-frequency signal characteristics of the i-th level and the high-frequency signal characteristics of the i-th level are used as the signal features of the i-th level.
  • Figure 9 is a schematic flow chart of the audio encoding method provided by the embodiment of the present application.
  • Figure 9 shows that step 102 of Figure 3 can also be implemented through steps 501 to 504: Step 501, splicing The features are subjected to the third convolution process to obtain the convolution features of the i-th level; Step 502, the second pooling process is performed on the convolution features to obtain the pooling features of the i-th level; Step 503, the second pooling process is performed on the pooled features. Downsampling is performed to obtain the downsampled features of the i-th level; step 504, the fourth convolution process is performed on the downsampled features to obtain the signal features of the i-th level.
  • a third convolution process can be performed on the spliced features (obtained by splicing the audio signal and the signal features of the (i-1)th level).
  • the third convolution process can be processed by calling a causal convolution with a preset number of channels, thereby obtaining the i-th level convolution feature.
  • a second pooling process is performed on the convolution features obtained in step 501.
  • the second pooling process can pre-set the pooling factor, and then perform the second pooling process on the convolution features based on the pooling factor to obtain the i-th level pooling feature.
  • a second downsampling is performed on the pooled features obtained in step 502.
  • the downsampling factor can be preset, so that downsampling is performed based on the downsampling factor.
  • the second downsampling can be implemented through one coding layer or through multiple coding layers. In some embodiments, the second downsampling may be achieved by X cascaded coding layers.
  • step 503 in Figure 9 can also be implemented through steps 5031 to 5033: Step 5031, through the first coding layer among The downsampling result of the coding layer; step 5032, perform the second downsampling on the downsampling result of the (g-1)th coding layer through the gth coding layer among the X cascaded coding layers, and obtain the gth The downsampling result of the coding layer; where, X and g are integers greater than 1, and g is less than or equal to The downsampling result of the layer is used as the downsampling feature of the i-th level.
  • the downsampling factors of each coding layer may be the same or different.
  • the downsampling factor is equivalent to the pooling factor and plays the role of downsampling.
  • a fourth convolution process may be performed on the downsampled features.
  • the fourth convolution process can be processed by calling a causal convolution with a preset number of channels, thereby obtaining the signal characteristics of the i-th level.
  • steps 501 to 504 shown in Figure 9 can be implemented by calling the second neural network model.
  • the second neural network model includes a third convolution layer, a pooling layer, a downsampling layer and a fourth convolution layer.
  • the convolution feature of the i-th level can be obtained by calling the third convolution layer to perform the third convolution process on the splicing; the pooling layer can be called to perform the second pooling process on the convolution features to obtain the pooling of the i-th level.
  • the feature dimension of the signal feature output by the second neural network may be less than the feature dimension of the signal feature input by the first neural network.
  • steps 501 to 504 shown in Figure 9 can also be used to perform i-th level feature extraction on the low-frequency subband signal and the high-frequency subband signal of the audio signal respectively. extract. That is, for the i-th level, perform the third convolution process on the low-frequency splicing features (obtained by splicing the low-frequency sub-band signals and the low-frequency signal features of the (i-1) level) to obtain the convolution features of the i-th level.
  • the features are subjected to the second pooling process to obtain the pooled features of the i-th level; the pooled features are subjected to the second down-sampling to obtain the down-sampled features of the i-th level; the down-sampled features are subjected to the fourth convolution process to obtain the i-th level Hierarchical low-frequency signal characteristics.
  • For the i-th level perform the third convolution process on the high-frequency splicing features (obtained by splicing the high-frequency sub-band signals and the high-frequency signal features of the (i-1) level) to obtain the convolution features of the i-th level; Perform the second pooling process on the convolution features to obtain the pooling features of the i-th level; perform the second down-sampling on the pooling features to obtain the down-sampling features of the i-th level; perform the fourth convolution processing on the down-sampling features, Obtain the high-frequency signal characteristics of the i-th level.
  • Step 103 Traverse i to obtain the signal characteristics of each level in N levels.
  • the data dimension of the signal feature is smaller than the data dimension of the audio signal.
  • Step 102 illustrates the feature extraction process for the i-th level.
  • i needs to be traversed to obtain the signal features of each level in the N levels.
  • the data dimension of the signal characteristics output by each layer is smaller than the data dimension of the audio signal. In this way, the data dimension of the data involved in the audio encoding process can be reduced and the encoding efficiency of audio encoding can be improved.
  • Step 104 Encode the signal characteristics of the first level and the signal characteristics of each level in the N levels respectively to obtain the code stream of the audio signal at each level.
  • the signal features of the first level and the signal features of each level in the N levels can be obtained, Encoding is performed separately to obtain the code stream of the audio signal at each level.
  • the code stream can be transmitted to the receiving end of the audio signal, so that the receiving end serves as the decoding end to decode the audio signal.
  • the signal features output by the i-th level among the N levels can be understood as the residual signal features between the signal features output by the (i-1)-th level and the original audio signal.
  • the extracted The signal characteristics of the audio signal include not only the signal characteristics of the audio signal extracted at the first level, but also the residual signal characteristics extracted at each level of the N levels, so that the signal characteristics of the extracted audio signal More comprehensive and accurate, reducing the information loss of the audio signal in the feature extraction process, thereby improving the quality of the code stream obtained when encoding the signal features of the first level and the signal features of each level in the N levels. Higher, the information contained in the audio signal is closer to the original audio signal, improving the encoding quality of audio encoding.
  • step 104 shown in Figure 3 can be implemented through steps 104a1 to 104a2: step 104a1, perform quantization processing on the signal features of the first level and the signal features of each level in the N levels, respectively. Obtain the quantization results of the signal characteristics at each level; step 104a2, perform entropy coding on the quantization results of the signal characteristics at each level to obtain the code stream of the audio signal at each level.
  • a quantization table may be set in advance, and the quantization table includes the correspondence between signal characteristics and quantization values.
  • the corresponding quantization values can be queried for the signal characteristics of the first level and the signal characteristics of each level in the N levels by querying the preset quantization table, so that the quantization values obtained by the query can be queried. as a quantitative result.
  • entropy coding is performed on the quantized results of the signal characteristics at each level to obtain the code stream of the audio signal at each level.
  • the audio signal includes low-frequency sub-band signals and high-frequency sub-band signals.
  • the signal characteristics output by each level include low-frequency signal characteristics and high-frequency signal characteristics.
  • step 104 shown in Figure 3 can also be implemented through steps 104b1 to 104b3: step 104b1, for the first-level low-frequency signal Features, and low-frequency signal features of each level in the N levels are encoded separately to obtain the low-frequency code stream of the audio signal at each level; step 104b2, the high-frequency signal features of the first level, and each level of the N levels are The high-frequency signal characteristics of each level are encoded separately to obtain the high-frequency code stream of the audio signal at each level; step 104b3, the low-frequency code stream and the high-frequency code stream of the audio signal at each level are used as the code stream of the audio signal at the corresponding level. flow.
  • step 104b1 the encoding process of low-frequency signal features in step 104b1 can also be implemented using steps similar to steps 104a1 to 104a2, that is, the low-frequency signal features of the first level and the low-frequency signal features of each level in the N levels are The low-frequency signal features are quantized separately to obtain the quantified results of the low-frequency signal features at each level; the quantified results of the low-frequency signal features at each level are entropy-encoded to obtain the low-frequency code stream of the audio signal at each level.
  • the encoding process of high-frequency signal features in step 104b2 can also be implemented using steps similar to steps 104a1-step 104a2, that is, the high-frequency signal features of the first level and the high-frequency signals of each level in the N levels Features are quantized separately to obtain quantified results of high-frequency signal features at each level; entropy coding is performed on the quantified results of high-frequency signal features at each level to obtain high-frequency code streams of audio signals at each level.
  • the audio signal includes low-frequency sub-band signals and high-frequency sub-band signals.
  • the signal characteristics output by each level include low-frequency signal characteristics and high-frequency signal characteristics.
  • step 104 shown in Figure 3 can also be implemented through steps 104c1 to 104c3: step 104c1, according to the first encoding code rate, Encode the low-frequency signal characteristics of the first level to obtain the first code stream of the first level, and encode the high-frequency signal characteristics of the first level according to the second encoding rate to obtain the second code stream of the first level.
  • Step 104c2 perform the following processing on the signal characteristics of each level in the N levels: encode the signal characteristics of the level separately according to the third encoding rate of the level to obtain the second code stream of each level; Step 104c3 , using the second code stream of the first level and the second code stream of each level in the N levels as the code streams of the audio signal at each level.
  • the first encoding bit rate is greater than the second encoding bit rate
  • the second encoding bit rate is greater than the third encoding bit rate of any level among the N levels.
  • the encoding bit rate of a level is related to the decoding of the code stream of the corresponding level. Quality indicators are positively correlated.
  • a corresponding third encoding rate can be set for each level among the N levels.
  • the third coding rate of each level in the N levels may be the same, may be partially the same and partially different, or may be completely different.
  • the coding rate of the level is positively correlated with the decoding quality index of the code stream of the corresponding level.
  • the low-frequency signal characteristics include The audio signal has the most features. Therefore, the first encoding code rate used for the low-frequency signal characteristics of the first level is the largest to ensure the encoding effect of the audio signal. At the same time, for the high-frequency signal characteristics of the first level, the first encoding rate is the highest.
  • the second coding rate of the coding rate is used for encoding, and for the signal characteristics of each level in the N levels, the third coding rate lower than the second coding rate is used for encoding, thereby adding more features of the audio signal (including high-frequency signal characteristics and residual signal characteristics), and at the same time, the coding efficiency of the audio signal is improved by reasonably allocating the coding rate of each level.
  • the terminal can also perform the following processing for each level: configure the corresponding level transmission priority for the code stream of the audio signal at each level; wherein, level transmission The priority is negatively related to the number of levels in the level, and the level transmission priority is positively related to the decoding quality index of the code stream of the corresponding level.
  • the hierarchical transmission priority of this level is used to represent the transmission priority of the code stream of this level.
  • the level transmission priority is negatively related to the level number of the level. That is, the larger the level number, the lower the corresponding level transmission priority.
  • the level transmission priority of the first level (number of levels is 1) is higher than that of the second level ( The level transmission priority is 2).
  • the code streams of the corresponding levels can be transmitted according to the configured level transmission priority.
  • the code streams of audio signals at multiple levels are transmitted to the decoder, the code streams of some levels can be transmitted, or the code streams of all levels can be transmitted.
  • the code streams of some levels can be transmitted according to Configure the level transmission priority to transmit the code stream of the corresponding level.
  • the signal characteristics include low-frequency signal characteristics and high-frequency signal characteristics.
  • the code streams of the audio signal at each level include: a low-frequency code stream encoded based on the low-frequency signal characteristics, and a high-frequency code stream encoded based on the high-frequency signal characteristics. code stream; after the terminal obtains the code stream of the audio signal at each level, it can also perform the following processing for each level: configure the first transmission priority for the low-frequency code stream of the level, and configure the third transmission priority for the high-frequency code stream of the level.
  • each layer can be , configure a first transmission priority for the low-frequency code stream of the hierarchy, and configure a second transmission priority for the high-frequency code stream of the hierarchy, and the first transmission priority is higher than the second transmission priority.
  • the second transmission priority of the (i-1)th level can be lower than the first transmission priority of the i-th level. That is to say, for each level, the transmission priority of the low-frequency code stream is higher than that of the high-frequency code stream.
  • the transmission priority of the code stream ensures that the low-frequency code stream at each level can be transmitted first; for multiple levels, the transmission priority of the low-frequency code stream at the i-th level is higher than that at the (i-1)th level.
  • the transmission priority of high-frequency code streams ensures that all low-frequency code streams at multiple levels can be transmitted first.
  • layered coding of audio signals is realized: first, first-level feature extraction is performed on the audio signal to obtain first-level signal features; then, for N (N is an integer greater than 1) At the i-th (i is an integer greater than 1, i is less than or equal to N) level in each level, the audio signal and the signal features of the (i-1)-th level are spliced to obtain the splicing features, and the splicing features are i-th Through level-level feature extraction, the signal features of the i-th level are obtained; then by traversing i, the signal features of each level of N levels are obtained; finally, the signal features of the first level and the signal features of each level of N levels are obtained.
  • the signal characteristics are encoded separately to obtain the code stream of the audio signal at each level.
  • the signal features at each level are obtained by coding the audio signal in layers, because the data dimension of the signal features at each level is smaller than the data dimension of the audio signal. In this way, the data dimension of the data processed during the audio encoding process is reduced, and the encoding efficiency of the audio signal is improved;
  • each level When extracting signal features of audio signals hierarchically, the output of each level is used as the input of the next level, so that each level combines the signal features extracted from the previous level to perform more accurate feature extraction of the audio signal.
  • the increase in the number can minimize the information loss of the audio signal during the feature extraction process.
  • the audio signal information contained in the multiple code streams obtained by encoding the signal features extracted in this way is closer to the original audio signal, which reduces the information loss of the audio signal during the encoding process and ensures the audio The encoding quality of the encoding.
  • the audio decoding method provided by the embodiment of the present application can be implemented by various electronic devices. For example, it can be implemented by the terminal alone, by the server alone, or by the terminal and the server collaboratively. Taking terminal implementation as an example, see Figure 10.
  • Figure 10 is a schematic flowchart of an audio decoding method provided by an embodiment of the present application.
  • the audio decoding method provided by an embodiment of the present application includes:
  • Step 601 The terminal receives code streams corresponding to multiple levels obtained by encoding the audio signal.
  • the terminal serves as a decoding end and receives code streams corresponding to multiple levels obtained by encoding the audio signal.
  • Step 602 Decode the code streams at each level to obtain signal characteristics at each level.
  • the data dimension of the signal feature is smaller than the data dimension of the audio signal.
  • the terminal can decode the code streams of each level separately to obtain the signal characteristics of each level in the following manner: for each level, perform the following processing: perform entropy decoding on the code streams of the level to obtain the code stream. Quantization value; perform inverse quantization processing on the quantization value of the code stream to obtain hierarchical signal characteristics.
  • the following processing can be performed: perform entropy decoding on the code stream at this level to obtain the quantized value of the code stream; and then obtain the code stream based on the process of encoding the audio signal.
  • the quantization table used is used to perform inverse quantization processing on the quantized value of the code stream, that is, through the quantization table, the signal characteristics corresponding to the quantized value of the code stream are queried to obtain the signal characteristics of this level.
  • the received code streams at each level may include low-frequency code streams and high-frequency code streams.
  • the low-frequency code stream is coded based on the low-frequency signal characteristics of the audio signal
  • the high-frequency code stream is coded based on the audio signal.
  • High-frequency signal characteristics are encoded.
  • the decoding process of high-frequency code stream and low-frequency code stream is similar to the decoding process of code stream, that is, for the low-frequency code stream at each level, the following processing is performed: entropy decoding is performed on the low-frequency code stream at this level to obtain the low-frequency code stream.
  • the quantized value of the code stream perform inverse quantization processing on the quantized value of the low-frequency code stream to obtain the low-frequency signal characteristics of this level.
  • the following processing is performed: perform entropy decoding on the high-frequency code streams at this level to obtain the quantized value of the high-frequency code stream; perform inverse quantization processing on the quantized value of the high-frequency code stream to obtain the Hierarchical high-frequency signal characteristics.
  • Step 603 Perform feature reconstruction on the signal features of each level to obtain hierarchical audio signals of each level.
  • the terminal can perform feature reconstruction on the signal features of each level in the following manner to obtain the hierarchical audio signals of each level: perform the following processing on the signal features of each level: perform the first convolution on the signal features.
  • the first convolution processing is performed on the signal features.
  • This first convolution processing can be processed by calling the causal convolution with a preset number of channels, so that Get the convolution features of this level.
  • the upsampling factor can be set in advance, so that upsampling is performed based on the upsampling factor to obtain the upsampling features of this level.
  • perform pooling processing on the upsampling features The pooling processing can preset a pooling factor, and then perform pooling processing on the upsampling features based on the pooling factor to obtain pooling features at this level.
  • a second convolution process is performed on the pooled features. This second convolution process can be processed by calling a causal convolution with a preset number of channels, thereby obtaining the hierarchical audio signal of this level.
  • This upsampling can be achieved through one decoding layer or through multiple decoding layers.
  • the terminal can upsample the convolutional features in the following way to obtain hierarchical upsampling features: through L cascaded decoding layers In the first decoding layer, the pooled features are upsampled to obtain the upsampling result of the first decoding layer; through the kth decoding layer among the L cascaded decoding layers, the (k-1)th decoding layer is The first upsampling result of the layer is upsampled to obtain the upsampling result of the k-th decoding layer; where L and k are integers greater than 1, and k is less than or equal to L; k is traversed to obtain the L-th decoding layer The upsampling result, and the upsampling result of the L-th decoding layer is used as the upsampling feature of the level.
  • the upsampling factors of each decoding layer can be the same or different.
  • Step 604 Perform audio synthesis on multiple levels of hierarchical audio signals to obtain an audio signal.
  • audio synthesis is performed on the hierarchical audio signals of multiple levels to obtain audio signals.
  • the code stream includes a low-frequency code stream and a high-frequency code stream.
  • Step 602 shown in Figure 10 can be implemented by the following steps: decoding the low-frequency code stream at each level separately to obtain the low-frequency signal characteristics at each level, And decode the high-frequency code streams of each level respectively to obtain the high-frequency signal characteristics of each level; accordingly, step 603 shown in Figure 10 can be implemented through the following steps: Step 6031, perform the low-frequency signal characteristics of each level respectively.
  • Feature reconstruction is performed to obtain hierarchical low-frequency subband signals at each level, and feature reconstruction is performed on the high-frequency signal characteristics at each level to obtain hierarchical high-frequency subband signals at each level; step 6032, combine the hierarchical low-frequency subband signals with the hierarchical high-frequency subband signals.
  • step 604 shown in Figure 10 can be implemented through the following steps: Step 6041, add the hierarchical low-frequency subband signals of multiple hierarchies to obtain the low-frequency subband signal , and add the hierarchical high-frequency subband signals of multiple levels to obtain a high-frequency subband signal; step 6042, synthesize the low-frequency subband signal and the high-frequency subband signal to obtain an audio signal.
  • step 6042 can be implemented by the following steps: step 60421, upsample the low-frequency subband signal to obtain a low-pass filtered signal; step 60422, upsample the high-frequency subband signal to obtain a high-frequency filtered signal. ; Step 60423, filter and synthesize the low-pass filtered signal and the high-frequency filtered signal to obtain an audio signal. It should be noted that in step 60423, the QMF synthesis filter can be used to perform synthesis processing to obtain the audio signal.
  • FIG 11 is a schematic flow chart of the audio decoding method provided by the embodiment of the present application.
  • the audio decoding method provided by the embodiment of the present application includes: Step 701 , receiving the low-frequency code streams and high-frequency code streams corresponding to multiple levels obtained by encoding the audio signal; Step 702a, decoding the low-frequency code streams of each level, respectively, to obtain the low-frequency signal characteristics of each level; Step 702b, respectively Decode the high-frequency code stream at each level to obtain the high-frequency signal characteristics at each level; Step 703a, perform feature reconstruction on the low-frequency signal characteristics at each level, and obtain the hierarchical low-frequency subband signals at each level; Step 703b, respectively The high-frequency signal features at each level are reconstructed to obtain the hierarchical high-frequency subband signals at each level; step 704a, add the hierarchical low-frequency subband signals of multiple levels to obtain the low-frequency subband
  • Step 705a the low-frequency subband signal is upsampled to obtain a low-pass filtered signal
  • Step 705b the high-frequency subband signal is upsampled , obtain a high-frequency filtered signal
  • the feature reconstruction process of high-frequency signal features and low-frequency signal features may refer to the feature reconstruction process of signal features in step 603. That is, for the high-frequency signal features at each level, the following processing is performed: perform the first convolution process on the high-frequency signal features to obtain the high-frequency convolution features of the level; perform upsampling on the high-frequency convolution features to obtain the level High-frequency upsampling features; perform pooling processing on high-frequency upsampling features to obtain hierarchical high-frequency pooling features; perform second convolution processing on high-frequency pooling features to obtain hierarchical high-frequency hierarchical audio signals.
  • the following processing is performed: perform the first convolution process on the low-frequency signal features to obtain the low-frequency convolution features of the level; perform upsampling on the low-frequency convolution features to obtain the low-frequency upsampling features of the level;
  • the low-frequency upsampling features are pooled to obtain hierarchical low-frequency pooling features;
  • the low-frequency pooling features are subjected to a second convolution process to obtain hierarchical low-frequency hierarchical audio signals.
  • Audio coding and decoding technology uses less network bandwidth resources to transmit as much voice information as possible.
  • the compression rate of audio codecs can reach more than 10 times, that is, the original 10MB of voice data only needs 1MB to be transmitted after compression by the encoder, which greatly reduces the bandwidth resources required to transmit information.
  • Standard voice codec protocols are deployed within the industry, such as those from the International Telecommunication Union Telecommunication Standardization Sector (ITU-T for ITU Telecommunication Standardization Sector), the 3rd Generation Partnership Project (3GPP, 3rd Generation Partnership Project), International Internet Engineering Task Force (IETF, The Internet Engineering Task Force), Audio and Video Coding Standard (AVS, Audio Video coding Standard), China Communications Standards Association (CCSA, China Communications Standards Association) and other international and domestic standards organizations standards, G.711, G.722, AMR series, EVS, OPUS and other standards.
  • Figure 12 gives a schematic diagram of spectrum comparison under different bit rates to demonstrate the relationship between compression bit rate and quality.
  • Curve 1201 is the spectrum curve of the original speech, that is, the signal without compression; curve 1202 is the spectrum curve of the OPUS encoder at the 20kbps code rate; curve 1203 is the spectrum curve of the OPUS encoding at the 6kbps code rate. It can be seen from Figure 12 that as the coding rate increases, the compressed signal becomes closer to the original signal.
  • Time domain coding such as waveform speech coding
  • frequency domain coding directly codes the waveform of the speech signal.
  • the advantage of this coding method is that the encoded speech quality is high, but the coding efficiency is not high.
  • parametric encoding can be used, and what the encoding end has to do is to extract the corresponding parameters of the speech signal to be transmitted.
  • the advantage of parametric encoding is that the encoding efficiency is extremely high, but the quality of the restored speech is very low.
  • Frequency domain coding is to transform the audio signal into the frequency domain, extract the frequency domain coefficients, and then encode the frequency domain coefficients, but the coding efficiency is not ideal. In this way, compression methods based on signal processing cannot improve coding efficiency while ensuring coding quality.
  • embodiments of the present application provide an audio encoding method and an audio decoding method to ensure encoding quality while improving encoding efficiency.
  • the degree of freedom of different encoding methods can be selected according to the encoding content and network bandwidth conditions, even in the low code rate range; and the encoding efficiency can be improved when the complexity and encoding quality are acceptable.
  • Figure 13 is a schematic flow chart of audio encoding and audio decoding provided by an embodiment of the present application.
  • the audio encoding method provided by the embodiment of this application includes:
  • the audio signal can be sampled according to the first sampling frequency to obtain a sampled signal, and then the sampled signal can be decomposed into subbands to obtain subband signals with a lower frequency than the first sampling frequency, including low-frequency subband signals and high-frequency subband signals. frequency subband signals.
  • an analysis filter such as a QMF filter
  • the first-layer low-frequency analysis neural network is called to obtain the low-dimensional first-layer low-frequency signal feature F LB (n). It should be noted that the dimension of the signal feature is smaller than the dimension of the low-frequency subband signal (to reduce the amount of data).
  • Neural networks include but are not limited to Dilated CNN, Autoencoder, Full-connection, LSTM, CNN+LSTM, etc.
  • the second layer low-frequency analysis neural network Based on the second-layer low-frequency analysis neural network, analyze the low-frequency sub-band signals and the first-layer low-frequency signal characteristics to obtain the second-layer low-frequency signal characteristics (that is, the second-layer low-frequency residual signal characteristics). For example, by combining x LB (n) and F LB (n), the second layer low-frequency analysis neural network is called to obtain the low-dimensional second layer low-frequency signal feature F LB, e (n).
  • the second-layer high-frequency analysis neural network Based on the second-layer high-frequency analysis neural network, analyze the high-frequency sub-band signals and the first-layer high-frequency signal characteristics to obtain the second-layer high-frequency signal characteristics (i.e., the second-layer high-frequency residual signal characteristics) . For example, by combining x HB (n) and F HB (n), the second layer high-frequency analysis neural network is called to obtain the low-dimensional second layer high-frequency signal feature F HB, e (n).
  • the two layers of signal features (including the first layer of low-frequency signal characteristics, the first layer of high-frequency signal characteristics, the second layer of low-frequency signal characteristics, and the second layer of high-frequency signal characteristics) are quantified and encoded.
  • the decoding end may only receive one layer of code streams, as shown in Figure 13, and can use "one-layer decoding" for decoding.
  • the audio decoding method provided by the embodiment of the present application includes: (1) decoding the received layer of code stream to obtain the low-frequency signal characteristics and high-frequency signal characteristics of the layer; (2) based on the first layer of low-frequency synthesis Neural network analyzes low-frequency signal characteristics and obtains low-frequency subband signal estimates.
  • the first layer of low-frequency synthetic neural network is called to generate the low-frequency subband signal estimate value x′ LB (n); (3) Based on the first layer of high-frequency synthetic neural network The network analyzes the characteristics of high-frequency signals and obtains high-frequency subband signal estimates. For example, based on the quantized value F′ HB (n) of the high-frequency signal characteristics, the first layer of high-frequency synthesis neural network is called to generate the high-frequency subband signal estimated value x′ HB (n).
  • the decoding end may receive the code streams of both layers.
  • "two-layer decoding" can be used for decoding.
  • the audio decoding method provided by the embodiment of this application includes:
  • the first layer's low-frequency synthesis neural network Based on the first layer of low-frequency synthetic neural network, analyze the characteristics of the first layer of low-frequency signals and obtain the estimated value of the first layer of low-frequency subband signals. For example, based on the quantized value F' LB (n) of the first layer's low-frequency signal characteristics, the first layer's low-frequency synthesis neural network is called to generate the first layer's low-frequency subband signal estimate value x' LB (n).
  • the first layer high-frequency synthesis neural network Based on the first-layer high-frequency synthetic neural network, analyze the characteristics of the first-layer high-frequency signal and obtain the estimated value of the first-layer high-frequency subband signal. For example, based on the quantized value F′ HB (n) of the first layer high-frequency signal characteristics, the first layer high-frequency synthesis neural network is called to generate the first layer high-frequency subband signal estimate value x′ HB (n).
  • the second layer low-frequency synthesis neural network Based on the second layer low-frequency synthetic neural network, analyze the characteristics of the second layer low-frequency signal and obtain the estimated value of the second layer low-frequency subband residual signal. For example, based on the quantized value F′ LB, e (n) of the second layer low-frequency signal characteristics, the second layer low-frequency synthesis neural network is called to generate the low-frequency subband residual signal estimate value x′ LB, e (n).
  • the second layer high-frequency synthesis neural network Based on the second-layer high-frequency synthetic neural network, analyze the characteristics of the second-layer high-frequency signal and obtain the estimated value of the second-layer high-frequency subband residual signal. For example, based on the quantized value F′ HB, e (n) of the second layer high-frequency signal characteristics, the second layer high-frequency synthesis neural network is called to generate the high-frequency subband residual signal estimate value x′ HB, e (n) .
  • the first layer low-frequency sub-band signal estimate value and the low-frequency sub-band residual signal estimate value are summed to obtain the low-frequency sub-band signal estimate value. For example, sum x′ LB (n) and x′ LB, e (n) to obtain the low-frequency subband signal estimate.
  • sum the estimated value of the first-layer high-frequency sub-band signal and the estimated value of the high-frequency sub-band residual signal to obtain the estimated value of the high-frequency sub-band signal. For example, sum x′ HB (n) and x′ HB,e (n) to obtain a high-quality high-frequency subband signal estimate.
  • FIG 14 is a schematic diagram of a voice communication link provided by an embodiment of the present application.
  • VoIP Voice over Internet Protocol
  • the encoder is deployed on the uplink client 1401
  • the decoder is deployed on the downlink client 1402.
  • the voice is collected through the uplink client, and pre-processing enhancement, encoding and other processing are performed, and the encoded code stream is transmitted to the downlink client 1402 through the network.
  • the downlink client 1402 performs decoding, enhancement and other processing to play the decoded voice on the downlink client 1402.
  • a transcoder needs to be deployed in the background of the system (that is, the server) to solve the problem of interconnection between the new encoder and the existing encoder.
  • the sending end upstream client
  • the receiving end downstream client
  • a decoder such as a G.722 decoder
  • PSTN Public Switched Telephone Network
  • the QMF filter bank and atrous convolution network will be introduced first.
  • a QMF filter bank is a pair of analysis-synthesis filters.
  • the input signal with a sampling rate of Fs can be decomposed into two signals with a sampling rate of Fs/2, representing the QMF low-pass signal and the QMF high-pass signal respectively.
  • the spectral response of the low-pass part (H_Low(z)) and high-pass part (H_High(z)) of the QMF filter is shown in Figure 15.
  • h Low (k) represents the coefficient of low-pass filtering
  • h High (k) represents the coefficient of high-pass filtering
  • the QMF analysis filter bank H_Low(z) and H_High(z) can be described, and the QMF synthetic filter bank can be described, as shown in formula (2).
  • G Low (z) represents the recovered low-pass signal
  • G High (z) represents the recovered high-pass signal
  • the low-pass and high-pass signals recovered at the decoding end are synthesized and processed by the QMF synthesis filter bank, and the reconstructed signal with the sampling rate Fs corresponding to the input signal can be recovered.
  • Figure 16A is a schematic diagram of a common convolutional network provided by an embodiment of the present application
  • Figure 16B is a schematic diagram of a dilated convolutional network provided by an embodiment of the present application.
  • atrous convolution can increase the receptive field while keeping the size of the feature map unchanged, and can also avoid errors caused by upsampling and downsampling.
  • the convolution kernel size (Kernel Size) shown in Figure 16A and Figure 16B is both 3 ⁇ 3; however, the receptive field 901 of the ordinary convolution shown in Figure 16A is only 3, while the dilated convolution shown in Figure 16B The receptive field of 902 reached 5.
  • the receptive field of the ordinary convolution shown in Figure 16A is 3, and the dilation rate (Dilation Rate) (the number of intervals between points in the convolution kernel) is 1; and
  • the atrous convolution shown in Figure 16B has a receptive field of 5 and a dilation rate of 2.
  • the convolution kernel can also move on a plane similar to Figure 16A or Figure 16B, which involves the concept of shift rate (Stride Rate) (step size). For example, each time the convolution kernel is shifted by 1 grid, the corresponding shift rate is 1.
  • shift rate (Stride Rate) (step size).
  • step size For example, each time the convolution kernel is shifted by 1 grid, the corresponding shift rate is 1.
  • number of convolution channels which is the number of parameters corresponding to the number of convolution kernels used to perform convolution analysis. Theoretically, the greater the number of channels, the more comprehensive the analysis of the signal and the higher the accuracy; however, the higher the number of channels, the higher the complexity. For example, a 1 ⁇ 320 tensor can use 24-channel convolution operation, and the output is a 24 ⁇ 320 tensor.
  • the size of the atrous convolution kernel can be defined according to actual application needs (for example: for speech signals, the size of the convolution kernel can be set to 1 ⁇ 3), expansion rate, shift rate and number of channels. This application The embodiment does not limit this.
  • the audio encoding method provided by the embodiment of the present application includes: Step 1, generation of input signal.
  • the 640 sample points of the nth frame are recorded as x(n).
  • Step 2 QMF subband signal decomposition.
  • the QMF analysis filter (such as 2-channel QMF filter) is called for filtering processing, and the filtered signal obtained by filtering is down-sampled to obtain two sub-band signals, namely the low-frequency sub-band signal x LB (n) and the high-frequency Subband signal x HB (n).
  • the effective bandwidth of the low-frequency subband signal x LB (n) is 0-8kHz
  • the effective bandwidth of the high-frequency subband signal x HB (n) is 8-16kHz
  • the number of sample points per frame is 320.
  • Step 3 the first layer of low-frequency analysis.
  • the purpose of calling the first-layer low-frequency analysis neural network is to generate a lower-dimensional first-layer low-frequency signal feature F LB (n) based on the low-frequency subband signal x LB (n).
  • the data dimension of x LB (n) is 320
  • the data dimension of F LB (n) is 64.
  • the function can be understood as data compression.
  • Figure 17, is a schematic structural diagram of the first layer of low-frequency analysis neural network provided by the embodiment of the present application.
  • the processing flow of the low-frequency subband signal x LB (n) includes:
  • the channel numbers of the three coding blocks are set to 48, 96, and 192 respectively. Therefore, after three consecutive coding blocks, the 24*160 tensor is converted into 48*40, 96*8 and 192*1 tensors respectively.
  • Step 4 the first level of high-frequency analysis.
  • the purpose of calling the first layer of high-frequency analysis neural network is to generate a lower-dimensional first-layer high-frequency signal based on the high-frequency subband signal x HB (n).
  • the structure of the first layer of high-frequency analysis neural network can be consistent with the structure of the first layer of low-frequency analysis neural network, that is, the data dimension of the input (i.e., x HB (n)) is 320 dimensions, and the output (i.e., F HB (n)) The data dimension of n)) is 64 dimensions.
  • the output dimension can be appropriately reduced, which can reduce the complexity of the first layer of high-frequency analysis neural network, which is not limited in this example.
  • Step 5 the second layer of low-frequency analysis.
  • the purpose of calling the second-layer low-frequency analysis neural network is to obtain a lower-dimensional second-layer low-frequency signal feature F LB based on the low-frequency subband signal x LB (n) and the first-layer low-frequency signal feature F LB (n) ,e (n).
  • the second layer low-frequency signal characteristics reflect: the output of the first-layer low-frequency analysis neural network reconstructs the audio signal at the decoding end, relative to the residual of the original audio signal; therefore, at the decoding end, the low frequency can be predicted according to F LB, e (n)
  • the residual signal of the sub-band signal is summed with the low-frequency sub-band signal estimate predicted by the output of the first layer of low-frequency analysis neural network to obtain a higher-precision low-frequency sub-band signal estimate.
  • the second-layer low-frequency analysis neural network adopts a structure similar to the first-layer low-frequency analysis neural network. See Figure 18.
  • Figure 18 is a schematic structural diagram of the second-layer low-frequency analysis neural network provided by an embodiment of the present application.
  • the main differences from the first-layer low-frequency analysis neural network type include: (1) In addition to the low-frequency sub-band signal x LB (n), the input of the second-layer low-frequency analysis neural network also includes the first-layer low-frequency analysis neural network The output F LB (n), x LB (n) and F LB (n) two variables can be spliced into a 384-dimensional spliced feature. (2) Considering that the second layer of low-frequency analysis processes the residual signal, the output F LB of the second-layer low-frequency analysis neural network, the dimension of e (n) is set to 28.
  • Step 6 the second level of high-frequency analysis.
  • the purpose of calling the second-layer high-frequency analysis neural network is to obtain a lower-dimensional second-layer high-frequency based on the high-frequency sub-band signal x HB (n) and the first-layer high-frequency signal feature F HB (n).
  • the structure and structure of the second layer of high-frequency analysis neural network can be the same as that of the second layer of low-frequency analysis neural network, that is, the data dimension of the input (spliced features of x HB (n) and F HB (n)) is 384 dimensions, and the output ( The data dimension of F HB, e (n)) is 28 dimensions.
  • Step 7 quantization encoding.
  • the signal features output by the 2 layers are quantized, and the quantization results obtained by quantization are encoded.
  • the quantization can use scalar quantization (each component is individually quantized), and the encoding can use entropy. encoding method.
  • the embodiments of the present application do not limit the technical combination of vector quantization (multiple adjacent components are combined into one vector for joint quantization) and entropy coding.
  • the first layer of low-frequency signal features F LB (n) is a 64-dimensional feature, which can be encoded using 8kbps.
  • the average code rate of quantizing one parameter per frame is 2.5 bit;
  • the first layer of high-frequency signal features F HB ( n) is a 64-dimensional feature, which can be encoded using 6kbps.
  • the average code rate of quantizing one parameter per frame is 1.875 bit. Therefore, encoding the first layer is a total of 14kbps.
  • the second layer of low-frequency signal features F LB, e (n) is a 28-dimensional feature, which can be encoded using 3.5kbps.
  • the average code rate of quantizing one parameter per frame is 2.5 bit;
  • the second layer of high-frequency signal features F HB, e (n) is a 28-dimensional feature, which can be encoded using 3.5kbps.
  • the average code rate of quantizing one parameter per frame is 2.5bit. Therefore, encoding the second layer is a total of 7kbps.
  • different feature vectors can be progressively encoded through hierarchical coding; according to different application scenarios, embodiments of the present application do not limit the code rate distribution of other methods.
  • third or higher layer coding can also be introduced iteratively.
  • a code stream can be generated.
  • different transmission strategies can be used to ensure transmission with different priorities.
  • the forward error correction mechanism (Forward Error Correction, FEC) can be used to transmit through redundancy.
  • FEC Forward Error Correction
  • the redundancy multiples of different layers are different. For example, the redundancy multiple of the first layer can be set higher.
  • the audio encoding method includes:
  • Step 1 decode.
  • decoding is the reverse process of encoding. Analyze the received code stream and obtain low-frequency signal characteristic estimates and high-frequency signal characteristic estimates by looking up the quantization table.
  • the quantized value F′ LB (n) of the 64-dimensional signal characteristics of the low-frequency sub-band signal is obtained, and the quantized value F′ HB (n) of the 64-dimensional signal characteristic of the high-frequency sub-band signal is obtained;
  • the quantized value F′ LB, e (n) of the 28-dimensional signal characteristics of the low-frequency sub-band signal is obtained, and the quantized value F′ HB, e (n) of the 28-dimensional signal characteristic of the high-frequency sub-band signal is obtained.
  • Step 2 the first layer of low-frequency synthesis.
  • the purpose of calling the first-layer low-frequency synthetic neural network is to generate the first-layer low-frequency subband signal estimate value x′ LB (n) based on the quantized value F′ LB (n) of the low-frequency feature vector.
  • Figure 19 is the first layer of the low-frequency synthetic neural network provided by the embodiment of the present application. Model diagram.
  • the processing flow of the first layer of low-frequency synthesis neural network is similar to that of the first layer of low-frequency analysis neural network, such as causal convolution;
  • the post-processing structure of the first layer of low-frequency synthesis neural network is similar to the first layer of low-frequency analysis neural network.
  • Step 3 the first layer of high-frequency synthesis.
  • the structure of the first layer of high-frequency synthetic neural network is the same as that of the first layer of low-frequency synthetic neural network.
  • the first high-frequency subband signal can be obtained according to the quantized value F′ HB (n) of the first layer of low-frequency signal characteristics. Estimate x′ HB (n).
  • Step 4 the second layer of low-frequency synthesis.
  • Figure 20 is a schematic structural diagram of the second layer of low-frequency synthesis neural network provided by the embodiment of the present application.
  • the structure of the second layer of low-frequency synthesis neural network is similar to the results of the first layer of low-frequency synthesis neural network. The difference lies in the input
  • the data dimension is 28 dimensions.
  • Step 5 the second layer of high-frequency synthesis.
  • the structure of the second layer of low-frequency synthetic neural network is the same as that of the second layer of low-frequency synthetic neural network.
  • the high-frequency subband residual can be generated based on the quantized value F′ HB,e (n) of the second layer of low-frequency signal characteristics.
  • Step 6 synthesis filtering.
  • the decoder obtains the low-frequency subband signal estimate value x′ LB (n) and the high-frequency subband signal x′ HB (n), as well as the low-frequency subband residual signal estimate value x′ LB, e (n) and the high-frequency subband signal residual estimate x′ HB, e (n).
  • the low-frequency subband signal estimate and the high-frequency subband signal estimate are upsampled, and the QMF synthesis filter is called to synthesize and filter the upsampling result to generate a 640-point reconstructed audio signal x′(n).
  • the relevant neural networks at the encoding end and the decoding end can be jointly trained by collecting data to obtain optimal parameters, thereby putting the trained network model into use.
  • the relevant neural networks at the encoding end and the decoding end can be jointly trained by collecting data to obtain optimal parameters, thereby putting the trained network model into use.
  • only one specific embodiment of network input, network structure, and network output is disclosed; engineers in related fields can modify the above configuration as needed.
  • a low bit rate audio encoding and decoding solution based on signal processing and deep learning network can be completed.
  • the coding efficiency has been significantly improved compared to related technologies.
  • the coding quality has also been improved.
  • the encoding end selects different layered transmission strategies for code stream transmission. The decoding end receives the low-level code stream and outputs an audio signal of acceptable quality. If it also receives other high-level code streams, then Can output high-quality audio.
  • the software module stored in the audio encoding device 553 of the memory 550 may include :
  • the first feature extraction module 5531 is configured to perform first-level feature extraction on the audio signal to obtain the signal features of the first level; the second feature extraction module 5532 is configured to extract the i-th level among the N levels.
  • the audio signal and the (i-1)th level signal feature are spliced to obtain the splicing feature, and the i-th level feature extraction is performed on the splicing feature to obtain the i-th level signal feature, and the N and
  • the i is an integer greater than 1, and the i is less than or equal to the N;
  • the traversal module 5533 is configured to traverse the i to obtain the signal characteristics of each level in the N levels.
  • the signal characteristics The data dimension is smaller than the data dimension of the audio signal; the encoding module 5534 is configured to separately encode the signal characteristics of the first level and the signal characteristics of each level in the N levels to obtain the audio The code stream of the signal at each level.
  • the first feature extraction module 5531 is also configured to perform sub-band decomposition on the audio signal to obtain a low-frequency sub-band signal and a high-frequency sub-band signal of the audio signal; Perform first-level feature extraction on the band signal to obtain the first-level low-frequency signal characteristics, and perform first-level feature extraction on the high-frequency sub-band signal to obtain the first-level high-frequency signal characteristics; The low-frequency signal characteristics and the high-frequency signal characteristics are used as the first-level signal characteristics.
  • the first feature extraction module 5531 is further configured to sample the audio signal according to a first sampling frequency to obtain a sampled signal; and perform low-pass filtering on the sampled signal to obtain a low-pass filtered signal. , and down-sample the low-pass filtered signal to obtain the low-frequency sub-band signal of the second sampling frequency; perform high-pass filtering on the sampled signal to obtain a high-pass filtered signal, and down-sample the high-pass filtered signal. , obtaining the high-frequency subband signal of a second sampling frequency; wherein the second sampling frequency is smaller than the first sampling frequency.
  • the second feature extraction module 5532 is also configured to splice the low-frequency subband signal of the audio signal and the low-frequency signal feature of the (i-1)th level to obtain the first splicing feature, and Perform feature extraction at the i-th level on the first splicing feature to obtain low-frequency signal features at the i-th level; extract high-frequency subband signals of the audio signal and high-frequency signal features at the (i-1) level Perform splicing to obtain the second splicing feature, perform i-th level feature extraction on the second splicing feature, and obtain the i-th level high-frequency signal feature; combine the i-th level low-frequency signal feature and the i-th level feature The high-frequency signal characteristics of the i-th level are used as the signal characteristics of the i-th level.
  • the first feature extraction module 5531 is further configured to perform a first convolution process on the audio signal to obtain the first-level convolution feature; and perform a first convolution process on the convolution feature. Pooling processing to obtain the pooling features of the first level; performing first downsampling on the pooling features to obtain downsampling features of the first level; performing second convolution processing on the downsampling features , obtain the signal characteristics of the first level.
  • the first downsampling is implemented through M cascaded coding layers
  • the first feature extraction module 5531 is further configured to pass the first coding among the M cascaded coding layers. layer, perform the first down-sampling on the pooled features to obtain the down-sampling result of the first coding layer; through the j-th coding layer among the M cascaded coding layers, the (j- The down-sampling results of 1) coding layers are first down-sampled to obtain the down-sampling results of the j-th coding layer; wherein, the M and the j are integers greater than 1, and the j is less than or equal to the Described M; traverse the j to obtain the down-sampling result of the M-th coding layer, and use the down-sampling result of the M-th coding layer as the down-sampling feature of the first level.
  • the second feature extraction module 5532 is also configured to perform a third convolution process on the spliced features to obtain the convolution features of the i-th level; and perform a second convolution process on the convolution features. Pooling processing is performed to obtain the pooling features of the i-th level; second down-sampling is performed on the pooling features to obtain down-sampling features of the i-th level; and fourth convolution processing is performed on the down-sampling features. , obtain the signal characteristics of the i-th level.
  • the encoding module 5534 is also configured to separately perform quantization processing on the signal characteristics of the first level and the signal characteristics of each level in the N levels to obtain the signal characteristics of each level.
  • the quantization result ; perform entropy coding on the quantization result of the signal characteristics at each level to obtain the code stream of the audio signal at each level.
  • the signal features include low-frequency signal features and high-frequency signal features.
  • the encoding module 5534 is also configured to evaluate the low-frequency signal features of the first level and each of the N levels.
  • the low-frequency signal characteristics are encoded separately to obtain the low-frequency code stream of the audio signal at each level; for the high-frequency signal characteristics of the first level and the high-frequency signal characteristics of each level in the N levels, Encoding is performed separately to obtain the high-frequency code stream of the audio signal at each level; the low-frequency code stream and the high-frequency code stream of the audio signal at each level are used as the code stream of the audio signal at the corresponding level.
  • the signal characteristics include low-frequency signal characteristics and high-frequency signal characteristics.
  • the encoding module 5534 is also configured to encode the first-level low-frequency signal characteristics according to the first encoding code rate to obtain The first code stream of the first level is encoded according to the second coding rate, and the high-frequency signal characteristics of the first level are encoded to obtain the second code stream of the first level; for each of the N levels
  • the signal characteristics of each layer are processed as follows: according to the third coding rate of the layer, the signal characteristics of the layer are encoded respectively to obtain the second code stream of each layer; the first layer The second code stream and the second code stream of each level in the N levels are used as the code stream of the audio signal at each level; wherein the first encoding code rate is greater than the second encoding code rate , the second coding rate is greater than the third coding rate of any level among the N levels, and the coding rate of the level is positively related to the decoding quality index of the code stream of the corresponding level.
  • the encoding module 5534 is also configured to perform the following processing for each of the levels: configure a corresponding level transmission priority for the code stream of the audio signal at the level; wherein, the The layer transmission priority is negatively correlated with the number of layers of the layer, and the layer transmission priority is positively correlated with the decoding quality index of the code stream of the corresponding layer.
  • the signal characteristics include low-frequency signal characteristics and high-frequency signal characteristics
  • the code streams of the audio signal at each level include: a low-frequency code stream encoded based on the low-frequency signal characteristics, and a code stream based on the high-frequency signal characteristics.
  • the high-frequency code stream obtained by encoding the frequency signal characteristics; the encoding module 5534 is also configured to perform the following processing for each of the levels: configure the first transmission priority for the low-frequency code stream of the level, and configure the first transmission priority for the low-frequency code stream of the level.
  • the high-frequency code stream of the level is configured with a second transmission priority; wherein the first transmission priority is higher than the second transmission priority, and the second transmission priority of the (i-1)th level is lower than the second transmission priority.
  • the first transmission priority of the i-level, the transmission priority of the code stream is positively related to the decoding quality index of the corresponding code stream.
  • layered coding of audio signals is realized: first, first-level feature extraction is performed on the audio signal to obtain first-level signal features; then, for N (N is an integer greater than 1) At the i-th (i is an integer greater than 1, i is less than or equal to N) level in each level, the audio signal and the signal features of the (i-1)-th level are spliced to obtain the splicing features, and the splicing features are i-th Through level-level feature extraction, the signal features of the i-th level are obtained; then by traversing i, the signal features of each level of N levels are obtained; finally, the signal features of the first level and the signal features of each level of N levels are obtained.
  • the signal characteristics are encoded separately to obtain the code stream of the audio signal at each level.
  • the data dimension of the extracted signal features is smaller than that of the audio signal. In this way, the data dimension of the data processed during the audio encoding process is reduced, and the encoding efficiency of the audio signal is improved;
  • the output of each level is used as the input of the next level, so that each level combines the signal features extracted from the previous level to extract more accurate features of the audio signal.
  • the information loss of the audio signal during the feature extraction process can be minimized.
  • the multiple code streams obtained by encoding the signal features extracted by this method include The information contained in the audio signal is closer to the original audio signal, reducing the information loss of the audio signal during the encoding process and ensuring the encoding quality of the audio encoding.
  • the audio decoding device provided by the embodiment of the present application includes: a receiving module configured to receive code streams corresponding to multiple levels obtained by encoding the audio signal; a decoding module configured to decode the code streams of each of the levels, Obtain the signal characteristics of each of the levels, and the data dimension of the signal characteristics is smaller than the data dimension of the audio signal; a feature reconstruction module is configured to perform feature reconstruction on the signal characteristics of each of the levels, and obtain the characteristics of each of the levels.
  • a hierarchical audio signal; an audio synthesis module configured to perform audio synthesis on hierarchical audio signals of a plurality of the hierarchical levels to obtain the audio signal.
  • the code stream includes a low-frequency code stream and a high-frequency code stream
  • the decoding module is also configured to decode the low-frequency code stream of each of the levels, respectively, to obtain the low-frequency signal characteristics of each of the levels.
  • the feature reconstruction module is also configured to separately perform decoding of the low-frequency signal characteristics of each level.
  • Feature reconstruction is to obtain the hierarchical low-frequency subband signals of each of the hierarchical levels, and perform feature reconstruction on the high-frequency signal characteristics of each of the hierarchical levels to obtain the hierarchical high-frequency subband signals of each of the hierarchical levels;
  • the band signal and the hierarchical high-frequency subband signal are used as the hierarchical audio signal of the hierarchical level;
  • the audio synthesis module is also configured to add multiple hierarchical low-frequency subband signals of the hierarchical level to obtain The low-frequency sub-band signal is added, and the high-frequency sub-band signals of multiple levels are added to obtain the high-frequency sub-band signal; the low-frequency sub-band signal and the high-frequency sub-band signal are synthesized to obtain the the audio signal.
  • the audio synthesis module is further configured to upsample the low-frequency subband signal to obtain a low-pass filtered signal; to upsample the high-frequency subband signal to obtain a high-frequency filtered signal; The low-pass filtered signal and the high-frequency filtered signal are filtered and synthesized to obtain the audio signal.
  • the feature reconstruction module is further configured to perform the following processing on the signal features of each level: perform a first convolution process on the signal features to obtain the convolution features of the level; Upsample the convolution features to obtain the upsampling features of the level; perform pooling processing on the upsampling features to obtain the pooling features of the level; perform a second convolution on the pooling features Process to obtain the hierarchical audio signal of the hierarchical level.
  • the upsampling is implemented through L cascaded decoding layers, and the feature reconstruction module is further configured to perform the first decoding layer among the L cascaded decoding layers. Pooling features are upsampled to obtain the upsampling result of the first decoding layer; through the kth decoding layer among the L cascaded decoding layers, the (k-1)th decoding layer is Perform upsampling on an upsampling result to obtain the upsampling result of the k-th decoding layer; wherein, the L and the k are integers greater than 1, and the k is less than or equal to the L; for the k Traverse is performed to obtain the upsampling result of the Lth decoding layer, and the upsampling result of the Lth decoding layer is used as the upsampling feature of the level.
  • the decoding module is further configured to perform the following processing for each of the levels: perform entropy decoding on the code stream of the level to obtain the quantized value of the code stream; The quantized value is subjected to inverse quantization processing to obtain the signal characteristics of the level.
  • Embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by the embodiment of the present application.
  • Embodiments of the present application also provide a computer-readable storage medium in which executable instructions are stored. When the executable instructions are executed by a processor, they will cause the processor to execute the method provided by the embodiments of the present application.
  • the computer-readable storage medium may be read-only memory (Read-Only Memory, ROM), random access memory (RAM), erasable programmable read-only memory (Erasable Programmable Read-Only Memory) , EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory, magnetic surface memory, optical disk, or CD-ROM; it can also include one or any combination of the above memories of various equipment.
  • ROM read-only memory
  • RAM random access memory
  • EPROM erasable programmable read-only memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory magnetic surface memory, optical disk, or CD-ROM
  • executable instructions may take the form of a program, software, software module, script, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and their May be deployed in any form, including deployed as a stand-alone program or deployed as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to, files in a file system and may be stored as part of a file holding other programs or data, for example, in a Hyper Text Markup Language (HTML) document. in one or more scripts, in a single file that is specific to the program in question, or in multiple collaborative files (e.g., files that store one or more modules, subroutines, or portions of code).
  • HTML Hyper Text Markup Language
  • executable instructions may be deployed to execute on one computing device, or on multiple computing devices located at one location, or alternatively, on multiple computing devices distributed across multiple locations and interconnected by a communications network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请提供了一种音频编码方法、装置、设备、存储介质及计算机程序产品;方法包括:对音频信号进行第一层级的特征提取,得到第一层级的信号特征;针对N个层级中的第i层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征,其中,N和i为大于1的整数,i小于或等于N;对i进行遍历,得到N个层级中每个层级的信号特征,该信号特征的数据维度小于音频信号的数据维度;对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别进行编码,得到音频信号在各层级的码流。

Description

音频编码方法、装置、电子设备、存储介质及程序产品
相关申请的交叉引用
本申请基于申请号为202210677636.4、申请日为2022年06月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种音频编码方法、音频解码方法、装置、电子设备、存储介质及计算机程序产品。
背景技术
音频编解码技术是应用于包括远程音视频通话在内的通信服务的一项核心技术。音频编码技术,可以理解为使用较少的网络带宽资源去尽量多的传递语音信息,音频编码是一种信源编码,信源编码的目的是在编码端尽可能的压缩用户想要传递信息的数据量,去掉信息中的冗余,同时在解码端还能够无损(或接近无损)的恢复出来。
然而,相关技术中,在保证音频编码质量的情况下,音频编码的效率低。
发明内容
本申请实施例提供一种音频编码方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够提高音频编码效率并保证音频编码质量。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种音频编码方法,包括:
对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征;
针对N个层级中的第i层级,对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,所述N和所述i为大于1的整数,所述i小于或等于所述N;
对所述i进行遍历,得到所述N个层级中每个层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流。
本申请实施例还提供一种音频解码方法,包括:
接收对音频信号进行编码得到的多个层级分别对应的码流;
分别对各所述层级的码流进行解码,得到各所述层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号;
对多个所述层级的层级音频信号进行音频合成,得到所述音频信号。
本申请实施例还提供一种音频编码装置,包括:
第一特征提取模块,配置为对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征;
第二特征提取模块,配置为针对N个层级中的第i层级,对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,所述N和所述i为大于1的整数,所述i小于或等于所述N;
遍历模块,配置为对所述i进行遍历,得到所述N个层级中每个层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
编码模块,配置为对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流。
本申请实施例还提供一种音频解码装置,包括:
接收模块,配置为接收对音频信号进行编码得到的多个层级分别对应的码流;
解码模块,配置为分别对各所述层级的码流进行解码,得到各所述层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
特征重建模块,配置为分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号;
音频合成模块,配置为对多个所述层级的层级音频信号进行音频合成,得到所述音频信号。
本申请实施例还提供一种电子设备,包括:
存储器,配置为存储可执行指令;
处理器,配置为执行所述存储器中存储的可执行指令时,实现本申请实施例提供的方法。
本申请实施例还提供一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时,实现本申请实施例提供的方法。
本申请实施例还提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现本申请实施例提供的方法。
本申请实施例具有以下有益效果:
通过对音频信号分层进行编码得到各层级的信号特征,由于各层级的信号特征的数据维度小于音频信号的数据维度,降低了音频编码过程中所处理数据的数据维度,提高了音频信号的编码效率;分层提取音频信号的信号特征时,每个层级的输出均作为下一层级的输入,使得每个层级均结合上一层级提取的信号特征,对音频信号进行更精确的特征提取,随着层级数量的增加,可以使音频信号在特征提取过程中的信息损失降到最低。如此,通过对该方式提取的信号特征进行编码所得到的多个码流,其包含的音频信号的信息更加接近于原始的音频信号,减少了音频信号在编码过程中的信息损失,保证了音频编码的编码质量。
附图说明
图1是本申请实施例提供的音频编码系统100的架构示意图;
图2是本申请实施例提供的实施音频编码方法的电子设备500的结构示意图;
图3是本申请实施例提供的音频编码方法的流程示意图;
图4是本申请实施例提供的音频编码方法的流程示意图;
图5是本申请实施例提供的音频编码方法的流程示意图;
图6是本申请实施例提供的音频编码方法的流程示意图;
图7是本申请实施例提供的音频编码方法的流程示意图;
图8是本申请实施例提供的音频编码方法的流程示意图;
图9是本申请实施例提供的音频编码方法的流程示意图;
图10是本申请实施例提供的音频解码方法的流程示意图;
图11是本申请实施例提供的音频解码方法的流程示意图;
图12是本申请实施例提供的不同码率下的频谱比较示意图;
图13是本申请实施例提供的音频编码和音频解码的流程示意图;
图14是本申请实施例提供的语音通信链路的示意图;
图15是本申请实施例提供的滤波器组示意图;
图16A是本申请实施例提供的普通卷积网络的示意图;
图16B是本申请实施例提供的空洞卷积网络的示意图;
图17是本申请实施例提供的第一层低频分析神经网络模型的结构示意图;
图18是本申请实施例提供的第二层低频分析神经网络模型的结构示意图;
图19是本申请实施例提供的第一层低频合成神经网络模型的模型示意图;
图20是本申请实施例提供的第二层低频合成神经网络模型的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)客户端,终端中运行的用于提供各种服务的应用程序,例如即时通讯客户端、音频播放客户端。
2)音频编码(Audio Coding),对包含语音的数字音频信号进行数据压缩的一种应用。
3)正交镜像滤波器组(Quadrature Mirror Filters,QMF),QMF滤波器组用于将子带信号分解为多个信号,从而降低信号带宽,分解后的各路信号通过各自的通道进行滤波。
4)量化,指将信号的连续取值(或者大量可能的离散取值)近似为有限多个(或较少的)离散值的过程,包括矢量量化、标量量化等。
5)矢量量化,将若干个标量数据组成一个矢量,把矢量空间划分为若干个小区域,每个小区域寻找一个代表矢量,量化时落入小区域的矢量,使用对应的代表矢量代替,即,被量化为该代表矢量。
6)标量量化,将整个动态范围分为若干个小区间,每个小区间具有一个代表值,在量化时落入小区间的信号值,使用对应的代表值代替,即,将信号值量化为该代表值。
7)熵编码,即编码过程中按熵原理不丢失任何信息的编码,信息熵为信源的平均信息量,常见的熵编码有:香农(Shannon)编码、哈夫曼(Huffman)编码和算术编码(arithmetic coding)。
8)神经网络(NN,Neural Network):是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
9)深度学习(DL,Deep Learning):是机器学习(ML,Machine Learning)领域中一个新的研究方向,深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。
本申请实施例提供一种音频编码方法、音频解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够提高音频编码效率并保证音频编码质量。
下面说明本申请实施例提供的音频编码方法的实施场景。参见图1,图1是本申请实施例提供的音频编码系统100的架构示意图,为实现支撑一个示例性应用,终端(示例性示出了终端400-1和终端400-2)通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线或有线链路实现数据传输。其中,终端400-1为音频信号的发送端,终端400-2为音频信号的接收端。
在终端400-1向终端400-2发送音频信号的过程中(如终端400-1和终端400-2基于设置的客户端进行远程通话的过程中),终端400-1,配置为对音频信号进行第一层级的特征提取,得到第一层级的信号特征;针对N个层级中的第i层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征,N和i为大于1的整数,i小于或等于N;对i进行遍历,得到N个层级中每个层级的信号特征,信号特征的数据维度小于音频信号的数据维度;对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别进行编码,得到音频信号在各层级的码流;将音频信号在各层级的码流发送至服务器200;
服务器200,配置为接收终端400-1对音频信号进行编码得到的多个层级分别对应的码流;将多个层级分别对应的码流发送至终端400-2;
终端400-2,配置为接收服务器200发送的对音频信号进行编码得到的多个层级分别对应的码流;分别对各层级的码流进行解码,得到各层级的信号特征,信号特征的数据维度小于音频信号的数据维度;分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号;对多个层级的层级音频信号进行音频合成,得到音频信号。
在一些实施例中,本申请实施例提供的音频编码方法可以由各种电子设备实施,例如,可以由终端单独实施,也可以由服务器单独实施,也可以由终端和服务器协同实施。例如终端独自执行本申请实施例提供的音频编码方法,或者,终端向服务器发送针对音频信号的编码请求,服务器根据接收的编码请求执行本申请实施例提供的音频编码方法。本申请实施例可应用于各种场景,包括但不限于云技术、人工智能、智慧交通、辅助驾驶等。
在一些实施例中,本申请实施例提供的实施音频编码的电子设备可以是各种类型的终端设备或服务器。其中,服务器(例如服务器200)可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统。终端(例如终端400)可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能语音交互设备(例如智能音箱)、智能家电(例如智能电视)、智能手表、车载终端等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例对此不做限制。
在一些实施例中,本申请实施例提供的音频编码方法可以借助于云技术(Cloud Technology)实现,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、存储、处理和共享的一种托管技术。云技术是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、以及应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统 的后台服务需要大量的计算、存储资源。作为示例,上述服务器(例如服务器200)可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
在一些实施例中,终端或服务器可以通过运行计算机程序来实现本申请实施例提供的音频编码方法,举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序;还可以是能够嵌入至任意APP中的小程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
在一些实施例中,多个服务器可组成为一区块链,而服务器为区块链上的节点,区块链中的每个节点之间可以存在信息连接,节点之间可以通过上述信息连接进行信息传输。其中,本申请实施例提供的音频编码方法所相关的数据(例如音频信号在各层级的码流、用于进行特征提取的神经网络模型、)可保存于区块链上。
下面说明本申请实施例提供的实施音频编码方法的电子设备。参见图2,图2是本申请实施例提供的实施音频编码方法的电子设备500的结构示意图。以电子设备500为图1所示的终端(如终端400-1)为例,本申请实施例提供的实施音频编码方法的电子设备500包括:至少一个处理器510、存储器550、至少一个网络接口520和用户接口530。电子设备500中的各个组件通过总线系统540耦合在一起。可理解,总线系统540用于实现这些组件之间的连接通信。总线系统540除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统540。
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器550可以是可移除的,不可移除的或其组合。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。存储器550包括易失性存储器或非易失性存储器,也可包括易失性存储器和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统551,包括配置为处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块552,配置为经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
在一些实施例中,本申请实施例提供的音频编码装置可以采用软件方式实现,图2示出了存储在存储器550中的音频编码装置553,其可以是程序和插件等形式的软件,包括以下软件模块:第一特征提取模块5531、第二特征提取模块5532、遍历模块5533和编码模块5534,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或拆分,将在下文中说明各个模块的功能。
下面说明本申请实施例提供的音频编码方法。在一些实施例中,本申请实施例提供的音频编码方法可以由各种电子设备实施,例如,可以由终端单独实施,也可以由服务器单独实施,也可以由终端和服务器协同实施。以终端实施为例,参见图3,图3是本申请实施例提供的音频编码方法的流程示意图,本申请实施例提供的音频编码方法包括:
步骤101:终端对音频信号进行第一层级的特征提取,得到第一层级的信号特征。
在实际应用中,该音频信号可以是通话(如网络通话、电话)过程中的语音信号、语音消息(如即时通信客户端中发送的语音消息)、所播放的音乐、音频等等。音频信号在传输时需要进行音频信号的编码,从而音频信号的发送端可以对编码得到的码流进行传输,而码流的接收端则可以对接收到的码流进行解码以得到该音频信号。接下来对音频信号的编码过程进行说明。在本申请实施例中,采用分层编码的方式对音频信号进行编码,该分层编码的方式是通过对音频信号进行多个层级的编码实现,下面对每个层级的编码过程进行说明。首先,针对第一层级,终端可对音频信号进行第一层级的特征提取,得到音频信号通过第一层级提取的信号特征,即第一层级的信号特征。
在一些实施例中,音频信号包括低频子带信号和高频子带信号,在对音频信号进行处理(如特征提取、编码)时,可以对音频信号包括的低频子带信号和高频子带信号分别进行处理。基于此,参见图4,图4是本申请实施例提供的音频编码方法的流程示意图,图4示出图3的步骤101可通过步骤201-步骤203实现:步骤201,对音频信号进行子带分解,得到音频信号的低频子带信号和高频子带信号;步骤202,对低频子带信号进行第一层级的特征提取,得到第一层级的低频信号特征,并对高频子带信号进行第一层级的特征提取,得到第一层级的高频信号特征;步骤203,将低频信号特征和高频信号特征,作为第一层级的信号特征。
需要说明的是,在步骤201中,通过第一层级对音频信号进行特征提取的过程中,终端可以首先对音频 信号进行子带分解,得到音频信号的低频子带信号和高频子带信号,从而分别对低频子带信号和高频子带信号进行特征提取。在一些实施例中,参见图5,图5是本申请实施例提供的音频编码方法的流程示意图,图5示出图4的步骤201可通过步骤2011-步骤2013实现:步骤2011,按照第一采样频率对音频信号进行采样,得到采样信号;步骤2012,对采样信号进行低通滤波,得到低通滤波信号,并对低通滤波信号进行下采样,得到第二采样频率的低频子带信号;步骤2013,对采样信号进行高通滤波,得到高通滤波信号,并对高通滤波信号进行下采样,得到第二采样频率的高频子带信号。其中,第二采样频率小于第一采样频率。
在步骤2011中,可以按照第一采样频率对音频信号进行采样,得到采样信号,该第一采样频率可以是预设的。在实际应用中,音频信号为连续的模拟信号,通过采用第一采样频率对音频信号进行采样,得到离散的数字信号,即采样信号,该采样信号包括从音频信号中采样得到的多个样本点(即采样值)。
在步骤2012中,对采用信号进行低通滤波,得到低通滤波信号,并对低通滤波信号进行下采样,得到第二采样频率的低频子带信号。在步骤2013中,对采样信号进行高通滤波,得到高通滤波信号,并对高通滤波信号进行下采样,得到第二采样频率的高频子带信号。在步骤202和步骤203中,该低通滤波和高通滤波可以通过QMF分析滤波器实现。在实际实施时,该第二采样频率可以为第一采样频率的二分之一,如此则可以得到频率相同的低频子带信号和高频子带信号。
在步骤202中,得到音频信号的低频子带信号和高频子带信号之后,对音频信号的低频子带信号进行第一层级的特征提取,得到第一层级的低频信号特征,并对高频子带信号进行第一层级的特征提取,得到第一层级的高频信号特征。在步骤203中,将低频信号特征和高频信号特征作为第一层级的信号特征。
在一些实施例中,参见图6,图6是本申请实施例提供的音频编码方法的流程示意图,图6示出图3的步骤101还可通过步骤301-步骤304实现:步骤301,对音频信号进行第一卷积处理,得到第一层级的卷积特征;步骤302,对卷积特征进行第一池化处理,得到第一层级的池化特征;步骤303,对池化特征进行第一下采样,得到第一层级的下采样特征;步骤304,对下采样特征进行第二卷积处理,得到第一层级的信号特征。
需要说明的是,在步骤301中,可以对音频信号进行第一卷积处理。在实际应用中,该第一卷积处理可以通过调用预设通道数(如24通道)的因果卷积进行处理,从而得到第一层级的卷积特征。
在步骤302中,对步骤301得到的卷积特征进行第一池化处理。在实际应用中,该第一池化处理可以预先设置池化因子(比如2),进而基于该池化因子对卷积特征进行第一池化处理,得到第一层级的池化特征。
在步骤303中,对步骤302得到的池化特征进行第一下采样。在实际应用中,可以预先设置下采样因子,从而基于该下采样因子进行下采样。该第一下采样可以通过一个编码层实现,也可以通过多个编码层实现。在一些实施例中,第一下采样通过M个级联的编码层实现。相应的,参见图7,图7是本申请实施例提供的音频编码方法的流程示意图,图7示出图6的步骤303还可通过步骤3031-步骤3033实现:步骤3031,通过M个级联的编码层中的第一个编码层,对池化特征进行第一下采样,得到第一个编码层的下采样结果;步骤3032,通过M个级联的编码层中的第j个编码层,对第(j-1)个编码层的下采样结果进行第一下采样,得到第j个编码层的下采样结果;其中,M和j为大于1的整数,j小于或等于M;步骤3033,对j进行遍历,得到第M个编码层的下采样结果,并将第M个编码层的下采样结果,作为第一层级的下采样特征。
需要说明的是,在步骤3031-步骤3033中,每个编码层的下采样因子可以是相同的,也可以是不同的。在实际应用中,下采样因子相当于池化因子,起到降采样的作用。
在步骤304中,可以对下采样特征进行第二卷积处理。在实际应用中,该第二卷积处理可以通过调用预设通道数的因果卷积进行处理,从而得到第一层级的信号特征。
在实际应用中,图6示出的步骤301-步骤304可以通过调用第一神经网络模型实现,第一神经网络模型包括第一卷积层、池化层、下采样层以及第二卷积层。如此,可通过调用第一卷积层对音频信号进行第一卷积处理,得到第一层级的卷积特征;调用池化层对卷积特征进行第一池化处理,得到第一层级的池化特征;调用下采样层对池化特征进行第一下采样,得到第一层级的下采样特征;调用第二卷积层对下采样特征进行第二卷积处理,得到第一层级的信号特征。
需要说明的是,在对音频信号进行第一层级的特征提取时,也可以通过图6示出的步骤301-步骤304,对音频信号的低频子带信号和高频子带信号分别进行第一层级的特征提取(即图4示出的步骤202)。即,对音频信号的低频子带信号进行第一卷积处理,得到第一层级的第一卷积特征;对第一卷积特征进行第一池化处理,得到第一层级的第一池化特征;对第一池化特征进行第一下采样,得到第一层级的第一下采样特征;对第一下采样特征进行第二卷积处理,得到第一层级的低频信号特征。对音频信号的高频子带信号进行第一卷积处理,得到第一层级的第二卷积特征;对第二卷积特征进行第一池化处理,得到第一层级的第二池化特征;对第二池化特征进行第一下采样,得到第一层级的第二下采样特征;对第二下采样特征进行第二卷积处理,得到第一层级的高频信号特征。
步骤102:针对N个层级中的第i层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征。
其中,N和i为大于1的整数,i小于或等于N。
在针对音频信号进行第一层级的特征提取后,还可以对音频信号进行剩余层级的特征提取。在本申请实 施例中,该剩余层级包括N个层级,针对N个层级中的第i层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征。如,针对第二层级,对音频信号和第一层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第二层级的特征提取,得到第二层级的信号特征;针对第三层级,对音频信号和第二层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第三层级的特征提取,得到第三层级的信号特征;针对第四层级,对音频信号和第三层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第四层级的特征提取,得到第四层级的信号特征,等等。
在一些实施例中,音频信号包括低频子带信号和高频子带信号,在对音频信号进行处理(如特征提取、编码)时,可以对音频信号包括的低频子带信号和高频子带信号分别进行处理。基于此,针对N个层级中的第i层级,还可以对音频信号进行子带分解,得到音频信号的低频子带信号和高频子带信号。子带分解的过程可参见上述步骤2011-步骤2013。如此,针对N个层级中的第i层级,其执行特征提取输出的数据包括:第i层级的低频信号特征、以及第i层级的高频信号特征。
相应的,参见图8,图8是本申请实施例提供的音频编码方法的流程示意图,图8示出图3的步骤102可通过步骤401-步骤403实现:步骤401,对音频信号的低频子带信号和第(i-1)层级的低频信号特征进行拼接,得到第一拼接特征,并对第一拼接特征进行第i层级的特征提取,得到第i层级的低频信号特征;步骤402,对音频信号的高频子带信号和第(i-1)层级的高频信号特征进行拼接,得到第二拼接特征,并对第二拼接特征进行第i层级的特征提取,得到第i层级的高频信号特征;步骤403,将第i层级的低频信号特征和第i层级的高频信号特征,作为第i层级的信号特征。
需要说明的是,在步骤401中,得到音频信号的低频子带信号和高频子带信号之后,对音频信号的低频子带信号、以及第(i-1)层级提取得到的低频信号特征进行拼接,得到第一拼接特征,然后对第一拼接特征进行第i层级的特征提取,得到第i层级的低频信号特征。同样的,在步骤402中,对音频信号的高频子带信号、以及第(i-1)层级提取得到的高频信号特征进行拼接,得到第二拼接特征,然后对第二拼接特征进行第i层级的特征提取,得到第i层级的高频信号特征。如此,在步骤403中,将第i层级的低频信号特征和第i层级的高频信号特征,作为第i层级的信号特征。
在一些实施例中,参见图9,图9是本申请实施例提供的音频编码方法的流程示意图,图9示出图3的步骤102还可通过步骤501-步骤504实现:步骤501,对拼接特征进行第三卷积处理,得到第i层级的卷积特征;步骤502,对卷积特征进行第二池化处理,得到第i层级的池化特征;步骤503,对池化特征进行第二下采样,得到第i层级的下采样特征;步骤504,对下采样特征进行第四卷积处理,得到第i层级的信号特征。
需要说明的是,在步骤501中,可以对拼接特征(由音频信号和第(i-1)层级的信号特征拼接得到的)进行第三卷积处理。在实际应用中,该第三卷积处理可以通过调用预设通道数的因果卷积进行处理,从而得到第i层级的卷积特征。
在步骤502中,对步骤501得到的卷积特征进行第二池化处理。在实际应用中,该第二池化处理可以预先设置池化因子,进而基于该池化因子对卷积特征进行第二池化处理,得到第i层级的池化特征。
在步骤503中,对步骤502得到的池化特征进行第二下采样。在实际应用中,可以预先设置下采样因子,从而基于该下采样因子进行下采样。该第二下采样可以通过一个编码层实现,也可以通过多个编码层实现。在一些实施例中,第二下采样可通过X个级联的编码层实现。相应的,图9的步骤503还可通过步骤5031-步骤5033实现:步骤5031,通过X个级联的编码层中的第一个编码层,对池化特征进行第二下采样,得到第一个编码层的下采样结果;步骤5032,通过X个级联的编码层中的第g个编码层,对第(g-1)个编码层的下采样结果进行第二下采样,得到第g个编码层的下采样结果;其中,X和g为大于1的整数,g小于或等于X;步骤5033,对g进行遍历,得到第X个编码层的下采样结果,并将第X个编码层的下采样结果,作为第i层级的下采样特征。
需要说明的是,在步骤5031-步骤5033中,每个编码层的下采样因子可以是相同的,也可以是不同的。在实际应用中,下采样因子相当于池化因子,起到降采样的作用。
在步骤504中,可以对下采样特征进行第四卷积处理。在实际应用中,该第四卷积处理可以通过调用预设通道数的因果卷积进行处理,从而得到第i层级的信号特征。
在实际应用中,图9示出的步骤501-步骤504可以通过调用第二神经网络模型实现,第二神经网络模型包括第三卷积层、池化层、下采样层以及第四卷积层。如此,可通过调用第三卷积层对拼接进行第三卷积处理,得到第i层级的卷积特征;调用池化层对卷积特征进行第二池化处理,得到第i层级的池化特征;调用下采样层对池化特征进行第二下采样,得到第i层级的下采样特征;调用第四卷积层对下采样特征进行第四卷积处理,得到第i层级的信号特征。在实际实施时,第二神经网络输出的信号特征的特征维度,可以少于第一神经网络输入的信号特征的特征维度。
需要说明的是,在进行第i层级的特征提取时,也可以通过图9示出的步骤501-步骤504,对音频信号的低频子带信号和高频子带信号分别进行第i层级的特征提取。即,针对第i层级,对低频拼接特征(由低频子带信号和第(i-1)层级的低频信号特征拼接得到的)进行第三卷积处理,得到第i层级的卷积特征,对卷积 特征进行第二池化处理,得到第i层级的池化特征;对池化特征进行第二下采样,得到第i层级的下采样特征;对下采样特征进行第四卷积处理,得到第i层级的低频信号特征。针对第i层级,对高频拼接特征(由高频子带信号和第(i-1)层级的高频信号特征拼接得到的)进行第三卷积处理,得到第i层级的卷积特征;对卷积特征进行第二池化处理,得到第i层级的池化特征;对池化特征进行第二下采样,得到第i层级的下采样特征;对下采样特征进行第四卷积处理,得到第i层级的高频信号特征。
步骤103:对i进行遍历,得到N个层级中每个层级的信号特征。
其中,信号特征的数据维度小于音频信号的数据维度。
在步骤102中说明了针对第i层级的特征提取过程,在实际应用中,需要对i进行遍历,以得到N个层级中每个层级的信号特征。在本申请实施例中,每个层级输出的信号特征的数据维度小于音频信号的数据维度,如此,能够降低音频编码过程中所涉及数据的数据维度,提高音频编码的编码效率。
步骤104:对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别进行编码,得到音频信号在各层级的码流。
在实际应用中,在得到的第一层级的信号特征、以及N个层级中每个层级的信号特征之后,则可以对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别进行编码,从而得到音频信号在各层级的码流。该码流可以被传输至音频信号的接收端,从而使得接收端作为解码端对音频信号进行解码。
需要说明的是,该N个层级中的第i层级输出的信号特征,可以理解为第(i-1)层级输出的信号特征和原始的音频信号之间的残差信号特征,如此,所提取的音频信号的信号特征,既包含了第一层级提取到的音频信号的信号特征,还包括了该N个层级中每个层级提取到的残差信号特征,使得所提取的音频信号的信号特征更加全面和精确,减少音频信号在特征提取过程中的信息损失,从而在对第一层级的信号特征、以及N个层级中每个层级的信号特征分别进行编码时,使得编码得到的码流质量更高,所包含的音频信号的信息更加接近于原始的音频信号,提高音频编码的编码质量。
在一些实施例中,图3示出的步骤104可通过步骤104a1-步骤104a2实现:步骤104a1,对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别进行量化处理,得到各层级的信号特征的量化结果;步骤104a2,对各层级的信号特征的量化结果进行熵编码,得到音频信号在各层级的码流。
需要说明的是,在步骤104a1中,可以预先设置量化表,该量化表包括信号特征和量化值之间的对应关系。在进行量化处理时,可以通过查询预设的量化表,针对第一层级的信号特征、以及N个层级中每个层级的信号特征,分别查询到相应的量化值,从而将查询得到的量化值作为量化结果。在步骤104a2中,对各层级的信号特征的量化结果分别进行熵编码,得到音频信号在各层级的码流。
在实际应用中,音频信号包括低频子带信号和高频子带信号,那么相应的,每个层级输出的信号特征则包括低频信号特征和高频信号特征。基于此,当信号特征包括低频信号特征和高频信号特征时,在一些实施例中,图3示出的步骤104还可通过步骤104b1-步骤104b3实现:步骤104b1,对第一层级的低频信号特征、以及N个层级中每个层级的低频信号特征,分别进行编码,得到音频信号在各层级的低频码流;步骤104b2,对第一层级的高频信号特征、以及N个层级中每个层级的高频信号特征,分别进行编码,得到音频信号在各层级的高频码流;步骤104b3,将音频信号在各层级的低频码流以及高频码流,作为音频信号在相应层级的码流。
需要说明的是,在步骤104b1中的低频信号特征的编码过程也可采用与步骤104a1-步骤104a2类似的步骤实现,即,对第一层级的低频信号特征、以及N个层级中每个层级的低频信号特征,分别进行量化处理,得到各层级的低频信号特征的量化结果;对各层级的低频信号特征的量化结果进行熵编码,得到音频信号在各层级的低频码流。在步骤104b2中的高频信号特征的编码过程也可采用与步骤104a1-步骤104a2类似的步骤实现,即,对第一层级的高频信号特征、以及N个层级中每个层级的高频信号特征,分别进行量化处理,得到各层级的高频信号特征的量化结果;对各层级的高频信号特征的量化结果进行熵编码,得到音频信号在各层级的高频码流。
在实际应用中,音频信号包括低频子带信号和高频子带信号,那么相应的,每个层级输出的信号特征则包括低频信号特征和高频信号特征。基于此,当信号特征包括低频信号特征和高频信号特征时,在一些实施例中,图3示出的步骤104还可通过步骤104c1-步骤104c3实现:步骤104c1,按照第一编码码率,对第一层级的低频信号特征进行编码,得到第一层级的第一码流,并按照第二编码码率,对第一层级的高频信号特征进行编码,得到第一层级的第二码流;步骤104c2,针对N个层级中每个层级的信号特征,分别执行如下处理:按照层级的第三编码码率,对层级的信号特征分别进行编码,得到各层级的第二码流;步骤104c3,将第一层级的第二码流、以及N个层级中每个层级的第二码流,作为音频信号在各层级的码流。
需要说明的是,第一编码码率大于第二编码码率,第二编码码率大于N个层级中任一层级的第三编码码率,层级的编码码率与相应层级的码流的解码质量指标正相关。在步骤104c2中,可以针对N个层级中每个层级,分别设置相应的第三编码码率。该N个层级中每个层级的第三编码码率可以是相同的,也可以是部分相同而部分不同,还可以是完全不相同。这里,层级的编码码率与相应层级的码流的解码质量指标为正相关关系,即编码码率越高,其得到的码流的解码质量指标(的值)越高,而由第一层级的低频信号特征所包含 的音频信号的特征最多,因此,第一层级的低频信号特征所采用的第一编码码率最大,以保证音频信号的编码效果;同时针对第一层级的高频信号特征,采用低于第一编码码率的第二编码码率进行编码,以及针对N个层级中每个层级的信号特征,采用低于第二编码码率的第三编码码率进行编码,在增加音频信号的更多特征(包括高频信号特征、残差信号特征)的同时,通过合理分配每个层级的编码码率,提高了音频信号的编码效率。
在一些实施例中,终端在得到音频信号在各层级的码流之后,还可针对各层级,分别执行如下处理:对音频信号在层级的码流配置相应的层级传输优先级;其中,层级传输优先级与层级的层级数负相关,层级传输优先级与相应层级的码流的解码质量指标正相关。
需要说明的是,该层级的层级传输优先级,用于表征该层级的码流的传输优先级。层级传输优先级与层级的层级数负相关,即层级数越大,其对应的层级传输优先级越低,如第一层级(层级数为1)的层级传输优先级,高于第二层级(层级数为2)的层级传输优先级。基于此,在将各层级的码流传输至解码端时,可以按照配置的层级传输优先级,来传输相应层级的码流。在实际应用中,将音频信号在多个层级的码流传输至解码端时,可以传输部分层级的码流,也可以传输全部层级的码流,当传输部分层级的码流时,则可以按照配置的层级传输优先级,来传输相应层级的码流。
在一些实施例中,信号特征包括低频信号特征和高频信号特征,音频信号在各层级的码流包括:基于低频信号特征编码得到的低频码流、以及基于高频信号特征编码得到的高频码流;终端在得到音频信号在各层级的码流之后,还可针对各层级,分别执行如下处理:为层级的低频码流配置第一传输优先级,并为层级的高频码流配置第二传输优先级;其中,第一传输优先级高于第二传输优先级,第(i-1)层级的第二传输优先级低于第i层级的第一传输优先级,码流的传输优先级与相应码流的解码质量指标正相关。
需要说明的是,由于码流的传输优先级与相应码流的解码质量指标正相关,而由于高频码流的数据维度小于低频码流的数据维度,因此,每个层级的低频码流所包含的音频信号的原始信息多余高频码流所包含的音频信号的原始信息,也就是说,为保证低频码流的解码质量指标相较于高频码流的解码质量,可以针对每个层级,为层级的低频码流配置第一传输优先级,并为层级的高频码流配置第二传输优先级,该第一传输优先级高于第二传输优先级。同时,还可以配置第(i-1)层级的第二传输优先级低于第i层级的第一传输优先级,也就是说,针对每个层级,低频码流的传输优先级高于高频码流的传输优先级,如此,保证每个层级的低频码流可以优先传输;针对多个层级来说,第i层级的低频码流的传输优先级,高于第(i-1)层级的高频码流的传输优先级,如此,保证多个层级的所有低频码流可以优先传输。
应用本申请上述实施例,实现了对音频信号的分层编码:首先,对音频信号进行第一层级的特征提取,得到第一层级的信号特征;然后,针对N(N为大于1的整数)个层级中的第i(i为大于1的整数,i小于或等于N)层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征;再通过对i进行遍历,得到N个层级中每个层级的信号特征;最后,对第一层级的信号特征以及N个层级中每个层级的信号特征,分别进行编码,得到音频信号在各层级的码流。
通过对音频信号分层进行编码得到各层级的信号特征,由于各层级的信号特征的数据维度小于音频信号的数据维度。如此,降低了音频编码过程中所处理数据的数据维度,提高了音频信号的编码效率;
分层提取音频信号的信号特征时,每个层级的输出均作为下一层级的输入,使得每个层级均结合上一层级提取的信号特征,对音频信号进行更精确的特征提取,随着层级数量的增加,可以使音频信号在特征提取过程中的信息损失降到最低。如此,通过对该方式提取的信号特征进行编码所得到的多个码流,其包含的音频信号的信息更加接近于原始的音频信号,减少了音频信号在编码过程中的信息损失,保证了音频编码的编码质量。
下面说明本申请实施例提供的音频解码方法。在一些实施例中,本申请实施例提供的音频解码方法可以由各种电子设备实施,例如,可以由终端单独实施,也可以由服务器单独实施,也可以由终端和服务器协同实施。以终端实施为例,参见图10,图10是本申请实施例提供的音频解码方法的流程示意图,本申请实施例提供的音频解码方法包括:
步骤601:终端接收对音频信号进行编码得到的多个层级分别对应的码流。
这里,终端作为解码端,接收到对音频信号进行编码得到的多个层级分别对应的码流。
步骤602:分别对各层级的码流进行解码,得到各层级的信号特征。
其中,信号特征的数据维度小于音频信号的数据维度。
在一些实施例中,终端可通如下方式分别对各层级的码流进行解码,得到各层级的信号特征:针对各层级,分别执行如下处理:对层级的码流进行熵解码,得到码流的量化值;对码流的量化值进行逆量化处理,得到层级的信号特征。
在实际应用中,针对各层级的码流,可以分别执行如下处理:对该层级的码流进行熵解码,得到码流的量化值;然后基于对音频信号进行编码得到该码流的过程中所采用的量化表,对码流的量化值进行逆量化处理,即通过量化表,查询码流的量化值所对应的信号特征,从而得到该层级的信号特征。
在实际应用中,该接收的各层级的码流可以包括低频码流和高频码流,其中,低频码流是基于音频信号的低频信号特征编码得到的,高频码流是基于音频信号的高频信号特征编码得到的。如此,在对各层级的码流进行解码时,可以是对各层级的低频码流和高频码流分别进行解码。其中,高频码流和低频码流的解码过程,和码流的解码过程类似,即,针对各层级的低频码流,分别执行如下处理:对该层级的低频码流进行熵解码,得到低频码流的量化值;对低频码流的量化值进行逆量化处理,得到该层级的低频信号特征。针对各层级的高频码流,分别执行如下处理:对该层级的高频码流进行熵解码,得到高频码流的量化值;对高频码流的量化值进行逆量化处理,得到该层级的高频信号特征。
步骤603:分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号。
在实际应用中,在解码得到各层级的信号特征之后,分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号。在一些实施例中,终端可通如下方式分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号:针对各层级的信号特征,分别执行如下处理:对信号特征进行第一卷积处理,得到层级的卷积特征;对卷积特征进行上采样,得到层级的上采样特征;对上采样特征进行池化处理,得到层级的池化特征;对池化特征进行第二卷积处理,得到层级的层级音频信号。
在实际应用中,针对各层级的信号特征,分别执行如下处理:首先,对信号特征进行第一卷积处理,该第一卷积处理可以通过调用预设通道数的因果卷积进行处理,从而得到该层级的卷积特征。然后,对卷积特征进行上采样,可以预先设置上采样因子,从而基于该上采样因子进行上采样,得到该层级的上采样特征。再,对上采样特征进行池化处理,该池化处理可以预先设置池化因子,进而基于该池化因子对上采样特征进行池化处理,得到该层级的池化特征。最后,对池化特征进行第二卷积处理,该第二卷积处理可以通过调用预设通道数的因果卷积进行处理,从而得到该层级的层级音频信号。
该上采样可以通过一个解码层实现,也可以通过多个解码层实现。当上采样可以通过L(L>1)个级联的解码层实现时,终端可通如下方式对卷积特征进行上采样,得到层级的上采样特征:通过L个级联的解码层中的第一个解码层,对池化特征进行上采样,得到第一个解码层的上采样结果;通过L个级联的解码层中的第k个解码层,对第(k-1)个解码层的第一上采样结果进行上采样,得到第k个解码层的上采样结果;其中,L和k为大于1的整数,k小于或等于L;对k进行遍历,得到第L个解码层的上采样结果,并将第L个解码层的上采样结果,作为层级的上采样特征。
需要说明的是,每个解码层的上采样因子可以是相同的,也可以是不同的。
步骤604:对多个层级的层级音频信号进行音频合成,得到音频信号。
在实际应用中,得到各层级的层级音频信号之后,对多个层级的层级音频信号进行音频合成,得到音频信号。
在一些实施例中,码流包括低频码流和高频码流,图10示出的步骤602可通如下步骤实现:分别对各层级的低频码流进行解码,得到各层级的低频信号特征,并分别对各层级的高频码流进行解码,得到各层级的高频信号特征;相应的,图10示出的步骤603可通如下步骤实现:步骤6031,分别对各层级的低频信号特征进行特征重建,得到各层级的层级低频子带信号,并分别对各层级的高频信号特征进行特征重建,得到各层级的层级高频子带信号;步骤6032,将层级低频子带信号和层级高频子带信号,作为层级的层级音频信号;相应的,图10示出的步骤604可通如下步骤实现:步骤6041,将多个层级的层级低频子带信号进行相加,得到低频子带信号,并将多个层级的层级高频子带信号进行相加,得到高频子带信号;步骤6042,对低频子带信号和高频子带信号进行合成,得到音频信号。
在一些实施例中,步骤6042可通如下步骤实现:步骤60421,对低频子带信号进行上采样,得到低通滤波信号;步骤60422,对高频子带信号进行上采样,得到高频滤波信号;步骤60423,对低通滤波信号和高频滤波信号进行滤波合成,得到音频信号。需要说明的是,在步骤60423中,可以通过QMF合成滤波器进行合成处理,得到音频信号。
基于此,当码流包括低频码流和高频码流时,参见图11,图11是本申请实施例提供的音频解码方法的流程示意图,本申请实施例提供的音频解码方法包括:步骤701,接收对音频信号进行编码得到的多个层级分别对应的低频码流和高频码流;步骤702a,分别对各层级的低频码流进行解码,得到各层级的低频信号特征;步骤702b,分别对各层级的高频码流进行解码,得到各层级的高频信号特征;步骤703a,分别对各层级的低频信号特征进行特征重建,得到各层级的层级低频子带信号;步骤703b,分别对各层级的高频信号特征进行特征重建,得到各层级的层级高频子带信号;步骤704a,将多个层级的层级低频子带信号进行相加,得到低频子带信号;步骤704b,将多个层级的层级高频子带信号进行相加,得到高频子带信号;步骤705a,对低频子带信号进行上采样,得到低通滤波信号;步骤705b,对高频子带信号进行上采样,得到高频滤波信号;步骤706,对低通滤波信号和高频滤波信号进行滤波合成,得到音频信号。
需要说明的是,高频信号特征以及低频信号特征的特征重建过程,可以参照步骤603中的信号特征的特征重建过程。即,针对各层级的高频信号特征,分别执行如下处理:对高频信号特征进行第一卷积处理,得到层级的高频卷积特征;对高频卷积特征进行上采样,得到层级的高频上采样特征;对高频上采样特征进行池化处理,得到层级的高频池化特征;对高频池化特征进行第二卷积处理,得到层级的高频层级音频信号。 针对各层级的低频信号特征,分别执行如下处理:对低频信号特征进行第一卷积处理,得到层级的低频卷积特征;对低频卷积特征进行上采样,得到层级的低频上采样特征;对低频上采样特征进行池化处理,得到层级的低频池化特征;对低频池化特征进行第二卷积处理,得到层级的低频层级音频信号。
应用本申请上述实施例,对多个层级的码流分别进行解码,得到各层级的信号特征,并分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号,对多个层级的层级音频信号进行音频合成,得到音频信号。由于码流中的信号特征的数据维度小于音频信号的数据维度,相较于相关技术中对原始的音频信号直接进行编码得到的码流的数据维度更小,减少了音频解码过程中所处理数据的数据维度,提高了音频信号的解码效率。
下面将说明本申请实施例在一个实际的应用场景中的示例性应用。
音频编解码技术,就是使用较少的网络带宽资源去尽量多的传递语音信息。音频编解码器的压缩率可以达到10倍以上,也就是原本10MB的语音数据经过编码器的压缩只需要1MB来传输,大大降低了传递信息所需消耗的带宽资源。在通信系统中,为了保证通信的顺利,行业内部部署标准的语音编解码协议,例如来自国际电信联盟电信标准分局(ITU-T for ITU Telecommunication Standardization Sector)、第三代合作伙伴计划(3GPP,3rd Generation Partnership Project)、国际互联网工程任务组(IETF,The Internet Engineering Task Force)、音视频编码标准(AVS,Audio Video coding Standard)、中国通信标准化协会(CCSA,China Communications Standards Association)等国际国内标准组织的标准,G.711、G.722、AMR系列、EVS、OPUS等标准。图12给出一个不同码率下的频谱比较示意图,以示范压缩码率与质量的关系。曲线1201为原始语音的频谱曲线,即没有压缩的信号;曲线1202为OPUS编码器在20kbps码率下的频谱曲线;曲线1203为OPUS编码在6kbps码率下的频谱曲线。由图12可知,随着编码码率提升,压缩后的信号更为接近原始信号。
传统音频编码可以分成时域编码和频域编码两类,均为基于信号处理的压缩方法。其中,1)时域编码,比如波形编码(waveform speech coding),直接对语音信号的波形进行编码,这种编码方式的优点是编码语音质量高,但是编码效率不高。特别地,如果是语音信号,可以使用参数编码,而编码端要做的就是提取想要传递的语音信号的对应参数,但是参数编码的优点是编码效率极高,但是恢复语音的质量很低。2)频域编码,就是将音频信号变换到频域,提取频域系数,然后,将频域系数进行编码,但是编码效率也不理想。如此,基于信号处理的压缩方法并不能在保证编码质量的情况下,提高编码效率。
基于此,本申请实施例提供一种音频编码方法以及音频解码方法,以在提高编码效率的同时,保证编码质量。在本申请实施例中,能够根据编码内容、网络带宽情况,选择不同编码方式的自由度,即使是在低码率区间;且能够在复杂度和编码质量可接受的情况下,提升编码效率。参见图13,图13是本申请实施例提供的音频编码和音频解码的流程示意图。这里,以层级的数量为两层为例(本申请不限制第三层或者更高层级的迭代操作),本申请实施例提供的音频编码方法包括:
(1)对音频信号进行子带分解,得到低频子带信号和高频子带信号。在实际实施时,可以按照第一采样频率对音频信号进行采样,得到采样信号,然后对采样信号进行子带分解,得到具有低于第一采样频率的子带信号,包括低频子带信号和高频子带信号。例如,对于第n帧的音频信号x(n),使用分析滤波器(如QMF滤波器)分解为低频子带信号xLB(n)和高频子带信号xHB(n)。
(2)基于第一层低频分析神经网络对低频子带信号进行分析,得到第一层低频信号特征。例如,对于低频子带信号xLB(n),调用第一层低频分析神经网络,获得低维度的第一层低频信号特征FLB(n)。需要说明的是,信号特征的维度小于低频子带信号的维度(以减少数据量),神经网络包括但不限于Dilated CNN,Autoencoder,Full-connection,LSTM,CNN+LSTM等。
(3)基于第一层高频分析神经网络对高频子带信号进行分析,得到第一层高频信号特征。例如,对于高频子带信号xHB(n),调用第一层高频分析神经网络,获得低维度的第一层高频信号特征FHB(n)。
(4)基于第二层低频分析神经网络,对低频子带信号和第一层低频信号特征进行分析,得到第二层低频信号特征(即第二层低频残差信号特征)。例如,联合xLB(n)和FLB(n),调用第二层低频分析神经网络,获得低维度的第二层低频信号特征FLB,e(n)。
(5)基于第二层高频分析神经网络,对高频子带信号和第一层高频信号特征进行分析,得到第二层高频信号特征(即第二层高频残差信号特征)。例如,联合xHB(n)和FHB(n),调用第二层高频分析神经网络,获得低维度的第二层高频信号特征FHB,e(n)。
(6)通过量化编码部分,对两层信号特征(包括第一层低频信号特征、第一层高频信号特征、第二层低频信号特征以及第二层高频信号特征)进行量化和编码,得到音频信号在每层的码流;并为每层的码流配置相应的传输优先级,如,第一层以更高优先级进行传输,第二层次之,以此类推。
在实际应用中,解码端可能仅接收到一层的码流,如图13所示,可以采用“一层解码”的方式进行解码。 基于此,本申请实施例提供的音频解码方法包括:(1)对接收到的一层码流进行解码,得到该层的低频信号特征以及高频信号特征;(2)基于第一层低频合成神经网络,对低频信号特征进行分析,得到低频子带信号估计值。例如,基于低频信号特征的量化值F′LB(n),调用第一层低频合成神经网络,生成低频子带信号估计值x′LB(n);(3)基于第一层高频合成神经网络,对高频信号特征进行分析,得到高频子带信号估计值。例如,基于高频信号特征的量化值F′HB(n),调用第一层高频合成神经网络,生成高频子带信号估计值x′HB(n)。(4)基于低频子带信号估计值x′LB(n)和高频子带信号估计值x′HB(n),通过合成滤波器进行合成滤波,得到最终重建的原采样频率下的音频信号x′(n),以完成解码过程。
在实际应用中,解码端可能针对两层的码流均接收到,如图13所示,可以采用“二层解码”的方式进行解码。基于此,本申请实施例提供的音频解码方法包括:
(1)对接收到的各层的码流进行解码,得到各层的低频信号特征以及高频信号特征。
(2)基于第一层低频合成神经网络,对第一层低频信号特征进行分析,得到第一层低频子带信号估计值。例如,基于第一层低频信号特征的量化值F′LB(n),调用第一层低频合成神经网络,生成第一层低频子带信号估计值x′LB(n)。
(3)基于第一层高频合成神经网络,对第一层高频信号特征进行分析,得到第一层高频子带信号估计值。例如,基于第一层高频信号特征的量化值F′HB(n),调用第一层高频合成神经网络,生成第一层高频子带信号估计值x′HB(n)。
(4)基于第二层低频合成神经网络,对第二层低频信号特征进行分析,得到第二层低频子带残差信号估计值。例如,基于第二层低频信号特征的量化值F′LB,e(n),调用第二层低频合成神经网络,生成低频子带残差信号估计值x′LB,e(n)。
(5)基于第二层高频合成神经网络,对第二层高频信号特征进行分析,得到第二层高频子带残差信号估计值。例如,基于第二层高频信号特征的量化值F′HB,e(n),调用第二层高频合成神经网络,生成高频子带残差信号估计值x′HB,e(n)。
(6)通过低频部分,将第一层低频子带信号估计值与低频子带残差信号估计值进行求和,得到低频子带信号估计值。例如,将x′LB(n)与x′LB,e(n)求和,获得低频子带信号估计值。
(7)通过高频部分,将第一层高频子带信号估计值与高频子带残差信号估计值进行求和,得到高频子带信号估计值。例如,将x′HB(n)与x′HB,e(n)求和,获得高质量高频子带信号估计值。
(8)基于低频子带信号估计值和高频子带信号估计值,通过合成滤波器进行合成滤波,得到最终重建的原采样频率下的音频信号x′(n),以完成解码过程。
本申请实施例可以应用于各种音频场景,例如远程语音通信。以远程语音通信为例,参见图14,图14是本申请实施例提供的语音通信链路的示意图。这里,以基于网际互连协议的语音传输(VoIP,Voice over Internet Protocol)会议系统为例,将本申请实施例涉及的语音编解码技术部署在编码和解码部分,以解决语音压缩的基本功能。编码器部署在上行客户端1401,解码器部署在下行客户端1402,通过上行客户端采集语音,并进行前处理增强、编码等处理,将编码得到的码流通过网络传输至下行客户端1402,通过下行客户端1402进行解码、增强等处理,以在下行客户端1402播放解码出的语音。
考虑前向兼容(即新的编码器与已有的编码器兼容),需要在系统的后台(即服务器)部署转码器,以解决新的编码器与已有的编码器互联互通问题。例如,如果发送端(上行客户端)是新的NN编码器,接收端(下行客户端)是公用电话交换网(PSTN,Public Switched Telephone Network)的解码器(例如G.722解码器)。因此,服务器在接收到发送端发送的码流之后,首先需要执行NN解码器生成语音信号,然后调用G.722编码器生成特定码流,才能让接收端正确解码。类似的转码场景不再展开。
下面在详细介绍本申请实施例提供的音频编码方法以及音频解码方法之前,先对QMF滤波器组以及空洞卷积网络进行介绍。
QMF滤波器组是一个包含分析-合成的滤波器对。对于QMF分析滤波器,可以将输入的采样率为Fs的信号分解成两路采样率为Fs/2的信号,分别表示QMF低通信号和QMF高通信号。如图15所示的QMF滤波器的低通部分(H_Low(z))和高通部分(H_High(z))的频谱响应。基于QMF分析滤波器组的相关理论知识,可以容易地描述上述低通滤波和高通滤波的系数之间的相关性,如公式(1)所示:
hHigh(k)=-1khLow(k)   (1)
其中,hLow(k)表示低通滤波的系数,hHigh(k)表示高通滤波的系数。
类似地,根据QMF相关理论,可以基于QMF分析滤波器组H_Low(z)和H_High(z),描述QMF合成滤波器组,如公式(2)所示。
GLow(z)=HLow(z)
GHigh(z)=(-1)*HHigh(z)    (2)
其中,GLow(z)表示恢复出的低通信号,GHigh(z)表示恢复出的高通信号。
解码端恢复出的低通和高通信号,经过QMF合成滤波器组进行合成处理,即可以恢复出输入信号对应的采样率Fs的重建信号。
参见图16A和图16B,图16A是本申请实施例提供的普通卷积网络的示意图,图16B是本申请实施例提供的空洞卷积网络的示意图。相对普通卷积网络,空洞卷积能够增加感受野的同时保持特征图的尺寸不变,还可以避免因为上采样、下采样引起的误差。虽然图16A和图16B中示出的卷积核大小(Kernel Size)均为3×3;但是,图16A所示的普通卷积的感受野901只有3,而图16B所示的空洞卷积的感受野902达到了5。也就是说,对于尺寸为3×3的卷积核,图16A所示的普通卷积的感受野为3、扩张率(Dilation Rate)(卷积核中的点的间隔数量)为1;而图16B所示的空洞卷积的感受野为5、扩张率为2。
卷积核还可以在类似图16A或者图16B的平面上进行移动,这里是涉及移位率(Stride Rate)(步长)概念。比如,每次卷积核移位1格,则对应的移位率为1。此外,还有卷积通道数的概念,就是用多少个卷积核对应的参数去进行卷积分析。理论上,通道数越多,对信号的分析更为全面,精度越高;但是,通道越高,复杂度也越高。比如,一个1×320的张量,可以使用24通道卷积运算,输出就是24×320的张量。需要说明的是,可以根据实际应用需要,自行定义空洞卷积核大小(例如:针对语音信号,卷积核的大小可以设置为1×3)、扩张率、移位率和通道数,本申请实施例对此不作限定。
下面以Fs=32000Hz的音频信号为例(本申请实施例也适用于其它采样频率的场景,包括但不限于8000Hz、16000Hz、48000Hz等),其中,帧长设置为20ms,对于Fs=32000Hz,相当于每帧包含640个样本点。
接下来继续参见图13,对本申请实施例提供的音频编码方法和音频解码方法分别进行详细说明。其中,本申请实施例提供的音频编码方法包括:第1步,输入信号的生成。
这里,将第n帧的640个样本点,记为x(n)。
第2步,QMF子带信号分解。
这里,调用QMF分析滤波器(如2通道QMF滤波器)进行滤波处理,并对滤波得到的滤波信号进行下采样,获得两部分子带信号,即低频子带信号xLB(n)和高频子带信号xHB(n)。其中,低频子带信号xLB(n)的有效带宽为0-8kHz,高频子带信号xHB(n)的有效带宽为8-16kHz,每帧样本点数为320。
第3步,第一层低频分析。
这里,调用第一层低频分析神经网络的目的是,基于低频子带信号xLB(n),生成更低维度的第一层低频信号特征FLB(n)。在本示例中,xLB(n)的数据维度为320,FLB(n)的数据维度为64,从数据量看,显然经过第一层低频分析神经网络后,起到了“降维”的作用,可以理解为数据压缩。作为示例,参见图17,图17是本申请实施例提供的第一层低频分析神经网络的结构示意图,对低频子带信号xLB(n)的处理流程包括:
(1)调用一个24通道的因果卷积,将输入的张量(即xLB(n)),扩展为24*320的张量。
(2)对24*320的张量进行预处理。在实际应用中,可以做池化因子为2的池化(Pooling)操作、且激活函数可以为ReLU,以生成24*160的张量。
(3)级连3个不同降采样因子(Down_factor)的编码块。以编码块(Down_factor=4)为例,可以先执行1个或者多个空洞卷积,每个卷积核大小均固定为1*3,移位率(Stride rate)均为1。此外,该1个或者多个空洞卷积的扩张率(Dilation rate)可以根据需要自行设置,比如3;当然,本申请实施例也不限制不同空洞卷积设置不同的扩展率。然后,3个编码块的Down_factor分别设置为4、5、8,等效于设置了不同大小的池化因子,起到降采样作用。最后,3个编码块通道数分别设置为48、96、192。因此,经过3个级连的编码块,依次将24*160的张量分别转换成48*40、96*8和192*1的张量。
(4)对192*1的张量,经过类似预处理的因果卷积,输出一个64维的特征向量,即第一层低频信号特征FLB(n)。
第4步,第一层高频分析。
这里,调用第一层高频分析神经网络的目的是,基于高频子带信号xHB(n),生成更低维度的第一层高 频信号特征FHB(n)。在本示例中,第一层高频分析神经网络的结构可以与第一层低频分析神经网络相一致,即输入(即xHB(n))的数据维度为320维,输出(即FHB(n))的数据维度为64维。考虑高频子带信号比低频子带信号的重要性较低,可以适当减少输出维度,这样可以减少第一层高频分析神经网络的复杂度,在本示例中不作限制。
第5步,第二层低频分析。
这里,调用第二层低频分析神经网络的目的是,基于低频子带信号xLB(n)和第一层低频信号特征FLB(n),得到更低维度的第二层低频信号特征FLB,e(n)。第二层低频信号特征反映了:第一层低频分析神经网络的输出在解码端的重建音频信号,相对原始音频信号的残差;因此,在解码端,可以根据FLB,e(n)预测低频子带信号的残差信号,并与通过第一层低频分析神经网络的输出预测的低频子带信号估计值进行求和,获得更高精度的低频子带信号估计值。
第二层低频分析神经网络采用与第一层低频分析神经网络类型类似的结构,参见图18,图18是本申请实施例提供的第二层低频分析神经网络的结构示意图。这里,和第一层低频分析神经网络类型的主要差异点包括:(1)第二层低频分析神经网络的输入除了包括低频子带信号xLB(n),还包括第一层低频分析神经网络的输出FLB(n),xLB(n)和FLB(n)两个变量可以拼接成384维的拼接特征。(2)考虑第二层低频分析所处理的是残差信号,第二层低频分析神经网络的输出FLB,e(n)的维度设置为28。
第6步,第二层高频分析。
这里,调用第二层高频分析神经网络的目的是,基于高频子带信号xHB(n)和第一层高频信号特征FHB(n),得到更低维度的第二层高频信号特征FHB,e(n)。第二层高频分析神经网络和结构可以和第二层低频分析神经网络的结构相同,即输入(xHB(n)和FHB(n)的拼接特征)的数据维度为384维,输出(FHB,e(n))的数据维度为28维。
第7步,量化编码。
通过查询预先设置好的量化表,对2层输出的信号特征进行量化处理,并对量化得到的量化结果进行编码,其中,量化可以采用标量量化(各分量单独量化)的方式,编码可以采用熵编码的方式。另外,本申请实施例也不限制矢量量化(相邻多个分量组合成一个矢量进行联合量化)和熵编码的技术组合。
在实际实施时,第一层低频信号特征FLB(n)为64维特征,可以使用8kbps完成编码,每帧量化一个参数的平均码率为2.5bit;第一层高频信号特征FHB(n)为64维特征,可以使用6kbps完成编码,每帧量化一个参数的平均码率为1.875bit。因此,编码第一层总共是14kbps。
在实际实施时,第二层低频信号特征FLB,e(n)为28维特征,可以使用3.5kbps完成编码,每帧量化一个参数的平均码率为2.5bit;第二层高频信号特征FHB,e(n)为28维特征,可以使用3.5kbps完成编码,每帧量化一个参数的平均码率为2.5bit。因此,编码第二层总共是7kbps。
基于此,通过分层编码的方式,可以渐进编码不同的特征向量;根据不同应用场景,本申请实施例不限制其它方式的码率分布,比如,还可以迭代地引入第三层或者更高层编码。量化编码之后可以生成码流,针对不同层的码流,可以采用不同的传输策略,保证以不同优先级的传输,如可以采用前向纠错机制(Forward Error Correction,FEC),通过冗余传输提升传输的质量,不同层的冗余倍数不一样,比如第一层的冗余倍数可以设置高一些。
以所有层的码流均被解码端接收并且准确解码为例,本申请实施例提供的音频编码方法包括:
第1步,解码。
这里,解码即为编码的逆过程。对接收到的码流进行解析,并通过查量化表,获得低频信号特征估计值和高频信号特征估计值。示例性地,第一层,获得低频子带信号的64维信号特征的量化值F′LB(n),以及高频子带信号的64维信号特征的量化值F′HB(n);第二层,获得低频子带信号的28维信号特征的量化值F′LB,e(n),以及高频子带信号的28维信号特征的量化值F′HB,e(n)。
第2步,第一层低频合成。
这里,调用第一层低频合成神经网络的目的是,基于低频特征向量的量化值F′LB(n),生成第一层低频子带信号估计值x′LB(n)。作为示例,参见图19,图19是本申请实施例提供的第一层低频合成神经网络的 模型示意图。这里,第一层低频合成神经网络的处理流程与第一层低频分析神经网络的处理流程类似,比如因果卷积;第一层低频合成神经网络的后处理结构类似于第一层低频分析神经网络的预处理结构;解码块结构与编码块结构是对称的:编码侧的编码块是先做空洞卷积再池化完成降采样,解码侧的解码块是先进行池化完成升采样,再做空洞卷积。
第3步,第一层高频合成。
这里,第一层高频合成神经网络的结构和第一层低频合成神经网络的结构相同,可以根据第一层低频信号特征的量化值F′HB(n),获得第一高频子带信号估计值x′HB(n)。
第4步,第二层低频合成。
这里,调用第二层低频合成神经网络的目的是,基于第二层低频信号特征的量化值F′LB,e(n),生成低频子带残差信号估计值x′LB,e(n)。参见图20,图20是本申请实施例提供的第二层低频合成神经网络的结构示意图,该第二层低频合成神经网络的结构和第一层低频合成神经网络的结果类似,差异点在于输入的数据维度为28维。
第5步,第二层高频合成。
这里,第二层低频合成神经网络的结构和第二层低频合成神经网络的结构相同,可以基于第二层低频信号特征的量化值F′HB,e(n),生成高频子带残差信号估计值x′HB,e(n)。
第6步,合成滤波。
基于前面步聚,解码端获得低频子带信号估计值x′LB(n)和高频子带信号x′HB(n),以及低频子带残差信号估计值x′LB,e(n)和高频子带信号残差估计值x′HB,e(n)。将x′LB(n)和x′LB,e(n)相加,生成高精度的低频子带信号估计值;将x′HB(n)和x′HB,e(n)相加,生成高精度的高频子带信号估计值。最后,对低频子带信号估计值以及高频子带信号估计值进行上采样,并调用QMF合成滤波器,对上采样结果进行合成滤波,则生成640点的重建音频信号x′(n)。
在本申请实施例中,可以通过采集数据对编码端和解码端的相关神经网络进行联合训练,获得最优参数,从而将训练好的网络模型投入使用。在本申请实施例中,仅公开了一种特定的网络输入、网络结构和网络输出的实施例;相关领域工程人员可根据需要修改上述配置。
应用本申请上述实施例,可以完成基于信号处理和深度学习网络的低码率音频编解码方案。通过信号分解和相关信号处理技术与深度神经网络的有机结合,编码效率较相关技术有了显著提升,在复杂度可接受的情况下,编码质量也得到了提高。根据不同编码内容和带宽情况下,编码端选择不同分层传输策略进行码流的传输,解码端接收到低层码流,输出可接受质量的音频信号,如果也收到其它高层的码流,则可以输出高质量的音频。
可以理解的是,在本申请实施例中,涉及到用户信息(如用户发送的音频信号)等相关的数据,当本申请实施例运用到产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
下面继续说明本申请实施例提供的音频编码装置553的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器550的音频编码装置553中的软件模块可以包括:
第一特征提取模块5531,配置为对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征;第二特征提取模块5532,配置为针对N个层级中的第i层级,对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,所述N和所述i为大于1的整数,所述i小于或等于所述N;遍历模块5533,配置为对所述i进行遍历,得到所述N个层级中每个层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;编码模块5534,配置为对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流。
在一些实施例中,所述第一特征提取模块5531,还配置为对所述音频信号进行子带分解,得到所述音频信号的低频子带信号和高频子带信号;对所述低频子带信号进行第一层级的特征提取,得到所述第一层级的低频信号特征,并对所述高频子带信号进行第一层级的特征提取,得到所述第一层级的高频信号特征;将所述低频信号特征和所述高频信号特征,作为所述第一层级的信号特征。
在一些实施例中,所述第一特征提取模块5531,还配置为按照第一采样频率对所述音频信号进行采样,得到采样信号;对所述采样信号进行低通滤波,得到低通滤波信号,并对所述低通滤波信号进行下采样,得到第二采样频率的所述低频子带信号;对所述采样信号进行高通滤波,得到高通滤波信号,并对所述高通滤波信号进行下采样,得到第二采样频率的所述高频子带信号;其中,所述第二采样频率小于所述第一采样频率。
在一些实施例中,所述第二特征提取模块5532,还配置为对所述音频信号的低频子带信号和第(i-1)层级的低频信号特征进行拼接,得到第一拼接特征,并对所述第一拼接特征进行第i层级的特征提取,得到所述第i层级的低频信号特征;对所述音频信号的高频子带信号和第(i-1)层级的高频信号特征进行拼接,得到第二拼接特征,并对所述第二拼接特征进行第i层级的特征提取,得到所述第i层级的高频信号特征;将所述第i层级的低频信号特征和所述第i层级的高频信号特征,作为所述第i层级的信号特征。
在一些实施例中,所述第一特征提取模块5531,还配置为对所述音频信号进行第一卷积处理,得到所述第一层级的卷积特征;对所述卷积特征进行第一池化处理,得到所述第一层级的池化特征;对所述池化特征进行第一下采样,得到所述第一层级的下采样特征;对所述下采样特征进行第二卷积处理,得到所述第一层级的信号特征。
在一些实施例中,所述第一下采样通过M个级联的编码层实现,所述第一特征提取模块5531,还配置为通过所述M个级联的编码层中的第一个编码层,对所述池化特征进行第一下采样,得到所述第一个编码层的下采样结果;通过所述M个级联的编码层中的第j个编码层,对第(j-1)个编码层的下采样结果进行第一下采样,得到所述第j个编码层的下采样结果;其中,所述M和所述j为大于1的整数,所述j小于或等于所述M;对所述j进行遍历,得到第M个编码层的下采样结果,并将所述第M个编码层的下采样结果,作为所述第一层级的下采样特征。
在一些实施例中,所述第二特征提取模块5532,还配置为对所述拼接特征进行第三卷积处理,得到所述第i层级的卷积特征;对所述卷积特征进行第二池化处理,得到所述第i层级的池化特征;对所述池化特征进行第二下采样,得到所述第i层级的下采样特征;对所述下采样特征进行第四卷积处理,得到所述第i层级的信号特征。
在一些实施例中,所述编码模块5534,还配置为对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行量化处理,得到各层级的信号特征的量化结果;对所述各层级的信号特征的量化结果进行熵编码,得到所述音频信号在各层级的码流。
在一些实施例中,所述信号特征包括低频信号特征和高频信号特征,所述编码模块5534,还配置为对所述第一层级的低频信号特征、以及所述N个层级中每个层级的低频信号特征,分别进行编码,得到所述音频信号在各层级的低频码流;对所述第一层级的高频信号特征、以及所述N个层级中每个层级的高频信号特征,分别进行编码,得到所述音频信号在各层级的高频码流;将所述音频信号在各层级的低频码流以及高频码流,作为所述音频信号在相应层级的码流。
在一些实施例中,所述信号特征包括低频信号特征和高频信号特征,所述编码模块5534,还配置为按照第一编码码率,对所述第一层级的低频信号特征进行编码,得到第一层级的第一码流,并按照第二编码码率,对所述第一层级的高频信号特征进行编码,得到第一层级的第二码流;针对所述N个层级中每个层级的信号特征,分别执行如下处理:按照所述层级的第三编码码率,对所述层级的信号特征分别进行编码,得到各所述层级的第二码流;将所述第一层级的第二码流、以及所述N个层级中每个层级的第二码流,作为所述音频信号在各层级的码流;其中,所述第一编码码率大于所述第二编码码率,所述第二编码码率,大于所述N个层级中任一层级的第三编码码率,所述层级的编码码率与相应层级的码流的解码质量指标正相关。
在一些实施例中,所述编码模块5534,还配置为针对各所述层级,分别执行如下处理:对所述音频信号在所述层级的码流配置相应的层级传输优先级;其中,所述层级传输优先级与所述层级的层级数负相关,所述层级传输优先级与相应层级的码流的解码质量指标正相关。
在一些实施例中,所述信号特征包括低频信号特征和高频信号特征,所述音频信号在各层级的码流包括:基于所述低频信号特征编码得到的低频码流、以及基于所述高频信号特征编码得到的高频码流;所述编码模块5534,还配置为针对各所述层级,分别执行如下处理:为所述层级的低频码流配置第一传输优先级,并为所述层级的高频码流配置第二传输优先级;其中,所述第一传输优先级高于所述第二传输优先级,第(i-1)层级的所述第二传输优先级低于第i层级的所述第一传输优先级,所述码流的传输优先级与相应码流的解码质量指标正相关。
应用本申请上述实施例,实现了对音频信号的分层编码:首先,对音频信号进行第一层级的特征提取,得到第一层级的信号特征;然后,针对N(N为大于1的整数)个层级中的第i(i为大于1的整数,i小于或等于N)层级,对音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对拼接特征进行第i层级的特征提取,得到第i层级的信号特征;再通过对i进行遍历,得到N个层级中每个层级的信号特征;最后,对第一层级的信号特征以及N个层级中每个层级的信号特征,分别进行编码,得到音频信号在各层级的码流。
第一,所提取的信号特征的数据维度小于音频信号的数据维度。如此,降低了音频编码过程中所处理数据的数据维度,提高了音频信号的编码效率;
第二,分层提取音频信号的信号特征时,每个层级的输出均作为下一层级的输入,使得每个层级均结合上一层级提取的信号特征,对音频信号进行更精确的特征提取,随着层级数量的增加,可以使音频信号在特征提取过程中的信息损失降到最低。如此,通过对该方式提取的信号特征进行编码所得到的多个码流,其包 含的音频信号的信息更加接近于原始的音频信号,减少了音频信号在编码过程中的信息损失,保证了音频编码的编码质量。
下面说明本申请实施例提供的音频解码装置。本申请实施例提供的音频解码装置包括:接收模块,配置为接收对音频信号进行编码得到的多个层级分别对应的码流;解码模块,配置为分别对各所述层级的码流进行解码,得到各所述层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;特征重建模块,配置为分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号;音频合成模块,配置为对多个所述层级的层级音频信号进行音频合成,得到所述音频信号。
在一些实施例中,所述码流包括低频码流和高频码流,所述解码模块,还配置为分别对各所述层级的低频码流进行解码,得到各所述层级的低频信号特征,并分别对各所述层级的高频码流进行解码,得到各所述层级的高频信号特征;相应的,所述特征重建模块,还配置为分别对各所述层级的低频信号特征进行特征重建,得到各所述层级的层级低频子带信号,并分别对各所述层级的高频信号特征进行特征重建,得到各所述层级的层级高频子带信号;将所述层级低频子带信号和所述层级高频子带信号,作为所述层级的层级音频信号;相应的,所述音频合成模块,还配置为将多个所述层级的层级低频子带信号进行相加,得到低频子带信号,并将多个所述层级的层级高频子带信号进行相加,得到高频子带信号;对所述低频子带信号和所述高频子带信号进行合成,得到所述音频信号。
在一些实施例中,所述音频合成模块,还配置为对所述低频子带信号进行上采样,得到低通滤波信号;对所述高频子带信号进行上采样,得到高频滤波信号;对所述低通滤波信号和所述高频滤波信号进行滤波合成,得到所述音频信号。
在一些实施例中,所述特征重建模块,还配置为针对各所述层级的信号特征,分别执行如下处理:对所述信号特征进行第一卷积处理,得到所述层级的卷积特征;对所述卷积特征进行上采样,得到所述层级的上采样特征;对所述上采样特征进行池化处理,得到所述层级的池化特征;对所述池化特征进行第二卷积处理,得到所述层级的层级音频信号。
在一些实施例中,所述上采样通过L个级联的解码层实现,所述特征重建模块,还配置为通过所述L个级联的解码层中的第一个解码层,对所述池化特征进行上采样,得到所述第一个解码层的上采样结果;通过所述L个级联的解码层中的第k个解码层,对第(k-1)个解码层的第一上采样结果进行上采样,得到所述第k个解码层的上采样结果;其中,所述L和所述k为大于1的整数,所述k小于或等于所述L;对所述k进行遍历,得到第L个解码层的上采样结果,并将所述第L个解码层的上采样结果,作为所述层级的上采样特征。
在一些实施例中,所述解码模块,还配置为针对各所述层级,分别执行如下处理:对所述层级的码流进行熵解码,得到所述码流的量化值;对所述码流的量化值进行逆量化处理,得到所述层级的信号特征。
应用本申请上述实施例,对多个层级的码流分别进行解码,得到各层级的信号特征,并分别对各层级的信号特征进行特征重建,得到各层级的层级音频信号,对多个层级的层级音频信号进行音频合成,得到音频信号。由于信号特征的数据维度小于音频信号的数据维度,减少了音频解码过程中所处理数据的数据维度,提高了音频信号的解码效率。
本申请实施例还提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例提供的方法。
本申请实施例还提供一种计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法。
在一些实施例中,计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随即存储器(Random Access Memory,RAM)、可擦写可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (23)

  1. 一种音频编码方法,所述方法由电子设备执行,所述方法包括:
    对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征;
    针对N个层级中的第i层级,对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,所述N和所述i为大于1的整数,所述i小于或等于所述N;
    对所述i进行遍历,得到所述N个层级中每个层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
    对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流。
  2. 如权利要求1所述的方法,其中,所述对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征,包括:
    对所述音频信号进行子带分解,得到所述音频信号的低频子带信号和高频子带信号;
    对所述低频子带信号进行第一层级的特征提取,得到所述第一层级的低频信号特征,并对所述高频子带信号进行第一层级的特征提取,得到所述第一层级的高频信号特征;
    将所述低频信号特征和所述高频信号特征,确定为所述第一层级的信号特征。
  3. 如权利要求2所述的方法,其中,所述对所述音频信号进行子带分解,得到所述音频信号的低频子带信号和高频子带信号,包括:
    按照第一采样频率对所述音频信号进行采样,得到采样信号;
    对所述采样信号进行低通滤波,得到低通滤波信号,并对所述低通滤波信号进行下采样,得到第二采样频率的所述低频子带信号;
    对所述采样信号进行高通滤波,得到高通滤波信号,并对所述高通滤波信号进行下采样,得到第二采样频率的所述高频子带信号;
    其中,所述第二采样频率小于所述第一采样频率。
  4. 如权利要求2所述的方法,其中,所述对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,包括:
    对所述音频信号的低频子带信号和第(i-1)层级的低频信号特征进行拼接,得到第一拼接特征,并对所述第一拼接特征进行第i层级的特征提取,得到所述第i层级的低频信号特征;
    对所述音频信号的高频子带信号和第(i-1)层级的高频信号特征进行拼接,得到第二拼接特征,并对所述第二拼接特征进行第i层级的特征提取,得到所述第i层级的高频信号特征;
    将所述第i层级的低频信号特征和所述第i层级的高频信号特征,确定为所述第i层级的信号特征。
  5. 如权利要求1所述的方法,其中,所述对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征,包括:
    对所述音频信号进行第一卷积处理,得到所述第一层级的卷积特征;
    对所述卷积特征进行第一池化处理,得到所述第一层级的池化特征;
    对所述池化特征进行第一下采样,得到所述第一层级的下采样特征;
    对所述下采样特征进行第二卷积处理,得到所述第一层级的信号特征。
  6. 如权利要求5所述的方法,其中,所述第一下采样通过M个级联的编码层实现,所述对所述池化特征进行第一下采样,得到所述第一层级的下采样特征,包括:
    通过所述M个级联的编码层中的第一个编码层,对所述池化特征进行第一下采样,得到所述第一个编码层的下采样结果;
    通过所述M个级联的编码层中的第j个编码层,对第(j-1)个编码层的下采样结果进行第一下采样,得到所述第j个编码层的下采样结果;
    其中,所述M和所述j为大于1的整数,所述j小于或等于所述M;
    对所述j进行遍历,得到第M个编码层的下采样结果,并将所述第M个编码层的下采样结果,确定为所述第一层级的下采样特征。
  7. 如权利要求1所述的方法,其中,所述对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,包括:
    对所述拼接特征进行第三卷积处理,得到所述第i层级的卷积特征;
    对所述卷积特征进行第二池化处理,得到所述第i层级的池化特征;
    对所述池化特征进行第二下采样,得到所述第i层级的下采样特征;
    对所述下采样特征进行第四卷积处理,得到所述第i层级的信号特征。
  8. 如权利要求1所述的方法,其中,所述对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流,包括:
    对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行量化处理,得到各层级的信号特征的量化结果;
    对所述各层级的信号特征的量化结果进行熵编码,得到所述音频信号在各层级的码流。
  9. 如权利要求1所述的方法,其中,所述信号特征包括低频信号特征和高频信号特征,所述对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流,包括:
    对所述第一层级的低频信号特征、以及所述N个层级中每个层级的低频信号特征,分别进行编码,得到所述音频信号在各层级的低频码流;
    对所述第一层级的高频信号特征、以及所述N个层级中每个层级的高频信号特征,分别进行编码,得到所述音频信号在各层级的高频码流;
    将所述音频信号在各层级的低频码流以及高频码流,确定为所述音频信号在相应层级的码流。
  10. 如权利要求1所述的方法,其中,所述信号特征包括低频信号特征和高频信号特征,所述对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流,包括:
    按照第一编码码率,对所述第一层级的低频信号特征进行编码,得到第一层级的第一码流,并按照第二编码码率,对所述第一层级的高频信号特征进行编码,得到第一层级的第二码流;
    针对所述N个层级中每个层级的信号特征,分别执行如下处理:按照所述层级的第三编码码率,对所述层级的信号特征分别进行编码,得到各所述层级的第二码流;
    将所述第一层级的第二码流、以及所述N个层级中每个层级的第二码流,确定为所述音频信号在各层级的码流;
    其中,所述第一编码码率大于所述第二编码码率,所述第二编码码率,大于所述N个层级中任一层级的第三编码码率,所述层级的编码码率与相应层级的码流的解码质量指标正相关。
  11. 如权利要求1所述的方法,其中,所述对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流之后,所述方法还包括:
    针对各所述层级,分别执行如下处理:
    对所述音频信号在所述层级的码流配置相应的层级传输优先级;
    其中,所述层级传输优先级与所述层级的层级数负相关,所述层级传输优先级与相应层级的码流的解码质量指标正相关。
  12. 如权利要求1所述的方法,其中,所述信号特征包括低频信号特征和高频信号特征,所述音频信号在各层级的码流包括:基于所述低频信号特征编码得到的低频码流、以及基于所述高频信号特征编码得到的高频码流;所述方法还包括:
    针对各所述层级,分别执行如下处理:为所述层级的低频码流配置第一传输优先级,并为所述层级的高频码流配置第二传输优先级;
    其中,所述第一传输优先级高于所述第二传输优先级,第(i-1)层级的所述第二传输优先级低于第i层级的所述第一传输优先级,所述码流的传输优先级与相应码流的解码质量指标正相关。
  13. 一种音频解码方法,所述方法由电子设备执行,所述方法包括:
    接收对音频信号进行编码得到的多个层级分别对应的码流;
    分别对各所述层级的码流进行解码,得到各所述层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
    分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号;
    对多个所述层级的层级音频信号进行音频合成,得到所述音频信号。
  14. 如权利要求13所述的方法,其中,所述码流包括低频码流和高频码流,所述分别对各所述层级的码流进行解码,得到各所述层级的信号特征,包括:
    分别对各所述层级的低频码流进行解码,得到各所述层级的低频信号特征,并分别对各所述层级的高频码流进行解码,得到各所述层级的高频信号特征;
    所述分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号,包括:
    分别对各所述层级的低频信号特征进行特征重建,得到各所述层级的层级低频子带信号,并分别对各所述层级的高频信号特征进行特征重建,得到各所述层级的层级高频子带信号;
    将所述层级低频子带信号和所述层级高频子带信号,作为所述层级的层级音频信号;
    所述对多个所述层级的层级音频信号进行音频合成,得到所述音频信号,包括:
    将多个所述层级的层级低频子带信号进行相加,得到低频子带信号,并将多个所述层级的层级高频子带信号进行相加,得到高频子带信号;
    对所述低频子带信号和所述高频子带信号进行合成,得到所述音频信号。
  15. 如权利要求14所述的方法,其中,所述对所述低频子带信号和所述高频子带信号进行合成,得到所述音频信号,包括:
    对所述低频子带信号进行上采样,得到低通滤波信号;
    对所述高频子带信号进行上采样,得到高频滤波信号;
    对所述低通滤波信号和所述高频滤波信号进行滤波合成,得到所述音频信号。
  16. 如权利要求13所述的方法,其中,所述分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号,包括:
    针对各所述层级的信号特征,分别执行如下处理:
    对所述信号特征进行第一卷积处理,得到所述层级的卷积特征;
    对所述卷积特征进行上采样,得到所述层级的上采样特征;
    对所述上采样特征进行池化处理,得到所述层级的池化特征;
    对所述池化特征进行第二卷积处理,得到所述层级的层级音频信号。
  17. 如权利要求16所述的方法,其中,所述上采样通过L个级联的解码层实现,所述对所述卷积特征进行上采样,得到所述层级的上采样特征,包括:
    通过所述L个级联的解码层中的第一个解码层,对所述池化特征进行上采样,得到所述第一个解码层的上采样结果;
    通过所述L个级联的解码层中的第k个解码层,对第(k-1)个解码层的第一上采样结果进行上采样,得到所述第k个解码层的上采样结果;
    其中,所述L和所述k为大于1的整数,所述k小于或等于所述L;
    对所述k进行遍历,得到第L个解码层的上采样结果,并将所述第L个解码层的上采样结果,作为所述层级的上采样特征。
  18. 如权利要求13所述的方法,其中,所述分别对各所述层级的码流进行解码,得到各所述层级的信号特征,包括:
    针对各所述层级,分别执行如下处理:
    对所述层级的码流进行熵解码,得到所述码流的量化值;
    对所述码流的量化值进行逆量化处理,得到所述层级的信号特征。
  19. 一种音频编码装置,所述装置包括:
    第一特征提取模块,配置为对音频信号进行第一层级的特征提取,得到所述第一层级的信号特征;
    第二特征提取模块,配置为针对N个层级中的第i层级,对所述音频信号和第(i-1)层级的信号特征进行拼接,得到拼接特征,并对所述拼接特征进行第i层级的特征提取,得到所述第i层级的信号特征,所述N和所述i为大于1的整数,所述i小于或等于所述N;
    遍历模块,配置为对所述i进行遍历,得到所述N个层级中每个层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
    编码模块,配置为对所述第一层级的信号特征、以及所述N个层级中每个层级的信号特征,分别进行编码,得到所述音频信号在各层级的码流。
  20. 一种音频解码装置,所述装置包括:
    接收模块,配置为接收对音频信号进行编码得到的多个层级分别对应的码流;
    解码模块,配置为分别对各所述层级的码流进行解码,得到各所述层级的信号特征,所述信号特征的数据维度小于所述音频信号的数据维度;
    特征重建模块,配置为分别对各所述层级的信号特征进行特征重建,得到各所述层级的层级音频信号;
    音频合成模块,配置为对多个所述层级的层级音频信号进行音频合成,得到所述音频信号。
  21. 一种电子设备,所述电子设备包括:
    存储器,配置为存储可执行指令;
    处理器,配置为执行所述存储器中存储的可执行指令时,实现权利要求1至18任一项所述的方法。
  22. 一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时,实现权利要求1至18任一项所述的方法。
  23. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现权利要求1至18任一项所述的方法。
PCT/CN2023/088014 2022-06-15 2023-04-13 音频编码方法、装置、电子设备、存储介质及程序产品 WO2023241193A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210677636.4A CN115116454A (zh) 2022-06-15 2022-06-15 音频编码方法、装置、设备、存储介质及程序产品
CN202210677636.4 2022-06-15

Publications (1)

Publication Number Publication Date
WO2023241193A1 true WO2023241193A1 (zh) 2023-12-21

Family

ID=83327948

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/088014 WO2023241193A1 (zh) 2022-06-15 2023-04-13 音频编码方法、装置、电子设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN115116454A (zh)
WO (1) WO2023241193A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116454A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频编码方法、装置、设备、存储介质及程序产品
CN117476024A (zh) * 2023-11-29 2024-01-30 腾讯科技(深圳)有限公司 音频编码方法、音频解码方法、装置、可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196828A1 (en) * 2015-01-07 2016-07-07 Adobe Systems Incorporated Acoustic Matching and Splicing of Sound Tracks
CN112420065A (zh) * 2020-11-05 2021-02-26 北京中科思创云智能科技有限公司 音频降噪处理方法和装置及设备
CN113470667A (zh) * 2020-03-11 2021-10-01 腾讯科技(深圳)有限公司 语音信号的编解码方法、装置、电子设备及存储介质
CN113889076A (zh) * 2021-09-13 2022-01-04 北京百度网讯科技有限公司 语音识别及编解码方法、装置、电子设备及存储介质
CN115116454A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频编码方法、装置、设备、存储介质及程序产品

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100711989B1 (ko) * 2002-03-12 2007-05-02 노키아 코포레이션 효율적으로 개선된 스케일러블 오디오 부호화
US8560328B2 (en) * 2006-12-15 2013-10-15 Panasonic Corporation Encoding device, decoding device, and method thereof
CN101572087B (zh) * 2008-04-30 2012-02-29 北京工业大学 嵌入式语音或音频信号编解码方法和装置
CN105070293B (zh) * 2015-08-31 2018-08-21 武汉大学 基于深度神经网络的音频带宽扩展编码解码方法及装置
CN112767954B (zh) * 2020-06-24 2024-06-14 腾讯科技(深圳)有限公司 音频编解码方法、装置、介质及电子设备
CN113299313B (zh) * 2021-01-28 2024-03-26 维沃移动通信有限公司 音频处理方法、装置及电子设备
CN112992161A (zh) * 2021-04-12 2021-06-18 北京世纪好未来教育科技有限公司 音频编码方法、音频解码方法、装置、介质及电子设备
CN113628610B (zh) * 2021-08-12 2024-02-13 科大讯飞股份有限公司 一种语音合成方法和装置、电子设备
CN113628630B (zh) * 2021-08-12 2023-12-01 科大讯飞股份有限公司 基于编解码网络的信息转换方法和装置、电子设备
CN114582317B (zh) * 2022-03-29 2023-08-08 马上消费金融股份有限公司 语音合成方法、声学模型的训练方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196828A1 (en) * 2015-01-07 2016-07-07 Adobe Systems Incorporated Acoustic Matching and Splicing of Sound Tracks
CN113470667A (zh) * 2020-03-11 2021-10-01 腾讯科技(深圳)有限公司 语音信号的编解码方法、装置、电子设备及存储介质
CN112420065A (zh) * 2020-11-05 2021-02-26 北京中科思创云智能科技有限公司 音频降噪处理方法和装置及设备
CN113889076A (zh) * 2021-09-13 2022-01-04 北京百度网讯科技有限公司 语音识别及编解码方法、装置、电子设备及存储介质
CN115116454A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频编码方法、装置、设备、存储介质及程序产品

Also Published As

Publication number Publication date
CN115116454A (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2023241193A1 (zh) 音频编码方法、装置、电子设备、存储介质及程序产品
JP4850837B2 (ja) 異なるサブバンド領域同士の間の通過によるデータ処理方法
JP5722040B2 (ja) スケーラブルなスピーチおよびオーディオコーデックにおける、量子化mdctスペクトルに対するコードブックインデックスのエンコーディング/デコーディングのための技術
JP4374233B2 (ja) 複数因子分解可逆変換(multiplefactorizationreversibletransform)を用いたプログレッシブ・ツー・ロスレス埋込みオーディオ・コーダ(ProgressivetoLosslessEmbeddedAudioCoder:PLEAC)
US10468045B2 (en) Methods, encoder and decoder for linear predictive encoding and decoding of sound signals upon transition between frames having different sampling rates
WO2023241254A9 (zh) 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
RU2408089C2 (ru) Декодирование кодированных с предсказанием данных с использованием адаптации буфера
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
CN104718572A (zh) 音频编码方法和装置、音频解码方法和装置及采用该方法和装置的多媒体装置
RU2530926C2 (ru) Изменение формы шума округления для основанных на целочисленном преобразовании кодирования и декодирования аудио и видеосигнала
KR102083768B1 (ko) 오디오 신호의 고주파 재구성을 위한 하모닉 트랜스포저의 하위호환형 통합
WO2023241222A1 (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
WO2023241205A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN115116457A (zh) 音频编码及解码方法、装置、设备、介质及程序产品
CN117831548A (zh) 音频编解码系统的训练方法、编码方法、解码方法、装置
CN117476024A (zh) 音频编码方法、音频解码方法、装置、可读存储介质
WO2024093588A1 (zh) 语音合成模型的训练方法、装置、设备、存储介质及程序产品
CN117198301A (zh) 音频编码方法、音频解码方法、装置、可读存储介质
CN117834596A (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
CN117219095A (zh) 音频编码方法、音频解码方法、装置、设备及存储介质
US11881227B2 (en) Audio signal compression method and apparatus using deep neural network-based multilayer structure and training method thereof
WO2022252957A1 (zh) 音频数据编解码方法和相关装置及计算机可读存储介质
CN117351943A (zh) 音频处理方法、装置、设备和存储介质
KR20230134856A (ko) 정규화 플로우를 활용한 오디오 신호를 부호화 및 복호화 하는 방법 및 그 학습 방법
CN117854516A (zh) 音频编解码方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822764

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023822764

Country of ref document: EP

Effective date: 20240515