WO2023241254A9 - 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023241254A9
WO2023241254A9 PCT/CN2023/092246 CN2023092246W WO2023241254A9 WO 2023241254 A9 WO2023241254 A9 WO 2023241254A9 CN 2023092246 W CN2023092246 W CN 2023092246W WO 2023241254 A9 WO2023241254 A9 WO 2023241254A9
Authority
WO
WIPO (PCT)
Prior art keywords
signal
predicted value
vector
feature vector
feature
Prior art date
Application number
PCT/CN2023/092246
Other languages
English (en)
French (fr)
Other versions
WO2023241254A1 (zh
Inventor
史裕鹏
肖玮
王蒙
康迂勇
黄庆博
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023241254A1 publication Critical patent/WO2023241254A1/zh
Publication of WO2023241254A9 publication Critical patent/WO2023241254A9/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to the field of communication technology, and in particular to an audio encoding and decoding method, device, electronic device, computer-readable storage medium and computer program product.
  • voice calls are increasingly used, such as transmitting audio signals (such as voice signals) between participants in a network conference.
  • the voice signal may be mixed with acoustic interference such as noise.
  • the noise mixed in the voice signal will cause the call quality to deteriorate, thereby greatly affecting the user's auditory experience.
  • the embodiments of the present application provide an audio coding and decoding method, apparatus, electronic device, computer-readable storage medium, and computer program product, which can effectively suppress acoustic interference in audio signals, thereby improving the quality of reconstructed audio signals.
  • the present application provides an audio decoding method, including:
  • bit stream is obtained by encoding an audio signal
  • the predicted value of the audio signal obtained by reconstructing the signal is used as the decoding result of the code stream.
  • the present application provides an audio decoding device, including:
  • An acquisition module configured to acquire a bit stream, wherein the bit stream is obtained by encoding the audio signal
  • a decoding module configured to decode the bit stream to obtain a predicted value of the feature vector of the audio signal
  • a label extraction module configured to perform label extraction processing on the predicted value of the feature vector to obtain a label information vector, wherein the dimension of the label information vector is the same as the dimension of the predicted value of the feature vector;
  • a reconstruction module configured to perform signal reconstruction based on the predicted value of the feature vector and the label information vector
  • the determination module is configured to use the predicted value of the audio signal obtained by reconstructing the signal as the decoding result of the code stream.
  • the present application provides an audio encoding method, including:
  • the audio signal is encoded to obtain a code stream, wherein the code stream is used for an electronic device to execute the audio decoding method provided in an embodiment of the present application.
  • the present application provides an audio encoding device, including:
  • An acquisition module configured to acquire an audio signal
  • the encoding module is configured to encode the audio signal to obtain a code stream, wherein the code stream is used for the electronic device to execute the audio decoding method provided in the embodiment of the present application.
  • An embodiment of the present application provides an electronic device, including:
  • a memory for storing computer executable instructions
  • the processor is used to implement the audio encoding and decoding method provided in the embodiment of the present application when executing the computer executable instructions stored in the memory.
  • An embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for implementing the audio encoding and decoding method provided in the embodiment of the present application when executed by a processor.
  • An embodiment of the present application provides a computer program product, including a computer program or computer executable instructions, which are used to implement the audio encoding and decoding method provided in the embodiment of the present application when executed by a processor.
  • a label information vector is obtained, and the signal is reconstructed by combining the predicted value of the feature vector and the label information vector.
  • the label information vector since the label information vector only reflects the core components of the audio signal, that is, the label information vector does not include acoustic interference such as noise, therefore, when combining the predicted value of the feature vector and the label information vector, the label information vector can be used to extract the label information vector.
  • the proportion of the core component in the audio signal can be increased through the label information vector, and the proportion of acoustic interference such as noise can be correspondingly reduced, thereby effectively suppressing the noise component included in the audio signal collected by the encoding end, achieving the effect of signal enhancement, and thus improving the quality of the reconstructed audio signal.
  • FIG1 is a schematic diagram of spectrum comparison at different bit rates provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of the architecture of an audio codec system 100 provided in an embodiment of the present application
  • FIG3 is a schematic diagram of the structure of a second terminal device 500 provided in an embodiment of the present application.
  • FIG4A is a schematic diagram of a flow chart of an audio encoding method provided in an embodiment of the present application.
  • FIG4B is a schematic diagram of a flow chart of an audio decoding method provided in an embodiment of the present application.
  • FIG4C is a schematic diagram of a flow chart of an audio encoding and decoding method provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the structure of the encoding end and the decoding end provided in an embodiment of the present application
  • FIGS. 6A and 6B are schematic flow charts of an audio decoding method provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an end-to-end voice communication link provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a flow chart of an audio encoding and decoding method provided in an embodiment of the present application.
  • FIG9A is a schematic diagram of a common convolution provided in an embodiment of the present application.
  • FIG9B is a schematic diagram of a dilated convolution provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the spectrum response of the low-pass part and the high-pass part of the QMF analysis filter bank provided in an embodiment of the present application;
  • FIG11A is a schematic diagram showing the principle of obtaining 4-channel sub-band signals based on a QMF filter bank according to an embodiment of the present application
  • FIG11B is a schematic diagram showing the principle of obtaining a 3-channel sub-band signal based on a QMF filter bank according to an embodiment of the present application
  • FIG12 is a schematic diagram of the structure of an analysis network provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of the structure of an enhanced network provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of the structure of a synthetic network provided in an embodiment of the present application.
  • FIG15 is a schematic diagram of a flow chart of an audio encoding and decoding method provided in an embodiment of the present application.
  • FIG16 is a schematic diagram of a flow chart of an audio encoding and decoding method provided in an embodiment of the present application.
  • FIG17 is a schematic diagram of the structure of a first analysis network provided in an embodiment of the present application.
  • FIG18 is a schematic diagram of the structure of a first enhanced network provided in an embodiment of the present application.
  • FIG19 is a schematic diagram of the structure of a first synthetic network provided in an embodiment of the present application.
  • FIG. 20 is a schematic diagram showing a comparison of encoding and decoding effects provided in an embodiment of the present application.
  • first ⁇ second ⁇ are merely used to distinguish similar objects and do not represent a specific ordering of the objects. It can be understood that “first ⁇ second ⁇ " can be interchanged with a specific order or sequence where permitted, so that the embodiments of the present application described herein can be implemented in an order other than that illustrated or described herein.
  • Neural Network It is an algorithmic mathematical model that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. This network relies on the complexity of the system and adjusts the interconnected relationships between a large number of internal nodes to achieve the purpose of processing information.
  • Deep Learning It is a new research direction in the field of Machine Learning (ML). Deep learning is to learn the inherent laws and representation levels of sample data. The information obtained in the learning process is very helpful for interpreting data such as text, images and sounds. Its ultimate goal is to enable machines to have analytical learning capabilities like humans and to recognize data such as text, images and sounds.
  • VQ Vector Quantization
  • Scalar quantization quantize the scalar, i.e. one-dimensional vector quantization, which divides the dynamic range into several small intervals, each of which has a representative value. When the input signal falls into a certain interval, it is quantized to the representative value.
  • Entropy coding It is a lossless coding method that does not lose any information according to the entropy principle during the coding process. It is also a key module in lossy coding. The end of the encoder. Common entropy coding methods include Shannon coding, Huffman coding, Exp-Golomb coding, and arithmetic coding.
  • Quadrature Mirror Filters It is a filter pair that includes analysis and synthesis, in which the QMF analysis filter group is used to decompose the sub-band signal to reduce the signal bandwidth so that each sub-band signal can be smoothly processed by the channel; the QMF synthesis filter group is used to synthesize the sub-band signals restored by the decoding end, for example, reconstructing the original audio signal through zero-value interpolation and bandpass filtering.
  • Voice coding and decoding technology is a core technology in communication services including remote audio and video calls.
  • voice coding technology is to use less network bandwidth resources to transmit as much voice information as possible.
  • voice coding is a kind of source coding. The purpose of source coding is to compress the amount of data to be transmitted as much as possible at the encoding end, remove the redundancy in the information, and restore it losslessly (or nearly losslessly) at the decoding end.
  • the compression rate of voice codecs provided by related technologies can reach more than 10 times, that is, the original 10MB of voice data only needs 1MB to be transmitted after being compressed by the encoder, which greatly reduces the broadband resources consumed to transmit information.
  • a wideband voice signal with a sampling rate of 16000Hz if a 16-bit sampling depth is used, the bit rate of the uncompressed version is 256 kilobits per second (kbps); if voice coding technology is used, even lossy coding, within the bit rate range of 10-20kbps, the quality of the reconstructed voice signal can be close to the uncompressed version, and even the auditory perception is no different.
  • a higher sampling rate service is required, such as 32000Hz ultra-wideband voice, the bit rate range must be at least 30kbps.
  • waveform coding waveform speech coding
  • parametric coding parametric speech coding
  • hybrid coding hyper speech coding
  • waveform coding is to directly encode the waveform of the speech signal.
  • the advantage of this coding method is that the encoded speech quality is high, but the compression rate is not high.
  • Parametric coding refers to modeling the speech process, and what the encoder needs to do is to extract the corresponding parameters of the speech signal to be transmitted.
  • the advantage of parametric coding is that the compression rate is extremely high, but the disadvantage is that the quality of the restored speech is not high.
  • Hybrid coding is a combination of the above two coding methods.
  • the speech components that can be coded using parameters are expressed by parameters, and the remaining components that cannot be effectively expressed by parameters are coded using waveform coding.
  • the combination of the two can achieve high coding efficiency and high restored speech quality.
  • the above three coding principles are derived from classic speech signal modeling, also known as compression methods based on signal processing. Based on rate-distortion analysis and combined with decades of standardization experience, a bit rate of at least 0.75 bit/sample is recommended to provide ideal speech quality; for a broadband speech signal with a sampling rate of 16000 Hz, this is equivalent to 12 kbps. For example, the IETF OPUS standard recommends 16 kbps as the recommended bit rate for providing high-quality broadband voice calls.
  • Figure 1 is a schematic diagram of spectrum comparison at different bit rates provided in an embodiment of the present application to demonstrate the relationship between compression bit rate and quality.
  • curve 101 is the original speech, that is, the uncompressed audio signal
  • curve 102 is the effect of OPUS encoder 20kbps
  • curve 103 is the effect of OPUS encoder 6kbps. It can be seen from Figure 1 that as the encoding bit rate increases, the compressed signal is closer to the original signal.
  • the applicant also found that: for audio codec solutions based on artificial intelligence, although the bit rate can be lower than 2kbps, the general call to generate networks such as Wavenet leads to very high complexity at the decoding end, making it very challenging to use in mobile terminals, and the absolute quality is significantly different from the encoder of traditional signal processing.
  • the bit rate is 6-10kbps, and the subjective quality is close to that of traditional signal processing solutions.
  • deep learning networks are used at both ends of the codec, resulting in very high complexity.
  • the embodiments of the present application provide an audio coding method, device, electronic device, computer-readable storage medium and computer program product, which can effectively suppress acoustic interference in audio signals while improving coding efficiency, thereby improving the quality of reconstructed audio signals.
  • the following describes an exemplary application of the electronic device provided by the embodiments of the present application.
  • the electronic device provided by the embodiments of the present application can be implemented as a terminal device, or as a server, or implemented by a terminal device and a server in collaboration.
  • the following is an example of an audio coding method provided by an embodiment of the present application being implemented in collaboration with a terminal device and a server.
  • the audio codec system 100 includes: a server 200, a network 300, a first terminal device 400 (i.e., an encoding end) and a second terminal device 500 (i.e., a decoding end), wherein the network 300 can be a local area network, a wide area network, or a combination of the two.
  • a client 410 is running on the first terminal device 400, and the client 410 can be various types of clients, such as an instant messaging client, a web conference client, a live broadcast client, a browser, etc.
  • the client 410 calls the microphone in the terminal device 400 to collect the audio signal, and encodes the collected audio signal to obtain a code stream.
  • the client 410 can send the code stream to the server 200 through the network 300, so that the server 200 sends the code stream to the second terminal device 500 associated with the recipient (such as the participant of the web conference, the audience, the recipient of the voice call, etc.).
  • the client 510 can decode the code stream to obtain the predicted value (also called estimated value) of the feature vector of the audio signal; then the client 510 can also call the enhancement network to perform label extraction processing on the predicted value of the feature vector to obtain a label information vector for signal enhancement, wherein the dimension of the label information vector is the same as the dimension of the predicted value of the feature vector; then the client 510 can call the synthesis network to reconstruct the signal based on the predicted value of the decoded feature vector and the label information vector obtained after the label extraction processing to obtain the predicted value of the audio signal, thereby completing the reconstruction of the audio signal, and suppressing the noise component contained in the audio signal collected by the encoding end, thereby improving the quality of the reconstructed audio signal.
  • the predicted value also called estimated value
  • the client 510 can also call the enhancement network to perform label extraction processing on the predicted value of the feature vector to obtain a label information vector for signal enhancement, wherein the dimension of the label information vector is the same as the dimension of the predicted value of the feature vector
  • the audio codec method provided in the embodiment of the present application can be widely used in various types of voice or video call application scenarios, such as in-vehicle voice realized by an application running on a vehicle terminal, voice calls or video calls made through an instant messaging client, and voice in game applications. Voice calls, voice calls in network conference clients, etc.
  • voice enhancement can be performed according to the audio decoding method provided in the embodiment of the present application at the receiving end of the voice call or the server providing the voice communication service.
  • online conference is an important part of online office.
  • the sound collection device such as a microphone
  • the sound collection device needs to send the collected voice signal to other participants of the online conference after collecting the voice signal of the speaker. This process involves the transmission and playback of voice signals between multiple participants. If the noise mixed in the voice signal is not processed, it will greatly affect the auditory experience of the conference participants.
  • the audio decoding method provided in the embodiment of the present application can be used to enhance the voice signal in the online conference, so that the voice signal heard by the conference participants is the enhanced voice signal, that is, the noise component in the voice signal collected by the encoding end is suppressed in the reconstructed voice signal, thereby improving the quality of voice calls in the online conference.
  • Cloud Technology refers to a hosting technology that unifies a series of resources such as hardware, software, and network in a wide area network or local area network to achieve data computing, storage, processing, and sharing.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, and application technology based on the cloud computing business model. It can form a resource pool, which is used on demand and is flexible and convenient. Cloud computing technology will become an important support.
  • the service interaction function between the above servers 200 can be realized through cloud technology.
  • the server 200 shown in FIG. 2 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (CDN, Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal device 400 and the terminal device 500 shown in FIG. 2 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a car terminal, etc., but are not limited thereto.
  • the terminal device (such as the first terminal device 400 and the second terminal device 500) and the server 200 may be directly or indirectly connected via wired or wireless communication, which is not limited in the embodiments of the present application.
  • the terminal device (such as the second terminal device 500) or the server 200 can also implement the audio decoding method provided in the embodiment of the present application by running a computer program.
  • the computer program can be a native program or software module in the operating system; it can be a native application (APP, Application), that is, a program that needs to be installed in the operating system to run, such as a live broadcast APP, a web conference APP, or an instant messaging APP, etc.; it can also be a small program, that is, a program that can be run only by downloading it to a browser environment.
  • APP Application
  • the above-mentioned computer program can be an application, module or plug-in in any form.
  • FIG. 3 is a schematic diagram of the structure of the second terminal device 500 provided in an embodiment of the present application.
  • the second terminal device 500 shown in Figure 3 includes: at least one processor 520, a memory 560, at least one network interface 530 and a user interface 540.
  • the various components in the second terminal device 500 are coupled together through a bus system 550.
  • the bus system 550 is used to realize the connection and communication between these components.
  • the bus system 550 also includes a power bus, a control bus and a status signal bus.
  • various buses are marked as bus systems 550 in Figure 3.
  • Processor 520 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • DSP digital signal processor
  • the user interface 540 includes one or more output devices 541 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
  • the user interface 540 also includes one or more input devices 542, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
  • the memory 560 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid-state memory, hard drives, optical drives, etc.
  • the memory 560 may optionally include one or more storage devices that are physically remote from the processor 520.
  • the memory 560 includes a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memories.
  • the non-volatile memory may be a read-only memory (ROM), and the volatile memory may be a random access memory (RAM).
  • the memory 560 described in the embodiment of the present application is intended to include any suitable type of memory.
  • memory 560 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or a subset or superset thereof, as exemplarily described below.
  • Operating system 561 including system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • a network communication module 562 for reaching other computing devices via one or more (wired or wireless) network interfaces 530
  • exemplary network interfaces 530 include: Bluetooth, Wireless Compatibility Authentication (WiFi), and Universal Serial Bus (USB), etc.;
  • a presentation module 563 for enabling presentation of information via one or more output devices 541 (e.g., display screen, speaker, etc.) associated with the user interface 540 (e.g., a user interface for operating peripherals and displaying content and information);
  • output devices 541 e.g., display screen, speaker, etc.
  • the user interface 540 e.g., a user interface for operating peripherals and displaying content and information
  • the input processing module 564 is used to detect one or more user inputs or interactions from one of the one or more input devices 542 and translate the detected inputs or interactions.
  • the audio decoding device provided in the embodiments of the present application can be implemented in software.
  • Figure 3 shows an audio decoding device 565 stored in the memory 560, which can be software in the form of programs and plug-ins, including the following software modules: an acquisition module 5651, a decoding module 5652, a label extraction module 5653, a reconstruction module 5654 and a determination module 5655. These modules are logical, and therefore can be arbitrarily combined or further split according to the functions implemented. The functions of each module will be explained below.
  • the audio encoding and decoding method provided in the embodiment of the present application will be specifically described below in combination with the exemplary application of the terminal device provided in the embodiment of the present application.
  • Figure 4A is a flow chart of the audio encoding method provided in an embodiment of the present application.
  • the main steps performed at the encoding end include: step 101, obtaining an audio signal; step 102, encoding the audio signal to obtain a bit stream.
  • FIG. 4B is a flowchart of an audio decoding method provided in an embodiment of the present application.
  • the main steps performed at the decoding end include: step 201, obtaining a bit stream; step 202, decoding the bit stream to obtain a predicted value of a feature vector of an audio signal; step 203, Perform label extraction processing on the predicted value of the feature vector to obtain a label information vector; Step 204, perform signal reconstruction based on the predicted value of the feature vector and the label information vector; Step 205, use the predicted value of the audio signal obtained by signal reconstruction as the decoding result of the code stream.
  • VoIP voice transmission over Internet Protocol
  • FIG. 4C is a flowchart of the audio encoding and decoding method provided in an embodiment of the present application, and will be described in conjunction with the steps shown in FIG. 4C .
  • the steps executed by the terminal device can be executed by a client running on the terminal device.
  • the embodiment of the present application does not make a specific distinction between the terminal device and the client running on the terminal device.
  • the audio encoding and decoding method provided in the embodiment of the present application can be executed by various forms of computer programs running on the terminal device, and is not limited to the client running on the above-mentioned terminal device, but can also be the operating system 561, software module, script and applet described above. Therefore, the example of the client in the following text should not be regarded as a limitation on the embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of the encoding end and the decoding end provided by the embodiment of the present application.
  • the encoding end includes an analysis network for performing feature extraction processing on the input audio signal to obtain the feature vector of the audio signal, and then the feature vector of the audio signal can be quantized and encoded to obtain a bit stream.
  • the decoding end includes an enhancement network and a synthesis network.
  • the enhancement network can be called to perform label extraction processing on the predicted value of the feature vector of the audio signal to obtain a label information vector, and then the synthesis network can be called to perform signal reconstruction based on the label information vector and the predicted value of the feature vector to obtain the predicted value of the audio signal.
  • the audio encoding and decoding method provided in the embodiment of the present application will be specifically described below in combination with the above-mentioned structures of the encoding end and the decoding end.
  • a first terminal device obtains an audio signal.
  • the first terminal device responds to an audio collection instruction triggered by the user and calls an audio collection device (such as a built-in microphone in the first terminal device or an external microphone) to collect audio signals to obtain audio signals, such as the voice signal of a speaker in a network conference scenario, the voice signal of a host in a live broadcast scenario, etc.
  • an audio collection device such as a built-in microphone in the first terminal device or an external microphone
  • the microphone (or microphone array) of the first terminal device is called to collect the voice signal emitted by the user to obtain the voice signal of the initiator of the online conference.
  • step 302 the first terminal device encodes the audio signal to obtain a code stream.
  • the first terminal device can encode the audio signal in the following manner to obtain a code stream: first call the analysis network (such as a neural network) to perform feature extraction on the audio signal to obtain a feature vector of the audio signal, then quantize the feature vector of the audio signal (such as vector quantization or scalar quantization) to obtain the index value of the feature vector, and finally encode the index value of the feature vector, such as performing entropy coding on the index value of the feature vector to obtain a code stream.
  • the analysis network such as a neural network
  • quantize the feature vector of the audio signal such as vector quantization or scalar quantization
  • the above-mentioned vector quantization refers to the process of encoding a point in a vector space using a finite subset thereof.
  • the key is the establishment of a codebook (or quantization table) and a codeword search algorithm.
  • the codeword in the codebook that best matches the feature vector of the audio signal can be queried first, and then the index value of the queried codeword can be used as the index value of the feature vector, that is, the index value of the codeword in the codebook that best matches the feature vector of the audio signal is used to replace the feature vector of the audio signal for transmission and storage.
  • the above scalar quantization refers to one-dimensional vector quantization, which divides the entire dynamic range into several small intervals, each of which has a representative value.
  • the signal value that falls into the small interval is replaced by the entire representative value, or is quantized to this representative value.
  • the signal quantity is one-dimensional, so it is called scalar quantization.
  • the representative value corresponding to small interval 2 can be used as the index value of the feature vector.
  • the first terminal device can perform feature extraction processing on the audio signal in the following manner to obtain a feature vector of the audio signal: first, perform convolution processing (such as causal convolution) on the audio signal to obtain the convolution feature of the audio signal; then, perform pooling processing on the convolution feature of the audio signal to obtain the pooling feature of the audio signal; then, perform downsampling processing on the pooled feature of the audio signal to obtain the downsampled feature of the audio signal; finally, perform convolution processing on the downsampled feature of the audio signal to obtain the feature vector of the audio signal.
  • convolution processing such as causal convolution
  • pooling is used to reduce the feature dimension of the convolution layer output, reduce network parameters and computational costs, and reduce overfitting.
  • Pooling includes maximum pooling (Max Pooling) and average pooling (Average Pooling).
  • Maximum pooling refers to taking the point with the largest value in the local receptive field, that is, reducing the amount of data by the maximum value; average pooling refers to taking the average value of the local receptive field.
  • the first terminal device may also encode the audio signal to obtain a code stream in the following manner: decompose the collected audio signal, for example, decompose it through a 2-channel QMF analysis filter group to obtain a low-frequency sub-band signal and a high-frequency sub-band signal; then perform feature extraction on the low-frequency sub-band signal and the high-frequency sub-band signal respectively to obtain a feature vector of the low-frequency sub-band signal and a feature vector of the high-frequency sub-band signal respectively; then perform quantization encoding on the feature vector of the low-frequency sub-band signal to obtain a low-frequency code stream of the audio signal, and perform quantization encoding on the feature vector of the high-frequency sub-band signal to obtain a high-frequency code stream of the audio signal.
  • the information loss caused by compression can be effectively reduced.
  • the first terminal device can decompose the audio signal in the following manner to obtain a low-frequency sub-band signal and a high-frequency sub-band signal: first, the audio signal is sampled to obtain a sampled signal, wherein the sampled signal includes a plurality of sample points obtained by acquisition; then, the sampled signal is low-pass filtered to obtain a low-pass filtered signal; then, the low-pass filtered signal is down-sampled to obtain a low-frequency sub-band signal. Similarly, the sampled signal is high-pass filtered to obtain a high-pass filtered signal, and the high-pass filtered signal is down-sampled to obtain a high-frequency sub-band signal.
  • the above-mentioned low-pass filter is a filtering method
  • the rule is that low-frequency signals can pass normally, while high-frequency signals exceeding the set critical value are blocked and weakened.
  • Low-pass filtering can be simply considered as: setting a frequency point, when the signal frequency is higher than this frequency, it cannot pass. In digital signals, this frequency point is also the cutoff frequency. When the frequency domain is higher than this cutoff frequency, all values are assigned to 0. Because in this processing process, all low-frequency signals are allowed to pass, it is called low-pass filtering.
  • the high-pass filter mentioned above is a filtering method.
  • the rule is that high-frequency signals can pass normally, while low-frequency signals below the set critical value are blocked and weakened.
  • High-pass filtering can be simply considered as: setting a frequency point, when the signal frequency is lower than this frequency, it cannot pass. In digital signals, this frequency point is also called the cutoff frequency. When the frequency domain is lower than this cutoff frequency, all values are assigned to 0. Because in this processing process, all high-frequency signals are allowed to pass, it is called high-pass filtering.
  • the above-mentioned downsampling processing is a method for reducing the number of sampling points.
  • downsampling processing can be performed by taking values at alternate bits to obtain a low-frequency sub-band signal.
  • selection can be performed every 3 bits, that is, the 1st sampling signal, the 4th sampling signal, the 7th sampling signal, and so on are selected respectively, so as to obtain a low-frequency sub-band signal.
  • the first terminal device can also encode the audio signal in the following manner to obtain a code stream: first, the collected audio signal is decomposed to obtain N sub-band signals, where N is an integer greater than 2; then, each sub-band signal is subjected to feature extraction processing to obtain a feature vector of each sub-band signal. For example, for each sub-band signal obtained by decomposition, a neural network model can be called to perform feature extraction processing to obtain a feature vector of the sub-band signal; then, the feature vector of each sub-band signal is subjected to quantization encoding processing to obtain N sub-code streams.
  • the above-mentioned decomposition and processing of the collected audio signal to obtain N sub-band signals can be achieved in the following manner: for example, the decomposition and processing can be performed through a 4-channel QMF analysis filter group to obtain 4 sub-band signals.
  • the audio signal can first be low-pass filtered and high-pass filtered to obtain a low-frequency sub-band signal and a high-frequency sub-band signal. Then, the low-frequency sub-band signal can be low-pass filtered and high-pass filtered again to obtain sub-band signal 1 and sub-band signal 2 accordingly; similarly, the high-frequency sub-band signal obtained by decomposition can be low-pass filtered and high-pass filtered again to obtain sub-band signal 3 and sub-band signal 4 accordingly.
  • the audio signal can be decomposed into 4 sub-band signals by iterating two layers of 2-channel QMF analysis filtering.
  • step 303 the first terminal device sends a code stream to the server.
  • the code stream can be sent to the server through the network.
  • step 304 the server sends a code stream to the second terminal device.
  • the server after receiving the code stream sent by the first terminal device (i.e., the encoding end, such as the terminal device associated with the initiator of the online conference), the server can send the code stream to the second terminal device (i.e., the decoding end, such as the terminal device associated with the participants of the online conference) through the network.
  • the first terminal device i.e., the encoding end, such as the terminal device associated with the initiator of the online conference
  • the server can send the code stream to the second terminal device (i.e., the decoding end, such as the terminal device associated with the participants of the online conference) through the network.
  • a transcoder can be deployed in the server to solve the interconnection problem between the new encoder (i.e., an encoder that encodes based on artificial intelligence, such as an NN encoder) and the traditional encoder (i.e., an encoder that encodes based on the transformation of the time domain and the frequency domain, such as a G.722 encoder).
  • the new NN encoder is deployed in the first terminal device (i.e., the transmitting end)
  • the traditional decoder e.g., a G.722 decoder
  • the second terminal device will not be able to correctly decode the code stream sent by the first terminal device.
  • a transcoder can be deployed in the server. For example, after receiving the code stream encoded based on the NN encoder sent by the first terminal device, the server can first call the NN decoder to generate an audio signal, and then call the traditional encoder (e.g., a G.722 encoder) to generate a specific code stream. In this way, the second terminal device can decode correctly. In other words, it can avoid the problem that the decoding end cannot correctly decode due to the inconsistency of the encoder deployed at the encoding end and the decoder deployed at the decoding end, thereby improving the compatibility in the encoding and decoding process.
  • the traditional encoder e.g., a G.722 encoder
  • step 305 the second terminal device decodes the code stream to obtain a predicted value of the feature vector of the audio signal.
  • the second terminal device can implement step 305 in the following manner: first, decode the code stream to obtain the index value of the feature vector of the audio signal; then query the quantization table based on the index value to obtain the predicted value of the feature vector of the audio signal. For example, when the encoder uses the index value of the codeword in the quantization table that best matches the feature vector of the audio signal to replace the feature vector for subsequent encoding, the decoder can perform a simple table lookup operation based on the index value after decoding the code stream to obtain the index value, and thus obtain the predicted value of the feature vector of the audio signal.
  • the decoding process is an inverse process of the encoding process.
  • the decoding end uses entropy coding to encode the feature vector of the audio signal to obtain a bit stream
  • the decoding end can correspondingly use entropy decoding to decode the received bit stream to obtain the index value of the feature vector of the audio signal.
  • the second terminal device can also implement the above step 305 in the following manner: decode the low-frequency code stream to obtain the predicted value of the feature vector of the low-frequency sub-band signal; decode the high-frequency code stream to obtain the predicted value of the feature vector of the high-frequency sub-band signal, wherein the low-frequency code stream is obtained by encoding the low-frequency sub-band signal obtained after the audio signal is decomposed, and the high-frequency code stream is obtained by encoding the high-frequency sub-band signal obtained after the audio signal is decomposed.
  • the decoding end when the encoding end encodes the feature vector of the low-frequency sub-band signal by entropy encoding, the decoding end can decode the low-frequency code stream by a corresponding entropy decoding method.
  • the second terminal device may first decode the low-frequency code stream to obtain the index value of the feature vector of the low-frequency sub-band signal (assuming the index value is 1), and then query the quantization table based on the index value 1 to obtain the predicted value of the feature vector of the low-frequency sub-band signal.
  • the second terminal device may first decode the high-frequency code stream to obtain the index value of the feature vector of the high-frequency sub-band signal (assuming the index value is 2), and then query the quantization table based on the index value 2 to obtain the predicted value of the feature vector of the high-frequency sub-band signal.
  • the second terminal device can also implement the above step 105 in the following manner: respectively decode the N sub-code streams to obtain the predicted values of the feature vectors corresponding to the N sub-band signals.
  • the decoding process for the N sub-code streams here can be implemented by referring to the decoding process for the low-frequency code stream or the high-frequency code stream above, and the embodiments of the present application will not be repeated here.
  • sub-streams For example, take N sub-streams as 4 sub-streams, namely sub-stream 1, sub-stream 2, sub-stream 3 and sub-stream 4, where sub-stream 1 is obtained by encoding sub-band signal 1, sub-stream 2 is obtained by encoding sub-band signal 2, sub-stream 3 is obtained by encoding sub-band signal 3, and sub-stream 4 is obtained by encoding sub-band signal 4.
  • the second terminal device can decode the 4 sub-streams respectively to obtain the predicted values of the feature vectors corresponding to the 4 sub-band signals, for example, including the predicted value of the feature vector of sub-band signal 1, the predicted value of the feature vector of sub-band signal 2, the predicted value of the feature vector of sub-band signal 3 and the predicted value of the feature vector of sub-band signal 4.
  • step 306 the second terminal device performs label extraction processing on the predicted value of the feature vector to obtain a label information vector.
  • the label information vector is used for signal enhancement.
  • the dimension of the label information vector is the same as the dimension of the predicted value of the feature vector, so that When performing subsequent signal reconstruction, the predicted value of the feature vector and the label information vector can be spliced together, thereby achieving the effect of enhancing the reconstructed audio signal by increasing the proportion of the core components. That is to say, by combining the predicted value of the feature vector and the label information vector for signal reconstruction, all core components in the reconstructed audio signal can be enhanced, thereby improving the quality of the reconstructed audio signal.
  • the second terminal device can perform label extraction processing on the predicted value of the feature vector by calling the enhanced network to obtain a label information vector, wherein the enhanced network includes a convolutional layer, a neural network layer, a fully connected network layer and an activation layer.
  • the process of extracting the label information vector is explained in combination with the above structure of the enhanced network.
  • FIG. 6A is a flow chart of the audio decoding method provided in an embodiment of the present application.
  • step 306 shown in Figure 4C can be implemented by steps 3061 to 3064 shown in Figure 6A, which will be explained in conjunction with the steps shown in Figure 6A.
  • step 3061 the second terminal device performs convolution processing on the predicted value of the feature vector to obtain a first tensor of the same dimension as the predicted value of the feature vector.
  • the second terminal device can use the predicted value of the feature vector obtained in step 305 as input, call the convolution layer included in the enhanced network (for example, a one-dimensional causal convolution), and generate a first tensor of the same dimension as the predicted value of the feature vector (a tensor is a quantity including numerical values of multiple dimensions). For example, as shown in Figure 13, the dimension of the predicted value of the feature vector is 56 ⁇ 1, then after causal convolution processing, a 56 ⁇ 1 tensor is generated.
  • the convolution layer included in the enhanced network for example, a one-dimensional causal convolution
  • step 3062 the second terminal device performs feature extraction processing on the first tensor to obtain a second tensor of the same dimension as the first tensor.
  • a neural network layer (such as a long short-term memory network, a time recursive neural network, etc.) included in the enhanced network can be used for feature extraction processing to generate a second tensor with the same dimension as the first tensor.
  • the dimension of the first tensor is 56 ⁇ 1
  • LSTM long short-term memory
  • step 3063 the second terminal device performs full connection processing on the second tensor to obtain a third tensor of the same dimension as the second tensor.
  • the second terminal device can call the fully connected network layer included in the enhanced network to perform full connection processing on the second tensor to obtain a third tensor of the same dimension as the second tensor.
  • the dimension of the second tensor is 56 ⁇ 1
  • a 56 ⁇ 1 tensor is generated.
  • step 3064 the second terminal device performs activation processing on the third tensor to obtain a label information vector.
  • the second terminal device can call the activation layer included in the enhanced network, that is, the activation function (such as ReLU function, Sigmoid function, Tanh function, etc.) to activate the third tensor, so that a label information vector of the same dimension as the predicted value of the feature vector is generated.
  • the activation function such as ReLU function, Sigmoid function, Tanh function, etc.
  • the dimension of the third tensor is 56 ⁇ 1
  • a label information vector of 56 ⁇ 1 is obtained.
  • the second terminal device can also implement the above-mentioned step 306 in the following manner: perform label extraction processing on the predicted value of the feature vector of the low-frequency sub-band signal to obtain a first label information vector, wherein the dimension of the first label information vector is the same as the dimension of the predicted value of the feature vector of the low-frequency sub-band signal, and is used for signal enhancement of the low-frequency sub-band signal; perform label extraction processing on the predicted value of the feature vector of the high-frequency sub-band signal to obtain a second label information vector, wherein the dimension of the second label information vector is the same as the dimension of the predicted value of the feature vector of the high-frequency sub-band signal, and is used for signal enhancement of the high-frequency sub-band signal.
  • the second terminal device can implement the above-mentioned label extraction processing on the predicted value of the feature vector of the low-frequency subband signal to obtain the first label information vector by calling the first enhanced network to perform the following processing: convolution processing on the predicted value of the feature vector of the low-frequency subband signal to obtain a fourth tensor of the same dimension as the predicted value of the feature vector of the low-frequency subband signal; feature extraction processing on the fourth tensor to obtain a fifth tensor of the same dimension as the fourth tensor; fully connect processing on the fifth tensor to obtain a sixth tensor of the same dimension as the fifth tensor; activation processing on the sixth tensor to obtain the first label information vector.
  • the second terminal device can implement the above-mentioned label extraction processing on the predicted value of the feature vector of the high-frequency sub-band signal to obtain the second label information vector in the following manner: call the second enhanced network to perform the following processing: perform convolution processing on the predicted value of the feature vector of the high-frequency sub-band signal to obtain a seventh tensor of the same dimension as the predicted value of the feature vector of the high-frequency sub-band signal; perform feature extraction processing on the seventh tensor to obtain an eighth tensor of the same dimension as the seventh tensor; perform full connection processing on the eighth tensor to obtain a ninth tensor of the same dimension as the eighth tensor; perform activation processing on the ninth tensor to obtain the second label information vector.
  • the label extraction process for the predicted value of the feature vector of the low-frequency sub-band signal and the label extraction process for the predicted value of the feature vector of the high-frequency sub-band signal are similar to the label extraction process for the predicted value of the feature vector of the audio signal, and can be implemented with reference to the description of Figure 6A, and the embodiments of the present application are not repeated here.
  • the structures of the first enhancement network and the second enhancement network are similar to the structures of the enhancement networks described above, and the embodiments of the present application are not repeated here.
  • the second terminal device can implement the above-mentioned step 306 in the following manner: perform label extraction processing on the predicted values of the feature vectors corresponding to the N sub-band signals, respectively, to obtain N label information vectors, wherein the dimension of each label information vector is the same as the dimension of the predicted value of the feature vector of the corresponding sub-band signal.
  • the second terminal device can implement the above-mentioned label extraction processing for the predicted values of the feature vectors corresponding to the N sub-band signals respectively to obtain N label information vectors in the following manner: based on the predicted value of the feature vector of the i-th sub-band signal, call the i-th enhancement network to perform label extraction processing to obtain the i-th label information vector; wherein the value range of i satisfies 1 ⁇ i ⁇ N, and the dimension of the i-th label information vector is the same as the dimension of the predicted value of the feature vector of the i-th sub-band signal, and is used for signal enhancement of the i-th sub-band signal.
  • the second terminal device can implement the above-mentioned predicted value based on the feature vector of the i-th subband signal in the following manner, calling the i-th enhanced network to perform label extraction processing to obtain the i-th label information vector: calling the i-th enhanced network to perform the following processing: performing convolution processing on the predicted value of the feature vector of the i-th subband signal to obtain a tenth tensor of the same dimension as the predicted value of the feature vector of the i-th subband signal; performing feature extraction processing on the tenth tensor to obtain an eleventh tensor of the same dimension as the tenth tensor; performing full connection processing on the eleventh tensor to obtain a twelfth tensor of the same dimension as the eleventh tensor; The twelfth tensor is activated to obtain the i-th label information vector.
  • the structure of the i-th enhanced network is similar to the structure of the enhanced network mentioned above, and will not be described in detail in the embodiments of the present application.
  • step 307 the second terminal device performs signal reconstruction based on the predicted value of the feature vector and the label information vector to obtain the predicted value of the audio signal.
  • the second terminal device can implement step 307 in the following manner: concatenate the predicted value of the feature vector and the label information vector to obtain a concatenated vector; compress the concatenated vector to obtain a predicted value of the audio signal, wherein the compression process can be implemented by one or more cascades of convolution, upsampling, and pooling, for example, it can be implemented by the following steps 3072 to 3075, and the predicted value of the audio signal includes the predicted values corresponding to the frequency, wavelength, amplitude and other parameters of the audio signal.
  • the second terminal device can call the synthesis network to perform signal reconstruction based on the predicted value of the feature vector and the label information vector to obtain the predicted value of the audio signal, wherein the synthesis network includes a first convolutional layer, an upsampling layer, a pooling layer, and a second convolutional layer.
  • the signal reconstruction process is explained in combination with the above structure of the synthesis network.
  • FIG. 6B is a flow chart of the audio decoding method provided in an embodiment of the present application.
  • step 307 shown in Figure 4C can be implemented by steps 3071 to 3075 shown in Figure 6B, which will be explained in conjunction with the steps shown in Figure 6B.
  • step 3071 the second terminal device concatenates the predicted value of the feature vector and the label information vector to obtain a concatenated vector.
  • the second terminal device may concatenate the predicted value of the feature vector obtained based on step 305 and the label information vector obtained based on step 306 to obtain a concatenated vector, and use the concatenated vector as an input of a synthesis network to perform signal reconstruction.
  • step 3072 the second terminal device performs a first convolution process on the concatenated vector to obtain a convolution feature of the audio signal.
  • the second terminal device may call the first convolution layer (e.g., a one-dimensional causal convolution) included in the synthesis network to perform convolution on the concatenated vector to obtain the convolution feature of the audio signal.
  • the first convolution layer e.g., a one-dimensional causal convolution
  • FIG14 after the concatenated vector is causally convolved, a tensor with a dimension of 192 ⁇ 1 is obtained (i.e., the convolution feature of the audio signal).
  • step 3073 the second terminal device upsamples the convolution feature to obtain the upsampled feature of the audio signal.
  • the second terminal device may call the upsampling layer included in the synthesis network to perform upsampling processing on the convolution features of the audio signal, wherein the upsampling processing may be implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the second terminal device may perform upsampling processing on the convolution features of the audio signal in the following manner to obtain the upsampling features of the audio signal: upsampling processing is performed on the convolution features through the first decoding layer in multiple cascaded decoding layers; the upsampling result of the first decoding layer is output to the subsequent cascaded decoding layers, and the upsampling processing and upsampling result output are continued through the subsequent cascaded decoding layers until it is output to the last decoding layer; the upsampling result output by the last decoding layer is used as the upsampling feature of the audio signal.
  • the above-mentioned upsampling process is a method for increasing the dimension of the convolution feature of the audio signal.
  • the convolution feature of the audio signal can be upsampled by interpolation (such as bilinear interpolation) to obtain the upsampled feature of the audio signal, wherein the dimension of the upsampled feature is greater than the dimension of the convolution feature. That is to say, the dimension of the convolution feature can be increased by upsampling.
  • the Up_factor of the three decoding layers is set to 8, 5, and 4, respectively, which is equivalent to setting pooling factors of different sizes, which plays the role of upsampling.
  • the number of channels of the three decoding layers is set to 96, 48, and 24, respectively. In this way, after upsampling through three decoding layers, the convolutional features of the audio signal (e.g., a 192 ⁇ 1 tensor) will be converted into 96 ⁇ 8, 48 ⁇ 40, and 24 ⁇ 160 tensors respectively, and the 24 ⁇ 160 tensor can be used as the upsampled features of the audio signal.
  • the convolutional features of the audio signal e.g., a 192 ⁇ 1 tensor
  • step 3074 the second terminal device performs pooling processing on the upsampled features to obtain pooled features of the audio signal.
  • the second terminal device can call the pooling layer in the synthesis network to perform pooling on the upsampled features, for example, performing a pooling operation on the upsampled features with a factor of 2 to obtain the pooled features of the audio signal.
  • the upsampled features of the audio signal are a 24 ⁇ 160 tensor. After pooling (i.e., the post-processing shown in Figure 14), a 24 ⁇ 320 tensor (i.e., the pooled features of the audio signal) is generated.
  • step 3075 the second terminal device performs a second convolution process on the pooled features to obtain a predicted value of the audio signal.
  • the second terminal device can also call the second convolutional layer included in the synthesis network for the pooled features of the audio signal, for example, calling the causal convolution shown in Figure 14, and perform a dilated convolution on the pooled features to generate a predicted value of the audio signal.
  • the second terminal device can also implement the above-mentioned step 307 in the following manner: splicing the predicted value of the feature vector of the low-frequency subband signal and the first label information vector (i.e., the label information vector obtained by performing label extraction processing on the predicted value of the feature vector of the low-frequency subband signal) to obtain a first splicing vector; calling the first synthesis network to perform signal reconstruction based on the first splicing vector to obtain the predicted value of the low-frequency subband signal; splicing the predicted value of the feature vector of the high-frequency subband signal and the second label information vector (i.e., the label information vector obtained by performing label extraction processing on the predicted value of the feature vector of the high-frequency subband signal) to obtain a second splicing vector; calling the second synthesis network to perform signal reconstruction based on the second
  • the second terminal device can implement the above-mentioned calling the first synthesis network based on the first splicing vector to perform signal reconstruction and obtain the predicted value of the low-frequency sub-band signal in the following manner: calling the first synthesis network to perform the following processing: performing a first convolution processing on the first splicing vector to obtain the convolution feature of the low-frequency sub-band signal; performing an upsampling processing on the convolution feature of the low-frequency sub-band signal to obtain the upsampling feature of the low-frequency sub-band signal; performing a pooling processing on the upsampling feature of the low-frequency sub-band signal to obtain the pooling feature of the low-frequency sub-band signal; performing a second convolution processing on the pooling feature of the low-frequency sub-band signal to obtain the predicted value of the low-frequency sub-band signal; wherein the upsampling processing can be implemented by multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the second terminal device can implement the above-mentioned calling the second synthesis network to perform signal reconstruction based on the second splicing vector to obtain the predicted value of the high-frequency sub-band signal by the following method: calling the second synthesis network to perform the following processing: performing the first convolution processing on the second splicing vector to obtain the predicted value of the high-frequency sub-band signal Convolution features; upsampling the convolution features of the high-frequency sub-band signal to obtain the up-sampled features of the high-frequency sub-band signal; pooling the up-sampled features of the high-frequency sub-band signal to obtain the pooled features of the high-frequency sub-band signal; performing a second convolution on the pooled features of the high-frequency sub-band signal to obtain a predicted value of the high-frequency sub-band signal; wherein the upsampling process can be implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the reconstruction process of the low-frequency subband signal i.e., the process of generating the predicted value of the low-frequency subband signal
  • the reconstruction process of the high-frequency subband signal i.e., the process of generating the predicted value of the high-frequency subband signal
  • the structures of the first synthesis network and the second synthesis network are similar to the structures of the synthesis networks described above, and the embodiments of the present application will not be repeated here.
  • the second terminal device can also implement the above-mentioned step 307 in the following manner: one-to-one splicing processing is performed on the predicted values of the feature vectors corresponding to the N sub-band signals and the N label information vectors to obtain N splicing vectors; based on the j-th splicing vector, the j-th synthesis network is called to perform signal reconstruction to obtain the predicted value of the j-th sub-band signal; wherein the value range of j satisfies 1 ⁇ j ⁇ N; the predicted values corresponding to the N sub-band signals are synthesized to obtain the predicted value of the audio signal.
  • the second terminal device can implement the above-mentioned calling the jth synthesis network to perform signal reconstruction based on the jth splicing vector to obtain the predicted value of the jth subband signal in the following manner: calling the jth synthesis network to perform the following processing: performing a first convolution processing on the jth splicing vector to obtain the convolution feature of the jth subband signal; performing upsampling processing on the convolution feature of the jth subband signal to obtain the upsampling feature of the jth subband signal; performing pooling processing on the upsampling feature of the jth subband signal to obtain the pooling feature of the jth subband signal; performing a second convolution processing on the pooling feature of the jth subband signal to obtain the predicted value of the jth subband signal; wherein the upsampling processing can be implemented by multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the structure of the jth synthetic network is similar to the structure of the synthetic network mentioned above, and will not be described in detail in the embodiments of the present application.
  • step 308 the second terminal device uses the predicted value of the audio signal obtained by signal reconstruction as the decoding result of the code stream.
  • the second terminal device can use the predicted value of the audio signal obtained through signal reconstruction as the decoding result of the code stream, and send the decoding result to the built-in speaker of the second terminal device for playback.
  • the audio decoding method provided in the embodiment of the present application performs label extraction processing on the predicted value of the decoded feature vector to obtain a label information vector, and combines the predicted value of the feature vector and the label information vector to perform signal reconstruction. Since the label information vector reflects the core components of the audio signal (i.e., does not include acoustic interference such as noise), compared to signal reconstruction based solely on the predicted value of the feature vector, the embodiment of the present application combines the predicted value of the feature vector and the label information vector to perform signal reconstruction, which is equivalent to increasing the proportion of core components (such as human voices) in the audio signal and reducing the proportion of acoustic interference such as noise (such as background sound) in the audio signal, thereby effectively suppressing the noise components included in the audio signal collected by the encoding end, thereby improving the quality of the reconstructed audio signal.
  • core components such as human voices
  • noise such as background sound
  • VoIP conference system is taken as an example to illustrate an exemplary application of the embodiment of the present application in an actual application scenario.
  • Figure 7 is a schematic diagram of an end-to-end voice communication link provided by an embodiment of the present application.
  • the audio encoding method provided by the embodiment of the present application can be applied at the encoding end (i.e., the transmitting end of the code stream), and the audio decoding method provided by the embodiment of the present application can be applied at the decoding end (i.e., the receiving end of the code stream).
  • This is the core part of a communication system such as a conference, which solves the basic function of compression.
  • the encoder is deployed on the uplink client and the decoder is deployed on the downlink client.
  • a transcoder needs to be deployed in the server to solve the interoperability problem between the new encoder and the encoder of related technologies. For example, if the sender deploys a new NN encoder and the receiver deploys a traditional public switched telephone network (PSTN) decoder (such as a G.722 decoder), the receiver will not be able to correctly decode the code stream sent directly by the sender. Therefore, after receiving the code stream sent by the sender, the server first needs to execute the NN decoder to generate a voice signal, and then call the G.722 encoder to generate a specific code stream so that the receiver can decode it correctly. Similar transcoding scenarios are no longer expanded.
  • PSTN public switched telephone network
  • Figure 8 is a flow chart of the audio encoding and decoding method provided by the embodiment of the present application.
  • the main steps of the encoding end include: for the input signal, such as the nth frame speech signal, denoted as x(n), calling the analysis network to perform feature extraction processing to obtain a low-dimensional feature vector, denoted as F(n); in particular, the dimension of the feature vector F(n) is smaller than the dimension of the input signal x(n), thereby reducing the amount of data.
  • a specific implementation may be calling a dilated convolutional network (Dilated CNN) to perform feature extraction processing on the nth frame speech signal x(n) to generate a lower-dimensional feature vector F(n).
  • Dilated CNN dilated convolutional network
  • the embodiment of the present application does not limit other NN structures, including but not limited to autoencoders (AE, Autoencoder), fully connected (FC, Full-Connection) networks, long short-term memory (LSTM, Long Short-Term Memory) networks, convolutional neural networks (CNN, Convolutional Neural Network) + LSTM, etc.
  • AE Autoencoder
  • FC Full-Connection
  • LSTM Long Short-Term Memory
  • CNN convolutional neural networks
  • the feature vector F(n) can be vector quantized or scalar quantized, and the index value obtained after quantization can be entropy encoded to obtain a bitstream, and finally the bitstream is transmitted to the decoding end.
  • the main steps of the decoding end include: decoding the received bitstream to obtain an estimated value of the feature vector, denoted as F'(n). Then, based on the estimated value of the feature vector F'(n), the enhancement network is called to generate a label information vector for enhancement, denoted as E(n). Finally, the estimated value of the feature vector F'(n) and the label information vector E(n) are combined to call the synthesis network (corresponding to the inverse process of the encoding end) to reconstruct the signal, and suppress the noise components contained in the speech signal collected by the encoding end, and generate a signal estimation value corresponding to the input signal x(n), denoted as x'(n).
  • the dilated convolutional network and the QMF filter group are first introduced.
  • Figure 9A is a schematic diagram of ordinary convolution provided in an embodiment of the present application
  • Figure 9B is a schematic diagram of dilated convolution provided in an embodiment of the present application.
  • the proposal of dilated convolution is to solve the problem of increasing the receptive field while keeping the size of the feature map unchanged, thereby avoiding errors caused by upsampling and downsampling.
  • the convolution kernel size (Kernel size) shown in Figures 9A and 9B is 3 ⁇ 3; however, the receptive field of the ordinary convolution shown in Figure 9A is only 3, while the receptive field of the dilated convolution shown in Figure 9B reaches 5.
  • the receptive field of the ordinary convolution shown in Figure 9A is 3, and the dilation rate (Dilation rate) is 1; while the receptive field of the dilated convolution shown in Figure 9B is 5, and the dilation rate is 2.
  • the convolution kernel can also be moved on a plane similar to FIG. 9A or FIG. 9B , which involves the concept of stride rate. For example, if Assume that the convolution kernel shifts 1 grid each time, and the corresponding shift rate is 1.
  • the number of convolution channels which is the number of convolution kernel parameters used to perform convolution analysis. Theoretically, the more channels there are, the more comprehensive the signal analysis is and the higher the accuracy is; however, the more channels there are, the higher the complexity is. For example, for a 1 ⁇ 320 tensor, a 24-channel convolution operation can be used, and the output is a 24 ⁇ 320 tensor.
  • the size of the dilated convolution kernel (for example, for speech signals, the size of the convolution kernel is generally 1 ⁇ 3), expansion rate, shift rate and number of channels can be defined according to actual application needs.
  • the embodiments of the present application do not make specific limitations on this.
  • the QMF filter bank is described below.
  • the QMF filter bank is a filter pair that includes analysis and synthesis.
  • the input signal with a sampling rate of Fs can be decomposed into two signals with a sampling rate of Fs/2, which represent the QMF low-pass signal and the QMF high-pass signal respectively.
  • h Low (k) represents the coefficient of the low-pass filter
  • h High (k) represents the coefficient of the high-pass filter
  • the QMF synthesis filter group can also be described based on the QMF analysis filter group H_Low(z) and H_High(z). The detailed mathematical background will not be repeated here.
  • G Low (z) H Low (z) (2)
  • G High (z) (-1)*H High (z) (3)
  • G Low (z) represents the restored low-pass signal
  • G High (z) represents the restored high-pass signal
  • the decoding end After the decoding end recovers the low-pass signal and the high-pass signal, they are synthesized by the QMF synthesis filter bank to recover the reconstructed signal with the sampling rate Fs corresponding to the input signal.
  • 2-channel QMF scheme it can also be expanded to an N-channel QMF scheme; in particular, a binary number method can be used to iteratively perform 2-channel QMF analysis on the current sub-band signal to obtain a sub-band signal with a lower resolution.
  • Figure 11A shows a 2-channel QMF analysis filter that iterates two layers, and a 4-channel sub-band signal can be obtained.
  • Figure 11B is another implementation method. Considering that the high-frequency signal has little effect on the quality, such a high-precision analysis is not required; therefore, only one high-pass filtering of the original signal is required. Similarly, more channels can be implemented, such as 8, 16, and 32 channels, which will not be further expanded here.
  • the input signal is generated.
  • the speech signal of the nth frame includes 320 sample points, which are recorded as input signal x(n).
  • the analysis network is called for data compression.
  • the purpose of the analysis network is to generate a lower-dimensional feature vector F(n) based on the input signal x(n) by calling the analysis network (e.g., a neural network).
  • the dimension of the input signal x(n) is 320
  • the dimension of the feature vector F(n) is 56. From the perspective of data volume, after feature extraction by the analysis network, it plays a role of "dimensionality reduction" and realizes the function of data compression.
  • FIG 12 is a schematic diagram of the structure of the analysis network provided by the embodiment of the present application.
  • a 24-channel causal convolution is first called to expand the input signal x(n) to a 24 ⁇ 320 tensor, where the input signal x(n) is a 1 ⁇ 320 tensor.
  • the expanded 24 ⁇ 320 tensor is preprocessed.
  • the expanded 24 ⁇ 320 tensor can be pooled with a factor of 2
  • the activation function can be a linear rectification function (ReLU, Linear Rectification Function) to generate a 24 ⁇ 160 tensor.
  • ReLU Linear Rectification Function
  • the dilation rate (Dilation Rate) of one or more dilated convolutions can be set according to demand, for example, it can be set to 3.
  • the embodiments of the present application do not limit the setting of different dilated rates for different dilated convolutions.
  • the Down_factor of the three coding blocks is set to 4, 5, and 8, respectively, which is equivalent to setting pooling factors of different sizes to play a downsampling role.
  • the number of channels of the three coding blocks is set to 48, 96, and 192, respectively.
  • the 24 ⁇ 160 tensor will be converted into 48 ⁇ 40, 96 ⁇ 8, and 192 ⁇ 1 tensors respectively.
  • a 56-dimensional feature vector F(n) can be output.
  • scalar quantization i.e., each component is quantized separately
  • entropy coding can be used for quantization coding.
  • vector quantization i.e., multiple adjacent components are combined into a vector for joint quantization
  • entropy coding can also be used for quantization coding, which is not specifically limited in the embodiments of the present application.
  • a bit stream After quantizing and encoding the feature vector F(n), a bit stream can be generated. According to experiments, high-quality compression of a 16kHz broadband signal can be achieved with a bit rate of 6-8kbps.
  • Decoding is the reverse process of encoding.
  • the received bitstream is decoded, and then the quantization table is queried based on the index value obtained by decoding to obtain the estimated value of the feature vector, which is recorded as F′(n).
  • the enhanced network is called to extract the label information vector.
  • the estimated value F′(n) of the feature vector contains a compressed version of the original speech signal collected by the encoder, reflecting the core components of the speech signal, and also contains acoustic interference such as noise mixed during collection. Therefore, the enhancement network is used to collect relevant label embedding information from the estimated value F′(n) of the feature vector to generate a relatively clean speech signal during decoding.
  • FIG 13 is a schematic diagram of the structure of the enhanced network provided in an embodiment of the present application.
  • a one-dimensional causal convolution is called to generate a 56 ⁇ 1 tensor.
  • a layer of LSTM network is passed to generate a 56 ⁇ 1 tensor.
  • a fully connected (FC, Full-Connection) network is called to generate a 56 ⁇ 1 tensor.
  • the activation function for example, ReLU, of course, other activation functions, such as Sigmoid function, Tanh function, etc.
  • E(n) a label information vector of the same dimension as the estimated value F′(n) of the feature vector is generated, denoted as E(n).
  • the synthesis network is called to reconstruct the signal.
  • the purpose of the synthetic network is to splice the estimated value F′(n) of the feature vector obtained at the decoding end and the locally generated label information vector E(n) into a 112-dimensional vector, and then call the synthetic network to reconstruct the signal to generate an estimated value of the speech signal, denoted as x′(n).
  • the synthetic network generates input vectors by splicing, which is only one way.
  • the embodiments of the present application do not limit other methods. For example, F′(n)+E(n) can be used as input, and the dimension is 56. For this method, you can refer to Figure 14 to redesign the network, and the embodiments of the present application will not be repeated here.
  • Figure 14 is a schematic diagram of the structure of the synthesis network provided in an embodiment of the present application.
  • the structure of the synthesis network is highly similar to that of the analysis network, such as causal convolution; but the dimension of the input quantity is increased to 112 dimensions.
  • the post-processing process is similar to the pre-processing in the analysis network.
  • the structure of the decoding block also called the decoding layer
  • the encoding block also called the encoding layer
  • the encoding block in the analysis network first performs a hole convolution and then performs pooling to complete downsampling, while the decoding block in the synthesis network first performs pooling to complete upsampling and then performs a hole convolution.
  • decoding is the inverse process of encoding. Please refer to the description of Figure 12, and the embodiments of the present application will not be repeated here.
  • the relevant networks (such as analysis network and synthesis network) of the encoding end and the decoding end can be jointly trained by collecting data to obtain the optimal parameters.
  • the relevant networks such as analysis network and synthesis network
  • the server can be jointly trained by collecting data to obtain the optimal parameters.
  • the relevant networks such as analysis network and synthesis network
  • the server can be put into use.
  • the above embodiment of the present application assumes that the parameters of the analysis network and the synthesis network have been trained, and only discloses a specific network input, network structure and network output. The engineers in the relevant fields can further modify the above configuration according to the actual situation.
  • the analysis network, the enhancement network and the synthesis network are respectively called on the encoding and decoding path to complete the low bit rate compression and signal reconstruction.
  • the complexity of these networks is relatively high.
  • the embodiment of the present application can introduce a QMF analysis filter to decompose the input signal into sub-band signals with a lower bit rate; then for each sub-band signal, the input and output dimensions of the neural network will be at least halved.
  • the computational complexity of neural networks is O(N 3 ), so this "divide and conquer" idea can effectively reduce the complexity.
  • Figure 15 is a flow chart of the audio encoding and decoding method provided by an embodiment of the present application.
  • a QMF analysis filter is used to decompose it into two sub-band signals.
  • x LB (n) a low-frequency sub-band signal
  • x HB (n) a high-frequency sub-band signal
  • the first analysis network can be called to obtain a low-dimensional feature vector of the low-frequency sub-band signal, denoted as F LB (n).
  • the dimension of the feature vector F LB (n) of the low-frequency sub-band signal is smaller than that of the low-frequency sub-band signal x LB (n), thereby reducing the amount of data.
  • the parameters of the first analysis network can be halved accordingly, including the feature vector F LB (n) of the low-frequency sub-band signal.
  • the eigenvector F LB (n) of the low frequency subband signal can be vector quantized or scalar quantized, and the index value obtained after quantization is entropy encoded to obtain a bit stream, which is then transmitted to the decoding end.
  • the decoder After receiving the code stream sent by the encoder, the decoder can decode the received code stream to obtain an estimated value of the feature vector of the low-frequency subband signal, denoted as F′ LB (n). Then, based on the estimated value F′ LB (n) of the feature vector of the low-frequency subband signal, the first enhancement network can be called to generate a label information vector corresponding to the low-frequency subband signal, denoted as E LB (n).
  • the first synthesis network of the inverse process of the corresponding encoder is called to complete the reconstruction of the estimated value of the low-frequency subband signal, denoted as x′ LB (n), and suppress acoustic interference such as noise contained in the speech signal collected by the encoder.
  • the functions of the first enhancement network and the first synthesis network are merged into the first synthesis module below, that is, at the decoding end, based on F′ LB (n) and E LB (n), the first synthesis module is called to reconstruct the signal to obtain the estimated value x′ LB (n) of the low-frequency subband signal.
  • the high-frequency subband signal obtained after the input signal x(n) is decomposed by the QMF analysis filter it is denoted as x HB (n).
  • the second analysis network and the second synthesis module (including the second enhancement network and the second synthesis network) are respectively called, and the estimated value of the high-frequency subband signal can be obtained at the decoding end, which is denoted as x′ HB (n).
  • processing flow for the high-frequency subband signal x HB (n) is similar to the processing flow for the low-frequency subband signal x LB (n), and can be implemented by referring to the processing flow for the low-frequency subband signal x LB (n), and the embodiments of the present application will not be repeated here.
  • the features completed by iterating the 2-channel QMF can be further expanded to the multi-channel QMF scheme as shown in Figure 16.
  • the input signal x(n) can be decomposed into N sub-band signals, and encoding and decoding processing can be performed separately for each sub-band signal. Because the principles are similar, the embodiments of the present application will not be repeated here.
  • the following uses 2-channel QMF as an example to illustrate the audio encoding and decoding method provided in the embodiment of the present application.
  • the input signal is generated.
  • the speech signal of the nth frame includes 320 sample points, which are recorded as input signal x(n).
  • the QMF signal is decomposed.
  • a QMF analysis filter (here specifically 2-channel QMF) can be called and down-sampled to obtain two sub-band signals, namely, a low-frequency sub-band signal x LB (n) and a high-frequency sub-band signal x HB (n).
  • the effective bandwidth of the low-frequency sub-band signal x LB (n) is 0-4kHz
  • the effective bandwidth of the high-frequency sub-band signal x HB (n) is 4-8kHz
  • the number of sample points per frame is 160.
  • first analysis network and the second analysis network are called to perform data compression.
  • the first analysis network shown in Figure 17 can be called to perform feature extraction processing on the low-frequency subband signal x LB (n) to obtain a feature vector F LB (n) of the low-frequency subband signal; similarly, for the high-frequency subband signal x HB (n), the second analysis network can be called to perform feature extraction processing to obtain a feature vector of the high-frequency subband signal, denoted as F HB (n).
  • the dimension of the feature vector of the output subband signal can be lower than the dimension of the feature vector of the input signal in the above embodiment.
  • the dimension of the feature vector of the low-frequency subband signal and the dimension of the feature vector of the high-frequency subband signal can both be set to 28. In this way, the dimension of the feature vector of the overall output is consistent with the dimension of the feature vector of the input signal in the above embodiment, that is, the bit rates of the two are consistent.
  • the embodiment of the present application does not limit the definition of different numbers of dimensions for the feature vectors of different sub-band signals.
  • the dimension of the feature vector of the low-frequency sub-band signal can be set to 32, while the dimension of the feature vector of the high-frequency sub-band signal is set to 24, which still ensures that the total dimension is consistent with the dimension of the feature vector of the input signal.
  • it can be achieved by adjusting the internal parameter quantities of the first analysis network and the second analysis network accordingly, and the embodiment of the present application will not be repeated here.
  • the estimated value F′ LB (n) of the feature vector of the low frequency sub-band signal and the estimated value F′ HB (n) of the feature vector of the high frequency sub-band signal can be obtained.
  • the first enhanced network and the second enhanced network are called to extract the label information vector.
  • the first enhancement network shown in Figure 18 can be called to collect label embedding information (i.e., the label information vector of the low-frequency part) for the estimated value F′ LB (n) of the feature vector of the low-frequency subband signal, which is recorded as E LB (n) and used to generate a relatively clean low-frequency subband speech signal during decoding.
  • label embedding information i.e., the label information vector of the low-frequency part
  • E LB (n) the estimated value of the feature vector of the low-frequency subband signal
  • the dimension of the output feature vector of the first analysis network in the encoding end can be referred to, and the structure of the first enhancement network shown in Figure 18 can be adjusted accordingly, for example, including the parameter amount of the first enhancement network.
  • the second enhancement network can be called to obtain the label information vector of the high-frequency part, which is recorded as E HB (n) for subsequent processes.
  • the label information vectors of the two sub-band signals can be obtained, which are the label information vector E LB (n) of the low-frequency part and the label information vector E HB (n) of the high-frequency part.
  • the first synthesis network and the second synthesis network are called to perform signal reconstruction.
  • Figure 19 is a schematic diagram of the structure of the first synthesis network provided by an embodiment of the present application.
  • the first synthesis network can be called to generate an estimated value of the low-frequency sub-band signal based on the estimated value F′ LB (n) of the characteristic vector of the low-frequency sub-band signal and the label information vector E LB (n) of the locally generated low-frequency part, denoted as x′ LB (n).
  • the specific calculation process can refer to the description of Figure 14, and the embodiment of the present application will not be repeated here.
  • Figure 19 only provides a specific configuration of the first synthesis network corresponding to the low-frequency part, and the implementation form of the high-frequency part is similar, which will not be repeated here.
  • the estimated value x′ LB (n) of the low frequency sub-band signal and the estimated value x′ HB (n) of the high frequency sub-band signal are generated.
  • acoustic interference such as noise in these two sub-band signals is effectively suppressed.
  • the embodiment of the present application significantly improves the coding efficiency compared with the traditional signal processing scheme through the organic combination of signal decomposition and related signal processing technology and deep neural network.
  • the speech enhancement is implemented at the decoding end, so that the effect of reconstructing clean speech can be achieved with a low bit rate under acoustic interference such as noise.
  • the speech signal collected by the encoding end is mixed with a large amount of noise interference.
  • a clean speech signal can be reconstructed at the decoding end, thereby improving the quality of the voice call.
  • the software modules stored in the audio decoding device 565 of the memory 560 may include: an acquisition module 5651, a decoding module 5652, a label extraction module 5653, a reconstruction module 5654 and a determination module 5655.
  • the acquisition module 5651 is configured to acquire a bit stream, wherein the bit stream is obtained by encoding an audio signal; the decoding module 5652 is configured to decode the bit stream to obtain a predicted value of a feature vector of the audio signal; the label extraction module 5653 is configured to perform label extraction on the predicted value of the feature vector to obtain a label information vector, wherein the dimension of the label information vector is the same as the dimension of the predicted value of the feature vector; the reconstruction module 5654 is configured to perform signal reconstruction based on the predicted value of the feature vector and the label information vector; the determination module 5655 is configured to use the predicted value of the audio signal obtained by signal reconstruction as the decoding result of the bit stream.
  • the decoding module 5652 is further configured to decode the bit stream to obtain an index value of the feature vector of the audio signal; and query the quantization table based on the index value to obtain a predicted value of the feature vector of the audio signal.
  • the label extraction module 5653 is also configured to perform convolution processing on the predicted value of the feature vector to obtain a first tensor of the same dimension as the predicted value of the feature vector; perform feature extraction processing on the first tensor to obtain a second tensor of the same dimension as the first tensor; perform full connection processing on the second tensor to obtain a third tensor of the same dimension as the second tensor; perform activation processing on the third tensor to obtain a label information vector.
  • the reconstruction module 5654 is further configured to perform splicing processing on the predicted value of the feature vector and the label information vector to obtain a spliced vector;
  • the concatenated vector is subjected to a first convolution process to obtain a convolution feature of the audio signal;
  • the convolution feature is subjected to an upsampling process to obtain an upsampling feature of the audio signal;
  • the upsampling feature is subjected to a pooling process to obtain a pooling feature of the audio signal;
  • the pooling feature is subjected to a second convolution process to obtain a predicted value of the audio signal.
  • the upsampling process is implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different; the reconstruction module 5654 is also configured to upsample the convolution features through the first decoding layer in the multiple cascaded decoding layers; output the upsampling result of the first decoding layer to the subsequent cascaded decoding layers, and continue to perform upsampling processing and output upsampling results through the subsequent cascaded decoding layers until it is output to the last decoding layer; the upsampling result output by the last decoding layer is used as the upsampling feature of the audio signal.
  • the bit stream includes a low-frequency bit stream and a high-frequency bit stream, wherein the low-frequency bit stream is obtained by encoding the low-frequency sub-band signal obtained after decomposition of the audio signal, and the high-frequency bit stream is obtained by encoding the high-frequency sub-band signal obtained after decomposition of the audio signal; the decoding module 5652 is also configured to decode the low-frequency bit stream to obtain a predicted value of the feature vector of the low-frequency sub-band signal; and is configured to decode the high-frequency bit stream to obtain a predicted value of the feature vector of the high-frequency sub-band signal.
  • the label extraction module 5653 is further configured to perform label extraction processing on the predicted value of the feature vector of the low-frequency sub-band signal to obtain a first label information vector, wherein the dimension of the first label information vector is the same as the dimension of the predicted value of the feature vector of the low-frequency sub-band signal; and is configured to perform label extraction processing on the predicted value of the feature vector of the high-frequency sub-band signal to obtain a second label information vector, wherein the dimension of the second label information vector is the same as the dimension of the predicted value of the feature vector of the high-frequency sub-band signal.
  • the label extraction module 5653 is further configured to call the first enhancement network to perform the following processing: perform convolution processing on the predicted value of the feature vector of the low-frequency subband signal to obtain a fourth tensor of the same dimension as the predicted value of the feature vector of the low-frequency subband signal; perform feature extraction processing on the fourth tensor to obtain a fifth tensor of the same dimension as the fourth tensor; perform full connection processing on the fifth tensor to obtain a sixth tensor of the same dimension as the fifth tensor; perform activation processing on the sixth tensor to obtain a first label information vector.
  • the label extraction module 5653 is further configured to call the second enhancement network to perform the following processing: perform convolution processing on the predicted value of the feature vector of the high-frequency subband signal to obtain a seventh tensor of the same dimension as the predicted value of the feature vector of the high-frequency subband signal; perform feature extraction processing on the seventh tensor to obtain an eighth tensor of the same dimension as the seventh tensor; perform full connection processing on the eighth tensor to obtain a ninth tensor of the same dimension as the eighth tensor; perform activation processing on the ninth tensor to obtain a second label information vector.
  • the predicted value of the feature vector includes: the predicted value of the feature vector of the low-frequency subband signal, and the predicted value of the feature vector of the high-frequency subband signal; the reconstruction module 5654 is also configured to perform splicing processing on the predicted value of the feature vector of the low-frequency subband signal and the first label information vector to obtain a first splicing vector; based on the first splicing vector, call the first synthesis network to perform signal reconstruction to obtain the predicted value of the low-frequency subband signal; perform splicing processing on the predicted value of the feature vector of the high-frequency subband signal and the second label information vector to obtain a second splicing vector; based on the second splicing vector, call the second synthesis network to perform signal reconstruction to obtain the predicted value of the high-frequency subband signal; perform synthesis processing on the predicted value of the low-frequency subband signal and the predicted value of the high-frequency subband signal to obtain the predicted value of the audio signal.
  • the reconstruction module 5654 is further configured to call the first synthesis network to perform the following processing: performing a first convolution processing on the first splicing vector to obtain a convolution feature of the low-frequency sub-band signal; performing an upsampling processing on the convolution feature to obtain an upsampling feature of the low-frequency sub-band signal; performing a pooling processing on the upsampling feature to obtain a pooling feature of the low-frequency sub-band signal; performing a second convolution processing on the pooling feature to obtain a predicted value of the low-frequency sub-band signal; wherein the upsampling processing is implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the reconstruction module 5654 is further configured to call the second synthesis network to perform the following processing: performing a first convolution process on the second splicing vector to obtain a convolution feature of the high-frequency sub-band signal; performing an upsampling process on the convolution feature to obtain an upsampling feature of the high-frequency sub-band signal; performing a pooling process on the upsampling feature to obtain a pooling feature of the high-frequency sub-band signal; performing a second convolution process on the pooling feature to obtain a predicted value of the high-frequency sub-band signal; wherein the upsampling process is implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the code stream includes N sub-code streams, the N sub-code streams correspond to different frequency bands, and are obtained by respectively encoding N sub-band signals obtained after decomposition of the audio signal, where N is an integer greater than 2; the decoding module 5652 is also configured to respectively decode the N sub-code streams to obtain predicted values of the feature vectors corresponding to the N sub-band signals.
  • the label extraction module 5653 is further configured to perform label extraction processing on the predicted values of the feature vectors corresponding to the N sub-band signals respectively, to obtain N label information vectors for signal enhancement, wherein the dimension of each label information vector is the same as the dimension of the predicted value of the feature vector of the corresponding sub-band signal.
  • the label extraction module 5653 is further configured to call the i-th enhancement network to perform label extraction processing based on the predicted value of the feature vector of the i-th subband signal to obtain the i-th label information vector; wherein the value range of i satisfies 1 ⁇ i ⁇ N, and the dimension of the i-th label information vector is the same as the dimension of the predicted value of the feature vector of the i-th subband signal.
  • the label extraction module 5653 is further configured to call the i-th enhancement network to perform the following processing: perform convolution processing on the predicted value of the feature vector of the i-th subband signal to obtain a tenth tensor of the same dimension as the predicted value of the feature vector of the i-th subband signal; perform feature extraction processing on the tenth tensor to obtain an eleventh tensor of the same dimension as the tenth tensor; perform full connection processing on the eleventh tensor to obtain a twelfth tensor of the same dimension as the eleventh tensor; perform activation processing on the twelfth tensor to obtain the i-th label information vector.
  • the reconstruction module 5654 is further configured to perform one-to-one splicing processing on the predicted values of the feature vectors corresponding to the N sub-band signals and the N label information vectors to obtain N splicing vectors; based on the j-th splicing vector, call the j-th synthesis network to perform signal reconstruction to obtain the predicted value of the j-th sub-band signal; wherein the value range of j satisfies 1 ⁇ j ⁇ N; and perform synthesis processing on the predicted values corresponding to the N sub-band signals to obtain the predicted value of the audio signal.
  • the reconstruction module 5654 is further configured to call the j-th synthesis network to perform the following processing: perform a first convolution process on the j-th splicing vector to obtain the convolution feature of the j-th subband signal; perform an upsampling process on the convolution feature to obtain the upsampling feature of the j-th subband signal; perform a pooling process on the upsampling feature to obtain the pooling feature of the j-th subband signal; perform a second convolution process on the pooling feature to obtain the predicted value of the j-th subband signal; wherein the upsampling process is implemented through multiple cascaded decoding layers, and the sampling factors of different decoding layers are different.
  • the description of the device in the embodiment of the present application is similar to the description of the method embodiment above and has similar beneficial effects as the method embodiment.
  • the unfinished technical details in the audio decoding device provided in the embodiment of the present application can be understood according to the description of any one of FIG. 4C , FIG. 6A , or FIG. 6B .
  • the embodiment of the present application provides a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the audio encoding and decoding method described in the embodiment of the present application.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions, wherein computer executable instructions are stored.
  • the processor will execute the audio codec method provided by the embodiment of the present application, for example, the audio codec method shown in FIG. 4C .
  • the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface storage, optical disk, or CD-ROM; or it may be various devices including one or any combination of the above memories.
  • computer executable instructions may be in the form of a program, software, software module, script or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily, correspond to a file in a file system, may be stored as part of a file that stores other programs or data, such as in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files storing one or more modules, subroutines, or code portions).
  • HTML HyperText Markup Language
  • the executable instructions may be deployed to be executed on one electronic device, or on multiple electronic devices located at one site, or on multiple electronic devices distributed at multiple sites and interconnected by a communication network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请提供了一种音频解码、编码方法、装置、电子设备及存储介质,可以应用于车载场景;音频解码方法包括:获取码流,其中,所述码流是对音频信号进行编码得到的;对所述码流进行解码处理,得到所述音频信号的特征向量的预测值;对所述特征向量的预测值进行标签提取处理,得到用于信号增强的标签信息向量,其中,所述标签信息向量的维度与所述特征向量的预测值的维度相同;基于所述特征向量的预测值和所述标签信息向量进行信号重建;将通过所述信号重建得到的所述音频信号的预测值,作为所述码流的解码结果。

Description

音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202210676984.X、申请日为2022年6月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及通信技术领域,尤其涉及一种音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
由于语音通信的便利性和及时性,语音通话的应用越来越广泛,例如在网络会议的会议参与方之间传输音频信号(例如语音信号)。而在语音通话中,语音信号可能会被混有噪声等声学干扰,语音信号中所混有的噪声会导致通话质量变差,从而极大地影响了用户的听觉体验。
然而,对于如何对语音信号进行增强处理以抑制噪声部分,相关技术尚无有效的解决方案。
发明内容
本申请实施例提供一种音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够有效抑制音频信号中的声学干扰,进而提高重建得到的音频信号的质量。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种音频解码方法,包括:
获取码流,其中,所述码流是对音频信号进行编码得到的;
对所述码流进行解码处理,得到所述音频信号的特征向量的预测值;
对所述特征向量的预测值进行标签提取处理,得到标签信息向量,其中,所述标签信息向量的维度与所述特征向量的预测值的维度相同;
基于所述特征向量的预测值和所述标签信息向量进行信号重建;
将通过所述信号重建得到的所述音频信号的预测值,作为所述码流的解码结果。
本申请实施例提供一种音频解码装置,包括:
获取模块,配置为获取码流,其中,所述码流是对音频信号进行编码得到的;
解码模块,配置为对所述码流进行解码处理,得到所述音频信号的特征向量的预测值;
标签提取模块,配置为对所述特征向量的预测值进行标签提取处理,得到标签信息向量,其中,所述标签信息向量的维度与所述特征向量的预测值的维度相同;
重建模块,配置为基于所述特征向量的预测值和所述标签信息向量进行信号重建;
确定模块,配置为将通过所述信号重建得到的所述音频信号的预测值,作为所述码流的解码结果。
本申请实施例提供一种音频编码方法,包括:
获取音频信号;
对所述音频信号进行编码处理,得到码流,其中,所述码流用于供电子设备执行本申请实施例提供的音频解码方法。
本申请实施例提供一种音频编码装置,包括:
获取模块,配置为获取音频信号;
编码模块,配置为对所述音频信号进行编码处理,得到码流,其中,所述码流用于供电子设备执行本申请实施例提供的音频解码方法。
本申请实施例提供一种电子设备,包括:
存储器,用于存储计算机可执行指令;
处理器,用于执行所述存储器中存储的计算机可执行指令时,实现本申请实施例提供的音频编解码方法。
本申请实施例提供一种计算机可读存储介质,存储有计算机可执行指令,用于被处理器执行时,实现本申请实施例提供的音频编解码方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或计算机可执行指令,用于被处理器执行时,实现本申请实施例提供的音频编解码方法。
本申请实施例具有以下有益效果:
通过对解码得到的特征向量的预测值进行标签提取处理,得到标签信息向量,并结合特征向量的预测值和标签信息向量进行信号重建,如此,相较于仅仅基于特征向量的预测值进行信号重建,由于标签信息向量仅反映了音频信号中的核心成分,也就是说,标签信息向量是不包括噪声等声学干扰的,因此,在结合特征向量的预测值和标签信息 向量进行信号重建时,能够通过标签信息向量增加音频信号中核心成分所占的比例,相应的减小了噪声等声学干扰所占的比例,从而可以有效抑制编码端采集的音频信号中包括的噪声成分,实现了信号增强的效果,进而提高了重建得到的音频信号的质量。
附图说明
图1是本申请实施例提供的不同码率下的频谱比较示意图;
图2是本申请实施例提供的音频编解码系统100的架构示意图;
图3是本申请实施例提供的第二终端设备500的结构示意图;
图4A是本申请实施例提供的音频编码方法的流程示意图;
图4B是本申请实施例提供的音频解码方法的流程示意图;
图4C是本申请实施例提供的音频编解码方法的流程示意图;
图5是本申请实施例提供的编码端和解码端的结构示意图
图6A和图6B是本申请实施例提供的音频解码方法的流程示意图;
图7是本申请实施例提供的端到端的语音通信链路示意图;
图8是本申请实施例提供的音频编解码方法的流程示意图;
图9A是本申请实施例提供的普通卷积的示意图;
图9B是本申请实施例提供的空洞卷积的示意图;
图10是本申请实施例提供的QMF分析滤波器组的低通部分和高通部分的频谱响应示意图;
图11A是本申请实施例提供的基于QMF滤波器组得到4通道的子带信号的原理示意图;
图11B是本申请实施例提供的基于QMF滤波器组得到3通道的子带信号的原理示意图;
图12是本申请实施例提供的分析网络的结构示意图;
图13是本申请实施例提供的增强网络的结构示意图;
图14是本申请实施例提供的合成网络的结构示意图;
图15是本申请实施例提供的音频编解码方法的流程示意图;
图16是本申请实施例提供的音频编解码方法的流程示意图;
图17是本申请实施例提供的第一分析网络的结构示意图;
图18是本申请实施例提供的第一增强网络的结构示意图;
图19是本申请实施例提供的第一合成网络的结构示意图;
图20是本申请实施例提供的编解码效果对比示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
可以理解的是,在本申请实施例中,涉及到用户信息等相关的数据(例如用户发出的语音信号),当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
在以下的描述中,所涉及的术语“第一\第二\...”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\...”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)神经网络(NN,Neural Network):是一种模仿动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。
2)深度学习(DL,Deep Learning):是机器学习(ML,Machine Learning)领域中一个新的研究方向,深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。
3)矢量量化(VQ,Vector Quantization):一种有效的有损压缩技术,其理论基础是香农的速率失真理论。矢量量化的基本原理是用码书中与输入矢量最匹配的码字的索引代替输入矢量进行传输与存储,而解码时仅需要简单地查表操作。
4)标量量化:是对标量进行量化,即一维的矢量量化,将动态范围分成若干个小区间,每个小区间有一个代表值。当输入信号落入某区间时,量化成该代表值。
5)熵编码:即编码过程中按熵原理不丢失任何信息的无损编码方式,也是有损编码中的一个关键模块,处于编 码器的末端。常见的熵编码有:香农(Shannon)编码、哈夫曼(Huffman)编码、指数哥伦布编码(Exp-Golomb)和算术编码(arithmetic coding)。
6)正交镜像滤波器组(QMF,Quadrature Mirror Filters):是一个包含分析-合成的滤波器对,其中,QMF分析滤波器组用于子带信号分解,以降低信号带宽,使各个子带信号可顺利由通道处理;QMF合成滤波器组用于将解码端恢复出的各子带信号进行合成处理,例如通过零值内插和带通滤波等方式重建出原始的音频信号。
语音编解码技术,是包括远程音视频通话在内的通信服务中的一项核心技术。语音编码技术,简单来讲,就是使用较少的网络宽带资源去尽量多的传递语音信息。从香农信息论的角度来讲,语音编码是一种信源编码,信源编码的目的是在编码端尽可能的压缩想要传递信息的数据量,去掉信息中的冗余,同时在解码端还能够无损(或者接近无损)的恢复出来。
相关技术提供的语音编解码器的压缩率都可以达到10倍以上,也就是说,原本10MB的语音数据经过编码器的压缩后只需要1MB来传输,大大降低了传递信息所需消耗的宽带资源。例如对于采样率为16000Hz的宽带语音信号,如果采用16-bit的采样深度,无压缩版本的码率为256千比特每秒(kbps,kilobit per second);如果使用语音编码技术,即使是有损编码,在10-20kbps的码率范围内,重建的语音信号的质量可以接近无压缩版本,甚至听感上认为无差别。如果需要更高采样率的服务,比如32000Hz的超宽带语音,码率范围至少要达到30kbps以上。
相关技术提供的传统语音编码方案,根据编码原理一般可以分为三种:波形编码(waveform speech coding)、参数编码(parametric speech coding)、混合编码(hyprid speech coding)。
其中,波形编码就是直接对语音信号的波形进行编码,这种编码方式的优点是编码语音质量高,但是压缩率不高。
参数编码指的是对语音发声过程进行建模,而编码端要做的就是提取想要传递的语音信号的对应参数。参数编码的优点是压缩率极高,缺点是恢复语音的质量不高。
混合编码是将上述两种编码方式结合,将能够使用参数编码的语音成分用参数表示,剩下的参数无法有效表达的成分使用波形编码。两者结合能够做到在编码效率高的情况下,恢复出的语音质量也很高。
一般地,上述三种编码原理均来自经典的语音信号建模,也称之为基于信号处理的压缩方法。根据率失真分析并结合过去几十年的标准化经验,推荐至少0.75bit/sample的码率才能提供理想的语音质量;对于采样率为16000Hz的宽带语音信号,等效于12kbps。例如IETF OPUS标准推荐16kbps作为提供高质量宽带语音通话的推荐码率。
示例的,参见图1,图1是本申请实施例提供的不同码率下的频谱比较示意图,以示范压缩码率与质量的关系。其中,曲线101为原始语音,即没有压缩的音频信号;曲线102为OPUS编码器20kbps的效果;曲线103为OPUS编码器6kbps的效果。从图1可以看出,随着编码码率的提升,压缩后的信号更接近原始信号。
然而,申请人发现,相关技术提供的上述方案主要还是通过传统信号处理的方法,在保持现有质量的前提下,码率很难再有明显的下降。
近年来,随着深度学习的进步,相关技术也提供了使用人工智能来提升编码码率的方案。
然而,申请人还发现:对于基于人工智能的音频编解码方案,虽然码率可以低于2kbps,然而,一般调用Wavenet等生成网络,导致解码端的复杂度非常高,使得在移动终端中使用时具有非常大的挑战性,并且,绝对质量离传统信号处理的编码器,差距也非常明显。而对于基于端到端的NN编解码方案,码率为6-10kbps,主观质量接近传统信号处理的方案,然而,编解码两端均采用了深度学习网络,导致复杂度非常高。
此外,不管是传统信号处理的方案还是基于深度神经网络的方案,只能对语音信号进行压缩。然而,实际的语音通信会受到噪声等声学干扰的影响。也就是说,相关技术中尚无同时具备语音增强和低码率高质量压缩效果的解决方案。
鉴于此,本申请实施例提供一种音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够在提高编码效率的同时,有效抑制音频信号中的声学干扰,进而提高重建得到的音频信号的质量。下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为终端设备,也可以实施为服务器,或者由终端设备和服务器协同实施。下面以由终端设备和服务器协同实施本申请实施例提供的音频编解码方法为例进行说明。
示例的,参见图2,图2是本申请实施例提供的音频编解码系统100的架构示意图,为实现支撑一个能够在提高编码效率的同时,有效抑制音频信号中的声学干扰,进而提高重建得到的音频信号的质量的应用,如图2所示,音频编解码系统100包括:服务器200、网络300、第一终端设备400(即编码端)和第二终端设备500(即解码端),其中,网络300可以是局域网,或者是广域网,又或者是二者的组合。
在一些实施例中,在第一终端设备400上运行有客户端410,客户端410可以是各种类型的客户端,例如包括即时通信客户端、网络会议客户端、直播客户端、浏览器等。客户端410响应于发送方(例如网络会议的发起者、主播、语音通话的发起者等)触发的音频采集指令,调用终端设备400中的麦克风进行音频信号的采集,并对采集得到的音频信号进行编码处理,得到码流。接着,客户端410可以将码流通过网络300发送至服务器200,以使服务器200将码流发送至接收方(例如网络会议的参会对象、观众、语音通话的接收者等)关联的第二终端设备500。客户端510在接收到服务器200发送的码流后,可以对码流进行解码处理,得到音频信号的特征向量的预测值(又称估计值);接着客户端510还可以调用增强网络对特征向量的预测值进行标签提取处理,得到用于信号增强的标签信息向量,其中,标签信息向量的维度与特征向量的预测值的维度相同;随后客户端510可以基于解码得到的特征向量的预测值、以及经过标签提取处理得到的标签信息向量,调用合成网络进行信号重建,得到音频信号的预测值,从而完成音频信号的重建,并抑制了编码端采集到的音频信号中包含的噪声成分,提高了重建得到的音频信号的质量。
本申请实施例提供的音频编解码方法可以广泛应用于各种不同类型的语音、或者视频通话的应用场景中,例如通过车载终端上运行的应用实现的车载语音、通过即时通信客户端进行的语音通话或者视频通话、游戏应用中的语 音通话、网络会议客户端中的语音通话等。例如可以在语音通话的接收端或者提供语音通信服务的服务器来按照本申请实施例提供的音频解码方法进行语音增强。
示例的,以网络会议场景为例,网络会议是线上办公中一个重要的环节,在网络会议中,网络会议的参与方的声音采集装置(例如麦克风)在采集到发言人的语音信号后,需要将所采集到的语音信号发送至网络会议的其他参与方,该过程涉及到语音信号在多个参与方之间的传输和播放,如果不对语音信号中所混有的噪声进行处理,会极大影响会议参与方的听觉体验。在该场景中,可以应用本申请实施例提供的音频解码方法对网络会议中的语音信号进行增强,从而使得会议参与方所听到的语音信号是进行增强后的语音信号,即在重建得到的语音信号中抑制了编码端采集的语音信号中的噪声成分,提高了网络会议中语音通话的质量。
在另一些实施例中,本申请实施例可以借助云技术(Cloud Technology)实现,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、存储、处理和共享的一种托管技术。
云技术是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、以及应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。上述服务器200之间的服务交互功能可以通过云技术实现。
示例的,图2中示出的服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。图2中示出的终端设备400和终端设备500可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端等,但并不局限于此。终端设备(例如第一终端设备400和第二终端设备500)以及服务器200可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
在一些实施例中,终端设备(例如第二终端设备500)或服务器200还可以通过运行计算机程序来实现本申请实施例提供的音频解码方法。举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序,如直播APP、网络会议APP、或者即时通信APP等;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
下面继续对图2中示出的第二终端设备500的结构进行说明。参见图3,图3是本申请实施例提供的第二终端设备500的结构示意图,图3所示的第二终端设备500包括:至少一个处理器520、存储器560、至少一个网络接口530和用户接口540。第二终端设备500中的各个组件通过总线系统550耦合在一起。可理解,总线系统550用于实现这些组件之间的连接通信。总线系统550除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图3中将各种总线都标为总线系统550。
处理器520可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口540包括使得能够呈现媒体内容的一个或多个输出装置541,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口540还包括一个或多个输入装置542,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器560可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器560可选地包括在物理位置上远离处理器520的一个或多个存储设备。
存储器560包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器560旨在包括任意适合类型的存储器。
在一些实施例中,存储器560能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统561,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块562,用于经由一个或多个(有线或无线)网络接口530到达其他计算设备,示例性的网络接口530包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
呈现模块563,用于经由一个或多个与用户接口540相关联的输出装置541(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块564,用于对一个或多个来自一个或多个输入装置542之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的音频解码装置可以采用软件方式实现,图3示出了存储在存储器560中的音频解码装置565,其可以是程序和插件等形式的软件,包括以下软件模块:获取模块5651、解码模块5652、标签提取模块5653、重建模块5654和确定模块5655,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分,将在下文中说明各个模块的功能。
下面将结合本申请实施例提供的终端设备的示例性应用,对本申请实施例提供的音频编解码方法进行具体说明。
示例的,参见图4A,图4A是本申请实施例提供的音频编码方法的流程示意图,如图4A所示,在编码端执行的主要步骤包括:步骤101、获取音频信号;步骤102、对音频信号进行编码处理,得到码流。
示例的,参见图4B,图4B是本申请实施例提供的音频解码方法的流程示意图,如图4B所示,在解码端执行的主要步骤包括:步骤201、获取码流;步骤202、对码流进行解码处理,得到音频信号的特征向量的预测值;步骤203、 对特征向量的预测值进行标签提取处理,得到标签信息向量;步骤204、基于特征向量的预测值和标签信息向量进行信号重建;步骤205、将通过信号重建得到的音频信号的预测值,作为码流的解码结果。
下面将以基于网际互连协议的语音传输(VoIP,Voice over Internet Protocol)的会议系统为例,从第一终端设备(即编码端)、服务器、以及第二终端设备(即解码端)之间交互的角度,对本申请实施例提供的音频编解码方法进行具体说明。
示例的,参见图4C,图4C是本申请实施例提供的音频编解码方法的流程示意图,将结合图4C示出的步骤进行说明。
需要说明的是,终端设备执行的步骤可以是由终端设备上运行的客户端执行的,为了表述方便,本申请实施例不对终端设备和终端设备上运行的客户端进行具体区分。此外,还需要说明的是,本申请实施例提供的音频编解码方法可以由终端设备上运行的各种形式的计算机程序执行,并不局限于上述终端设备运行的客户端,还可以是上文所述的操作系统561、软件模块、脚本和小程序,因此下文中以客户端的示例不应视为对本申请实施例的限定。
在对图4C进行说明之前,首先对编码端和解码端的结构进行说明。
示例的,参见图5,图5是本申请实施例提供的编码端和解码端的结构示意图,如图5所示,编码端包括分析网络,用于对输入的音频信号进行特征提取处理,得到音频信号的特征向量,接着可以对音频信号的特征向量进行量化编码处理,得到码流。解码端包括增强网络和合成网络,在对接收到的码流进行解码,得到音频信号的特征向量的预测值之后,可以调用增强网络对音频信号的特征向量的预测值进行标签提取处理,得到标签信息向量,随后可以基于标签信息向量和特征向量的预测值,调用合成网络进行信号重建,得到音频信号的预测值。
下面将结合编码端和解码端的上述结构,对本申请实施例提供的音频编解码方法进行具体说明。
在步骤301中,第一终端设备获取音频信号。
在一些实施例中,第一终端设备响应于用户触发的音频采集指令,调用音频采集装置(例如第一终端设备中内置的麦克风或者外接的麦克风)进行音频信号的采集,得到音频信号,例如可以是网络会议场景中发言人的语音信号、直播场景中主播的语音信号等。
示例的,以网络会议场景为例,当第一终端设备上运行的网络会议APP接收到用户(例如网络会议的发起者)针对人机交互界面中显示的打开“麦克风”按钮的点击操作时,调用第一终端设备自带的麦克风(或者麦克风阵列)对用户发出的语音信号进行采集,得到网络会议的发起者的语音信号。
在步骤302中,第一终端设备对音频信号进行编码处理,得到码流。
在一些实施例中,第一终端设备在调用麦克风采集得到音频信号之后,可以通过以下方式对音频信号进行编码处理,得到码流:首先调用分析网络(例如神经网络)对音频信号进行特征提取处理,得到音频信号的特征向量,接着对音频信号的特征向量进行量化处理(例如矢量量化或者标量量化),得到特征向量的索引值,最后对特征向量的索引值进行编码处理,例如对特征向量的索引值进行熵编码处理,得到码流。
示例的,上述的矢量量化是指将一个向量空间中的点用其中的一个有限子集来进行编码的过程。在矢量量化编码中,关键是码书(或者量化表)的建立和码字的搜索算法,在得到音频信号的特征向量之后,可以首先查询码书中与音频信号的特征向量最匹配的码字,接着可以将查询得到的码字的索引值作为特征向量的索引值,即使用码书中与音频信号的特征向量最匹配的码字的索引值,代替音频信号的特征向量进行传输和存储。
示例的,上述的标量量化是指一维的矢量量化,通过将整个动态范围划分成若干个小区间,每个小区间有一个代表值,量化时落入小区间的信号值就用整个代表值代替,或者叫被量化为这个代表值。这时的信号量是一维的,所以称为标量量化。例如假设音频信号的特征向量落入了小区间2,则可以将小区间2对应的代表值,作为特征向量的索引值。
示例的,第一终端设备可以通过以下方式对音频信号进行特征提取处理,得到音频信号的特征向量:首先对音频信号进行卷积处理(例如因果卷积),得到音频信号的卷积特征;接着对音频信号的卷积特征进行池化处理,得到音频信号的池化特征;随后对音频信号的池化特征进行下采样处理,得到音频信号的下采样特征;最后对音频信号的下采样特征进行卷积处理,即可得到音频信号的特征向量。
示例的,上述池化处理的本质是降维,在卷积层之后,通过池化来降低卷积层输出的特征维度,减少网络参数和计算成本的同时,降低过拟合现象。池化包括最大池化(Max Pooling)和平均池化(Average Pooling),其中,最大池化是指取局部接受域中值最大的点,即通过最大值的方式减少数据量;平均池化是指取局部接受域中值的平均值。
在另一些实施例中,第一终端设备还可以通过以下方式对音频信号进行编码处理,得到码流:对采集得到的音频信号进行分解处理,例如可以通过2通道的QMF分析滤波器组进行分解处理,得到低频子带信号和高频子带信号;接着分别对低频子带信号和高频子带信号进行特征提取处理,对应得到低频子带信号的特征向量和高频子带信号的特征向量;随后对低频子带信号的特征向量进行量化编码处理,得到音频信号的低频码流,并对高频子带信号的特征向量进行量化编码处理,得到音频信号的高频码流,如此,通过先对音频信号进行分解处理,再对分解得到的低频子带信号和高频子带信号分别进行量化编码的方式,可以有效减少由于压缩造成的信息丢失。
示例的,第一终端设备可以通过以下方式对音频信号进行分解处理,得到低频子带信号和高频子带信号:首先对音频信号进行采样处理,得到采样信号,其中,采样信号包括采集得到的多个样本点;接着对采样信号进行低通滤波处理,得到低通滤波信号;随后对低通滤波信号进行下采样处理,得到低频子带信号。类似的,对采样信号进行高通滤波处理,得到高通滤波信号,并对高通滤波信号进行下采样处理,即可得到高频子带信号。
示例的,上述的低通滤波(Low-Pass Filter)是一种过滤方式,规则为低频信号能够正常通过,而超过设定的临界值的高频信号则被阻隔、减弱。低通滤波可以简单的认为:设定一个频率点,当信号频率高于这个频率时不能通过。在数字信号中,这个频率点也就是截止频率,当频域高于这个截止频率时,则全部赋值为0。因为在这一处理过程中,让低频信号全部通过,所以称为低通滤波。
示例的,上述的高通滤波(High-Pass Filter)是一种过滤方式,规则为高频信号能够正常通过,而低于设定的临界值的低频信号则被阻隔、减弱。高通滤波可以简单的认为:设定一个频率点,当信号频率低于这个频率时不能通过。在数字信号中,这个频率点也称为截止频率,当频域低于这个截止频率时,则全部赋值为0。因为在这一处理过程中,让高频信号全部通过,所以称为高通滤波。
示例的,上述的下采样处理(又称为降采样处理)是一种减少采样点数的方法,例如可以通过隔位取值的方式进行下采样处理,得到低频子带信号,例如针对低通滤波信号包括的多个采样信号,可以每隔3位进行选取一次,即分别选取第1位采样信号、第4位采样信号、第7位采样信号,以此类推,从而得到低频子带信号。
在一些实施例中,第一终端设备还可以通过以下方式对音频信号进行编码处理,得到码流:首先对采集得到的音频信号进行分解处理,得到N个子带信号,其中,N为大于2的整数;接着分别对每个子带信号进行特征提取处理,得到每个子带信号的特征向量,例如针对分解得到的每个子带信号,可以调用神经网络模型进行特征提取处理,得到该子带信号的特征向量;随后分别对每个子带信号的特征向量进行量化编码处理,得到N个子码流。
示例的,可以通过以下方式实现上述的对采集得到的音频信号进行分解处理,得到N个子带信号:例如可以通过4通道的QMF分析滤波器组进行分解处理,得到4个子带信号,举例来说,可以首先对音频信号进行一次低通滤波和高通滤波,得到低频子带信号和高频子带信号,接着,针对低频子带信号可以再进行一次低通滤波和高通滤波,对应得到子带信号1和子带信号2;类似的,可以对分解得到的高频子带信号再次进行低通滤波和高通滤波,对应得到子带信号3和子带信号4,如此,通过迭代两层的2通道QMF分析滤波,即可将音频信号分解成4个子带信号。
在步骤303中,第一终端设备向服务器发送码流。
在一些实施例中,第一终端设备在对采集得到的音频信号进行编码处理,得到码流之后,可以通过网络将码流发送给服务器。
在步骤304中,服务器向第二终端设备发送码流。
在一些实施例中,服务器在接收到第一终端设备(即编码端,例如网络会议的发起者所关联的终端设备)发送的码流之后,可以通过网络将码流发送至第二终端设备(即解码端,例如网络会议的参会对象所关联的终端设备)。
在另一些实施例中,考虑前向兼容性,可以在服务器中部署转码器,以解决新的编码器(即基于人工智能的方式进行编码的编码器,例如NN编码器)和传统的编码器(即基于时域和频域的变换的方式进行编码的编码器,例如G.722编码器)之间互联互通问题。例如,如果第一终端设备(即发送端)中部署的是新的NN编码器,而第二终端设备(即接收端)中部署的是传统的解码器(例如G.722解码器),将导致第二终端设备无法正确解码第一终端设备发送的码流。针对上述情况,可以在服务器中部署转码器,例如服务器在接收到第一终端设备发送的基于NN编码器编码得到的码流之后,可以首先调用NN解码器生成音频信号,然后调用传统的编码器(例如G.722编码器)生成特定码流,如此,第二终端设备可以正确解码,也就是说,能够避免由于编码端部署的编码器和解码端部署的解码器版本不一致,导致解码端无法正确解码的问题,提高了编解码过程中的兼容性。
在步骤305中,第二终端设备对码流进行解码处理,得到音频信号的特征向量的预测值。
在一些实施例中,第二终端设备可以通过以下方式实现步骤305:首先对码流进行解码处理,得到音频信号的特征向量的索引值;接着基于索引值查询量化表,得到音频信号的特征向量的预测值。例如当编码端采用量化表中与音频信号的特征向量最匹配的码字的索引值,代替特征向量进行后续的编码时,则解码端在对码流进行解码处理,得到索引值之后,可以基于索引值执行简单的查表操作,即可得到音频信号的特征向量的预测值。
需要说明的是,解码处理与编码处理为逆过程,例如当编码端采用熵编码的方式对音频信号的特征向量进行编码得到码流时,解码端可以相应采用熵解码的方式对接收到的码流进行解码处理,得到音频信号的特征向量的索引值。
在另一些实施例中,当码流包括低频码流和高频码流时,第二终端设备还可以通过以下方式实现上述的步骤305:对低频码流进行解码处理,得到低频子带信号的特征向量的预测值;对高频码流进行解码处理,得到高频子带信号的特征向量的预测值,其中,低频码流是对音频信号经过分解处理后得到的低频子带信号进行编码得到的,高频码流是对音频信号经过分解处理后得到的高频子带信号进行编码得到的。以低频码流为例,当编码端采用熵编码的方式对低频子带信号的特征向量进行编码时,解码端可以采用相应的熵解码方式对低频码流进行解码处理。
示例的,针对低频码流,第二终端设备可以先对低频码流进行解码处理,得到低频子带信号的特征向量的索引值(假设为索引值1),接着基于索引值1查询量化表,得到低频子带信号的特征向量的预测值。类似的,针对高频码流,第二终端设备可以先对高频码流进行解码处理,得到高频子带信号的特征向量的索引值(假设为索引值2),接着基于索引值2查询量化表,得到高频子带信号的特征向量的预测值。
在一些实施例中,当码流包括N个子码流时,其中,N个子码流对应不同的频段,且是对音频信号经过分解处理后得到的N个子带信号分别进行编码得到的,N为大于2的整数,第二终端设备还可以通过以下方式实现上述的步骤105:对N个子码流分别进行解码处理,得到N个子带信号分别对应的特征向量的预测值。需要说明的是,此处针对N个子码流的解码过程可以参考上文针对低频码流或者高频码流的解码过程实现,本申请实施例在此不再赘述。
示例的,以N个子码流为4个子码流为例,分别为子码流1、子码流2、子码流3和子码流4,其中,子码流1是对子带信号1进行编码得到的,子码流2是对子带信号2进行编码得到的,子码流3是对子带信号3进行编码得到的,子码流4是对子带信号4进行编码得到的,第二终端设备在接收到这4个子码流之后,可以分别对这4个子码流进行解码处理,对应得到4个子带信号分别对应的特征向量的预测值,例如包括子带信号1的特征向量的预测值、子带信号2的特征向量的预测值、子带信号3的特征向量的预测值和子带信号4的特征向量的预测值。
在步骤306中,第二终端设备对特征向量的预测值进行标签提取处理,得到标签信息向量。
这里,标签信息向量是用于信号增强的,同时,标签信息向量的维度与特征向量的预测值的维度是相同的,如此, 后续在进行信号重建时,能够将特征向量的预测值和标签信息向量进行拼接,实现了通过增加核心成分所占的比例对重建得到的音频信号进行信号增强的效果,也就是说,通过结合特征向量的预测值和标签信息向量进行信号重建,使得重建得到的音频信号中所有的核心成分都能得到增强,从而提高了重建得到的音频信号的质量。
在一些实施例中,第二终端设备可以通过调用增强网络对特征向量的预测值进行标签提取处理,得到标签信息向量,其中,增强网络包括卷积层、神经网络层、全连接网络层和激活层,下面,结合增强网络的上述结构说明提取标签信息向量的过程。
示例的,参见图6A,图6A是本申请实施例提供的音频解码方法的流程示意图,如图6A所示,图4C示出的步骤306可以通过图6A示出的步骤3061至步骤3064实现,将结合图6A示出的步骤进行说明。
在步骤3061中,第二终端设备对特征向量的预测值进行卷积处理,得到与特征向量的预测值相同维度的第一张量。
在一些实施例中,第二终端设备可以将在步骤305中得到的特征向量的预测值作为输入,调用增强网络包括的卷积层(例如一个一维的因果卷积),生成与特征向量的预测值相同维度的第一张量(张量是包括多个维度的数值的一种量),例如,如图13所示,特征向量的预测值的维度为56×1,则经过因果卷积处理之后,生成56×1的张量。
在步骤3062中,第二终端设备对第一张量进行特征提取处理,得到与第一张量相同维度的第二张量。
在一些实施例中,对于经过因果卷积处理后得到的第一张量,可以通过增强网络包括的神经网络层(例如长短期记忆网络、时间递归神经网络等)进行特征提取处理,生成与第一张量相同维度的第二张量。例如,如图13所示,第一张量的维度为56×1,则经过一层长短期记忆(LSTM,Long Short-Term Memory)网络进行特征提取处理之后,生成56×1的张量。
在步骤3063中,第二终端设备对第二张量进行全连接处理,得到与第二张量相同维度的第三张量。
在一些实施例中,在经过增强网络包括的神经网络层的特征提取处理,得到与第一张量相同维度的第二张量之后,第二终端设备可以调用增强网络包括的全连接网络层对第二张量进行全连接处理,得到与第二张量相同维度的第三张量。例如,如图13所示,第二张量的维度为56×1,则调用一个全连接网络层进行全连接处理之后,生成56×1的张量。
在步骤3064中,第二终端设备对第三张量进行激活处理,得到标签信息向量。
在一些实施例中,在经过增强网络包括的全连接网络层的全连接处理,得到与第二张量相同维度的第三张量之后,第二终端设备可以调用增强网络包括的激活层,即激活函数(例如ReLU函数、Sigmoid函数、Tanh函数等)对第三张量进行激活处理,这样,就生成了与特征向量的预测值相同维度的标签信息向量。例如,如图13所示,第三张量的维度为56×1,在调用ReLU函数对第三张量进行激活处理后,得到维度为56×1的标签信息向量。
在另一些实施例中,当特征向量的预测值包括低频子带信号的特征向量的预测值、以及高频子带信号的特征向量的预测值时,第二终端设备还可以通过以下方式实现上述的步骤306:对低频子带信号的特征向量的预测值进行标签提取处理,得到第一标签信息向量,其中,第一标签信息向量的维度与低频子带信号的特征向量的预测值的维度相同,用于低频子带信号的信号增强;对高频子带信号的特征向量的预测值进行标签提取处理,得到第二标签信息向量,其中,第二标签信息向量的维度与高频子带信号的特征向量的预测值的维度相同,用于高频子带信号的信号增强。
示例的,第二终端设备可以通过以下方式实现上述的对低频子带信号的特征向量的预测值进行标签提取处理,得到第一标签信息向量:调用第一增强网络执行以下处理:对低频子带信号的特征向量的预测值进行卷积处理,得到与低频子带信号的特征向量的预测值相同维度的第四张量;对第四张量进行特征提取处理,得到与第四张量相同维度的第五张量;对第五张量进行全连接处理,得到与第五张量相同维度的第六张量;对第六张量进行激活处理,得到第一标签信息向量。
示例的,第二终端设备可以通过以下方式实现上述的对高频子带信号的特征向量的预测值进行标签提取处理,得到第二标签信息向量:调用第二增强网络执行以下处理:对高频子带信号的特征向量的预测值进行卷积处理,得到与高频子带信号的特征向量的预测值相同维度的第七张量;对第七张量进行特征提取处理,得到与第七张量相同维度的第八张量;对第八张量进行全连接处理,得到与第八张量相同维度的第九张量;对第九张量进行激活处理,得到第二标签信息向量。
需要说明的是,针对低频子带信号的特征向量的预测值的标签提取过程、以及针对高频子带信号的特征向量的预测值的标签提取过程,与针对音频信号的特征向量的预测值的标签提取过程是类似的,可以参考图6A的描述实现,本申请实施例在此不再赘述。此外,还需要说明的是,第一增强网络和第二增强网络的结构与上文中增强网络的结构是类似的,本申请实施例在此不再赘述。
在一些实施例中,当特征向量的预测值包括N个子带信号分别对应的特征向量的预测值时,第二终端设备可以通过以下方式实现上述的步骤306:对N个子带信号分别对应的特征向量的预测值分别进行标签提取处理,得到N个标签信息向量,其中,每个标签信息向量的维度与对应子带信号的特征向量的预测值的维度相同。
示例的,第二终端设备可以通过以下方式实现上述的对N个子带信号分别对应的特征向量的预测值分别进行标签提取处理,得到N个标签信息向量:基于第i子带信号的特征向量的预测值,调用第i增强网络进行标签提取处理,得到第i标签信息向量;其中,i的取值范围满足1≤i≤N,且第i标签信息向量的维度与第i子带信号的特征向量的预测值的维度相同,用于第i子带信号的信号增强。
举例来说,第二终端设备可以通过以下方式实现上述的基于第i子带信号的特征向量的预测值,调用第i增强网络进行标签提取处理,得到第i标签信息向量:调用第i增强网络执行以下处理:对第i子带信号的特征向量的预测值进行卷积处理,得到与第i子带信号的特征向量的预测值相同维度的第十张量;对第十张量进行特征提取处理,得到与第十张量相同维度的第十一张量;对第十一张量进行全连接处理,得到与第十一张量相同维度的第十二张量;对 第十二张量进行激活处理,得到第i标签信息向量。
需要说明的是,第i增强网络的结构与上文中增强网络的结构是类似的,本申请实施例在此不再赘述。
在步骤307中,第二终端设备基于特征向量的预测值和标签信息向量进行信号重建,得到音频信号的预测值。
在一些实施例中,第二终端设备可以通过以下方式实现步骤307:对特征向量的预测值和标签信息向量进行拼接处理,得到拼接向量;对拼接向量进行压缩处理,得到音频信号的预测值,其中,压缩处理可以通过卷积处理、上采样处理以及池化处理的一次或者多次级联实现,例如可以通过下述的步骤3072至步骤3075实现,音频信号的预测值包括音频信号的频率、波长、振幅等参数分别对应的预测值。
在另一些实施例中,第二终端设备可以基于特征向量的预测值和标签信息向量,调用合成网络进行信号重建,得到音频信号的预测值,其中,合成网络包括第一卷积层、上采样层、池化层、第二卷积层,下面,结合合成网络的上述结构说明信号重建的过程。
示例的,参见图6B,图6B是本申请实施例提供的音频解码方法的流程示意图,如图6B所示,图4C示出的步骤307可以通过图6B示出的步骤3071至步骤3075实现,将结合图6B示出的步骤进行说明。
在步骤3071中,第二终端设备对特征向量的预测值和标签信息向量进行拼接处理,得到拼接向量。
在一些实施例中,第二终端设备可以将基于步骤305得到的特征向量的预测值、以及基于步骤306得到的标签信息向量进行拼接处理,得到拼接向量,并将拼接向量作为合成网络的输入,进行信号重建。
在步骤3072中,第二终端设备对拼接向量进行第一卷积处理,得到音频信号的卷积特征。
在一些实施例中,在对特征向量的预测值和标签信息向量进行拼接处理,得到拼接向量之后,第二终端设备可以调用合成网络包括的第一卷积层(例如一个一维的因果卷积)对拼接向量进行卷积处理,得到音频信号的卷积特征。例如,如图14所示,在对拼接向量进行因果卷积处理之后,得到一个维度为192×1的张量(即音频信号的卷积特征)。
在步骤3073中,第二终端设备对卷积特征进行上采样处理,得到音频信号的上采样特征。
在一些实施例中,在得到音频信号的卷积特征之后,第二终端设备可以调用合成网络包括的上采样层对音频信号的卷积特征进行上采样处理,其中,上采样处理可以是通过多个级联的解码层实现的,且不同解码层的采样因子不同,则第二终端设备可以通过以下方式对音频信号的卷积特征进行上采样处理,得到音频信号的上采样特征:通过多个级联的解码层中的第一个解码层,对卷积特征进行上采样处理;将第一个解码层的上采样结果输出到后续级联的解码层,并通过后续级联的解码层继续进行上采样处理和上采样结果输出,直至输出到最后一个解码层;将最后一个解码层输出的上采样结果,作为音频信号的上采样特征。
示例的,上述的上采样处理是一种增加音频信号的卷积特征的维度的方法,例如可以通过插值(例如双线性插值)的方式对音频信号的卷积特征进行上采样处理,得到音频信号的上采样特征,其中,上采样特征的维度大于卷积特征的维度,也就是说,通过上采样处理可以增加卷积特征的维度。
示例的,参见图14,以多个级联的解码层(又称解码块)为3个级联的解码层为例,可以级联3个不同上采样因子(Up_factor)的解码层。以解码层(Up_factor=8)为例,可以先执行1个或者多个空洞卷积,每个卷积核大小均固定为1×3、移位率(Stride Rate)为1。此外,1个或者多个空洞卷积的扩张率(Dilation Rate)可根据需求设置,比如可以设置为3,当然,本申请实施例也不限制不同空洞卷积设置不同的扩展率。然后,将3个解码层的Up_factor分别设置为8、5、4,等效于设置了不同大小的池化因子,起到上采样的作用。最后,将3个解码层的通道数分别设置为96、48、24。如此,经过3个解码层进行上采样处理之后,音频信号的卷积特征(例如192×1的张量)将依次转换成96×8、48×40和24×160的张量,则可以将24×160的张量作为音频信号的上采样特征。
在步骤3074中,第二终端设备对上采样特征进行池化处理,得到音频信号的池化特征。
在一些实施例中,在对音频信号的卷积特征进行上采样处理,得到音频信号的上采样特征之后,第二终端设备可以调用合成网络中的池化层对上采样特征进行池化处理,例如对上采样特征做因子为2的池化操作,得到音频信号的池化特征,例如,参见图14,音频信号的上采样特征为24×160的张量,则经过池化处理(即图14中示出的后处理)之后,生成24×320的张量(即音频信号的池化特征)。
在步骤3075中,第二终端设备对池化特征进行第二卷积处理,得到音频信号的预测值。
在一些实施例中,在对音频信号的上采样特征进行池化处理,得到音频信号的池化特征之后,第二终端设备还可以对音频信号的池化特征,调用合成网络包括的第二卷积层,例如调用图14所示的因果卷积,对池化特征进行空洞卷积,生成音频信号的预测值。在另一些实施例中,当特征向量的预测值包括低频子带信号的特征向量的预测值、以及高频子带信号的特征向量的预测值时,第二终端设备还可以通过以下方式实现上述的步骤307:对低频子带信号的特征向量的预测值、以及第一标签信息向量(即对低频子带信号的特征向量的预测值进行标签提取处理得到的标签信息向量)进行拼接处理,得到第一拼接向量;基于第一拼接向量调用第一合成网络进行信号重建,得到低频子带信号的预测值;对高频子带信号的特征向量的预测值、以及第二标签信息向量(即对高频子带信号的特征向量的预测值进行标签提取处理得到的标签信息向量)进行拼接处理,得到第二拼接向量;基于第二拼接向量调用第二合成网络进行信号重建,得到高频子带信号的预测值;对低频子带信号的预测值和高频子带信号的预测值进行合成处理,得到音频信号的预测值。
示例的,第二终端设备可以通过以下方式实现上述的基于第一拼接向量调用第一合成网络进行信号重建,得到低频子带信号的预测值:调用第一合成网络执行以下处理:对第一拼接向量进行第一卷积处理,得到低频子带信号的卷积特征;对低频子带信号的卷积特征进行上采样处理,得到低频子带信号的上采样特征;对低频子带信号的上采样特征进行池化处理,得到低频子带信号的池化特征;对低频子带信号的池化特征进行第二卷积处理,得到低频子带信号的预测值;其中,上采样处理可以是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
示例的,第二终端设备可以通过以下方式实现上述的基于第二拼接向量调用第二合成网络进行信号重建,得到高频子带信号的预测值:调用第二合成网络执行以下处理:对第二拼接向量进行第一卷积处理,得到高频子带信号的 卷积特征;对高频子带信号的卷积特征进行上采样处理,得到高频子带信号的上采样特征;对高频子带信号的上采样特征进行池化处理,得到高频子带信号的池化特征;对高频子带信号的池化特征进行第二卷积处理,得到高频子带信号的预测值;其中,上采样处理可以是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
需要说明的是,针对低频子带信号的重建过程(即低频子带信号的预测值的生成过程)、以及高频子带信号的重建过程(即高频子带信号的预测值的生成过程),与音频信号的重建过程(即音频信号的预测值的生成过程)是类似的,可以参考图6B的描述实现,本申请实施例在此不再赘述。此外,还需要说明的是,第一合成网络和第二合成网络的结构与上文中合成网络的结构是类似的,本申请实施例在此不再赘述。
在另一些实施例中,当特征向量的预测值包括N个子带信号分别对应的特征向量的预测值时,第二终端设备还可以通过以下方式实现上述的步骤307:对N个子带信号分别对应的特征向量的预测值、以及N个标签信息向量进行一一对应的拼接处理,得到N个拼接向量;基于第j拼接向量调用第j合成网络进行信号重建,得到第j子带信号的预测值;其中,j的取值范围满足1≤j≤N;对N个子带信号分别对应的预测值进行合成处理,得到音频信号的预测值。
示例的,第二终端设备可以通过以下方式实现上述的基于第j拼接向量调用第j合成网络进行信号重建,得到第j子带信号的预测值:调用第j合成网络执行以下处理:对第j拼接向量进行第一卷积处理,得到第j子带信号的卷积特征;对第j子带信号的卷积特征进行上采样处理,得到第j子带信号的上采样特征;对第j子带信号的上采样特征进行池化处理,得到第j子带信号的池化特征;对第j子带信号的池化特征进行第二卷积处理,得到第j子带信号的预测值;其中,上采样处理可以是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
需要说明的是,第j合成网络的结构与上文中合成网络的结构是类似的,本申请实施例在此不再赘述。
在步骤308中,第二终端设备将通过信号重建得到的音频信号的预测值,作为码流的解码结果。
在一些实施例中,第二终端设备在通过信号重建得到音频信号的预测值之后,可以将通过信号重建得到的音频信号的预测值,作为码流的解码结果,并将解码结果发送至第二终端设备内置的扬声器中进行播放。
本申请实施例提供的音频解码方法,通过对解码得到的特征向量的预测值进行标签提取处理,得到标签信息向量,并结合特征向量的预测值和标签信息向量进行信号重建,由于标签信息向量反映的是音频信号的核心成分(即不包括噪声等声学干扰),因此,相较于仅仅基于特征向量的预测值进行信号重建,本申请实施例结合特征向量的预测值和标签信息向量进行信号重建,相当于增加了音频信号中核心成分(例如人声)所占的比例,减小了音频信号中噪声等声学干扰(例如背景音)所占的比例,从而可以有效抑制编码端采集的音频信号中包括的噪声成分,进而提高了重建得到的音频信号的质量。
下面,将以VoIP会议系统为例,说明本申请实施例在一个实际的应用场景中的示例性应用。
示例的,参见图7,图7是本申请实施例提供的端到端的语音通信链路示意图,如图7所示,可以在编码端(即码流的发送端)应用本申请实施例提供的音频编码方法、在解码端(即码流的接收端)应用本申请实施例提供的音频解码方法。这是会议这种通信系统的最核心部分,解决了压缩的基本功能。一般地,编码器部署在上行客户端,解码器部署在下行客户端。
此外,考虑前向兼容,需要在服务器中同样部署转码器,以解决新的编码器与相关技术的编码器之间的互联互通问题。例如,如果发送端部署的是新的NN编码器,而接收端部署的是传统的公用电话交换网(PSTN,Public Switched Telephone Network)的解码器(例如G.722解码器),会导致接收端无法正确解码发送端直接发送的码流。因此,服务器在接收到发送端发送的码流之后,首先需要执行NN解码器生成语音信号,然后调用G.722编码器生成特定码流,才能让接收端正确解码。类似的转码场景不再展开。
下面对本申请实施例提供的音频编解码方法进行具体说明。
在一些实施例中,参见图8,图8是本申请实施例提供的音频编解码方法的流程示意图,如图8所示,编码端的主要步骤包括:对于输入信号,例如第n帧语音信号,记为x(n),调用分析网络进行特征提取处理,获得低维度的特征向量,记为F(n);特别的,特征向量F(n)的维度小于输入信号x(n)的维度,从而减少数据量。一个特定的实现可以是调用空洞卷积网络(Dilated CNN)对第n帧语音信号x(n)进行特征提取处理,生成更低维度的特征向量F(n)。需要说明的是,本申请实施例不限制其他的NN结构,包括但不限于自编码器(AE,Autoencoder)、全连接(FC,Full-Connection)网络、长短期记忆(LSTM,Long Short-Term Memory)网络、卷积神经网络(CNN,Convolutional Neural Network)+LSTM等。在得到特征向量F(n)之后,可以对特征向量F(n)进行矢量量化或者标量量化,并将量化后得到的索引值进行熵编码,得到码流(bitstream),最后将码流传输到解码端。
继续参见图8,解码端的主要步骤包括:对接收到的码流进行解码,得到特征向量的估计值,记为F′(n)。接着基于特征向量的估计值F′(n)调用增强网络,生成增强用的标签信息向量,记为E(n),最后结合特征向量的估计值F′(n)和标签信息向量E(n),调用合成网络(对应编码端的逆过程)进行信号的重建,并抑制编码端采集到的语音信号中包含的噪声成分,生成与输入信号x(n)对应的信号估计值,记为x′(n)。
为了更好地理解本申请实施例提供的音频编解码方法,在对本申请实施例提供的音频编解码方法进行具体说明之前,首先对空洞卷积网络以及QMF滤波器组进行介绍。
示例的,参见图9A和图9B,图9A是本申请实施例提供的普通卷积的示意图,图9B是本申请实施例提供的空洞卷积的示意图。相对普通卷积,空洞卷积的提出,是为了解决增加感受野的同时保持特征图的尺寸不变,从而避免因为上采样、下采样引起的误差。虽然图9A和图9B中示出的卷积核大小(Kernel size)均为3×3;但是,图9A所示的普通卷积的感受野只有3,而图9B所示的空洞卷积的感受野达到了5。也就是说,对于尺寸为3×3的卷积核,图9A所示的普通卷积的感受野为3,扩张率(Dilation rate)为1;而图9B所示的空洞卷积的感受野为5,扩张率为2。
卷积核还可以在类似图9A或者图9B的平面上进行移动,这里会涉及到移位率(Stride rate)的概念,例如,假 设卷积核每次移位1格,则对应的移位率为1。
此外,还有卷积通道数的概念,就是用多少个卷积核对应的参数去进行卷积分析。理论上,通道数越多,对信号的分析更为全面,精度越高;但是,通道数越多,复杂度也越高。例如,对于一个1×320的张量,可以使用24通道卷积运算,输出就是24×320的张量。
需要说明的是,可以根据实际应用需要,自行定义空洞卷积核大小(例如,针对语音信号,卷积核的大小一般为1×3)、扩张率、移位率和通道数等,本申请实施例对此不作具体限定。
下面继续对QMF滤波器组进行说明。
QMF滤波器组是一个包含分析-合成的滤波器对。对于QMF分析滤波器,可以将输入的采样率为Fs的信号分解成两路采样率为Fs/2的信号,分别表示QMF低通信号和QMF高通信号。如图10所示,为QMF分析滤波器组的低通部分H_Low(z)和高通部分H_High(z)的频谱响应。基于QMF分析滤波器组的相关理论知识,可以很容易地描述上述低通滤波和高通滤波的系数之间的相关性:
hHigh(k)=-1khLow(k)     (1)
其中,hLow(k)表示低通滤波的系数,hHigh(k)表示高通滤波的系数。
类似的,根据QMF相关理论,也可以基于QMF分析滤波器组H_Low(z)和H_High(z),描述QMF合成滤波器组,详细数学背景不在此重复。
GLow(z)=HLow(z)      (2)
GHigh(z)=(-1)*HHigh(z)    (3)
其中,GLow(z)表示恢复出的低通信号,GHigh(z)表示恢复出的高通信号。
解码端恢复出低通信号和高通信号之后,经过QMF合成滤波器组进行合成处理,即可以恢复出输入信号对应的采样率Fs的重建信号。
此外,除了上述2通道QMF方案,还可以扩展为N通道QMF的方案;特别地,可以使用二叉数的方式,迭代地对当前子带信号做2通道QMF分析,以获得更低分辨率的子带信号。图11A表示进行迭代两层的2通道QMF分析滤波器,可以获得4通道的子带信号。图11B是另一种实现方式,考虑到高频部分信号对质量影响小,无需那么高精度的分析;因此,只需要对原始信号做一次高通滤波即可。类似地,可以实现更多通道的方式,比如,8、16、32通道,在此不进一步展开。
下面对本申请实施例提供的音频编解码方法进行具体说明。
在一些实施例中,以采样率Fs=16000Hz的语音信号为例,需要说明的是,本申请实施例提供的方法也适用于其他采样率的场景,包括但不限于:8000Hz、32000Hz、48000Hz。同时,假设帧长设置为20ms,因此,对于Fs=16000Hz,相当于每帧包含320个样本点。
下面将结合图8所示的音频编解码方法的流程示意图,分别对编码端和解码端的流程进行详细说明。
(一)关于编码端的流程如下:
首先,输入信号的生成。
如前所述,对于采样率Fs=16000Hz的语音信号,假设帧长为20ms,则对于第n帧的语音信号,其包括320个样本点,记为输入信号x(n)。
其次,调用分析网络进行数据压缩。
分析网络的目的是基于输入信号x(n),通过调用分析网络(例如神经网络),生成更低维度的特征向量F(n)。在本实施例中,输入信号x(n)的维度为320,特征向量F(n)的维度为56,从数据量看,经过分析网络进行特征提取之后,起到了“降维”的作用,实现了数据压缩的功能。
示例的,参见图12,图12是本申请实施例提供的分析网络的结构示意图,如图12所示,首先调用一个24通道的因果卷积,将输入信号x(n)扩展为24×320的张量,其中,输入信号x(n)为1×320的张量。接着对扩展得到的24×320的张量进行预处理。例如,可以对扩展得到的24×320的张量做因子为2的池化(Pooling)操作、且激活函数可以为线性整流函数(ReLU,Linear Rectification Function),生成24×160的张量。接下来,可以级联3个不同下采样因子(Down_factor)的编码块。以编码块(Down_factor=4)为例,可以先执行1个或者多个空洞卷积,每个卷积核大小均固定为1×3、移位率(Stride Rate)为1。此外,1个或者多个空洞卷积的扩张率(Dilation Rate)可根据需求设置,比如可以设置为3,当然,本申请实施例也不限制不同空洞卷积设置不同的扩展率。然后,将3个编码块的Down_factor分别设置为4、5、8,等效于设置了不同大小的池化因子,起到下采样的作用。最后,将3个编码块的通道数分别设置为48、96、192。如此,经过3个编码块进行下采样处理之后,24×160的张量将依次转换成48×40、96×8和192×1的张量。最后,对192×1的张量,再经过类似预处理的因果卷积,可以输出一个56维的特征向量F(n)。
再次,量化编码。
对于编码端提取得到的特征向量F(n),可以采用标量量化(即各分量单独量化)和熵编码的方式进行量化编码。当然,也可以采用矢量量化(即相邻多个分量组合成一个矢量进行联合量化)和熵编码的方式进行量化编码,本申请实施例对此不作具体限定。
对特征向量F(n)进行量化编码后,可以生成码流。根据实验,通过6-8kbps码率就可以对16kHz宽带信号实现高质量的压缩。
(二)关于解码端的流程如下:
首先,解码。
解码是编码的逆过程。对于接收到的码流,进行解码,然后基于解码得到的索引值查询量化表,即可获得特征向量的估计值,记为F′(n)。
其次,调用增强网络提取标签信息向量。
特征向量的估计值F′(n)包含了编码端采集得到的原始语音信号的压缩版本,反映了语音信号的核心成分,同时也包含了在采集时所混有的噪声等声学干扰。因此,增强网络,用于从特征向量的估计值F′(n)中采集相关的标签嵌入(embedding)信息,用于在解码时,生成相对干净的语音信号。
示例的,参见图13,图13是本申请实施例提供的增强网络的结构示意图,如图13所示,以特征向量的估计值F′(n)为输入量,调用一个一维的因果卷积,生成56×1的张量。接着对于56×1的张量,经过一层LSTM网络,生成56×1的张量。然后,调用一个全连接(FC,Full-Connection)网络,生成56×1的张量。最后,调用激活函数(例如可以是ReLU,当然,也可以是其他激活函数,例如Sigmoid函数、Tanh函数等)进行激活处理,这样,就生成了与特征向量的估计值F′(n)相同维度的标签信息向量,记为E(n)。
再次,调用合成网络进行信号重建。
合成网络的目的,是将解码端获得的特征向量的估计值F′(n)和本地生成的标签信息向量E(n),拼接成一个112维的向量,然后调用合成网络进行信号重建,生成语音信号的估计值,记为x′(n)。需要说明的是,合成网络,通过拼接方式生成输入向量只是其中的一种方式,本申请实施例不限制其他方式,例如可以将F′(n)+E(n)作为输入,维度就是56。针对这种方式,可以参考图14重新设计网络即可,本申请实施例在此不再赘述。
示例的,参见图14,图14是本申请实施例提供的合成网络的结构示意图,如图14所示,合成网络的结构与分析网络的结构高度类似,例如因果卷积;但输入量的维度增加到了112维。后处理的过程类似于分析网络中的预处理。此外,解码块(又称解码层)的结构与分析网络中的编码块(又称编码层)是对称的,例如分析网络中的编码块是先做空洞卷积再池化完成下采样,而合成网络中的解码块是先进行池化完成上采样,再做空洞卷积。也就是说,解码是编码的逆过程,可以参考图12的描述,本申请实施例在此不再赘述。
本申请实施例中,可以通过采集数据,对编码端和解码端的相关网络(例如分析网络和合成网络)进行联合训练,获得最优参数。目前公开有很多的神经网络和深度学习的开源平台,基于上述开源平台,用户仅需要准备好数据和设置相应的网络结构,在服务器完成训练后,即可将训练好的网络投入使用。本申请上述实施例假定分析网络和合成网络的参数已经训练完毕,仅公开一种特定的网络输入、网络结构和网络输出的实现,相关领域的工程人员可以根据实际情况进一步修改上述配置。
在上述实施例中,对于输入信号,在编解码路径上,分别调用分析网络、增强网络和合成网络,完成低码率压缩和信号重建。但是,这些网络的复杂度较高。为了降低复杂度,本申请实施例可以引入QMF分析滤波器,将输入信号分解成更低码率的子带信号;接着对于每个子带信号,神经网络的输入和输出维度将至少减半。一般的,神经网络的计算复杂度均是O(N3),因此,这种“分治”思想可以有效降低复杂度。
示例的,参见图15,图15是本申请实施例提供的音频编解码方法的流程示意图,如图15所示,对于第n帧的输入信号x(n),使用QMF分析滤波器分解为2个子带信号,例如输入信号x(n)经过QMF分析滤波器分解之后,可以获得低频子带信号,记为xLB(n)和高频子带信号,记为xHB(n)。接着,针对低频子带信号xLB(n),可以调用第一分析网络,获得低维度的低频子带信号的特征向量,记为FLB(n)。特别的,低频子带信号的特征向量FLB(n)的维度小于低频子带信号xLB(n),从而减少了数据量。此外,还需要说明的是,由于低频子带信号xLB(n)的分辨率比输入信号x(n)少了一半,因此第一分析网络的参数可以相应减半,包括低频子带信号的特征向量FLB(n)。
在得到低频子带信号的特征向量FLB(n)之后,可以对低频子带信号的特征向量FLB(n)进行矢量量化或者标量量化,并将量化后得到的索引值进行熵编码,得到码流,随后将码流传输到解码端。
解码端在接收到编码端发送的码流后,可以对接收到的码流进行解码,从而获得低频子带信号的特征向量的估计值,记为F′LB(n)。接着,可以基于低频子带信号的特征向量的估计值F′LB(n),调用第一增强网络,生成与低频子带信号对应的标签信息向量,记为ELB(n)。最后,结合F′LB(n)和ELB(n),调用对应编码端的逆过程的第一合成网络,完成低频子带信号的估计值,记为x′LB(n),的重建,并抑制编码端采集到的语音信号中包含的噪声等声学干扰。为了表述方便,下文中将第一增强网络和第一合成网络的功能合并成第一合成模块,即在解码端,基于F′LB(n)和ELB(n),调用第一合成模块进行信号重建,即可获得低频子带信号的估计值x′LB(n)。
类似的,对于输入信号x(n)经过QMF分析滤波器分解后得到的高频子带信号,记为xHB(n),在编解码处理流程上,分别调用第二分析网络、第二合成模块(包括第二增强网络和第二合成网络),可以在解码端获得高频子带信号的估计值,记为x′HB(n)。需要说明的是,针对高频子带信号xHB(n)的处理流程与低频子带信号xLB(n)的处理流程类似,可以参考低频子带信号xLB(n)的处理流程实现,本申请实施例在此不再赘述。
参考上述2通道QMF的处理实例,以及上文中介绍的多通道QMF,可以通过迭代2通道QMF完成的特征,进一步扩展到如图16所示的多通道QMF的方案,例如可以将输入信号x(n)分解为N个子带信号,并针对每个子带信号分别进行编解码处理,因为原理类似,本申请实施例在此不再赘述。
下面以2通道QMF为例,对本申请实施例提供的音频编解码方法进行说明。
(一)关于编码端的流程如下:
首先,输入信号的生成。
如前所述,对于采样率Fs=16000Hz的语音信号,假设帧长为20ms,则对于第n帧的语音信号,其包括320个样本点,记为输入信号x(n)。
其次,QMF信号分解。
如前所述,针对输入信号x(n),可以调用QMF分析滤波器(这里特指2通道QMF),并进行下采样,可以获得两部分的子带信号,分别为低频子带信号xLB(n)和高频子带信号xHB(n)。其中,低频子带信号xLB(n)的有效带宽是0-4kHz,高频子带信号xHB(n)的有效带宽是4-8kHz,且每帧样本点的数量为160。
再次,调用第一分析网络和第二分析网络进行数据压缩。
示例的,在将输入信号x(n)分解成低频子带信号xLB(n)和高频子带信号xHB(n)之后,针对低频子带信号xLB(n),可以调用如图17所示的第一分析网络进行特征提取处理,得到低频子带信号的特征向量FLB(n);类似的,对于高频子带信号xHB(n),可以调用第二分析网络进行特征提取处理,得到高频子带信号的特征向量,记为FHB(n)。
需要说明的是,由于子带信号的采样率相对于输入信号减半,因此,在本实施例中,输出的子带信号的特征向量的维度可以低于上述实施例中输入信号的特征向量的维度。例如,在本实施例中,低频子带信号的特征向量和高频子带信号的特征向量的维度均可以设置为28。这样,整体输出的特征向量的维度与上述实施例中输入信号的特征向量的维度一致,即两者的码率是一致的。
此外,考虑到低频和高频对语音质量的影响因子不一,本申请实施例也不限制对不同子带信号的特征向量定义不同数量的维度。例如对于低频子带信号的特征向量的维度可以设置为32,而将高频子带信号的特征向量的维度设置为24,这样仍然保证了总维度与输入信号的特征向量的维度一致。针对上述情况,可以通过相应调整第一分析网络和第二分析网络的内部参数量实现,本申请实施例在此不再赘述。
最后,量化编码。
与针对输入信号的特征向量的处理过程类似,考虑总的特征向量的维度不变,通过6-8kbps码率就可以对16kHz宽带信号实现高质量的压缩。
(二)关于解码端的流程如下:
首先,解码。
与上述实施例类似,通过对接收到的码流进行解码,即可获得低频子带信号的特征向量的估计值F′LB(n)和高频子带信号的特征向量的估计值F′HB(n)。
其次,调用第一增强网络和第二增强网络提取标签信息向量。
示例的,在对接收到的码流进行解码,得到低频子带信号的特征向量的估计值F′LB(n)和高频子带信号的特征向量的估计值F′HB(n)之后,针对低频子带信号的特征向量的估计值F′LB(n),可以调用如图18所示的第一增强网络采集用于低频部分语音增强的标签embedding信息(即低频部分的标签信息向量),记为ELB(n),用于在解码时,生成相对干净的低频子带语音信号。上述计算过程,可以参考图13实现,本申请实施例在此不再赘述。此外,由于采样率减半的原因,可以参考编码端中的第一分析网络的输出特征向量的维度,对应调整图18示出的第一增强网络的结构,例如包括第一增强网络的参数量。
类似的,针对解码得到的高频子带信号的特征向量的估计值F′HB(n),可以调用第二增强网络,可以获得高频部分的标签信息向量,记为EHB(n),用于后续流程。
总之,这一步执行后,可以获得两个子带信号的标签信息向量,分别为低频部分的标签信息向量ELB(n)和高频部分的标签信息向量EHB(n)。
再次,调用第一合成网络和第二合成网络进行信号重建。
示例的,参见图19,图19是本申请实施例提供的第一合成网络的结构示意图,如图19所示,可以调用第一合成网络,基于低频子带信号的特征向量的估计值F′LB(n)和本地生成的低频部分的标签信息向量ELB(n),生成低频子带信号的估计值,记为x′LB(n)。具体计算过程可以参考图14的描述,本申请实施例在此不再赘述。此外,由于采样率减半的原因,图19只提供了对应于低频部分的第一合成网络的一个具体配置,高频部分的实现形式类似,在此不再赘述。
经过这一步,生成了低频子带信号的估计值x′LB(n)和高频子带信号的估计值x′HB(n)。特别的,这两个子带信号中的噪声等声学干扰获得了有效抑制。
最后,基于QMF合成滤波器进行合成处理。
基于前两步,在解码端获得低频子带信号的估计值x′LB(n)和高频子带信号的估计值x′HB(n)之后,只需要上采样并调用QMF合成滤波器,就可以生成320点的重建信号,即输入信号x(n)的估计值x′(n),从而完成整个解码的过程。
综上,本申请实施例通过信号分解和相关信号处理技术与深度神经网络的有机结合,编码效率,相较于传统的信号处理方案显著提升,在复杂度可接受的情况下,将语音增强在解码端实施,使得可以在噪声等声学干扰下,可以用低码率实现重建干净语音的效果。例如参见图20,编码端采集的语音信号中混有大量的噪声干扰,经过本申请实施例提供的语音增强和超低码率压缩的方案,可以在解码端中重建一个干净的语音信号,从而提高了语音通话的质量。
下面继续说明本申请实施例提供的音频解码装置565的实施为软件模块的示例性结构,在一些实施例中,如图3所示,存储在存储器560的音频解码装置565中的软件模块可以包括:获取模块5651、解码模块5652、标签提取模块5653、重建模块5654和确定模块5655。
获取模块5651,配置为获取码流,其中,码流是对音频信号进行编码得到的;解码模块5652,配置为对码流进行解码处理,得到音频信号的特征向量的预测值;标签提取模块5653,配置为对特征向量的预测值进行标签提取处理,得到标签信息向量,其中,标签信息向量的维度与特征向量的预测值的维度相同;重建模块5654,配置为基于特征向量的预测值和标签信息向量进行信号重建;确定模块5655,配置为将通过信号重建得到的音频信号的预测值,作为码流的解码结果。
在一些实施例中,解码模块5652,还配置为对码流进行解码处理,得到音频信号的特征向量的索引值;基于索引值查询量化表,得到音频信号的特征向量的预测值。
在一些实施例中,标签提取模块5653,还配置为对特征向量的预测值进行卷积处理,得到与特征向量的预测值相同维度的第一张量;对第一张量进行特征提取处理,得到与第一张量相同维度的第二张量;对第二张量进行全连接处理,得到与第二张量相同维度的第三张量;对第三张量进行激活处理,得到标签信息向量。
在一些实施例中,重建模块5654,还配置为对特征向量的预测值和标签信息向量进行拼接处理,得到拼接向量; 对拼接向量进行第一卷积处理,得到音频信号的卷积特征;对卷积特征进行上采样处理,得到音频信号的上采样特征;对上采样特征进行池化处理,得到音频信号的池化特征;对池化特征进行第二卷积处理,得到音频信号的预测值。
在一些实施例中,上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同;重建模块5654,还配置为通过多个级联的解码层中的第一个解码层,对卷积特征进行上采样处理;将第一个解码层的上采样结果输出到后续级联的解码层,并通过后续级联的解码层继续进行上采样处理和上采样结果输出,直至输出到最后一个解码层;将最后一个解码层输出的上采样结果,作为音频信号的上采样特征。
在一些实施例中,码流包括低频码流和高频码流,其中,低频码流是对音频信号经过分解处理后得到的低频子带信号进行编码得到的,高频码流是对音频信号经过分解处理后得到的高频子带信号进行编码得到的;解码模块5652,还配置为对低频码流进行解码处理,得到低频子带信号的特征向量的预测值;以及配置为对高频码流进行解码处理,得到高频子带信号的特征向量的预测值。
在一些实施例中,标签提取模块5653,还配置为对低频子带信号的特征向量的预测值进行标签提取处理,得到第一标签信息向量,其中,第一标签信息向量的维度与低频子带信号的特征向量的预测值的维度相同;以及配置为对高频子带信号的特征向量的预测值进行标签提取处理,得到第二标签信息向量,其中,第二标签信息向量的维度与高频子带信号的特征向量的预测值的维度相同。
在一些实施例中,标签提取模块5653,还配置为调用第一增强网络执行以下处理:对低频子带信号的特征向量的预测值进行卷积处理,得到与低频子带信号的特征向量的预测值相同维度的第四张量;对第四张量进行特征提取处理,得到与第四张量相同维度的第五张量;对第五张量进行全连接处理,得到与第五张量相同维度的第六张量;对第六张量进行激活处理,得到第一标签信息向量。
在一些实施例中,标签提取模块5653,还配置为调用第二增强网络执行以下处理:对高频子带信号的特征向量的预测值进行卷积处理,得到与高频子带信号的特征向量的预测值相同维度的第七张量;对第七张量进行特征提取处理,得到与第七张量相同维度的第八张量;对第八张量进行全连接处理,得到与第八张量相同维度的第九张量;对第九张量进行激活处理,得到第二标签信息向量。
在一些实施例中,特征向量的预测值包括:低频子带信号的特征向量的预测值,高频子带信号的特征向量的预测值;重建模块5654,还配置为对低频子带信号的特征向量的预测值、以及第一标签信息向量进行拼接处理,得到第一拼接向量;基于第一拼接向量调用第一合成网络进行信号重建,得到低频子带信号的预测值;对高频子带信号的特征向量的预测值、以及第二标签信息向量进行拼接处理,得到第二拼接向量;基于第二拼接向量调用第二合成网络进行信号重建,得到高频子带信号的预测值;对低频子带信号的预测值和高频子带信号的预测值进行合成处理,得到音频信号的预测值。
在一些实施例中,重建模块5654,还配置为调用第一合成网络执行以下处理:对第一拼接向量进行第一卷积处理,得到低频子带信号的卷积特征;对卷积特征进行上采样处理,得到低频子带信号的上采样特征;对上采样特征进行池化处理,得到低频子带信号的池化特征;对池化特征进行第二卷积处理,得到低频子带信号的预测值;其中,上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
在一些实施例中,重建模块5654,还配置为调用第二合成网络执行以下处理:对第二拼接向量进行第一卷积处理,得到高频子带信号的卷积特征;对卷积特征进行上采样处理,得到高频子带信号的上采样特征;对上采样特征进行池化处理,得到高频子带信号的池化特征;对池化特征进行第二卷积处理,得到高频子带信号的预测值;其中,上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
在一些实施例中,码流包括N个子码流,N个子码流对应不同的频段,且是对音频信号经过分解处理后得到的N个子带信号分别进行编码得到的,N为大于2的整数;解码模块5652,还配置为对N个子码流分别进行解码处理,得到N个子带信号分别对应的特征向量的预测值。
在一些实施例中,标签提取模块5653,还配置为对N个子带信号分别对应的特征向量的预测值分别进行标签提取处理,得到用于信号增强的N个标签信息向量,其中,每个标签信息向量的维度与对应子带信号的特征向量的预测值的维度相同。
在一些实施例中,标签提取模块5653,还配置为基于第i子带信号的特征向量的预测值,调用第i增强网络进行标签提取处理,得到第i标签信息向量;其中,i的取值范围满足1≤i≤N,且第i标签信息向量的维度与第i子带信号的特征向量的预测值的维度相同。
在一些实施例中,标签提取模块5653,还配置为调用第i增强网络执行以下处理:对第i子带信号的特征向量的预测值进行卷积处理,得到与第i子带信号的特征向量的预测值相同维度的第十张量;对第十张量进行特征提取处理,得到与第十张量相同维度的第十一张量;对第十一张量进行全连接处理,得到与第十一张量相同维度的第十二张量;对第十二张量进行激活处理,得到第i标签信息向量。
在一些实施例中,重建模块5654,还配置为对N个子带信号分别对应的特征向量的预测值、以及N个标签信息向量进行一一对应的拼接处理,得到N个拼接向量;基于第j拼接向量调用第j合成网络进行信号重建,得到第j子带信号的预测值;其中,j的取值范围满足1≤j≤N;对N个子带信号分别对应的预测值进行合成处理,得到音频信号的预测值。
在一些实施例中,重建模块5654,还配置为调用第j合成网络执行以下处理:对第j拼接向量进行第一卷积处理,得到第j子带信号的卷积特征;对卷积特征进行上采样处理,得到第j子带信号的上采样特征;对上采样特征进行池化处理,得到第j子带信号的池化特征;对池化特征进行第二卷积处理,得到第j子带信号的预测值;其中,上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效 果,因此不做赘述。对于本申请实施例提供的音频解码装置中未尽的技术细节,可以根据图4C、图6A、或图6B中的任一附图的说明而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的音频编解码方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有计算机可执行指令,当计算机可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的音频编解码方法,例如,如图4C示出的音频编解码方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,计算机可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个电子设备上执行,或者在位于一个地点的多个电子设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个电子设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (25)

  1. 一种音频解码方法,由电子设备执行,所述方法包括:
    获取码流,其中,所述码流是对音频信号进行编码得到的;
    对所述码流进行解码处理,得到所述音频信号的特征向量的预测值;
    对所述特征向量的预测值进行标签提取处理,得到标签信息向量,其中,所述标签信息向量的维度与所述特征向量的预测值的维度相同;
    基于所述特征向量的预测值和所述标签信息向量进行信号重建;
    将通过所述信号重建得到的所述音频信号的预测值,作为所述码流的解码结果。
  2. 根据权利要求1所述的方法,其中,所述对所述码流进行解码处理,得到所述音频信号的特征向量的预测值,包括:
    对所述码流进行解码处理,得到所述音频信号的特征向量的索引值;
    基于所述索引值查询量化表,得到所述音频信号的特征向量的预测值。
  3. 根据权利要求1所述的方法,其中,所述对所述特征向量的预测值进行标签提取处理,得到标签信息向量,包括:
    对所述特征向量的预测值进行卷积处理,得到与所述特征向量的预测值相同维度的第一张量;
    对所述第一张量进行特征提取处理,得到与所述第一张量相同维度的第二张量;
    对所述第二张量进行全连接处理,得到与所述第二张量相同维度的第三张量;
    对所述第三张量进行激活处理,得到标签信息向量。
  4. 根据权利要求1至3任一项所述的方法,其中,所述基于所述特征向量的预测值和所述标签信息向量进行信号重建,包括:
    对所述特征向量的预测值和所述标签信息向量进行拼接处理,得到拼接向量;
    对所述拼接向量进行压缩处理,得到所述音频信号的预测值。
  5. 根据权利要求4所述的方法,其中,所述对所述拼接向量进行压缩处理,得到所述音频信号的预测值,包括:
    对所述拼接向量进行第一卷积处理,得到所述音频信号的卷积特征;
    对所述卷积特征进行上采样处理,得到所述音频信号的上采样特征;
    对所述上采样特征进行池化处理,得到所述音频信号的池化特征;
    对所述池化特征进行第二卷积处理,得到所述音频信号的预测值。
  6. 根据权利要求5所述的方法,其中,
    所述上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同;
    所述对所述卷积特征进行上采样处理,得到所述音频信号的上采样特征,包括:
    通过所述多个级联的解码层中的第一个解码层,对所述卷积特征进行上采样处理;
    将所述第一个解码层的上采样结果输出到后续级联的解码层,并通过所述后续级联的解码层继续进行上采样处理和上采样结果输出,直至输出到最后一个解码层;
    将所述最后一个解码层输出的上采样结果,作为所述音频信号的上采样特征。
  7. 根据权利要求1至6任一项所述的方法,其中,
    所述码流包括低频码流和高频码流,其中,所述低频码流是对所述音频信号经过分解处理后得到的低频子带信号进行编码得到的,所述高频码流是对所述音频信号经过分解处理后得到的高频子带信号进行编码得到的;
    所述对所述码流进行解码处理,得到所述音频信号的特征向量的预测值,包括:
    对所述低频码流进行解码处理,得到所述低频子带信号的特征向量的预测值;
    对所述高频码流进行解码处理,得到所述高频子带信号的特征向量的预测值。
  8. 根据权利要求7所述的方法,其中,所述对所述特征向量的预测值进行标签提取处理,得到标签信息向量,包括:
    对所述低频子带信号的特征向量的预测值进行标签提取处理,得到第一标签信息向量,其中,所述第一标签信息向量的维度与所述低频子带信号的特征向量的预测值的维度相同;
    对所述高频子带信号的特征向量的预测值进行标签提取处理,得到第二标签信息向量,其中,所述第二标签信息向量的维度与所述高频子带信号的特征向量的预测值的维度相同。
  9. 根据权利要求7所述的方法,其中,所述对所述低频子带信号的特征向量的预测值进行标签提取处理,得到第一标签信息向量,包括:
    调用第一增强网络执行以下处理:
    对所述低频子带信号的特征向量的预测值进行卷积处理,得到与所述低频子带信号的特征向量的预测值相同维度的第四张量;
    对所述第四张量进行特征提取处理,得到与所述第四张量相同维度的第五张量;
    对所述第五张量进行全连接处理,得到与所述第五张量相同维度的第六张量;
    对所述第六张量进行激活处理,得到第一标签信息向量。
  10. 根据权利要求7所述的方法,其中,所述对所述高频子带信号的特征向量的预测值进行标签提取处理,得到第二标签信息向量,包括:
    调用第二增强网络执行以下处理:
    对所述高频子带信号的特征向量的预测值进行卷积处理,得到与所述高频子带信号的特征向量的预测值相同维度的第七张量;
    对所述第七张量进行特征提取处理,得到与所述第七张量相同维度的第八张量;
    对所述第八张量进行全连接处理,得到与所述第八张量相同维度的第九张量;
    对所述第九张量进行激活处理,得到第二标签信息向量。
  11. 根据权利要求8至10任一项所述的方法,其中,
    所述特征向量的预测值包括:所述低频子带信号的特征向量的预测值,所述高频子带信号的特征向量的预测值;
    所述基于所述特征向量的预测值和所述标签信息向量进行信号重建,包括:
    对所述低频子带信号的特征向量的预测值、以及所述第一标签信息向量进行拼接处理,得到第一拼接向量;
    基于所述第一拼接向量调用第一合成网络进行信号重建,得到所述低频子带信号的预测值;
    对所述高频子带信号的特征向量的预测值、以及所述第二标签信息向量进行拼接处理,得到第二拼接向量;
    基于所述第二拼接向量调用第二合成网络进行信号重建,得到所述高频子带信号的预测值;
    对所述低频子带信号的预测值和所述高频子带信号的预测值进行合成处理,得到所述音频信号的预测值。
  12. 根据权利要求11所述的方法,其中,所述基于所述第一拼接向量调用第一合成网络进行信号重建,得到所述低频子带信号的预测值,包括:
    调用所述第一合成网络执行以下处理:
    对所述第一拼接向量进行第一卷积处理,得到所述低频子带信号的卷积特征;
    对所述卷积特征进行上采样处理,得到所述低频子带信号的上采样特征;
    对所述上采样特征进行池化处理,得到所述低频子带信号的池化特征;
    对所述池化特征进行第二卷积处理,得到所述低频子带信号的预测值;
    其中,所述上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
  13. 根据权利要求11所述的方法,其中,所述基于所述第二拼接向量调用第二合成网络进行信号重建,得到所述高频子带信号的预测值,包括:
    调用所述第二合成网络执行以下处理:
    对所述第二拼接向量进行第一卷积处理,得到所述高频子带信号的卷积特征;
    对所述卷积特征进行上采样处理,得到所述高频子带信号的上采样特征;
    对所述上采样特征进行池化处理,得到所述高频子带信号的池化特征;
    对所述池化特征进行第二卷积处理,得到所述高频子带信号的预测值;
    其中,所述上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
  14. 根据权利要求1至13任一项所述的方法,其中,
    所述码流包括N个子码流,所述N个子码流对应不同的频段,且是对所述音频信号经过分解处理后得到的N个子带信号分别进行编码得到的,N为大于2的整数;
    所述对所述码流进行解码处理,得到所述音频信号的特征向量的预测值,包括:
    对所述N个子码流分别进行解码处理,得到所述N个子带信号分别对应的特征向量的预测值。
  15. 根据权利要求14所述的方法,其中,所述对所述特征向量的预测值进行标签提取处理,得到标签信息向量,包括:
    对所述N个子带信号分别对应的特征向量的预测值分别进行标签提取处理,得到N个标签信息向量,其中,每个所述标签信息向量的维度与对应子带信号的特征向量的预测值的维度相同。
  16. 根据权利要求15所述的方法,其中,所述对所述N个子带信号分别对应的特征向量的预测值分别进行标签提取处理,得到N个标签信息向量,包括:
    基于第i子带信号的特征向量的预测值,调用第i增强网络进行标签提取处理,得到第i标签信息向量;
    其中,i的取值范围满足1≤i≤N,且所述第i标签信息向量的维度与所述第i子带信号的特征向量的预测值的维度相同。
  17. 根据权利要求16所述的方法,其中,所述基于第i子带信号的特征向量的预测值,调用第i增强网络进行标签提取处理,得到第i标签信息向量,包括:
    调用所述第i增强网络执行以下处理:
    对所述第i子带信号的特征向量的预测值进行卷积处理,得到与所述第i子带信号的特征向量的预测值相同维度的第十张量;
    对所述第十张量进行特征提取处理,得到与所述第十张量相同维度的第十一张量;
    对所述第十一张量进行全连接处理,得到与所述第十一张量相同维度的第十二张量;
    对所述第十二张量进行激活处理,得到第i标签信息向量。
  18. 根据权利要求15至17任一项所述的方法,其中,所述基于所述特征向量的预测值和所述标签信息向量进行信号重建,包括:
    对所述N个子带信号分别对应的特征向量的预测值、以及所述N个标签信息向量进行一一对应的拼接处理,得到N个拼接向量;
    基于第j拼接向量调用第j合成网络进行信号重建,得到第j子带信号的预测值;其中,j的取值范围满足1≤j≤N;
    对所述N个子带信号分别对应的预测值进行合成处理,得到所述音频信号的预测值。
  19. 根据权利要求18所述的方法,其中,所述基于第j拼接向量调用第j合成网络进行信号重建,得到第j子带信号的预测值,包括:
    调用所述第j合成网络执行以下处理:
    对所述第j拼接向量进行第一卷积处理,得到所述第j子带信号的卷积特征;
    对所述卷积特征进行上采样处理,得到所述第j子带信号的上采样特征;
    对所述上采样特征进行池化处理,得到所述第j子带信号的池化特征;
    对所述池化特征进行第二卷积处理,得到所述第j子带信号的预测值;
    其中,所述上采样处理是通过多个级联的解码层实现的,且不同解码层的采样因子不同。
  20. 一种音频编码方法,所述方法包括:
    获取音频信号;
    对所述音频信号进行编码处理,得到码流,其中,所述码流用于供电子设备执行如权利要求1至19任一项所述的音频解码方法。
  21. 一种音频解码装置,所述装置包括:
    获取模块,配置为获取码流,其中,所述码流是对音频信号进行编码得到的;
    解码模块,配置为对所述码流进行解码处理,得到所述音频信号的特征向量的预测值;
    标签提取模块,配置为对所述特征向量的预测值进行标签提取处理,得到标签信息向量,其中,所述标签信息向量的维度与所述特征向量的预测值的维度相同;
    重建模块,配置为基于所述特征向量的预测值和所述标签信息向量进行信号重建;
    确定模块,配置为将通过所述信号重建得到的所述音频信号的预测值,作为所述码流的解码结果。
  22. 一种音频编码装置,所述装置包括:
    获取模块,配置为获取音频信号;
    编码模块,配置为对所述音频信号进行编码处理,得到码流,其中,所述码流用于供电子设备执行如权利要求1至19任一项所述的音频解码方法。
  23. 一种电子设备,包括:
    存储器,用于存储计算机可执行指令;
    处理器,用于执行所述存储器中存储的计算机可执行指令时,实现权利要求1至19任一项所述的音频解码方法、或权利要求20所述的音频编码方法。
  24. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时,实现权利要求1至19任一项所述的音频解码方法、或权利要求20所述的音频编码方法。
  25. 一种计算机程序产品,包括计算机程序或计算机可执行指令,所述计算机程序或计算机可执行指令被处理器执行时,实现权利要求1至19任一项所述的音频解码方法、或权利要求20所述的音频编码方法。
PCT/CN2023/092246 2022-06-15 2023-05-05 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品 WO2023241254A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210676984.X 2022-06-15
CN202210676984.XA CN115116451A (zh) 2022-06-15 2022-06-15 音频解码、编码方法、装置、电子设备及存储介质

Publications (2)

Publication Number Publication Date
WO2023241254A1 WO2023241254A1 (zh) 2023-12-21
WO2023241254A9 true WO2023241254A9 (zh) 2024-04-18

Family

ID=83328395

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092246 WO2023241254A1 (zh) 2022-06-15 2023-05-05 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN115116451A (zh)
WO (1) WO2023241254A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116451A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频解码、编码方法、装置、电子设备及存储介质
CN117965214B (zh) * 2024-04-01 2024-06-18 新疆凯龙清洁能源股份有限公司 一种天然气脱二氧化碳制合成气的方法和系统

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101202043B (zh) * 2007-12-28 2011-06-15 清华大学 音频信号的编码方法和装置与解码方法和装置
CN101572586B (zh) * 2008-04-30 2012-09-19 北京工业大学 编解码方法、装置及系统
EP2887350B1 (en) * 2013-12-19 2016-10-05 Dolby Laboratories Licensing Corporation Adaptive quantization noise filtering of decoded audio data
CN105374359B (zh) * 2014-08-29 2019-05-17 中国电信股份有限公司 语音数据的编码方法和系统
CN110009013B (zh) * 2019-03-21 2021-04-27 腾讯科技(深圳)有限公司 编码器训练及表征信息提取方法和装置
CN110689876B (zh) * 2019-10-14 2022-04-12 腾讯科技(深圳)有限公司 语音识别方法、装置、电子设备及存储介质
KR102594160B1 (ko) * 2019-11-29 2023-10-26 한국전자통신연구원 필터뱅크를 이용한 오디오 신호 부호화/복호화 장치 및 방법
CN113140225A (zh) * 2020-01-20 2021-07-20 腾讯科技(深圳)有限公司 语音信号处理方法、装置、电子设备及存储介质
CN113470667A (zh) * 2020-03-11 2021-10-01 腾讯科技(深圳)有限公司 语音信号的编解码方法、装置、电子设备及存储介质
KR102501773B1 (ko) * 2020-08-28 2023-02-21 주식회사 딥브레인에이아이 랜드마크를 함께 생성하는 발화 동영상 생성 장치 및 방법
CN113035211B (zh) * 2021-03-11 2021-11-16 马上消费金融股份有限公司 音频压缩方法、音频解压缩方法及装置
CN113823298B (zh) * 2021-06-15 2024-04-16 腾讯科技(深圳)有限公司 语音数据处理方法、装置、计算机设备及存储介质
CN113488063B (zh) * 2021-07-02 2023-12-19 国网江苏省电力有限公司电力科学研究院 一种基于混合特征及编码解码的音频分离方法
CN113990347A (zh) * 2021-10-25 2022-01-28 腾讯音乐娱乐科技(深圳)有限公司 一种信号处理方法、计算机设备及存储介质
CN114550732B (zh) * 2022-04-15 2022-07-08 腾讯科技(深圳)有限公司 一种高频音频信号的编解码方法和相关装置
CN115116451A (zh) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 音频解码、编码方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
WO2023241254A1 (zh) 2023-12-21
CN115116451A (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2023241254A9 (zh) 音频编解码方法、装置、电子设备、计算机可读存储介质及计算机程序产品
JP4850837B2 (ja) 異なるサブバンド領域同士の間の通過によるデータ処理方法
JP4374233B2 (ja) 複数因子分解可逆変換(multiplefactorizationreversibletransform)を用いたプログレッシブ・ツー・ロスレス埋込みオーディオ・コーダ(ProgressivetoLosslessEmbeddedAudioCoder:PLEAC)
US20220180881A1 (en) Speech signal encoding and decoding methods and apparatuses, electronic device, and storage medium
CN103187065B (zh) 音频数据的处理方法、装置和系统
RU2408089C9 (ru) Декодирование кодированных с предсказанием данных с использованием адаптации буфера
JP2001202097A (ja) 符号化二進オーディオ処理方法
WO2023241193A1 (zh) 音频编码方法、装置、电子设备、存储介质及程序产品
KR20150032614A (ko) 오디오 부호화방법 및 장치, 오디오 복호화방법 및 장치, 및 이를 채용하는 멀티미디어 기기
CN101223598B (zh) 基于虚拟源位置信息的通道等级差量化和解量化方法
WO2023241240A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2023241222A9 (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
WO2023241205A1 (zh) 音频处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
WO2021244418A1 (zh) 一种音频编码方法和音频编码装置
Geiger et al. ISO/IEC MPEG-4 high-definition scalable advanced audio coding
JPWO2008066071A1 (ja) 復号化装置および復号化方法
Bhatt et al. A novel approach for artificial bandwidth extension of speech signals by LPC technique over proposed GSM FR NB coder using high band feature extraction and various extension of excitation methods
JP3487250B2 (ja) 符号化音声信号形式変換装置
CN113314132A (zh) 一种应用于交互式音频系统中的音频对象编码方法、解码方法及装置
CN115116457A (zh) 音频编码及解码方法、装置、设备、介质及程序产品
CN117219095A (zh) 音频编码方法、音频解码方法、装置、设备及存储介质
CN117834596A (zh) 音频处理方法、装置、设备、存储介质及计算机程序产品
CN117476024A (zh) 音频编码方法、音频解码方法、装置、可读存储介质
WO2022252957A1 (zh) 音频数据编解码方法和相关装置及计算机可读存储介质
US20130197919A1 (en) "method and device for determining a number of bits for encoding an audio signal"

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23822825

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023822825

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023822825

Country of ref document: EP

Effective date: 20240326