CN115881140A

CN115881140A - Encoding and decoding method, device, equipment, storage medium and computer program product

Info

Publication number: CN115881140A
Application number: CN202111155384.0A
Authority: CN
Inventors: 刘帅; 高原; 王宾; 王喆
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-03-31
Also published as: WO2023051368A1; TW202333139A

Abstract

The embodiment of the application discloses a coding and decoding method, a coding and decoding device, coding and decoding equipment, a storage medium and a computer program product, and belongs to the technical field of audio processing. The method combines the coding and decoding scheme selected based on the virtual loudspeaker and the coding and decoding scheme based on the directional audio coding to code and decode the HOA signals of the audio frames, namely, selects a proper coding and decoding scheme aiming at different audio frames, thereby improving the compression rate of the audio signals. Meanwhile, in order to smoothly transition the hearing quality when switching between different coding and decoding schemes, for some audio frames, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the two coding and decoding schemes, that is, signals of a specified channel in the HOA signals of the audio frames are coded into a code stream, that is, a compromise scheme is adopted to carry out coding and decoding, so that the hearing quality after rendering and playing the HOA signals restored by decoding can be smoothly transitioned.

Description

Encoding and decoding method, device, equipment, storage medium and computer program product

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a computer program product for encoding and decoding.

Background

Higher Order Ambisonics (HOA) technology has attracted much attention as a three-dimensional audio technology because of its higher flexibility in three-dimensional audio playback. In order to achieve a better hearing effect, the HOA technique requires a large amount of data to record detailed sound scene information. However, as the HOA order increases, more data is generated and the large amount of data causes transmission and storage difficulties. Therefore, how to encode and decode HOA signals becomes a problem of important attention at present.

The related art proposes two schemes for coding and decoding HOA signals. One of the schemes is a directional audio coding (DirAC) based codec scheme. In the scheme, a coding end extracts a core layer signal and a spatial parameter from an HOA signal of a current frame, and codes the extracted core layer signal and the spatial parameter into a code stream. And the decoding end reconstructs the HOA signal of the current frame from the code stream by adopting a decoding method symmetrical to the coding. Another scheme is a codec scheme based on virtual speaker selection. In the scheme, an encoding end selects a target virtual loudspeaker matched with an HOA signal of a current frame from a virtual loudspeaker set based on a match-projection (MP) algorithm, determines a virtual loudspeaker signal based on the HOA signal of the current frame and the target virtual loudspeaker, determines a residual signal based on the HOA signal of the current frame and the virtual loudspeaker signal, and encodes the virtual loudspeaker signal and the residual signal into a code stream. And the decoding end reconstructs the HOA signal of the current frame from the code stream by adopting a decoding method symmetrical to the coding.

However, the compression rate of the codec scheme based on the virtual speaker selection is higher for the case where there are fewer alien sound sources in the sound field, and the compression rate of the codec scheme based on the DirAC is higher for the case where there are more alien sound sources in the sound field. Wherein, the dissimilarity sound source refers to a point sound source with different positions and/or directions of sound sources. However, the sound field types (related to different sound sources in the sound field) of different audio frames may be different, and if it is desired to simultaneously satisfy the requirement of having higher compression rates for the audio frames under different sound field types, it is necessary to select a suitable encoding and decoding scheme for the corresponding audio frame according to the sound field type of each audio frame, so that it is necessary to switch between different encoding and decoding schemes. However, the auditory quality of HOA signals reconstructed based on different coding and decoding schemes after rendering and playback is different, and when switching between different coding and decoding schemes, how to ensure smooth transition of the auditory quality is a problem that needs to be considered at present.

Disclosure of Invention

The embodiment of the application provides a coding and decoding method, a coding and decoding device, coding and decoding equipment, a storage medium and a computer program product, which can ensure smooth transition of hearing quality when switching between different coding and decoding schemes. The technical scheme is as follows:

in a first aspect, an encoding method is provided, where the method includes:

determining a coding scheme of the current frame according to the HOA signal of the current frame, wherein the coding scheme of the current frame is one of a first coding scheme, a second coding scheme and a third coding scheme; wherein the first encoding scheme is a directional audio coding based HOA encoding scheme (i.e. DirAC decoding scheme), the second encoding scheme is a virtual speaker selection based HOA encoding scheme (which may be referred to as MP based HOA decoding scheme for short), and the third encoding scheme is a hybrid encoding scheme; if the encoding scheme of the current frame is the third encoding scheme, encoding a signal of a designated channel in the HOA signal into a code stream, wherein the designated channel is a part of channels in all the channels of the HOA signal. The hybrid coding scheme uses both the technical means related to the first coding scheme (i.e., dirAC coding scheme) and the technical means related to the second coding scheme (MP-based HOA coding scheme) during the coding process, and is called a hybrid coding scheme.

In the embodiment of the present application, a suitable coding and decoding scheme is selected for different audio frames, so that the compression rate of the audio signal can be improved. Meanwhile, for some audio frames, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the first coding scheme and the second coding scheme, that is, signals of specified channels in the HOA signals of the audio frames are coded into a code stream, that is, a compromise scheme is adopted to carry out coding and decoding, so that the hearing quality after the HOA signals restored by decoding are rendered and played can be smoothly transited.

Optionally, the channel-specific signals include first-order ambisonics (FOA) signals, the FOA signals include omnidirectional W signals, and directional X, Y and Z signals.

Optionally, encoding a signal of a specified channel in the HOA signal into a code stream, including: determining a virtual speaker signal and a residual signal based on the W signal, the X signal, the Y signal, and the Z signal; and coding the virtual loudspeaker signal and the residual signal into a code stream.

Optionally, determining the virtual speaker signal and the residual signal based on the W signal, the X signal, the Y signal, and the Z signal comprises: determining the W signal as a path of virtual loudspeaker signal; determining a three-way residual signal based on the W signal, the X signal, the Y signal, and the Z signal, or determining the X signal, the Y signal, and the Z signal as the three-way residual signal. Alternatively, difference signals between the X, Y, and Z signals and the W signal, respectively, are determined as three-way residual signals.

Optionally, encoding the virtual speaker signal and the residual signal into a code stream, including: combining the virtual loudspeaker signal with a first preset single-channel signal to obtain a stereo signal; combining the three residual signals with a second preset single-channel signal to obtain two paths of stereo signals; and respectively coding the obtained three paths of stereo signals into code streams through a stereo encoder.

Optionally, combining the three residual signals with a second preset mono signal to obtain a binaural signal, including: combining two paths of residual signals with highest correlation in the three paths of residual signals to obtain one path of stereo signal in the two paths of stereo signals; and combining one path of residual signal except the two paths of residual signals with the highest correlation in the three paths of residual signals with a second path of preset single-channel signal to obtain the other path of stereo signal in the two paths of stereo signals.

Optionally, the first path of preset monaural signal is an all-zero signal or an all-one signal, the all-zero signal includes a signal whose sampling points have all zero values or a signal whose frequency points have all zero values, and the all-one signal includes a signal whose sampling points have all one values or a signal whose frequency points have all one values; the second path of preset single track signal is an all-zero signal or an all-one signal; the first path of preset single-channel signal is the same as or different from the second path of preset single-channel signal.

Optionally, encoding the virtual speaker signal and the residual signal into a code stream, including: and respectively encoding the virtual loudspeaker signal and each residual signal in the three residual signals into a code stream through a single sound channel encoder.

Optionally, after determining the encoding scheme of the current frame according to the HOA signal of the current frame, the method further includes: if the coding scheme of the current frame is the first coding scheme, coding the HOA signal into a code stream according to the first coding scheme; and if the coding scheme of the current frame is the second coding scheme, coding the HOA signal into a code stream according to the second coding scheme.

Optionally, determining an encoding scheme of the current frame according to the higher-order ambisonic HOA signal of the current frame includes: determining an initial coding scheme of the current frame according to the HOA signal, wherein the initial coding scheme is a first coding scheme or a second coding scheme; if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame, determining that the coding scheme of the current frame is the initial coding scheme of the current frame; and if the initial coding scheme of the current frame is the first coding scheme and the initial coding scheme of the previous frame of the current frame is the second coding scheme, or the initial coding scheme of the current frame is the second coding scheme and the initial coding scheme of the previous frame of the current frame is the first coding scheme, determining that the coding scheme of the current frame is the third coding scheme.

Optionally, after determining the initial coding scheme of the current frame according to the HOA signal, the method further includes: and coding the indication information of the initial coding scheme of the current frame into a code stream.

Optionally, after determining the encoding scheme of the current frame according to the higher-order ambisonic HOA signal of the current frame, the method further includes: determining the value of a switching flag of a current frame, wherein when the coding scheme of the current frame is a first coding scheme or a second coding scheme, the value of the switching flag of the current frame is a first value; when the coding scheme of the current frame is a third coding scheme, the value of the switching flag of the current frame is a second value; and coding the value of the switching mark into the code stream. That is, the switch flag is used to indicate whether the current frame is a switch frame.

Optionally, after determining the encoding scheme of the current frame according to the HOA signal of the current frame, the method further includes: and coding the indication information of the coding scheme of the current frame into a code stream.

Optionally, the designated channel is consistent with a transmission channel preset in the first coding scheme. This ensures that the perceptual quality of the switching frame is similar to the perceptual quality of the audio frame encoded using the first encoding scheme.

In a second aspect, a decoding method is provided, which includes:

obtaining a decoding scheme of a current frame based on the code stream, wherein the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme and a third decoding scheme; wherein the first decoding scheme is a higher-order ambisonic (HOA) decoding scheme based on directional audio decoding, the second decoding scheme is a HOA decoding scheme based on virtual speaker selection, and the third decoding scheme is a hybrid decoding scheme; if the decoding scheme of the current frame is the third decoding scheme, determining a signal of a designated channel in the HOA signal of the current frame based on the code stream, wherein the designated channel is a part of channels in all channels of the HOA signal; determining, based on the signals of the specified channels, gains of one or more remaining channels of the HOA signal other than the specified channels; determining a signal for each of the one or more remaining channels based on the signal for the specified channel and the gain for the one or more remaining channels; based on the signal of the specified channel and the signals of the one or more remaining channels, a reconstructed HOA signal of the current frame is obtained. The hybrid decoding scheme uses both the technical means related to the first decoding scheme (i.e., dirAC decoding scheme) and the technical means related to the second decoding scheme (MP-based HOA decoding scheme) during decoding, and is called a hybrid decoding scheme.

In this embodiment of the present application, when the encoding end encodes the HOA signal of the current frame by using the third encoding scheme, the signal of the designated channel is encoded into the code stream, and then the decoding end analyzes the signal of the designated channel from the code stream, and then reconstructs signals of the remaining channels based on the signal of the designated channel, thereby reconstructing the HOA signal. That is, a compromise scheme is adopted, so that the hearing quality after rendering and playing the HOA signal restored by decoding can be smoothly transited.

Optionally, determining a signal of a specified channel in the HOA signal of the current frame based on the code stream includes: determining a virtual speaker signal and a residual signal based on the code stream; based on the virtual loudspeaker signal and the residual signal, a signal specifying the channel is determined.

Optionally, determining a virtual speaker signal and a residual signal based on the code stream includes: decoding the code stream through a stereo decoder to obtain three paths of stereo signals; based on the three paths of stereo signals, a path of virtual speaker signal and three paths of residual signals are determined.

Optionally, determining a path of virtual speaker signal and a three-path residual signal based on the three paths of stereo signals includes: determining a virtual loudspeaker signal based on one stereo signal in the three stereo signals; a three-way residual signal is determined based on another binaural signal of the three-way binaural signals.

Optionally, determining a virtual speaker signal and a residual signal based on the code stream includes: and decoding the code stream through a single sound channel decoder to obtain a path of virtual loudspeaker signals and three paths of residual signals.

Optionally, the signals of the designated channel include a first-order ambisonic FOA signal, the FOA signal including an omnidirectional W signal, and directional X, Y and Z signals; determining a signal of the specified channel based on the virtual speaker signal and the residual signal, comprising: determining a W signal based on the virtual speaker signal; the X signal, the Y signal, and the Z signal are determined based on the residual signal and the W signal, or the X signal, the Y signal, and the Z signal are determined based on the residual signal.

Optionally, the method further comprises: if the decoding scheme of the current frame is the first decoding scheme, acquiring a reconstructed HOA signal of the current frame according to the code stream according to the first decoding scheme; and if the decoding scheme of the current frame is the second decoding scheme, acquiring a reconstructed HOA signal of the current frame according to the code stream according to the second decoding scheme.

Optionally, according to a second decoding scheme, obtaining a reconstructed HOA signal of the current frame according to the code stream includes: according to a second decoding scheme, obtaining an initial HOA signal according to the code stream; if the decoding scheme of the previous frame of the current frame is the third decoding scheme, performing gain adjustment on the high-order part of the initial HOA signal according to the high-order gain of the previous frame of the current frame; a reconstructed HOA signal is obtained based on the lower order part and the gain adjusted higher order part of the initial HOA signal. That is, the auditory quality is further smoothed over by the high-order gain adjustment.

Optionally, obtaining a decoding scheme of the current frame based on the code stream includes: analyzing the value of the switching mark of the current frame from the code stream; if the value of the switching mark is a first value, analyzing indication information of a decoding scheme of the current frame from the code stream, wherein the indication information is used for indicating that the decoding scheme of the current frame is a first decoding scheme or a second decoding scheme; and if the value of the switching flag is the second value, determining that the decoding scheme of the current frame is the third decoding scheme.

Optionally, obtaining a decoding scheme of the current frame based on the code stream includes: and analyzing the indication information of the decoding scheme of the current frame from the code stream, wherein the indication information is used for indicating that the decoding scheme of the current frame is a first decoding scheme, a second decoding scheme or a third decoding scheme.

Optionally, obtaining a decoding scheme of the current frame based on the code stream includes: analyzing an initial decoding scheme of the current frame from the code stream, wherein the initial decoding scheme is a first decoding scheme or a second decoding scheme; if the initial decoding scheme of the current frame is the same as the initial decoding scheme of the previous frame of the current frame, determining that the decoding scheme of the current frame is the initial decoding scheme of the current frame; and if the initial decoding scheme of the current frame is the first decoding scheme and the initial decoding scheme of the previous frame of the current frame is the second decoding scheme, or the initial decoding scheme of the current frame is the second decoding scheme and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme, determining that the decoding scheme of the current frame is the third decoding scheme.

In a third aspect, there is provided an encoding apparatus having a function of implementing the behavior of the encoding method in the first aspect. The encoding apparatus includes one or more modules, and the one or more modules are configured to implement the encoding method provided by the first aspect.

That is, there is provided an encoding apparatus including:

a first determining module, configured to determine a coding scheme of the current frame according to the higher-order ambisonic HOA signal of the current frame, where the coding scheme of the current frame is one of a first coding scheme, a second coding scheme, and a third coding scheme; wherein the first encoding scheme is a HOA encoding scheme based on directional audio encoding, the second encoding scheme is a HOA encoding scheme based on virtual speaker selection, and the third encoding scheme is a hybrid encoding scheme;

and the first coding module is used for coding the signals of the appointed channels in the HOA signals into a code stream if the coding scheme of the current frame is the third coding scheme, and the appointed channels are partial channels in all the channels of the HOA signals.

Optionally, the channel-specific signals include a first-order ambisonic FOA signal, the FOA signal including an omnidirectional W signal, and directional X, Y, and Z signals.

Optionally, the first encoding module comprises:

a first determining sub-module for determining a virtual speaker signal and a residual signal based on the W signal, the X signal, the Y signal, and the Z signal;

and the coding submodule is used for coding the virtual loudspeaker signal and the residual signal into a code stream.

Optionally, the first determining sub-module is configured to:

determining the W signal as a path of virtual loudspeaker signal;

determining three-way residual signals based on the W signal, the X signal, the Y signal, and the Z signal, or determining the X signal, the Y signal, and the Z signal as the three-way residual signals.

Optionally, the encoding submodule is configured to:

combining the virtual loudspeaker signal with a first preset single-channel signal to obtain a stereo signal;

combining the three residual signals with a second preset single-channel signal to obtain two paths of stereo signals;

and respectively coding the obtained three paths of stereo signals into code streams through a stereo encoder.

Optionally, the encoding submodule is configured to:

combining two paths of residual signals with highest correlation in the three paths of residual signals to obtain one path of stereo signal in the two paths of stereo signals;

and combining one path of residual signal except the two paths of residual signals with the highest correlation in the three paths of residual signals with a second path of preset single-channel signal to obtain the other path of stereo signal in the two paths of stereo signals.

Optionally, the first path of preset monaural signal is an all-zero signal or an all-one signal, the all-zero signal includes a signal whose sampling points have all zero values or a signal whose frequency points have all zero values, and the all-one signal includes a signal whose sampling points have all one values or a signal whose frequency points have all one values; the second path of preset single track signals are all-zero signals or all-one signals; the first path of preset single-channel signal is the same as or different from the second path of preset single-channel signal.

Optionally, the encoding submodule is configured to:

and respectively encoding the virtual loudspeaker signal and each residual signal in the three residual signals into a code stream through a single sound channel encoder.

Optionally, the apparatus further comprises:

a second coding module, configured to code the HOA signal into a code stream according to the first coding scheme if the coding scheme of the current frame is the first coding scheme;

and the third coding module is used for coding the HOA signal into a code stream according to the second coding scheme if the coding scheme of the current frame is the second coding scheme.

Optionally, the first determining module includes:

a second determining sub-module, configured to determine an initial coding scheme of the current frame according to the HOA signal, where the initial coding scheme is the first coding scheme or the second coding scheme;

a third determining sub-module, configured to determine that the coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame;

and a fourth determining sub-module, configured to determine that the encoding scheme of the current frame is the third encoding scheme if the initial encoding scheme of the current frame is the first encoding scheme and the initial encoding scheme of the previous frame of the current frame is the second encoding scheme, or the initial encoding scheme of the current frame is the second encoding scheme and the initial encoding scheme of the previous frame of the current frame is the first encoding scheme.

Optionally, the apparatus further comprises:

and the fourth coding module is used for coding the indication information of the initial coding scheme of the current frame into the code stream.

Optionally, the apparatus further comprises:

a second determining module, configured to determine a value of a switch flag of a current frame, where the value of the switch flag of the current frame is a first value when a coding scheme of the current frame is a first coding scheme or a second coding scheme; when the coding scheme of the current frame is a third coding scheme, the value of the switching mark of the current frame is a second value;

and the fifth coding module is used for coding the value of the switching mark into the code stream.

Optionally, the apparatus further comprises:

and the sixth coding module is used for coding the indication information of the coding scheme of the current frame into the code stream.

Optionally, the designated channel is consistent with a transmission channel preset in the first coding scheme.

In a fourth aspect, there is provided a decoding apparatus having a function of implementing the behavior of the decoding method in the second aspect described above. The decoding device comprises one or more modules for implementing the decoding method provided by the second aspect.

That is, there is provided a decoding apparatus including:

a first obtaining module, configured to obtain a decoding scheme of a current frame based on a code stream, where the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme, and a third decoding scheme; wherein the first decoding scheme is a Higher Order Ambisonic (HOA) decoding scheme based on directional audio decoding, the second decoding scheme is a HOA decoding scheme based on virtual speaker selection, and the third decoding scheme is a hybrid decoding scheme;

a first determining module, configured to determine, based on the code stream, a signal of a specified channel in the HOA signal of the current frame if the decoding scheme of the current frame is the third decoding scheme, where the specified channel is a partial channel in all channels of the HOA signal;

a second determining module for determining, based on the signal of the specified channel, gains of one or more remaining channels of the HOA signal other than the specified channel;

a third determining module for determining a signal of each of the one or more remaining channels based on the signal of the designated channel and the gain of the one or more remaining channels;

a second obtaining module, configured to obtain a reconstructed HOA signal of the current frame based on the signal of the specified channel and the signals of the one or more remaining channels.

Optionally, the first determining module includes:

the first determining submodule is used for determining a virtual loudspeaker signal and a residual error signal based on the code stream;

a second determining sub-module for determining a signal of the specified channel based on the virtual speaker signal and the residual signal.

Optionally, the first determining sub-module is configured to:

decoding the code stream through a stereo decoder to obtain three paths of stereo signals;

based on the three paths of stereo signals, a path of virtual speaker signal and three paths of residual signals are determined.

Optionally, the first determining submodule is configured to:

determining a virtual loudspeaker signal based on one stereo signal in the three stereo signals;

based on the other binaural signal of the three binaural signals, three residual signals are determined.

Optionally, the first determining sub-module is configured to:

and decoding the code stream through a single-track decoder to obtain a path of virtual loudspeaker signals and three paths of residual signals.

Optionally, the channel-specific signals include a first-order ambisonic FOA signal, the FOA signal including an omnidirectional W signal, and directional X, Y, and Z signals;

the first determination submodule is operable to:

determining a W signal based on the virtual speaker signal;

the X signal, the Y signal, and the Z signal are determined based on the residual signal and the W signal, or the X signal, the Y signal, and the Z signal are determined based on the residual signal.

Optionally, the apparatus further comprises:

the first decoding module is used for obtaining a reconstructed HOA signal of the current frame according to the code stream according to the first decoding scheme if the decoding scheme of the current frame is the first decoding scheme;

and the second decoding module is used for obtaining the reconstructed HOA signal of the current frame according to the code stream according to the second decoding scheme if the decoding scheme of the current frame is the second decoding scheme.

Optionally, the second decoding module comprises:

the first obtaining submodule is used for obtaining an initial HOA signal according to the code stream according to the second decoding scheme;

a gain adjustment sub-module, configured to perform gain adjustment on the high-order part of the initial HOA signal according to a high-order gain of a previous frame of the current frame if the decoding scheme of the previous frame of the current frame is the third decoding scheme;

a second obtaining sub-module for obtaining a reconstructed HOA signal based on the low order part and the gain adjusted high order part of the initial HOA signal.

Optionally, the first obtaining module includes:

the first analysis submodule is used for analyzing the value of the switching mark of the current frame from the code stream;

a second parsing sub-module, configured to parse, if the value of the switch flag is the first value, indication information of the decoding scheme of the current frame from the code stream, where the indication information is used to indicate that the decoding scheme of the current frame is the first decoding scheme or the second decoding scheme;

and a third determining sub-module, configured to determine that the decoding scheme of the current frame is the third decoding scheme if the value of the switching flag is the second value.

Optionally, the first obtaining module includes:

and the third analysis sub-module is used for analyzing the indication information of the decoding scheme of the current frame from the code stream, and the indication information is used for indicating that the decoding scheme of the current frame is the first decoding scheme, the second decoding scheme or the third decoding scheme.

Optionally, the first obtaining module includes:

the fourth analysis submodule is used for analyzing the initial decoding scheme of the current frame from the code stream, and the initial decoding scheme is the first decoding scheme or the second decoding scheme;

a fourth determining sub-module, configured to determine that the decoding scheme of the current frame is the initial decoding scheme of the current frame if the initial decoding scheme of the current frame is the same as the initial decoding scheme of the previous frame of the current frame;

and a fifth determining sub-module, configured to determine that the decoding scheme of the current frame is the third decoding scheme if the initial decoding scheme of the current frame is the first decoding scheme and the initial decoding scheme of the previous frame of the current frame is the second decoding scheme, or the initial decoding scheme of the current frame is the second decoding scheme and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme.

In a fifth aspect, an encoding end device is provided, where the encoding end device includes a processor and a memory, and the memory is used for storing a program for executing the encoding method provided by the first aspect, and storing data involved in implementing the encoding method provided by the first aspect. The processor is configured to execute programs stored in the memory. The operating means of the memory device may further comprise a communication bus for establishing a connection between the processor and the memory.

In a sixth aspect, a decoding-side device is provided, which includes a processor and a memory, where the memory is used for storing a program for executing the decoding method provided by the second aspect, and storing data involved in implementing the decoding method provided by the second aspect. The processor is configured to execute programs stored in the memory. The operating means of the memory device may further comprise a communication bus for establishing a connection between the processor and the memory.

In a seventh aspect, a computer-readable storage medium is provided, which stores instructions that, when executed on a computer, cause the computer to perform the encoding method of the first aspect or the decoding method of the second aspect.

In an eighth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the encoding method of the first aspect or the decoding method of the second aspect.

The technical effects obtained by the third aspect, the fourth aspect, the fifth aspect, the sixth aspect, the seventh aspect, and the eighth aspect are similar to the technical effects obtained by the corresponding technical means in the first aspect or the second aspect, and are not described herein again.

The technical scheme provided by the embodiment of the application can at least bring the following beneficial effects:

in the embodiment of the present application, the HOA signal of the audio frame is coded and decoded by combining two schemes (i.e., a coding and decoding scheme based on virtual speaker selection and a coding and decoding scheme based on directional audio coding), that is, a suitable coding and decoding scheme is selected for different audio frames, so that the compression rate of the audio signal can be improved. Meanwhile, in order to enable the smooth transition of the hearing quality when switching between different coding and decoding schemes, for some audio frames in the scheme, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the two schemes, namely, signals of a specified channel in the HOA signals of the audio frames are coded into a code stream, namely, a compromise scheme is adopted to code and decode, so that the hearing quality after rendering and playing the HOA signals restored by decoding can be smoothly transited.

Drawings

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a schematic diagram of an implementation environment of a terminal scenario provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an implementation environment of a transcoding scenario of a wireless or core network device according to an embodiment of the present application;

fig. 4 is a schematic diagram of an implementation environment of a broadcast television scene provided in an embodiment of the present application;

fig. 5 is a schematic diagram of an implementation environment of a virtual reality streaming scene according to an embodiment of the present application;

fig. 6 is a flowchart of an encoding method provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a switching frame coding scheme provided in an embodiment of the present application;

fig. 8 is a schematic diagram of an HOA encoding scheme based on virtual speaker selection according to an embodiment of the present application;

fig. 9 is a schematic diagram of a DirAC-based HOA coding scheme provided in an embodiment of the present application;

FIG. 10 is a flow chart of another encoding method provided by the embodiments of the present application;

fig. 11 is a flowchart of a decoding method provided in an embodiment of the present application;

fig. 12 is a schematic diagram of a switching frame decoding scheme provided by an embodiment of the present application;

fig. 13 is a schematic diagram of an HOA decoding scheme based on virtual speaker selection according to an embodiment of the present application;

fig. 14 is a schematic diagram of a DirAC-based HOA decoding scheme provided in an embodiment of the present application;

fig. 15 is a flowchart of another decoding method provided in the embodiments of the present application;

fig. 16 is a schematic structural diagram of an encoding apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a decoding apparatus according to an embodiment of the present application;

fig. 18 is a schematic block diagram of a coding and decoding device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the encoding and decoding method provided in the embodiment of the present application in detail, an implementation environment related to the embodiment of the present application will be introduced.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment according to an embodiment of the present disclosure. The implementation environment includes a source device 10, a destination device 20, a link 30, and a storage device 40. Source device 10 may, among other things, generate encoded media data. Accordingly, the source device 10 may also be referred to as a media data encoding device. Destination device 20 may decode the encoded media data generated by source device 10. Accordingly, the destination device 20 may also be referred to as a media data decoding device. Link 30 may receive encoded media data generated by source device 10 and may transmit the encoded media data to destination device 20. Storage device 40 may receive encoded media data generated by source device 10 and may store the encoded media data, on which condition destination device 20 may retrieve the encoded media data directly from storage device 40. Alternatively, storage device 40 may correspond to a file server or another intermediate storage device that may hold the encoded media data generated by source device 10, in which case destination device 20 may stream or download the encoded media data stored by storage device 40.

Source device 10 and destination device 20 may each include one or more processors and memory coupled to the one or more processors that may include Random Access Memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory, any other medium that may be used to store desired program code in the form of computer-accessible instructions or data structures, and the like. For example, source device 10 and destination device 20 may each comprise a desktop computer, a mobile computing device, a notebook (e.g., laptop) computer, a tablet computer, a set-top box, a telephone handset such as a so-called "smart" phone, a television, a camera, a display device, a digital media player, a video game console, an on-board computer, or the like.

Link 30 may include one or more media or devices capable of transmitting encoded media data from source device 10 to destination device 20. In one possible implementation, link 30 may include one or more communication media that enable source device 10 to transmit encoded media data directly to destination device 20 in real-time. In the present embodiment, source device 10 may modulate the encoded media data based on a communication standard, which may be a wireless communication protocol or the like, and may transmit the modulated media data to destination device 20. The one or more communication media may include wireless and/or wired communication media, for example, the one or more communication media may include a Radio Frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, which may be a local area network, a wide area network, or a global network (e.g., the internet), among others. The one or more communication media may include a router, a switch, a base station, or other devices that facilitate communication from source device 10 to destination device 20, and the like, which is not specifically limited in this embodiment.

In one possible implementation, storage device 40 may store the received encoded media data sent by source device 10, and destination device 20 may retrieve the encoded media data directly from storage device 40. In such a case, the storage device 40 may include any of a variety of distributed or locally accessed data storage media, such as a hard drive, a blu-ray disc, a Digital Versatile Disc (DVD), a compact disc read-only memory (CD-ROM), a flash memory, a volatile or non-volatile memory, or any other suitable digital storage media for storing encoded media data.

In one possible implementation, storage device 40 may correspond to a file server or another intermediate storage device that may hold the encoded media data generated by source device 10, and destination device 20 may stream or download the media data stored by storage device 40. The file server may be any type of server capable of storing encoded media data and transmitting the encoded media data to the destination device 20. In one possible implementation, the file server may include a network server, a File Transfer Protocol (FTP) server, a Network Attached Storage (NAS) device, a local disk drive, or the like. Destination device 20 may obtain the encoded media data over any standard data connection, including an internet connection. Any standard data connection may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a Digital Subscriber Line (DSL), cable modem, etc.), or a combination of both suitable for acquiring encoded media data stored on a file server. The transmission of the encoded media data from storage device 40 may be a streaming transmission, a download transmission, or a combination of both.

The implementation environment shown in fig. 1 is only one possible implementation manner, and the technology of the embodiment of the present application may be applied to not only the source device 10 that may encode media data and the destination device 20 that may decode encoded media data shown in fig. 1, but also other devices that may encode media data and decode encoded media data, which is not specifically limited in the embodiment of the present application.

In the implementation environment shown in fig. 1, source device 10 includes a data source 120, an encoder 100, and an output interface 140. In some embodiments, output interface 140 may include a regulator/demodulator (modem) and/or a transmitter, which may also be referred to as a transmitter. Data source 120 may include an image capture device (e.g., a camera, etc.), an archive containing previously captured media data, a feed interface for receiving media data from a media data content provider, and/or a computer graphics system for generating media data, or a combination of these sources of media data.

The data source 120 may transmit media data to the encoder 100, and the encoder 100 may encode the received media data transmitted by the data source 120 to obtain encoded media data. The encoder may send the encoded media data to the output interface. In some embodiments, source device 10 sends the encoded media data directly to destination device 20 via output interface 140. In other embodiments, the encoded media data may also be stored onto storage device 40 for later retrieval by destination device 20 and for decoding and/or display.

In the implementation environment shown in fig. 1, destination device 20 includes an input interface 240, a decoder 200, and a display device 220. In some embodiments, input interface 240 includes a receiver and/or a modem. The input interface 240 may receive the encoded media data via the link 30 and/or from the storage device 40 and then send it to the decoder 200, and the decoder 200 may decode the received encoded media data to obtain decoded media data. The decoder may send the decoded media data to the display device 220. Display device 220 may be integrated with destination device 20 or may be external to destination device 20. In general, display device 220 displays the decoded media data. The display device 220 may be any one of a plurality of types of display devices, for example, the display device 220 may be a Liquid Crystal Display (LCD), a plasma display, an organic light-emitting diode (OLED) display, or other types of display devices.

Although not shown in fig. 1, in some aspects, encoder 100 and decoder 200 may each be integrated with an encoder and decoder, and may include appropriate multiplexer-demultiplexer (MUX-DEMUX) units or other hardware and software for encoding both audio and video in a common data stream or separate data streams. In some embodiments, the MUX-DEMUX unit may conform to the ITU h.223 multiplexer protocol, or other protocols such as User Datagram Protocol (UDP), if applicable.

Encoder 100 and decoder 200 may each be any of the following circuits: one or more microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, or any combinations thereof. If the techniques of embodiments of the present application are implemented in part in software, a device may store instructions for the software in a suitable non-volatile computer-readable storage medium and may execute the instructions in hardware using one or more processors to implement the techniques of embodiments of the present application. Any of the foregoing, including hardware, software, a combination of hardware and software, etc., can be considered as one or more processors. Each of the encoder 100 and decoder 200 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (codec) in the respective device.

Embodiments of the present application may generally refer to encoder 100 as "signaling" or "sending" certain information to another device, such as decoder 200. The terms "signaling" or "sending" may generally refer to the transfer of syntax elements and/or other data used to decode compressed media data. This transfer may occur in real time or near real time. Alternatively, such communication may occur over a period of time, such as may occur when, at the time of encoding, syntax elements are stored in the encoded bitstream to a computer-readable storage medium, which the decoding device may then retrieve at any time after the syntax elements are stored to such medium.

The encoding and decoding method provided by the embodiment of the application can be applied to various scenes, and then, taking media data to be encoded as an HOA signal as an example, several scenes are introduced respectively.

Referring to fig. 2, fig. 2 is a schematic diagram of an implementation environment in which an encoding and decoding method provided by an embodiment of the present application is applied to a terminal scenario. The implementation environment comprises a first terminal 101 and a second terminal 201, and the first terminal 101 is in communication connection with the second terminal 201. The communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.

The first terminal 101 may be a sending end device or a receiving end device, and similarly, the second terminal 201 may be a receiving end device or a sending end device. For example, when the first terminal 101 is a transmitting terminal, the second terminal 201 is a receiving terminal, and when the first terminal 101 is a receiving terminal, the second terminal 201 is a transmitting terminal.

Next, the first terminal 101 is taken as a sending end device, and the second terminal 201 is taken as a receiving end device for example.

The first terminal 101 and the second terminal 201 each include an audio acquisition module, an audio playback module, an encoder, a decoder, a channel encoding module, and a channel decoding module. In an embodiment of the present application, the encoder is a three-dimensional audio encoder and the decoder is a three-dimensional audio decoder.

The audio acquisition module in the first terminal 101 acquires the HOA signal and transmits the HOA signal to the encoder, and the encoder encodes the HOA signal by using the encoding method provided in the embodiment of the present application, where the encoding may be referred to as source encoding. Then, in order to realize the transmission of the HOA signal in the channel, the channel coding module further needs to perform channel coding, and then transmit the code stream obtained by coding in the digital channel through a wireless or wired network communication device.

The second terminal 201 receives the code stream transmitted in the digital channel through the wireless or wired network communication device, the channel decoding module performs channel decoding on the code stream, and then the decoder decodes the code stream by using the decoding method provided by the embodiment of the application to obtain the HOA signal, and plays the HOA signal through the audio playback module.

The first terminal 101 and the second terminal 201 may be any electronic product capable of performing human-computer interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or handwriting equipment, such as a Personal Computer (PC), a mobile phone, a smart phone, a Personal Digital Assistant (PDA), a wearable device, a pocket PC (pocket PC), a tablet computer, a smart car machine, a smart television, a smart sound box, and the like.

Those skilled in the art should appreciate that the above-described terminal is only exemplary and that other existing or future existing terminals, which may be suitable for use with the embodiments of the present application, are also included within the scope of the embodiments of the present application and are hereby incorporated by reference.

Referring to fig. 3, fig. 3 is a schematic diagram of an implementation environment in which an encoding and decoding method provided in an embodiment of the present application is applied to a transcoding scenario of a wireless or core network device. The implementation environment includes a channel decoding module, an audio decoder, an audio encoder, and a channel encoding module. In an embodiment of the present application, the audio encoder is a three-dimensional audio encoder, and the audio decoder is a three-dimensional audio decoder.

The audio decoder may be a decoder using the decoding method provided in the embodiment of the present application, and may also be a decoder using another decoding method. The audio encoder may be an encoder using the encoding method provided in the embodiments of the present application, or may be an encoder using another encoding method. In the case where the audio decoder is a decoder using the decoding method provided in the embodiment of the present application, the audio encoder is an encoder using another encoding method, and in the case where the audio decoder is a decoder using another decoding method, the audio encoder is an encoder using the encoding method provided in the embodiment of the present application.

In the first case, the audio decoder is a decoder using the decoding method provided in the embodiments of the present application, and the audio encoder is an encoder using another encoding method.

At this time, the channel decoding module is used for performing channel decoding on the received code stream, then the audio decoder is used for performing source decoding by using the decoding method provided by the embodiment of the application, and then the audio encoder is used for encoding according to other encoding methods, so that conversion from one format to another format, namely transcoding, is realized. And then, the signal is transmitted after channel coding.

In the second case, the audio decoder is a decoder using another decoding method, and the audio encoder is an encoder using the encoding method provided in the embodiments of the present application.

At this time, the channel decoding module is used for performing channel decoding on the received code stream, then the audio decoder is used for performing source decoding by using other decoding methods, and then the audio encoder is used for encoding by using the encoding method provided by the embodiment of the application, so that conversion from one format to another format, namely transcoding, is realized. And then, the signal is transmitted after channel coding.

The wireless device may be a wireless access point, a wireless router, a wireless connector, etc. The core network device may be a mobility management entity, a gateway, etc.

Those skilled in the art will appreciate that the above-described wireless devices or core network devices are merely examples, and that other wireless or core network devices, now existing or later to be developed, that may be suitable for use in the embodiments of the present application are also included within the scope of the embodiments of the present application and are hereby incorporated by reference.

Referring to fig. 4, fig. 4 is a schematic diagram of an implementation environment in which an encoding and decoding method provided by an embodiment of the present application is applied to a broadcast television scene. The broadcast television scene is divided into a live broadcast scene and a post-production scene. For a live scene, the implementation environment comprises a live program three-dimensional sound production module, a three-dimensional sound coding module, a set top box and a loudspeaker set, wherein the set top box comprises a three-dimensional sound decoding module. For the post-production scene, the implementation environment comprises a post-program three-dimensional sound production module, a three-dimensional sound coding module, a network receiver, a mobile terminal, earphones and the like.

In a live broadcast scene, a live broadcast program three-dimensional sound making module generates a three-dimensional sound signal (such as an HOA signal), the three-dimensional sound signal obtains a code stream by applying the coding method of the embodiment of the application, the code stream is transmitted to a user side through a broadcast television network, and a three-dimensional sound decoder in a set top box decodes the code stream by using the decoding method provided by the embodiment of the application, so that the three-dimensional sound signal is reconstructed and played back by a loudspeaker set. Or, the code stream is transmitted to the user side via the internet, and the three-dimensional sound decoder in the network receiver decodes the code stream by using the decoding method provided by the embodiment of the application, so as to reconstruct the three-dimensional sound signal, and the speaker group plays back the three-dimensional sound signal. Or, the code stream is transmitted to the user side through the internet, and the three-dimensional sound decoder in the mobile terminal decodes the code stream by using the decoding method provided by the embodiment of the application, so that the three-dimensional sound signal is reconstructed and played back by the earphone.

In a post-production scene, a three-dimensional sound signal is produced by a post-program three-dimensional sound production module, the three-dimensional sound signal obtains a code stream by applying the coding method of the embodiment of the application, the code stream is transmitted to a user side through a broadcast television network, and a three-dimensional sound decoder in the set top box decodes the code stream by using the decoding method provided by the embodiment of the application, so that the three-dimensional sound signal is reconstructed and played back by a loudspeaker set. Or, the code stream is transmitted to the user side via the internet, and the three-dimensional sound decoder in the network receiver decodes the code stream by using the decoding method provided by the embodiment of the application, so as to reconstruct the three-dimensional sound signal, and the speaker group plays back the three-dimensional sound signal. Or, the code stream is transmitted to the user side through the internet, and the three-dimensional sound decoder in the mobile terminal decodes the code stream by using the decoding method provided by the embodiment of the application, so that the three-dimensional sound signal is reconstructed and played back by the earphone.

Referring to fig. 5, fig. 5 is a schematic view illustrating an implementation environment in which an encoding and decoding method provided by an embodiment of the present application is applied to a virtual reality stream scene. The implementation environment comprises an encoding end and a decoding end, wherein the encoding end comprises a collection module, a preprocessing module, an encoding module, a packing module and a sending module, and the decoding end comprises a unpacking module, a decoding module, a rendering module and an earphone.

The acquisition module acquires the HOA signal, and then the preprocessing module performs preprocessing operation on the HOA signal, wherein the preprocessing operation comprises filtering out low-frequency parts in the HOA signal, usually taking 20Hz or 50Hz as a demarcation point, extracting azimuth information in the HOA signal and the like. And then, the encoding module is used for encoding by using the encoding method provided by the embodiment of the application, and the encoding module is used for packaging, and then the encoding module is used for transmitting the encoding result to the decoding end.

The decoding method comprises the steps that an unpacking module of a decoding end unpacks firstly, then decoding is carried out through the decoding module by using the decoding method provided by the embodiment of the application, then binaural rendering processing is carried out on a decoded signal through a rendering module, and the rendered signal is mapped to the earphone of a listener. The earphone can be an independent earphone or an earphone on virtual reality-based glasses equipment.

It should be noted that the system architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it is known by a person of ordinary skill in the art that, with the evolution of the system architecture and the occurrence of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.

The encoding and decoding methods provided in the embodiments of the present application are explained in detail below. It should be noted that, in conjunction with the implementation environment shown in fig. 1, any of the encoding methods below may be performed by the encoder 100 in the source device 10. Any of the decoding methods hereinafter may be performed by the decoder 200 in the destination device 20.

Fig. 6 is a flowchart of an encoding method provided in an embodiment of the present application, where the encoding method is applied to an encoding end. Referring to fig. 6, the method includes the following steps.

Step 601: the encoding scheme of the current frame is determined from the HOA signal of the current frame.

For HOA signals of multiple audio frames to be encoded, the encoding side encodes on a frame-by-frame basis. Wherein the HOA signal of the audio frame is an audio signal obtained by a HOA acquisition technique. The HOA signal is a scene audio signal and is also a three-dimensional audio signal, the HOA signal is an audio signal acquired from a sound field at a position where a microphone is located in a space, and the acquired audio signal is referred to as an original HOA signal. The HOA signal of the audio frame may also be an HOA signal obtained by converting a three-dimensional audio signal of another format. For example, a 5.1 channel signal is converted into an HOA signal, or a three-dimensional audio signal in which a 5.1 channel signal and object audio are mixed is converted into an HOA signal. Optionally, the HOA signal of the audio frame to be encoded is a time domain signal or a frequency domain signal, and may include all channels of the HOA signal or may include some channels of the HOA signal. Illustratively, if the order of the HOA signal of an audio frame is 3, the number of channels of the HOA signal is 16, the frame length of the audio frame is 20ms, and the sampling rate is 48KHz, the HOA signal of the audio frame to be encoded contains signals of 16 channels, each channel containing 960 sampling points.

In order to reduce the computational complexity, if the HOA signal of the audio frame acquired by the encoding end is the original HOA signal and the number of sampling points or frequency points of the original HOA signal is large, the encoding end may perform downsampling on the original HOA signal to obtain the HOA signal of the audio frame to be encoded. For example, the encoding end performs 1/Q downsampling on the original HOA signal to reduce the number of sampling points or the number of frequency points of the HOA signal to be encoded, for example, each channel of the original HOA signal includes 960 sampling points in the embodiment of the present application, and each channel of the HOA signal to be encoded includes 8 sampling points after 1/120 downsampling is performed.

In the embodiment of the present application, a coding method of a coding end is introduced by taking a coding end as an example to code a current frame. The current frame is an audio frame to be encoded. That is, the encoding end acquires the HOA signal of the current frame, and encodes the HOA signal of the current frame by using the encoding method provided in the embodiment of the present application.

It should be noted that, in order to satisfy the requirement of having higher compression rate for audio frames under different sound field types, a suitable encoding and decoding scheme needs to be selected for corresponding audio frames according to the sound field type of each audio frame. In the embodiment of the present application, the encoding end first determines an initial encoding scheme of the current frame according to the HOA signal of the current frame, where the initial encoding scheme is the first encoding scheme or the second encoding scheme. And the encoding end judges whether the first encoding scheme, the second encoding scheme or the third encoding scheme is adopted to encode the HOA signal of the current frame by comparing whether the initial encoding scheme of the current frame is the same as the initial encoding scheme of the previous frame of the current frame. If the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame, the coding end adopts a coding scheme consistent with the initial coding scheme of the current frame to code the HOA signal of the current frame. If the initial coding scheme of the current frame is different from the initial coding scheme of the previous frame of the current frame, the coding end adopts a switching frame coding scheme to code the HOA signal of the current frame.

In an embodiment of the present application, the encoding scheme of the current frame is one of a first encoding scheme, a second encoding scheme, and a third encoding scheme. Wherein the first coding scheme is a DirAC-based HOA coding scheme, the second coding scheme is a HOA coding scheme selected based on virtual loudspeakers, and the third coding scheme is a hybrid coding scheme. Alternatively, the hybrid coding scheme is also referred to as a switching frame coding scheme. The third coding scheme is a switching frame coding scheme provided by the embodiment of the application, and the third coding scheme is used for smooth transition of auditory quality when switching between different coding schemes. The embodiments of the present application will be described in detail below with respect to these three encoding schemes. In an embodiment of the application, the HOA encoding scheme based on virtual speaker selection is also referred to as MP-based HOA encoding scheme.

In the embodiment of the application, the encoding end determines the initial encoding scheme of the current frame according to the HOA signal of the current frame. Then, the coding terminal determines a coding scheme of the current frame based on the initial coding scheme of the current frame and the initial coding scheme of the previous frame of the current frame. It should be noted that the embodiment of the present application does not limit the implementation manner of determining the initial coding scheme by the coding end.

Optionally, the encoding end performs sound field type analysis on the HOA signal of the current frame to obtain a sound field classification result of the current frame, and determines an initial encoding scheme of the current frame based on the sound field classification result of the current frame. It should be noted that the embodiments of the present application do not limit the method of sound field type analysis, for example, the encoding side performs the sound field type analysis by performing singular value decomposition on the HOA signal of the current frame, or performs other linear decomposition on the HOA signal to perform the sound field type analysis.

Optionally, the sound field classification result includes the number of distinct sound sources. Taking the example that the encoding end directly performs sound field type analysis on the HOA signal of the current frame, the encoding end performs sound field type analysis on the HOA signal of the current frame, and one implementation way to obtain the sound field classification result of the current frame is as follows: and the encoding end carries out singular value decomposition on the HOA signal of the current frame to obtain M singular values. And the encoding end calculates the ratio of the ith singular value to the (i + 1) th singular value in the M singular values to obtain M-1 sound field classification parameters. Wherein i =1,2, \8230;, M. And the encoding end determines the number of the dissimilarity sound sources corresponding to the current frame based on the M-1 sound field classification parameters. Where M = min (L, K), L represents the number of channels of the HOA signal of the current frame, K represents the number of signal points of each channel of the HOA signal of the current frame, and min represents the minimum value calculation. If the HOA signal is a time domain signal, the number of signal points is the number of sampling points, and if the HOA signal is a frequency domain signal, the number of signal points is the number of frequency points.

Optionally, assuming that the M-1 sound field type parameters are temp [ i ], i =0,1, \8230, and M-2, the implementation manner of determining, by the encoding end, the number of the dissimilarity sound sources corresponding to the current frame based on the M-1 sound field classification parameters is as follows: the following procedure is performed in order from i = 0: judging whether the temp [ i ] is greater than a preset dissimilarity sound source judgment threshold, if the temp [ i ] in the process of the current round is less than the dissimilarity sound source judgment threshold, updating the value of i to be i +1, continuing to execute a next round of process, if the temp [ i ] in the process of the current round is greater than or equal to the dissimilarity sound source judgment threshold, determining that the number of the dissimilarity sound sources corresponding to the current frame is equal to i +1, and ending the process. Alternatively, the dissimilarity sound source determination threshold is 30, 80, or 100, etc., and the dissimilarity sound source determination threshold is a preset value, which may be preset empirically or through statistics.

Accordingly, in one implementation, after determining the number of the dissimilarity sound sources corresponding to the current frame, if the number of the dissimilarity sound sources corresponding to the current frame is greater than a first threshold and less than a second threshold, the encoding end determines that the initial encoding scheme of the current frame is the second encoding scheme. And if the number of the dissimilarity sound sources corresponding to the current frame is not greater than a first threshold value or not less than a second threshold value, the encoding end determines that the initial encoding scheme of the current frame is the first encoding scheme. Wherein the first threshold is less than the second threshold. Optionally, the first threshold is 0 or other value and the second threshold is 3 or other value. The first threshold and the second threshold are preset values, and can be preset empirically or through statistics.

Exemplarily, it is assumed that the number of channels L =16 of the HOA signal of the current frame, and the number of frequency bins K =8,min (L, K) =8 per channel. Then the encoding end performs a singular value decomposition on the HOA signal of the current frame resulting in singular values v [ i ], i =0,1, \ 8230, min (L, K) -1. And the encoding end calculates the ratio between adjacent singular values, and the obtained ratio is used as the sound field classification result temp [ i ], temp [ i ] = v [ i ]/v [ i +1], i =0,1, \ 8230;, min (L, K) -2 of the current frame. Assuming that the dissimilarity sound source determination threshold is 100, the process of determining the number n of dissimilarity sound sources is as follows: starting from i =0, judging whether temp [ i ] is greater than or equal to 100, and if the temp [ i ] is greater than or equal to 100, namely the temp [ i ] is greater than or equal to 100, stopping judging; otherwise, i = i +1, and the judgment is continued. And if the judgment is stopped, adding 1 to the serial number i when the judgment is stopped to be equal to the number n of the dissimilarity sound sources corresponding to the current frame. For example, when i =0, if temp [0] is greater than or equal to 100, the judgment is stopped, and the number n of the dissimilarity sound sources is equal to 1; otherwise, letting i =1, and continuing to judge i =1; when i =1 and temp [1] ≧ 100, the judgment is stopped, and the number n of the dissimilarity sound sources equals i +1=2. Assuming that the first threshold is 0 and the second threshold is 3, if the number n of the dissimilarity sound sources corresponding to the current frame satisfies 0-n-t-3, the encoding end determines that the initial encoding scheme of the current frame is the second encoding scheme. If the number n of the dissimilarity sound sources corresponding to the current frame meets n =0 or n is larger than or equal to 3, the encoding end determines that the initial encoding scheme of the current frame is the first encoding scheme.

Optionally, the sound field classification result includes sound field types, and the sound field types are classified into a diffuse sound field and a dissimilarity sound field. The sound field type may be determined according to the number of dissimilarity sound sources obtained by the foregoing method, that is, the coding end determines the sound field type of the current frame based on the number of dissimilarity sound sources corresponding to the current frame. For example, if the number of the dissimilarity sound sources corresponding to the current frame is greater than a first threshold and smaller than a second threshold, the encoding end determines that the sound field type of the current frame is a dissimilarity sound field. And if the number of the dissimilarity sound sources corresponding to the current frame is not larger than a first threshold or not smaller than a second threshold, the coding end determines that the sound field type of the current frame is a diffuse sound field. Accordingly, if the sound field type of the current frame is a dissimilarity sound field, the encoding end determines that the initial encoding scheme of the current frame is the second encoding scheme, i.e., the MP-based HOA encoding scheme. If the sound field type of the current frame is a diffuse sound field type, the encoding end determines that the initial encoding scheme of the current frame is a first encoding scheme, namely a DirAC-based HOA encoding scheme.

In some embodiments, after the initial coding scheme of each audio frame (including the current frame) is determined through the implementation manner, the initial coding scheme of each audio frame may be switched back and forth, that is, more switching frames need to be coded finally. Since there are many problems associated with switching between coding schemes, i.e., there are many problems to be solved, the number of switching frames can be reduced to reduce the problems associated with switching. In order to reduce the number of switching frames, the encoding end may first determine the predicted encoding scheme of the current frame according to the sound field classification result of the current frame, that is, the encoding end uses the initial encoding scheme determined according to the foregoing method as the predicted encoding scheme. Then, the encoding end updates the initial coding scheme of the current frame based on the estimated coding scheme by adopting a sliding window method, for example, the encoding end updates the initial coding scheme of the current frame through a hangover process.

Alternatively, assuming that the sliding window has a length of N, the sliding window includes the predicted coding scheme of the current frame and the updated initial coding scheme of the previous N-1 frames of the current frame. And if the accumulated number of the second coding schemes in the sliding window is not less than the first specified threshold, the coding end updates the initial coding scheme of the current frame to the second coding scheme. If the accumulated number of the second coding schemes in the sliding window is smaller than a first specified threshold value, the coding end updates the initial coding scheme of the current frame to the first coding scheme. The length N of the sliding window is 8, 10, 15, and the like, and the first specified threshold is 5, 6, 7, and the like. For example, assuming that the length of the sliding window is 10, the first specified threshold is 7, and the sliding window includes the predicted coding scheme of the current frame and the updated initial coding scheme of the first 9 frames of the current frame, if the number of the second coding schemes in the sliding window is not less than 7, the encoding end determines the initial coding scheme of the current frame as the second coding scheme, and if the number of the second coding schemes in the sliding window is less than 7, the encoding end updates the initial coding scheme of the current frame to the first coding scheme.

Or if the accumulated number of the first coding schemes in the sliding window is not less than the second specified threshold, the coding end updates the initial coding scheme of the current frame to the first coding scheme. And if the accumulated number of the first coding schemes in the sliding window is smaller than a second specified threshold value, the coding end updates the initial coding scheme of the current frame to the second coding scheme. The second specified threshold is 5, 6, 7, and the like, and the value of the second specified threshold is not limited in the embodiment of the application. Optionally, the second specified threshold is different from or the same as the first specified threshold.

Besides the foregoing implementations, the encoding side may also use other methods to obtain the sound field classification result of the current frame, and the method for determining the initial encoding scheme based on the sound field classification result may also use other methods, which is not limited in this embodiment of the present application.

In the embodiment of the present application, after the encoding end determines the initial coding scheme of the current frame, if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame, the encoding end determines that the coding scheme of the current frame is the initial coding scheme of the current frame. And if the initial coding scheme of the current frame is different from the initial coding scheme of the previous frame of the current frame, the coding end determines that the coding scheme of the current frame is the third coding scheme. That is, if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame and is the first coding scheme, the encoding end determines that the coding scheme of the current frame is the first coding scheme. And if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame and is the second coding scheme, the coding end determines that the coding scheme of the current frame is the second coding scheme. If one of the initial coding scheme of the current frame and the initial coding scheme of the previous frame of the current frame is the first coding scheme and the other one is the second coding scheme, the coding end determines that the coding scheme of the current frame is the third coding scheme. One of the initial coding scheme of the current frame and the initial coding scheme of the previous frame of the current frame is a first coding scheme, and the other is a second coding scheme, that is, the initial coding scheme of the current frame is the first coding scheme and the initial coding scheme of the previous frame of the current frame is the second coding scheme, or the initial coding scheme of the current frame is the second coding scheme and the initial coding scheme of the previous frame of the current frame is the first coding scheme. That is, for the handover frame, the encoding end will encode the HOA signal of the handover frame using neither the first encoding scheme nor the second encoding scheme, but will encode the HOA signal of the handover frame using the handover frame encoding scheme. For the non-handover frame, the encoding end will encode the HOA signal of the handover frame using an encoding scheme consistent with the initial encoding scheme of the non-handover frame. Wherein, the audio frame with the initial coding scheme different from that of the previous frame is a switching frame, and the audio frame with the initial coding scheme same as that of the previous frame is a non-switching frame.

It should be noted that, in addition to determining the encoding scheme of the current frame, the encoding end needs to encode information capable of indicating the encoding scheme of the current frame into the code stream, so that the decoding end determines which decoding scheme is used to decode the code stream of the current frame. In the embodiment of the present application, there are various implementation manners in which the encoding end encodes information capable of indicating the encoding scheme of the current frame into the code stream, and three implementation manners are described next.

First implementationCode switching flag and indication information of two coding schemes

In this implementation, the encoding end needs to determine the value of the switching flag of the current frame, and the value of the switching flag of the current frame is encoded into the code stream. When the coding scheme of the current frame is the first coding scheme or the second coding scheme, the value of the switching flag of the current frame is the first value. And when the coding scheme of the current frame is the third coding scheme, the value of the switching flag of the current frame is a second value. Alternatively, the first value is "0", the second value is "1", and the first value and the second value may be other values.

In addition, the encoding end encodes the indication information of the initial encoding scheme of the current frame into the code stream. Or, if the value of the switching flag of the current frame is a first value, the encoding end encodes the indication information of the initial coding scheme of the current frame into the code stream, and if the value of the switching flag of the current frame is a second value, the encoding end encodes the preset indication information into the code stream.

Alternatively, the indication information of the initial coding scheme is represented in a coding mode (coding mode) corresponding to the initial coding scheme, that is, the coding mode is taken as the indication information. For example, the encoding mode corresponding to the initial encoding scheme is an initial encoding mode, which is a first encoding mode (i.e., dirAC mode) or a second encoding mode (i.e., MP mode). Optionally, the preset indication information is a preset coding mode, and the preset coding mode is a first coding mode or a second coding mode. In some other embodiments, the preset indication information is other coding modes, that is, what is specifically the indication information does not limit the coding scheme of the switching frame coded into the code stream.

That is, in the first implementation manner, the coding end indicates the switch frame by using the switch flag, and may not limit the indication information of the coding scheme of the switch frame encoded in the code stream, and the indication information of the coding scheme of the switch frame may be an initial coding mode, may also be a preset coding mode, may also be randomly selected from the first coding mode and the second coding mode, and may also be other indication information. It should be noted that, in this implementation manner, the switching flag is used to indicate whether the current frame is a switching frame, so that the decoding end can determine whether the current frame is a switching frame by directly acquiring the switching flag in the code stream.

Optionally, in the first implementation manner, the switching flag of the current frame and the indication information of the initial coding scheme each occupy one bit of the code stream. Illustratively, the value of the switching flag of the current frame is "0" or "1", wherein the value of the switching flag is "0" indicates that the current frame is not a switching frame, i.e., the value of the switching flag of the current frame is the first value. A switching flag of "1" indicates that the current frame is a switching frame, i.e., the value of the switching flag of the current frame is a second value. Alternatively, the indication information of the initial coding scheme is "0" or "1", where "0" denotes DirAC mode (i.e., dirAC coding scheme) and "1" denotes MP mode (i.e., MP-based coding scheme).

In some other embodiments, if the initial coding scheme of the current frame is different from the initial coding scheme of the previous frame of the current frame, the coding end determines that the value of the switching flag of the current frame is the second value, and codes the value of the switching flag of the current frame into the code stream. That is, for the switch frame, since the switch flag in the code stream can indicate the switch frame, the indication information of the coding scheme of the coded switch frame is not needed.

Second implementationEncoding indication information of two encoding schemes

In this implementation, the encoding end encodes the indication information of the initial encoding scheme of the current frame into the code stream. Taking the encoding mode as the indication information, the indication information of the encoded code stream is substantially the encoding mode consistent with the initial encoding scheme, i.e. the initial encoding mode, and the initial encoding mode is the first encoding mode or the second encoding mode. In addition, the encoding end may not encode the switch flag.

Optionally, in this first implementation manner, the indication information of the initial coding scheme occupies one bit of the code stream. Illustratively, taking the coding mode as the indication information as an example, the coding mode of the coded code stream is "0" or "1", where "0" represents DirAC mode, indicates that the initial coding scheme of the current frame is the first coding scheme, and "1" represents MP mode, indicates that the initial coding scheme of the current frame is the second coding scheme.

Third implementationEncoding indication information of three encoding schemes

In this implementation, the encoding end encodes the indication information of the encoding scheme of the current frame into the code stream. Taking the encoding mode as the indication information, the indication information of the encoded code stream is substantially the encoding mode consistent with the encoding scheme of the current frame, and the encoding mode consistent with the encoding scheme of the current frame is the actual encoding mode, i.e. the first encoding mode, the second encoding mode or the third encoding mode. Optionally, the third encoding mode is an MP-W mode.

Optionally, in this third implementation manner, the indication information of the coding scheme of the current frame occupies two bits of the code stream. Illustratively, the indication information of the coding scheme of the current frame is "00", "01", or "10". Wherein "00" indicates that the coding scheme of the current frame is the first coding scheme, "01" indicates that the coding scheme of the current frame is the second coding scheme, and "10" indicates that the coding scheme of the current frame is the third coding scheme.

As can be seen from the above, in the first implementation manner, after the encoding end determines the initial encoding scheme of the current frame, the value of the switching flag is determined, and the value of the switching flag is encoded into the code stream. In addition, the indication information of the initial coding scheme of the current frame is coded into the code stream, or if the current frame is a switching frame, the coding end codes preset indication information into the code stream, and if the current frame is a non-switching frame, the coding end codes the indication information of the initial coding scheme of the current frame into the code stream. In the second implementation manner, after the encoding end determines the initial encoding scheme of the current frame, the indication information of the initial encoding scheme of the current frame is directly encoded into the code stream. In the third implementation manner, after the encoding end determines the initial encoding scheme of the current frame, the encoding end determines the encoding scheme of the current frame based on the initial encoding scheme of the current frame and the initial encoding scheme of the previous frame of the current frame, and encodes the indication information of the encoding scheme of the current frame into the code stream.

Step 602: if the encoding scheme of the current frame is the third encoding scheme, encoding a signal of a designated channel in the HOA signal into a code stream, wherein the designated channel is a part of channels in all the channels of the HOA signal.

In the embodiment of the present application, if the coding scheme of the current frame is the third coding scheme, which indicates that the current frame is a handover frame, the coding end codes the HOA signal of the current frame according to the third coding scheme (i.e. a hybrid coding scheme). Corresponding to the first implementation manner in step 601, if the value of the switch flag of the current frame is the second value, it indicates that the current frame is the switch frame. Corresponding to the second implementation manner in step 601, if the initial coding scheme of the current frame is different from the initial coding scheme of the previous frame of the current frame, it indicates that the current frame is a switching frame. Corresponding to the third implementation manner in step 601, if the coding scheme of the current frame is the third coding scheme, the coding scheme of the current frame indicates that the current frame is the switch frame. For the handover frame, the encoding end employs a third encoding scheme to encode the HOA signal of the current frame. And the third coding scheme indicates that a signal of a specified channel in the HOA signal of the current frame is coded into the code stream, wherein the specified channel is part of all channels of the HOA signal. That is, for the switching frame, the encoding end encodes the signal of the specified channel in the HOA signal of the switching frame into the code stream, instead of encoding the switching frame by using the first encoding scheme or the second encoding scheme, that is, the switching frame is encoded by using a compromise method for smooth transition of the hearing quality during switching of the encoding scheme.

Optionally, the designated channel is consistent with a transmission channel preset in the first coding scheme, that is, the designated channel is a preset channel. That is, on the premise that the third coding scheme is different from the second coding scheme, in order to make the coding effects of the third coding scheme and the second coding scheme approach each other, the coding end codes a signal of a channel, which is the same as a transmission channel preset in the first coding scheme, in the HOA signal of the handover frame into the code stream, so that the auditory quality is transited smoothly as much as possible. It should be noted that different transmission channels may be respectively preset according to different coding bandwidths and code rates, even different application scenarios. Optionally, the preset transmission channels may also be the same in different coding bandwidths, code rates, or application scenarios.

Optionally, the channel-specific signals include FOA signals, which include omnidirectional W signals, and directional X, Y, and Z signals. That is, the designated channel includes an FOA channel, and a signal of the FOA channel is a low-order signal, that is, if the current frame is a handover frame, the encoding end encodes a low-order portion of the HOA signal of the current frame into a code stream, where the low-order portion includes a W signal, an X signal, a Y signal, and a Z signal of the FOA channel.

It should be noted that, in the embodiment of the present application, there are many implementation manners in which the encoding end encodes the signal of the specified channel in the HOA signal into the code stream, and the encoding end may encode the signal of the specified channel into the code stream. Some of which are described next.

In the embodiment of the application, if the designated channel comprises an FOA channel, the encoding end determines a virtual speaker signal and a residual signal based on a W signal, an X signal, a Y signal and a Z signal, and encodes the virtual speaker signal and the residual signal into a code stream.

Optionally, the encoding end determines the W signal as a path of virtual speaker signal, and determines three paths of residual signals based on the W signal, the X signal, the Y signal, and the Z signal, or determines the X signal, the Y signal, and the Z signal as three paths of residual signals. Optionally, the encoding end determines a difference signal between any three of the W signal, the X signal, the Y signal, and the Z signal and the remaining one of the signals as a three-way residual signal. For example, the encoding end determines difference signals between the X, Y, and Z signals, respectively, and the W signal as three-way residual signals. Illustratively, the encoding end takes the difference signals X ', Y ' and Z ' obtained by X-W, Y-W and Z-W respectively as three paths of residual signals.

If the encoding end uses the core encoder to encode the current frame, the core encoder is a stereo encoder, and the determined one-path virtual speaker signal and the three-path residual error signals are both mono signals, therefore, the encoding end needs to combine a stereo signal based on the mono signals, and then uses the stereo encoder to encode. Optionally, the encoding end combines the one path of virtual speaker signal with the first path of preset mono signal to obtain one path of stereo signal, and combines the three paths of residual signals with the second path of preset mono signal to obtain two paths of stereo signals. And the coding end respectively codes the obtained three paths of stereo signals into code streams through a stereo coder.

The embodiment of the present application does not limit a specific combination manner in which the encoding end combines the three residual signals with one preset monaural signal to obtain two stereo signals. Optionally, the encoding end combines two residual signals with the highest correlation among the three residual signals to obtain one of the two stereo signals, and combines one of the three residual signals except for the two residual signals with the highest correlation with a second preset mono signal to obtain the other of the two stereo signals. That is, the encoding end combines the stereo signals according to the correlation of the signals. In some other embodiments, the encoding end may also combine any two residual signals of the three residual signals to obtain one of the two stereo signals, and combine the remaining one residual signal with the second preset monaural signal to obtain the other stereo signal of the two stereo signals.

Optionally, in this embodiment of the application, the first path of preset monaural signal is an all-zero signal or an all-one signal, and the second path of preset monaural signal is an all-zero signal or an all-one signal. Optionally, the first path of preset mono signal is the same as or different from the second path of preset mono signal, that is, both the first path of preset mono signal and the second path of preset mono signal are all zero signals or one signal, or the first path of preset mono signal is a zero signal and the second path of preset mono signal is a one signal, or the first path of preset mono signal is a one signal and the second path of preset mono signal is a zero signal. The all-zero signal comprises a signal with zero values of sampling points or a signal with zero values of frequency points, and the all-one signal comprises a signal with one value of sampling points or a signal with one value of frequency points. If the HOA signal is a time domain signal, the all-zero signal includes a signal whose sampling points have zero values, and the all-one signal includes a signal whose sampling points have one values. If the HOA signal is a frequency domain signal, the all-zero signal includes a signal whose frequency point values are all zero, and the all-one signal includes a signal whose frequency point values are all one. In other embodiments, the first path of default monaural signal and/or the second path of default monaural signal may also be a default signal in other forms.

And if the core encoder used by the encoding end is a single-track encoder, the encoding end respectively encodes the virtual loudspeaker signal and each residual signal in the three residual signals into a code stream through the single-track encoder.

Fig. 7 is a schematic diagram of a switching frame coding scheme according to an embodiment of the present application. Referring to fig. 7, a current frame to be encoded is a switching frame, an encoding end obtains an HOA signal of the current frame, uses a W signal in the HOA signal as a virtual speaker signal, and determines a residual signal according to an FOA signal in the HOA signal, for example, determines a residual signal according to an X signal, a Y signal, and a Z signal in the HOA signal, or determines a residual signal according to the W signal and the X signal, the Y signal, and the Z signal. And the encoding end encodes the determined virtual loudspeaker signal and the residual error signal into a code stream through a core encoder to obtain a code stream of the switching frame.

Optionally, in other embodiments, the encoding end determines two signals of the W signal, the X signal, the Y signal, and the Z signal as two virtual speaker signals, and determines the remaining two signals as two residual signals. And the coding end combines the two paths of virtual loudspeaker signals to obtain one path of stereo signal, and combines the two paths of residual signals to obtain the other path of stereo signal. And the coding end respectively codes the obtained two paths of stereo signals into code streams through a stereo coder.

The embodiment of the present application does not limit a specific combination manner in which the encoding end combines the W signal, the X signal, the Y signal, and the Z signal two by two to obtain the binaural signal. Optionally, the encoding end determines the W signal as one path of virtual speaker signal, determines one path of signal with the highest correlation with the W signal among the X signal, the Y signal, and the Z signal as another path of virtual speaker signal, that is, combines the W signal of the four paths of signals included in the FOA channel and the one signal with the highest correlation with the W signal, and combines the remaining two paths of signals. Or the coding end combines any two paths of signals in the W signal, the X signal, the Y signal and the Z signal to obtain one path of stereo signal, and combines the rest two paths of signals to obtain the other path of stereo signal.

It should be noted that, in the embodiment of the present application, a specific implementation manner in which the core encoder is used by the encoding end to encode the virtual speaker signal and the residual signal is not limited, for example, the number of encoding bits corresponding to each of the virtual speaker signal and the residual signal is not limited.

In the above, the process of encoding the current frame by the encoding end when the current frame is the switching frame is introduced, that is, the encoding end encodes the signal of the specified channel in the HOA signal of the switching frame into the code stream according to the third encoding scheme, which is the switching frame encoding scheme. As can be seen from the above, in the embodiment of the present application, the channel-specific signal may include a W signal, which is a core signal of the HOA signal, and thus, the handover frame coding scheme may also be referred to as an MP-W based coding scheme. Next, a process of encoding the current frame by the encoding side in the case where the current frame is a non-switching frame will be described.

In the embodiment of the present application, if the coding scheme of the current frame is the first coding scheme, the coding end codes the HOA signal of the current frame into the code stream according to the first coding scheme. And if the coding scheme of the current frame is the second coding scheme, the coding end codes the HOA signal of the current frame into a code stream according to the second coding scheme. That is, if the current frame is not the switch frame, the encoding end encodes the current frame using the initial encoding scheme of the current frame.

Exemplarily, referring to fig. 8, the implementation process of encoding the HOA signal of the current frame into the code stream by the encoding end according to the second encoding scheme is as follows: the method comprises the steps that a coding end selects a target virtual loudspeaker matched with an HOA signal of a current frame from a virtual loudspeaker set based on an MP algorithm, determines a virtual loudspeaker signal through an MP-based space encoder based on the HOA signal of the current frame and the target virtual loudspeaker, determines a residual signal through the MP-based space encoder based on the HOA signal of the current frame and the virtual loudspeaker signal, and codes the virtual loudspeaker signal and the residual signal into a code stream through a core encoder. It should be noted that the principle and specific manner of determining the virtual speaker signal and the residual signal in the MP-based HOA coding scheme and the handover frame coding scheme are different, and the virtual speaker signal and the residual signal determined by the two schemes are also different. For the same frame, the effective information of the code stream coded by adopting the MP-based HOA coding scheme is more than that of the switching frame coding scheme. On the premise that the switching frame coding scheme is different from the second coding scheme, in order to enable the coding effect of the switching frame coding scheme to be close to that of the second coding scheme, the switching frame coding scheme also encodes the virtual loudspeaker signal and the residual signal into the code stream, so that the auditory quality is in smooth transition as far as possible.

The encoding end encodes the HOA signal of the current frame into a code stream according to a first encoding scheme, and the implementation process comprises the following steps: and the coding end extracts the core layer signal and the spatial parameter from the HOA signal of the current frame and codes the extracted core layer signal and the spatial parameter into a code stream. For example, referring to fig. 9, the encoding side extracts a core layer signal from the HOA signal of the current frame through the core encoding signal obtaining module, extracts a spatial parameter from the HOA signal of the current frame through the DirAC-based spatial parameter extracting module, encodes the core layer signal into a code stream through the core encoder, and encodes the spatial parameter into the code stream through the spatial parameter encoder. The channel corresponding to the core layer signal is consistent with the specified channel in the scheme. In addition, the first coding scheme is adopted to code the core layer signal into the code stream, and also code the extracted spatial parameters into the code stream, wherein the spatial parameters contain rich scene information, such as direction information and the like. It can be seen that, for the same frame, the effective information of the code stream coded by the DirAC-based HOA coding scheme is more than the effective information of the code stream coded by the handover frame coding scheme, and on the premise that the handover frame coding scheme is different from the first coding scheme, in order to make the coding effect of the handover frame coding scheme approach that of the first coding scheme, the handover frame coding scheme also codes the signals of the transmission channel preset in the HOA signals and the first coding scheme into the code stream, but will not code more information in the HOA signals into the code stream except the signals of the specified channel, i.e. will not extract the spatial parameters, and will not code the spatial parameters into the code stream, so that the auditory quality is smoothly transited as much as possible.

Fig. 10 is a flowchart of another encoding method provided in the embodiment of the present application. Referring to fig. 10, the encoding method provided in the embodiment of the present application is explained again by taking the example of encoding the indication information of the initial encoding scheme of the current frame into the code stream. The encoding end firstly acquires the HOA signal of the current frame to be encoded. Then, the encoding end analyzes the sound field type of the HOA signal to determine the initial encoding scheme of the current frame, and the encoding end encodes the indication information of the initial encoding scheme of the current frame into a code stream. The encoding end judges whether the initial encoding scheme of the current frame is the same as the initial encoding scheme of the previous frame. If the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame, the coding end adopts the initial coding scheme of the current frame to code the HOA signal of the current frame so as to obtain the code stream of the current frame. If the initial coding scheme of the current frame is different from the initial coding scheme of the previous frame, the coding end adopts a switching frame coding scheme to code the HOA signal of the current frame so as to obtain the code stream of the current frame.

It should be noted that, if the current frame is the first audio frame to be encoded, the initial encoding scheme of the current frame is the first encoding scheme or the second encoding scheme, and the encoding end encodes the HOA signal of the current frame into the code stream by using the initial encoding scheme of the current frame.

In summary, in the embodiment of the present application, the HOA signal of the audio frame is coded and decoded by combining two schemes (i.e. the coding and decoding scheme based on the virtual speaker selection and the coding and decoding scheme based on the directional audio coding), that is, a suitable coding and decoding scheme is selected for different audio frames, so that the compression rate of the audio signal can be improved. Meanwhile, in order to enable the smooth transition of the hearing quality when different coding and decoding schemes are switched, in the scheme, for some audio frames, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the two schemes, namely, signals of specified channels in the HOA signals of the audio frames are coded into a code stream, namely, a compromise scheme is adopted to code and decode, so that the hearing quality after the HOA signals restored by decoding are rendered and played can be smoothly transited.

Fig. 11 is a flowchart of a decoding method provided in an embodiment of the present application, where the method is applied to a decoding end. Note that this decoding method corresponds to the encoding method shown in fig. 6. Referring to fig. 11, the method includes the following steps.

Step 1101: and obtaining a decoding scheme of the current frame based on the code stream.

Wherein the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme, and a third decoding scheme. The first decoding scheme is a DirAC-based HOA decoding scheme, the second decoding scheme is a virtual speaker selection-based HOA decoding scheme, and the third decoding scheme is a hybrid decoding scheme. Alternatively, the hybrid decoding scheme is also referred to as a switching frame decoding scheme.

It should be noted that, since the encoding end encodes different audio frames by using different encoding schemes, the decoding end also needs to decode each audio frame by using a corresponding decoding scheme.

Next, how the decoding end determines the coding scheme of the current frame will be described first. As can be seen from the foregoing, three implementation manners in which the encoding end encodes information capable of indicating the encoding scheme of the current frame into the code stream are introduced in step 601 of the encoding method shown in fig. 6, and accordingly, the decoding end determines that the encoding scheme of the current frame also corresponds to the three implementation manners, which will be described next.

First implementationCoded switch flag and indication of two coding schemes

The decoding end firstly analyzes the value of the switching mark of the current frame from the code stream. If the value of the switching mark is the first value, the decoding end analyzes the indication information of the decoding scheme of the current frame from the code stream, and the indication information is used for indicating that the decoding scheme of the current frame is the first decoding scheme or the second decoding scheme. If the value of the switching flag is the second value, the decoding end determines that the decoding scheme of the current frame is the third decoding scheme. It should be noted that the indication information of the coding scheme coded into the code stream by the coding end is the indication information of the decoding scheme parsed from the code stream by the decoding end.

In other words, if the decoding end analyzes that the value of the switching flag of the current frame is the first value, it indicates that the current frame is a non-switching frame. And the decoding end analyzes the indication information of the decoding scheme from the code stream and determines the decoding scheme of the current frame based on the indication information. If the decoding end analyzes that the value of the switching flag of the current frame is a second value, the current frame is a switching frame, and even if the code stream contains the indication information, the decoding end does not need to decode the indication information.

It should be noted that, if the value of the switching flag is the second value, the decoding end determines that the decoding scheme of the current frame is the switching frame decoding scheme, and the current frame is the switching frame, where the switching frame decoding scheme is a decoding scheme different from the first decoding scheme and the second decoding scheme, and the switching frame decoding scheme is for smooth transition of auditory quality.

Optionally, in the first implementation manner, the indication information of the decoding scheme and the switch flag each occupy one bit of the code stream. Illustratively, the decoding end analyzes the value of the switching flag of the current frame from the code stream, if the analyzed value of the switching flag is "0", that is, the value of the switching flag is a first value, the decoding end analyzes the indication information of the decoding scheme of the current frame from the code stream, and if the analyzed indication information is "0", the decoding end determines that the decoding scheme of the current frame is the first decoding scheme. And if the analyzed indication information is '1', the decoding end determines that the decoding scheme of the current frame is the second decoding scheme. If the analyzed switching flag has a value of "1", the decoding end determines that the decoding scheme of the current frame is a switching frame decoding scheme (third decoding scheme).

Second implementationIndication information of two encoding schemes

And the decoding end analyzes the initial decoding scheme of the current frame from the code stream, wherein the initial decoding scheme is a first decoding scheme or a second decoding scheme. And if the initial decoding scheme of the current frame is the same as the initial decoding scheme of the previous frame of the current frame, determining that the decoding scheme of the current frame is the initial decoding scheme of the current frame. And if the initial decoding scheme of the current frame is different from the initial decoding scheme of the previous frame of the current frame, determining that the decoding scheme of the current frame is a third decoding scheme, namely a hybrid decoding scheme. The difference between the initial decoding scheme of the current frame and the initial decoding scheme of the previous frame of the current frame means that the initial decoding scheme of the current frame is the first decoding scheme and the initial decoding scheme of the previous frame of the current frame is the second decoding scheme, or the initial decoding scheme of the current frame is the second decoding scheme and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme. That is, one of the initial decoding scheme of the current frame and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme, and the other is the second decoding scheme.

Optionally, in the second implementation manner, the indication information used for indicating the initial coding scheme occupies one bit of the code stream, and taking the coding mode as the indication information, the coding mode in the code stream occupies one bit. Illustratively, the decoding end parses the indication information of the initial coding scheme of the current frame from the code stream, and if the parsed indication information is "0" and the indication information of the previous frame of the current frame is also "0", the decoding end determines that the decoding scheme of the current frame is the first decoding scheme. If the analyzed indication information is "1" and the indication information of the previous frame of the current frame is also "1", the decoding end determines that the decoding scheme of the current frame is the second decoding scheme. If the analyzed indication information is '0' and the indication information of the previous frame of the current frame is '1', or the analyzed indication information is '1' and the indication information of the previous frame of the current frame is '0', the decoding end determines that the decoding scheme of the current frame is the switching frame decoding scheme.

Optionally, the indication information of the initial decoding scheme of the frame previous to the current frame is buffered data. When the current frame is decoded, the decoding end can obtain the indication information of the initial decoding scheme of the previous frame of the current frame from the buffer.

Third implementationAnd code threeIndication of a coding scheme

And the decoding end analyzes the indication information of the decoding scheme of the current frame from the code stream, and the indication information is used for indicating that the decoding scheme of the current frame is the first decoding scheme, the second decoding scheme or the third decoding scheme.

Optionally, in this third implementation, the indication information of the decoding scheme occupies two bits of the code stream. For example, assume that the coding mode of the current frame occupies two bits of the code stream with the coding mode as the indication information.

Illustratively, the decoding end parses the indication information of the decoding scheme of the current frame from the code stream, and if the parsed indication information is "00", the decoding end determines that the decoding scheme of the current frame is the first decoding scheme. If the analyzed indication information is '01', the decoding end determines that the decoding scheme of the current frame is the second decoding scheme. If the analyzed indication information is '10', the decoding end determines that the decoding scheme of the current frame is a switching frame decoding scheme.

Step 1102: and if the decoding scheme of the current frame is the third decoding scheme, determining a signal of a designated channel in the HOA signal of the current frame based on the code stream, wherein the designated channel is a part of channels in all the channels of the HOA signal.

In the embodiment of the present application, after the decoding end obtains the decoding scheme of the current frame, if the decoding scheme of the current frame is the third encoding scheme, which indicates that the current frame is a handover frame, the decoding end determines a signal of a specified channel in the HOA signal of the current frame based on the code stream. That is, for the switching frame, the encoding end encodes the signal of the designated channel into the code stream, and the decoding end decodes the switching frame by using the switching frame decoding scheme, that is, the signal of the designated channel needs to be analyzed from the code stream.

Next, a detailed description is given to a realization process of decoding the handover frame by the decoding end using the handover frame decoding scheme, that is, a detailed description is given to a realization process of determining, by the decoding end, a signal of a specified channel in the HOA signal of the current frame based on the code stream, when the current frame is the handover frame.

It should be noted that the process of determining the signal of the specified channel in the HOA signal of the current frame by the decoding end based on the code stream is symmetrical to the process of encoding the signal of the specified channel in the HOA signal of the current frame into the code stream by the encoding end. In the foregoing embodiment of the encoding method, some implementation processes for encoding the signal of the specified channel into the code stream are described, and a decoding process symmetrical to these implementation processes will be described at the decoding end.

In this embodiment of the application, if the encoding end determines the virtual speaker signal and the residual signal based on the signal of the specified channel, and then encodes the virtual speaker signal and the residual signal into the code stream, correspondingly, the decoding end determines the virtual speaker signal and the residual signal based on the code stream, and then determines the signal of the specified channel based on the virtual speaker signal and the residual signal.

Optionally, if the encoding end encodes the three paths of stereo signals obtained by combining the virtual speaker signals and the residual signals into a code stream through a stereo encoder, the decoding end decodes the code stream through a stereo decoder to obtain the three paths of stereo signals, and then determines one path of virtual speaker signals and three paths of residual signals based on the three paths of stereo signals. Optionally, the decoding end determines a path of virtual speaker signal based on one path of stereo signal in the three paths of stereo signals, and determines three paths of residual signals based on another path of stereo signal in the three paths of stereo signals. That is, the decoding end firstly analyzes the three paths of stereo signals from the code stream, and then obtains a path of virtual speaker signals and three paths of residual signals by disassembling the three paths of stereo signals.

Illustratively, the decoding end analyzes three paths of stereo signals from the code stream to obtain S1, S2 and S3 respectively, where S1 is obtained by combining one path of virtual speaker signal and one path of preset mono signal, S2 is obtained by combining two paths of residual signals, and S3 is obtained by combining the remaining one path of residual signal and one path of preset mono signal. And the decoding end disassembles the S1 to obtain a path of virtual loudspeaker signals, disassembles the S2 to obtain two paths of residual signals, and disassembles the S3 to obtain the remaining path of residual signals.

Optionally, if the encoding end encodes four paths of monaural signals determined based on the virtual speaker signals and the residual signals into the code stream through a monaural encoder, the decoding end decodes the code stream through a monaural decoder to obtain a path of virtual speaker signals and three paths of residual signals, where the four paths of monaural signals include the path of virtual speaker signals and the three paths of residual signals.

Optionally, if the signal of the specified channel includes an FOA signal, the FOA signal includes an omnidirectional W signal, and a directional X signal, a directional Y signal, and a directional Z signal, then after the decoding end determines a virtual speaker signal and a residual signal based on the code stream, the decoding end determines the W signal based on the virtual speaker signal. The decoding end determines an X signal, a Y signal, and a Z signal based on the residual signal and the W signal, or the decoding end determines an X signal, a Y signal, and a Z signal based on the residual signal. For example, in the case where the decoding end analyzes three paths of residual signals, the sum of each of the three paths of residual signals and the W signal is determined as an X signal, a Y signal, and a Z signal, or each of the three paths of residual signals is determined as an X signal, a Y signal, and a Z signal. If the encoding end determines the difference signals between the X signal, the Y signal and the Z signal and the W signal as three residual signals, the decoding end determines the sum of the three residual signals and the W signal as the X signal, the Y signal and the Z signal. If the encoding end determines the X signal, the Y signal and the Z signal as three residual signals, the decoding end determines the three residual signals as the X signal, the Y signal and the Z signal respectively. That is, the decoding process at the decoding end is matched with the encoding process at the encoding end.

If the coding end codes two paths of stereo signals determined based on the virtual loudspeaker signals and the residual signals into the code stream through the stereo encoder, the decoding end decodes the code stream through the stereo decoder to obtain the two paths of stereo signals. The decoding end determines two paths of virtual loudspeaker signals based on one path of stereo signals in the two paths of stereo signals, determines two paths of residual signals based on the other path of stereo signals in the two paths of stereo signals, and the two paths of virtual loudspeaker signals and the two paths of residual signals comprise W signals, X signals, Y signals and Z signals. Optionally, if the encoding end determines the W signal and one of the X signal, the Y signal, and the Z signal having the highest correlation with the W signal as the two paths of virtual speaker signals, the two paths of virtual speaker signals determined by the decoding end include the W signal and one of the X signal, the Y signal, and the Z signal having the highest correlation with the W signal. And assuming that one path of signal with the highest correlation with the W signal in the X signal, the Y signal and the Z signal is the X signal, the two paths of virtual loudspeaker signals determined by the decoding end comprise the W signal and the X signal, and the two paths of residual signals determined by the decoding end comprise the Y signal and the Z signal.

Step 1103: based on the channel-specific signal, the gain of one or more remaining channels of the HOA signal of the current frame other than the channel-specific is determined.

In the embodiment of the present application, after the decoding end determines a signal of a specified channel in the HOA signal of the current frame based on the code stream, based on the signal of the specified channel, the gain of one or more remaining channels in the HOA signal except for the specified channel is determined.

Illustratively, assuming that the designated channel is an FOA channel, the FOA channel may be referred to as a lower-order channel, a signal of the FOA channel may be referred to as a lower-order portion of the HOA signal, one or more remaining channels of the HOA signal other than the designated channel may be referred to as higher-order channels, and a signal of the higher-order channels may be referred to as a higher-order portion of the HOA signal, the decoding side determines a higher-order gain of the HOA signal, i.e., a gain of the higher-order channels, based on the lower-order portion of the HOA signal.

Optionally, the decoding end performs analysis filtering processing on the signal of the specified channel in the HOA signal to obtain an analysis-filtered signal of the specified channel, and determines the gain of the one or more remaining channels based on the analysis-filtered signal of the specified channel. For example, assuming that the signal of the designated channel is the low-order part of the HOA signal, the decoding end performs an analysis filtering process on the low-order part of the HOA signal to obtain the low-order part of the analysis filtered HOA signal, and then estimates a high-order gain based on the low-order part of the analysis filtered HOA signal. Optionally, for the handover frame in the present solution, an analysis filter used by the decoding end to perform analysis filtering processing is the same as an analysis filter used in the DirAC-based HOA decoding scheme, so that the decoding delay of the handover frame can be made to be consistent with the decoding delay of the DirAC-based HOA decoding scheme, that is, the delays are aligned. It should be noted that the decoding delay referred to herein is an end-to-end coding/decoding delay, and the decoding delay may also be referred to as a coding delay.

It should be noted that, in the embodiment of the present application, a process of determining, by a decoding end, gains of one or more remaining channels except for a specified channel in an HOA signal based on a signal of the specified channel, that is, a process of estimating gains of remaining channels based on a signal of the specified channel, is the same as a residual channel gain estimation method in a DirAC-based coding and decoding scheme in a specific implementation manner, and the embodiment of the present application is not described in detail. Illustratively, for the handover frame in the present scheme, the method for the decoding end to estimate the higher order gain based on the lower order part of the HOA signal is the same as the higher order gain estimation method in the DirAC-based codec scheme.

Step 1104: determining a signal for each of the one or more remaining channels based on the signal for the specified channel and the gain for the one or more remaining channels.

In the embodiment of the present application, the decoding end determines the signal of each of the one or more remaining channels based on the signal of the specified channel and the gain of the one or more remaining channels. Illustratively, assuming that the signal of the specified channel is a low-order part of the HOA signal and the gain of the one or more remaining channels is a high-order gain, the decoding end may determine a high-order part of the HOA signal based on the W signal in the low-order part and the high-order gain. Alternatively, if the decoding side performs the analysis filtering process on the lower order part of the HOA signal, the decoding side may determine the higher order part of the analysis filtered HOA signal based on the W signal and the higher order gain in the lower order part of the analysis filtered HOA signal.

Step 1105: based on the signal of the specified channel and the signals of the one or more remaining channels, a reconstructed HOA signal of the current frame is obtained.

In the embodiment of the present application, after obtaining the signal of the specified channel and the signals of the one or more remaining channels, the decoding end obtains the reconstructed HOA signal of the current frame, that is, the reconstructed HOA signal of the current frame, based on the signal of the specified channel and the signals of the one or more remaining channels. Illustratively, the decoding end performs synthesis filtering processing on the signal of the specified channel and the signals of the one or more remaining channels to obtain a reconstructed HOA signal of the current frame. For example, assuming that the signal of the specified channel is a low-order part of the HOA signal and the signals of the one or more remaining channels are high-order parts of the HOA signal, the decoding end performs synthesis filtering on the low-order part and the high-order parts of the HOA signal to obtain a reconstructed HOA signal of the current frame. Or, if the decoding end performs analysis filtering processing on the low-order part of the HOA signal, the decoding end performs synthesis filtering processing on the low-order part of the analysis filtered HOA signal and the high-order part of the analysis filtered HOA signal to obtain a reconstructed HOA signal of the current frame. Optionally, for the handover frame in the present solution, a synthesis filter used by the decoding end for performing synthesis filtering processing is the same as a synthesis filter used in the DirAC-based HOA coding and decoding scheme, so that the decoding delay of the handover frame can be consistent with the decoding delay of the DirAC-based HOA decoding scheme, i.e., the delays are aligned.

Fig. 12 is a schematic diagram of a switching frame decoding scheme according to an embodiment of the present application. Referring to fig. 12, a current frame to be decoded is a handover frame, and assuming that a signal of a specified channel is a low-order part of an HOA signal, in a decoding process, a decoding end acquires a code stream of the current frame to be decoded, performs core decoding on the code stream through a core decoder to reconstruct the low-order part of the HOA signal of the current frame, and estimates a high-order part based on the low-order part, that is, reconstructs the high-order part of the HOA signal, by using a method similar to the method for determining the high-order part in the DirAC-based HOA decoding scheme. The decoding end then reconstructs the HOA signal based on the decoded lower order part and the estimated higher order part.

In the above, the process of decoding the current frame by the decoding end when the current frame is the handover frame is introduced, that is, the decoding end decodes the handover frame by using the handover frame decoding scheme, that is, the decoding end decodes the signal (such as the low-order part) of the specified channel in the HOA signal first, and then reconstructs the signals (such as the reconstructed high-order part) of each remaining channel. Next, a process of decoding the current frame by the decoding side in the case where the current frame is a non-switching frame will be described.

In the embodiment of the present application, after the decoding end determines the decoding scheme of the current frame, if the decoding scheme of the current frame is the first decoding scheme, the decoding end obtains the reconstructed HOA signal of the current frame according to the code stream according to the first decoding scheme. And if the decoding scheme of the current frame is the second decoding scheme, the decoding end obtains the reconstructed HOA signal of the current frame according to the code stream according to the second decoding scheme.

In the embodiment of the present application, referring to fig. 13, according to the second decoding scheme, the implementation process of obtaining the reconstructed HOA signal of the current frame according to the code stream by the decoding end is as follows: and the decoding end analyzes the virtual loudspeaker signal and the residual signal from the code stream through the core decoder, and sends the analyzed virtual loudspeaker signal and residual signal to an MP-based space decoder to obtain a reconstructed HOA signal of the current frame. It should be noted that the decoding scheme shown in fig. 13 corresponds to the encoding scheme shown in fig. 8.

The realization process of obtaining the reconstructed HOA signal of the current frame by the decoding end according to the code stream according to the first decoding scheme is as follows: and the decoding end analyzes the core layer signal and the spatial parameter from the code stream and reconstructs the HOA signal of the current frame based on the core layer signal and the spatial parameter. Illustratively, referring to fig. 14, the decoding end parses a core layer signal from the code stream through a core decoder, parses a spatial parameter from the code stream through a spatial parameter decoder, and performs a DirAC-based HOA signal synthesis process based on the parsed core layer signal and the spatial parameter to obtain a reconstructed HOA signal of the current frame. It should be noted that the decoding scheme shown in fig. 14 corresponds to the encoding scheme shown in fig. 9.

Optionally, since the higher order part of the HOA signal has a large influence on the hearing quality, in order to further enable smooth transition of the hearing quality when switching between different coding and decoding schemes, the decoding end may also perform gain adjustment on the higher order part of the current frame in the process of obtaining the reconstructed HOA signal of the current frame according to the second decoding scheme and the code stream. For example, the decoding end obtains the initial HOA signal according to the code stream according to the second decoding scheme, and if the decoding scheme of the previous frame of the current frame is the third decoding scheme, that is, the previous frame of the current frame is the switching frame, the decoding end performs gain adjustment on the high-order part of the initial HOA signal according to the high-order gain of the previous frame of the current frame. Then, the decoding end is based on the low-order part of the initial HOA signal and the gain-adjusted high-order part to obtain a reconstructed HOA signal for the current frame.

It should be noted that, if the previous frame of the current frame is a handover frame, the current frame performs gain adjustment on the high-order portion of the initial HOA signal of the current frame by using the high-order gain of the previous frame, so that the gain-adjusted high-order portion of the current frame is similar to the high-order portion of the previous frame, for example, the gain adjustment makes the energy of the high-order portions of the HOA signals of the two adjacent frames similar. Therefore, in the process of rendering and playing each audio frame by the subsequent decoding end, the auditory quality of the switching frame and the auditory quality of the next frame of the switching frame can be smoothly transited.

Optionally, in addition to performing high-order gain adjustment on the audio frame of which the decoding scheme after the handover frame is the second decoding scheme, for the audio frame of which the other decoding scheme is the second decoding scheme, the decoding end may also perform gain adjustment on the high-order part of the HOA signal of the audio frame, and the embodiment of the present application does not limit a specific implementation manner of performing gain adjustment on the high-order part of the HOA signal of the audio frame. Optionally, the decoding end may gain adjust other parts of the HOA signal of the audio frames in addition to the higher order parts. That is, the embodiments of the present application do not limit which channels of the HOA signal are gain adjusted. In other words, the decoding end may perform gain adjustment on signals of any one or more channels of the HOA signal, where the one or more channels may include part or all of the higher-order channels, or part or all of the remaining channels except the specified channel, or other channels.

Fig. 15 is a flowchart of another decoding method provided in an embodiment of the present application. Referring to fig. 15, taking an example that the encoding end encodes the indication information of the initial encoding scheme into the code stream, and assuming that the code stream is not encoded with the switching flag, the decoding end firstly analyzes the indication information of the initial decoding scheme of the current frame from the code stream in the decoding process. Then, the decoding end judges whether the initial decoding scheme of the current frame is the same as the initial decoding scheme of the previous frame. If the initial decoding scheme of the current frame is the same as that of the previous frame, which indicates that the current frame is a non-switching frame, the decoding end decodes the code stream by adopting the initial decoding scheme of the current frame to obtain a reconstructed HOA signal of the current frame. If the initial decoding scheme of the current frame is different from the initial decoding scheme of the previous frame, which indicates that the current frame is a switching frame, the decoding end decodes the code stream by adopting the switching frame decoding scheme to obtain a reconstructed HOA signal of the current frame.

In summary, in the embodiment of the present application, the HOA signal of the audio frame is encoded and decoded by combining two schemes (i.e., the encoding and decoding scheme based on the virtual speaker selection and the encoding and decoding scheme based on the directional audio encoding), that is, a suitable encoding and decoding scheme is selected for different audio frames, so that the compression rate of the audio signal can be improved. Meanwhile, in order to enable the smooth transition of the auditory quality when switching between different coding and decoding schemes, for some audio frames in the scheme, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the two schemes, namely, signals of specified channels in the HOA signals of the audio frames are coded into a code stream during coding, namely, a compromise scheme is adopted to code and decode, so that the auditory quality after rendering and playing the HOA signals recovered by decoding can be smoothly transited.

Fig. 16 is a schematic structural diagram of an encoding apparatus 1600 provided in an embodiment of the present application, where the encoding apparatus 1600 may be implemented by software, hardware, or a combination of the two to be a part or all of an encoding end device, and the encoding end device may be any one of the encoding end devices in the foregoing embodiments. Referring to fig. 16, the apparatus 1600 comprises: a first determining module 1601 and a first encoding module 1602.

A first determining module 1601, configured to determine an encoding scheme of the current frame according to the higher-order ambisonic HOA signal of the current frame, where the encoding scheme of the current frame is one of a first encoding scheme, a second encoding scheme, and a third encoding scheme; wherein the first encoding scheme is a HOA encoding scheme based on directional audio encoding, the second encoding scheme is a HOA encoding scheme based on virtual speaker selection, and the third encoding scheme is a hybrid encoding scheme;

a first encoding module 1602, configured to, if the encoding scheme of the current frame is the third encoding scheme, encode a signal of a specified channel in the HOA signal into the code stream, where the specified channel is a part of channels in all channels of the HOA signal.

Optionally, the first encoding module 1602 includes:

Optionally, the first determining sub-module is configured to:

determining the W signal as a path of virtual loudspeaker signal;

determining a three-way residual signal based on the W signal, the X signal, the Y signal, and the Z signal, or determining the X signal, the Y signal, and the Z signal as the three-way residual signal.

Optionally, the encoding submodule is configured to:

Optionally, the first path of preset monaural signal is an all-zero signal or an all-one signal, the all-zero signal includes a signal whose sampling points have zero values or a signal whose frequency points have zero values, and the all-one signal includes a signal whose sampling points have one values or a signal whose frequency points have one values; the second path of preset single track signal is an all-zero signal or an all-one signal; the first path of preset single-channel signal is the same as or different from the second path of preset single-channel signal.

Optionally, the encoding submodule is configured to:

Optionally, the apparatus 1600 further comprises:

a second encoding module, configured to encode the HOA signal into a code stream according to the first encoding scheme if the encoding scheme of the current frame is the first encoding scheme;

and the third coding module is used for coding the HOA signal into the code stream according to the second coding scheme if the coding scheme of the current frame is the second coding scheme.

Optionally, the first determining module 1601 comprises:

Optionally, the apparatus 1600 further comprises:

It should be noted that: in the encoding apparatus provided in the foregoing embodiment, when encoding an audio frame, only the division of the above functional modules is used for illustration, and in practical applications, the above functions may be allocated by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the encoding apparatus and the encoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 17 is a schematic structural diagram of a decoding apparatus 1700 provided in an embodiment of the present application, where the decoding apparatus 1700 may be implemented by software, hardware, or a combination of the two to be a part or all of a decoding-side device, and the decoding-side device may be any encoding-side device in the foregoing embodiments. Referring to fig. 17, the apparatus 1700 includes: a first obtaining module 1701, a first determining module 1702, a second determining module 1703, a third determining module 1704, and a second obtaining module 1705.

A first obtaining module 1701, configured to obtain, based on the code stream, a decoding scheme of the current frame, where the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme, and a third decoding scheme; wherein the first decoding scheme is a higher-order ambisonic (HOA) decoding scheme based on directional audio decoding, the second decoding scheme is a HOA decoding scheme based on virtual speaker selection, and the third decoding scheme is a hybrid decoding scheme;

a first determining module 1702, configured to determine, based on the code stream, a signal of a specified channel in the HOA signal of the current frame if the decoding scheme of the current frame is the third decoding scheme, where the specified channel is a partial channel in all channels of the HOA signal;

a second determining module 1703, configured to determine, based on the signal of the specified channel, gains of one or more remaining channels of the HOA signal except for the specified channel;

a third determining module 1704 for determining a signal for each of the one or more remaining channels based on the signal for the specified channel and the gain for the one or more remaining channels;

a second obtaining module 1705, configured to obtain a reconstructed HOA signal of the current frame based on the signal of the specified channel and the signals of the one or more remaining channels.

Optionally, the first determining module 1702 includes:

the first determining submodule is used for determining a virtual loudspeaker signal and a residual signal based on the code stream;

Optionally, the first determining sub-module is configured to:

a three-way residual signal is determined based on another binaural signal of the three-way binaural signals.

Optionally, the first determining sub-module is configured to:

the first determination submodule is configured to:

determining a W signal based on the virtual speaker signal;

Optionally, the apparatus 1700 further comprises:

Optionally, the second decoding module comprises:

a gain adjustment sub-module, configured to perform gain adjustment on a high-order portion of the initial HOA signal according to a high-order gain of a previous frame of the current frame if a decoding scheme of the previous frame of the current frame is a third decoding scheme;

Optionally, the first obtaining module 1701 includes:

In the embodiment of the present application, the HOA signal of an audio frame is coded and decoded by combining two schemes (i.e., a coding and decoding scheme based on virtual speaker selection and a coding and decoding scheme based on directional audio coding), that is, a suitable coding and decoding scheme is selected for different audio frames, so that the compression rate of the audio signal can be improved. Meanwhile, in order to enable the smooth transition of the hearing quality when different coding and decoding schemes are switched, in the scheme, for some audio frames, a new coding and decoding scheme is adopted to code and decode the audio frames instead of directly adopting any one of the two schemes, namely, signals of specified channels in the HOA signals of the audio frames are coded into a code stream, namely, a compromise scheme is adopted to code and decode, so that the hearing quality after the HOA signals restored by decoding are rendered and played can be smoothly transited.

It should be noted that: in the decoding apparatus provided in the foregoing embodiment, when decoding an audio frame, only the division of the functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the decoding apparatus and the decoding method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 18 is a schematic block diagram of a codec device 1800 used in an embodiment of the present application. The codec 1800 may include a processor 1801, a memory 1802, and a bus system 1803, among other things. The processor 1801 is connected to the memory 1802 through a bus system 1803, the memory 1802 is configured to store instructions, and the processor 1801 is configured to execute the instructions stored in the memory 1802, so as to perform various encoding or decoding methods described in the embodiments of the present application. To avoid repetition, it is not described in detail here.

In the embodiment of the present application, the processor 1801 may be a Central Processing Unit (CPU), and the processor 1801 may also be other general-purpose processors, DSPs, ASICs, FPGAs, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1802 may include a ROM device or a RAM device. Any other suitable type of memory device may also be used for memory 1802. Memory 1802 may include code and data 18021 accessed by processor 1801 using bus 1803. The memory 1802 may further include an operating system 18023 and application programs 18022, the application programs 18022 including at least one program that allows the processor 1801 to perform the encoding or decoding methods described in embodiments of the present application. For example, the application 18022 may include applications 1 to N, which further include an encoding or decoding application (abbreviated as a codec application) that performs the encoding or decoding method described in the embodiments of the present application.

The bus system 1803 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are designated in the figure as the bus system 1803.

Optionally, the codec 1800 may also include one or more output devices, such as a display 1804. In one example, the display 1804 may be a touch-sensitive display that incorporates a display with a touch-sensitive unit operable to sense touch input. The display 1804 may be coupled to the processor 1801 via the bus 1803.

It should be noted that the codec device 1800 may execute the encoding method in the embodiment of the present application, and may also execute the decoding method in the embodiment of the present application.

Those of skill in the art will appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps described in the disclosure herein may be implemented as hardware, software, firmware, or any combination thereof. If implemented in software, the functions described in the various illustrative logical blocks, modules, and steps may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (e.g., based on a communication protocol). In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described herein. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that the computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, DVD and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements. In one example, various illustrative logical blocks, units, and modules within the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logical elements.

The techniques of embodiments of the present application may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in embodiments of the application to emphasize functional aspects of means for performing the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit, in conjunction with suitable software and/or firmware, or provided by an interoperating hardware unit (including one or more processors as described above).

That is, in the above embodiments, may be wholly or partially implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others. It is noted that the computer-readable storage medium referred to in the embodiments of the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It is to be understood that reference herein to "at least one" means one or more and "a plurality" means two or more. In the description of the embodiments of the present application, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish identical items or similar items with substantially identical functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The above-mentioned embodiments are provided not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of encoding, the method comprising:

determining an encoding scheme of a current frame according to a higher-order ambisonic (HOA) signal of the current frame, wherein the encoding scheme of the current frame is one of a first encoding scheme, a second encoding scheme and a third encoding scheme; wherein the first encoding scheme is a HOA encoding scheme based on directional audio encoding, the second encoding scheme is a HOA encoding scheme based on virtual speaker selection, and the third encoding scheme is a hybrid encoding scheme;

and if the coding scheme of the current frame is the third coding scheme, coding a signal of a specified channel in the HOA signal into a code stream, wherein the specified channel is a part of channels in all channels of the HOA signal.

2. The method of claim 1, wherein the channel-specific signals comprise a first order ambisonic FOA signal comprising an omnidirectional W signal, and directional X, Y, and Z signals.

3. The method of claim 2, wherein said encoding a signal of a specified channel in said HOA signal into said codestream comprises:

determining a virtual speaker signal and a residual signal based on the W signal, the X signal, the Y signal, and the Z signal;

and coding the virtual loudspeaker signal and the residual error signal into the code stream.

4. The method of claim 3, wherein determining a virtual speaker signal and a residual signal based on the W signal, the X signal, the Y signal, and the Z signal comprises:

determining the W signal as a path of the virtual loudspeaker signal;

determining three paths of residual signals based on the W signal, the X signal, the Y signal and the Z signal, or determining the X signal, the Y signal and the Z signal as three paths of residual signals.

5. The method of claim 4, wherein said encoding the virtual speaker signal and the residual signal into the codestream comprises:

and respectively encoding the obtained three paths of stereo signals into the code stream through a stereo encoder.

6. The method of claim 5, wherein combining the three residual signals with a second pre-set mono signal to obtain a binaural signal comprises:

and combining one path of residual signal except the two paths of residual signals with the highest correlation in the three paths of residual signals with the second path of preset single-channel signal to obtain the other path of stereo signal in the two paths of stereo signals.

7. The method according to claim 5 or 6, wherein the first path of preset monaural signal is an all-zero signal or an all-one signal, the all-zero signal includes signals with sampling points all having zero values or signals with frequency points all having zero values, and the all-one signal includes signals with sampling points all having one values or signals with frequency points all having one values;

the second path of preset single sound track signal is an all-zero signal or an all-one signal;

the first path of preset single-channel signal is the same as or different from the second path of preset single-channel signal.

8. The method of claim 4, wherein said encoding the virtual speaker signal and the residual signal into the codestream comprises:

and respectively encoding the virtual loudspeaker signal and each residual signal in the three residual signals into the code stream through a single sound channel encoder.

9. The method according to any of claims 1-8, wherein said determining a coding scheme for a current frame based on a higher-order ambisonic (HOA) signal for said current frame further comprises:

if the coding scheme of the current frame is the first coding scheme, coding the HOA signal into the code stream according to the first coding scheme;

and if the coding scheme of the current frame is the second coding scheme, coding the HOA signal into the code stream according to the second coding scheme.

10. The method according to any of claims 1-9, wherein said determining a coding scheme for a current frame based on a Higher Order Ambisonic (HOA) signal for said current frame comprises:

determining an initial coding scheme for the current frame from the HOA signal, the initial coding scheme being the first coding scheme or the second coding scheme;

if the initial coding scheme of the current frame is the same as the initial coding scheme of the previous frame of the current frame, determining that the coding scheme of the current frame is the initial coding scheme of the current frame;

determining that the encoding scheme of the current frame is the third encoding scheme if the initial encoding scheme of the current frame is the first encoding scheme and the initial encoding scheme of the previous frame of the current frame is the second encoding scheme, or if the initial encoding scheme of the current frame is the second encoding scheme and the initial encoding scheme of the previous frame of the current frame is the first encoding scheme.

11. The method of claim 10, wherein said determining an initial coding scheme for the current frame based on the HOA signal further comprises:

and coding the indication information of the initial coding scheme of the current frame into the code stream.

12. The method according to any of claims 1-11, wherein said determining a coding scheme for a current frame based on a higher-order ambisonic (HOA) signal for said current frame further comprises:

determining a value of a switch flag of the current frame, wherein the value of the switch flag of the current frame is a first value when the coding scheme of the current frame is the first coding scheme or the second coding scheme; when the coding scheme of the current frame is the third coding scheme, the value of the switching flag of the current frame is a second value;

and coding the value of the switching mark into the code stream.

13. The method according to any of claims 1-10, wherein after determining the coding scheme of the current frame based on the HOA signal of the current frame, further comprising:

and coding the indication information of the coding scheme of the current frame into the code stream.

14. The method according to any of claims 1-13, wherein the designated channel coincides with a transmission channel preset in the first coding scheme.

15. A method of decoding, the method comprising:

obtaining a decoding scheme of a current frame based on the code stream, wherein the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme and a third decoding scheme; wherein the first decoding scheme is a higher-order ambisonic (HOA) decoding scheme based on directional audio decoding, the second decoding scheme is a HOA decoding scheme based on virtual speaker selection, and the third decoding scheme is a hybrid decoding scheme;

if the decoding scheme of the current frame is the third decoding scheme, determining a signal of a specified channel in the HOA signal of the current frame based on a code stream, wherein the specified channel is a part of channels in all channels of the HOA signal;

determining, based on the signals of the specified channel, gains of one or more remaining channels of the HOA signal other than the specified channel;

determining a signal for each of the one or more remaining channels based on the signal for the specified channel and the gain for the one or more remaining channels;

obtaining a reconstructed HOA signal of the current frame based on the signal of the specified channel and the signals of the one or more remaining channels.

16. The method of claim 15, wherein said determining a channel-specific signal in the HOA signal of the current frame based on the codestream comprises:

determining a virtual speaker signal and a residual signal based on the code stream;

determining a signal of the specified channel based on the virtual speaker signal and the residual signal.

17. The method of claim 16, wherein the determining a virtual speaker signal and a residual signal based on the code stream comprises:

and determining one path of the virtual loudspeaker signal and three paths of the residual signals based on the three paths of stereo signals.

18. The method of claim 17, wherein said determining one of said virtual speaker signals and three of said residual signals based on said three stereo signals comprises:

determining the one path of virtual loudspeaker signal based on one path of stereo signal in the three paths of stereo signals;

determining the three residual signals based on another binaural signal of the three binaural signals.

19. The method of claim 16, wherein the determining a virtual speaker signal and a residual signal based on the codestream comprises:

and decoding the code stream through a single sound channel decoder to obtain one path of virtual loudspeaker signals and three paths of residual signals.

20. The method of any of claims 16-19, wherein the channel-specific signals comprise a first-order ambisonic FOA signal, the FOA signal comprising an omnidirectional W signal, and directional X, Y, and Z signals;

the determining the signal of the specified channel based on the virtual speaker signal and the residual signal comprises:

determining the W signal based on the virtual speaker signal;

21. The method of any of claims 15-20, further comprising:

if the decoding scheme of the current frame is the first decoding scheme, acquiring a reconstructed HOA signal of the current frame according to the code stream according to the first decoding scheme;

and if the decoding scheme of the current frame is the second decoding scheme, acquiring a reconstructed HOA signal of the current frame according to the code stream according to the second decoding scheme.

22. The method of claim 21, wherein said obtaining the reconstructed HOA signal of the current frame from the code stream according to the second decoding scheme comprises:

according to the second decoding scheme, obtaining an initial HOA signal according to the code stream;

if the decoding scheme of the previous frame of the current frame is the third decoding scheme, performing gain adjustment on the high-order part of the initial HOA signal according to the high-order gain of the previous frame of the current frame;

obtaining the reconstructed HOA signal based on the lower order part and the gain-adjusted higher order part of the initial HOA signal.

23. The method according to any one of claims 15-22, wherein said obtaining a decoding scheme for a current frame based on a code stream comprises:

analyzing the value of the switching mark of the current frame from the code stream;

if the value of the switching flag is a first value, analyzing indication information of a decoding scheme of the current frame from the code stream, wherein the indication information is used for indicating that the decoding scheme of the current frame is the first decoding scheme or the second decoding scheme;

and if the value of the switching mark is a second value, determining that the decoding scheme of the current frame is the third decoding scheme.

24. The method according to any one of claims 15-22, wherein said obtaining a decoding scheme for a current frame based on a code stream comprises:

and analyzing indication information of the decoding scheme of the current frame from the code stream, wherein the indication information is used for indicating that the decoding scheme of the current frame is the first decoding scheme, the second decoding scheme or the third decoding scheme.

25. The method according to any one of claims 15-22, wherein said obtaining a decoding scheme of the current frame based on the code stream comprises:

analyzing an initial decoding scheme of the current frame from the code stream, wherein the initial decoding scheme is the first decoding scheme or the second decoding scheme;

if the initial decoding scheme of the current frame is the same as the initial decoding scheme of the previous frame of the current frame, determining that the decoding scheme of the current frame is the initial decoding scheme of the current frame;

determining that the decoding scheme of the current frame is the third decoding scheme if the initial decoding scheme of the current frame is the first decoding scheme and the initial decoding scheme of the previous frame of the current frame is the second decoding scheme or the initial decoding scheme of the current frame is the second decoding scheme and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme.

26. An encoding apparatus, characterized in that the apparatus comprises:

a first determining module, configured to determine a coding scheme of a current frame according to a higher-order ambisonic HOA signal of the current frame, where the coding scheme of the current frame is one of a first coding scheme, a second coding scheme, and a third coding scheme; wherein the first encoding scheme is a HOA encoding scheme based on directional audio encoding, the second encoding scheme is a HOA encoding scheme based on virtual speaker selection, and the third encoding scheme is a hybrid encoding scheme;

a first encoding module, configured to encode, into a code stream, a signal of a specified channel in the HOA signal if the encoding scheme of the current frame is the third encoding scheme, where the specified channel is a partial channel in all channels of the HOA signal.

27. The apparatus of claim 26, wherein the channel-specific signals comprise a first-order ambisonic (FOA) signal comprising an omnidirectional (W) signal, and directional (X), Y, and Z signals.

28. The apparatus of claim 27, wherein the first encoding module comprises:

and the coding submodule is used for coding the virtual loudspeaker signal and the residual error signal into the code stream.

29. The apparatus of claim 28, wherein the first determination submodule is to:

determining the W signal as a path of the virtual loudspeaker signal;

determining difference signals between the X signal, the Y signal, and the Z signal and the W signal, respectively, as three paths of the residual signals, or determining the X signal, the Y signal, and the Z signal as three paths of the residual signals.

30. The apparatus of claim 29, wherein the encoding submodule is to:

31. The apparatus of claim 30, wherein the encoding sub-module is to:

32. The apparatus according to claim 30 or 31, wherein the first channel of predetermined mono signal is an all-zero signal or an all-one signal, the all-zero signal includes a signal whose sampling points have all zero values or a signal whose frequency points have all zero values, and the all-one signal includes a signal whose sampling points have all one values or a signal whose frequency points have all one values;

the second path of preset single-track signals are all-zero signals or all-one signals;

33. The apparatus of claim 29, wherein the encoding submodule is to:

and respectively encoding the path of virtual loudspeaker signal and each path of residual signal in the three paths of residual signals into the code stream through a single sound channel encoder.

34. The apparatus of any of claims 26-33, wherein the apparatus further comprises:

a second encoding module, configured to, if the encoding scheme of the current frame is the first encoding scheme, encode the HOA signal into the code stream according to the first encoding scheme;

and a third encoding module, configured to, if the encoding scheme of the current frame is the second encoding scheme, encode the HOA signal into the code stream according to the second encoding scheme.

35. The apparatus of any of claims 26-34, wherein the first determining module comprises:

a second determining sub-module for determining an initial coding scheme of the current frame from the HOA signal, the initial coding scheme being the first coding scheme or the second coding scheme;

a fourth determining sub-module, configured to determine that the encoding scheme of the current frame is the third encoding scheme if the initial encoding scheme of the current frame is the first encoding scheme and the initial encoding scheme of the previous frame of the current frame is the second encoding scheme, or if the initial encoding scheme of the current frame is the second encoding scheme and the initial encoding scheme of the previous frame of the current frame is the first encoding scheme.

36. The apparatus of claim 35, wherein the apparatus further comprises:

37. The apparatus of any of claims 26-36, wherein the apparatus further comprises:

a second determining module, configured to determine a value of the switch flag of the current frame, where the value of the switch flag of the current frame is a first value when the coding scheme of the current frame is the first coding scheme or the second coding scheme; when the coding scheme of the current frame is the third coding scheme, the value of the switching flag of the current frame is a second value;

38. The apparatus of any of claims 26-35, wherein the apparatus further comprises:

39. The apparatus according to any of claims 26-38, wherein the designated channel coincides with a transmission channel preset in the first coding scheme.

40. An apparatus for decoding, the apparatus comprising:

a first obtaining module, configured to obtain a decoding scheme of a current frame based on a code stream, where the decoding scheme of the current frame is one of a first decoding scheme, a second decoding scheme, and a third decoding scheme; wherein the first decoding scheme is a higher-order ambisonic (HOA) decoding scheme based on directional audio decoding, the second decoding scheme is a HOA decoding scheme based on virtual speaker selection, and the third decoding scheme is a hybrid decoding scheme;

a first determining module, configured to determine, based on a code stream, a signal of a specified channel in the HOA signal of the current frame if the decoding scheme of the current frame is the third decoding scheme, where the specified channel is a partial channel in all channels of the HOA signal;

a second determining module for determining, based on the signals of the specified channel, gains of one or more remaining channels of the HOA signal other than the specified channel;

a third determining module, configured to determine a signal of each of the one or more remaining channels based on the signal of the specified channel and the gain of the one or more remaining channels;

41. The apparatus of claim 40, wherein the first determining module comprises:

42. The apparatus of claim 41, wherein the first determination submodule is to:

43. The apparatus of claim 42, wherein the first determination submodule is to:

44. The apparatus of claim 41, wherein the first determination submodule is to:

and decoding the code stream through a single-channel decoder to obtain one path of virtual loudspeaker signals and three paths of residual error signals.

45. The apparatus of any one of claims 41 to 44, wherein the channel-specific signals comprise a first order ambisonic FOA signal comprising an omnidirectional W signal, and directional X, Y and Z signals;

the first determination submodule is configured to:

determining the W signal based on the virtual speaker signal;

determining the X signal, the Y signal, and the Z signal based on the residual signal and the W signal, or determining the X signal, the Y signal, and the Z signal based on the residual signal.

46. The apparatus of any one of claims 40-45, wherein the apparatus further comprises:

a first decoding module, configured to, if the decoding scheme of the current frame is the first decoding scheme, obtain a reconstructed HOA signal of the current frame according to the code stream according to the first decoding scheme;

47. The apparatus of claim 46, wherein the second decoding module comprises:

a gain adjustment sub-module, configured to perform gain adjustment on a high-order portion of the initial HOA signal according to a high-order gain of a previous frame of the current frame if a decoding scheme of the previous frame of the current frame is the third decoding scheme;

a second obtaining sub-module for obtaining the reconstructed HOA signal based on the low order part and the gain adjusted high order part of the initial HOA signal.

48. The apparatus of any one of claims 40-47, wherein the first obtaining module comprises:

a second parsing sub-module, configured to parse, if the value of the switch flag is a first value, indication information of a decoding scheme of the current frame from the code stream, where the indication information is used to indicate that the decoding scheme of the current frame is the first decoding scheme or the second decoding scheme;

and a third determining sub-module, configured to determine that the decoding scheme of the current frame is the third decoding scheme if the value of the switch flag is the second value.

49. The apparatus of any one of claims 40-47, wherein the first obtaining module comprises:

and a third parsing sub-module, configured to parse, from the code stream, indication information of a decoding scheme of the current frame, where the indication information is used to indicate that the decoding scheme of the current frame is the first decoding scheme, the second decoding scheme, or the third decoding scheme.

50. The apparatus of any one of claims 40-47, wherein the first obtaining module comprises:

a fourth parsing sub-module, configured to parse an initial decoding scheme of the current frame from the code stream, where the initial decoding scheme is the first decoding scheme or the second decoding scheme;

a fifth determining sub-module, configured to determine that the decoding scheme of the current frame is the third decoding scheme if the initial decoding scheme of the current frame is the first decoding scheme and the initial decoding scheme of the previous frame of the current frame is the second decoding scheme, or the initial decoding scheme of the current frame is the second decoding scheme and the initial decoding scheme of the previous frame of the current frame is the first decoding scheme.

51. An encoding end device is characterized by comprising a memory and a processor;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory to realize the coding method of any one of claims 1-14.

52. A decoding-side device, characterized in that the decoding-side device comprises a memory and a processor;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory to realize the decoding method of any one of claims 15-25.

53. A computer-readable storage medium having instructions stored therein, which when executed on the computer, cause the computer to perform the steps of the method of any one of claims 1-25.

54. A computer program product comprising instructions which, when executed by a processor, implement the method of any one of claims 1-25.