WO2023051370A1

WO2023051370A1 - Encoding and decoding methods and apparatus, device, storage medium, and computer program

Info

Publication number: WO2023051370A1
Application number: PCT/CN2022/120507
Authority: WO
Inventors: 刘帅; 高原; 王宾; 王喆
Original assignee: 华为技术有限公司
Priority date: 2021-09-29
Filing date: 2022-09-22
Publication date: 2023-04-06
Also published as: AR127171A1; CN115881139A

Abstract

Encoding and decoding methods and apparatus, a device, a storage medium, and a computer program, belonging to the technical field of three-dimensional audio encoding and decoding. The encoding method comprises: performing transient detection on signals of M channels comprised in a time-domain three-dimensional audio signal of a current frame, so as to obtain M transient detection results corresponding to the M channels, M being an integer greater than 1 (601); determining a global transient detection result on the basis of the M transient detection results (602); on the basis of the global transient detection result, converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal (603); on the basis of the global transient detection result, performing spatial encoding on the frequency-domain three-dimensional audio signal of the current frame, so as to obtain a spatial encoding parameter and frequency-domain signals of N transmission channels, N being an integer greater than or equal to 1 and less than or equal to M (604); encoding the frequency-domain signals of the N transmission channels on the basis of the global transient detection result, so as to obtain a frequency-domain signal encoding result (605); and encoding the spatial encoding parameter, so as to obtain a spatial encoding parameter encoding result, and writing the spatial encoding parameter encoding result and the frequency-domain signal encoding result into a bitstream (606). In this way, the complexity of encoding can be reduced, and the efficiency of encoding is increased.

Description

Encoding and decoding method, device, equipment, storage medium and computer program

This application claims the priority of the Chinese patent application with the application number 202111155355.4 and the title of the invention "encoding and decoding method, device, equipment, storage medium and computer program" filed on September 29, 2021, the entire contents of which are incorporated by reference in In this application.

technical field

The present application relates to the technical field of three-dimensional audio coding and decoding, and in particular to a coding and decoding method, device, equipment, storage medium and computer program.

Background technique

Three-dimensional audio technology is an audio technology that acquires, processes, transmits, and renders sound events and three-dimensional sound field information in the real world through computers and signal processing. In order to achieve a better audio auditory effect, a 3D audio signal usually needs to include a large amount of data, so as to record spatial information of a sound scene in more detail. However, there are difficulties in the process of transmission and storage of a large amount of data, so it is necessary to encode and decode the 3D audio signal.

Higher order ambisonics (HOA) audio technology is a three-dimensional audio technology, because it has no relationship with speaker layout in the recording, encoding and playback stages, and HOA format data has the characteristics of rotatable playback, so HOA signals have higher flexibility in playback, and thus have received more widespread attention.

Related technologies propose a method for encoding HOA signals. In this method, the time-domain HOA signal is first subjected to time-frequency transformation to obtain the frequency-domain HOA signal, and the frequency-domain HOA signal is spatially encoded to obtain multiple The frequency domain signal of the channel. Afterwards, the time-frequency inverse transform is performed on the frequency domain signals of each channel to obtain the time domain signals of each channel, and the transient detection is performed on the time domain signals of each channel to obtain the transient detection results of each channel. Then, the time-frequency transform is performed on the time-domain signals of each channel again to obtain the frequency-domain signals of each channel, and the frequency-domain signals of each channel are encoded by using the transient detection results of each channel.

However, in the above method, it is necessary to perform multiple transformations on the audio signal between the time domain and the frequency domain, thereby increasing the coding complexity and further reducing the coding efficiency.

Contents of the invention

Embodiments of the present application provide an encoding and decoding method, device, device, storage medium, and computer program, which can reduce encoding complexity and improve encoding efficiency. Described technical scheme is as follows:

In the first aspect, a coding method is provided, which performs transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is greater than An integer of 1; based on the M transient detection results, determine the global transient detection result; based on the global transient detection results, convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain; The three-dimensional audio signal is spatially encoded to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is an integer greater than or equal to 1 and less than or equal to M; based on the global transient detection results, the frequency domain signals of N transmission channels Encoding the signal in the frequency domain to obtain the encoding result of the frequency domain signal; encoding the spatial encoding parameters to obtain the encoding result of the spatial encoding parameter; writing the encoding result of the spatial encoding parameter and the encoding result of the frequency domain signal into the code stream.

The transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information. The transient flag is used to indicate whether the signal of the corresponding channel is a transient signal, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel. There are many ways to determine the M transient detection results corresponding to the M channels, and one of the ways will be introduced next. Since the determination method of the transient state detection result corresponding to each of the M channels is the same, the method for determining the transient state detection result corresponding to the channel is introduced next by taking one of the channels as an example. For ease of description, this channel is referred to as a target channel, and the transient flag and transient position information of the target channel will be respectively introduced next.

The method for determining the transient flag of the target channel includes: determining a transient detection parameter corresponding to the target channel based on a signal of the target channel. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.

As an example, the transient detection parameter corresponding to the target channel is the absolute value of the energy difference between frames. That is, the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame of the current frame are determined. The absolute value of the difference between the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame is determined to obtain the absolute value of the inter-frame energy difference. If the absolute value of the inter-frame energy difference exceeds the first energy difference threshold, determine that the transient flag corresponding to the target channel in the current frame is the first value, otherwise, determine that the transient flag corresponding to the target channel in the current frame is the second value .

As another example, the transient detection parameter corresponding to the target channel is the absolute value of the subframe energy difference. That is, the signal of the target channel in the current frame includes signals of multiple subframes, the absolute value of the subframe energy difference corresponding to each subframe in the multiple subframes is determined, and then the transient flag corresponding to each subframe is determined. If there is a subframe whose transient flag is the first value in the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the first value. If there is no subframe whose transient flag is the first value among the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the second value.

The method for determining the transient location information of the target channel includes: determining the transient location information corresponding to the target channel based on the transient flag corresponding to the target channel.

As an example, if the transient flag corresponding to the target channel is the first value, determine the transient position information corresponding to the target channel. If the transient flag corresponding to the target channel is the second value, it is determined that the target channel does not have corresponding transient position information, or, the transient position information corresponding to the target channel is set to a preset value, such as -1.

In some embodiments, the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Determining the global transient detection result based on the M transient detection results, including: if the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determining that the global transient flag is the first value , m is a positive integer greater than 0 and less than M. Alternatively, if the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer.

Wherein, m and n are preset values, and m and n can also be adjusted according to different requirements. In the case where the three-dimensional audio signal is an HOA signal, the first preset condition includes a channel belonging to a first-order ambisonics (first-order ambisonics, FOA) signal, for example, the channel of the FOA signal may include the first 4 channels of the HOA signal. channels. In other words, in the case where the 3D audio signal is an HOA signal, if the number of channels whose corresponding transient flag is the first value in the channel of the FOA signal in the 3D audio signal of the current frame is greater than or equal to n, determine the global The transient flag is the first value. Of course, the first preset condition can also be other conditions.

In some other embodiments, the transient detection result further includes transient position information, the global transient detection result further includes global transient position information, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel. Determine the global transient detection result based on the M transient detection results, including: if only one of the M transient flags is the first value, the transient state corresponding to the channel whose transient flag is the first value The location information is determined as global transient location information. If there are at least two transient flags with the first value among the M transient flags, the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.

Or if at least two of the M transient flags are the first value, and the gap between the transient position information corresponding to the two channels is smaller than the position difference threshold, then the transient position information corresponding to the two channels The average value of the position information is determined as the global transient position information. The position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.

Based on the above description, the transient detection parameter corresponding to the channel is the absolute value of the energy difference between frames or the absolute value of the energy difference between subframes. In the case where the transient detection parameter corresponding to the channel is the absolute value of the inter-frame energy difference, one channel corresponds to an absolute value of the inter-frame energy difference, at this time, the corresponding inter-frame energy difference can be selected from the at least two channels The channel with the largest absolute value, and then determine the transient position information corresponding to the selected channel as the global transient position information. In the case where the transient detection parameter corresponding to the channel is the absolute value of the subframe energy difference, one channel corresponds to the absolute value of the energy difference of multiple subframes, at this time, the corresponding subframe energy can be selected from the at least two channels The channel with the largest absolute value of the difference is selected, and then the transient position information corresponding to the selected channel is determined as the global transient position information.

Optionally, converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result includes: determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or or the frame type of the current frame. The time-domain three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameters.

As an example, the global transient detection result includes a global transient flag. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, determining the first preset window function type as the window function type of the current frame. If the global transient flag is the second value, the type of the second preset window function is determined as the type of the window function of the current frame. Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.

As another example, the global transient detection result includes global transient flags and global transient location information. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining the window function type of the current frame based on the global transient position information. If the global transient flag is the second value, the type of the third preset window function is determined as the type of the window function of the current frame, or the type of the window function of the current frame is determined based on the type of the window function of the previous frame of the current frame.

Since the global transient detection result may only include the global transient flag, it may also include the global transient flag and the global transient position information, and the global transient position information may be the transient position corresponding to the channel whose transient flag is the first value information, and possibly preset values. In the case of different global transient detection results, the method of determining the frame type of the current frame is different. Therefore, the following three cases will be described separately:

In the first case, the global transient detection result includes a global transient flag. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frame. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame.

In the second case, the global transient detection result includes the global transient flag and the global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame. The frame length of ultra-short frame is less than that of short frame, and the frame length of short frame is less than that of long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.

In the third case, the global transient detection result includes global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient position information is a preset value, such as -1, then determine that the frame type of the current frame is the second type, and the second type is used for Indicates that the current frame includes a long frame. If the global transient position information is not a preset value and satisfies the second preset condition, it is determined that the frame type of the current frame is a third type, and the third type is used to indicate that the current frame includes multiple ultra-short frames. If the global transient position information is not a preset value and does not meet the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. The frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.

It should be noted that the window function type of the current frame is used to indicate the shape and length of the window function corresponding to the current frame, and the window function of the current frame is used to perform windowing processing on the time-domain three-dimensional audio signal of the current frame. The frame type of the current frame is used to indicate whether the current frame is a very short frame, a short frame or a long frame. The ultra-short frame, short frame and long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.

Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal based on the target coding parameters is different, so the following descriptions will be made respectively.

In the first case, the target coding parameters include the window function type of the current frame. In this case, windowing processing is performed on the time-domain three-dimensional audio signal of the current frame based on the window function indicated by the window function type of the current frame. Afterwards, the windowed three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal.

In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames, and at this time, the time domain 3D audio signal of each short frame included in the current frame is converted into a frequency domain 3D audio signal. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the time domain 3D audio signal of the long frame included in the current frame is directly converted into a frequency domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames, and at this time, the time domain 3D audio signal of each ultrashort frame included in the current frame is converted into a frequency domain 3D audio signal.

In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The three-dimensional audio signals in the time domain are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each short frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the time-domain three-dimensional audio signal of the long frame included in the current frame is added. window processing, and convert the time-domain three-dimensional audio signal of the windowed long frame into a frequency-domain three-dimensional audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes multiple ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the temporal three-dimensional The audio signals are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each ultrashort frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.

In some embodiments, the target encoding parameters can also be encoded to obtain an encoding result of the target encoding parameters. Write the encoding result of the target encoding parameters into the code stream.

In some embodiments, performing spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result includes: performing spatial encoding on the frequency-domain three-dimensional audio signal based on the frame type.

When spatially encoding the frequency-domain three-dimensional audio signal of the current frame based on the frame type of the current frame, if the frame type of the current frame is the first type, that is, the current frame includes multiple short frames, at this time, the current The three-dimensional audio signals in the frequency domain of multiple short frames included in the frame are interleaved to obtain a three-dimensional audio signal in the frequency domain of a long frame, and spatial coding is performed on the three-dimensional audio signal in the frequency domain of the long frame obtained after the interleaving. If the frame type of the current frame is the second type, that is, the current frame includes a long frame, at this time, the frequency domain 3D audio signal of the long frame is spatially encoded. If the frame type of the current frame is the third type, that is, the current frame includes multiple ultrashort frames, at this time, the frequency-domain three-dimensional audio signals of the multiple ultrashort frames included in the current frame are interleaved to obtain a long The three-dimensional audio signal in the frequency domain of the frame is spatially encoded on the three-dimensional audio signal in the frequency domain of the long frame obtained after interleaving.

In some embodiments, encoding the frequency-domain signals of the N transmission channels based on the global transient detection result includes: encoding the frequency-domain signals of the N transmission channels based on the frame type of the current frame.

As an example, the implementation process of encoding the frequency-domain signals of the N transmission channels includes: performing noise shaping processing on the frequency-domain signals of the N transmission channels based on the frame type of the current frame. The transmission channel downmixing process is performed on the frequency domain signals of the N transmission channels after the noise shaping process, to obtain the downmixed signal. Perform quantization and encoding processing on the low-frequency part of the downmixed signal, and write the encoding result into the code stream. Perform bandwidth expansion and encoding processing on the high-frequency part of the downmixed signal, and write the encoding result into the code stream.

Optionally, the method further includes: encoding the global transient detection result to obtain an encoding result of the global transient detection result. Write the encoding result of the global transient detection result into the code stream.

In the second aspect, a decoding method is provided, which analyzes the global transient detection result and the spatial encoding parameters from the code stream; decodes based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels; Based on the global transient detection results and the spatial coding parameters, spatially decode the frequency-domain signals of N transmission channels to obtain the reconstructed frequency-domain 3D audio signals; based on the global transient detection results and the reconstructed frequency-domain 3D audio signals, determine Reconstructed time domain 3D audio signal.

Optionally, based on the global transient detection result and the reconstructed frequency-domain three-dimensional audio signal, determining the reconstructed time-domain three-dimensional audio signal includes: determining a target encoding parameter based on the global transient detection result, and the target encoding parameter includes a window function of the current frame type and/or the frame type of the current frame; based on the target encoding parameters, converting the reconstructed frequency-domain 3D audio signal into a reconstructed time-domain 3D audio signal.

Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the reconstructed frequency domain 3D audio signal into the reconstructed time domain 3D audio signal based on the target coding parameters is different, so the following will describe them respectively.

In the first case, the target coding parameters include the window function type of the current frame. In this case, based on the window function indicated by the window function type of the current frame, de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain. Afterwards, the de-windowed frequency-domain 3D audio signal is converted into a reconstructed time-domain 3D audio signal.

Wherein, de-windowing processing is also referred to as windowing and splicing-add processing.

In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames. At this time, the reconstructed frequency-domain three-dimensional audio signal of each short frame is converted into a time-domain three-dimensional audio signal to obtain reconstruction 3D audio signal in the time domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is directly converted into a time-domain three-dimensional audio signal to obtain a reconstructed time-domain audio signal. domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, the reconstructed frequency domain 3D audio signal of each ultrashort frame is converted into a time domain 3D audio signal to obtain a reconstructed time domain 3D audio signal.

In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The frequency-domain 3D audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3-D audio signals of each short frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is performed. De-windowing processing, and converting the de-windowed long-frame frequency domain 3D audio signal into a time domain 3D audio signal, so as to obtain a reconstructed time domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the frequency domain three-dimensional The audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3D audio signals of each ultrashort frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.

Optionally, the global transient detection result includes a global transient flag, and the target coding parameter includes a window function type of the current frame. Determine the target encoding parameters based on the global transient detection results, including: if the global transient flag is the first value, then determine the type of the first preset window function as the window function type of the current frame; if the global transient flag is the second value, the type of the second preset window function is determined as the window function type of the current frame; wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.

Optionally, the global transient detection result includes the global transient flag and the global transient position information, and the target coding parameter includes the window function type of the current frame; the target coding parameter is determined based on the global transient detection result, including: if the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.

In a third aspect, an encoding device is provided, and the encoding device has a function of implementing the behavior of the encoding method in the first aspect above. The encoding device includes at least one module, and the at least one module is used to implement the encoding method provided in the first aspect above.

In a fourth aspect, a decoding device is provided, and the decoding device has the function of realizing the behavior of the decoding method in the second aspect above. The decoding device includes at least one module, and the at least one module is used to implement the decoding method provided by the second aspect above.

According to a fifth aspect, an encoding device is provided, and the encoding device includes a processor and a memory, and the memory is used to store a program for executing the encoding method provided in the first aspect above. The processor is configured to execute the program stored in the memory, so as to implement the encoding method provided in the first aspect above.

Optionally, the encoding end device may further include a communication bus, which is used to establish a connection between the processor and the memory.

In a sixth aspect, a decoding end device is provided, the decoding end device includes a processor and a memory, and the memory is used to store a program for executing the decoding method provided in the second aspect above. The processor is configured to execute the program stored in the memory, so as to implement the decoding method provided by the second aspect above.

Optionally, the decoding device may further include a communication bus, which is used to establish a connection between the processor and the memory.

In the seventh aspect, there is provided a computer-readable storage medium, and instructions are stored in the storage medium, and when the instructions are run on a computer, the computer is made to execute the steps of the encoding method described in the first aspect above, or execute The steps of the decoding method described in the second aspect above.

In the eighth aspect, there is provided a computer program product containing instructions, which, when the instructions are run on a computer, cause the computer to execute the steps of the encoding method described in the above-mentioned first aspect, or perform the decoding described in the above-mentioned second aspect method steps. Or in other words, a computer program is provided, and when the computer program is executed, the steps of the encoding method described in the above-mentioned first aspect are realized, or the steps of the decoding method described in the above-mentioned second aspect are realized.

A ninth aspect provides a computer-readable storage medium, where the computer-readable storage medium includes the code stream obtained by the encoding method described in the first aspect.

The technical effects obtained in the third aspect, the fourth aspect, the fifth aspect, the sixth aspect, the seventh aspect, the eighth aspect and the ninth aspect are the same as the technical effects obtained by the corresponding technical means in the first aspect or the second aspect Approximate, no more details here.

The technical solution provided by the application at least includes the following beneficial effects:

A global transient detection result is determined by performing transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The encoding of the frequency-domain signals of each transmission channel is guided by the global transient detection results. It is not necessary to convert the frequency-domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio The signal undergoes multiple transformations between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency.

Description of drawings

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment of a terminal scenario provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an implementation environment of a transcoding scenario of a wireless or core network device provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation environment of a broadcast television scene provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation environment of a virtual reality streaming scene provided by an embodiment of the present application;

Fig. 6 is a flow chart of the first encoding method provided by the embodiment of the present application;

Fig. 7 is an exemplary block diagram of the first encoding method shown in Fig. 6 provided by the embodiment of the present application;

FIG. 8 is an exemplary block diagram of a second encoding method shown in FIG. 6 provided by an embodiment of the present application;

FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application;

FIG. 10 is an exemplary block diagram of the decoding method shown in FIG. 9 provided by an embodiment of the present application;

Fig. 11 is a flow chart of the second encoding method provided by the embodiment of the present application;

Fig. 12 is an exemplary block diagram of the first encoding method shown in Fig. 11 provided by the embodiment of the present application;

FIG. 13 is an exemplary block diagram of the second encoding method shown in FIG. 11 provided by the embodiment of the present application;

Fig. 14 is a flow chart of the second decoding method provided by the embodiment of the present application;

Fig. 15 is an exemplary block diagram of a decoding method shown in Fig. 14 provided by an embodiment of the present application;

FIG. 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application;

Fig. 18 is a schematic block diagram of a codec device provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the following will further describe the embodiments of the present application in detail in conjunction with the accompanying drawings.

Before explaining in detail the encoding and decoding methods provided by the embodiments of the present application, terms and implementation environments involved in the embodiments of the present application are firstly introduced.

For ease of understanding, terms involved in the embodiments of the present application are explained first.

Encoding: refers to the process of compressing the audio signal to be encoded into a code stream. It should be noted that after the audio signal is compressed into a code stream, it may be referred to as an encoded audio signal or a compressed audio signal.

Decoding: refers to the process of restoring the coded stream into a reconstructed audio signal according to specific grammatical rules and processing methods.

Three-dimensional audio signal: a signal including multiple channels, which is used to characterize the sound field in a three-dimensional space, and may be one or a combination of HOA signals, multi-channel signals, and object audio signals. For the HOA signal, the number of channels of the 3D audio signal is related to the order of the 3D audio signal. For example, if the 3D audio signal is an A-level signal, the number of channels of the 3D audio signal is (A+1) ² .

The three-dimensional audio signal mentioned below may be any three-dimensional audio signal, for example, it may be one or a combination of HOA signals, multi-channel signals, and object audio signals.

Transient signal: It is used to characterize the transient phenomenon of the signal of the corresponding channel of the 3D audio signal. If the signal of a certain channel is a transient signal, it indicates that the signal of this channel is a non-stationary signal. For example, signals with large energy changes in a short period of time, such as the sound of drums and percussion instruments.

Next, the implementation environment involved in the embodiment of the present application will be introduced.

Please refer to FIG. 1 , which is a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes source device 10 , destination device 20 , link 30 and storage device 40 . Wherein, the source device 10 may generate an encoded 3D audio signal. Therefore, the source device 10 may also be called a three-dimensional audio signal encoding device. Destination device 20 may decode the encoded three-dimensional audio signal generated by source device 10. Therefore, the destination device 20 may also be referred to as a three-dimensional audio signal decoding device. Link 30 may receive an encoded 3D audio signal generated by source device 10 and may transmit the encoded 3D audio signal to destination device 20 . The storage device 40 can receive the encoded 3D audio signal generated by the source device 10, and can store the encoded 3D audio signal. Under such conditions, the destination device 20 can directly obtain the encoded 3D audio signal from the storage device 40. 3D audio signal. Alternatively, the storage device 40 may correspond to a file server or another intermediate storage device that may store the encoded three-dimensional audio signal generated by the source device 10, in which case the destination device 20 may communicate via streaming or downloading the storage device. 40 stored encoded three-dimensional audio signal.

Both the source device 10 and the destination device 20 may include one or more processors and a memory coupled to the one or more processors, and the memory may include random access memory (random access memory, RAM), read-only memory ( read-only memory, ROM), charged erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), flash memory, can be used to store the desired program in the form of instructions or data structures that can be accessed by the computer Any other media etc. of the code. For example, both source device 10 and destination device 20 may include desktop computers, mobile computing devices, notebook (e.g., laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, Televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, or the like.

Link 30 may include one or more media or devices capable of transmitting the encoded three-dimensional audio signal from source device 10 to destination device 20 . In one possible implementation, link 30 may include one or more communication media that enable source device 10 to transmit the encoded three-dimensional audio signal directly to destination device 20 in real time. In the embodiment of the present application, the source device 10 may modulate the encoded 3D audio signal based on a communication standard, such as a wireless communication protocol, etc., and may send the modulated 3D audio signal to the destination device 20 . The one or more communication media may include wireless and/or wired communication media, for example, the one or more communication media may include radio frequency (radio frequency, RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet), among others. The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from the source device 10 to the destination device 20, etc., which are not specifically limited in this embodiment of the present application.

In a possible implementation manner, the storage device 40 may store the received encoded 3D audio signal sent by the source device 10, and the destination device 20 may directly obtain the encoded 3D audio signal from the storage device 40. . Under such conditions, the storage device 40 may include any one of a variety of distributed or locally accessed data storage media, for example, any one of the various distributed or locally accessed data storage media may be Hard disk drive, Blu-ray Disc, digital versatile disc (DVD), compact disc read-only memory (CD-ROM), flash memory, volatile or nonvolatile memory, or Any other suitable digital storage medium for storing encoded three-dimensional audio signals, etc.

In one possible implementation, the storage device 40 may correspond to a file server or another intermediate storage device that may save the encoded 3D audio signal generated by the source device 10, and the destination device 20 may store the encoded 3D audio signal via streaming or downloading. The device 40 stores the three-dimensional audio signal. The file server may be any type of server capable of storing and transmitting the encoded three-dimensional audio signal to the destination device 20 . In a possible implementation manner, the file server may include a network server, a file transfer protocol (file transfer protocol, FTP) server, a network attached storage (network attached storage, NAS) device, or a local disk drive. Destination device 20 may acquire the encoded three-dimensional audio signal over any standard data connection, including an Internet connection. Any standard data connection may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a digital subscriber line (DSL), cable modem, etc.), or is suitable for obtaining encoded data stored on a file server. A combination of the two for 3D audio data. The transmission of the encoded three-dimensional audio signal from the storage device 40 may be a streaming transmission, a download transmission, or a combination of both.

The technology of the embodiment of the present application can be applied to the source device 10 shown in FIG. 1 that encodes the 3D audio signal, and can also be applied to the destination device 20 that decodes the encoded 3D audio signal.

In the implementation environment shown in FIG. 1 , the source device 10 includes a data source 120 , an encoder 100 and an output interface 140 . In some embodiments, output interface 140 may include a conditioner/demodulator (modem) and/or a transmitter, where a transmitter may also be referred to as a transmitter. Data source 120 may include an image capture device (e.g., video camera, etc.), an archive containing previously captured 3D audio signals, a feed-in interface for receiving 3D audio signals from a 3D audio signal content provider, and/or for generating 3D audio signals. A computer graphics system for audio signals, or a combination of these sources for three-dimensional audio signals.

The data source 120 may send a 3D audio signal to the encoder 100, and the encoder 100 may encode the received 3D audio signal sent by the data source 120 to obtain an encoded 3D audio signal. The encoder may send the encoded three-dimensional audio signal to the output interface. In some embodiments, source device 10 sends the encoded three-dimensional audio signal directly to destination device 20 via output interface 140 . In other embodiments, the encoded 3D audio signal may also be stored on storage device 40 for later retrieval by destination device 20 for decoding and/or display.

In the implementation environment shown in FIG. 1 , the destination device 20 includes an input interface 240 , a decoder 200 and a display device 220 . In some embodiments, input interface 240 includes a receiver and/or a modem. The input interface 240 can receive the encoded three-dimensional audio signal via the link 30 and/or from the storage device 40, and then send it to the decoder 200, and the decoder 200 can decode the received encoded three-dimensional audio signal to obtain the encoded three-dimensional audio signal. Decoded 3D audio signal. The decoder may transmit the decoded three-dimensional audio signal to the display device 220 . The display device 220 may be integrated with the destination device 20 or may be external to the destination device 20 . In general, the display device 220 displays the decoded 3D audio signal. The display device 220 can be any type of display device in various types, for example, the display device 220 can be a liquid crystal display (liquid crystal display, LCD), a plasma display, an organic light-emitting diode (organic light-emitting diode, OLED) monitor or other type of display device.

Although not shown in FIG. 1 , in some aspects encoder 100 and decoder 200 may be individually integrated with the encoder and decoder, and may include appropriate multiplexer-demultiplexer (multiplexer-demultiplexer) , MUX-DEMUX) unit or other hardware and software for encoding both audio and video in a common data stream or in separate data streams. In some embodiments, the MUX-DEMUX unit may conform to the ITU H.223 multiplexer protocol, or other protocols such as user datagram protocol (UDP), if applicable.

Each of the encoder 100 and the decoder 200 can be any one of the following circuits: one or more microprocessors, digital signal processing (digital signal processing, DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques of the embodiments of the present application are implemented partially in software, the device may store instructions for the software in a suitable non-transitory computer-readable storage medium, and may use one or more processors in hardware The instructions are executed to implement the technology of the embodiments of the present application. Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered to be one or more processors. Each of encoder 100 and decoder 200 may be included in one or more encoders or decoders, either of which may be integrated into a combined encoding in a corresponding device Part of a codec/decoder (codec).

Embodiments of the present application may generally refer to the encoder 100 as “signaling” or “sending” certain information to another device such as the decoder 200 . The term "signaling" or "sending" may generally refer to the transmission of syntax elements and/or other data for decoding a compressed three-dimensional audio signal. This transfer can occur in real time or near real time. Alternatively, this communication may occur after a period of time, such as upon encoding when storing syntax elements in an encoded bitstream to a computer-readable storage medium, which the decoding device may then perform after the syntax elements are stored on this medium The syntax element is retrieved at any time.

The codec method provided in the embodiment of the present application can be applied to various scenarios, and several scenarios among them will be introduced respectively next.

Please refer to FIG. 2 . FIG. 2 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a terminal scenario. The implementation environment includes a first terminal 101 and a second terminal 201 , and the first terminal 101 and the second terminal 201 are connected in communication. The communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.

Wherein, the first terminal 101 may be a sending end device or a receiving end device. Similarly, the second terminal 201 may be a receiving end device or a sending end device. In the case where the first terminal 101 is a sending end device, the second terminal 201 is a receiving end device, and in the case where the first terminal 101 is a receiving end device, the second terminal 201 is a sending end device.

Next, an introduction will be made by taking the first terminal 101 as a sending end device and the second terminal 201 as a receiving end device as an example.

The first terminal 101 may be the source device 10 in the implementation environment shown in FIG. 1 above. The second terminal 201 may be the destination device 20 in the implementation environment shown in FIG. 1 above. Wherein, both the first terminal 101 and the second terminal 201 include an audio collection module, an audio playback module, an encoder, a decoder, a channel encoding module and a channel decoding module.

The audio acquisition module in the first terminal 101 collects the 3D audio signal and transmits it to the encoder. The encoder encodes the 3D audio signal using the encoding method provided in the embodiment of the present application. The encoding may be called source encoding. Afterwards, in order to realize the transmission of the 3D audio signal in the channel, the channel coding module needs to perform channel coding again, and then transmit the coded stream through the wireless or wired network communication device in the digital channel.

The second terminal 201 receives the code stream transmitted in the digital channel through a wireless or wired network communication device, the channel decoding module performs channel decoding on the code stream, and then the decoder uses the decoding method provided by the embodiment of the present application to decode to obtain a three-dimensional audio signal, and then passes Audio playback module to play.

Among them, the first terminal 101 and the second terminal 201 can be any electronic product that can interact with the user through one or more ways such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting equipment, etc., Such as personal computer (personal computer, PC), mobile phone, smart phone, personal digital assistant (personal digital assistant, PDA), wearable device, PPC (pocket PC), tablet computer, smart car machine, smart TV, smart speaker wait.

Those skilled in the art should understand that the above-mentioned terminals are only examples, and other existing or future terminals that are applicable to this embodiment of the application should also be included in the scope of protection of this embodiment of the application, and are hereby referenced included here.

Please refer to FIG. 3 . FIG. 3 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a transcoding scenario of a wireless or core network device. The implementation environment includes a channel decoding module, an audio decoder, an audio encoder and a channel encoding module.

Wherein, the audio decoder may be a decoder using the decoding method provided by the embodiment of the present application, or may be a decoder using other decoding methods. The audio encoder may be an encoder using the encoding method provided by the embodiment of the present application, or may be an encoder using other encoding methods. In the case where the audio decoder is a decoder using the decoding method provided by the embodiment of the present application, the audio encoder is a coder using other encoding methods, and in the case where the audio decoder is a decoder using other decoding methods, the audio The encoder is an encoder using the encoding method provided by the embodiment of the present application.

In the first case, the audio decoder is a decoder using the decoding method provided by the embodiment of the present application, and the audio encoder is an encoder using other encoding methods.

At this time, the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use the decoding method provided by the embodiment of the application to perform source decoding, and then the audio encoder is used to encode according to other encoding methods to achieve a The conversion from one format to another is known as transcoding. After that, it is sent after channel coding.

In the second case, the audio decoder is a decoder using other decoding methods, and the audio encoder is an encoder using the encoding method provided by the embodiment of the present application.

At this time, the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use other decoding methods to perform source decoding, and then the audio encoder uses the encoding method provided by the embodiment of the application to perform encoding to realize a The conversion from one format to another is known as transcoding. After that, it is sent after channel coding.

Wherein, the wireless device may be a wireless access point, a wireless router, a wireless connector, and the like. A core network device may be a mobility management entity, a gateway, and the like.

Those skilled in the art should understand that the above-mentioned wireless devices or core network devices are only examples, and other existing or future wireless or core network devices that are applicable to this embodiment of the application should also be included in the protection of this embodiment of the application. scope, and is hereby incorporated by reference.

Please refer to FIG. 4 . FIG. 4 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a broadcast television scene. The broadcast TV scene is divided into a live scene and a post-production scene. For the live broadcast scene, the implementation environment includes a live program 3D sound production module, a 3D sound encoding module, a set-top box and a speaker group, and the set-top box includes a 3D sound decoding module. For post-production scenarios, the implementation environment includes post-program 3D sound production modules, 3D sound coding modules, network receivers, mobile terminals, earphones, and the like.

In the live broadcast scene, the 3D sound production module of the live program produces a 3D sound signal, and the 3D sound signal includes a 3D audio signal. The three-dimensional sound signal is encoded by applying the encoding method of the embodiment of the present application to obtain a code stream, and the code stream is transmitted to the user side through the broadcasting network, and is decoded by the three-dimensional sound decoder in the set-top box using the decoding method provided by the embodiment of the present application. The three-dimensional sound signal is thus reconstructed and played back by the loudspeaker group. Alternatively, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group. Or, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.

In the post-production scenario, the post-program 3D sound production module produces a 3D sound signal, and the 3D sound signal is encoded by applying the coding method of the embodiment of the application to obtain a code stream, and the code stream is transmitted to the user side through the radio and television network, and is transmitted by the set-top box The 3D sound decoder uses the decoding method provided by the embodiment of the present application to decode, thereby reconstructing the 3D sound signal, which is played back by the speaker group. Alternatively, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group. Or, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.

Please refer to FIG. 5 , which is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a virtual reality streaming scene. The implementation environment includes an encoding end and a decoding end. The encoding end includes an acquisition module, a preprocessing module, an encoding module, a packaging module and a sending module, and the decoding end includes an unpacking module, a decoding module, a rendering module and earphones.

The acquisition module collects three-dimensional audio signals, and then performs preprocessing operations through the preprocessing module. The preprocessing operations include filtering out the low-frequency part of the signal, usually with 20Hz or 50Hz as the cut-off point, and extracting the orientation information in the signal. Then use the encoding module to perform encoding processing using the encoding method provided by the embodiment of the present application. After encoding, use the packing module to pack and send to the decoding end through the sending module.

The unpacking module at the decoding end first unpacks, and then uses the decoding method provided by the embodiment of the application to decode through the decoding module, and then performs binaural rendering processing on the decoded signal through the rendering module, and the rendered signal is mapped to the listener's earphones superior. The earphone can be an independent earphone, or an earphone on a virtual reality glasses device.

It should be noted that the system architecture and business scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided by the embodiments of the present application. Those of ordinary skill in the art It can be seen that with the evolution of the system architecture and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.

Next, the codec method provided by the embodiment of the present application is explained in detail. It should be noted that, in combination with the implementation environment shown in FIG. 1 , any of the following encoding methods may be executed by the encoder 100 in the source device 10 . Any of the following decoding methods may be performed by the decoder 200 in the destination device 20 .

Please refer to FIG. 6, which is a flowchart of the first encoding method provided by the embodiment of the present application. The encoding method is applied to an encoding end device, and includes the following steps.

Step 601: Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.

The M transient detection results are in one-to-one correspondence with the M channels included in the time-domain three-dimensional audio signal of the current frame. The transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information. The transient flag is used to indicate whether the signal of the corresponding channel is a transient signal, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel.

There are many ways to determine the M transient detection results corresponding to the M channels, and one of the ways will be introduced next. Since the determination method of the transient state detection result corresponding to each of the M channels is the same, the method for determining the transient state detection result corresponding to the channel is introduced next by taking one of the channels as an example. For ease of description, this channel is referred to as a target channel, and the transient flag and transient position information of the target channel will be respectively introduced next.

Destination channel's transient flag

Based on the signal of the target channel, the transient detection parameters corresponding to the target channel are determined. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.

Based on the above description, the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference between frames exceeds the first energy difference threshold, it indicates that the signal of the target channel in the current frame is a transient signal, and at this time, determine that the transient flag corresponding to the target channel in the current frame is the first value. When the absolute value of the inter-frame energy difference does not exceed the first energy difference threshold, it indicates that the signal of the target channel in the current frame is not a transient signal. At this time, it is determined that the transient flag corresponding to the target channel in the current frame is the first binary value.

It should be noted that the first value and the second value can be expressed in various ways. For example, the first value is true, and the second value is flase. Alternatively, the first value is 1 and the second value is 0. Of course, it can also be expressed in other ways. The first energy difference threshold is preset, and the first energy difference threshold can be adjusted according to different requirements.

Wherein, the implementation manner of determining the transient state flag of each subframe in the multiple subframes is the same, so the i-th subframe among the multiple subframes is taken as an example for illustration, and i is a positive integer. That is, the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe in the plurality of subframes are determined. Determine the absolute value of the difference between the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe, so as to obtain the absolute value of the energy difference of the subframe corresponding to the i-th subframe. If the absolute value of the subframe energy difference corresponding to the i subframe exceeds the second energy difference threshold, then determine that the transient flag of the i subframe is the first value, otherwise, determine that the transient flag of the i subframe is the second value.

Based on the above description, the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference of the subframe corresponding to the i-th subframe exceeds the second energy difference threshold, it indicates that the i-th subframe The signal of the subframe is a transient signal, and at this time, it is determined that the transient flag of the ith subframe is the first value. In the case that the absolute value of the energy difference of the subframe corresponding to the i subframe does not exceed the second energy difference threshold, it indicates that the signal of the i subframe is not a transient signal, and at this time, determine the transient flag of the i subframe is the second value.

It should be noted that, when i=0, the energy of the signal of the i-1th subframe is the energy of the signal of the last subframe of the target channel in the previous frame of the current frame. The second energy difference threshold is preset, and the second energy difference threshold can be adjusted according to different requirements. In addition, the second energy difference threshold may be the same as or different from the first energy difference threshold.

Transient position information of the target channel

Based on the transient flag corresponding to the target channel, the transient position information corresponding to the target channel is determined.

That is, when the transient flag corresponding to the target channel is the second value, it indicates that the signal of the target channel is not a transient signal. At this time, the transient detection result of the target channel does not include the transient position information, or the target channel is directly The transient position information corresponding to the channel is set as a preset value, and the preset value is used to indicate that the signal of the target channel is not a transient signal. That is, the transient detection result of the transient signal includes the transient flag and the transient location information, and the transient detection result of the non-transient signal may include the transient flag, and may also include the transient flag and the transient location information.

It should be noted that, in the case that the transient flag corresponding to the target channel is the first value, there are multiple ways to determine the transient position information corresponding to the target channel. As an example, the signal of the target channel in the current frame includes signals of a plurality of subframes, and the subframe whose transient flag is the first value and whose absolute value of the subframe energy difference is the highest is selected from the plurality of subframes, and the selected subframe The sequence number of the frame is determined as the transient position information corresponding to the target channel in the current frame.

For example, the transient flag corresponding to the target channel in the current frame is the first value, the signal of the target channel in the current frame includes signals of 4 subframes, and i=0, 1, 2, 3. The absolute value of the subframe energy difference of the 0th subframe is 18, the absolute value of the subframe energy difference of the first subframe is 21, the absolute value of the subframe energy difference of the second subframe is 24, and the absolute value of the subframe energy difference of the third subframe is The absolute value of the subframe energy difference is 35. Assuming that the preset second energy difference threshold is 20, the signal of the first subframe is a transient signal, the signal of the second subframe is a transient signal, and the signal of the third subframe is a transient signal. At this time, it is determined that the transient flags of the first subframe, the second subframe, and the third subframe are all the first values, and the subframe with the highest absolute value of the subframe energy difference among the three subframes is the third subframe , the sequence number 3 of the third subframe is determined as the transient position information corresponding to the target channel in the current frame.

Step 602: Determine a global transient detection result based on the M transient detection results.

In some embodiments, the global transient detection results include a global transient flag. If the number of transient flags with the first value among the M transient flags is greater than or equal to m, determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M. Or, if the number of channels that satisfy the first preset condition and the corresponding transient flag is the first value among the M channels is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer of .

For example, the 3D audio signal of the current frame is a third-order HOA signal, and the number of channels of the HOA signal is (3+1) ² , that is, 16. Assuming that m is 1, if the number of transient flags with the first value among the 16 transient flags is greater than or equal to 1, then the global transient flag is determined to be the first value. Alternatively, the first preset condition includes channels belonging to the FOA signal, for example, the channels of the FOA signal may include the first 4 channels in the HOA signal, and it is assumed that the channel that satisfies the first preset condition among the M channels is the current frame The channel where the FOA signal is located, n is 1. If the number of channels whose corresponding transient flag is the first value among the 16 channels belonging to the FOA is greater than or equal to 1, then it is determined that the global transient flag is the first value.

Wherein, m and n are preset values, and m and n can also be adjusted according to different requirements. In the case where the 3D audio signal is an HOA signal, the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located. channel, the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.

In other embodiments, the global transient detection result further includes global transient location information. If only one of the M transient flags is the first value, the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information. If there are at least two transient flags in the M transient flags as the first value, determine the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is the global transient position information, or if at least two of the M transient flags are the first value, and the gap between the transient position information corresponding to the two channels is smaller than the position difference threshold, then the The average value of the transient position information corresponding to the two channels is determined as the global transient position information. The position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.

For example, for a 3rd-order HOA signal, if only the transient flag corresponding to the third channel is the first value among the 16 transient flags of the HOA signal, then the transient position corresponding to the third channel can be directly set The information is determined as global transient position information.

If there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the absolute value of the inter-frame energy difference corresponding to channel 1 is 22, the transient position information corresponding to channel 2 is 2, and the absolute value of the inter-frame energy difference corresponding to channel 2 is 23, The transient position information corresponding to channel 3 is 3, and the absolute value of the inter-frame energy difference corresponding to channel 3 is 28. Among the three channels, the channel with the largest absolute value of the inter-frame energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.

For another example, if there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the signal of channel 1 includes three subframes, and the absolute values of the subframe energy differences corresponding to these three subframes are 20, 18, and 22 respectively, and the transient position information corresponding to channel 2 is 2. The signal of channel 2 includes three subframes. The absolute values of the energy differences of the subframes corresponding to these three subframes are 20, 23, and 25 respectively. The transient position information corresponding to channel 3 is 3. The signal of channel 3 includes three subframes frame, the absolute values of energy differences of subframes corresponding to these three subframes are 25, 28, and 30. Among the three channels, the channel with the largest absolute value of subframe energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.

If there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the transient position information corresponding to channel 2 is 3, and the transient position information corresponding to channel 3 is 6. The gap 2 between the transient position information corresponding to channel 1 and channel 2 among the three channels is less than the preset position difference threshold 3, then the average value 2 of the transient position information corresponding to channel 1 and channel 2 is determined as the global Transient location information.

Step 603: Convert the time domain 3D audio signal of the current frame into a frequency domain 3D audio signal based on the global transient detection result.

In some embodiments, target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame. The time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameter.

Wherein, when the global transient flag is the first value, there are multiple ways to determine the window function type of the current frame based on the global transient position information. For example, the type of the fourth preset window function is adjusted based on the global transient position information, so that the center position of the fourth preset window function corresponds to the global transient occurrence position, and then the value of the window function corresponding to the global transient occurrence position maximum. Alternatively, a window function corresponding to the location where the global transient occurs is selected from the window function set, and then the type of the selected window function is determined as the window function type of the current frame. That is to say, window functions corresponding to each transient occurrence position are stored in the window function set, so that the window function corresponding to the global transient occurrence position can be selected.

In addition, there are many methods for determining the window function type of the current frame based on the window function type of the previous frame of the current frame. For details, reference may be made to related technologies, which will not be elaborated in the embodiments of the present application.

In the second case, the global transient detection result includes the global transient flag and the global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame. The frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.

The method of converting the time-domain three-dimensional audio signal of the current frame into the frequency-domain three-dimensional audio signal can be a modified discrete cosine transform (modified discrete cosine transform, MDCT), or a modified discrete sine transform (modified discrete sine transform, MDST) , can also be fast Fourier transform (fast fourier transform, FFT).

That is to say, in the case that the current frame includes a plurality of ultrashort frames and short frames, after converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal, each ultrashort frame and short frame included in the current frame are obtained 3D audio signal in the frequency domain. In the case that the current frame includes a long frame, after converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal, a frequency-domain three-dimensional audio signal of one long frame included in the current frame is obtained.

Step 604: Based on the global transient detection result, spatially encode the frequency domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.

In some embodiments, based on the frame type of the current frame, the frequency-domain three-dimensional audio signal of the current frame is spatially encoded to obtain spatial encoding parameters and frequency-domain signals of N transmission channels.

The method of spatial coding can be any method that can obtain spatial coding parameters and frequency domain signals of N transmission channels based on the frequency-domain three-dimensional audio signal of the current frame. For example, a spatial coding method of matching projection can be adopted. This application implements The example does not limit the spatial encoding method.

The spatial coding parameters refer to parameters determined during the process of spatially coding the frequency-domain 3D audio signal of the current frame, including side information, bit pre-allocation side information, and the like. The frequency-domain signals of the N transmission channels may include virtual speaker signals of one or more channels, and residual signals of one or more channels. In addition, when the number of coding bits is insufficient, the frequency domain signals of the N transmission channels may only include virtual speaker signals of one or more channels.

Step 605: Based on the global transient detection result, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.

In some embodiments, the frequency domain signals of the N transmission channels are encoded based on the frame type of the current frame.

It should be noted that, for the manner of performing noise shaping processing based on the frame type of the current frame, reference may be made to related technologies, which will not be described in detail in the embodiment of the present application. Wherein, the noise shaping processing includes temporal noise shaping (temporal noise shaping, TNS) processing and frequency domain noise shaping (frequency domain noise shaping, FDNS) processing.

Wherein, when performing transmission channel downmix processing on the frequency domain signals of the N transmission channels after the noise shaping processing, the N transmission channels after the noise shaping processing may be paired according to a preset criterion, or the N transmission channels after the noise shaping processing may be paired according to the signal correlation degree The frequency-domain signals of the N transmission channels after noise shaping are paired. Then, mid-side (mid side, MS) downmix processing is performed based on the paired two channels of frequency domain signals.

For example, if the N transmission channels include 2 channels of virtual speaker signals and 4 channels of residual signals, the 2 channels of virtual speaker signals may be combined into a pair according to a preset criterion for downmix processing. It is also possible to determine the correlation between every 2 residual signals in the 4 residual signals, select the 2 residual signals with high correlation to form a pair, and the remaining 2 residual signals form a pair, and perform downmix processing respectively.

It should be noted that, when the paired two channels of frequency domain signals are downmixed, the result of the downmixing process may be one channel of frequency domain signals or two channels of frequency domain signals, depending on the encoding process.

Wherein, the low-frequency part and the high-frequency part of the signal can be divided in various ways. For example, with 2000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 2000 Hz is regarded as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 2000 Hz is regarded as the high frequency part of the signal. For another example, with 5000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 5000 Hz is taken as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 5000 Hz is taken as the high frequency part of the signal.

Step 606: Encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, and write the encoding result of the spatial encoding parameters and the encoding result of the frequency domain signal into the code stream.

Optionally, the global transient detection result may also be encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result may be written into the code stream. Alternatively, the target encoding parameter is encoded to obtain an encoding result of the target encoding parameter, and the encoding result of the target encoding parameter is written into the code stream.

In the embodiment of the present application, transient detection may be performed on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to determine a global transient detection result. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The transient detection results of each transmission channel are multiplexed with the global transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio signal Multiple transformations are performed between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency. Moreover, the embodiment of the present application does not need to encode the transient detection results of each transmission channel, but only needs to encode the global transient detection results into the code stream, so that the number of coding bits can be reduced.

Please refer to FIG. 7 and FIG. 8 , both of which are block diagrams of an exemplary encoding method provided by an embodiment of the present application. FIG. 7 and FIG. 8 are mainly for exemplary explanation of the encoding method shown in FIG. 6 . In FIG. 7 , the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels. Based on the M transient detection results, a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream. Based on the global transient detection result, the time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal. Based on the global transient detection result, perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, Write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. Based on the global transient detection results, the frequency domain signals of the N transmission channels are encoded. Further, in FIG. 8 , the frequency-domain three-dimensional audio signal of the current frame is spatially encoded, and after obtaining the spatial encoding parameters and the frequency-domain signals of N transmission channels, the spatial encoding parameters are encoded to obtain the encoding result of the spatial encoding parameters , write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. Then, based on the global transient detection results, the frequency domain signals of the N transmission channels are subjected to noise shaping processing, transmission channel downmix processing, quantization and encoding processing, and bandwidth extension processing, and the encoding results of the signals after bandwidth extension processing Write code stream.

Based on the description in step 606 above, the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in FIG. 9 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters. The specific implementation process is shown in Fig. The procedure in 9 is similar. Of course, the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream. In this case, the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.

Please refer to FIG. 9. FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application. The method is applied to the decoding end and includes the following steps.

Step 901: Parse the global transient detection result and spatial coding parameters from the code stream.

Step 902: Decode based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels.

In some embodiments, the frame type of the current frame is determined based on global transient detection results. Decoding is performed based on the frame type of the current frame and the code stream to obtain frequency domain signals of the N transmission channels.

For an implementation manner of determining the frame type of the current frame based on the global transient detection result, reference may be made to the relevant description in the above step 603 , which will not be repeated here. For an implementation manner of decoding based on the frame type of the current frame and the code stream, reference may be made to related technologies, and the embodiments of the present application will not describe in detail.

Step 903: Based on the global transient detection result and the spatial coding parameters, spatially decode the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal.

In some embodiments, the frequency domain signals of the N transmission channels are spatially decoded based on the frame type and spatial coding parameters of the current frame to obtain a reconstructed frequency domain 3D audio signal, and the frame type of the current frame is based on the global transient state The test result is confirmed. That is, the frame type of the current frame is determined based on the global transient detection result, and then, based on the frame type and spatial coding parameters of the current frame, the frequency domain signals of the N transmission channels are spatially decoded to obtain the reconstructed frequency domain 3D audio signal.

Wherein, based on the frame type and spatial coding parameters of the current frame, the implementation process of spatially decoding the frequency-domain signals of the N transmission channels may refer to related technologies, which will not be described in detail in the embodiments of the present application.

Step 904: Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.

In some embodiments, target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame. Based on the target encoding parameters, the reconstructed frequency domain 3D audio signal is converted into a reconstructed time domain 3D audio signal.

For an implementation manner of determining the target encoding parameter based on the global transient detection result, reference may be made to the relevant description in the above step 603 , which will not be repeated here.

In the first case, the target coding parameters include the window function type of the current frame. In this case, based on the window function indicated by the window function type of the current frame, de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain. Afterwards, the de-windowed three-dimensional audio signal in the frequency domain is converted into a reconstructed three-dimensional audio signal in the time domain.

In the embodiment of the present application, the decoder parses the global transient detection results and spatial coding parameters from the code stream, so that the time-domain three-dimensional audio signal can be reconstructed based on the global transient detection results and spatial coding parameters without the The transient detection results of each transmission channel are analyzed in the middle, so that the decoding complexity can be reduced and the decoding efficiency can be improved. Moreover, without encoding the target coding parameters into the code stream, the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.

Please refer to FIG. 10 , which is a block diagram of an exemplary decoding method provided by an embodiment of the present application. FIG. 10 is mainly an exemplary explanation of the decoding method shown in FIG. 9 . In Fig. 10, the global transient detection results and spatial coding parameters are parsed from the code stream. Decoding is performed based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels. Based on the global transient detection result and the spatial encoding parameters, the frequency domain signals of the N transmission channels are spatially decoded to obtain a reconstructed frequency domain 3D audio signal. Based on the global transient detection result and the reconstructed three-dimensional audio signal in the frequency domain, the reconstructed three-dimensional audio signal in the time domain is determined through de-windowing processing and inverse time-frequency transformation.

Please refer to FIG. 11 , which is a flowchart of the second encoding method provided by the embodiment of the present application. The encoding method is applied to an encoding end device, and includes the following steps.

Step 1101: Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.

For an implementation manner of determining the M transient detection results corresponding to the M channels, reference may be made to related descriptions in step 601 , which will not be repeated here.

Step 1102: Based on the M transient detection results, determine a global transient detection result.

For an implementation manner of determining global transient position information based on the M transient detection results, reference may be made to relevant descriptions in step 602 , which will not be repeated here.

Step 1103: Convert the time-domain 3D audio signal of the current frame into a frequency-domain 3D audio signal based on the global transient detection result.

Wherein, based on the global transient detection result, the method of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal can refer to the relevant description in step 603 , which will not be repeated here.

Step 1104: Based on the global transient detection result, perform spatial encoding on the frequency-domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.

Wherein, the implementation manner of performing spatial coding on the frequency-domain three-dimensional audio signal of the current frame based on the global transient detection result can refer to the related description in step 604 , which will not be repeated here.

Step 1105: Based on the M transient detection results, determine N transient detection results corresponding to the N transmission channels.

In some embodiments, based on the M transient flags, the N transmission channels include one or more transient flags of virtual loudspeaker signals determined according to a first preset rule. Based on the M transient flags, the N transmission channels include transient flags of residual signals of one or more channels according to a second preset rule.

As an example, the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include virtual speaker signal transient flags of one or more channels are the first values. The second preset rule includes: if the quantity of the first value among the M transient flags is greater than or equal to Q, then the N transmission channels including one or more residual signal transient flags of the channels are all of the first value .

Wherein, both P and Q are positive integers smaller than M. P and Q are preset values, and P and Q can also be adjusted according to different requirements. Optionally, since the virtual speaker signal is used to record the real 3D audio signal and is more important than the residual signal, P is smaller than Q.

As another example, the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include one or more channels corresponding to the virtual speaker signal The transient flags are all first values. The second preset rule includes: if the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to R, then the N transmission channels include one or more channels The transient flags corresponding to the residual signal all have the first value.

Wherein, both P and R are positive integers smaller than M. P and R are preset values, and P and R can also be adjusted according to different requirements. In the case where the 3D audio signal is an HOA signal, the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located. channel, the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.

In some other embodiments, the N transient flags may also be determined based on the M transient flags and according to the mapping relationship between the M transient flags and the N transmission channels. Wherein, the mapping relationship is determined in advance.

For example, a certain transmission channel included in the N transmission channels is mapped to a certain number of channels in the M channels, if at least one of the M transient flags has a first value, then the N transmission The transient flag of this transmission channel of the channels is the first value.

It should be noted that step 1105 may be executed at any timing after step 1101 and before step 1106, and the embodiment of the present application does not limit the execution timing of step 1105.

Step 1106: Based on the N transient detection results, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.

In some embodiments, based on the N transient detection results, the frame type corresponding to each of the N transmission channels is determined. Based on the frame type corresponding to each of the N transmission channels, the frequency domain signal corresponding to the N transmission channels is encoded.

Since the implementation manner of determining the frame type corresponding to each of the N transmission channels is the same, one of the transmission channels is taken as an example for description below. For ease of description, this transmission channel is referred to as a target transmission channel.

The implementation process of determining the frame type corresponding to the target transmission channel based on the transient detection result corresponding to the target transmission channel includes: if the transient flag corresponding to the target transmission channel is the first value, then determining that the frame type corresponding to the target transmission channel is the first type , the signal of the first type used to indicate the target transmission channel includes a plurality of short frames. If the transient flag corresponding to the target transmission channel is the second value, it is determined that the frame type corresponding to the target transmission channel is the second type, and the signal of the second type used to indicate the target transmission channel includes a long frame.

It should be noted that the frame type of the current frame is used to indicate whether the current frame is a short frame or a long frame. Wherein, the short frame and the long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.

After the frame type corresponding to each transmission channel is determined, noise shaping processing can be performed on the frequency domain signal of each transmission channel based on the frame type corresponding to each transmission channel. Afterwards, the frequency domain signals of the N transmission channels after the noise shaping process are subjected to transmission channel downmixing processing to obtain the downmixed signals. Perform quantization and encoding processing on the low-frequency part of the downmixed signal, and write the encoding result into the code stream. Perform bandwidth expansion and encoding processing on the high-frequency part of the downmixed signal, and write the encoding result into the code stream.

For details about noise shaping, transmission channel downmixing, quantization and encoding of low frequency parts, and bandwidth extension and encoding, reference may be made to relevant descriptions in step 605 , which will not be repeated here.

Step 1107: encode the spatial encoding parameters and the N transient detection results to obtain the encoding results of the spatial encoding parameters and the encoding results of the N transient detection results, and encode the encoding results of the spatial encoding parameters and the N transient detection results Write code stream.

In the embodiment of the present application, based on the M transient detection results corresponding to the M channels included in the 3D audio signal, the transient detection results corresponding to the virtual speaker signal included in each transmission channel and the residual signal are determined. In this way, the encoding accuracy can be improved when encoding the frequency domain signals of each transmission channel. Moreover, the transient detection results corresponding to each transmission channel are determined based on M transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the transient detection results corresponding to each transmission channel, and then The three-dimensional audio signal does not need to be transformed multiple times between the time domain and the frequency domain, so that the coding complexity can be reduced and the coding efficiency can be improved.

Please refer to FIG. 12 and FIG. 13 . FIG. 12 and FIG. 13 are block diagrams of another exemplary encoding method provided by an embodiment of the present application. FIG. 12 and FIG. 13 mainly illustrate the encoding method shown in FIG. 11 . In FIG. 12 , the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels. Based on the M transient detection results, a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream. The time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the global transient detection result. Based on the global transient detection result, perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain the spatial encoding parameters and the frequency-domain signals of the N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters , write the encoding result of the spatial encoding parameters into the code stream. Based on the M transient detection results, N transient detection results corresponding to the N transmission channels are determined, and the N transient detection results are encoded to obtain N transient detection result encoding results, and the N transient detection results are encoded. The encoding result of the transient detection result is written into the code stream. Based on the N transient detection results, the frequency domain signals of the N transmission channels are encoded. Further, in FIG. 13 , after the N transient detection results are determined, noise shaping processing is performed on the frequency-domain signals of the N transmission channels based on the N transient detection results. Then perform transmission channel downmixing processing, quantization and encoding processing, and bandwidth expansion processing on the frequency domain signals of each transmission channel processed by the noise shaping process, and write the encoding results of the signals after the bandwidth expansion processing into the code stream.

Based on the description in step 1107 above, the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in Figure 14 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters. The specific implementation process is shown in Fig. The procedure in 14 is similar. Of course, the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream. In this case, the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.

Please refer to FIG. 14 . FIG. 14 is a flowchart of a second decoding method provided by an embodiment of the present application. The method is applied to a decoding end and includes the following steps.

Step 1401: Analyze the global transient detection result, N transient detection results corresponding to N transmission channels and spatial coding parameters from the code stream.

Step 1402: Decode based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels.

In some embodiments, the frame type corresponding to each transmission channel is determined based on the N transient detection results. Decoding is performed based on the frame type corresponding to each transmission channel and the code stream, so as to obtain frequency domain signals of the N transmission channels.

For an implementation manner of determining the frame type corresponding to each transmission channel based on the N transient detection results, reference may be made to the relevant description in the above step 1106 , which will not be repeated here. For an implementation manner of decoding based on the frame type corresponding to each transmission channel and the code stream, reference may be made to related technologies, and the embodiments of the present application will not describe in detail.

Step 1403: Based on the frequency domain signals and spatial coding parameters of the N transmission channels, perform spatial decoding on the frequency domain signals of the N transmission channels to obtain a reconstructed frequency domain 3D audio signal.

In some embodiments, the frame type corresponding to each transmission channel is determined based on the N transient detection results. Based on the frame type and spatial encoding parameters corresponding to each transmission channel, spatial decoding is performed on the frequency domain signals of the N transmission channels to obtain a reconstructed three-dimensional audio signal in the frequency domain.

Wherein, based on the frame type and spatial coding parameters corresponding to each transmission channel, the implementation process of spatially decoding the frequency domain signals of the N transmission channels can refer to related technologies, which will not be described in detail in the embodiment of the present application.

Step 1404: Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.

For an implementation manner of determining the reconstructed time-domain 3D audio signal based on the global transient detection result and the reconstructed frequency-domain 3D audio signal, reference may be made to relevant descriptions in step 904 , which will not be repeated here.

In the embodiment of the present application, the decoding end parses the global transient detection result, the transient detection result corresponding to each transmission channel, and the spatial coding parameters from the code stream. In this way, when decoding is performed based on the transient detection results corresponding to each transmission channel, the frequency domain signal of each transmission channel can be accurately obtained. Moreover, without encoding the target coding parameters into the code stream, the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.

Please refer to FIG. 15 , which is a block diagram of another exemplary decoding method provided by an embodiment of the present application. FIG. 15 is mainly an exemplary explanation of the decoding method shown in FIG. 14 . In FIG. 15 , the global transient detection result, N transient detection results corresponding to N transmission channels, and spatial coding parameters are analyzed from the code stream. Decoding is performed based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels. Based on the frequency-domain signals of the N transmission channels and the spatial coding parameters, spatial decoding is performed on the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal. Based on the global transient detection result and the reconstructed frequency domain 3D audio signal, a reconstructed time domain 3D audio signal is determined.

Figure 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application. The encoding device can be implemented by software, hardware, or a combination of the two to become part or all of the encoding end device. The encoding end device can be as shown in Figure 1 source device. Referring to FIG. 16 , the device includes: a transient detection module 1601 , a determination module 1602 , a conversion module 1603 , a spatial encoding module 1604 , a first encoding module 1605 , a second encoding module 1606 , and a first writing module 1607 .

The transient detection module 1601 is configured to perform transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1 . For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

A determining module 1602, configured to determine a global transient detection result based on the M transient detection results. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The conversion module 1603 is configured to convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain based on the global transient detection result. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The spatial encoding module 1604 is configured to perform spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M an integer of . For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The first coding module 1605 is configured to code the frequency-domain signals of the N transmission channels based on the global transient detection result, so as to obtain a frequency-domain signal coding result. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The second encoding module 1606 is configured to encode the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The first writing module 1607 is configured to write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

Optionally, the conversion module 1603 includes:

A determining unit, configured to determine a target encoding parameter based on a global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;

The converting unit is configured to convert the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the target coding parameter.

Optionally, the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;

Identify units specifically for:

If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;

If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;

Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.

Optionally, the global transient detection result includes global transient flags and global transient position information, and the target encoding parameters include the window function type of the current frame;

Identify units specifically for:

If the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.

Optionally, the device also includes:

The third encoding module is configured to encode the target encoding parameters to obtain an encoding result of the target encoding parameters;

The second writing module is used to write the encoding result of the target encoding parameter into the code stream.

Optionally, the spatial encoding module 1604 is specifically configured to:

Based on the frame type, the frequency-domain three-dimensional audio signal is spatially encoded.

Optionally, the first encoding module 1605 is specifically configured to:

Based on the frame type of the current frame, the frequency domain signals of the N transmission channels are encoded.

Optionally, the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal;

The determining module 1602 is specifically used for:

If the number of transient flags that are the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M; or

If the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and n is a positive integer greater than 0 and less than M .

Optionally, the transient detection result further includes transient position information, the global transient detection result further includes global transient position information, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel;

The determining module 1602 is specifically used for:

If only one of the M transient flags is the first value, the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information;

If there are at least two transient flags with the first value among the M transient flags, the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.

Optionally, the device also includes:

The fourth encoding module is used to encode the global transient detection result to obtain the global transient detection result encoding result;

The third writing module is used to write the encoding result of the global transient detection result into the code stream.

It should be noted that: when the encoding device provided in the above-mentioned embodiments performs encoding, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above. In addition, the encoding device and the encoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.

Fig. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application. The decoding device can be implemented by software, hardware or a combination of the two to become part or all of the decoding end device. The decoding end device can be as shown in Fig. 1 destination device. Referring to FIG. 17 , the device includes: an analysis module 1701 , a decoding module 1702 , a spatial decoding module 1703 and a determination module 1704 .

The parsing module 1701 is configured to parse out the global transient detection result and spatial coding parameters from the code stream. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The decoding module 1702 is configured to perform decoding based on the global transient detection result and code stream, so as to obtain frequency domain signals of N transmission channels. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

The spatial decoding module 1703 is used to spatially decode the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial coding parameters, so as to obtain a reconstructed frequency-domain three-dimensional audio signal. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

A determining module 1704, configured to determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.

Optionally, the determining module 1704 includes:

The converting unit is configured to convert the reconstructed frequency domain 3D audio signal into a reconstructed time domain 3D audio signal based on the target encoding parameters.

Identify units specifically for:

It should be noted that when the decoding device provided in the above-mentioned embodiments performs decoding, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs, namely, the device The internal structure of the system is divided into different functional modules to complete all or part of the functions described above. In addition, the decoding device and the decoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.

Fig. 18 is a schematic block diagram of a codec device 1800 used in an embodiment of the present application. Wherein, the codec apparatus 1800 may include a processor 1801 , a memory 1802 and a bus system 1803 . Among them, the processor 1801 and the memory 1802 are connected through the bus system 1803, the memory 1802 is used to store instructions, and the processor 1801 is used to execute the instructions stored in the memory 1802 to perform various encoding or decoding described in the embodiments of this application method. To avoid repetition, no detailed description is given here.

In the embodiment of the present application, the processor 1801 can be a central processing unit (central processing unit, CPU), and the processor 1801 can also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The memory 1802 may include a ROM device or a RAM device. Any other suitable type of storage device may also be used as memory 1802. Memory 1802 may include code and data 18021 accessed by processor 1801 using bus 1803 . The memory 1802 may further include an operating system 18023 and an application program 18022, where the application program 18022 includes at least one program that allows the processor 1801 to execute the encoding or decoding method described in the embodiment of this application. For example, the application program 18022 may include applications 1 to N, which further include an encoding or decoding application (codec application for short) that executes the encoding or decoding method described in the embodiment of this application.

The bus system 1803 may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 1803 in the figure.

Optionally, the codec apparatus 1800 may also include one or more output devices, such as a display 1804 . In one example, display 1804 may be a touch-sensitive display that incorporates a display with a haptic unit operable to sense touch input. The display 1804 may be connected to the processor 1801 via the bus 1803 .

It should be noted that the codec device 1800 may implement the encoding method in the embodiment of the present application, and may also implement the decoding method in the embodiment of the present application.

Those of skill in the art would appreciate that the functions described in conjunction with the various illustrative logical blocks, modules, and algorithm steps disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions described by the various illustrative logical blocks, modules, and steps may be stored or transmitted as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which correspond to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (eg, based on a communication protocol) . In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this application. A computer program product may include a computer readable medium.

By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory, or any other medium that can contain the desired program code in the form of a computer and can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable Wire, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of media. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, DVD and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

can be processed by one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. device to execute instructions. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or in conjunction with into the combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements. In one example, various illustrative logical blocks, units, and modules in the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logic elements.

The techniques of embodiments of the present application may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a group of ICs (eg, a chipset). Various components, modules or units are described in the embodiments of the present application to emphasize the functional aspects of the apparatus for performing the disclosed technology, but they do not necessarily need to be realized by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit in conjunction with suitable software and/or firmware, or by interoperating hardware units (comprising one or more processors as described above) to supply.

That is to say, in the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)) or a semiconductor medium (for example: solid state disk (solid state disk, SSD)) wait. It should be noted that the computer-readable storage medium mentioned in the embodiment of the present application may be a non-volatile storage medium, in other words, may be a non-transitory storage medium.

It should be understood that the "plurality" mentioned herein refers to two or more than two. In the description of the embodiments of this application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in this article is only a description of the association of associated objects A relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not necessarily limit the difference.

It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and All signals are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the time-domain three-dimensional audio signal and code stream involved in the embodiments of the present application are obtained under the condition of sufficient authorization.

The above-mentioned embodiments provided by the application are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application shall be included in the protection scope of the application. Inside.

Claims

An encoding method, characterized in that the method comprises:

Performing transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, respectively, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1;

Determine a global transient detection result based on the M transient detection results;

converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result;

Based on the global transient detection result, spatially encode the frequency-domain three-dimensional audio signal to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to the an integer of M;

Encoding the frequency-domain signals of the N transmission channels based on the global transient detection result to obtain a frequency-domain signal encoding result;

Encoding the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters;

Writing the coding result of the spatial coding parameter and the coding result of the frequency domain signal into a code stream.
The method according to claim 1, wherein said converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result comprises:

determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;

converting the time domain 3D audio signal into the frequency domain 3D audio signal based on the target encoding parameters.
The method according to claim 2, wherein the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;

The determining target encoding parameters based on the global transient detection results includes:

If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;

If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;

Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
The method according to claim 2, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include the window function type of the current frame;

The determining target encoding parameters based on the global transient detection results includes:

If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
The method according to any one of claims 2-4, wherein the method further comprises:

Encoding the target encoding parameters to obtain an encoding result of the target encoding parameters;

Writing the encoding result of the target encoding parameter into the code stream.
The method according to any one of claims 2-5, wherein the spatial encoding of the frequency-domain three-dimensional audio signal based on the global transient detection result comprises:

Based on the frame type, spatially encode the frequency-domain three-dimensional audio signal.
The method according to any one of claims 2-6, wherein said encoding the frequency-domain signals of said N transmission channels based on said global transient detection result comprises:

Encoding the frequency domain signals of the N transmission channels based on the frame type of the current frame.
The method according to any one of claims 1-7, wherein the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate the corresponding Whether the signal of the channel is a transient signal;

The determining the global transient detection result based on the M transient detection results includes:

If the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and the m is a positive value greater than 0 and less than the M an integer; or

If the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and the n is greater than 0 and a positive integer less than the M.
The method according to claim 8, wherein the transient detection result further includes transient position information, the global transient detection result further comprises global transient position information, and the transient position information is used to indicate The position where the transient occurs in the signal of the corresponding channel;

The determining the global transient detection result based on the M transient detection results includes:

If only one of the M transient flags is the first value, then determine the transient position information corresponding to the channel whose transient flag is the first value as the global transient position information;

If there are at least two transient flags in the M transient flags as the first value, the transient position corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags The information is determined as the global transient location information.
The method according to any one of claims 1-9, wherein the method further comprises:

Encoding the global transient detection result to obtain an encoding result of the global transient detection result;

Writing the encoding result of the global transient detection result into the code stream.
A decoding method, characterized in that the method comprises:

Analyze the global transient detection results and spatial encoding parameters from the code stream;

Decoding based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels;

Based on the global transient detection result and the spatial coding parameters, spatially decode the frequency domain signals of the N transmission channels to obtain a reconstructed frequency domain three-dimensional audio signal;

Based on the global transient detection result and the reconstructed frequency domain 3D audio signal, determine a reconstructed time domain 3D audio signal.
The method according to claim 11, wherein said determining a reconstructed time-domain three-dimensional audio signal based on said global transient detection result and said reconstructed frequency-domain three-dimensional audio signal comprises:

determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;

Converting the reconstructed frequency-domain 3D audio signal into the reconstructed time-domain 3D audio signal based on the target encoding parameters.
The method according to claim 12, wherein the global transient detection result includes a global transient flag, and the target coding parameter includes a window function type of the current frame;

The determining target encoding parameters based on the global transient detection results includes:

If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;

If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;

Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
The method according to claim 12, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include the window function type of the current frame;

The determining target encoding parameters based on the global transient detection results includes:

If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
An encoding device, characterized in that the device comprises:

The transient detection module is used to perform transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, and the M is greater than an integer of 1;

A determining module, configured to determine a global transient detection result based on the M transient detection results;

A conversion module, configured to convert the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result;

A spatial encoding module, configured to perform spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result, to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and an integer less than or equal to said M;

A first encoding module, configured to encode the frequency-domain signals of the N transmission channels based on the global transient detection result, so as to obtain a frequency-domain signal encoding result;

A second encoding module, configured to encode the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters;

The first writing module is configured to write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into a code stream.
The device according to claim 15, wherein the conversion module comprises:

A determining unit, configured to determine a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;

A conversion unit, configured to convert the time-domain 3D audio signal into the frequency-domain 3D audio signal based on the target encoding parameter.
The device according to claim 16, wherein the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;

The determining unit is specifically used for:

If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;

If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;

Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
The device according to claim 16, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include a window function type of the current frame;

The determining unit is specifically used for:

If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
The device according to any one of claims 16-18, wherein the device further comprises:

A third encoding module, configured to encode the target encoding parameters to obtain an encoding result of the target encoding parameters;

The second writing module is used for writing the coding result of the target coding parameter into the code stream.
The device according to any one of claims 16-19, wherein the spatial encoding module is specifically used for:

Based on the frame type, spatially encode the frequency-domain three-dimensional audio signal.
The device according to any one of claims 16-20, wherein the first encoding module is specifically used for:

Encoding the frequency domain signals of the N transmission channels based on the frame type of the current frame.
The device according to any one of claims 15-21, wherein the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate the corresponding Whether the signal of the channel is a transient signal;

The determination module is specifically used for:

If the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and the m is a positive value greater than 0 and less than the M an integer; or

If the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and the n is greater than 0 and a positive integer less than the M.
The device according to claim 22, wherein the transient detection result further includes transient position information, the global transient detection result further comprises global transient position information, and the transient position information is used to indicate The position where the transient occurs in the signal of the corresponding channel;

The determination module is specifically used for:

If only one of the M transient flags is the first value, then determine the transient position information corresponding to the channel whose transient flag is the first value as the global transient position information;

If there are at least two transient flags in the M transient flags as the first value, the transient position corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags The information is determined as the global transient location information.
The device according to any one of claims 15-23, wherein the device further comprises:

A fourth encoding module, configured to encode the global transient detection result to obtain an encoding result of the global transient detection result;

The third writing module is used to write the encoding result of the global transient detection result into the code stream.
A decoding device, characterized in that the device comprises:

The analysis module is used to analyze the global transient detection results and spatial encoding parameters from the code stream;

A decoding module, configured to decode based on the global transient detection result and the code stream, so as to obtain frequency domain signals of N transmission channels;

A spatial decoding module, configured to perform spatial decoding on the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial encoding parameters, so as to obtain a reconstructed three-dimensional audio signal in the frequency domain;

A determination module, configured to determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
The device according to claim 25, wherein the determining module comprises:

a determining unit, configured to determine a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;

A conversion unit, configured to convert the reconstructed frequency-domain 3D audio signal into the reconstructed time-domain 3D audio signal based on the target coding parameters.
The device according to claim 26, wherein the global transient detection result includes a global transient flag, and the target coding parameters include the window function type of the current frame;

The determining unit is specifically used for:

If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;

If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;

Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
The device according to claim 26, wherein the global transient detection result comprises a global transient flag and global transient position information, and the target encoding parameter comprises a window function type of the current frame;

The determining unit is specifically used for:

If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
An encoding end device, characterized in that the encoding end device includes a memory and a processor;

The memory is used to store computer programs, and the processor is used to execute the computer programs stored in the memory, so as to realize the coding method described in any one of claims 1-10.
A decoding end device, characterized in that the decoding end device includes a memory and a processor;

The memory is used to store computer programs, and the processor is used to execute the computer programs stored in the memory, so as to realize the decoding method described in any one of claims 11-14.
A computer-readable storage medium, characterized in that instructions are stored in the storage medium, and when the instructions are run on the computer, the computer is made to perform the method described in any one of claims 1-14 step.
A computer-readable storage medium, characterized by comprising the code stream obtained by the encoding method according to any one of claims 1-10.
A computer program, characterized in that, when the computer program is executed, the method according to any one of claims 1-14 is realized.