WO2023051370A1 - Encoding and decoding methods and apparatus, device, storage medium, and computer program - Google Patents

Encoding and decoding methods and apparatus, device, storage medium, and computer program Download PDF

Info

Publication number
WO2023051370A1
WO2023051370A1 PCT/CN2022/120507 CN2022120507W WO2023051370A1 WO 2023051370 A1 WO2023051370 A1 WO 2023051370A1 CN 2022120507 W CN2022120507 W CN 2022120507W WO 2023051370 A1 WO2023051370 A1 WO 2023051370A1
Authority
WO
WIPO (PCT)
Prior art keywords
transient
global
encoding
current frame
domain
Prior art date
Application number
PCT/CN2022/120507
Other languages
French (fr)
Chinese (zh)
Inventor
刘帅
高原
王宾
王喆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023051370A1 publication Critical patent/WO2023051370A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing

Definitions

  • the present application relates to the technical field of three-dimensional audio coding and decoding, and in particular to a coding and decoding method, device, equipment, storage medium and computer program.
  • Three-dimensional audio technology is an audio technology that acquires, processes, transmits, and renders sound events and three-dimensional sound field information in the real world through computers and signal processing.
  • a 3D audio signal usually needs to include a large amount of data, so as to record spatial information of a sound scene in more detail.
  • there are difficulties in the process of transmission and storage of a large amount of data so it is necessary to encode and decode the 3D audio signal.
  • HOA Higher order ambisonics
  • the time-domain HOA signal is first subjected to time-frequency transformation to obtain the frequency-domain HOA signal, and the frequency-domain HOA signal is spatially encoded to obtain multiple The frequency domain signal of the channel.
  • the time-frequency inverse transform is performed on the frequency domain signals of each channel to obtain the time domain signals of each channel, and the transient detection is performed on the time domain signals of each channel to obtain the transient detection results of each channel.
  • the time-frequency transform is performed on the time-domain signals of each channel again to obtain the frequency-domain signals of each channel, and the frequency-domain signals of each channel are encoded by using the transient detection results of each channel.
  • Embodiments of the present application provide an encoding and decoding method, device, device, storage medium, and computer program, which can reduce encoding complexity and improve encoding efficiency. Described technical scheme is as follows:
  • a coding method which performs transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is greater than An integer of 1; based on the M transient detection results, determine the global transient detection result; based on the global transient detection results, convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain;
  • the three-dimensional audio signal is spatially encoded to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is an integer greater than or equal to 1 and less than or equal to M; based on the global transient detection results, the frequency domain signals of N transmission channels Encoding the signal in the frequency domain to obtain the encoding result of the frequency domain signal; encoding the spatial encoding parameters to obtain the encoding result of the spatial encoding parameter; writing the encoding result of the spatial encoding parameter and the encoding result of the frequency domain signal into the code
  • the transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information.
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal
  • the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel.
  • the method for determining the transient flag of the target channel includes: determining a transient detection parameter corresponding to the target channel based on a signal of the target channel. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.
  • the transient detection parameter corresponding to the target channel is the absolute value of the energy difference between frames. That is, the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame of the current frame are determined. The absolute value of the difference between the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame is determined to obtain the absolute value of the inter-frame energy difference. If the absolute value of the inter-frame energy difference exceeds the first energy difference threshold, determine that the transient flag corresponding to the target channel in the current frame is the first value, otherwise, determine that the transient flag corresponding to the target channel in the current frame is the second value .
  • the transient detection parameter corresponding to the target channel is the absolute value of the subframe energy difference. That is, the signal of the target channel in the current frame includes signals of multiple subframes, the absolute value of the subframe energy difference corresponding to each subframe in the multiple subframes is determined, and then the transient flag corresponding to each subframe is determined. If there is a subframe whose transient flag is the first value in the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the first value. If there is no subframe whose transient flag is the first value among the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the second value.
  • the method for determining the transient location information of the target channel includes: determining the transient location information corresponding to the target channel based on the transient flag corresponding to the target channel.
  • the transient flag corresponding to the target channel is the first value, determine the transient position information corresponding to the target channel. If the transient flag corresponding to the target channel is the second value, it is determined that the target channel does not have corresponding transient position information, or, the transient position information corresponding to the target channel is set to a preset value, such as -1.
  • the transient detection result includes a transient flag
  • the global transient detection result includes a global transient flag
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Determining the global transient detection result based on the M transient detection results, including: if the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determining that the global transient flag is the first value , m is a positive integer greater than 0 and less than M.
  • n the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value. If the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer.
  • the first preset condition includes a channel belonging to a first-order ambisonics (first-order ambisonics, FOA) signal, for example, the channel of the FOA signal may include the first 4 channels of the HOA signal. channels.
  • FOA first-order ambisonics
  • the 3D audio signal is an HOA signal
  • the first preset condition can also be other conditions.
  • the transient detection result further includes transient position information
  • the global transient detection result further includes global transient position information
  • the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel.
  • Determine the global transient detection result based on the M transient detection results, including: if only one of the M transient flags is the first value, the transient state corresponding to the channel whose transient flag is the first value
  • the location information is determined as global transient location information. If there are at least two transient flags with the first value among the M transient flags, the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.
  • the transient position information corresponding to the two channels is determined as the global transient position information.
  • the position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.
  • the transient detection parameter corresponding to the channel is the absolute value of the energy difference between frames or the absolute value of the energy difference between subframes.
  • the transient detection parameter corresponding to the channel is the absolute value of the inter-frame energy difference
  • one channel corresponds to an absolute value of the inter-frame energy difference
  • the corresponding inter-frame energy difference can be selected from the at least two channels The channel with the largest absolute value, and then determine the transient position information corresponding to the selected channel as the global transient position information.
  • the transient detection parameter corresponding to the channel is the absolute value of the subframe energy difference
  • one channel corresponds to the absolute value of the energy difference of multiple subframes
  • the corresponding subframe energy can be selected from the at least two channels The channel with the largest absolute value of the difference is selected, and then the transient position information corresponding to the selected channel is determined as the global transient position information.
  • converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result includes: determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or or the frame type of the current frame.
  • the time-domain three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameters.
  • the global transient detection result includes a global transient flag.
  • the implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, determining the first preset window function type as the window function type of the current frame. If the global transient flag is the second value, the type of the second preset window function is determined as the type of the window function of the current frame. Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  • the global transient detection result includes global transient flags and global transient location information.
  • the implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining the window function type of the current frame based on the global transient position information. If the global transient flag is the second value, the type of the third preset window function is determined as the type of the window function of the current frame, or the type of the window function of the current frame is determined based on the type of the window function of the previous frame of the current frame.
  • the global transient detection result may only include the global transient flag, it may also include the global transient flag and the global transient position information, and the global transient position information may be the transient position corresponding to the channel whose transient flag is the first value information, and possibly preset values.
  • the method of determining the frame type of the current frame is different. Therefore, the following three cases will be described separately:
  • the global transient detection result includes a global transient flag.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frame. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame.
  • the global transient detection result includes the global transient flag and the global transient position information.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames.
  • the global transient flag is the second value
  • the frame type of the current frame is the second type
  • the second type is used to indicate that the current frame includes a long frame.
  • the frame length of ultra-short frame is less than that of short frame
  • the frame length of short frame is less than that of long frame.
  • the second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame
  • the frame length is smaller than the thumb frame.
  • the global transient detection result includes global transient position information.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient position information is a preset value, such as -1, then determine that the frame type of the current frame is the second type, and the second type is used for Indicates that the current frame includes a long frame. If the global transient position information is not a preset value and satisfies the second preset condition, it is determined that the frame type of the current frame is a third type, and the third type is used to indicate that the current frame includes multiple ultra-short frames.
  • a preset value such as -1
  • the global transient position information is not a preset value and does not meet the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames.
  • the frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame.
  • the second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
  • the window function type of the current frame is used to indicate the shape and length of the window function corresponding to the current frame, and the window function of the current frame is used to perform windowing processing on the time-domain three-dimensional audio signal of the current frame.
  • the frame type of the current frame is used to indicate whether the current frame is a very short frame, a short frame or a long frame.
  • the ultra-short frame, short frame and long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
  • the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame.
  • the process of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal based on the target coding parameters is different, so the following descriptions will be made respectively.
  • the target coding parameters include the window function type of the current frame.
  • windowing processing is performed on the time-domain three-dimensional audio signal of the current frame based on the window function indicated by the window function type of the current frame. Afterwards, the windowed three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal.
  • the target coding parameters include the frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames, and at this time, the time domain 3D audio signal of each short frame included in the current frame is converted into a frequency domain 3D audio signal.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the time domain 3D audio signal of the long frame included in the current frame is directly converted into a frequency domain 3D audio signal.
  • the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames, and at this time, the time domain 3D audio signal of each ultrashort frame included in the current frame is converted into a frequency domain 3D audio signal.
  • the target encoding parameters include the window function type and frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames.
  • each short frame included in the current frame The three-dimensional audio signals in the time domain are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each short frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame.
  • the time-domain three-dimensional audio signal of the long frame included in the current frame is added. window processing, and convert the time-domain three-dimensional audio signal of the windowed long frame into a frequency-domain three-dimensional audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes multiple ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the temporal three-dimensional The audio signals are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each ultrashort frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
  • the target encoding parameters can also be encoded to obtain an encoding result of the target encoding parameters. Write the encoding result of the target encoding parameters into the code stream.
  • performing spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result includes: performing spatial encoding on the frequency-domain three-dimensional audio signal based on the frame type.
  • the frame type of the current frame is the first type, that is, the current frame includes multiple short frames, at this time, the current
  • the three-dimensional audio signals in the frequency domain of multiple short frames included in the frame are interleaved to obtain a three-dimensional audio signal in the frequency domain of a long frame, and spatial coding is performed on the three-dimensional audio signal in the frequency domain of the long frame obtained after the interleaving.
  • the frame type of the current frame is the second type, that is, the current frame includes a long frame, at this time, the frequency domain 3D audio signal of the long frame is spatially encoded.
  • the frame type of the current frame is the third type, that is, the current frame includes multiple ultrashort frames
  • the frequency-domain three-dimensional audio signals of the multiple ultrashort frames included in the current frame are interleaved to obtain a long
  • the three-dimensional audio signal in the frequency domain of the frame is spatially encoded on the three-dimensional audio signal in the frequency domain of the long frame obtained after interleaving.
  • encoding the frequency-domain signals of the N transmission channels based on the global transient detection result includes: encoding the frequency-domain signals of the N transmission channels based on the frame type of the current frame.
  • the implementation process of encoding the frequency-domain signals of the N transmission channels includes: performing noise shaping processing on the frequency-domain signals of the N transmission channels based on the frame type of the current frame.
  • the transmission channel downmixing process is performed on the frequency domain signals of the N transmission channels after the noise shaping process, to obtain the downmixed signal.
  • the method further includes: encoding the global transient detection result to obtain an encoding result of the global transient detection result. Write the encoding result of the global transient detection result into the code stream.
  • a decoding method which analyzes the global transient detection result and the spatial encoding parameters from the code stream; decodes based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels; Based on the global transient detection results and the spatial coding parameters, spatially decode the frequency-domain signals of N transmission channels to obtain the reconstructed frequency-domain 3D audio signals; based on the global transient detection results and the reconstructed frequency-domain 3D audio signals, determine Reconstructed time domain 3D audio signal.
  • determining the reconstructed time-domain three-dimensional audio signal includes: determining a target encoding parameter based on the global transient detection result, and the target encoding parameter includes a window function of the current frame type and/or the frame type of the current frame; based on the target encoding parameters, converting the reconstructed frequency-domain 3D audio signal into a reconstructed time-domain 3D audio signal.
  • the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame.
  • the process of converting the reconstructed frequency domain 3D audio signal into the reconstructed time domain 3D audio signal based on the target coding parameters is different, so the following will describe them respectively.
  • the target coding parameters include the window function type of the current frame.
  • de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain.
  • the de-windowed frequency-domain 3D audio signal is converted into a reconstructed time-domain 3D audio signal.
  • de-windowing processing is also referred to as windowing and splicing-add processing.
  • the target coding parameters include the frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames.
  • the reconstructed frequency-domain three-dimensional audio signal of each short frame is converted into a time-domain three-dimensional audio signal to obtain reconstruction 3D audio signal in the time domain.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long frame.
  • the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is directly converted into a time-domain three-dimensional audio signal to obtain a reconstructed time-domain audio signal. domain 3D audio signal.
  • the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames.
  • the reconstructed frequency domain 3D audio signal of each ultrashort frame is converted into a time domain 3D audio signal to obtain a reconstructed time domain 3D audio signal.
  • the target encoding parameters include the window function type and frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames.
  • each short frame included in the current frame The frequency-domain 3D audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3-D audio signals of each short frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame.
  • the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is performed. De-windowing processing, and converting the de-windowed long-frame frequency domain 3D audio signal into a time domain 3D audio signal, so as to obtain a reconstructed time domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames.
  • the frequency domain three-dimensional The audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3D audio signals of each ultrashort frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
  • the global transient detection result includes a global transient flag
  • the target coding parameter includes a window function type of the current frame.
  • Determine the target encoding parameters based on the global transient detection results including: if the global transient flag is the first value, then determine the type of the first preset window function as the window function type of the current frame; if the global transient flag is the second value, the type of the second preset window function is determined as the window function type of the current frame; wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  • the global transient detection result includes the global transient flag and the global transient position information
  • the target coding parameter includes the window function type of the current frame
  • the target coding parameter is determined based on the global transient detection result, including: if the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.
  • an encoding device in a third aspect, is provided, and the encoding device has a function of implementing the behavior of the encoding method in the first aspect above.
  • the encoding device includes at least one module, and the at least one module is used to implement the encoding method provided in the first aspect above.
  • a decoding device in a fourth aspect, has the function of realizing the behavior of the decoding method in the second aspect above.
  • the decoding device includes at least one module, and the at least one module is used to implement the decoding method provided by the second aspect above.
  • an encoding device includes a processor and a memory, and the memory is used to store a program for executing the encoding method provided in the first aspect above.
  • the processor is configured to execute the program stored in the memory, so as to implement the encoding method provided in the first aspect above.
  • the encoding end device may further include a communication bus, which is used to establish a connection between the processor and the memory.
  • a decoding end device in a sixth aspect, includes a processor and a memory, and the memory is used to store a program for executing the decoding method provided in the second aspect above.
  • the processor is configured to execute the program stored in the memory, so as to implement the decoding method provided by the second aspect above.
  • the decoding device may further include a communication bus, which is used to establish a connection between the processor and the memory.
  • a computer-readable storage medium and instructions are stored in the storage medium, and when the instructions are run on a computer, the computer is made to execute the steps of the encoding method described in the first aspect above, or execute The steps of the decoding method described in the second aspect above.
  • a computer program product containing instructions, which, when the instructions are run on a computer, cause the computer to execute the steps of the encoding method described in the above-mentioned first aspect, or perform the decoding described in the above-mentioned second aspect method steps.
  • a computer program is provided, and when the computer program is executed, the steps of the encoding method described in the above-mentioned first aspect are realized, or the steps of the decoding method described in the above-mentioned second aspect are realized.
  • a ninth aspect provides a computer-readable storage medium, where the computer-readable storage medium includes the code stream obtained by the encoding method described in the first aspect.
  • a global transient detection result is determined by performing transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The encoding of the frequency-domain signals of each transmission channel is guided by the global transient detection results.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an implementation environment of a terminal scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an implementation environment of a transcoding scenario of a wireless or core network device provided in an embodiment of the present application;
  • FIG. 4 is a schematic diagram of an implementation environment of a broadcast television scene provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an implementation environment of a virtual reality streaming scene provided by an embodiment of the present application.
  • Fig. 6 is a flow chart of the first encoding method provided by the embodiment of the present application.
  • Fig. 7 is an exemplary block diagram of the first encoding method shown in Fig. 6 provided by the embodiment of the present application;
  • FIG. 8 is an exemplary block diagram of a second encoding method shown in FIG. 6 provided by an embodiment of the present application.
  • FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application.
  • FIG. 10 is an exemplary block diagram of the decoding method shown in FIG. 9 provided by an embodiment of the present application.
  • Fig. 11 is a flow chart of the second encoding method provided by the embodiment of the present application.
  • Fig. 12 is an exemplary block diagram of the first encoding method shown in Fig. 11 provided by the embodiment of the present application;
  • FIG. 13 is an exemplary block diagram of the second encoding method shown in FIG. 11 provided by the embodiment of the present application.
  • Fig. 14 is a flow chart of the second decoding method provided by the embodiment of the present application.
  • Fig. 15 is an exemplary block diagram of a decoding method shown in Fig. 14 provided by an embodiment of the present application;
  • FIG. 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application.
  • FIG. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application.
  • Fig. 18 is a schematic block diagram of a codec device provided by an embodiment of the present application.
  • Encoding refers to the process of compressing the audio signal to be encoded into a code stream. It should be noted that after the audio signal is compressed into a code stream, it may be referred to as an encoded audio signal or a compressed audio signal.
  • Decoding refers to the process of restoring the coded stream into a reconstructed audio signal according to specific grammatical rules and processing methods.
  • Three-dimensional audio signal a signal including multiple channels, which is used to characterize the sound field in a three-dimensional space, and may be one or a combination of HOA signals, multi-channel signals, and object audio signals.
  • the number of channels of the 3D audio signal is related to the order of the 3D audio signal. For example, if the 3D audio signal is an A-level signal, the number of channels of the 3D audio signal is (A+1) 2 .
  • the three-dimensional audio signal mentioned below may be any three-dimensional audio signal, for example, it may be one or a combination of HOA signals, multi-channel signals, and object audio signals.
  • Transient signal It is used to characterize the transient phenomenon of the signal of the corresponding channel of the 3D audio signal. If the signal of a certain channel is a transient signal, it indicates that the signal of this channel is a non-stationary signal. For example, signals with large energy changes in a short period of time, such as the sound of drums and percussion instruments.
  • FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes source device 10 , destination device 20 , link 30 and storage device 40 .
  • the source device 10 may generate an encoded 3D audio signal. Therefore, the source device 10 may also be called a three-dimensional audio signal encoding device.
  • Destination device 20 may decode the encoded three-dimensional audio signal generated by source device 10. Therefore, the destination device 20 may also be referred to as a three-dimensional audio signal decoding device.
  • Link 30 may receive an encoded 3D audio signal generated by source device 10 and may transmit the encoded 3D audio signal to destination device 20 .
  • the storage device 40 can receive the encoded 3D audio signal generated by the source device 10, and can store the encoded 3D audio signal. Under such conditions, the destination device 20 can directly obtain the encoded 3D audio signal from the storage device 40. 3D audio signal.
  • the storage device 40 may correspond to a file server or another intermediate storage device that may store the encoded three-dimensional audio signal generated by the source device 10, in which case the destination device 20 may communicate via streaming or downloading the storage device. 40 stored encoded three-dimensional audio signal.
  • Both the source device 10 and the destination device 20 may include one or more processors and a memory coupled to the one or more processors, and the memory may include random access memory (random access memory, RAM), read-only memory ( read-only memory, ROM), charged erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), flash memory, can be used to store the desired program in the form of instructions or data structures that can be accessed by the computer Any other media etc. of the code.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory can be used to store the desired program in the form of instructions or data structures that can be accessed by the computer Any other media etc. of the code.
  • both source device 10 and destination device 20 may include desktop computers, mobile computing devices, notebook (e.g., laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called “smart" phones, Televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, or the like.
  • Link 30 may include one or more media or devices capable of transmitting the encoded three-dimensional audio signal from source device 10 to destination device 20 .
  • link 30 may include one or more communication media that enable source device 10 to transmit the encoded three-dimensional audio signal directly to destination device 20 in real time.
  • the source device 10 may modulate the encoded 3D audio signal based on a communication standard, such as a wireless communication protocol, etc., and may send the modulated 3D audio signal to the destination device 20 .
  • the one or more communication media may include wireless and/or wired communication media, for example, the one or more communication media may include radio frequency (radio frequency, RF) spectrum or one or more physical transmission lines.
  • the one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet), among others.
  • the one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from the source device 10 to the destination device 20, etc., which are not specifically limited in this embodiment of the present application.
  • the storage device 40 may store the received encoded 3D audio signal sent by the source device 10, and the destination device 20 may directly obtain the encoded 3D audio signal from the storage device 40.
  • the storage device 40 may include any one of a variety of distributed or locally accessed data storage media, for example, any one of the various distributed or locally accessed data storage media may be Hard disk drive, Blu-ray Disc, digital versatile disc (DVD), compact disc read-only memory (CD-ROM), flash memory, volatile or nonvolatile memory, or Any other suitable digital storage medium for storing encoded three-dimensional audio signals, etc.
  • the storage device 40 may correspond to a file server or another intermediate storage device that may save the encoded 3D audio signal generated by the source device 10, and the destination device 20 may store the encoded 3D audio signal via streaming or downloading.
  • the device 40 stores the three-dimensional audio signal.
  • the file server may be any type of server capable of storing and transmitting the encoded three-dimensional audio signal to the destination device 20 .
  • the file server may include a network server, a file transfer protocol (file transfer protocol, FTP) server, a network attached storage (network attached storage, NAS) device, or a local disk drive.
  • Destination device 20 may acquire the encoded three-dimensional audio signal over any standard data connection, including an Internet connection.
  • Any standard data connection may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a digital subscriber line (DSL), cable modem, etc.), or is suitable for obtaining encoded data stored on a file server.
  • a wireless channel e.g., a Wi-Fi connection
  • a wired connection e.g., a digital subscriber line (DSL), cable modem, etc.
  • DSL digital subscriber line
  • cable modem etc.
  • the transmission of the encoded three-dimensional audio signal from the storage device 40 may be a streaming transmission, a download transmission, or a combination of both.
  • the technology of the embodiment of the present application can be applied to the source device 10 shown in FIG. 1 that encodes the 3D audio signal, and can also be applied to the destination device 20 that decodes the encoded 3D audio signal.
  • the source device 10 includes a data source 120 , an encoder 100 and an output interface 140 .
  • output interface 140 may include a conditioner/demodulator (modem) and/or a transmitter, where a transmitter may also be referred to as a transmitter.
  • Data source 120 may include an image capture device (e.g., video camera, etc.), an archive containing previously captured 3D audio signals, a feed-in interface for receiving 3D audio signals from a 3D audio signal content provider, and/or for generating 3D audio signals.
  • the data source 120 may send a 3D audio signal to the encoder 100, and the encoder 100 may encode the received 3D audio signal sent by the data source 120 to obtain an encoded 3D audio signal.
  • the encoder may send the encoded three-dimensional audio signal to the output interface.
  • source device 10 sends the encoded three-dimensional audio signal directly to destination device 20 via output interface 140 .
  • the encoded 3D audio signal may also be stored on storage device 40 for later retrieval by destination device 20 for decoding and/or display.
  • the destination device 20 includes an input interface 240 , a decoder 200 and a display device 220 .
  • input interface 240 includes a receiver and/or a modem.
  • the input interface 240 can receive the encoded three-dimensional audio signal via the link 30 and/or from the storage device 40, and then send it to the decoder 200, and the decoder 200 can decode the received encoded three-dimensional audio signal to obtain the encoded three-dimensional audio signal. Decoded 3D audio signal.
  • the decoder may transmit the decoded three-dimensional audio signal to the display device 220 .
  • the display device 220 may be integrated with the destination device 20 or may be external to the destination device 20 .
  • the display device 220 displays the decoded 3D audio signal.
  • the display device 220 can be any type of display device in various types, for example, the display device 220 can be a liquid crystal display (liquid crystal display, LCD), a plasma display, an organic light-emitting diode (organic light-emitting diode, OLED) monitor or other type of display device.
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • encoder 100 and decoder 200 may be individually integrated with the encoder and decoder, and may include appropriate multiplexer-demultiplexer (multiplexer-demultiplexer) , MUX-DEMUX) unit or other hardware and software for encoding both audio and video in a common data stream or in separate data streams.
  • the MUX-DEMUX unit may conform to the ITU H.223 multiplexer protocol, or other protocols such as user datagram protocol (UDP), if applicable.
  • Each of the encoder 100 and the decoder 200 can be any one of the following circuits: one or more microprocessors, digital signal processing (digital signal processing, DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques of the embodiments of the present application are implemented partially in software, the device may store instructions for the software in a suitable non-transitory computer-readable storage medium, and may use one or more processors in hardware The instructions are executed to implement the technology of the embodiments of the present application. Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered to be one or more processors. Each of encoder 100 and decoder 200 may be included in one or more encoders or decoders, either of which may be integrated into a combined encoding in a corresponding device Part of a codec/decoder (codec).
  • codec codec/decoder
  • Embodiments of the present application may generally refer to the encoder 100 as “signaling” or “sending” certain information to another device such as the decoder 200 .
  • the term “signaling” or “sending” may generally refer to the transmission of syntax elements and/or other data for decoding a compressed three-dimensional audio signal. This transfer can occur in real time or near real time. Alternatively, this communication may occur after a period of time, such as upon encoding when storing syntax elements in an encoded bitstream to a computer-readable storage medium, which the decoding device may then perform after the syntax elements are stored on this medium The syntax element is retrieved at any time.
  • the codec method provided in the embodiment of the present application can be applied to various scenarios, and several scenarios among them will be introduced respectively next.
  • FIG. 2 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a terminal scenario.
  • the implementation environment includes a first terminal 101 and a second terminal 201 , and the first terminal 101 and the second terminal 201 are connected in communication.
  • the communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.
  • the first terminal 101 may be a sending end device or a receiving end device.
  • the second terminal 201 may be a receiving end device or a sending end device.
  • the first terminal 101 is a sending end device
  • the second terminal 201 is a receiving end device
  • the second terminal 201 is a sending end device.
  • the first terminal 101 may be the source device 10 in the implementation environment shown in FIG. 1 above.
  • the second terminal 201 may be the destination device 20 in the implementation environment shown in FIG. 1 above.
  • both the first terminal 101 and the second terminal 201 include an audio collection module, an audio playback module, an encoder, a decoder, a channel encoding module and a channel decoding module.
  • the audio acquisition module in the first terminal 101 collects the 3D audio signal and transmits it to the encoder.
  • the encoder encodes the 3D audio signal using the encoding method provided in the embodiment of the present application.
  • the encoding may be called source encoding.
  • the channel coding module needs to perform channel coding again, and then transmit the coded stream through the wireless or wired network communication device in the digital channel.
  • the second terminal 201 receives the code stream transmitted in the digital channel through a wireless or wired network communication device, the channel decoding module performs channel decoding on the code stream, and then the decoder uses the decoding method provided by the embodiment of the present application to decode to obtain a three-dimensional audio signal, and then passes Audio playback module to play.
  • the first terminal 101 and the second terminal 201 can be any electronic product that can interact with the user through one or more ways such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting equipment, etc.,
  • Such as personal computer personal computer, PC
  • mobile phone smart phone
  • personal digital assistant personal digital assistant, PDA
  • wearable device PPC (pocket PC)
  • tablet computer smart car machine, smart TV, smart speaker wait.
  • FIG. 3 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a transcoding scenario of a wireless or core network device.
  • the implementation environment includes a channel decoding module, an audio decoder, an audio encoder and a channel encoding module.
  • the audio decoder may be a decoder using the decoding method provided by the embodiment of the present application, or may be a decoder using other decoding methods.
  • the audio encoder may be an encoder using the encoding method provided by the embodiment of the present application, or may be an encoder using other encoding methods.
  • the audio encoder is a coder using other encoding methods
  • the audio The encoder is an encoder using the encoding method provided by the embodiment of the present application.
  • the audio decoder is a decoder using the decoding method provided by the embodiment of the present application, and the audio encoder is an encoder using other encoding methods.
  • the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use the decoding method provided by the embodiment of the application to perform source decoding, and then the audio encoder is used to encode according to other encoding methods to achieve a
  • the conversion from one format to another is known as transcoding. After that, it is sent after channel coding.
  • the audio decoder is a decoder using other decoding methods
  • the audio encoder is an encoder using the encoding method provided by the embodiment of the present application.
  • the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use other decoding methods to perform source decoding, and then the audio encoder uses the encoding method provided by the embodiment of the application to perform encoding to realize a
  • the conversion from one format to another is known as transcoding. After that, it is sent after channel coding.
  • the wireless device may be a wireless access point, a wireless router, a wireless connector, and the like.
  • a core network device may be a mobility management entity, a gateway, and the like.
  • FIG. 4 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a broadcast television scene.
  • the broadcast TV scene is divided into a live scene and a post-production scene.
  • the implementation environment includes a live program 3D sound production module, a 3D sound encoding module, a set-top box and a speaker group, and the set-top box includes a 3D sound decoding module.
  • the implementation environment includes post-program 3D sound production modules, 3D sound coding modules, network receivers, mobile terminals, earphones, and the like.
  • the 3D sound production module of the live program produces a 3D sound signal, and the 3D sound signal includes a 3D audio signal.
  • the three-dimensional sound signal is encoded by applying the encoding method of the embodiment of the present application to obtain a code stream, and the code stream is transmitted to the user side through the broadcasting network, and is decoded by the three-dimensional sound decoder in the set-top box using the decoding method provided by the embodiment of the present application.
  • the three-dimensional sound signal is thus reconstructed and played back by the loudspeaker group.
  • the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group.
  • the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.
  • the post-program 3D sound production module produces a 3D sound signal
  • the 3D sound signal is encoded by applying the coding method of the embodiment of the application to obtain a code stream
  • the code stream is transmitted to the user side through the radio and television network, and is transmitted by the set-top box
  • the 3D sound decoder uses the decoding method provided by the embodiment of the present application to decode, thereby reconstructing the 3D sound signal, which is played back by the speaker group.
  • the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group.
  • the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.
  • FIG. 5 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a virtual reality streaming scene.
  • the implementation environment includes an encoding end and a decoding end.
  • the encoding end includes an acquisition module, a preprocessing module, an encoding module, a packaging module and a sending module
  • the decoding end includes an unpacking module, a decoding module, a rendering module and earphones.
  • the acquisition module collects three-dimensional audio signals, and then performs preprocessing operations through the preprocessing module.
  • the preprocessing operations include filtering out the low-frequency part of the signal, usually with 20Hz or 50Hz as the cut-off point, and extracting the orientation information in the signal.
  • use the encoding module to perform encoding processing using the encoding method provided by the embodiment of the present application.
  • After encoding use the packing module to pack and send to the decoding end through the sending module.
  • the unpacking module at the decoding end first unpacks, and then uses the decoding method provided by the embodiment of the application to decode through the decoding module, and then performs binaural rendering processing on the decoded signal through the rendering module, and the rendered signal is mapped to the listener's earphones superior.
  • the earphone can be an independent earphone, or an earphone on a virtual reality glasses device.
  • any of the following encoding methods may be executed by the encoder 100 in the source device 10 .
  • Any of the following decoding methods may be performed by the decoder 200 in the destination device 20 .
  • FIG. 6 is a flowchart of the first encoding method provided by the embodiment of the present application.
  • the encoding method is applied to an encoding end device, and includes the following steps.
  • Step 601 Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.
  • the M transient detection results are in one-to-one correspondence with the M channels included in the time-domain three-dimensional audio signal of the current frame.
  • the transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information.
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal
  • the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel.
  • the M transient detection results corresponding to the M channels There are many ways to determine the M transient detection results corresponding to the M channels, and one of the ways will be introduced next. Since the determination method of the transient state detection result corresponding to each of the M channels is the same, the method for determining the transient state detection result corresponding to the channel is introduced next by taking one of the channels as an example. For ease of description, this channel is referred to as a target channel, and the transient flag and transient position information of the target channel will be respectively introduced next.
  • the transient detection parameters corresponding to the target channel are determined. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.
  • the transient detection parameter corresponding to the target channel is the absolute value of the energy difference between frames. That is, the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame of the current frame are determined. The absolute value of the difference between the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame is determined to obtain the absolute value of the inter-frame energy difference. If the absolute value of the inter-frame energy difference exceeds the first energy difference threshold, determine that the transient flag corresponding to the target channel in the current frame is the first value, otherwise, determine that the transient flag corresponding to the target channel in the current frame is the second value .
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference between frames exceeds the first energy difference threshold, it indicates that the signal of the target channel in the current frame is a transient signal, and at this time, determine that the transient flag corresponding to the target channel in the current frame is the first value. When the absolute value of the inter-frame energy difference does not exceed the first energy difference threshold, it indicates that the signal of the target channel in the current frame is not a transient signal. At this time, it is determined that the transient flag corresponding to the target channel in the current frame is the first binary value.
  • first value and the second value can be expressed in various ways.
  • first value is true, and the second value is flase.
  • first value is 1 and the second value is 0.
  • the first energy difference threshold is preset, and the first energy difference threshold can be adjusted according to different requirements.
  • the transient detection parameter corresponding to the target channel is the absolute value of the subframe energy difference. That is, the signal of the target channel in the current frame includes signals of multiple subframes, the absolute value of the subframe energy difference corresponding to each subframe in the multiple subframes is determined, and then the transient flag corresponding to each subframe is determined. If there is a subframe whose transient flag is the first value in the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the first value. If there is no subframe whose transient flag is the first value among the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the second value.
  • the implementation manner of determining the transient state flag of each subframe in the multiple subframes is the same, so the i-th subframe among the multiple subframes is taken as an example for illustration, and i is a positive integer. That is, the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe in the plurality of subframes are determined. Determine the absolute value of the difference between the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe, so as to obtain the absolute value of the energy difference of the subframe corresponding to the i-th subframe. If the absolute value of the subframe energy difference corresponding to the i subframe exceeds the second energy difference threshold, then determine that the transient flag of the i subframe is the first value, otherwise, determine that the transient flag of the i subframe is the second value.
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference of the subframe corresponding to the i-th subframe exceeds the second energy difference threshold, it indicates that the i-th subframe The signal of the subframe is a transient signal, and at this time, it is determined that the transient flag of the ith subframe is the first value. In the case that the absolute value of the energy difference of the subframe corresponding to the i subframe does not exceed the second energy difference threshold, it indicates that the signal of the i subframe is not a transient signal, and at this time, determine the transient flag of the i subframe is the second value.
  • the energy of the signal of the i-1th subframe is the energy of the signal of the last subframe of the target channel in the previous frame of the current frame.
  • the second energy difference threshold is preset, and the second energy difference threshold can be adjusted according to different requirements.
  • the second energy difference threshold may be the same as or different from the first energy difference threshold.
  • the transient position information corresponding to the target channel is determined.
  • the transient flag corresponding to the target channel is the first value, determine the transient position information corresponding to the target channel. If the transient flag corresponding to the target channel is the second value, it is determined that the target channel does not have corresponding transient position information, or, the transient position information corresponding to the target channel is set to a preset value, such as -1.
  • the transient flag corresponding to the target channel when the transient flag corresponding to the target channel is the second value, it indicates that the signal of the target channel is not a transient signal.
  • the transient detection result of the target channel does not include the transient position information, or the target channel is directly
  • the transient position information corresponding to the channel is set as a preset value, and the preset value is used to indicate that the signal of the target channel is not a transient signal. That is, the transient detection result of the transient signal includes the transient flag and the transient location information, and the transient detection result of the non-transient signal may include the transient flag, and may also include the transient flag and the transient location information.
  • the transient flag corresponding to the target channel is the first value
  • the signal of the target channel in the current frame includes signals of a plurality of subframes, and the subframe whose transient flag is the first value and whose absolute value of the subframe energy difference is the highest is selected from the plurality of subframes, and the selected subframe The sequence number of the frame is determined as the transient position information corresponding to the target channel in the current frame.
  • the transient flag corresponding to the target channel in the current frame is the first value
  • the absolute value of the subframe energy difference of the 0th subframe is 18, the absolute value of the subframe energy difference of the first subframe is 21, the absolute value of the subframe energy difference of the second subframe is 24, and the absolute value of the subframe energy difference of the third subframe is The absolute value of the subframe energy difference is 35.
  • the preset second energy difference threshold is 20
  • the signal of the first subframe is a transient signal
  • the signal of the second subframe is a transient signal
  • the signal of the third subframe is a transient signal.
  • the sequence number 3 of the third subframe is determined as the transient position information corresponding to the target channel in the current frame.
  • Step 602 Determine a global transient detection result based on the M transient detection results.
  • the global transient detection results include a global transient flag. If the number of transient flags with the first value among the M transient flags is greater than or equal to m, determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M. Or, if the number of channels that satisfy the first preset condition and the corresponding transient flag is the first value among the M channels is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer of .
  • the 3D audio signal of the current frame is a third-order HOA signal
  • the number of channels of the HOA signal is (3+1) 2 , that is, 16.
  • the first preset condition includes channels belonging to the FOA signal
  • the channels of the FOA signal may include the first 4 channels in the HOA signal, and it is assumed that the channel that satisfies the first preset condition among the M channels is the current frame
  • n is 1. If the number of channels whose corresponding transient flag is the first value among the 16 channels belonging to the FOA is greater than or equal to 1, then it is determined that the global transient flag is the first value.
  • m and n are preset values, and m and n can also be adjusted according to different requirements.
  • the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located.
  • the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.
  • the global transient detection result further includes global transient location information. If only one of the M transient flags is the first value, the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information. If there are at least two transient flags in the M transient flags as the first value, determine the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is the global transient position information, or if at least two of the M transient flags are the first value, and the gap between the transient position information corresponding to the two channels is smaller than the position difference threshold, then the The average value of the transient position information corresponding to the two channels is determined as the global transient position information.
  • the position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.
  • the transient detection parameter corresponding to the channel is the absolute value of the energy difference between frames or the absolute value of the energy difference between subframes.
  • the transient detection parameter corresponding to the channel is the absolute value of the inter-frame energy difference
  • one channel corresponds to an absolute value of the inter-frame energy difference
  • the corresponding inter-frame energy difference can be selected from the at least two channels The channel with the largest absolute value, and then determine the transient position information corresponding to the selected channel as the global transient position information.
  • the transient detection parameter corresponding to the channel is the absolute value of the subframe energy difference
  • one channel corresponds to the absolute value of the energy difference of multiple subframes
  • the corresponding subframe energy can be selected from the at least two channels The channel with the largest absolute value of the difference is selected, and then the transient position information corresponding to the selected channel is determined as the global transient position information.
  • the transient position corresponding to the third channel can be directly set
  • the information is determined as global transient position information.
  • transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively.
  • the transient position information corresponding to channel 1 is 1, the absolute value of the inter-frame energy difference corresponding to channel 1 is 22, the transient position information corresponding to channel 2 is 2, and the absolute value of the inter-frame energy difference corresponding to channel 2 is 23,
  • the transient position information corresponding to channel 3 is 3, and the absolute value of the inter-frame energy difference corresponding to channel 3 is 28.
  • the channel with the largest absolute value of the inter-frame energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.
  • transient flags with the first value among the 16 transient flags of the HOA signal they are channel 1, channel 2 and channel 3 respectively.
  • the transient position information corresponding to channel 1 is 1, the signal of channel 1 includes three subframes, and the absolute values of the subframe energy differences corresponding to these three subframes are 20, 18, and 22 respectively, and the transient position information corresponding to channel 2 is 2.
  • the signal of channel 2 includes three subframes.
  • the absolute values of the energy differences of the subframes corresponding to these three subframes are 20, 23, and 25 respectively.
  • the transient position information corresponding to channel 3 is 3.
  • the signal of channel 3 includes three subframes frame, the absolute values of energy differences of subframes corresponding to these three subframes are 25, 28, and 30.
  • the channel with the largest absolute value of subframe energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.
  • transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively.
  • the transient position information corresponding to channel 1 is 1, the transient position information corresponding to channel 2 is 3, and the transient position information corresponding to channel 3 is 6.
  • the gap 2 between the transient position information corresponding to channel 1 and channel 2 among the three channels is less than the preset position difference threshold 3, then the average value 2 of the transient position information corresponding to channel 1 and channel 2 is determined as the global Transient location information.
  • Step 603 Convert the time domain 3D audio signal of the current frame into a frequency domain 3D audio signal based on the global transient detection result.
  • target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame.
  • the time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameter.
  • the global transient detection result includes a global transient flag.
  • the implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, determining the first preset window function type as the window function type of the current frame. If the global transient flag is the second value, the type of the second preset window function is determined as the type of the window function of the current frame. Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  • the global transient detection result includes global transient flags and global transient location information.
  • the implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining the window function type of the current frame based on the global transient position information. If the global transient flag is the second value, the type of the third preset window function is determined as the type of the window function of the current frame, or the type of the window function of the current frame is determined based on the type of the window function of the previous frame of the current frame.
  • the type of the fourth preset window function is adjusted based on the global transient position information, so that the center position of the fourth preset window function corresponds to the global transient occurrence position, and then the value of the window function corresponding to the global transient occurrence position maximum.
  • a window function corresponding to the location where the global transient occurs is selected from the window function set, and then the type of the selected window function is determined as the window function type of the current frame. That is to say, window functions corresponding to each transient occurrence position are stored in the window function set, so that the window function corresponding to the global transient occurrence position can be selected.
  • the global transient detection result may only include the global transient flag, it may also include the global transient flag and the global transient position information, and the global transient position information may be the transient position corresponding to the channel whose transient flag is the first value information, and possibly preset values.
  • the method of determining the frame type of the current frame is different. Therefore, the following three cases will be described separately:
  • the global transient detection result includes a global transient flag.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frame. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame.
  • the global transient detection result includes the global transient flag and the global transient position information.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames.
  • the global transient flag is the second value
  • the frame type of the current frame is the second type
  • the second type is used to indicate that the current frame includes a long frame.
  • the frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame.
  • the second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame
  • the frame length is smaller than the thumb frame.
  • the global transient detection result includes global transient position information.
  • the implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient position information is a preset value, such as -1, then determine that the frame type of the current frame is the second type, and the second type is used for Indicates that the current frame includes a long frame. If the global transient position information is not a preset value and satisfies the second preset condition, it is determined that the frame type of the current frame is a third type, and the third type is used to indicate that the current frame includes multiple ultra-short frames.
  • a preset value such as -1
  • the global transient position information is not a preset value and does not meet the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames.
  • the frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame.
  • the second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
  • the window function type of the current frame is used to indicate the shape and length of the window function corresponding to the current frame, and the window function of the current frame is used to perform windowing processing on the time-domain three-dimensional audio signal of the current frame.
  • the frame type of the current frame is used to indicate whether the current frame is a very short frame, a short frame or a long frame.
  • the ultra-short frame, short frame and long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
  • the method of converting the time-domain three-dimensional audio signal of the current frame into the frequency-domain three-dimensional audio signal can be a modified discrete cosine transform (modified discrete cosine transform, MDCT), or a modified discrete sine transform (modified discrete sine transform, MDST) , can also be fast Fourier transform (fast fourier transform, FFT).
  • MDCT modified discrete cosine transform
  • MDST modified discrete sine transform
  • FFT fast Fourier transform
  • the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame.
  • the process of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal based on the target coding parameters is different, so the following descriptions will be made respectively.
  • the target coding parameters include the window function type of the current frame.
  • windowing processing is performed on the time-domain three-dimensional audio signal of the current frame based on the window function indicated by the window function type of the current frame. Afterwards, the windowed three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal.
  • the target coding parameters include the frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames, and at this time, the time domain 3D audio signal of each short frame included in the current frame is converted into a frequency domain 3D audio signal.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the time domain 3D audio signal of the long frame included in the current frame is directly converted into a frequency domain 3D audio signal.
  • the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames, and at this time, the time domain 3D audio signal of each ultrashort frame included in the current frame is converted into a frequency domain 3D audio signal.
  • the target encoding parameters include the window function type and frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames.
  • each short frame included in the current frame The three-dimensional audio signals in the time domain are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each short frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame.
  • the time-domain three-dimensional audio signal of the long frame included in the current frame is added. window processing, and convert the time-domain three-dimensional audio signal of the windowed long frame into a frequency-domain three-dimensional audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes multiple ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the temporal three-dimensional The audio signals are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each ultrashort frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
  • the current frame includes a plurality of ultrashort frames and short frames
  • each ultrashort frame and short frame included in the current frame are obtained 3D audio signal in the frequency domain.
  • the current frame includes a long frame
  • a frequency-domain three-dimensional audio signal of one long frame included in the current frame is obtained.
  • Step 604 Based on the global transient detection result, spatially encode the frequency domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.
  • the frequency-domain three-dimensional audio signal of the current frame is spatially encoded to obtain spatial encoding parameters and frequency-domain signals of N transmission channels.
  • the frame type of the current frame is the first type, that is, the current frame includes multiple short frames, at this time, the current
  • the three-dimensional audio signals in the frequency domain of multiple short frames included in the frame are interleaved to obtain a three-dimensional audio signal in the frequency domain of a long frame, and spatial coding is performed on the three-dimensional audio signal in the frequency domain of the long frame obtained after the interleaving.
  • the frame type of the current frame is the second type, that is, the current frame includes a long frame, at this time, the frequency domain 3D audio signal of the long frame is spatially encoded.
  • the frame type of the current frame is the third type, that is, the current frame includes multiple ultrashort frames
  • the frequency-domain three-dimensional audio signals of the multiple ultrashort frames included in the current frame are interleaved to obtain a long
  • the three-dimensional audio signal in the frequency domain of the frame is spatially encoded on the three-dimensional audio signal in the frequency domain of the long frame obtained after interleaving.
  • the method of spatial coding can be any method that can obtain spatial coding parameters and frequency domain signals of N transmission channels based on the frequency-domain three-dimensional audio signal of the current frame.
  • a spatial coding method of matching projection can be adopted. This application implements The example does not limit the spatial encoding method.
  • the spatial coding parameters refer to parameters determined during the process of spatially coding the frequency-domain 3D audio signal of the current frame, including side information, bit pre-allocation side information, and the like.
  • the frequency-domain signals of the N transmission channels may include virtual speaker signals of one or more channels, and residual signals of one or more channels.
  • the frequency domain signals of the N transmission channels may only include virtual speaker signals of one or more channels.
  • Step 605 Based on the global transient detection result, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.
  • the frequency domain signals of the N transmission channels are encoded based on the frame type of the current frame.
  • the implementation process of encoding the frequency-domain signals of the N transmission channels includes: performing noise shaping processing on the frequency-domain signals of the N transmission channels based on the frame type of the current frame.
  • the transmission channel downmixing process is performed on the frequency domain signals of the N transmission channels after the noise shaping process, to obtain the downmixed signal.
  • the noise shaping processing includes temporal noise shaping (temporal noise shaping, TNS) processing and frequency domain noise shaping (frequency domain noise shaping, FDNS) processing.
  • the N transmission channels after the noise shaping processing when performing transmission channel downmix processing on the frequency domain signals of the N transmission channels after the noise shaping processing, may be paired according to a preset criterion, or the N transmission channels after the noise shaping processing may be paired according to the signal correlation degree
  • the frequency-domain signals of the N transmission channels after noise shaping are paired.
  • mid-side (mid side, MS) downmix processing is performed based on the paired two channels of frequency domain signals.
  • the 2 channels of virtual speaker signals may be combined into a pair according to a preset criterion for downmix processing. It is also possible to determine the correlation between every 2 residual signals in the 4 residual signals, select the 2 residual signals with high correlation to form a pair, and the remaining 2 residual signals form a pair, and perform downmix processing respectively.
  • the result of the downmixing process may be one channel of frequency domain signals or two channels of frequency domain signals, depending on the encoding process.
  • the low-frequency part and the high-frequency part of the signal can be divided in various ways. For example, with 2000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 2000 Hz is regarded as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 2000 Hz is regarded as the high frequency part of the signal. For another example, with 5000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 5000 Hz is taken as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 5000 Hz is taken as the high frequency part of the signal.
  • Step 606 Encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, and write the encoding result of the spatial encoding parameters and the encoding result of the frequency domain signal into the code stream.
  • the global transient detection result may also be encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result may be written into the code stream.
  • the target encoding parameter is encoded to obtain an encoding result of the target encoding parameter, and the encoding result of the target encoding parameter is written into the code stream.
  • transient detection may be performed on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to determine a global transient detection result.
  • sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding The transient detection results of each transmission channel are multiplexed with the global transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio signal
  • Multiple transformations are performed between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency.
  • the embodiment of the present application does not need to encode the transient detection results of each transmission channel, but only needs to encode the global transient detection results into the code stream, so that the number of coding bits can be
  • FIG. 7 and FIG. 8 are block diagrams of an exemplary encoding method provided by an embodiment of the present application.
  • FIG. 7 and FIG. 8 are mainly for exemplary explanation of the encoding method shown in FIG. 6 .
  • the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels.
  • a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream.
  • the time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal.
  • perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, Write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream.
  • the frequency domain signals of the N transmission channels are encoded. Further, in FIG.
  • the frequency-domain three-dimensional audio signal of the current frame is spatially encoded, and after obtaining the spatial encoding parameters and the frequency-domain signals of N transmission channels, the spatial encoding parameters are encoded to obtain the encoding result of the spatial encoding parameters , write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. Then, based on the global transient detection results, the frequency domain signals of the N transmission channels are subjected to noise shaping processing, transmission channel downmix processing, quantization and encoding processing, and bandwidth extension processing, and the encoding results of the signals after bandwidth extension processing Write code stream.
  • the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in FIG. 9 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters.
  • the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream.
  • the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.
  • FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application. The method is applied to the decoding end and includes the following steps.
  • Step 901 Parse the global transient detection result and spatial coding parameters from the code stream.
  • Step 902 Decode based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels.
  • the frame type of the current frame is determined based on global transient detection results. Decoding is performed based on the frame type of the current frame and the code stream to obtain frequency domain signals of the N transmission channels.
  • Step 903 Based on the global transient detection result and the spatial coding parameters, spatially decode the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal.
  • the frequency domain signals of the N transmission channels are spatially decoded based on the frame type and spatial coding parameters of the current frame to obtain a reconstructed frequency domain 3D audio signal, and the frame type of the current frame is based on the global transient state
  • the test result is confirmed. That is, the frame type of the current frame is determined based on the global transient detection result, and then, based on the frame type and spatial coding parameters of the current frame, the frequency domain signals of the N transmission channels are spatially decoded to obtain the reconstructed frequency domain 3D audio signal.
  • the implementation process of spatially decoding the frequency-domain signals of the N transmission channels may refer to related technologies, which will not be described in detail in the embodiments of the present application.
  • Step 904 Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
  • target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame. Based on the target encoding parameters, the reconstructed frequency domain 3D audio signal is converted into a reconstructed time domain 3D audio signal.
  • the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame.
  • the process of converting the reconstructed frequency domain 3D audio signal into the reconstructed time domain 3D audio signal based on the target coding parameters is different, so the following will describe them respectively.
  • the target coding parameters include the window function type of the current frame.
  • de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain.
  • the de-windowed three-dimensional audio signal in the frequency domain is converted into a reconstructed three-dimensional audio signal in the time domain.
  • de-windowing processing is also referred to as windowing and splicing-add processing.
  • the target coding parameters include the frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames.
  • the reconstructed frequency-domain three-dimensional audio signal of each short frame is converted into a time-domain three-dimensional audio signal to obtain reconstruction 3D audio signal in the time domain.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long frame.
  • the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is directly converted into a time-domain three-dimensional audio signal to obtain a reconstructed time-domain audio signal. domain 3D audio signal.
  • the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames.
  • the reconstructed frequency domain 3D audio signal of each ultrashort frame is converted into a time domain 3D audio signal to obtain a reconstructed time domain 3D audio signal.
  • the target encoding parameters include the window function type and frame type of the current frame.
  • the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames.
  • each short frame included in the current frame The frequency-domain 3D audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3-D audio signals of each short frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
  • the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame.
  • the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is performed. De-windowing processing, and converting the de-windowed long-frame frequency domain 3D audio signal into a time domain 3D audio signal, so as to obtain a reconstructed time domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames.
  • the frequency domain three-dimensional The audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3D audio signals of each ultrashort frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
  • the decoder parses the global transient detection results and spatial coding parameters from the code stream, so that the time-domain three-dimensional audio signal can be reconstructed based on the global transient detection results and spatial coding parameters without the
  • the transient detection results of each transmission channel are analyzed in the middle, so that the decoding complexity can be reduced and the decoding efficiency can be improved.
  • the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
  • FIG. 10 is a block diagram of an exemplary decoding method provided by an embodiment of the present application.
  • FIG. 10 is mainly an exemplary explanation of the decoding method shown in FIG. 9 .
  • the global transient detection results and spatial coding parameters are parsed from the code stream. Decoding is performed based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels. Based on the global transient detection result and the spatial encoding parameters, the frequency domain signals of the N transmission channels are spatially decoded to obtain a reconstructed frequency domain 3D audio signal. Based on the global transient detection result and the reconstructed three-dimensional audio signal in the frequency domain, the reconstructed three-dimensional audio signal in the time domain is determined through de-windowing processing and inverse time-frequency transformation.
  • FIG. 11 is a flowchart of the second encoding method provided by the embodiment of the present application.
  • the encoding method is applied to an encoding end device, and includes the following steps.
  • Step 1101 Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.
  • step 601 For an implementation manner of determining the M transient detection results corresponding to the M channels, reference may be made to related descriptions in step 601 , which will not be repeated here.
  • Step 1102 Based on the M transient detection results, determine a global transient detection result.
  • step 602 For an implementation manner of determining global transient position information based on the M transient detection results, reference may be made to relevant descriptions in step 602 , which will not be repeated here.
  • Step 1103 Convert the time-domain 3D audio signal of the current frame into a frequency-domain 3D audio signal based on the global transient detection result.
  • the method of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal can refer to the relevant description in step 603 , which will not be repeated here.
  • Step 1104 Based on the global transient detection result, perform spatial encoding on the frequency-domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.
  • the implementation manner of performing spatial coding on the frequency-domain three-dimensional audio signal of the current frame based on the global transient detection result can refer to the related description in step 604 , which will not be repeated here.
  • Step 1105 Based on the M transient detection results, determine N transient detection results corresponding to the N transmission channels.
  • the N transmission channels include one or more transient flags of virtual loudspeaker signals determined according to a first preset rule. Based on the M transient flags, the N transmission channels include transient flags of residual signals of one or more channels according to a second preset rule.
  • the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include virtual speaker signal transient flags of one or more channels are the first values.
  • the second preset rule includes: if the quantity of the first value among the M transient flags is greater than or equal to Q, then the N transmission channels including one or more residual signal transient flags of the channels are all of the first value .
  • both P and Q are positive integers smaller than M.
  • P and Q are preset values, and P and Q can also be adjusted according to different requirements.
  • P is smaller than Q.
  • the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include one or more channels corresponding to the virtual speaker signal The transient flags are all first values.
  • the second preset rule includes: if the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to R, then the N transmission channels include one or more channels The transient flags corresponding to the residual signal all have the first value.
  • both P and R are positive integers smaller than M.
  • P and R are preset values, and P and R can also be adjusted according to different requirements.
  • the 3D audio signal is an HOA signal
  • the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located.
  • the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.
  • the N transient flags may also be determined based on the M transient flags and according to the mapping relationship between the M transient flags and the N transmission channels. Wherein, the mapping relationship is determined in advance.
  • a certain transmission channel included in the N transmission channels is mapped to a certain number of channels in the M channels, if at least one of the M transient flags has a first value, then the N transmission The transient flag of this transmission channel of the channels is the first value.
  • step 1105 may be executed at any timing after step 1101 and before step 1106, and the embodiment of the present application does not limit the execution timing of step 1105.
  • Step 1106 Based on the N transient detection results, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.
  • the frame type corresponding to each of the N transmission channels is determined. Based on the frame type corresponding to each of the N transmission channels, the frequency domain signal corresponding to the N transmission channels is encoded.
  • this transmission channel is referred to as a target transmission channel.
  • the implementation process of determining the frame type corresponding to the target transmission channel based on the transient detection result corresponding to the target transmission channel includes: if the transient flag corresponding to the target transmission channel is the first value, then determining that the frame type corresponding to the target transmission channel is the first type , the signal of the first type used to indicate the target transmission channel includes a plurality of short frames. If the transient flag corresponding to the target transmission channel is the second value, it is determined that the frame type corresponding to the target transmission channel is the second type, and the signal of the second type used to indicate the target transmission channel includes a long frame.
  • the frame type of the current frame is used to indicate whether the current frame is a short frame or a long frame.
  • the short frame and the long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
  • noise shaping processing can be performed on the frequency domain signal of each transmission channel based on the frame type corresponding to each transmission channel.
  • the frequency domain signals of the N transmission channels after the noise shaping process are subjected to transmission channel downmixing processing to obtain the downmixed signals.
  • step 605 For details about noise shaping, transmission channel downmixing, quantization and encoding of low frequency parts, and bandwidth extension and encoding, reference may be made to relevant descriptions in step 605 , which will not be repeated here.
  • Step 1107 encode the spatial encoding parameters and the N transient detection results to obtain the encoding results of the spatial encoding parameters and the encoding results of the N transient detection results, and encode the encoding results of the spatial encoding parameters and the N transient detection results Write code stream.
  • the global transient detection result may also be encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result may be written into the code stream.
  • the target encoding parameter is encoded to obtain an encoding result of the target encoding parameter, and the encoding result of the target encoding parameter is written into the code stream.
  • the transient detection results corresponding to the virtual speaker signal included in each transmission channel and the residual signal are determined.
  • the encoding accuracy can be improved when encoding the frequency domain signals of each transmission channel.
  • the transient detection results corresponding to each transmission channel are determined based on M transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the transient detection results corresponding to each transmission channel, and then The three-dimensional audio signal does not need to be transformed multiple times between the time domain and the frequency domain, so that the coding complexity can be reduced and the coding efficiency can be improved.
  • FIG. 12 and FIG. 13 are block diagrams of another exemplary encoding method provided by an embodiment of the present application.
  • FIG. 12 and FIG. 13 mainly illustrate the encoding method shown in FIG. 11 .
  • the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels.
  • a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream.
  • the time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the global transient detection result.
  • Based on the global transient detection result perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain the spatial encoding parameters and the frequency-domain signals of the N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters , write the encoding result of the spatial encoding parameters into the code stream.
  • N transient detection results corresponding to the N transmission channels are determined, and the N transient detection results are encoded to obtain N transient detection result encoding results, and the N transient detection results are encoded.
  • the encoding result of the transient detection result is written into the code stream.
  • the frequency domain signals of the N transmission channels are encoded.
  • noise shaping processing is performed on the frequency-domain signals of the N transmission channels based on the N transient detection results. Then perform transmission channel downmixing processing, quantization and encoding processing, and bandwidth expansion processing on the frequency domain signals of each transmission channel processed by the noise shaping process, and write the encoding results of the signals after the bandwidth expansion processing into the code stream.
  • the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in Figure 14 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters.
  • the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream.
  • the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.
  • FIG. 14 is a flowchart of a second decoding method provided by an embodiment of the present application. The method is applied to a decoding end and includes the following steps.
  • Step 1401 Analyze the global transient detection result, N transient detection results corresponding to N transmission channels and spatial coding parameters from the code stream.
  • Step 1402 Decode based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels.
  • the frame type corresponding to each transmission channel is determined based on the N transient detection results. Decoding is performed based on the frame type corresponding to each transmission channel and the code stream, so as to obtain frequency domain signals of the N transmission channels.
  • Step 1403 Based on the frequency domain signals and spatial coding parameters of the N transmission channels, perform spatial decoding on the frequency domain signals of the N transmission channels to obtain a reconstructed frequency domain 3D audio signal.
  • the frame type corresponding to each transmission channel is determined based on the N transient detection results. Based on the frame type and spatial encoding parameters corresponding to each transmission channel, spatial decoding is performed on the frequency domain signals of the N transmission channels to obtain a reconstructed three-dimensional audio signal in the frequency domain.
  • the implementation process of spatially decoding the frequency domain signals of the N transmission channels can refer to related technologies, which will not be described in detail in the embodiment of the present application.
  • Step 1404 Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
  • step 904 For an implementation manner of determining the reconstructed time-domain 3D audio signal based on the global transient detection result and the reconstructed frequency-domain 3D audio signal, reference may be made to relevant descriptions in step 904 , which will not be repeated here.
  • the decoding end parses the global transient detection result, the transient detection result corresponding to each transmission channel, and the spatial coding parameters from the code stream.
  • the frequency domain signal of each transmission channel can be accurately obtained.
  • the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
  • FIG. 15 is a block diagram of another exemplary decoding method provided by an embodiment of the present application.
  • FIG. 15 is mainly an exemplary explanation of the decoding method shown in FIG. 14 .
  • the global transient detection result, N transient detection results corresponding to N transmission channels, and spatial coding parameters are analyzed from the code stream.
  • Decoding is performed based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels.
  • spatial decoding is performed on the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal.
  • a reconstructed time domain 3D audio signal is determined.
  • Figure 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application.
  • the encoding device can be implemented by software, hardware, or a combination of the two to become part or all of the encoding end device.
  • the encoding end device can be as shown in Figure 1 source device. Referring to FIG. 16 , the device includes: a transient detection module 1601 , a determination module 1602 , a conversion module 1603 , a spatial encoding module 1604 , a first encoding module 1605 , a second encoding module 1606 , and a first writing module 1607 .
  • the transient detection module 1601 is configured to perform transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1 .
  • M is an integer greater than 1 .
  • a determining module 1602 configured to determine a global transient detection result based on the M transient detection results.
  • the conversion module 1603 is configured to convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain based on the global transient detection result.
  • the conversion module 1603 is configured to convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain based on the global transient detection result.
  • the spatial encoding module 1604 is configured to perform spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M an integer of .
  • N is greater than or equal to 1 and less than or equal to M an integer of .
  • the first coding module 1605 is configured to code the frequency-domain signals of the N transmission channels based on the global transient detection result, so as to obtain a frequency-domain signal coding result.
  • the detailed implementation process refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
  • the second encoding module 1606 is configured to encode the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters.
  • the detailed implementation process refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
  • the first writing module 1607 is configured to write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream.
  • the detailed implementation process refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
  • the conversion module 1603 includes:
  • a determining unit configured to determine a target encoding parameter based on a global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
  • the converting unit is configured to convert the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the target coding parameter.
  • the global transient detection result includes a global transient flag
  • the target encoding parameter includes a window function type of the current frame
  • the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
  • the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
  • the window length of the first preset window function is smaller than the window length of the second preset window function.
  • the global transient detection result includes global transient flags and global transient position information
  • the target encoding parameters include the window function type of the current frame
  • the window function type of the current frame is determined based on the global transient position information.
  • the device also includes:
  • the third encoding module is configured to encode the target encoding parameters to obtain an encoding result of the target encoding parameters
  • the second writing module is used to write the encoding result of the target encoding parameter into the code stream.
  • the spatial encoding module 1604 is specifically configured to:
  • the frequency-domain three-dimensional audio signal is spatially encoded.
  • the first encoding module 1605 is specifically configured to:
  • the frequency domain signals of the N transmission channels are encoded.
  • the transient detection result includes a transient flag
  • the global transient detection result includes a global transient flag
  • the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal
  • the determining module 1602 is specifically used for:
  • the number of transient flags that are the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M; or
  • n is a positive integer greater than 0 and less than M .
  • the transient detection result further includes transient position information
  • the global transient detection result further includes global transient position information
  • the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel
  • the determining module 1602 is specifically used for:
  • the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information
  • the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.
  • the device also includes:
  • the fourth encoding module is used to encode the global transient detection result to obtain the global transient detection result encoding result
  • the third writing module is used to write the encoding result of the global transient detection result into the code stream.
  • transient detection may be performed on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to determine a global transient detection result.
  • sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding The transient detection results of each transmission channel are multiplexed with the global transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio signal
  • Multiple transformations are performed between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency.
  • the embodiment of the present application does not need to encode the transient detection results of each transmission channel, but only needs to encode the global transient detection results into the code stream, so that the number of coding bits can be
  • the encoding device provided in the above-mentioned embodiments performs encoding, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above.
  • the encoding device and the encoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
  • Fig. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application.
  • the decoding device can be implemented by software, hardware or a combination of the two to become part or all of the decoding end device.
  • the decoding end device can be as shown in Fig. 1 destination device. Referring to FIG. 17 , the device includes: an analysis module 1701 , a decoding module 1702 , a spatial decoding module 1703 and a determination module 1704 .
  • the parsing module 1701 is configured to parse out the global transient detection result and spatial coding parameters from the code stream. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
  • the decoding module 1702 is configured to perform decoding based on the global transient detection result and code stream, so as to obtain frequency domain signals of N transmission channels.
  • the spatial decoding module 1703 is used to spatially decode the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial coding parameters, so as to obtain a reconstructed frequency-domain three-dimensional audio signal.
  • the spatial decoding module 1703 is used to spatially decode the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial coding parameters, so as to obtain a reconstructed frequency-domain three-dimensional audio signal.
  • a determining module 1704 configured to determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
  • the determining module 1704 includes:
  • a determining unit configured to determine a target encoding parameter based on a global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
  • the converting unit is configured to convert the reconstructed frequency domain 3D audio signal into a reconstructed time domain 3D audio signal based on the target encoding parameters.
  • the global transient detection result includes a global transient flag
  • the target encoding parameter includes a window function type of the current frame
  • the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
  • the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
  • the window length of the first preset window function is smaller than the window length of the second preset window function.
  • the global transient detection result includes global transient flags and global transient position information
  • the target encoding parameters include the window function type of the current frame
  • the window function type of the current frame is determined based on the global transient position information.
  • the decoder parses the global transient detection results and spatial coding parameters from the code stream, so that the time-domain three-dimensional audio signal can be reconstructed based on the global transient detection results and spatial coding parameters without the
  • the transient detection results of each transmission channel are analyzed in the middle, so that the decoding complexity can be reduced and the decoding efficiency can be improved.
  • the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
  • the decoding device when the decoding device provided in the above-mentioned embodiments performs decoding, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs, namely, the device The internal structure of the system is divided into different functional modules to complete all or part of the functions described above.
  • the decoding device and the decoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
  • Fig. 18 is a schematic block diagram of a codec device 1800 used in an embodiment of the present application.
  • the codec apparatus 1800 may include a processor 1801 , a memory 1802 and a bus system 1803 .
  • the processor 1801 and the memory 1802 are connected through the bus system 1803, the memory 1802 is used to store instructions, and the processor 1801 is used to execute the instructions stored in the memory 1802 to perform various encoding or decoding described in the embodiments of this application method. To avoid repetition, no detailed description is given here.
  • the processor 1801 can be a central processing unit (central processing unit, CPU), and the processor 1801 can also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory 1802 may include a ROM device or a RAM device. Any other suitable type of storage device may also be used as memory 1802.
  • Memory 1802 may include code and data 18021 accessed by processor 1801 using bus 1803 .
  • the memory 1802 may further include an operating system 18023 and an application program 18022, where the application program 18022 includes at least one program that allows the processor 1801 to execute the encoding or decoding method described in the embodiment of this application.
  • the application program 18022 may include applications 1 to N, which further include an encoding or decoding application (codec application for short) that executes the encoding or decoding method described in the embodiment of this application.
  • the bus system 1803 may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 1803 in the figure.
  • the codec apparatus 1800 may also include one or more output devices, such as a display 1804 .
  • display 1804 may be a touch-sensitive display that incorporates a display with a haptic unit operable to sense touch input.
  • the display 1804 may be connected to the processor 1801 via the bus 1803 .
  • codec device 1800 may implement the encoding method in the embodiment of the present application, and may also implement the decoding method in the embodiment of the present application.
  • Computer-readable media may include computer-readable storage media, which correspond to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (eg, based on a communication protocol) .
  • a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this application.
  • a computer program product may include a computer readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory, or any other medium that can contain the desired program code in the form of a computer and can be accessed by a computer.
  • any connection is properly termed a computer-readable medium.
  • coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable Wire, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of media.
  • Disk and disc includes compact disc (CD), laser disc, optical disc, DVD and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • the term "processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functionality described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or in conjunction with into the combined codec.
  • the techniques may be fully implemented in one or more circuits or logic elements.
  • various illustrative logical blocks, units, and modules in the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logic elements.
  • inventions of the present application may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a group of ICs (eg, a chipset).
  • IC integrated circuit
  • a group of ICs eg, a chipset
  • Various components, modules or units are described in the embodiments of the present application to emphasize the functional aspects of the apparatus for performing the disclosed technology, but they do not necessarily need to be realized by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit in conjunction with suitable software and/or firmware, or by interoperating hardware units (comprising one or more processors as described above) to supply.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)) or a semiconductor medium (for example: solid state disk (solid state disk, SSD)) wait.
  • a magnetic medium for example: floppy disk, hard disk, magnetic tape
  • an optical medium for example: digital versatile disc (digital versatile disc, DVD)
  • a semiconductor medium for example: solid state disk (solid state disk, SSD)
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • All signals are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • the time-domain three-dimensional audio signal and code stream involved in the embodiments of the present application are obtained under the condition of sufficient authorization.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Encoding and decoding methods and apparatus, a device, a storage medium, and a computer program, belonging to the technical field of three-dimensional audio encoding and decoding. The encoding method comprises: performing transient detection on signals of M channels comprised in a time-domain three-dimensional audio signal of a current frame, so as to obtain M transient detection results corresponding to the M channels, M being an integer greater than 1 (601); determining a global transient detection result on the basis of the M transient detection results (602); on the basis of the global transient detection result, converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal (603); on the basis of the global transient detection result, performing spatial encoding on the frequency-domain three-dimensional audio signal of the current frame, so as to obtain a spatial encoding parameter and frequency-domain signals of N transmission channels, N being an integer greater than or equal to 1 and less than or equal to M (604); encoding the frequency-domain signals of the N transmission channels on the basis of the global transient detection result, so as to obtain a frequency-domain signal encoding result (605); and encoding the spatial encoding parameter, so as to obtain a spatial encoding parameter encoding result, and writing the spatial encoding parameter encoding result and the frequency-domain signal encoding result into a bitstream (606). In this way, the complexity of encoding can be reduced, and the efficiency of encoding is increased.

Description

编解码方法、装置、设备、存储介质及计算机程序Encoding and decoding method, device, equipment, storage medium and computer program
本申请要求于2021年9月29日提交的申请号为202111155355.4、发明名称为“编解码方法、装置、设备、存储介质及计算机程序”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111155355.4 and the title of the invention "encoding and decoding method, device, equipment, storage medium and computer program" filed on September 29, 2021, the entire contents of which are incorporated by reference in In this application.
技术领域technical field
本申请涉及三维音频编解码技术领域,特别涉及一种编解码方法、装置、设备、存储介质及计算机程序。The present application relates to the technical field of three-dimensional audio coding and decoding, and in particular to a coding and decoding method, device, equipment, storage medium and computer program.
背景技术Background technique
三维音频技术是通过计算机、信号处理等方式对真实世界中声音事件和三维声场信息进行获取、处理、传输和渲染回放的音频技术。为了实现更好的音频听觉效果,三维音频信号通常需要包括大量的数据量,以此来更详细地记录声音场景的空间信息。然而,大量的数据量在传输和存储的过程中存在困难,因此需要对三维音频信号进行编解码。Three-dimensional audio technology is an audio technology that acquires, processes, transmits, and renders sound events and three-dimensional sound field information in the real world through computers and signal processing. In order to achieve a better audio auditory effect, a 3D audio signal usually needs to include a large amount of data, so as to record spatial information of a sound scene in more detail. However, there are difficulties in the process of transmission and storage of a large amount of data, so it is necessary to encode and decode the 3D audio signal.
高阶立体混响(higher order ambisonics,HOA)音频技术作为一种三维音频技术,因其在录制、编码与回放阶段具有与扬声器布局无关的性质,以及HOA格式数据具有可旋转回放的特性,所以HOA信号在进行回放时具有更高的灵活性,因而得到了更为广泛的关注。Higher order ambisonics (HOA) audio technology is a three-dimensional audio technology, because it has no relationship with speaker layout in the recording, encoding and playback stages, and HOA format data has the characteristics of rotatable playback, so HOA signals have higher flexibility in playback, and thus have received more widespread attention.
相关技术提出了一种对HOA信号进行编码的方法,在该方法中,先对时域HOA信号进行时频变换,以得到频域HOA信号,对频域HOA信号进行空间编码,以得到多个通道的频域信号。之后,对各通道的频域信号进行时频逆变换,以得到各通道的时域信号,对各通道的时域信号进行暂态检测,以得到各通道的暂态检测结果。然后,对各通道的时域信号再次进行时频变换,以得到各通道的频域信号,利用各通道的暂态检测结果,对各通道的频域信号进行编码。Related technologies propose a method for encoding HOA signals. In this method, the time-domain HOA signal is first subjected to time-frequency transformation to obtain the frequency-domain HOA signal, and the frequency-domain HOA signal is spatially encoded to obtain multiple The frequency domain signal of the channel. Afterwards, the time-frequency inverse transform is performed on the frequency domain signals of each channel to obtain the time domain signals of each channel, and the transient detection is performed on the time domain signals of each channel to obtain the transient detection results of each channel. Then, the time-frequency transform is performed on the time-domain signals of each channel again to obtain the frequency-domain signals of each channel, and the frequency-domain signals of each channel are encoded by using the transient detection results of each channel.
然而,在上述方法中,需要对音频信号在时域与频域之间进行多次变换,从而增加了编码复杂度,进而降低了编码效率。However, in the above method, it is necessary to perform multiple transformations on the audio signal between the time domain and the frequency domain, thereby increasing the coding complexity and further reducing the coding efficiency.
发明内容Contents of the invention
本申请实施例提供了一种编解码方法、装置、设备、存储介质及计算机程序,可以降低编码复杂度,提高编码效率。所述技术方案如下:Embodiments of the present application provide an encoding and decoding method, device, device, storage medium, and computer program, which can reduce encoding complexity and improve encoding efficiency. Described technical scheme is as follows:
第一方面,提供了一种编码方法,对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到M个通道对应的M个暂态检测结果,M为大于1的整数;基于M个暂态检测结果,确定全局暂态检测结果;基于全局暂态检测结果,将时域三维音频信号转换为频域三维音频信号;基于全局暂态检测结果,对频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,N为大于或等于1且小于或等于M的整数;基于全局暂态检测结果,对N个传输通道的频域信号进行编码,以得到频域信号编码结果;将空间编码参数进行编码,以得到空间编码参数编码结果;将空间编码参数编码结果和频域 信号编码结果写入码流。In the first aspect, a coding method is provided, which performs transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is greater than An integer of 1; based on the M transient detection results, determine the global transient detection result; based on the global transient detection results, convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain; The three-dimensional audio signal is spatially encoded to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is an integer greater than or equal to 1 and less than or equal to M; based on the global transient detection results, the frequency domain signals of N transmission channels Encoding the signal in the frequency domain to obtain the encoding result of the frequency domain signal; encoding the spatial encoding parameters to obtain the encoding result of the spatial encoding parameter; writing the encoding result of the spatial encoding parameter and the encoding result of the frequency domain signal into the code stream.
暂态检测结果包括暂态标志,或者,暂态检测结果包括暂态标志和暂态位置信息。暂态标志用于指示对应通道的信号是否为暂态信号,暂态位置信息用于指示对应通道的信号中暂态发生的位置。该M个通道对应的M个暂态检测结果的确定方式包括多种,接下来对其中一种方式进行介绍。由于该M个通道中每个通道对应的暂态检测结果的确定方式相同,因此,接下来以其中一个通道为例,对该通道对应的暂态检测结果的确定方式进行介绍。为了便于描述,将该通道称为目标通道,而且接下来将分别介绍目标通道的暂态标志和暂态位置信息。The transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information. The transient flag is used to indicate whether the signal of the corresponding channel is a transient signal, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel. There are many ways to determine the M transient detection results corresponding to the M channels, and one of the ways will be introduced next. Since the determination method of the transient state detection result corresponding to each of the M channels is the same, the method for determining the transient state detection result corresponding to the channel is introduced next by taking one of the channels as an example. For ease of description, this channel is referred to as a target channel, and the transient flag and transient position information of the target channel will be respectively introduced next.
目标通道的暂态标志的确定方式包括:基于目标通道的信号,确定目标通道对应的暂态检测参数。基于目标通道对应的暂态检测参数,确定目标通道对应的暂态标志。The method for determining the transient flag of the target channel includes: determining a transient detection parameter corresponding to the target channel based on a signal of the target channel. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.
作为一种示例,目标通道对应的暂态检测参数为帧间能量差的绝对值。也即是,确定当前帧中目标通道的信号的能量,以及当前帧的上一帧中目标通道的信号的能量。确定当前帧中目标通道的信号的能量与上一帧中目标通道的信号的能量之间的差值的绝对值,以得到帧间能量差的绝对值。如果该帧间能量差的绝对值超过第一能量差阈值,则确定当前帧中目标通道对应的暂态标志为第一值,否则,确定当前帧中目标通道对应的暂态标志为第二值。As an example, the transient detection parameter corresponding to the target channel is the absolute value of the energy difference between frames. That is, the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame of the current frame are determined. The absolute value of the difference between the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame is determined to obtain the absolute value of the inter-frame energy difference. If the absolute value of the inter-frame energy difference exceeds the first energy difference threshold, determine that the transient flag corresponding to the target channel in the current frame is the first value, otherwise, determine that the transient flag corresponding to the target channel in the current frame is the second value .
作为另一种示例,目标通道对应的暂态检测参数为子帧能量差的绝对值。也即是,当前帧中目标通道的信号包括多个子帧的信号,确定该多个子帧中每个子帧对应的子帧能量差的绝对值,进而确定每个子帧对应的暂态标志。如果该多个子帧中存在暂态标志为第一值的子帧,则确定当前帧中目标通道对应的暂态标志为第一值。如果该多个子帧中不存在暂态标志为第一值的子帧,则确定当前帧中目标通道对应的暂态标志为第二值。As another example, the transient detection parameter corresponding to the target channel is the absolute value of the subframe energy difference. That is, the signal of the target channel in the current frame includes signals of multiple subframes, the absolute value of the subframe energy difference corresponding to each subframe in the multiple subframes is determined, and then the transient flag corresponding to each subframe is determined. If there is a subframe whose transient flag is the first value in the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the first value. If there is no subframe whose transient flag is the first value among the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the second value.
目标通道的暂态位置信息的确定方式包括:基于目标通道对应的暂态标志,确定目标通道对应的暂态位置信息。The method for determining the transient location information of the target channel includes: determining the transient location information corresponding to the target channel based on the transient flag corresponding to the target channel.
作为一种示例,如果目标通道对应的暂态标志为第一值,则确定目标通道对应的暂态位置信息。如果目标通道对应的暂态标志为第二值,则确定目标通道不具有对应的暂态位置信息,或者,将目标通道对应的暂态位置信息设置为预设数值,比如设置为-1。As an example, if the transient flag corresponding to the target channel is the first value, determine the transient position information corresponding to the target channel. If the transient flag corresponding to the target channel is the second value, it is determined that the target channel does not have corresponding transient position information, or, the transient position information corresponding to the target channel is set to a preset value, such as -1.
在一些实施例中,暂态检测结果包括暂态标志,全局暂态检测结果包括全局暂态标志,暂态标志用于指示对应通道的信号是否为暂态信号。基于M个暂态检测结果,确定全局暂态检测结果,包括:若该M个暂态标志中为第一值的暂态标志的数量大于或等于m,则确定全局暂态标志为第一值,m为大于0且小于M的正整数。或者,若M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于n,则确定全局暂态标志为第一值,n为大于0且小于M的正整数。In some embodiments, the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Determining the global transient detection result based on the M transient detection results, including: if the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determining that the global transient flag is the first value , m is a positive integer greater than 0 and less than M. Alternatively, if the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer.
其中,m和n为事先设置的数值,且m和n还能够按照不同的需求来调整。在该三维音频信号为HOA信号的情况下,第一预设条件包括属于一阶立体混响(first-order ambisonics,FOA)信号的通道,例如FOA信号的通道可以包括该HOA信号中的前4个通道。换句话说,在该三维音频信号为HOA信号的情况下,若当前帧的三维音频信号中的FOA信号的通道中对应的暂态标志为第一值的通道数量大于或等于n,则确定全局暂态标志为第一值。当然,第一预设条件还可以为其他的条件。Wherein, m and n are preset values, and m and n can also be adjusted according to different requirements. In the case where the three-dimensional audio signal is an HOA signal, the first preset condition includes a channel belonging to a first-order ambisonics (first-order ambisonics, FOA) signal, for example, the channel of the FOA signal may include the first 4 channels of the HOA signal. channels. In other words, in the case where the 3D audio signal is an HOA signal, if the number of channels whose corresponding transient flag is the first value in the channel of the FOA signal in the 3D audio signal of the current frame is greater than or equal to n, determine the global The transient flag is the first value. Of course, the first preset condition can also be other conditions.
在另一些实施例中,暂态检测结果还包括暂态位置信息,全局暂态检测结果还包括全局暂态位置信息,暂态位置信息用于指示对应通道的信号中暂态发生的位置。基于M个暂态检测结果,确定全局暂态检测结果,包括:若M个暂态标志中仅有一个暂态标志为第一值,则 将暂态标志为第一值的通道对应的暂态位置信息确定为全局暂态位置信息。若M个暂态标志中存在至少两个暂态标志为第一值,则将至少两个暂态标志对应的至少两个通道中暂态检测参数最大的通道对应的暂态位置信息确定为全局暂态位置信息。In some other embodiments, the transient detection result further includes transient position information, the global transient detection result further includes global transient position information, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel. Determine the global transient detection result based on the M transient detection results, including: if only one of the M transient flags is the first value, the transient state corresponding to the channel whose transient flag is the first value The location information is determined as global transient location information. If there are at least two transient flags with the first value among the M transient flags, the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.
或者若该M个暂态标志中存在至少两个暂态标志为第一值,且两个通道对应的暂态位置信息之间的差距小于位置差阈值,则将该两个通道对应的暂态位置信息的平均值确定为全局暂态位置信息。位置差阈值为事先设置的,而且位置差阈值能够按照不同的需求来调整。Or if at least two of the M transient flags are the first value, and the gap between the transient position information corresponding to the two channels is smaller than the position difference threshold, then the transient position information corresponding to the two channels The average value of the position information is determined as the global transient position information. The position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.
基于上文描述,通道对应的暂态检测参数为帧间能量差的绝对值或者子帧能量差的绝对值。在通道对应的暂态检测参数为帧间能量差的绝对值的情况下,一个通道对应一个帧间能量差的绝对值,此时,可以从该至少两个通道中选择对应的帧间能量差的绝对值最大的通道,进而将选择的通道对应的暂态位置信息确定为全局暂态位置信息。在通道对应的暂态检测参数为子帧能量差的绝对值的情况下,一个通道对应有多个子帧能量差的绝对值,此时,可以从该至少两个通道中选择对应的子帧能量差的绝对值最大的通道,进而将选择的通道对应的暂态位置信息确定为全局暂态位置信息。Based on the above description, the transient detection parameter corresponding to the channel is the absolute value of the energy difference between frames or the absolute value of the energy difference between subframes. In the case where the transient detection parameter corresponding to the channel is the absolute value of the inter-frame energy difference, one channel corresponds to an absolute value of the inter-frame energy difference, at this time, the corresponding inter-frame energy difference can be selected from the at least two channels The channel with the largest absolute value, and then determine the transient position information corresponding to the selected channel as the global transient position information. In the case where the transient detection parameter corresponding to the channel is the absolute value of the subframe energy difference, one channel corresponds to the absolute value of the energy difference of multiple subframes, at this time, the corresponding subframe energy can be selected from the at least two channels The channel with the largest absolute value of the difference is selected, and then the transient position information corresponding to the selected channel is determined as the global transient position information.
可选地,基于全局暂态检测结果,将时域三维音频信号转换为频域三维音频信号,包括:基于全局暂态检测结果确定目标编码参数,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。基于目标编码参数将时域三维音频信号转换为频域三维音频信号。Optionally, converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result includes: determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or or the frame type of the current frame. The time-domain three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameters.
作为一种示例,全局暂态检测结果包括全局暂态标志。基于全局暂态检测结果确定当前帧的窗函数类型的实现过程包括:若全局暂态标志为第一值,则将第一预设窗函数的类型确定为当前帧的窗函数类型。若全局暂态标志为第二值,则将第二预设窗函数的类型确定为当前帧的窗函数类型。其中,第一预设窗函数的窗长小于第二预设窗函数的窗长。As an example, the global transient detection result includes a global transient flag. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, determining the first preset window function type as the window function type of the current frame. If the global transient flag is the second value, the type of the second preset window function is determined as the type of the window function of the current frame. Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
作为另一种示例,全局暂态检测结果包括全局暂态标志和全局暂态位置信息。基于全局暂态检测结果确定当前帧的窗函数类型的实现过程包括:若全局暂态标志为第一值,则基于全局暂态位置信息确定当前帧的窗函数类型。若全局暂态标志为第二值,则将第三预设窗函数的类型确定为当前帧的窗函数类型,或者,基于当前帧的上一帧的窗函数类型确定当前帧的窗函数类型。As another example, the global transient detection result includes global transient flags and global transient location information. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining the window function type of the current frame based on the global transient position information. If the global transient flag is the second value, the type of the third preset window function is determined as the type of the window function of the current frame, or the type of the window function of the current frame is determined based on the type of the window function of the previous frame of the current frame.
由于全局暂态检测结果可以仅包括全局暂态标志,也可以包括全局暂态标志和全局暂态位置信息,而且全局暂态位置信息可能是暂态标志为第一值的通道对应的暂态位置信息,也可能是预设数值。在全局暂态检测结果不同的情况下,确定当前帧的帧类型的方式不同,因此,接下来将分为以下三种情况分别进行说明:Since the global transient detection result may only include the global transient flag, it may also include the global transient flag and the global transient position information, and the global transient position information may be the transient position corresponding to the channel whose transient flag is the first value information, and possibly preset values. In the case of different global transient detection results, the method of determining the frame type of the current frame is different. Therefore, the following three cases will be described separately:
第一种情况、全局暂态检测结果包括全局暂态标志。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态标志为第一值,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。若全局暂态标志为第二值,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。In the first case, the global transient detection result includes a global transient flag. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frame. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame.
第二种情况、全局暂态检测结果包括全局暂态标志和全局暂态位置信息。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态标志为第一值且全局暂态位置信息满足第二预设条件,则确定当前帧的帧类型为第三类型,第三类型用于指示当前帧包括多个超短帧。若全局暂态标志为第一值且全局暂态位置信息不满足第二预设条件,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。若全局暂态标志为第二值,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。超短帧的帧长 小于短帧的帧长,短帧的帧长小于长帧的帧长。第二预设条件可以是全局暂态位置信息指示的暂态发生位置距离当前帧的起始位置小于超短帧的帧长或者全局暂态位置信息指示的暂态发生位置距离当前帧的结束位置小于超短帧的帧长。In the second case, the global transient detection result includes the global transient flag and the global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame. The frame length of ultra-short frame is less than that of short frame, and the frame length of short frame is less than that of long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
第三种情况、全局暂态检测结果包括全局暂态位置信息。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态位置信息为预设数值,比如为-1,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。若全局暂态位置信息不为预设数值且满足第二预设条件,则确定当前帧的帧类型为第三类型,第三类型用于指示当前帧包括多个超短帧。若全局暂态位置信息不为预设数值且不满足第二预设条件,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。超短帧的帧长小于短帧的帧长,短帧的帧长小于长帧的帧长。第二预设条件可以是全局暂态位置信息指示的暂态发生位置距离当前帧的起始位置小于超短帧的帧长或者全局暂态位置信息指示的暂态发生位置距离当前帧的结束位置小于超短帧的帧长。In the third case, the global transient detection result includes global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient position information is a preset value, such as -1, then determine that the frame type of the current frame is the second type, and the second type is used for Indicates that the current frame includes a long frame. If the global transient position information is not a preset value and satisfies the second preset condition, it is determined that the frame type of the current frame is a third type, and the third type is used to indicate that the current frame includes multiple ultra-short frames. If the global transient position information is not a preset value and does not meet the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. The frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
需要说明的是,当前帧的窗函数类型用于指示当前帧对应的窗函数的形状和长度,当前帧的窗函数用于对当前帧的时域三维音频信号进行加窗处理。当前帧的帧类型用于指示当前帧为超短帧、短帧还是长帧。其中,超短帧、短帧和长帧可以基于帧的时长来区分,具体的时长可以按照不同的需求来设置,本申请实施例对此不做限定。It should be noted that the window function type of the current frame is used to indicate the shape and length of the window function corresponding to the current frame, and the window function of the current frame is used to perform windowing processing on the time-domain three-dimensional audio signal of the current frame. The frame type of the current frame is used to indicate whether the current frame is a very short frame, a short frame or a long frame. The ultra-short frame, short frame and long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
基于上文描述,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。也即是,目标编码参数包括当前帧的窗函数类型,或者,目标编码参数包括当前帧的帧类型,又或者,目标编码参数包括当前帧的窗函数类型和帧类型。在目标编码参数包括的参数不同时,基于该目标编码参数将当前帧的时域三维音频信号转换为频域三维音频信号的过程有所不同,因此接下来将分别进行说明。Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal based on the target coding parameters is different, so the following descriptions will be made respectively.
第一种情况,目标编码参数包括当前帧的窗函数类型。这种情况下,基于当前帧的窗函数类型所指示的窗函数,对当前帧的时域三维音频信号进行加窗处理。之后,将加窗处理后的三维音频信号转换为频域三维音频信号。In the first case, the target coding parameters include the window function type of the current frame. In this case, windowing processing is performed on the time-domain three-dimensional audio signal of the current frame based on the window function indicated by the window function type of the current frame. Afterwards, the windowed three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal.
第二种情况,目标编码参数包括当前帧的帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,将当前帧包括的各个短帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长帧,此时,直接将当前帧包括的长帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,将当前帧包括的各个超短帧的时域三维音频信号转换为频域三维音频信号。In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames, and at this time, the time domain 3D audio signal of each short frame included in the current frame is converted into a frequency domain 3D audio signal. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the time domain 3D audio signal of the long frame included in the current frame is directly converted into a frequency domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames, and at this time, the time domain 3D audio signal of each ultrashort frame included in the current frame is converted into a frequency domain 3D audio signal.
第三种情况,目标编码参数包括当前帧的窗函数类型和帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个短帧的时域三维音频信号分别进行加窗处理,并将加窗处理后的各个短帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的长帧的时域三维音频信号进行加窗处理,并将加窗处理后的长帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个超短帧的时域三维音频信号分别进行加窗处理,并将加窗处理后的各个超短帧的时域三维音频信号转换为频域三 维音频信号。In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The three-dimensional audio signals in the time domain are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each short frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the time-domain three-dimensional audio signal of the long frame included in the current frame is added. window processing, and convert the time-domain three-dimensional audio signal of the windowed long frame into a frequency-domain three-dimensional audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes multiple ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the temporal three-dimensional The audio signals are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each ultrashort frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
在一些实施例中,还能够将目标编码参数进行编码,以得到目标编码参数编码结果。将目标编码参数编码结果写入码流。In some embodiments, the target encoding parameters can also be encoded to obtain an encoding result of the target encoding parameters. Write the encoding result of the target encoding parameters into the code stream.
在一些实施例中,基于全局暂态检测结果,对频域三维音频信号进行空间编码,包括:基于帧类型,对频域三维音频信号进行空间编码。In some embodiments, performing spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result includes: performing spatial encoding on the frequency-domain three-dimensional audio signal based on the frame type.
在基于当前帧的帧类型,对当前帧的频域三维音频信号进行空间编码时,如果当前帧的帧类型为第一类型,也即是,当前帧包括多个短帧,此时,将当前帧包括的多个短帧的频域三维音频信号进行交织,以得到一个长帧的频域三维音频信号,对交织后得到的长帧的频域三维音频信号进行空间编码。如果当前帧的帧类型为第二类型,也即是,当前帧包括一个长帧,此时,对这个长帧的频域三维音频信号进行空间编码。如果当前帧的帧类型为第三类型,也即是,当前帧包括多个超短帧,此时,将当前帧包括的多个超短帧的频域三维音频信号进行交织,以得到一个长帧的频域三维音频信号,对交织后得到的长帧的频域三维音频信号进行空间编码。When spatially encoding the frequency-domain three-dimensional audio signal of the current frame based on the frame type of the current frame, if the frame type of the current frame is the first type, that is, the current frame includes multiple short frames, at this time, the current The three-dimensional audio signals in the frequency domain of multiple short frames included in the frame are interleaved to obtain a three-dimensional audio signal in the frequency domain of a long frame, and spatial coding is performed on the three-dimensional audio signal in the frequency domain of the long frame obtained after the interleaving. If the frame type of the current frame is the second type, that is, the current frame includes a long frame, at this time, the frequency domain 3D audio signal of the long frame is spatially encoded. If the frame type of the current frame is the third type, that is, the current frame includes multiple ultrashort frames, at this time, the frequency-domain three-dimensional audio signals of the multiple ultrashort frames included in the current frame are interleaved to obtain a long The three-dimensional audio signal in the frequency domain of the frame is spatially encoded on the three-dimensional audio signal in the frequency domain of the long frame obtained after interleaving.
在一些实施例中,基于全局暂态检测结果,对N个传输通道的频域信号进行编码,包括:基于当前帧的帧类型,对N个传输通道的频域信号进行编码。In some embodiments, encoding the frequency-domain signals of the N transmission channels based on the global transient detection result includes: encoding the frequency-domain signals of the N transmission channels based on the frame type of the current frame.
作为一种示例,对该N个传输通道的频域信号进行编码的实现过程包括:基于当前帧的帧类型,对该N个传输通道的频域信号进行噪声整形处理。对噪声整形处理后的N个传输通道的频域信号进行传输通道下混处理,得到下混处理后的信号。对下混处理后的信号的低频部分进行量化与编码处理,将编码结果写入码流。对下混处理后的信号的高频部分进行带宽扩展与编码处理,将编码结果写入码流。As an example, the implementation process of encoding the frequency-domain signals of the N transmission channels includes: performing noise shaping processing on the frequency-domain signals of the N transmission channels based on the frame type of the current frame. The transmission channel downmixing process is performed on the frequency domain signals of the N transmission channels after the noise shaping process, to obtain the downmixed signal. Perform quantization and encoding processing on the low-frequency part of the downmixed signal, and write the encoding result into the code stream. Perform bandwidth expansion and encoding processing on the high-frequency part of the downmixed signal, and write the encoding result into the code stream.
可选地,该方法还包括:将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果。将全局暂态检测结果编码结果写入码流。Optionally, the method further includes: encoding the global transient detection result to obtain an encoding result of the global transient detection result. Write the encoding result of the global transient detection result into the code stream.
第二方面,提供了一种解码方法,从码流中解析出全局暂态检测结果和空间编码参数;基于全局暂态检测结果和码流进行解码,以得到N个传输通道的频域信号;基于全局暂态检测结果和空间编码参数,对N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号;基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号。In the second aspect, a decoding method is provided, which analyzes the global transient detection result and the spatial encoding parameters from the code stream; decodes based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels; Based on the global transient detection results and the spatial coding parameters, spatially decode the frequency-domain signals of N transmission channels to obtain the reconstructed frequency-domain 3D audio signals; based on the global transient detection results and the reconstructed frequency-domain 3D audio signals, determine Reconstructed time domain 3D audio signal.
可选地,基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号,包括:基于全局暂态检测结果确定目标编码参数,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型;基于目标编码参数,将重建的频域三维音频信号转换为重建的时域三维音频信号。Optionally, based on the global transient detection result and the reconstructed frequency-domain three-dimensional audio signal, determining the reconstructed time-domain three-dimensional audio signal includes: determining a target encoding parameter based on the global transient detection result, and the target encoding parameter includes a window function of the current frame type and/or the frame type of the current frame; based on the target encoding parameters, converting the reconstructed frequency-domain 3D audio signal into a reconstructed time-domain 3D audio signal.
基于上文描述,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。也即是,目标编码参数包括当前帧的窗函数类型,或者,目标编码参数包括当前帧的帧类型,又或者,目标编码参数包括当前帧的窗函数类型和帧类型。在目标编码参数包括的参数不同时,基于该目标编码参数将重建的频域三维音频信号转换为重建的时域三维音频信号的过程有所不同,因此接下来将分别进行说明。Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the reconstructed frequency domain 3D audio signal into the reconstructed time domain 3D audio signal based on the target coding parameters is different, so the following will describe them respectively.
第一种情况,目标编码参数包括当前帧的窗函数类型。这种情况下,基于当前帧的窗函数类型所指示的窗函数,对重建的频域三维音频信号进行去加窗处理。之后,将去加窗处理 后的频域三维音频信号转换为重建的时域三维音频信号。In the first case, the target coding parameters include the window function type of the current frame. In this case, based on the window function indicated by the window function type of the current frame, de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain. Afterwards, the de-windowed frequency-domain 3D audio signal is converted into a reconstructed time-domain 3D audio signal.
其中,去加窗处理也称为加窗及叠接相加处理。Wherein, de-windowing processing is also referred to as windowing and splicing-add processing.
第二种情况,目标编码参数包括当前帧的帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,将各个短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长帧,此时,直接将当前帧包括的长帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,将各个超短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames. At this time, the reconstructed frequency-domain three-dimensional audio signal of each short frame is converted into a time-domain three-dimensional audio signal to obtain reconstruction 3D audio signal in the time domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is directly converted into a time-domain three-dimensional audio signal to obtain a reconstructed time-domain audio signal. domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, the reconstructed frequency domain 3D audio signal of each ultrashort frame is converted into a time domain 3D audio signal to obtain a reconstructed time domain 3D audio signal.
第三种情况,目标编码参数包括当前帧的窗函数类型和帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个短帧的频域三维音频信号分别进行去加窗处理,并将去加窗处理后的各个短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的长帧的重建频域三维音频信号进行去加窗处理,并将去加窗处理后的长帧的频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个超短帧的频域三维音频信号分别进行去加窗处理,并将去加窗处理后的各个超短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The frequency-domain 3D audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3-D audio signals of each short frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is performed. De-windowing processing, and converting the de-windowed long-frame frequency domain 3D audio signal into a time domain 3D audio signal, so as to obtain a reconstructed time domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the frequency domain three-dimensional The audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3D audio signals of each ultrashort frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
可选地,全局暂态检测结果包括全局暂态标志,目标编码参数包括当前帧的窗函数类型。基于全局暂态检测结果确定目标编码参数,包括:若全局暂态标志为第一值,则将第一预设窗函数的类型确定为当前帧的窗函数类型;若全局暂态标志为第二值,则将第二预设窗函数的类型确定为当前帧的窗函数类型;其中,第一预设窗函数的窗长小于第二预设窗函数的窗长。Optionally, the global transient detection result includes a global transient flag, and the target coding parameter includes a window function type of the current frame. Determine the target encoding parameters based on the global transient detection results, including: if the global transient flag is the first value, then determine the type of the first preset window function as the window function type of the current frame; if the global transient flag is the second value, the type of the second preset window function is determined as the window function type of the current frame; wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
可选地,全局暂态检测结果包括全局暂态标志和全局暂态位置信息,目标编码参数包括当前帧的窗函数类型;基于全局暂态检测结果确定目标编码参数,包括:若全局暂态标志为第一值,则基于全局暂态位置信息确定当前帧的窗函数类型。Optionally, the global transient detection result includes the global transient flag and the global transient position information, and the target coding parameter includes the window function type of the current frame; the target coding parameter is determined based on the global transient detection result, including: if the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.
第三方面,提供了一种编码装置,所述编码装置具有实现上述第一方面中编码方法行为的功能。所述编码装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的编码方法。In a third aspect, an encoding device is provided, and the encoding device has a function of implementing the behavior of the encoding method in the first aspect above. The encoding device includes at least one module, and the at least one module is used to implement the encoding method provided in the first aspect above.
第四方面,提供了一种解码装置,所述解码装置具有实现上述第二方面中解码方法行为的功能。所述解码装置包括至少一个模块,该至少一个模块用于实现上述第二方面所提供的解码方法。In a fourth aspect, a decoding device is provided, and the decoding device has the function of realizing the behavior of the decoding method in the second aspect above. The decoding device includes at least one module, and the at least one module is used to implement the decoding method provided by the second aspect above.
第五方面,提供了一种编码端设备,所述编码端设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的编码方法的程序。所述处理器被配置为用于执行所述存 储器中存储的程序,以实现上述第一方面提供的编码方法。According to a fifth aspect, an encoding device is provided, and the encoding device includes a processor and a memory, and the memory is used to store a program for executing the encoding method provided in the first aspect above. The processor is configured to execute the program stored in the memory, so as to implement the encoding method provided in the first aspect above.
可选地,所述编码端设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。Optionally, the encoding end device may further include a communication bus, which is used to establish a connection between the processor and the memory.
第六方面,提供了一种解码端设备,所述解码端设备包括处理器和存储器,所述存储器用于存储执行上述第二方面所提供的解码方法的程序。所述处理器被配置为用于执行所述存储器中存储的程序,以实现上述第二方面提供的解码方法。In a sixth aspect, a decoding end device is provided, the decoding end device includes a processor and a memory, and the memory is used to store a program for executing the decoding method provided in the second aspect above. The processor is configured to execute the program stored in the memory, so as to implement the decoding method provided by the second aspect above.
可选地,所述解码端设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。Optionally, the decoding device may further include a communication bus, which is used to establish a connection between the processor and the memory.
第七方面,提供了一种计算机可读存储介质,所述存储介质内存储有指令,当所述指令在计算机上运行时,使得计算机执行上述第一方面所述的编码方法的步骤,或者执行上述第二方面所述的解码方法的步骤。In the seventh aspect, there is provided a computer-readable storage medium, and instructions are stored in the storage medium, and when the instructions are run on a computer, the computer is made to execute the steps of the encoding method described in the first aspect above, or execute The steps of the decoding method described in the second aspect above.
第八方面,提供了一种包含指令的计算机程序产品,当所述指令在计算机上运行时,使得计算机执行上述第一方面所述的编码方法的步骤,或者执行上述第二方面所述的解码方法的步骤。或者说,提供了一种计算机程序,所述计算机程序被执行时实现上述第一方面所述的编码方法的步骤,或者实现上述第二方面所述的解码方法的步骤。In the eighth aspect, there is provided a computer program product containing instructions, which, when the instructions are run on a computer, cause the computer to execute the steps of the encoding method described in the above-mentioned first aspect, or perform the decoding described in the above-mentioned second aspect method steps. Or in other words, a computer program is provided, and when the computer program is executed, the steps of the encoding method described in the above-mentioned first aspect are realized, or the steps of the decoding method described in the above-mentioned second aspect are realized.
第九方面,提供了一种计算机可读存储介质,所述计算机可读存储介质包括上述第一方面所述的编码方法所获得的码流。A ninth aspect provides a computer-readable storage medium, where the computer-readable storage medium includes the code stream obtained by the encoding method described in the first aspect.
上述第三方面、第四方面、第五方面、第六方面、第七方面、第八方面和第九方面所得到的技术效果与第一方面或第二方面中对应的技术手段得到的技术效果近似,在这里不再赘述。The technical effects obtained in the third aspect, the fourth aspect, the fifth aspect, the sixth aspect, the seventh aspect, the eighth aspect and the ninth aspect are the same as the technical effects obtained by the corresponding technical means in the first aspect or the second aspect Approximate, no more details here.
本申请提供的技术方案至少包括以下有益效果:The technical solution provided by the application at least includes the following beneficial effects:
通过对当前帧的时域三维音频信号包括的M个通道的信号进行暂态检测,从而确定全局暂态检测结果。之后,基于全局暂态检测结果,依次进行音频信号的时频变换、空间编码以及对各个传输通道的频域信号进行编码,尤其对空间编码后得到的各个传输通道的频域信号进行编码时,通过全局暂态检测结果指导各个传输通道的频域信号的编码,不需要将各个传输通道的频域信号转换至时域来确定各个传输通道对应的暂态检测结果,进而就不需要将三维音频信号在时域与频域之间进行多次变换,从而能够降低编码复杂度,提高编码效率。A global transient detection result is determined by performing transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The encoding of the frequency-domain signals of each transmission channel is guided by the global transient detection results. It is not necessary to convert the frequency-domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio The signal undergoes multiple transformations between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency.
附图说明Description of drawings
图1是本申请实施例提供的一种实施环境的示意图;FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;
图2是本申请实施例提供的一种终端场景的实施环境的示意图;FIG. 2 is a schematic diagram of an implementation environment of a terminal scenario provided by an embodiment of the present application;
图3是本申请实施例提供的一种无线或核心网设备的转码场景的实施环境的示意图;FIG. 3 is a schematic diagram of an implementation environment of a transcoding scenario of a wireless or core network device provided in an embodiment of the present application;
图4是本申请实施例提供的一种广播电视场景的实施环境的示意图;FIG. 4 is a schematic diagram of an implementation environment of a broadcast television scene provided by an embodiment of the present application;
图5是本申请实施例提供的一种虚拟现实流场景的实施环境的示意图;FIG. 5 is a schematic diagram of an implementation environment of a virtual reality streaming scene provided by an embodiment of the present application;
图6是本申请实施例提供的第一种编码方法的流程图;Fig. 6 is a flow chart of the first encoding method provided by the embodiment of the present application;
图7是本申请实施例提供的第一种关于图6所示的编码方法的示例性框图;Fig. 7 is an exemplary block diagram of the first encoding method shown in Fig. 6 provided by the embodiment of the present application;
图8是本申请实施例提供的第二种关于图6所示的编码方法的示例性框图;FIG. 8 is an exemplary block diagram of a second encoding method shown in FIG. 6 provided by an embodiment of the present application;
图9是本申请实施例提供的第一种解码方法的流程图;FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application;
图10是本申请实施例提供的一种关于图9所示的解码方法的示例性框图;FIG. 10 is an exemplary block diagram of the decoding method shown in FIG. 9 provided by an embodiment of the present application;
图11是本申请实施例提供的第二种编码方法的流程图;Fig. 11 is a flow chart of the second encoding method provided by the embodiment of the present application;
图12是本申请实施例提供的第一种关于图11所示的编码方法的示例性框图;Fig. 12 is an exemplary block diagram of the first encoding method shown in Fig. 11 provided by the embodiment of the present application;
图13是本申请实施例提供的第二种关于图11所示的编码方法的示例性框图;FIG. 13 is an exemplary block diagram of the second encoding method shown in FIG. 11 provided by the embodiment of the present application;
图14是本申请实施例提供的第二种解码方法的流程图;Fig. 14 is a flow chart of the second decoding method provided by the embodiment of the present application;
图15是本申请实施例提供的一种关于图14所示的解码方法的示例性框图;Fig. 15 is an exemplary block diagram of a decoding method shown in Fig. 14 provided by an embodiment of the present application;
图16是本申请实施例提供的一种编码装置的结构示意图;FIG. 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application;
图17是本申请实施例提供的一种解码装置的结构示意图;FIG. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application;
图18是本申请实施例提供的一种编解码装置的示意性框图。Fig. 18 is a schematic block diagram of a codec device provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the following will further describe the embodiments of the present application in detail in conjunction with the accompanying drawings.
在对本申请实施例提供的编解码方法进行详细地解释说明之前,先对本申请实施例涉及的术语和实施环境进行介绍。Before explaining in detail the encoding and decoding methods provided by the embodiments of the present application, terms and implementation environments involved in the embodiments of the present application are firstly introduced.
为了便于理解,首先对本申请实施例涉及的术语进行解释。For ease of understanding, terms involved in the embodiments of the present application are explained first.
编码:是指将待编码的音频信号压缩成码流的处理过程。需要说明的是,音频信号被压缩成码流之后可以称为经编码的音频信号或者经压缩的音频信号。Encoding: refers to the process of compressing the audio signal to be encoded into a code stream. It should be noted that after the audio signal is compressed into a code stream, it may be referred to as an encoded audio signal or a compressed audio signal.
解码:是指将编码码流按照特定的语法规则和处理方法恢复成重建音频信号的处理过程。Decoding: refers to the process of restoring the coded stream into a reconstructed audio signal according to specific grammatical rules and processing methods.
三维音频信号:包括多个通道的信号,用于表征三维空间的声场,可以是HOA信号、多声道信号、对象音频信号中的一种或多种的组合。对于HOA信号来说,该三维音频信号的通道数与三维音频信号的阶数相关。比如,如果三维音频信号为A阶信号,则三维音频信号的通道数为(A+1) 2Three-dimensional audio signal: a signal including multiple channels, which is used to characterize the sound field in a three-dimensional space, and may be one or a combination of HOA signals, multi-channel signals, and object audio signals. For the HOA signal, the number of channels of the 3D audio signal is related to the order of the 3D audio signal. For example, if the 3D audio signal is an A-level signal, the number of channels of the 3D audio signal is (A+1) 2 .
下述提及的三维音频信号可以是任意的三维音频信号,比如可以是HOA信号、多声道信号、对象音频信号中的一种或多种的组合。The three-dimensional audio signal mentioned below may be any three-dimensional audio signal, for example, it may be one or a combination of HOA signals, multi-channel signals, and object audio signals.
暂态信号:用于表征三维音频信号对应通道的信号的暂态现象。如果某个通道的信号为暂态信号,表明这个通道的信号为非平稳信号。例如,在短时间内能量变化大的信号,如鼓声、打击乐器的声音等。Transient signal: It is used to characterize the transient phenomenon of the signal of the corresponding channel of the 3D audio signal. If the signal of a certain channel is a transient signal, it indicates that the signal of this channel is a non-stationary signal. For example, signals with large energy changes in a short period of time, such as the sound of drums and percussion instruments.
接下来对本申请实施例涉及的实施环境进行介绍。Next, the implementation environment involved in the embodiment of the present application will be introduced.
请参考图1,图1是本申请实施例提供的一种实施环境的示意图。该实施环境包括源装置10、目的地装置20、链路30和存储装置40。其中,源装置10可以产生经编码的三维音频信号。因此,源装置10也可以被称为三维音频信号编码装置。目的地装置20可以对由源 装置10所产生的经编码的三维音频信号进行解码。因此,目的地装置20也可以被称为三维音频信号解码装置。链路30可以接收源装置10所产生的经编码的三维音频信号,并可以将该经编码的三维音频信号传输给目的地装置20。存储装置40可以接收源装置10所产生的经编码的三维音频信号,并可以将该经编码的三维音频信号进行存储,这样的条件下,目的地装置20可以直接从存储装置40中获取经编码的三维音频信号。或者,存储装置40可以对应于文件服务器或可以保存由源装置10产生的经编码的三维音频信号的另一中间存储装置,这样的条件下,目的地装置20可以经由流式传输或下载存储装置40存储的经编码的三维音频信号。Please refer to FIG. 1 , which is a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes source device 10 , destination device 20 , link 30 and storage device 40 . Wherein, the source device 10 may generate an encoded 3D audio signal. Therefore, the source device 10 may also be called a three-dimensional audio signal encoding device. Destination device 20 may decode the encoded three-dimensional audio signal generated by source device 10. Therefore, the destination device 20 may also be referred to as a three-dimensional audio signal decoding device. Link 30 may receive an encoded 3D audio signal generated by source device 10 and may transmit the encoded 3D audio signal to destination device 20 . The storage device 40 can receive the encoded 3D audio signal generated by the source device 10, and can store the encoded 3D audio signal. Under such conditions, the destination device 20 can directly obtain the encoded 3D audio signal from the storage device 40. 3D audio signal. Alternatively, the storage device 40 may correspond to a file server or another intermediate storage device that may store the encoded three-dimensional audio signal generated by the source device 10, in which case the destination device 20 may communicate via streaming or downloading the storage device. 40 stored encoded three-dimensional audio signal.
源装置10和目的地装置20均可以包括一个或多个处理器以及耦合到该一个或多个处理器的存储器,该存储器可以包括随机存取存储器(random access memory,RAM)、只读存储器(read-only memory,ROM)、带电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、快闪存储器、可用于以可由计算机存取的指令或数据结构的形式存储所要的程序代码的任何其它媒体等。例如,源装置10和目的地装置20均可以包括桌上型计算机、移动计算装置、笔记型(例如,膝上型)计算机、平板计算机、机顶盒、例如所谓的“智能”电话等电话手持机、电视机、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机或其类似者。Both the source device 10 and the destination device 20 may include one or more processors and a memory coupled to the one or more processors, and the memory may include random access memory (random access memory, RAM), read-only memory ( read-only memory, ROM), charged erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), flash memory, can be used to store the desired program in the form of instructions or data structures that can be accessed by the computer Any other media etc. of the code. For example, both source device 10 and destination device 20 may include desktop computers, mobile computing devices, notebook (e.g., laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, Televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, or the like.
链路30可以包括能够将经编码的三维音频信号从源装置10传输到目的地装置20的一个或多个媒体或装置。在一种可能的实现方式中,链路30可以包括能够使源装置10实时地将经编码的三维音频信号直接发送到目的地装置20的一个或多个通信媒体。在本申请实施例中,源装置10可以基于通信标准来调制经编码的三维音频信号,该通信标准可以为无线通信协议等,并且可以将经调制的三维音频信号发送给目的地装置20。该一个或多个通信媒体可以包括无线和/或有线通信媒体,例如该一个或多个通信媒体可以包括射频(radio frequency,RF)频谱或一个或多个物理传输线。该一个或多个通信媒体可以形成基于分组的网络的一部分,基于分组的网络可以为局域网、广域网或全球网络(例如,因特网)等。该一个或多个通信媒体可以包括路由器、交换器、基站或促进从源装置10到目的地装置20的通信的其它设备等,本申请实施例对此不做具体限定。Link 30 may include one or more media or devices capable of transmitting the encoded three-dimensional audio signal from source device 10 to destination device 20 . In one possible implementation, link 30 may include one or more communication media that enable source device 10 to transmit the encoded three-dimensional audio signal directly to destination device 20 in real time. In the embodiment of the present application, the source device 10 may modulate the encoded 3D audio signal based on a communication standard, such as a wireless communication protocol, etc., and may send the modulated 3D audio signal to the destination device 20 . The one or more communication media may include wireless and/or wired communication media, for example, the one or more communication media may include radio frequency (radio frequency, RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet), among others. The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from the source device 10 to the destination device 20, etc., which are not specifically limited in this embodiment of the present application.
在一种可能的实现方式中,存储装置40可以将接收到的由源装置10发送的经编码的三维音频信号进行存储,目的地装置20可以直接从存储装置40中获取经编码的三维音频信号。这样的条件下,存储装置40可以包括多种分布式或本地存取的数据存储媒体中的任一者,例如,该多种分布式或本地存取的数据存储媒体中的任一者可以为硬盘驱动器、蓝光光盘、数字多功能光盘(digital versatile disc,DVD)、只读光盘(compact disc read-only memory,CD-ROM)、快闪存储器、易失性或非易失性存储器,或用于存储经编码三维音频信号的任何其它合适的数字存储媒体等。In a possible implementation manner, the storage device 40 may store the received encoded 3D audio signal sent by the source device 10, and the destination device 20 may directly obtain the encoded 3D audio signal from the storage device 40. . Under such conditions, the storage device 40 may include any one of a variety of distributed or locally accessed data storage media, for example, any one of the various distributed or locally accessed data storage media may be Hard disk drive, Blu-ray Disc, digital versatile disc (DVD), compact disc read-only memory (CD-ROM), flash memory, volatile or nonvolatile memory, or Any other suitable digital storage medium for storing encoded three-dimensional audio signals, etc.
在一种可能的实现方式中,存储装置40可以对应于文件服务器或可以保存由源装置10产生的经编码三维音频信号的另一中间存储装置,目的地装置20可经由流式传输或下载存储装置40存储的三维音频信号。文件服务器可以为能够存储经编码的三维音频信号并且将经编码的三维音频信号发送给目的地装置20的任意类型的服务器。在一种可能的实现方式中,文件服务器可以包括网络服务器、文件传输协议(file transfer protocol,FTP)服务器、网络附属存储(network attached storage,NAS)装置或本地磁盘驱动器等。目的地装置20可以通过 任意标准数据连接(包括因特网连接)来获取经编码三维音频信号。任意标准数据连接可以包括无线信道(例如,Wi-Fi连接)、有线连接(例如,数字用户线路(digital subscriber line,DSL)、电缆调制解调器等),或适合于获取存储在文件服务器上的经编码的三维音频数据的两者的组合。经编码的三维音频信号从存储装置40的传输可为流式传输、下载传输或两者的组合。In one possible implementation, the storage device 40 may correspond to a file server or another intermediate storage device that may save the encoded 3D audio signal generated by the source device 10, and the destination device 20 may store the encoded 3D audio signal via streaming or downloading. The device 40 stores the three-dimensional audio signal. The file server may be any type of server capable of storing and transmitting the encoded three-dimensional audio signal to the destination device 20 . In a possible implementation manner, the file server may include a network server, a file transfer protocol (file transfer protocol, FTP) server, a network attached storage (network attached storage, NAS) device, or a local disk drive. Destination device 20 may acquire the encoded three-dimensional audio signal over any standard data connection, including an Internet connection. Any standard data connection may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a digital subscriber line (DSL), cable modem, etc.), or is suitable for obtaining encoded data stored on a file server. A combination of the two for 3D audio data. The transmission of the encoded three-dimensional audio signal from the storage device 40 may be a streaming transmission, a download transmission, or a combination of both.
本申请实施例的技术可以适用于图1所示的对三维音频信号进行编码的源装置10,还可以适用于对经编码的三维音频信号进行解码的目的地装置20。The technology of the embodiment of the present application can be applied to the source device 10 shown in FIG. 1 that encodes the 3D audio signal, and can also be applied to the destination device 20 that decodes the encoded 3D audio signal.
在图1所示的实施环境中,源装置10包括数据源120、编码器100和输出接口140。在一些实施例中,输出接口140可以包括调节器/解调器(调制解调器)和/或发送器,其中发送器也可以称为发射器。数据源120可以包括图像捕获装置(例如,摄像机等)、含有先前捕获的三维音频信号的存档、用于从三维音频信号内容提供者接收三维音频信号的馈入接口,和/或用于产生三维音频信号的计算机图形系统,或三维音频信号的这些来源的组合。In the implementation environment shown in FIG. 1 , the source device 10 includes a data source 120 , an encoder 100 and an output interface 140 . In some embodiments, output interface 140 may include a conditioner/demodulator (modem) and/or a transmitter, where a transmitter may also be referred to as a transmitter. Data source 120 may include an image capture device (e.g., video camera, etc.), an archive containing previously captured 3D audio signals, a feed-in interface for receiving 3D audio signals from a 3D audio signal content provider, and/or for generating 3D audio signals. A computer graphics system for audio signals, or a combination of these sources for three-dimensional audio signals.
数据源120可以向编码器100发送三维音频信号,编码器100可以对接收到由数据源120发送的三维音频信号进行编码,得到经编码的三维音频信号。编码器可以将经编码的三维音频信号发送给输出接口。在一些实施例中,源装置10经由输出接口140将经编码的三维音频信号直接发送到目的地装置20。在其它实施例中,经编码的三维音频信号还可存储到存储装置40上,供目的地装置20以后获取并用于解码和/或显示。The data source 120 may send a 3D audio signal to the encoder 100, and the encoder 100 may encode the received 3D audio signal sent by the data source 120 to obtain an encoded 3D audio signal. The encoder may send the encoded three-dimensional audio signal to the output interface. In some embodiments, source device 10 sends the encoded three-dimensional audio signal directly to destination device 20 via output interface 140 . In other embodiments, the encoded 3D audio signal may also be stored on storage device 40 for later retrieval by destination device 20 for decoding and/or display.
在图1所示的实施环境中,目的地装置20包括输入接口240、解码器200和显示装置220。在一些实施例中,输入接口240包括接收器和/或调制解调器。输入接口240可经由链路30和/或从存储装置40接收经编码的三维音频信号,然后再发送给解码器200,解码器200可以对接收到的经编码的三维音频信号进行解码,得到经解码的三维音频信号。解码器可以将经解码的三维音频信号发送给显示装置220。显示装置220可与目的地装置20集成或可在目的地装置20外部。一般来说,显示装置220显示经解码的三维音频信号。显示装置220可以为多种类型中的任一种类型的显示装置,例如,显示装置220可以为液晶显示器(liquid crystal display,LCD)、等离子显示器、有机发光二极管(organic light-emitting diode,OLED)显示器或其它类型的显示装置。In the implementation environment shown in FIG. 1 , the destination device 20 includes an input interface 240 , a decoder 200 and a display device 220 . In some embodiments, input interface 240 includes a receiver and/or a modem. The input interface 240 can receive the encoded three-dimensional audio signal via the link 30 and/or from the storage device 40, and then send it to the decoder 200, and the decoder 200 can decode the received encoded three-dimensional audio signal to obtain the encoded three-dimensional audio signal. Decoded 3D audio signal. The decoder may transmit the decoded three-dimensional audio signal to the display device 220 . The display device 220 may be integrated with the destination device 20 or may be external to the destination device 20 . In general, the display device 220 displays the decoded 3D audio signal. The display device 220 can be any type of display device in various types, for example, the display device 220 can be a liquid crystal display (liquid crystal display, LCD), a plasma display, an organic light-emitting diode (organic light-emitting diode, OLED) monitor or other type of display device.
尽管图1中未示出,但在一些方面,编码器100和解码器200可各自与编码器和解码器集成,且可以包括适当的多路复用器-多路分用器(multiplexer-demultiplexer,MUX-DEMUX)单元或其它硬件和软件,用于共同数据流或单独数据流中的音频和视频两者的编码。在一些实施例中,如果适用的话,那么MUX-DEMUX单元可符合ITU H.223多路复用器协议,或例如用户数据报协议(user datagram protocol,UDP)等其它协议。Although not shown in FIG. 1 , in some aspects encoder 100 and decoder 200 may be individually integrated with the encoder and decoder, and may include appropriate multiplexer-demultiplexer (multiplexer-demultiplexer) , MUX-DEMUX) unit or other hardware and software for encoding both audio and video in a common data stream or in separate data streams. In some embodiments, the MUX-DEMUX unit may conform to the ITU H.223 multiplexer protocol, or other protocols such as user datagram protocol (UDP), if applicable.
编码器100和解码器200各自可为以下各项电路中的任一者:一个或多个微处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件或其任何组合。如果部分地以软件来实施本申请实施例的技术,那么装置可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令从而实施本申请实施例的技术。前述内容(包括硬件、软件、硬件与软件的组合等)中的任一者可被视为一个或多个处理器。编码器100和解码器200中的每一者都可以包括在一个或多个编码器或解码器中,所述编码器或所述解码器中的任一者可以集成为相应装置中 的组合编码器/解码器(编码解码器)的一部分。Each of the encoder 100 and the decoder 200 can be any one of the following circuits: one or more microprocessors, digital signal processing (digital signal processing, DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques of the embodiments of the present application are implemented partially in software, the device may store instructions for the software in a suitable non-transitory computer-readable storage medium, and may use one or more processors in hardware The instructions are executed to implement the technology of the embodiments of the present application. Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered to be one or more processors. Each of encoder 100 and decoder 200 may be included in one or more encoders or decoders, either of which may be integrated into a combined encoding in a corresponding device Part of a codec/decoder (codec).
本申请实施例可大体上将编码器100称为将某些信息“发信号通知”或“发送”到例如解码器200的另一装置。术语“发信号通知”或“发送”可大体上指代用于对经压缩的三维音频信号进行解码的语法元素和/或其它数据的传送。此传送可实时或几乎实时地发生。替代地,此通信可经过一段时间后发生,例如可在编码时在经编码位流中将语法元素存储到计算机可读存储媒体时发生,解码装置接着可在所述语法元素存储到此媒体之后的任何时间检索所述语法元素。Embodiments of the present application may generally refer to the encoder 100 as “signaling” or “sending” certain information to another device such as the decoder 200 . The term "signaling" or "sending" may generally refer to the transmission of syntax elements and/or other data for decoding a compressed three-dimensional audio signal. This transfer can occur in real time or near real time. Alternatively, this communication may occur after a period of time, such as upon encoding when storing syntax elements in an encoded bitstream to a computer-readable storage medium, which the decoding device may then perform after the syntax elements are stored on this medium The syntax element is retrieved at any time.
本申请实施例提供的编解码方法可以应用于多种场景,接下来对其中的几种场景分别进行介绍。The codec method provided in the embodiment of the present application can be applied to various scenarios, and several scenarios among them will be introduced respectively next.
请参考图2,图2是本申请实施例提供的一种编解码方法应用于终端场景的实施环境的示意图。该实施环境包括第一终端101和第二终端201,第一终端101与第二终端201进行通信连接。该通信连接可以为无线连接,也可以为有线连接,本申请实施例对此不做限定。Please refer to FIG. 2 . FIG. 2 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a terminal scenario. The implementation environment includes a first terminal 101 and a second terminal 201 , and the first terminal 101 and the second terminal 201 are connected in communication. The communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.
其中,第一终端101可以为发送端设备,也可以为接收端设备,同理,第二终端201可以为接收端设备,也可以为发送端设备。在第一终端101为发送端设备的情况下,第二终端201为接收端设备,在第一终端101为接收端设备的情况下,第二终端201为发送端设备。Wherein, the first terminal 101 may be a sending end device or a receiving end device. Similarly, the second terminal 201 may be a receiving end device or a sending end device. In the case where the first terminal 101 is a sending end device, the second terminal 201 is a receiving end device, and in the case where the first terminal 101 is a receiving end device, the second terminal 201 is a sending end device.
接下来以第一终端101为发送端设备,第二终端201为接收端设备为例进行介绍。Next, an introduction will be made by taking the first terminal 101 as a sending end device and the second terminal 201 as a receiving end device as an example.
第一终端101可以为上述图1所示的实施环境中的源装置10。第二终端201可以为上述图1所示的实施环境中的目的地装置20。其中,第一终端101和第二终端201均包括音频采集模块、音频回放模块、编码器、解码器、信道编码模块和信道解码模块。The first terminal 101 may be the source device 10 in the implementation environment shown in FIG. 1 above. The second terminal 201 may be the destination device 20 in the implementation environment shown in FIG. 1 above. Wherein, both the first terminal 101 and the second terminal 201 include an audio collection module, an audio playback module, an encoder, a decoder, a channel encoding module and a channel decoding module.
第一终端101中的音频采集模块采集三维音频信号并传输给编码器,编码器利用本申请实施例提供的编码方法对三维音频信号进行编码,该编码可以称为信源编码。之后,为了实现三维音频信号在信道中的传输,信道编码模块还需要再进行信道编码,然后将编码得到的码流通过无线或者有线网络通信设备在数字信道中传输。The audio acquisition module in the first terminal 101 collects the 3D audio signal and transmits it to the encoder. The encoder encodes the 3D audio signal using the encoding method provided in the embodiment of the present application. The encoding may be called source encoding. Afterwards, in order to realize the transmission of the 3D audio signal in the channel, the channel coding module needs to perform channel coding again, and then transmit the coded stream through the wireless or wired network communication device in the digital channel.
第二终端201通过无线或者有线网络通信设备接收数字信道中传输的码流,信道解码模块对码流进行信道解码,然后解码器利用本申请实施例提供的解码方法解码得到三维音频信号,再通过音频回放模块进行播放。The second terminal 201 receives the code stream transmitted in the digital channel through a wireless or wired network communication device, the channel decoding module performs channel decoding on the code stream, and then the decoder uses the decoding method provided by the embodiment of the present application to decode to obtain a three-dimensional audio signal, and then passes Audio playback module to play.
其中,第一终端101和第二终端201可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如个人计算机(personal computer,PC)、手机、智能手机、个人数字助手(personal digital assistant,PDA)、可穿戴设备、掌上电脑PPC(pocket PC)、平板电脑、智能车机、智能电视、智能音箱等。Among them, the first terminal 101 and the second terminal 201 can be any electronic product that can interact with the user through one or more ways such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting equipment, etc., Such as personal computer (personal computer, PC), mobile phone, smart phone, personal digital assistant (personal digital assistant, PDA), wearable device, PPC (pocket PC), tablet computer, smart car machine, smart TV, smart speaker wait.
本领域技术人员应能理解上述终端仅为举例,其他现有的或今后可能出现的终端如可适用于本申请实施例,也应包含在本申请实施例保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned terminals are only examples, and other existing or future terminals that are applicable to this embodiment of the application should also be included in the scope of protection of this embodiment of the application, and are hereby referenced included here.
请参考图3,图3是本申请实施例提供的一种编解码方法应用于无线或核心网设备的转码场景的实施环境的示意图。该实施环境包括信道解码模块、音频解码器、音频编码器和信道编码模块。Please refer to FIG. 3 . FIG. 3 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a transcoding scenario of a wireless or core network device. The implementation environment includes a channel decoding module, an audio decoder, an audio encoder and a channel encoding module.
其中,音频解码器可以为利用本申请实施例提供的解码方法的解码器,也可以为利用其 他解码方法的解码器。音频编码器可以为利用本申请实施例提供的编码方法的编码器,也可以为利用其他编码方法的编码器。在音频解码器为利用本申请实施例提供的解码方法的解码器的情况下,音频编码器为利用其他编码方法的编码器,在音频解码器为利用其他解码方法的解码器的情况下,音频编码器为利用本申请实施例提供的编码方法的编码器。Wherein, the audio decoder may be a decoder using the decoding method provided by the embodiment of the present application, or may be a decoder using other decoding methods. The audio encoder may be an encoder using the encoding method provided by the embodiment of the present application, or may be an encoder using other encoding methods. In the case where the audio decoder is a decoder using the decoding method provided by the embodiment of the present application, the audio encoder is a coder using other encoding methods, and in the case where the audio decoder is a decoder using other decoding methods, the audio The encoder is an encoder using the encoding method provided by the embodiment of the present application.
第一种情况,音频解码器为利用本申请实施例提供的解码方法的解码器,音频编码器为利用其他编码方法的编码器。In the first case, the audio decoder is a decoder using the decoding method provided by the embodiment of the present application, and the audio encoder is an encoder using other encoding methods.
此时,信道解码模块用于对接收的码流进行信道解码,然后音频解码器用于利用本申请实施例提供的解码方法进行信源解码,再通过音频编码器按照其他编码方法进行编码,实现一种格式到另一种格式的转换,即转码。之后,再通过信道编码后发送。At this time, the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use the decoding method provided by the embodiment of the application to perform source decoding, and then the audio encoder is used to encode according to other encoding methods to achieve a The conversion from one format to another is known as transcoding. After that, it is sent after channel coding.
第二种情况,音频解码器为利用其他解码方法的解码器,音频编码器为利用本申请实施例提供的编码方法的编码器。In the second case, the audio decoder is a decoder using other decoding methods, and the audio encoder is an encoder using the encoding method provided by the embodiment of the present application.
此时,信道解码模块用于对接收的码流进行信道解码,然后音频解码器用于利用其他解码方法进行信源解码,再通过音频编码器利用本申请实施例提供的编码方法进行编码,实现一种格式到另一种格式的转换,即转码。之后,再通过信道编码后发送。At this time, the channel decoding module is used to perform channel decoding on the received code stream, and then the audio decoder is used to use other decoding methods to perform source decoding, and then the audio encoder uses the encoding method provided by the embodiment of the application to perform encoding to realize a The conversion from one format to another is known as transcoding. After that, it is sent after channel coding.
其中,无线设备可以为无线接入点、无线路由器、无线连接器等等。核心网设备可以为移动性管理实体、网关等等。Wherein, the wireless device may be a wireless access point, a wireless router, a wireless connector, and the like. A core network device may be a mobility management entity, a gateway, and the like.
本领域技术人员应能理解上述无线设备或者核心网设备仅为举例,其他现有的或今后可能出现的无线或核心网设备如可适用于本申请实施例,也应包含在本申请实施例保护范围以内,并在此以引用方式包含于此。Those skilled in the art should understand that the above-mentioned wireless devices or core network devices are only examples, and other existing or future wireless or core network devices that are applicable to this embodiment of the application should also be included in the protection of this embodiment of the application. scope, and is hereby incorporated by reference.
请参考图4,图4是本申请实施例提供的一种编解码方法应用于广播电视场景的实施环境的示意图。广播电视场景分为直播场景和后期制作场景。对于直播场景来说,该实施环境包括直播节目三维声制作模块、三维声编码模块、机顶盒和扬声器组,机顶盒包括三维声解码模块。对于后期制作场景来说,该实施环境包括后期节目三维声制作模块、三维声编码模块、网络接收器、移动终端、耳机等。Please refer to FIG. 4 . FIG. 4 is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a broadcast television scene. The broadcast TV scene is divided into a live scene and a post-production scene. For the live broadcast scene, the implementation environment includes a live program 3D sound production module, a 3D sound encoding module, a set-top box and a speaker group, and the set-top box includes a 3D sound decoding module. For post-production scenarios, the implementation environment includes post-program 3D sound production modules, 3D sound coding modules, network receivers, mobile terminals, earphones, and the like.
直播场景下,直播节目三维声制作模块制作出三维声信号,该三维声信号包括三维音频信号。该三维声信号经过应用本申请实施例的编码方法的编码得到码流,该码流经广电网络传输到用户侧,由机顶盒中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由扬声器组进行回放。或者,该码流经互联网传输到用户侧,由网络接收器中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由扬声器组进行回放。又或者,该码流经互联网传输到用户侧,由移动终端中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由耳机进行回放。In the live broadcast scene, the 3D sound production module of the live program produces a 3D sound signal, and the 3D sound signal includes a 3D audio signal. The three-dimensional sound signal is encoded by applying the encoding method of the embodiment of the present application to obtain a code stream, and the code stream is transmitted to the user side through the broadcasting network, and is decoded by the three-dimensional sound decoder in the set-top box using the decoding method provided by the embodiment of the present application. The three-dimensional sound signal is thus reconstructed and played back by the loudspeaker group. Alternatively, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group. Or, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.
后期制作场景下,后期节目三维声制作模块制作出三维声信号,该三维声信号经过应用本申请实施例的编码方法的编码得到码流,该码流经广电网络传输到用户侧,由机顶盒中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由扬声器组进行回放。或者,该码流经互联网传输到用户侧,由网络接收器中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由扬声器组进行回放。又或者,该码流经互联网传输到用户侧,由移动终端中的三维声解码器利用本申请实施例提供的解码方法进行解码,从而重建三维声信号,由耳机进行回放。In the post-production scenario, the post-program 3D sound production module produces a 3D sound signal, and the 3D sound signal is encoded by applying the coding method of the embodiment of the application to obtain a code stream, and the code stream is transmitted to the user side through the radio and television network, and is transmitted by the set-top box The 3D sound decoder uses the decoding method provided by the embodiment of the present application to decode, thereby reconstructing the 3D sound signal, which is played back by the speaker group. Alternatively, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the network receiver decodes it using the decoding method provided by the embodiment of the present application, so as to reconstruct the 3D sound signal and play it back by the speaker group. Or, the code stream is transmitted to the user side through the Internet, and the 3D sound decoder in the mobile terminal decodes it using the decoding method provided by the embodiment of the present application, thereby reconstructing the 3D sound signal, and playing it back by the earphone.
请参考图5,图5是本申请实施例提供的一种编解码方法应用于虚拟现实流场景的实施环境的示意图。该实施环境包括编码端和解码端,编码端包括采集模块、预处理模块、编码模块、打包模块和发送模块,解码端包括解包模块、解码模块、渲染模块和耳机。Please refer to FIG. 5 , which is a schematic diagram of an implementation environment in which a codec method provided by an embodiment of the present application is applied to a virtual reality streaming scene. The implementation environment includes an encoding end and a decoding end. The encoding end includes an acquisition module, a preprocessing module, an encoding module, a packaging module and a sending module, and the decoding end includes an unpacking module, a decoding module, a rendering module and earphones.
采集模块采集三维音频信号,然后通过预处理模块进行预处理操作,预处理操作包括滤除掉信号中的低频部分,通常是以20Hz或者50Hz为分界点,提取信号中的方位信息等。之后通过编码模块,利用本申请实施例提供的编码方法进行编码处理,编码之后通过打包模块进行打包,进而通过发送模块发送给解码端。The acquisition module collects three-dimensional audio signals, and then performs preprocessing operations through the preprocessing module. The preprocessing operations include filtering out the low-frequency part of the signal, usually with 20Hz or 50Hz as the cut-off point, and extracting the orientation information in the signal. Then use the encoding module to perform encoding processing using the encoding method provided by the embodiment of the present application. After encoding, use the packing module to pack and send to the decoding end through the sending module.
解码端的解包模块首先进行解包,之后通过解码模块,利用本申请实施例提供的解码方法进行解码,然后通过渲染模块对解码信号进行双耳渲染处理,渲染处理后的信号映射到收听者耳机上。该耳机可以为独立的耳机,也可以是基于虚拟现实的眼镜设备上的耳机。The unpacking module at the decoding end first unpacks, and then uses the decoding method provided by the embodiment of the application to decode through the decoding module, and then performs binaural rendering processing on the decoded signal through the rendering module, and the rendered signal is mapped to the listener's earphones superior. The earphone can be an independent earphone, or an earphone on a virtual reality glasses device.
需要说明的是,本申请实施例描述的系统架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着系统架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。It should be noted that the system architecture and business scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided by the embodiments of the present application. Those of ordinary skill in the art It can be seen that with the evolution of the system architecture and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.
接下来对本申请实施例提供的编解码方法进行详细地解释说明。需要说明的是,结合图1所示的实施环境,下文中的任一种编码方法可以是源装置10中的编码器100执行的。下文中的任一种解码方法可以是目的地装置20中的解码器200执行的。Next, the codec method provided by the embodiment of the present application is explained in detail. It should be noted that, in combination with the implementation environment shown in FIG. 1 , any of the following encoding methods may be executed by the encoder 100 in the source device 10 . Any of the following decoding methods may be performed by the decoder 200 in the destination device 20 .
请参考图6,图6是本申请实施例提供的第一种编码方法的流程图。该编码方法应用于编码端设备,包括如下步骤。Please refer to FIG. 6, which is a flowchart of the first encoding method provided by the embodiment of the present application. The encoding method is applied to an encoding end device, and includes the following steps.
步骤601:对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到该M个通道对应的M个暂态检测结果,M为大于1的整数。Step 601: Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.
M个暂态检测结果与当前帧的时域三维音频信号包括的M个通道一一对应。暂态检测结果包括暂态标志,或者,暂态检测结果包括暂态标志和暂态位置信息。暂态标志用于指示对应通道的信号是否为暂态信号,暂态位置信息用于指示对应通道的信号中暂态发生的位置。The M transient detection results are in one-to-one correspondence with the M channels included in the time-domain three-dimensional audio signal of the current frame. The transient detection result includes a transient flag, or, the transient detection result includes a transient flag and transient position information. The transient flag is used to indicate whether the signal of the corresponding channel is a transient signal, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel.
该M个通道对应的M个暂态检测结果的确定方式包括多种,接下来对其中一种方式进行介绍。由于该M个通道中每个通道对应的暂态检测结果的确定方式相同,因此,接下来以其中一个通道为例,对该通道对应的暂态检测结果的确定方式进行介绍。为了便于描述,将该通道称为目标通道,而且接下来将分别介绍目标通道的暂态标志和暂态位置信息。There are many ways to determine the M transient detection results corresponding to the M channels, and one of the ways will be introduced next. Since the determination method of the transient state detection result corresponding to each of the M channels is the same, the method for determining the transient state detection result corresponding to the channel is introduced next by taking one of the channels as an example. For ease of description, this channel is referred to as a target channel, and the transient flag and transient position information of the target channel will be respectively introduced next.
目标通道的暂态标志Destination channel's transient flag
基于目标通道的信号,确定目标通道对应的暂态检测参数。基于目标通道对应的暂态检测参数,确定目标通道对应的暂态标志。Based on the signal of the target channel, the transient detection parameters corresponding to the target channel are determined. Based on the transient detection parameters corresponding to the target channel, the transient flag corresponding to the target channel is determined.
作为一种示例,目标通道对应的暂态检测参数为帧间能量差的绝对值。也即是,确定当前帧中目标通道的信号的能量,以及当前帧的上一帧中目标通道的信号的能量。确定当前帧中目标通道的信号的能量与上一帧中目标通道的信号的能量之间的差值的绝对值,以得到帧间能量差的绝对值。如果该帧间能量差的绝对值超过第一能量差阈值,则确定当前帧中目标通道对应的暂态标志为第一值,否则,确定当前帧中目标通道对应的暂态标志为第二值。As an example, the transient detection parameter corresponding to the target channel is the absolute value of the energy difference between frames. That is, the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame of the current frame are determined. The absolute value of the difference between the energy of the signal of the target channel in the current frame and the energy of the signal of the target channel in the previous frame is determined to obtain the absolute value of the inter-frame energy difference. If the absolute value of the inter-frame energy difference exceeds the first energy difference threshold, determine that the transient flag corresponding to the target channel in the current frame is the first value, otherwise, determine that the transient flag corresponding to the target channel in the current frame is the second value .
基于上文描述,暂态标志用于指示对应通道的信号是否为暂态信号,所以,在该帧间能 量差的绝对值超过第一能量差阈值的情况下,表明当前帧中目标通道的信号为暂态信号,此时,确定当前帧中目标通道对应的暂态标志为第一值。在该帧间能量差的绝对值不超过第一能量差阈值的情况下,表明当前帧中目标通道的信号不为暂态信号,此时,确定当前帧中目标通道对应的暂态标志为第二值。Based on the above description, the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference between frames exceeds the first energy difference threshold, it indicates that the signal of the target channel in the current frame is a transient signal, and at this time, determine that the transient flag corresponding to the target channel in the current frame is the first value. When the absolute value of the inter-frame energy difference does not exceed the first energy difference threshold, it indicates that the signal of the target channel in the current frame is not a transient signal. At this time, it is determined that the transient flag corresponding to the target channel in the current frame is the first binary value.
需要说明的是,第一值和第二值能够通过多种方式来表示。比如,第一值为true,第二值为flase。或者,第一值为1,第二值为0。当然,还能够通过其他方式来表示。第一能量差阈值为事先设置的,而且第一能量差阈值能够按照不同的需求来调整。It should be noted that the first value and the second value can be expressed in various ways. For example, the first value is true, and the second value is flase. Alternatively, the first value is 1 and the second value is 0. Of course, it can also be expressed in other ways. The first energy difference threshold is preset, and the first energy difference threshold can be adjusted according to different requirements.
作为另一种示例,目标通道对应的暂态检测参数为子帧能量差的绝对值。也即是,当前帧中目标通道的信号包括多个子帧的信号,确定该多个子帧中每个子帧对应的子帧能量差的绝对值,进而确定每个子帧对应的暂态标志。如果该多个子帧中存在暂态标志为第一值的子帧,则确定当前帧中目标通道对应的暂态标志为第一值。如果该多个子帧中不存在暂态标志为第一值的子帧,则确定当前帧中目标通道对应的暂态标志为第二值。As another example, the transient detection parameter corresponding to the target channel is the absolute value of the subframe energy difference. That is, the signal of the target channel in the current frame includes signals of multiple subframes, the absolute value of the subframe energy difference corresponding to each subframe in the multiple subframes is determined, and then the transient flag corresponding to each subframe is determined. If there is a subframe whose transient flag is the first value in the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the first value. If there is no subframe whose transient flag is the first value among the multiple subframes, determine that the transient flag corresponding to the target channel in the current frame is the second value.
其中,确定该多个子帧中每个子帧的暂态标志的实现方式相同,因此接下来以该多个子帧中的第i个子帧为例进行说明,i为正整数。也即是,确定该多个子帧中第i个子帧的信号的能量以及第i-1个子帧的信号的能量。确定第i个子帧的信号的能量与第i-1个子帧的信号的能量之间的差值的绝对值,以得到第i个子帧对应的子帧能量差的绝对值。如果第i个子帧对应的子帧能量差的绝对值超过第二能量差阈值,则确定第i个子帧的暂态标志为第一值,否则,确定第i个子帧的暂态标志为第二值。Wherein, the implementation manner of determining the transient state flag of each subframe in the multiple subframes is the same, so the i-th subframe among the multiple subframes is taken as an example for illustration, and i is a positive integer. That is, the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe in the plurality of subframes are determined. Determine the absolute value of the difference between the energy of the signal of the i-th subframe and the energy of the signal of the i-1th subframe, so as to obtain the absolute value of the energy difference of the subframe corresponding to the i-th subframe. If the absolute value of the subframe energy difference corresponding to the i subframe exceeds the second energy difference threshold, then determine that the transient flag of the i subframe is the first value, otherwise, determine that the transient flag of the i subframe is the second value.
基于上文描述,暂态标志用于指示对应通道的信号是否为暂态信号,所以,在第i个子帧对应的子帧能量差的绝对值超过第二能量差阈值的情况下,表明第i个子帧的信号为暂态信号,此时,确定第i个子帧的暂态标志为第一值。在第i个子帧对应的子帧能量差的绝对值不超过第二能量差阈值的情况下,表明第i个子帧的信号不为暂态信号,此时,确定第i个子帧的暂态标志为第二值。Based on the above description, the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal. Therefore, when the absolute value of the energy difference of the subframe corresponding to the i-th subframe exceeds the second energy difference threshold, it indicates that the i-th subframe The signal of the subframe is a transient signal, and at this time, it is determined that the transient flag of the ith subframe is the first value. In the case that the absolute value of the energy difference of the subframe corresponding to the i subframe does not exceed the second energy difference threshold, it indicates that the signal of the i subframe is not a transient signal, and at this time, determine the transient flag of the i subframe is the second value.
需要说明的是,当i=0时,第i-1个子帧的信号的能量为当前帧的前一帧中目标通道的最后一个子帧的信号的能量。第二能量差阈值为事先设置的,而且第二能量差阈值能够按照不同的需求来调整。另外,第二能量差阈值与第一能量差阈值可以相同,也可以不同。It should be noted that, when i=0, the energy of the signal of the i-1th subframe is the energy of the signal of the last subframe of the target channel in the previous frame of the current frame. The second energy difference threshold is preset, and the second energy difference threshold can be adjusted according to different requirements. In addition, the second energy difference threshold may be the same as or different from the first energy difference threshold.
目标通道的暂态位置信息Transient position information of the target channel
基于目标通道对应的暂态标志,确定目标通道对应的暂态位置信息。Based on the transient flag corresponding to the target channel, the transient position information corresponding to the target channel is determined.
作为一种示例,如果目标通道对应的暂态标志为第一值,则确定目标通道对应的暂态位置信息。如果目标通道对应的暂态标志为第二值,则确定目标通道不具有对应的暂态位置信息,或者,将目标通道对应的暂态位置信息设置为预设数值,比如设置为-1。As an example, if the transient flag corresponding to the target channel is the first value, determine the transient position information corresponding to the target channel. If the transient flag corresponding to the target channel is the second value, it is determined that the target channel does not have corresponding transient position information, or, the transient position information corresponding to the target channel is set to a preset value, such as -1.
即,在目标通道对应的暂态标志为第二值的情况下,表明目标通道的信号不为暂态信号,此时,目标通道的暂态检测结果不包括暂态位置信息,或者直接将目标通道对应的暂态位置信息设置为预设数值,该预设数值用于指示该目标通道的信号不为暂态信号。也即是,暂态信号的暂态检测结果包括暂态标志和暂态位置信息,非暂态信号的暂态检测结果可以包括暂态标志,也可以包括暂态标志和暂态位置信息。That is, when the transient flag corresponding to the target channel is the second value, it indicates that the signal of the target channel is not a transient signal. At this time, the transient detection result of the target channel does not include the transient position information, or the target channel is directly The transient position information corresponding to the channel is set as a preset value, and the preset value is used to indicate that the signal of the target channel is not a transient signal. That is, the transient detection result of the transient signal includes the transient flag and the transient location information, and the transient detection result of the non-transient signal may include the transient flag, and may also include the transient flag and the transient location information.
需要说明的是,在目标通道对应的暂态标志为第一值的情况下,确定目标通道对应的暂态位置信息的方式包括多种。作为一种示例,当前帧中目标通道的信号包括多个子帧的信号,从该多个子帧中选择暂态标志为第一值且子帧能量差的绝对值最高的子帧,将选择的子帧的 序号确定为当前帧中目标通道对应的暂态位置信息。It should be noted that, in the case that the transient flag corresponding to the target channel is the first value, there are multiple ways to determine the transient position information corresponding to the target channel. As an example, the signal of the target channel in the current frame includes signals of a plurality of subframes, and the subframe whose transient flag is the first value and whose absolute value of the subframe energy difference is the highest is selected from the plurality of subframes, and the selected subframe The sequence number of the frame is determined as the transient position information corresponding to the target channel in the current frame.
例如,当前帧中目标通道对应的暂态标志为第一值,当前帧中目标通道的信号包括4个子帧的信号,i=0、1、2、3。第0个子帧的子帧能量差的绝对值为18,第1个子帧的子帧能量差的绝对值为21,第2个子帧的子帧能量差的绝对值为24,第3个子帧的子帧能量差的绝对值为35。假设,事先设置的第二能量差阈值为20,则第1个子帧的信号为暂态信号,第2个子帧的信号为暂态信号,以及第3个子帧的信号为暂态信号。此时,确定第1个子帧、第2个子帧以及第3个子帧的暂态标志都为第一值,且这3个子帧中子帧能量差的绝对值最高的子帧为第3个子帧,则将第3个子帧的序号3确定为当前帧中目标通道对应的暂态位置信息。For example, the transient flag corresponding to the target channel in the current frame is the first value, the signal of the target channel in the current frame includes signals of 4 subframes, and i=0, 1, 2, 3. The absolute value of the subframe energy difference of the 0th subframe is 18, the absolute value of the subframe energy difference of the first subframe is 21, the absolute value of the subframe energy difference of the second subframe is 24, and the absolute value of the subframe energy difference of the third subframe is The absolute value of the subframe energy difference is 35. Assuming that the preset second energy difference threshold is 20, the signal of the first subframe is a transient signal, the signal of the second subframe is a transient signal, and the signal of the third subframe is a transient signal. At this time, it is determined that the transient flags of the first subframe, the second subframe, and the third subframe are all the first values, and the subframe with the highest absolute value of the subframe energy difference among the three subframes is the third subframe , the sequence number 3 of the third subframe is determined as the transient position information corresponding to the target channel in the current frame.
步骤602:基于该M个暂态检测结果,确定全局暂态检测结果。Step 602: Determine a global transient detection result based on the M transient detection results.
在一些实施例中,全局暂态检测结果包括全局暂态标志。若该M个暂态标志中为第一值的暂态标志的数量大于或等于m,则确定全局暂态标志为第一值,m为大于0且小于M的正整数。或者,若该M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于n,则确定全局暂态标志为第一值,n为大于0且小于M的正整数。In some embodiments, the global transient detection results include a global transient flag. If the number of transient flags with the first value among the M transient flags is greater than or equal to m, determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M. Or, if the number of channels that satisfy the first preset condition and the corresponding transient flag is the first value among the M channels is greater than or equal to n, then determine that the global transient flag is the first value, and n is greater than 0 and less than M positive integer of .
例如,当前帧的三维音频信号为3阶的HOA信号,该HOA信号的通道数为(3+1) 2,即16。假设m为1,如果该16个暂态标志中为第一值的暂态标志的数量大于或等于1,则确定全局暂态标志为第一值。或者,第一预设条件包括属于FOA信号的通道,例如FOA信号的通道可以包括该HOA信号中的前4个通道,假设该M个通道中满足第一预设条件的通道为当前帧中的FOA信号所在的通道,n为1。如果该16个通道中属于FOA的通道中对应的暂态标志为第一值的通道数量大于或等于1,则确定全局暂态标志为第一值。 For example, the 3D audio signal of the current frame is a third-order HOA signal, and the number of channels of the HOA signal is (3+1) 2 , that is, 16. Assuming that m is 1, if the number of transient flags with the first value among the 16 transient flags is greater than or equal to 1, then the global transient flag is determined to be the first value. Alternatively, the first preset condition includes channels belonging to the FOA signal, for example, the channels of the FOA signal may include the first 4 channels in the HOA signal, and it is assumed that the channel that satisfies the first preset condition among the M channels is the current frame The channel where the FOA signal is located, n is 1. If the number of channels whose corresponding transient flag is the first value among the 16 channels belonging to the FOA is greater than or equal to 1, then it is determined that the global transient flag is the first value.
其中,m和n为事先设置的数值,且m和n还能够按照不同的需求来调整。在该三维音频信号为HOA信号的情况下,第一预设条件包括属于FOA信号的通道,该M个通道中满足第一预设条件的通道为当前帧的三维音频信号中的FOA信号所在的通道,该FOA信号为该HOA信号中的前4个通道的信号,当然,第一预设条件还可以为其他的条件。Wherein, m and n are preset values, and m and n can also be adjusted according to different requirements. In the case where the 3D audio signal is an HOA signal, the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located. channel, the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.
在另一些实施例中,全局暂态检测结果还包括全局暂态位置信息。若该M个暂态标志中仅有一个暂态标志为第一值,则将该暂态标志为第一值的通道对应的暂态位置信息确定为全局暂态位置信息。若该M个暂态标志中存在至少两个暂态标志为第一值,则将该至少两个暂态标志对应的至少两个通道中暂态检测参数最大的通道对应的暂态位置信息确定为全局暂态位置信息,或者若该M个暂态标志中存在至少两个暂态标志为第一值,且两个通道对应的暂态位置信息之间的差距小于位置差阈值,则将该两个通道对应的暂态位置信息的平均值确定为全局暂态位置信息。位置差阈值为事先设置的,而且位置差阈值能够按照不同的需求来调整。In other embodiments, the global transient detection result further includes global transient location information. If only one of the M transient flags is the first value, the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information. If there are at least two transient flags in the M transient flags as the first value, determine the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is the global transient position information, or if at least two of the M transient flags are the first value, and the gap between the transient position information corresponding to the two channels is smaller than the position difference threshold, then the The average value of the transient position information corresponding to the two channels is determined as the global transient position information. The position difference threshold is set in advance, and the position difference threshold can be adjusted according to different requirements.
基于上文描述,通道对应的暂态检测参数为帧间能量差的绝对值或者子帧能量差的绝对值。在通道对应的暂态检测参数为帧间能量差的绝对值的情况下,一个通道对应一个帧间能量差的绝对值,此时,可以从该至少两个通道中选择对应的帧间能量差的绝对值最大的通道,进而将选择的通道对应的暂态位置信息确定为全局暂态位置信息。在通道对应的暂态检测参数为子帧能量差的绝对值的情况下,一个通道对应有多个子帧能量差的绝对值,此时,可以从该至少两个通道中选择对应的子帧能量差的绝对值最大的通道,进而将选择的通道对应的暂态位置信息确定为全局暂态位置信息。Based on the above description, the transient detection parameter corresponding to the channel is the absolute value of the energy difference between frames or the absolute value of the energy difference between subframes. In the case where the transient detection parameter corresponding to the channel is the absolute value of the inter-frame energy difference, one channel corresponds to an absolute value of the inter-frame energy difference, at this time, the corresponding inter-frame energy difference can be selected from the at least two channels The channel with the largest absolute value, and then determine the transient position information corresponding to the selected channel as the global transient position information. In the case where the transient detection parameter corresponding to the channel is the absolute value of the subframe energy difference, one channel corresponds to the absolute value of the energy difference of multiple subframes, at this time, the corresponding subframe energy can be selected from the at least two channels The channel with the largest absolute value of the difference is selected, and then the transient position information corresponding to the selected channel is determined as the global transient position information.
例如,对于3阶的HOA信号来说,若该HOA信号的16个暂态标志中只有第3个通道对应的暂态标志为第一值,那么可以直接将第3个通道对应的暂态位置信息确定为全局暂态位置信息。For example, for a 3rd-order HOA signal, if only the transient flag corresponding to the third channel is the first value among the 16 transient flags of the HOA signal, then the transient position corresponding to the third channel can be directly set The information is determined as global transient position information.
若该HOA信号的16个暂态标志中存在3个暂态标志为第一值,分别为通道1、通道2和通道3。通道1对应的暂态位置信息为1,通道1对应的帧间能量差的绝对值为22,通道2对应的暂态位置信息为2,通道2对应的帧间能量差的绝对值为23,通道3对应的暂态位置信息为3,通道3对应的帧间能量差的绝对值为28。该3个通道中帧间能量差的绝对值最大的通道为通道3,则将通道3对应的暂态位置信息3确定为全局暂态位置信息。If there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the absolute value of the inter-frame energy difference corresponding to channel 1 is 22, the transient position information corresponding to channel 2 is 2, and the absolute value of the inter-frame energy difference corresponding to channel 2 is 23, The transient position information corresponding to channel 3 is 3, and the absolute value of the inter-frame energy difference corresponding to channel 3 is 28. Among the three channels, the channel with the largest absolute value of the inter-frame energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.
又例如,若该HOA信号的16个暂态标志中存在3个暂态标志为第一值,分别为通道1、通道2和通道3。通道1对应的暂态位置信息为1,通道1的信号包括三个子帧,这三个子帧对应的子帧能量差的绝对值分别为20、18、22,通道2对应的暂态位置信息为2,通道2的信号包括三个子帧,这三个子帧对应的子帧能量差的绝对值分别为20、23、25,通道3对应的暂态位置信息为3,通道3的信号包括三个子帧,这三个子帧对应的子帧能量差的绝对值为25、28、30。该3个通道中子帧能量差的绝对值最大的通道为通道3,则将通道3对应的暂态位置信息3确定为全局暂态位置信息。For another example, if there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the signal of channel 1 includes three subframes, and the absolute values of the subframe energy differences corresponding to these three subframes are 20, 18, and 22 respectively, and the transient position information corresponding to channel 2 is 2. The signal of channel 2 includes three subframes. The absolute values of the energy differences of the subframes corresponding to these three subframes are 20, 23, and 25 respectively. The transient position information corresponding to channel 3 is 3. The signal of channel 3 includes three subframes frame, the absolute values of energy differences of subframes corresponding to these three subframes are 25, 28, and 30. Among the three channels, the channel with the largest absolute value of subframe energy difference is channel 3, and then the transient position information 3 corresponding to channel 3 is determined as the global transient position information.
若该HOA信号的16个暂态标志中存在3个暂态标志为第一值,分别为通道1、通道2和通道3。通道1对应的暂态位置信息为1,通道2对应的暂态位置信息为3,通道3对应的暂态位置信息为6。该3个通道中通道1和通道2对应的暂态位置信息之间的差距2小于事先设置的位置差阈值3,则将通道1和通道2对应的暂态位置信息的平均值2确定为全局暂态位置信息。If there are 3 transient flags with the first value among the 16 transient flags of the HOA signal, they are channel 1, channel 2 and channel 3 respectively. The transient position information corresponding to channel 1 is 1, the transient position information corresponding to channel 2 is 3, and the transient position information corresponding to channel 3 is 6. The gap 2 between the transient position information corresponding to channel 1 and channel 2 among the three channels is less than the preset position difference threshold 3, then the average value 2 of the transient position information corresponding to channel 1 and channel 2 is determined as the global Transient location information.
步骤603:基于全局暂态检测结果,将当前帧的时域三维音频信号转换为频域三维音频信号。Step 603: Convert the time domain 3D audio signal of the current frame into a frequency domain 3D audio signal based on the global transient detection result.
在一些实施例中,基于全局暂态检测结果确定目标编码参数,该目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。基于该目标编码参数将当前帧的时域三维音频信号转换为频域三维音频信号。In some embodiments, target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame. The time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the target encoding parameter.
作为一种示例,全局暂态检测结果包括全局暂态标志。基于全局暂态检测结果确定当前帧的窗函数类型的实现过程包括:若全局暂态标志为第一值,则将第一预设窗函数的类型确定为当前帧的窗函数类型。若全局暂态标志为第二值,则将第二预设窗函数的类型确定为当前帧的窗函数类型。其中,第一预设窗函数的窗长小于第二预设窗函数的窗长。As an example, the global transient detection result includes a global transient flag. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, determining the first preset window function type as the window function type of the current frame. If the global transient flag is the second value, the type of the second preset window function is determined as the type of the window function of the current frame. Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
作为另一种示例,全局暂态检测结果包括全局暂态标志和全局暂态位置信息。基于全局暂态检测结果确定当前帧的窗函数类型的实现过程包括:若全局暂态标志为第一值,则基于全局暂态位置信息确定当前帧的窗函数类型。若全局暂态标志为第二值,则将第三预设窗函数的类型确定为当前帧的窗函数类型,或者,基于当前帧的上一帧的窗函数类型确定当前帧的窗函数类型。As another example, the global transient detection result includes global transient flags and global transient location information. The implementation process of determining the window function type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining the window function type of the current frame based on the global transient position information. If the global transient flag is the second value, the type of the third preset window function is determined as the type of the window function of the current frame, or the type of the window function of the current frame is determined based on the type of the window function of the previous frame of the current frame.
其中,在全局暂态标志为第一值的情况下,基于全局暂态位置信息确定当前帧的窗函数类型的方式有多种。例如,基于全局暂态位置信息调整第四预设窗函数的类型,以使第四预设窗函数的中心位置对应于全局暂态发生位置,进而使全局暂态发生位置对应的窗函数的值最大。或者,从窗函数集合中选择与全局暂态发生位置对应的窗函数,进而将选择的窗函数的类型,确定为当前帧的窗函数类型。也即是,窗函数集合中存储有各个暂态发生位置对应 的窗函数,这样,可以选择与全局暂态发生位置对应的窗函数。Wherein, when the global transient flag is the first value, there are multiple ways to determine the window function type of the current frame based on the global transient position information. For example, the type of the fourth preset window function is adjusted based on the global transient position information, so that the center position of the fourth preset window function corresponds to the global transient occurrence position, and then the value of the window function corresponding to the global transient occurrence position maximum. Alternatively, a window function corresponding to the location where the global transient occurs is selected from the window function set, and then the type of the selected window function is determined as the window function type of the current frame. That is to say, window functions corresponding to each transient occurrence position are stored in the window function set, so that the window function corresponding to the global transient occurrence position can be selected.
此外,基于当前帧的上一帧的窗函数类型确定当前帧的窗函数类型的方法也包括多种,具体可以参考相关技术,本申请实施例对此不做详细阐述。In addition, there are many methods for determining the window function type of the current frame based on the window function type of the previous frame of the current frame. For details, reference may be made to related technologies, which will not be elaborated in the embodiments of the present application.
由于全局暂态检测结果可以仅包括全局暂态标志,也可以包括全局暂态标志和全局暂态位置信息,而且全局暂态位置信息可能是暂态标志为第一值的通道对应的暂态位置信息,也可能是预设数值。在全局暂态检测结果不同的情况下,确定当前帧的帧类型的方式不同,因此,接下来将分为以下三种情况分别进行说明:Since the global transient detection result may only include the global transient flag, it may also include the global transient flag and the global transient position information, and the global transient position information may be the transient position corresponding to the channel whose transient flag is the first value information, and possibly preset values. In the case of different global transient detection results, the method of determining the frame type of the current frame is different. Therefore, the following three cases will be described separately:
第一种情况、全局暂态检测结果包括全局暂态标志。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态标志为第一值,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。若全局暂态标志为第二值,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。In the first case, the global transient detection result includes a global transient flag. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value, then determining that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frame. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame.
第二种情况、全局暂态检测结果包括全局暂态标志和全局暂态位置信息。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态标志为第一值且全局暂态位置信息满足第二预设条件,则确定当前帧的帧类型为第三类型,第三类型用于指示当前帧包括多个超短帧。若全局暂态标志为第一值且全局暂态位置信息不满足第二预设条件,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。若全局暂态标志为第二值,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。超短帧的帧长小于短帧的帧长,短帧的帧长小于长帧的帧长。第二预设条件可以是全局暂态位置信息指示的暂态发生位置距离当前帧的起始位置小于超短帧的帧长或者全局暂态位置信息指示的暂态发生位置距离当前帧的结束位置小于超短帧的帧长。In the second case, the global transient detection result includes the global transient flag and the global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient flag is the first value and the global transient position information satisfies the second preset condition, then determining that the frame type of the current frame is the third type , the third type is used to indicate that the current frame includes multiple ultrashort frames. If the global transient flag is the first value and the global transient position information does not satisfy the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. If the global transient flag is the second value, it is determined that the frame type of the current frame is the second type, and the second type is used to indicate that the current frame includes a long frame. The frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
第三种情况、全局暂态检测结果包括全局暂态位置信息。基于全局暂态检测结果确定当前帧的帧类型的实现过程包括:若全局暂态位置信息为预设数值,比如为-1,则确定当前帧的帧类型为第二类型,第二类型用于指示当前帧包括一个长帧。若全局暂态位置信息不为预设数值且满足第二预设条件,则确定当前帧的帧类型为第三类型,第三类型用于指示当前帧包括多个超短帧。若全局暂态位置信息不为预设数值且不满足第二预设条件,则确定当前帧的帧类型为第一类型,第一类型用于指示当前帧包括多个短帧。超短帧的帧长小于短帧的帧长,短帧的帧长小于长帧的帧长。第二预设条件可以是全局暂态位置信息指示的暂态发生位置距离当前帧的起始位置小于超短帧的帧长或者全局暂态位置信息指示的暂态发生位置距离当前帧的结束位置小于超短帧的帧长。In the third case, the global transient detection result includes global transient position information. The implementation process of determining the frame type of the current frame based on the global transient detection result includes: if the global transient position information is a preset value, such as -1, then determine that the frame type of the current frame is the second type, and the second type is used for Indicates that the current frame includes a long frame. If the global transient position information is not a preset value and satisfies the second preset condition, it is determined that the frame type of the current frame is a third type, and the third type is used to indicate that the current frame includes multiple ultra-short frames. If the global transient position information is not a preset value and does not meet the second preset condition, then determine that the frame type of the current frame is the first type, and the first type is used to indicate that the current frame includes multiple short frames. The frame length of the ultra-short frame is smaller than the frame length of the short frame, and the frame length of the short frame is smaller than that of the long frame. The second preset condition may be that the distance between the transient occurrence position indicated by the global transient position information and the start position of the current frame is less than the frame length of the ultra-short frame, or that the distance between the transient occurrence position indicated by the global transient position information and the end position of the current frame The frame length is smaller than the thumb frame.
需要说明的是,当前帧的窗函数类型用于指示当前帧对应的窗函数的形状和长度,当前帧的窗函数用于对当前帧的时域三维音频信号进行加窗处理。当前帧的帧类型用于指示当前帧为超短帧、短帧还是长帧。其中,超短帧、短帧和长帧可以基于帧的时长来区分,具体的时长可以按照不同的需求来设置,本申请实施例对此不做限定。It should be noted that the window function type of the current frame is used to indicate the shape and length of the window function corresponding to the current frame, and the window function of the current frame is used to perform windowing processing on the time-domain three-dimensional audio signal of the current frame. The frame type of the current frame is used to indicate whether the current frame is a very short frame, a short frame or a long frame. The ultra-short frame, short frame and long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
将当前帧的时域三维音频信号转换为频域三维音频信号的方式可以是改进型离散余弦转换(modified discrete cosine transform,MDCT),也可以是改进型离散正弦转换(modified discrete sine transform,MDST),还可以是快速傅立叶转换(fast fourier transform,FFT)。The method of converting the time-domain three-dimensional audio signal of the current frame into the frequency-domain three-dimensional audio signal can be a modified discrete cosine transform (modified discrete cosine transform, MDCT), or a modified discrete sine transform (modified discrete sine transform, MDST) , can also be fast Fourier transform (fast fourier transform, FFT).
基于上文描述,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。也即是,目标编码参数包括当前帧的窗函数类型,或者,目标编码参数包括当前帧的帧类型,又或者,目标编码参数包括当前帧的窗函数类型和帧类型。在目标编码参数包括的参数不同时,基于 该目标编码参数将当前帧的时域三维音频信号转换为频域三维音频信号的过程有所不同,因此接下来将分别进行说明。Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal based on the target coding parameters is different, so the following descriptions will be made respectively.
第一种情况,目标编码参数包括当前帧的窗函数类型。这种情况下,基于当前帧的窗函数类型所指示的窗函数,对当前帧的时域三维音频信号进行加窗处理。之后,将加窗处理后的三维音频信号转换为频域三维音频信号。In the first case, the target coding parameters include the window function type of the current frame. In this case, windowing processing is performed on the time-domain three-dimensional audio signal of the current frame based on the window function indicated by the window function type of the current frame. Afterwards, the windowed three-dimensional audio signal is converted into a frequency-domain three-dimensional audio signal.
第二种情况,目标编码参数包括当前帧的帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,将当前帧包括的各个短帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长帧,此时,直接将当前帧包括的长帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,将当前帧包括的各个超短帧的时域三维音频信号转换为频域三维音频信号。In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames, and at this time, the time domain 3D audio signal of each short frame included in the current frame is converted into a frequency domain 3D audio signal. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the time domain 3D audio signal of the long frame included in the current frame is directly converted into a frequency domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames, and at this time, the time domain 3D audio signal of each ultrashort frame included in the current frame is converted into a frequency domain 3D audio signal.
第三种情况,目标编码参数包括当前帧的窗函数类型和帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个短帧的时域三维音频信号分别进行加窗处理,并将加窗处理后的各个短帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的长帧的时域三维音频信号进行加窗处理,并将加窗处理后的长帧的时域三维音频信号转换为频域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个超短帧的时域三维音频信号分别进行加窗处理,并将加窗处理后的各个超短帧的时域三维音频信号转换为频域三维音频信号。In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The three-dimensional audio signals in the time domain are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each short frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the time-domain three-dimensional audio signal of the long frame included in the current frame is added. window processing, and convert the time-domain three-dimensional audio signal of the windowed long frame into a frequency-domain three-dimensional audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes multiple ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the temporal three-dimensional The audio signals are respectively subjected to windowing processing, and the three-dimensional audio signals in the time domain of each ultrashort frame after the windowing processing are converted into three-dimensional audio signals in the frequency domain.
也就是说,在当前帧包括多个超短帧、短帧的情况下,将当前帧的时域三维音频信号转换为频域三维音频信号后,得到当前帧包括的各个超短帧、短帧的频域三维音频信号。在当前帧包括一个长帧的情况下,将当前帧的时域三维音频信号转换为频域三维音频信号后,得到当前帧包括的一个长帧的频域三维音频信号。That is to say, in the case that the current frame includes a plurality of ultrashort frames and short frames, after converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal, each ultrashort frame and short frame included in the current frame are obtained 3D audio signal in the frequency domain. In the case that the current frame includes a long frame, after converting the time-domain three-dimensional audio signal of the current frame into a frequency-domain three-dimensional audio signal, a frequency-domain three-dimensional audio signal of one long frame included in the current frame is obtained.
步骤604:基于全局暂态检测结果,对当前帧的频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,N为大于或等于1且小于或等于M的整数。Step 604: Based on the global transient detection result, spatially encode the frequency domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.
在一些实施例中,基于当前帧的帧类型,对当前帧的频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号。In some embodiments, based on the frame type of the current frame, the frequency-domain three-dimensional audio signal of the current frame is spatially encoded to obtain spatial encoding parameters and frequency-domain signals of N transmission channels.
在基于当前帧的帧类型,对当前帧的频域三维音频信号进行空间编码时,如果当前帧的帧类型为第一类型,也即是,当前帧包括多个短帧,此时,将当前帧包括的多个短帧的频域三维音频信号进行交织,以得到一个长帧的频域三维音频信号,对交织后得到的长帧的频域三维音频信号进行空间编码。如果当前帧的帧类型为第二类型,也即是,当前帧包括一个长帧,此时,对这个长帧的频域三维音频信号进行空间编码。如果当前帧的帧类型为第三类型,也即是,当前帧包括多个超短帧,此时,将当前帧包括的多个超短帧的频域三维音频信号进行交织,以得到一个长帧的频域三维音频信号,对交织后得到的长帧的频域三维音频信号进行空间编码。When spatially encoding the frequency-domain three-dimensional audio signal of the current frame based on the frame type of the current frame, if the frame type of the current frame is the first type, that is, the current frame includes multiple short frames, at this time, the current The three-dimensional audio signals in the frequency domain of multiple short frames included in the frame are interleaved to obtain a three-dimensional audio signal in the frequency domain of a long frame, and spatial coding is performed on the three-dimensional audio signal in the frequency domain of the long frame obtained after the interleaving. If the frame type of the current frame is the second type, that is, the current frame includes a long frame, at this time, the frequency domain 3D audio signal of the long frame is spatially encoded. If the frame type of the current frame is the third type, that is, the current frame includes multiple ultrashort frames, at this time, the frequency-domain three-dimensional audio signals of the multiple ultrashort frames included in the current frame are interleaved to obtain a long The three-dimensional audio signal in the frequency domain of the frame is spatially encoded on the three-dimensional audio signal in the frequency domain of the long frame obtained after interleaving.
空间编码的方法可以是任何一种能够基于当前帧的频域三维音频信号,得到空间编码参数和N个传输通道的频域信号的方法,比如,可以采取匹配投影的空间编码方法,本申请实 施例对空间编码方法不作限定。The method of spatial coding can be any method that can obtain spatial coding parameters and frequency domain signals of N transmission channels based on the frequency-domain three-dimensional audio signal of the current frame. For example, a spatial coding method of matching projection can be adopted. This application implements The example does not limit the spatial encoding method.
空间编码参数是指对当前帧的频域三维音频信号进行空间编码的过程中所确定的参数,包括边信息、比特预分配边信息等等。N个传输通道的频域信号可以包括一个或多个通道的虚拟扬声器信号,以及一个或多个通道的残差信号。此外,当编码比特数不足时,N个传输通道的频域信号还可以仅包括一个或多个通道的虚拟扬声器信号。The spatial coding parameters refer to parameters determined during the process of spatially coding the frequency-domain 3D audio signal of the current frame, including side information, bit pre-allocation side information, and the like. The frequency-domain signals of the N transmission channels may include virtual speaker signals of one or more channels, and residual signals of one or more channels. In addition, when the number of coding bits is insufficient, the frequency domain signals of the N transmission channels may only include virtual speaker signals of one or more channels.
步骤605:基于全局暂态检测结果,对该N个传输通道的频域信号进行编码,以得到频域信号编码结果。Step 605: Based on the global transient detection result, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.
在一些实施例中,基于当前帧的帧类型,对该N个传输通道的频域信号进行编码。In some embodiments, the frequency domain signals of the N transmission channels are encoded based on the frame type of the current frame.
作为一种示例,对该N个传输通道的频域信号进行编码的实现过程包括:基于当前帧的帧类型,对该N个传输通道的频域信号进行噪声整形处理。对噪声整形处理后的N个传输通道的频域信号进行传输通道下混处理,得到下混处理后的信号。对下混处理后的信号的低频部分进行量化与编码处理,将编码结果写入码流。对下混处理后的信号的高频部分进行带宽扩展与编码处理,将编码结果写入码流。As an example, the implementation process of encoding the frequency-domain signals of the N transmission channels includes: performing noise shaping processing on the frequency-domain signals of the N transmission channels based on the frame type of the current frame. The transmission channel downmixing process is performed on the frequency domain signals of the N transmission channels after the noise shaping process, to obtain the downmixed signal. Perform quantization and encoding processing on the low-frequency part of the downmixed signal, and write the encoding result into the code stream. Perform bandwidth expansion and encoding processing on the high-frequency part of the downmixed signal, and write the encoding result into the code stream.
需要说明的是,基于当前帧的帧类型进行噪声整形处理的方式可以参考相关技术,本申请实施例对此不做详细阐述。其中,噪声整形处理包括时域噪声整形(temporal noise shaping,TNS)处理以及频域噪声整形(frequency domain noise shaping,FDNS)处理。It should be noted that, for the manner of performing noise shaping processing based on the frame type of the current frame, reference may be made to related technologies, which will not be described in detail in the embodiment of the present application. Wherein, the noise shaping processing includes temporal noise shaping (temporal noise shaping, TNS) processing and frequency domain noise shaping (frequency domain noise shaping, FDNS) processing.
其中,对噪声整形处理后的N个传输通道的频域信号进行传输通道下混处理时,可以按照预先设定的准则对噪声整形处理后的N个传输通道进行配对,也可以根据信号相关度对噪声整形处理后的N个传输通道的频域信号进行配对。然后,基于配对后的两路频域信号进行中间边(mid side,MS)下混处理。Wherein, when performing transmission channel downmix processing on the frequency domain signals of the N transmission channels after the noise shaping processing, the N transmission channels after the noise shaping processing may be paired according to a preset criterion, or the N transmission channels after the noise shaping processing may be paired according to the signal correlation degree The frequency-domain signals of the N transmission channels after noise shaping are paired. Then, mid-side (mid side, MS) downmix processing is performed based on the paired two channels of frequency domain signals.
例如,如果该N个传输通道中包括2路虚拟扬声器信号以及4路残差信号,可以按照预先设定的准则将2路虚拟扬声器信号组成一对,进行下混处理。还可以确定4路残差信号中每2路残差信号间的相关度,选择相关度高的2路残差信号组成一对,其余2路残差信号组成一对,分别进行下混处理。For example, if the N transmission channels include 2 channels of virtual speaker signals and 4 channels of residual signals, the 2 channels of virtual speaker signals may be combined into a pair according to a preset criterion for downmix processing. It is also possible to determine the correlation between every 2 residual signals in the 4 residual signals, select the 2 residual signals with high correlation to form a pair, and the remaining 2 residual signals form a pair, and perform downmix processing respectively.
需要说明的是,对配对后的两路频域信号进行下混处理,其下混处理后的结果可能是一路频域信号,也可能是两路频域信号,具体取决于编码的处理过程。It should be noted that, when the paired two channels of frequency domain signals are downmixed, the result of the downmixing process may be one channel of frequency domain signals or two channels of frequency domain signals, depending on the encoding process.
其中,信号的低频部分和高频部分可以按照多种方式来划分。比如,以2000Hz作为分界点,将下混处理后信号频率小于2000Hz的部分作为信号的低频部分,将下混处理后信号频率大于2000Hz的部分作为信号的高频部分。又比如,以5000Hz作为分界点,将下混处理后信号频率小于5000Hz的部分作为信号的低频部分,将下混处理后信号频率大于5000Hz的部分作为信号的高频部分。Wherein, the low-frequency part and the high-frequency part of the signal can be divided in various ways. For example, with 2000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 2000 Hz is regarded as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 2000 Hz is regarded as the high frequency part of the signal. For another example, with 5000 Hz as the cut-off point, the part of the downmixed signal whose frequency is less than 5000 Hz is taken as the low frequency part of the signal, and the part of the downmixed signal whose frequency is greater than 5000 Hz is taken as the high frequency part of the signal.
步骤606:将空间编码参数进行编码,以得到空间编码参数编码结果,将空间编码参数编码结果和频域信号编码结果写入码流。Step 606: Encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, and write the encoding result of the spatial encoding parameters and the encoding result of the frequency domain signal into the code stream.
可选地,还可以将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果,将全局暂态检测结果编码结果写入码流。或者,将目标编码参数进行编码,以得到目标编码参数编码结果,将目标编码参数编码结果写入码流。Optionally, the global transient detection result may also be encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result may be written into the code stream. Alternatively, the target encoding parameter is encoded to obtain an encoding result of the target encoding parameter, and the encoding result of the target encoding parameter is written into the code stream.
在本申请实施例中,可以先对当前帧的时域三维音频信号包括的M个通道的信号进行暂态检测,从而确定全局暂态检测结果。之后,基于全局暂态检测结果,依次进行音频信号的时频变换、空间编码以及对各个传输通道的频域信号进行编码,尤其对空间编码后得到的各 个传输通道的频域信号进行编码时,各个传输通道的暂态检测结果复用全局暂态检测结果,不需要将各个传输通道的频域信号转换至时域来确定各个传输通道对应的暂态检测结果,进而就不需要将三维音频信号在时域与频域之间进行多次变换,从而能够降低编码复杂度,提高编码效率。而且,本申请实施例不用将各个传输通道的暂态检测结果进行编码,只需将全局暂态检测结果编入码流,这样,能够降低编码比特数。In the embodiment of the present application, transient detection may be performed on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to determine a global transient detection result. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The transient detection results of each transmission channel are multiplexed with the global transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio signal Multiple transformations are performed between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency. Moreover, the embodiment of the present application does not need to encode the transient detection results of each transmission channel, but only needs to encode the global transient detection results into the code stream, so that the number of coding bits can be reduced.
请参考图7和图8,图7和图8均是本申请实施例提供的一种示例性编码方法的框图。图7和图8主要是对图6所示的编码方法进行示例性解释。在图7中,对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到该M个通道对应的M个暂态检测结果。基于该M个暂态检测结果,确定全局暂态检测结果,并将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果,将全局暂态检测结果编码结果写入码流。基于全局暂态检测结果,将当前帧的时域三维音频信号转换为频域三维音频信号。基于全局暂态检测结果,对当前帧的频域三维音频信号进行空间编码,得到空间编码参数和N个传输通道的频域信号,并将空间编码参数进行编码,以得到空间编码参数编码结果,将空间编码参数编码结果和频域信号编码结果写入码流。基于全局暂态检测结果,对该N个传输通道的频域信号进行编码。进一步地,在图8中,对当前帧的频域三维音频信号进行空间编码,得到空间编码参数和N个传输通道的频域信号之后,将空间编码参数进行编码,以得到空间编码参数编码结果,将空间编码参数编码结果和频域信号编码结果写入码流。然后,基于全局暂态检测结果,对该N个传输通道的频域信号进行噪声整形处理、传输通道下混处理、量化与编码处理、带宽扩展处理,并将带宽扩展处理后的信号的编码结果写入码流。Please refer to FIG. 7 and FIG. 8 , both of which are block diagrams of an exemplary encoding method provided by an embodiment of the present application. FIG. 7 and FIG. 8 are mainly for exemplary explanation of the encoding method shown in FIG. 6 . In FIG. 7 , the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels. Based on the M transient detection results, a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream. Based on the global transient detection result, the time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal. Based on the global transient detection result, perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters, Write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. Based on the global transient detection results, the frequency domain signals of the N transmission channels are encoded. Further, in FIG. 8 , the frequency-domain three-dimensional audio signal of the current frame is spatially encoded, and after obtaining the spatial encoding parameters and the frequency-domain signals of N transmission channels, the spatial encoding parameters are encoded to obtain the encoding result of the spatial encoding parameters , write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. Then, based on the global transient detection results, the frequency domain signals of the N transmission channels are subjected to noise shaping processing, transmission channel downmix processing, quantization and encoding processing, and bandwidth extension processing, and the encoding results of the signals after bandwidth extension processing Write code stream.
基于上述步骤606中的描述,编码端设备可能会将全局暂态检测结果编入码流,也可能不将全局暂态检测结果编入码流。而且,编码端设备也可能将目标编码参数编入码流,也可能不将目标编码参数编入码流。在编码端设备将全局暂态检测结果编入码流的情况下,解码端设备可以按照下述图9所示的方法进行解码。在编码端设备将目标编码参数编入码流的情况下,解码端设备可以从码流中解析出目标编码参数,进而基于目标编码参数包括的当前帧的帧类型进行解码,具体实现过程与图9中的过程类似。当然,编码端设备可能不将全局暂态检测结果编入码流,也不将目标编码参数编入码流,这种情况下,对三维音频信号的解码过程可以参考相关技术,本申请实施例对此不做阐述。Based on the description in step 606 above, the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in FIG. 9 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters. The specific implementation process is shown in Fig. The procedure in 9 is similar. Of course, the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream. In this case, the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.
请参考图9,图9是本申请实施例提供的第一种解码方法的流程图,该方法应用于解码端,包括如下步骤。Please refer to FIG. 9. FIG. 9 is a flow chart of the first decoding method provided by the embodiment of the present application. The method is applied to the decoding end and includes the following steps.
步骤901:从码流中解析出全局暂态检测结果和空间编码参数。Step 901: Parse the global transient detection result and spatial coding parameters from the code stream.
步骤902:基于全局暂态检测结果和该码流进行解码,以得到N个传输通道的频域信号。Step 902: Decode based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels.
在一些实施例中,基于全局暂态检测结果确定当前帧的帧类型。基于当前帧的帧类型和该码流进行解码,以得到该N个传输通道的频域信号。In some embodiments, the frame type of the current frame is determined based on global transient detection results. Decoding is performed based on the frame type of the current frame and the code stream to obtain frequency domain signals of the N transmission channels.
其中,基于全局暂态检测结果确定当前帧的帧类型的实现方式可以参考上述步骤603中的相关描述,此处不再赘述。基于当前帧的帧类型和该码流进行解码的实现方式可以参考相关技术,本申请实施例不进行详细阐述。For an implementation manner of determining the frame type of the current frame based on the global transient detection result, reference may be made to the relevant description in the above step 603 , which will not be repeated here. For an implementation manner of decoding based on the frame type of the current frame and the code stream, reference may be made to related technologies, and the embodiments of the present application will not describe in detail.
步骤903:基于全局暂态检测结果和空间编码参数,对该N个传输通道的频域信号进行 空间解码,以得到重建的频域三维音频信号。Step 903: Based on the global transient detection result and the spatial coding parameters, spatially decode the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal.
在一些实施例中,基于当前帧的帧类型和空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号,当前帧的帧类型基于全局暂态检测结果确定得到。也即是,基于全局暂态检测结果确定当前帧的帧类型,然后,基于当前帧的帧类型和空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号。In some embodiments, the frequency domain signals of the N transmission channels are spatially decoded based on the frame type and spatial coding parameters of the current frame to obtain a reconstructed frequency domain 3D audio signal, and the frame type of the current frame is based on the global transient state The test result is confirmed. That is, the frame type of the current frame is determined based on the global transient detection result, and then, based on the frame type and spatial coding parameters of the current frame, the frequency domain signals of the N transmission channels are spatially decoded to obtain the reconstructed frequency domain 3D audio signal.
其中,基于当前帧的帧类型和空间编码参数,对该N个传输通道的频域信号进行空间解码的实现过程可以参考相关技术,本申请实施例对此不进行详细阐述。Wherein, based on the frame type and spatial coding parameters of the current frame, the implementation process of spatially decoding the frequency-domain signals of the N transmission channels may refer to related technologies, which will not be described in detail in the embodiments of the present application.
步骤904:基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号。Step 904: Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
在一些实施例中,基于全局暂态检测结果确定目标编码参数,该目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。基于该目标编码参数,将重建的频域三维音频信号转换为重建的时域三维音频信号。In some embodiments, target encoding parameters are determined based on global transient detection results, where the target encoding parameters include the window function type of the current frame and/or the frame type of the current frame. Based on the target encoding parameters, the reconstructed frequency domain 3D audio signal is converted into a reconstructed time domain 3D audio signal.
其中,基于全局暂态检测结果确定目标编码参数的实现方式可以参考上述步骤603中的相关描述,此处不再赘述。For an implementation manner of determining the target encoding parameter based on the global transient detection result, reference may be made to the relevant description in the above step 603 , which will not be repeated here.
基于上文描述,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型。也即是,目标编码参数包括当前帧的窗函数类型,或者,目标编码参数包括当前帧的帧类型,又或者,目标编码参数包括当前帧的窗函数类型和帧类型。在目标编码参数包括的参数不同时,基于该目标编码参数将重建的频域三维音频信号转换为重建的时域三维音频信号的过程有所不同,因此接下来将分别进行说明。Based on the above description, the target coding parameters include the window function type of the current frame and/or the frame type of the current frame. That is, the target coding parameters include the window function type of the current frame, or the target coding parameters include the frame type of the current frame, or the target coding parameters include the window function type and the frame type of the current frame. When the parameters included in the target coding parameters are different, the process of converting the reconstructed frequency domain 3D audio signal into the reconstructed time domain 3D audio signal based on the target coding parameters is different, so the following will describe them respectively.
第一种情况,目标编码参数包括当前帧的窗函数类型。这种情况下,基于当前帧的窗函数类型所指示的窗函数,对重建的频域三维音频信号进行去加窗处理。之后,将去加窗处理后的频域三维音频信号转换为重建的时域三维音频信号。In the first case, the target coding parameters include the window function type of the current frame. In this case, based on the window function indicated by the window function type of the current frame, de-windowing processing is performed on the reconstructed three-dimensional audio signal in the frequency domain. Afterwards, the de-windowed three-dimensional audio signal in the frequency domain is converted into a reconstructed three-dimensional audio signal in the time domain.
其中,去加窗处理也称为加窗及叠接相加处理。Wherein, de-windowing processing is also referred to as windowing and splicing-add processing.
第二种情况,目标编码参数包括当前帧的帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,将各个短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长帧,此时,直接将当前帧包括的长帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此时,将各个超短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。In the second case, the target coding parameters include the frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes multiple short frames. At this time, the reconstructed frequency-domain three-dimensional audio signal of each short frame is converted into a time-domain three-dimensional audio signal to obtain reconstruction 3D audio signal in the time domain. If the frame type of the current frame is the second type, it indicates that the current frame includes a long frame. At this time, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is directly converted into a time-domain three-dimensional audio signal to obtain a reconstructed time-domain audio signal. domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, the reconstructed frequency domain 3D audio signal of each ultrashort frame is converted into a time domain 3D audio signal to obtain a reconstructed time domain 3D audio signal.
第三种情况,目标编码参数包括当前帧的窗函数类型和帧类型。这种情况下,如果当前帧的帧类型为第一类型,表明当前帧包括多个短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个短帧的频域三维音频信号分别进行去加窗处理,并将去加窗处理后的各个短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第二类型,表明当前帧包括一个长短帧,此时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的长帧的重建频域三维音频信号进行去加窗处理,并将去加窗处理后的长帧的频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。如果当前帧的帧类型为第三类型,表明当前帧包括多个超短帧,此 时,基于当前帧的窗函数类型所指示的窗函数,对当前帧包括的各个超短帧的频域三维音频信号分别进行去加窗处理,并将去加窗处理后的各个超短帧的重建频域三维音频信号转换为时域三维音频信号,以得到重建的时域三维音频信号。In the third case, the target encoding parameters include the window function type and frame type of the current frame. In this case, if the frame type of the current frame is the first type, it indicates that the current frame includes a plurality of short frames. At this time, based on the window function indicated by the window function type of the current frame, each short frame included in the current frame The frequency-domain 3D audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3-D audio signals of each short frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals. If the frame type of the current frame is the second type, it indicates that the current frame includes a long and short frame. At this time, based on the window function indicated by the window function type of the current frame, the reconstructed frequency-domain three-dimensional audio signal of the long frame included in the current frame is performed. De-windowing processing, and converting the de-windowed long-frame frequency domain 3D audio signal into a time domain 3D audio signal, so as to obtain a reconstructed time domain 3D audio signal. If the frame type of the current frame is the third type, it indicates that the current frame includes a plurality of ultrashort frames. At this time, based on the window function indicated by the window function type of the current frame, the frequency domain three-dimensional The audio signals are respectively subjected to de-windowing processing, and the reconstructed frequency-domain 3D audio signals of each ultrashort frame after de-windowing processing are converted into time-domain 3-D audio signals to obtain reconstructed time-domain 3-D audio signals.
在本申请实施例中,解码端从码流中解析出全局暂态检测结果和空间编码参数,这样可以基于全局暂态检测结果和空间编码参数,重建时域三维音频信号,而无需从码流中解析出各个传输通道的暂态检测结果,从而能够降低解码复杂度,提高解码效率。而且,在不将目标编码参数编入码流的情况下,可以直接基于全局暂态检测结果确定目标编码参数,从而实现时域三维音频信号的重建。In the embodiment of the present application, the decoder parses the global transient detection results and spatial coding parameters from the code stream, so that the time-domain three-dimensional audio signal can be reconstructed based on the global transient detection results and spatial coding parameters without the The transient detection results of each transmission channel are analyzed in the middle, so that the decoding complexity can be reduced and the decoding efficiency can be improved. Moreover, without encoding the target coding parameters into the code stream, the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
请参考图10,图10是本申请实施例提供的一种示例性解码方法的框图。图10主要是对图9所示的解码方法进行示例性解释。在图10中,从码流中解析出全局暂态检测结果和空间编码参数。基于全局暂态检测结果和该码流进行解码,以得到N个传输通道的频域信号。基于全局暂态检测结果和空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号。基于全局暂态检测结果和该重建的频域三维音频信号,通过去加窗处理和时频逆变换,确定重建的时域三维音频信号。Please refer to FIG. 10 , which is a block diagram of an exemplary decoding method provided by an embodiment of the present application. FIG. 10 is mainly an exemplary explanation of the decoding method shown in FIG. 9 . In Fig. 10, the global transient detection results and spatial coding parameters are parsed from the code stream. Decoding is performed based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels. Based on the global transient detection result and the spatial encoding parameters, the frequency domain signals of the N transmission channels are spatially decoded to obtain a reconstructed frequency domain 3D audio signal. Based on the global transient detection result and the reconstructed three-dimensional audio signal in the frequency domain, the reconstructed three-dimensional audio signal in the time domain is determined through de-windowing processing and inverse time-frequency transformation.
请参考图11,图11是本申请实施例提供的第二种编码方法的流程图。该编码方法应用于编码端设备,包括如下步骤。Please refer to FIG. 11 , which is a flowchart of the second encoding method provided by the embodiment of the present application. The encoding method is applied to an encoding end device, and includes the following steps.
步骤1101:对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到该M个通道对应的M个暂态检测结果,M为大于1的整数。Step 1101: Perform transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1.
其中,确定该M个通道对应的M个暂态检测结果的实现方式可以参考步骤601中的相关描述,此处不再赘述。For an implementation manner of determining the M transient detection results corresponding to the M channels, reference may be made to related descriptions in step 601 , which will not be repeated here.
步骤1102:基于该M个暂态检测结果,确定全局暂态检测结果。Step 1102: Based on the M transient detection results, determine a global transient detection result.
其中,基于该M个暂态检测结果,确定全局暂态位置信息的实现方式可以参考步骤602中的相关描述,此处不再赘述。For an implementation manner of determining global transient position information based on the M transient detection results, reference may be made to relevant descriptions in step 602 , which will not be repeated here.
步骤1103:基于全局暂态检测结果,将当前帧的时域三维音频信号转换为频域三维音频信号。Step 1103: Convert the time-domain 3D audio signal of the current frame into a frequency-domain 3D audio signal based on the global transient detection result.
其中,基于全局暂态检测结果,将当前帧的时域三维音频信号转换为频域三维音频信号的方式可以参考步骤603中的相关描述,此处不再赘述。Wherein, based on the global transient detection result, the method of converting the time-domain 3D audio signal of the current frame into the frequency-domain 3D audio signal can refer to the relevant description in step 603 , which will not be repeated here.
步骤1104:基于全局暂态检测结果,对当前帧的频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,N为大于或等于1且小于或等于M的整数。Step 1104: Based on the global transient detection result, perform spatial encoding on the frequency-domain 3D audio signal of the current frame to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M integer.
其中,基于全局暂态检测结果对当前帧的频域三维音频信号进行空间编码的实现方式可以参考步骤604中的相关描述,此处不再赘述。Wherein, the implementation manner of performing spatial coding on the frequency-domain three-dimensional audio signal of the current frame based on the global transient detection result can refer to the related description in step 604 , which will not be repeated here.
步骤1105:基于M个暂态检测结果,确定该N个传输通道对应的N个暂态检测结果。Step 1105: Based on the M transient detection results, determine N transient detection results corresponding to the N transmission channels.
在一些实施例中,基于该M个暂态标志,按照第一预设规则确定该N个传输通道包括一个或多个通道的虚拟扬声器信号的暂态标志。基于该M个暂态标志,按照第二预设规则确定该N个传输通道包括一个或多个通道的残差信号的暂态标志。In some embodiments, based on the M transient flags, the N transmission channels include one or more transient flags of virtual loudspeaker signals determined according to a first preset rule. Based on the M transient flags, the N transmission channels include transient flags of residual signals of one or more channels according to a second preset rule.
作为一种示例,第一预设规则包括:若该M个暂态标志中为第一值的数量大于或等于P,则该N个传输通道包括一个或多个通道的虚拟扬声器信号暂态标志均为第一值。第二预设规 则包括:若该M个暂态标志中为第一值的数量大于或等于Q,则该N个传输通道包括一个或多个通道的残差信号暂态标志均为第一值。As an example, the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include virtual speaker signal transient flags of one or more channels are the first values. The second preset rule includes: if the quantity of the first value among the M transient flags is greater than or equal to Q, then the N transmission channels including one or more residual signal transient flags of the channels are all of the first value .
其中,P和Q均为小于M的正整数。P和Q为事先设置的数值,且P和Q还能够按照不同的需求来调整。可选地,由于虚拟扬声器信号用于记录真实三维音频信号,相对于残差信号比较重要,所以P小于Q。Wherein, both P and Q are positive integers smaller than M. P and Q are preset values, and P and Q can also be adjusted according to different requirements. Optionally, since the virtual speaker signal is used to record the real 3D audio signal and is more important than the residual signal, P is smaller than Q.
作为另一种示例,第一预设规则包括:若该M个暂态标志中为第一值的数量大于或等于P,则该N个传输通道包括一个或多个通道的虚拟扬声器信号对应的暂态标志均为第一值。第二预设规则包括:若该M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于R,则该N个传输通道包括一个或多个通道的残差信号对应的暂态标志均为第一值。As another example, the first preset rule includes: if the number of the first values among the M transient flags is greater than or equal to P, then the N transmission channels include one or more channels corresponding to the virtual speaker signal The transient flags are all first values. The second preset rule includes: if the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to R, then the N transmission channels include one or more channels The transient flags corresponding to the residual signal all have the first value.
其中,P和R均为小于M的正整数。P和R为事先设置的数值,且P和R还能够按照不同的需求来调整。在该三维音频信号为HOA信号的情况下,第一预设条件包括属于FOA信号的通道,该M个通道中满足第一预设条件的通道为当前帧的三维音频信号中的FOA信号所在的通道,该FOA信号为该HOA信号中的前4个通道的信号,当然,第一预设条件还可以为其他的条件。Wherein, both P and R are positive integers smaller than M. P and R are preset values, and P and R can also be adjusted according to different requirements. In the case where the 3D audio signal is an HOA signal, the first preset condition includes channels belonging to the FOA signal, and the channel that satisfies the first preset condition among the M channels is where the FOA signal in the 3D audio signal of the current frame is located. channel, the FOA signal is the signal of the first 4 channels in the HOA signal, of course, the first preset condition can also be other conditions.
在另一些实施例中,还可以基于该M个暂态标志,按照该M个暂态标志与该N个传输通道之间的映射关系,确定该N个暂态标志。其中,映射关系是事先确定得到的。In some other embodiments, the N transient flags may also be determined based on the M transient flags and according to the mapping relationship between the M transient flags and the N transmission channels. Wherein, the mapping relationship is determined in advance.
比如,该N个传输通道包括的某个传输通道映射到该M个通道中的某多个通道,如果该M个暂态标志中存在至少一个暂态标志为第一值,那么该N个传输通道中的这个传输通道的暂态标志为第一值。For example, a certain transmission channel included in the N transmission channels is mapped to a certain number of channels in the M channels, if at least one of the M transient flags has a first value, then the N transmission The transient flag of this transmission channel of the channels is the first value.
需要说明的是,步骤1105可以在步骤1101之后,步骤1106之前的任何时机执行,本申请实施例对步骤1105的执行时机不作限定。It should be noted that step 1105 may be executed at any timing after step 1101 and before step 1106, and the embodiment of the present application does not limit the execution timing of step 1105.
步骤1106:基于该N个暂态检测结果,对该N个传输通道的频域信号进行编码,以得到频域信号编码结果。Step 1106: Based on the N transient detection results, encode the frequency-domain signals of the N transmission channels to obtain a frequency-domain signal encoding result.
在一些实施例中,基于该N个暂态检测结果,确定该N个传输通道中每个传输通道对应的帧类型。基于该N个传输通道中每个传输通道对应的帧类型,对该N个传输通道中对应传输通道的频域信号进行编码。In some embodiments, based on the N transient detection results, the frame type corresponding to each of the N transmission channels is determined. Based on the frame type corresponding to each of the N transmission channels, the frequency domain signal corresponding to the N transmission channels is encoded.
由于确定该N个传输通道中每个传输通道对应的帧类型的实现方式相同,因此,接下来以其中一个传输通道为例进行说明。为了便于描述,将该传输通道称为目标传输通道。Since the implementation manner of determining the frame type corresponding to each of the N transmission channels is the same, one of the transmission channels is taken as an example for description below. For ease of description, this transmission channel is referred to as a target transmission channel.
基于目标传输通道对应的暂态检测结果确定目标传输通道对应的帧类型的实现过程包括:若目标传输通道对应的暂态标志为第一值,则确定目标传输通道对应的帧类型为第一类型,第一类型用于指示目标传输通道的信号包括多个短帧。若目标传输通道对应的暂态标志为第二值,则确定目标传输通道对应的帧类型为第二类型,第二类型用于指示目标传输通道的信号包括一个长帧。The implementation process of determining the frame type corresponding to the target transmission channel based on the transient detection result corresponding to the target transmission channel includes: if the transient flag corresponding to the target transmission channel is the first value, then determining that the frame type corresponding to the target transmission channel is the first type , the signal of the first type used to indicate the target transmission channel includes a plurality of short frames. If the transient flag corresponding to the target transmission channel is the second value, it is determined that the frame type corresponding to the target transmission channel is the second type, and the signal of the second type used to indicate the target transmission channel includes a long frame.
需要说明的是,当前帧的帧类型用于指示当前帧为短帧还是长帧。其中,短帧和长帧可以基于帧的时长来区分,具体的时长可以按照不同的需求来设置,本申请实施例对此不做限定。It should be noted that the frame type of the current frame is used to indicate whether the current frame is a short frame or a long frame. Wherein, the short frame and the long frame can be distinguished based on the duration of the frame, and the specific duration can be set according to different requirements, which is not limited in this embodiment of the present application.
当确定出各个传输通道对应的帧类型之后,即可基于各个传输通道对应的帧类型,对各个传输通道的频域信号进行噪声整形处理。之后,对噪声整形处理后的N个传输通道的频域 信号进行传输通道下混处理,得到下混处理后的信号。对下混处理后的信号的低频部分进行量化与编码处理,将编码结果写入码流。对下混处理后的信号的高频部分进行带宽扩展与编码处理,将编码结果写入码流。After the frame type corresponding to each transmission channel is determined, noise shaping processing can be performed on the frequency domain signal of each transmission channel based on the frame type corresponding to each transmission channel. Afterwards, the frequency domain signals of the N transmission channels after the noise shaping process are subjected to transmission channel downmixing processing to obtain the downmixed signals. Perform quantization and encoding processing on the low-frequency part of the downmixed signal, and write the encoding result into the code stream. Perform bandwidth expansion and encoding processing on the high-frequency part of the downmixed signal, and write the encoding result into the code stream.
其中,关于噪声整形处理、传输通道下混处理、低频部分的量化与编码处理、以及带宽扩展与编码处理的相关内容可以参考步骤605中的相关描述,此处不再赘述。For details about noise shaping, transmission channel downmixing, quantization and encoding of low frequency parts, and bandwidth extension and encoding, reference may be made to relevant descriptions in step 605 , which will not be repeated here.
步骤1107:将空间编码参数和该N个暂态检测结果进行编码,以得到空间编码参数编码结果和N个暂态检测结果编码结果,将空间编码参数编码结果和N个暂态检测结果编码结果写入码流。Step 1107: encode the spatial encoding parameters and the N transient detection results to obtain the encoding results of the spatial encoding parameters and the encoding results of the N transient detection results, and encode the encoding results of the spatial encoding parameters and the N transient detection results Write code stream.
可选地,还可以将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果,将全局暂态检测结果编码结果写入码流。或者,将目标编码参数进行编码,以得到目标编码参数编码结果,将目标编码参数编码结果写入码流。Optionally, the global transient detection result may also be encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result may be written into the code stream. Alternatively, the target encoding parameter is encoded to obtain an encoding result of the target encoding parameter, and the encoding result of the target encoding parameter is written into the code stream.
在本申请实施例中,基于三维音频信号包括的M个通道对应的M个暂态检测结果,确定各个传输通道包括的虚拟扬声器信号以及残差信号对应的暂态检测结果。这样可以保证对各个传输通道的频域信号进行编码时,提高编码精确度。而且,各个传输通道对应的暂态检测结果是基于M个暂态检测结果确定的,不需要将各个传输通道的频域信号转换至时域来确定各个传输通道对应的暂态检测结果,进而就不需要将三维音频信号在时域与频域之间进行多次变换,从而能够降低编码复杂度,提高编码效率。In the embodiment of the present application, based on the M transient detection results corresponding to the M channels included in the 3D audio signal, the transient detection results corresponding to the virtual speaker signal included in each transmission channel and the residual signal are determined. In this way, the encoding accuracy can be improved when encoding the frequency domain signals of each transmission channel. Moreover, the transient detection results corresponding to each transmission channel are determined based on M transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the transient detection results corresponding to each transmission channel, and then The three-dimensional audio signal does not need to be transformed multiple times between the time domain and the frequency domain, so that the coding complexity can be reduced and the coding efficiency can be improved.
请参考图12和图13,图12和图13是本申请实施例提供的另一种示例性编码方法的框图。图12和图13主要是对图11所示的编码方法进行示例性解释。在图12中,对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到该M个通道对应的M个暂态检测结果。基于该M个暂态检测结果,确定全局暂态检测结果,并将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果,将全局暂态检测结果编码结果写入码流。基于全局暂态检测结果将当前帧的时域三维音频信号转换为频域三维音频信号。基于全局暂态检测结果,对当前帧的频域三维音频信号进行空间编码,得到空间编码参数和该N个传输通道的频域信号,并将空间编码参数进行编码,以得到空间编码参数编码结果,将空间编码参数编码结果写入码流。基于该M个暂态检测结果,确定该N个传输通道对应的N个暂态检测结果,并将该N个暂态检测结果进行编码,以得到N个暂态检测结果编码结果,将N个暂态检测结果编码结果写入码流。基于该N个暂态检测结果,对该N个传输通道的频域信号进行编码。进一步地,在图13中,确定出该N个暂态检测结果之后,基于该N个暂态检测结果,对该N个传输通道的频域信号进行噪声整形处理。然后对噪声整形处理的各个传输通道的频域信号进行传输通道下混处理、量化与编码处理、带宽扩展处理,并将带宽扩展处理后的信号的编码结果写入码流。Please refer to FIG. 12 and FIG. 13 . FIG. 12 and FIG. 13 are block diagrams of another exemplary encoding method provided by an embodiment of the present application. FIG. 12 and FIG. 13 mainly illustrate the encoding method shown in FIG. 11 . In FIG. 12 , the signals of M channels included in the time-domain three-dimensional audio signal of the current frame are respectively subjected to transient detection to obtain M transient detection results corresponding to the M channels. Based on the M transient detection results, a global transient detection result is determined, and the global transient detection result is encoded to obtain an encoding result of the global transient detection result, and the encoding result of the global transient detection result is written into a code stream. The time-domain three-dimensional audio signal of the current frame is converted into a frequency-domain three-dimensional audio signal based on the global transient detection result. Based on the global transient detection result, perform spatial encoding on the frequency-domain three-dimensional audio signal of the current frame to obtain the spatial encoding parameters and the frequency-domain signals of the N transmission channels, and encode the spatial encoding parameters to obtain the encoding result of the spatial encoding parameters , write the encoding result of the spatial encoding parameters into the code stream. Based on the M transient detection results, N transient detection results corresponding to the N transmission channels are determined, and the N transient detection results are encoded to obtain N transient detection result encoding results, and the N transient detection results are encoded. The encoding result of the transient detection result is written into the code stream. Based on the N transient detection results, the frequency domain signals of the N transmission channels are encoded. Further, in FIG. 13 , after the N transient detection results are determined, noise shaping processing is performed on the frequency-domain signals of the N transmission channels based on the N transient detection results. Then perform transmission channel downmixing processing, quantization and encoding processing, and bandwidth expansion processing on the frequency domain signals of each transmission channel processed by the noise shaping process, and write the encoding results of the signals after the bandwidth expansion processing into the code stream.
基于上述步骤1107中的描述,编码端设备可能会将全局暂态检测结果编入码流,也可能不将全局暂态检测结果编入码流。而且,编码端设备也可能将目标编码参数编入码流,也可能不将目标编码参数编入码流。在编码端设备将全局暂态检测结果编入码流的情况下,解码端设备可以按照下述图14所示的方法进行解码。在编码端设备将目标编码参数编入码流的情况下,解码端设备可以从码流中解析出目标编码参数,进而基于目标编码参数包括的当前帧 的帧类型进行解码,具体实现过程与图14中的过程类似。当然,编码端设备可能不将全局暂态检测结果编入码流,也不将目标编码参数编入码流,这种情况下,对三维音频信号的解码过程可以参考相关技术,本申请实施例对此不做阐述。Based on the description in step 1107 above, the encoder device may encode the global transient detection result into the code stream, or may not encode the global transient detection result into the code stream. Moreover, the encoding end device may also encode the target encoding parameters into the code stream, or may not encode the target encoding parameters into the code stream. In the case that the encoding end device encodes the global transient detection result into the code stream, the decoding end device can perform decoding according to the method shown in Figure 14 below. When the encoder device encodes the target encoding parameters into the code stream, the decoder device can parse the target encoding parameters from the code stream, and then decode based on the frame type of the current frame included in the target encoding parameters. The specific implementation process is shown in Fig. The procedure in 14 is similar. Of course, the encoder device may not encode the global transient detection results into the code stream, nor encode the target encoding parameters into the code stream. In this case, the decoding process of the 3D audio signal can refer to related technologies. The embodiment of this application No elaboration on this.
请参考图14,图14是本申请实施例提供的第二种解码方法的流程图,该方法应用于解码端,包括如下步骤。Please refer to FIG. 14 . FIG. 14 is a flowchart of a second decoding method provided by an embodiment of the present application. The method is applied to a decoding end and includes the following steps.
步骤1401:从码流中解析出全局暂态检测结果、N个传输通道对应的N个暂态检测结果以及空间编码参数。Step 1401: Analyze the global transient detection result, N transient detection results corresponding to N transmission channels and spatial coding parameters from the code stream.
步骤1402:基于该N个暂态检测结果和该码流进行解码,以得到该N个传输通道的频域信号。Step 1402: Decode based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels.
在一些实施例中,基于N个暂态检测结果确定各个传输通道对应的帧类型。基于各个传输通道对应的帧类型和该码流进行解码,以得到该N个传输通道的频域信号。In some embodiments, the frame type corresponding to each transmission channel is determined based on the N transient detection results. Decoding is performed based on the frame type corresponding to each transmission channel and the code stream, so as to obtain frequency domain signals of the N transmission channels.
其中,基于N个暂态检测结果确定各个传输通道对应的帧类型的实现方式可以参考上述步骤1106中的相关描述,此处不再赘述。基于各个传输通道对应的帧类型和该码流进行解码的实现方式可以参考相关技术,本申请实施例不进行详细阐述。For an implementation manner of determining the frame type corresponding to each transmission channel based on the N transient detection results, reference may be made to the relevant description in the above step 1106 , which will not be repeated here. For an implementation manner of decoding based on the frame type corresponding to each transmission channel and the code stream, reference may be made to related technologies, and the embodiments of the present application will not describe in detail.
步骤1403:基于该N个传输通道的频域信号和空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号。Step 1403: Based on the frequency domain signals and spatial coding parameters of the N transmission channels, perform spatial decoding on the frequency domain signals of the N transmission channels to obtain a reconstructed frequency domain 3D audio signal.
在一些实施例中,基于N个暂态检测结果确定各个传输通道对应的帧类型。基于各个传输通道对应的帧类型和空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号。In some embodiments, the frame type corresponding to each transmission channel is determined based on the N transient detection results. Based on the frame type and spatial encoding parameters corresponding to each transmission channel, spatial decoding is performed on the frequency domain signals of the N transmission channels to obtain a reconstructed three-dimensional audio signal in the frequency domain.
其中,基于各个传输通道对应的帧类型和空间编码参数,对该N个传输通道的频域信号进行空间解码的实现过程可以参考相关技术,本申请实施例对此不进行详细阐述。Wherein, based on the frame type and spatial coding parameters corresponding to each transmission channel, the implementation process of spatially decoding the frequency domain signals of the N transmission channels can refer to related technologies, which will not be described in detail in the embodiment of the present application.
步骤1404:基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号。Step 1404: Determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
其中,基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号的实现方式可以参考步骤904中的相关描述,此处不再赘述。For an implementation manner of determining the reconstructed time-domain 3D audio signal based on the global transient detection result and the reconstructed frequency-domain 3D audio signal, reference may be made to relevant descriptions in step 904 , which will not be repeated here.
在本申请实施例中,解码端从码流中解析出全局暂态检测结果、各个传输通道对应的暂态检测结果以及空间编码参数。这样,基于各个传输通道对应的暂态检测结果进行解码时,可以精确得到各个传输通道的频域信号。而且,在不将目标编码参数编入码流的情况下,可以直接基于全局暂态检测结果确定目标编码参数,从而实现时域三维音频信号的重建。In the embodiment of the present application, the decoding end parses the global transient detection result, the transient detection result corresponding to each transmission channel, and the spatial coding parameters from the code stream. In this way, when decoding is performed based on the transient detection results corresponding to each transmission channel, the frequency domain signal of each transmission channel can be accurately obtained. Moreover, without encoding the target coding parameters into the code stream, the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
请参考图15,图15是本申请实施例提供的另一种示例性解码方法的框图。图15主要是对图14所示的解码方法进行示例性解释。在图15中,从码流中解析出全局暂态检测结果、N个传输通道对应的N个暂态检测结果以及空间编码参数。基于该N个暂态检测结果和该码流进行解码,以得到该N个传输通道的频域信号。基于该N个传输通道的频域信号和该空间编码参数,对该N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号。基于该全局暂态检测结果和该重建的频域三维音频信号,确定重建的时域三维音频信号。Please refer to FIG. 15 , which is a block diagram of another exemplary decoding method provided by an embodiment of the present application. FIG. 15 is mainly an exemplary explanation of the decoding method shown in FIG. 14 . In FIG. 15 , the global transient detection result, N transient detection results corresponding to N transmission channels, and spatial coding parameters are analyzed from the code stream. Decoding is performed based on the N transient detection results and the code stream to obtain frequency domain signals of the N transmission channels. Based on the frequency-domain signals of the N transmission channels and the spatial coding parameters, spatial decoding is performed on the frequency-domain signals of the N transmission channels to obtain a reconstructed frequency-domain three-dimensional audio signal. Based on the global transient detection result and the reconstructed frequency domain 3D audio signal, a reconstructed time domain 3D audio signal is determined.
图16是本申请实施例提供的一种编码装置的结构示意图,该编码装置可以由软件、硬件或者两者的结合实现成为编码端设备的部分或者全部,该编码端设备可以为图1所示的源装 置。参见图16,该装置包括:暂态检测模块1601、确定模块1602、转换模块1603和空间编码模块1604、第一编码模块1605、第二编码模块1606、第一写入模块1607。Figure 16 is a schematic structural diagram of an encoding device provided by an embodiment of the present application. The encoding device can be implemented by software, hardware, or a combination of the two to become part or all of the encoding end device. The encoding end device can be as shown in Figure 1 source device. Referring to FIG. 16 , the device includes: a transient detection module 1601 , a determination module 1602 , a conversion module 1603 , a spatial encoding module 1604 , a first encoding module 1605 , a second encoding module 1606 , and a first writing module 1607 .
暂态检测模块1601,用于对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到M个通道对应的M个暂态检测结果,M为大于1的整数。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The transient detection module 1601 is configured to perform transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1 . For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
确定模块1602,用于基于M个暂态检测结果,确定全局暂态检测结果。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。A determining module 1602, configured to determine a global transient detection result based on the M transient detection results. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
转换模块1603,用于基于全局暂态检测结果,将时域三维音频信号转换为频域三维音频信号。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The conversion module 1603 is configured to convert the three-dimensional audio signal in the time domain into a three-dimensional audio signal in the frequency domain based on the global transient detection result. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
空间编码模块1604,用于基于全局暂态检测结果,对频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,N为大于或等于1且小于或等于M的整数。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The spatial encoding module 1604 is configured to perform spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to M an integer of . For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
第一编码模块1605,用于基于全局暂态检测结果,对N个传输通道的频域信号进行编码,以得到频域信号编码结果。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The first coding module 1605 is configured to code the frequency-domain signals of the N transmission channels based on the global transient detection result, so as to obtain a frequency-domain signal coding result. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
第二编码模块1606,用于将空间编码参数进行编码,以得到空间编码参数编码结果。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The second encoding module 1606 is configured to encode the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
第一写入模块1607,用于将空间编码参数编码结果和频域信号编码结果写入码流。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The first writing module 1607 is configured to write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into the code stream. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
可选地,转换模块1603包括:Optionally, the conversion module 1603 includes:
确定单元,用于基于全局暂态检测结果确定目标编码参数,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型;A determining unit, configured to determine a target encoding parameter based on a global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
转换单元,用于基于目标编码参数将时域三维音频信号转换为频域三维音频信号。The converting unit is configured to convert the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the target coding parameter.
可选地,全局暂态检测结果包括全局暂态标志,目标编码参数包括当前帧的窗函数类型;Optionally, the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;
确定单元具体用于:Identify units specifically for:
若全局暂态标志为第一值,则将第一预设窗函数的类型确定为当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
若全局暂态标志为第二值,则将第二预设窗函数的类型确定为当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
其中,第一预设窗函数的窗长小于第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
可选地,全局暂态检测结果包括全局暂态标志和全局暂态位置信息,目标编码参数包括当前帧的窗函数类型;Optionally, the global transient detection result includes global transient flags and global transient position information, and the target encoding parameters include the window function type of the current frame;
确定单元具体用于:Identify units specifically for:
若全局暂态标志为第一值,则基于全局暂态位置信息确定当前帧的窗函数类型。If the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.
可选地,该装置还包括:Optionally, the device also includes:
第三编码模块,用于将目标编码参数进行编码,以得到目标编码参数编码结果;The third encoding module is configured to encode the target encoding parameters to obtain an encoding result of the target encoding parameters;
第二写入模块,用于将目标编码参数编码结果写入码流。The second writing module is used to write the encoding result of the target encoding parameter into the code stream.
可选地,空间编码模块1604具体用于:Optionally, the spatial encoding module 1604 is specifically configured to:
基于帧类型,对频域三维音频信号进行空间编码。Based on the frame type, the frequency-domain three-dimensional audio signal is spatially encoded.
可选地,第一编码模块1605具体用于:Optionally, the first encoding module 1605 is specifically configured to:
基于当前帧的帧类型,对N个传输通道的频域信号进行编码。Based on the frame type of the current frame, the frequency domain signals of the N transmission channels are encoded.
可选地,暂态检测结果包括暂态标志,全局暂态检测结果包括全局暂态标志,暂态标志 用于指示对应通道的信号是否为暂态信号;Optionally, the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate whether the signal of the corresponding channel is a transient signal;
确定模块1602具体用于:The determining module 1602 is specifically used for:
若M个暂态标志中为第一值的暂态标志的数量大于或等于m,则确定全局暂态标志为第一值,m为大于0且小于M的正整数;或者If the number of transient flags that are the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and m is a positive integer greater than 0 and less than M; or
若M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于n,则确定全局暂态标志为第一值,n为大于0且小于M的正整数。If the number of channels that satisfy the first preset condition among the M channels and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and n is a positive integer greater than 0 and less than M .
可选地,暂态检测结果还包括暂态位置信息,全局暂态检测结果还包括全局暂态位置信息,暂态位置信息用于指示对应通道的信号中暂态发生的位置;Optionally, the transient detection result further includes transient position information, the global transient detection result further includes global transient position information, and the transient position information is used to indicate the position where the transient occurs in the signal of the corresponding channel;
确定模块1602具体用于:The determining module 1602 is specifically used for:
若M个暂态标志中仅有一个暂态标志为第一值,则将暂态标志为第一值的通道对应的暂态位置信息确定为全局暂态位置信息;If only one of the M transient flags is the first value, the transient position information corresponding to the channel whose transient flag is the first value is determined as the global transient position information;
若M个暂态标志中存在至少两个暂态标志为第一值,则将至少两个暂态标志对应的至少两个通道中暂态检测参数最大的通道对应的暂态位置信息确定为全局暂态位置信息。If there are at least two transient flags with the first value among the M transient flags, the transient position information corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags is determined as the global Transient location information.
可选地,该装置还包括:Optionally, the device also includes:
第四编码模块,用于将全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果;The fourth encoding module is used to encode the global transient detection result to obtain the global transient detection result encoding result;
第三写入模块,用于将全局暂态检测结果编码结果写入码流。The third writing module is used to write the encoding result of the global transient detection result into the code stream.
在本申请实施例中,可以先对当前帧的时域三维音频信号包括的M个通道的信号进行暂态检测,从而确定全局暂态检测结果。之后,基于全局暂态检测结果,依次进行音频信号的时频变换、空间编码以及对各个传输通道的频域信号进行编码,尤其对空间编码后得到的各个传输通道的频域信号进行编码时,各个传输通道的暂态检测结果复用全局暂态检测结果,不需要将各个传输通道的频域信号转换至时域来确定各个传输通道对应的暂态检测结果,进而就不需要将三维音频信号在时域与频域之间进行多次变换,从而能够降低编码复杂度,提高编码效率。而且,本申请实施例不用将各个传输通道的暂态检测结果进行编码,只需将全局暂态检测结果编入码流,这样,能够降低编码比特数。In the embodiment of the present application, transient detection may be performed on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to determine a global transient detection result. Afterwards, based on the global transient detection results, sequentially perform time-frequency transformation of the audio signal, spatial coding, and coding of the frequency domain signals of each transmission channel, especially when encoding the frequency domain signals of each transmission channel obtained after spatial coding, The transient detection results of each transmission channel are multiplexed with the global transient detection results, and there is no need to convert the frequency domain signals of each transmission channel to the time domain to determine the corresponding transient detection results of each transmission channel, and then there is no need to convert the three-dimensional audio signal Multiple transformations are performed between the time domain and the frequency domain, thereby reducing coding complexity and improving coding efficiency. Moreover, the embodiment of the present application does not need to encode the transient detection results of each transmission channel, but only needs to encode the global transient detection results into the code stream, so that the number of coding bits can be reduced.
需要说明的是:上述实施例提供的编码装置在进行编码时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以基于需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的编码装置与编码方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that: when the encoding device provided in the above-mentioned embodiments performs encoding, it only uses the division of the above-mentioned functional modules as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above. In addition, the encoding device and the encoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
图17是本申请实施例提供的一种解码装置的结构示意图,该解码装置可以由软件、硬件或者两者的结合实现成为解码端设备的部分或者全部,该解码端设备可以为图1所示的目的地装置。参见图17,该装置包括:解析模块1701、解码模块1702、空间解码模块1703和确定模块1704。Fig. 17 is a schematic structural diagram of a decoding device provided by an embodiment of the present application. The decoding device can be implemented by software, hardware or a combination of the two to become part or all of the decoding end device. The decoding end device can be as shown in Fig. 1 destination device. Referring to FIG. 17 , the device includes: an analysis module 1701 , a decoding module 1702 , a spatial decoding module 1703 and a determination module 1704 .
解析模块1701,用于从码流中解析出全局暂态检测结果和空间编码参数。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The parsing module 1701 is configured to parse out the global transient detection result and spatial coding parameters from the code stream. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
解码模块1702,用于基于全局暂态检测结果和码流进行解码,以得到N个传输通道的频域信号。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The decoding module 1702 is configured to perform decoding based on the global transient detection result and code stream, so as to obtain frequency domain signals of N transmission channels. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
空间解码模块1703,用于基于全局暂态检测结果和空间编码参数,对N个传输通道的频 域信号进行空间解码,以得到重建的频域三维音频信号。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。The spatial decoding module 1703 is used to spatially decode the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial coding parameters, so as to obtain a reconstructed frequency-domain three-dimensional audio signal. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
确定模块1704,用于基于全局暂态检测结果和重建的频域三维音频信号,确定重建的时域三维音频信号。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。A determining module 1704, configured to determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal. For the detailed implementation process, refer to the corresponding content in the foregoing embodiments, and details are not repeated here.
可选地,确定模块1704包括:Optionally, the determining module 1704 includes:
确定单元,用于基于全局暂态检测结果确定目标编码参数,目标编码参数包括当前帧的窗函数类型和/或当前帧的帧类型;A determining unit, configured to determine a target encoding parameter based on a global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
转换单元,用于基于目标编码参数,将重建的频域三维音频信号转换为重建的时域三维音频信号。The converting unit is configured to convert the reconstructed frequency domain 3D audio signal into a reconstructed time domain 3D audio signal based on the target encoding parameters.
可选地,全局暂态检测结果包括全局暂态标志,目标编码参数包括当前帧的窗函数类型;Optionally, the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;
确定单元具体用于:Identify units specifically for:
若全局暂态标志为第一值,则将第一预设窗函数的类型确定为当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
若全局暂态标志为第二值,则将第二预设窗函数的类型确定为当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
其中,第一预设窗函数的窗长小于第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
可选地,全局暂态检测结果包括全局暂态标志和全局暂态位置信息,目标编码参数包括当前帧的窗函数类型;Optionally, the global transient detection result includes global transient flags and global transient position information, and the target encoding parameters include the window function type of the current frame;
确定单元具体用于:Identify units specifically for:
若全局暂态标志为第一值,则基于全局暂态位置信息确定当前帧的窗函数类型。If the global transient flag is the first value, the window function type of the current frame is determined based on the global transient position information.
在本申请实施例中,解码端从码流中解析出全局暂态检测结果和空间编码参数,这样可以基于全局暂态检测结果和空间编码参数,重建时域三维音频信号,而无需从码流中解析出各个传输通道的暂态检测结果,从而能够降低解码复杂度,提高解码效率。而且,在不将目标编码参数编入码流的情况下,可以直接基于全局暂态检测结果确定目标编码参数,从而实现时域三维音频信号的重建。In the embodiment of the present application, the decoder parses the global transient detection results and spatial coding parameters from the code stream, so that the time-domain three-dimensional audio signal can be reconstructed based on the global transient detection results and spatial coding parameters without the The transient detection results of each transmission channel are analyzed in the middle, so that the decoding complexity can be reduced and the decoding efficiency can be improved. Moreover, without encoding the target coding parameters into the code stream, the target coding parameters can be directly determined based on the global transient detection results, thereby realizing the reconstruction of the three-dimensional audio signal in the time domain.
需要说明的是:上述实施例提供的解码装置在进行解码时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以基于需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的解码装置与解码方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the decoding device provided in the above-mentioned embodiments performs decoding, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules based on needs, namely, the device The internal structure of the system is divided into different functional modules to complete all or part of the functions described above. In addition, the decoding device and the decoding method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
图18为用于本申请实施例的一种编解码装置1800的示意性框图。其中,编解码装置1800可以包括处理器1801、存储器1802和总线系统1803。其中,处理器1801和存储器1802通过总线系统1803相连,该存储器1802用于存储指令,该处理器1801用于执行该存储器1802存储的指令,以执行本申请实施例描述的各种的编码或解码方法。为避免重复,这里不再详细描述。Fig. 18 is a schematic block diagram of a codec device 1800 used in an embodiment of the present application. Wherein, the codec apparatus 1800 may include a processor 1801 , a memory 1802 and a bus system 1803 . Among them, the processor 1801 and the memory 1802 are connected through the bus system 1803, the memory 1802 is used to store instructions, and the processor 1801 is used to execute the instructions stored in the memory 1802 to perform various encoding or decoding described in the embodiments of this application method. To avoid repetition, no detailed description is given here.
在本申请实施例中,该处理器1801可以是中央处理单元(central processing unit,CPU),该处理器1801还可以是其他通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。In the embodiment of the present application, the processor 1801 can be a central processing unit (central processing unit, CPU), and the processor 1801 can also be other general-purpose processors, DSP, ASIC, FPGA or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
该存储器1802可以包括ROM设备或者RAM设备。任何其他适宜类型的存储设备也可 以用作存储器1802。存储器1802可以包括由处理器1801使用总线1803访问的代码和数据18021。存储器1802可以进一步包括操作系统18023和应用程序18022,该应用程序18022包括允许处理器1801执行本申请实施例描述的编码或解码方法的至少一个程序。例如,应用程序18022可以包括应用1至N,其进一步包括执行在本申请实施例描述的编码或解码方法的编码或解码应用(简称编解码应用)。The memory 1802 may include a ROM device or a RAM device. Any other suitable type of storage device may also be used as memory 1802. Memory 1802 may include code and data 18021 accessed by processor 1801 using bus 1803 . The memory 1802 may further include an operating system 18023 and an application program 18022, where the application program 18022 includes at least one program that allows the processor 1801 to execute the encoding or decoding method described in the embodiment of this application. For example, the application program 18022 may include applications 1 to N, which further include an encoding or decoding application (codec application for short) that executes the encoding or decoding method described in the embodiment of this application.
该总线系统1803除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线系统1803。The bus system 1803 may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 1803 in the figure.
可选地,编解码装置1800还可以包括一个或多个输出设备,诸如显示器1804。在一个示例中,显示器1804可以是触感显示器,其将显示器与可操作地感测触摸输入的触感单元合并。显示器1804可以经由总线1803连接到处理器1801。Optionally, the codec apparatus 1800 may also include one or more output devices, such as a display 1804 . In one example, display 1804 may be a touch-sensitive display that incorporates a display with a haptic unit operable to sense touch input. The display 1804 may be connected to the processor 1801 via the bus 1803 .
需要指出的是,编解码装置1800可以执行本申请实施例中的编码方法,也可执行本申请实施例中的解码方法。It should be noted that the codec device 1800 may implement the encoding method in the embodiment of the present application, and may also implement the decoding method in the embodiment of the present application.
本领域技术人员能够领会,结合本文公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,基于通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体可为可由一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。Those of skill in the art would appreciate that the functions described in conjunction with the various illustrative logical blocks, modules, and algorithm steps disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions described by the various illustrative logical blocks, modules, and steps may be stored or transmitted as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which correspond to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (eg, based on a communication protocol) . In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this application. A computer program product may include a computer readable medium.
作为实例而非限制,此类计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来存储指令或数据结构的形式的所要程序代码并且可由计算机存取的任何其它媒体。并且,任何连接被恰当地称作计算机可读媒体。举例来说,如果使用同轴缆线、光纤缆线、双绞线、数字订户线(DSL)或例如红外线、无线电和微波等无线技术从网站、服务器或其它远程源传输指令,那么同轴缆线、光纤缆线、双绞线、DSL或例如红外线、无线电和微波等无线技术包含在媒体的定义中。但是,应理解,所述计算机可读存储媒体和数据存储媒体并不包括连接、载波、信号或其它暂时媒体,而是实际上针对于非暂时性有形存储媒体。如本文中所使用,磁盘和光盘包含压缩光盘(CD)、激光光盘、光学光盘、DVD和蓝光光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光以光学方式再现数据。以上各项的组合也应包含在计算机可读媒体的范围内。By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage, flash memory, or any other medium that can contain the desired program code in the form of a computer and can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable Wire, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of media. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, DVD and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
可通过例如一或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指前述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文中所描述的各种说明性逻辑框、模块、和步骤所描述的功能可以提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或者并入在组合编解码器中。而且,所述技术可完全实施于一或多个电路或逻辑元件中。在一种示例下,编码器100及解码器200中的各种说明性逻辑框、单元、模块可以理解为对应的电路器 件或逻辑元件。can be processed by one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. device to execute instructions. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or in conjunction with into the combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements. In one example, various illustrative logical blocks, units, and modules in the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logic elements.
本申请实施例的技术可在各种各样的装置或设备中实施,包含无线手持机、集成电路(IC)或一组IC(例如,芯片组)。本申请实施例中描述各种组件、模块或单元是为了强调用于执行所揭示的技术的装置的功能方面,但未必需要由不同硬件单元实现。实际上,如上文所描述,各种单元可结合合适的软件和/或固件组合在编码解码器硬件单元中,或者通过互操作硬件单元(包含如上文所描述的一或多个处理器)来提供。The techniques of embodiments of the present application may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a group of ICs (eg, a chipset). Various components, modules or units are described in the embodiments of the present application to emphasize the functional aspects of the apparatus for performing the disclosed technology, but they do not necessarily need to be realized by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit in conjunction with suitable software and/or firmware, or by interoperating hardware units (comprising one or more processors as described above) to supply.
也就是说,在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。That is to say, in the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)) or a semiconductor medium (for example: solid state disk (solid state disk, SSD)) wait. It should be noted that the computer-readable storage medium mentioned in the embodiment of the present application may be a non-volatile storage medium, in other words, may be a non-transitory storage medium.
应当理解的是,本文提及的“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。It should be understood that the "plurality" mentioned herein refers to two or more than two. In the description of the embodiments of this application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in this article is only a description of the association of associated objects A relationship means that there may be three kinds of relationships, for example, A and/or B means: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the number and execution order, and words such as "first" and "second" do not necessarily limit the difference.
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请实施例中涉及到的时域三维音频信号和码流都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and All signals are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the time-domain three-dimensional audio signal and code stream involved in the embodiments of the present application are obtained under the condition of sufficient authorization.
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above-mentioned embodiments provided by the application are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application shall be included in the protection scope of the application. Inside.

Claims (33)

  1. 一种编码方法,其特征在于,所述方法包括:An encoding method, characterized in that the method comprises:
    对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到所述M个通道对应的M个暂态检测结果,所述M为大于1的整数;Performing transient detection on signals of M channels included in the time-domain three-dimensional audio signal of the current frame, respectively, to obtain M transient detection results corresponding to the M channels, where M is an integer greater than 1;
    基于所述M个暂态检测结果,确定全局暂态检测结果;Determine a global transient detection result based on the M transient detection results;
    基于所述全局暂态检测结果,将所述时域三维音频信号转换为频域三维音频信号;converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result;
    基于所述全局暂态检测结果,对所述频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,所述N为大于或等于1且小于或等于所述M的整数;Based on the global transient detection result, spatially encode the frequency-domain three-dimensional audio signal to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and less than or equal to the an integer of M;
    基于所述全局暂态检测结果,对所述N个传输通道的频域信号进行编码,以得到频域信号编码结果;Encoding the frequency-domain signals of the N transmission channels based on the global transient detection result to obtain a frequency-domain signal encoding result;
    将所述空间编码参数进行编码,以得到空间编码参数编码结果;Encoding the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters;
    将所述空间编码参数编码结果和所述频域信号编码结果写入码流。Writing the coding result of the spatial coding parameter and the coding result of the frequency domain signal into a code stream.
  2. 如权利要求1所述的方法,其特征在于,所述基于所述全局暂态检测结果,将所述时域三维音频信号转换为频域三维音频信号,包括:The method according to claim 1, wherein said converting the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result comprises:
    基于所述全局暂态检测结果确定目标编码参数,所述目标编码参数包括所述当前帧的窗函数类型和/或所述当前帧的帧类型;determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
    基于所述目标编码参数将所述时域三维音频信号转换为所述频域三维音频信号。converting the time domain 3D audio signal into the frequency domain 3D audio signal based on the target encoding parameters.
  3. 如权利要求2所述的方法,其特征在于,所述全局暂态检测结果包括全局暂态标志,所述目标编码参数包括所述当前帧的窗函数类型;The method according to claim 2, wherein the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;
    所述基于所述全局暂态检测结果确定目标编码参数,包括:The determining target encoding parameters based on the global transient detection results includes:
    若所述全局暂态标志为第一值,则将第一预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
    若所述全局暂态标志为第二值,则将第二预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
    其中,所述第一预设窗函数的窗长小于所述第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  4. 如权利要求2所述的方法,其特征在于,所述全局暂态检测结果包括全局暂态标志和全局暂态位置信息,所述目标编码参数包括所述当前帧的窗函数类型;The method according to claim 2, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include the window function type of the current frame;
    所述基于所述全局暂态检测结果确定目标编码参数,包括:The determining target encoding parameters based on the global transient detection results includes:
    若所述全局暂态标志为第一值,则基于所述全局暂态位置信息确定所述当前帧的窗函数类型。If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
  5. 如权利要求2-4任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 2-4, wherein the method further comprises:
    将所述目标编码参数进行编码,以得到目标编码参数编码结果;Encoding the target encoding parameters to obtain an encoding result of the target encoding parameters;
    将所述目标编码参数编码结果写入所述码流。Writing the encoding result of the target encoding parameter into the code stream.
  6. 如权利要求2-5任一所述的方法,其特征在于,所述基于所述全局暂态检测结果,对所述频域三维音频信号进行空间编码,包括:The method according to any one of claims 2-5, wherein the spatial encoding of the frequency-domain three-dimensional audio signal based on the global transient detection result comprises:
    基于所述帧类型,对所述频域三维音频信号进行空间编码。Based on the frame type, spatially encode the frequency-domain three-dimensional audio signal.
  7. 如权利要求2-6任一所述的方法,其特征在于,所述基于所述全局暂态检测结果,对所述N个传输通道的频域信号进行编码,包括:The method according to any one of claims 2-6, wherein said encoding the frequency-domain signals of said N transmission channels based on said global transient detection result comprises:
    基于所述当前帧的帧类型,对所述N个传输通道的频域信号进行编码。Encoding the frequency domain signals of the N transmission channels based on the frame type of the current frame.
  8. 如权利要求1-7任一所述的方法,其特征在于,所述暂态检测结果包括暂态标志,所述全局暂态检测结果包括全局暂态标志,所述暂态标志用于指示对应通道的信号是否为暂态信号;The method according to any one of claims 1-7, wherein the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate the corresponding Whether the signal of the channel is a transient signal;
    所述基于所述M个暂态检测结果,确定全局暂态检测结果,包括:The determining the global transient detection result based on the M transient detection results includes:
    若所述M个暂态标志中为第一值的暂态标志的数量大于或等于m,则确定所述全局暂态标志为第一值,所述m为大于0且小于所述M的正整数;或者If the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and the m is a positive value greater than 0 and less than the M an integer; or
    若所述M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于n,则确定所述全局暂态标志为第一值,所述n为大于0且小于所述M的正整数。If the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and the n is greater than 0 and a positive integer less than the M.
  9. 如权利要求8所述的方法,其特征在于,所述暂态检测结果还包括暂态位置信息,所述全局暂态检测结果还包括全局暂态位置信息,所述暂态位置信息用于指示对应通道的信号中暂态发生的位置;The method according to claim 8, wherein the transient detection result further includes transient position information, the global transient detection result further comprises global transient position information, and the transient position information is used to indicate The position where the transient occurs in the signal of the corresponding channel;
    所述基于所述M个暂态检测结果,确定全局暂态检测结果,包括:The determining the global transient detection result based on the M transient detection results includes:
    若所述M个暂态标志中仅有一个暂态标志为第一值,则将所述暂态标志为第一值的通道对应的暂态位置信息确定为所述全局暂态位置信息;If only one of the M transient flags is the first value, then determine the transient position information corresponding to the channel whose transient flag is the first value as the global transient position information;
    若所述M个暂态标志中存在至少两个暂态标志为第一值,则将所述至少两个暂态标志对应的至少两个通道中暂态检测参数最大的通道对应的暂态位置信息确定为所述全局暂态位置信息。If there are at least two transient flags in the M transient flags as the first value, the transient position corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags The information is determined as the global transient location information.
  10. 如权利要求1-9任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-9, wherein the method further comprises:
    将所述全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果;Encoding the global transient detection result to obtain an encoding result of the global transient detection result;
    将所述全局暂态检测结果编码结果写入所述码流。Writing the encoding result of the global transient detection result into the code stream.
  11. 一种解码方法,其特征在于,所述方法包括:A decoding method, characterized in that the method comprises:
    从码流中解析出全局暂态检测结果和空间编码参数;Analyze the global transient detection results and spatial encoding parameters from the code stream;
    基于所述全局暂态检测结果和所述码流进行解码,以得到N个传输通道的频域信号;Decoding based on the global transient detection result and the code stream to obtain frequency domain signals of N transmission channels;
    基于所述全局暂态检测结果和所述空间编码参数,对所述N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号;Based on the global transient detection result and the spatial coding parameters, spatially decode the frequency domain signals of the N transmission channels to obtain a reconstructed frequency domain three-dimensional audio signal;
    基于所述全局暂态检测结果和所述重建的频域三维音频信号,确定重建的时域三维音频信号。Based on the global transient detection result and the reconstructed frequency domain 3D audio signal, determine a reconstructed time domain 3D audio signal.
  12. 如权利要求11所述的方法,其特征在于,所述基于所述全局暂态检测结果和所述重建的频域三维音频信号,确定重建的时域三维音频信号,包括:The method according to claim 11, wherein said determining a reconstructed time-domain three-dimensional audio signal based on said global transient detection result and said reconstructed frequency-domain three-dimensional audio signal comprises:
    基于所述全局暂态检测结果确定目标编码参数,所述目标编码参数包括当前帧的窗函数类型和/或所述当前帧的帧类型;determining a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
    基于所述目标编码参数,将所述重建的频域三维音频信号转换为所述重建的时域三维音频信号。Converting the reconstructed frequency-domain 3D audio signal into the reconstructed time-domain 3D audio signal based on the target encoding parameters.
  13. 如权利要求12所述的方法,其特征在于,所述全局暂态检测结果包括全局暂态标志,所述目标编码参数包括所述当前帧的窗函数类型;The method according to claim 12, wherein the global transient detection result includes a global transient flag, and the target coding parameter includes a window function type of the current frame;
    所述基于所述全局暂态检测结果确定目标编码参数,包括:The determining target encoding parameters based on the global transient detection results includes:
    若所述全局暂态标志为第一值,则将第一预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
    若所述全局暂态标志为第二值,则将第二预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
    其中,所述第一预设窗函数的窗长小于所述第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  14. 如权利要求12所述的方法,其特征在于,所述全局暂态检测结果包括全局暂态标志和全局暂态位置信息,所述目标编码参数包括所述当前帧的窗函数类型;The method according to claim 12, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include the window function type of the current frame;
    所述基于所述全局暂态检测结果确定目标编码参数,包括:The determining target encoding parameters based on the global transient detection results includes:
    若所述全局暂态标志为第一值,则基于所述全局暂态位置信息确定所述当前帧的窗函数类型。If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
  15. 一种编码装置,其特征在于,所述装置包括:An encoding device, characterized in that the device comprises:
    暂态检测模块,用于对当前帧的时域三维音频信号包括的M个通道的信号分别进行暂态检测,以得到所述M个通道对应的M个暂态检测结果,所述M为大于1的整数;The transient detection module is used to perform transient detection on the signals of M channels included in the time-domain three-dimensional audio signal of the current frame, so as to obtain M transient detection results corresponding to the M channels, and the M is greater than an integer of 1;
    确定模块,用于基于所述M个暂态检测结果,确定全局暂态检测结果;A determining module, configured to determine a global transient detection result based on the M transient detection results;
    转换模块,用于基于所述全局暂态检测结果,将所述时域三维音频信号转换为频域三维音频信号;A conversion module, configured to convert the time-domain three-dimensional audio signal into a frequency-domain three-dimensional audio signal based on the global transient detection result;
    空间编码模块,用于基于所述全局暂态检测结果,对所述频域三维音频信号进行空间编码,以得到空间编码参数和N个传输通道的频域信号,所述N为大于或等于1且小于或等于所述M的整数;A spatial encoding module, configured to perform spatial encoding on the frequency-domain three-dimensional audio signal based on the global transient detection result, to obtain spatial encoding parameters and frequency-domain signals of N transmission channels, where N is greater than or equal to 1 and an integer less than or equal to said M;
    第一编码模块,用于基于所述全局暂态检测结果,对所述N个传输通道的频域信号进行编码,以得到频域信号编码结果;A first encoding module, configured to encode the frequency-domain signals of the N transmission channels based on the global transient detection result, so as to obtain a frequency-domain signal encoding result;
    第二编码模块,用于将所述空间编码参数进行编码,以得到空间编码参数编码结果;A second encoding module, configured to encode the spatial encoding parameters to obtain an encoding result of the spatial encoding parameters;
    第一写入模块,用于将所述空间编码参数编码结果和所述频域信号编码结果写入码流。The first writing module is configured to write the coding result of the spatial coding parameter and the coding result of the frequency domain signal into a code stream.
  16. 如权利要求15所述的装置,其特征在于,所述转换模块包括:The device according to claim 15, wherein the conversion module comprises:
    确定单元,用于基于所述全局暂态检测结果确定目标编码参数,所述目标编码参数包括 所述当前帧的窗函数类型和/或所述当前帧的帧类型;A determining unit, configured to determine a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
    转换单元,用于基于所述目标编码参数将所述时域三维音频信号转换为所述频域三维音频信号。A conversion unit, configured to convert the time-domain 3D audio signal into the frequency-domain 3D audio signal based on the target encoding parameter.
  17. 如权利要求16所述的装置,其特征在于,所述全局暂态检测结果包括全局暂态标志,所述目标编码参数包括所述当前帧的窗函数类型;The device according to claim 16, wherein the global transient detection result includes a global transient flag, and the target encoding parameter includes a window function type of the current frame;
    所述确定单元具体用于:The determining unit is specifically used for:
    若所述全局暂态标志为第一值,则将第一预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
    若所述全局暂态标志为第二值,则将第二预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
    其中,所述第一预设窗函数的窗长小于所述第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  18. 如权利要求16所述的装置,其特征在于,所述全局暂态检测结果包括全局暂态标志和全局暂态位置信息,所述目标编码参数包括所述当前帧的窗函数类型;The device according to claim 16, wherein the global transient detection result includes a global transient flag and global transient position information, and the target coding parameters include a window function type of the current frame;
    所述确定单元具体用于:The determining unit is specifically used for:
    若所述全局暂态标志为第一值,则基于所述全局暂态位置信息确定所述当前帧的窗函数类型。If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
  19. 如权利要求16-18任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 16-18, wherein the device further comprises:
    第三编码模块,用于将所述目标编码参数进行编码,以得到目标编码参数编码结果;A third encoding module, configured to encode the target encoding parameters to obtain an encoding result of the target encoding parameters;
    第二写入模块,用于将所述目标编码参数编码结果写入所述码流。The second writing module is used for writing the coding result of the target coding parameter into the code stream.
  20. 如权利要求16-19任一所述的装置,其特征在于,所述空间编码模块具体用于:The device according to any one of claims 16-19, wherein the spatial encoding module is specifically used for:
    基于所述帧类型,对所述频域三维音频信号进行空间编码。Based on the frame type, spatially encode the frequency-domain three-dimensional audio signal.
  21. 如权利要求16-20任一所述的装置,其特征在于,所述第一编码模块具体用于:The device according to any one of claims 16-20, wherein the first encoding module is specifically used for:
    基于所述当前帧的帧类型,对所述N个传输通道的频域信号进行编码。Encoding the frequency domain signals of the N transmission channels based on the frame type of the current frame.
  22. 如权利要求15-21任一所述的装置,其特征在于,所述暂态检测结果包括暂态标志,所述全局暂态检测结果包括全局暂态标志,所述暂态标志用于指示对应通道的信号是否为暂态信号;The device according to any one of claims 15-21, wherein the transient detection result includes a transient flag, the global transient detection result includes a global transient flag, and the transient flag is used to indicate the corresponding Whether the signal of the channel is a transient signal;
    所述确定模块具体用于:The determination module is specifically used for:
    若所述M个暂态标志中为第一值的暂态标志的数量大于或等于m,则确定所述全局暂态标志为第一值,所述m为大于0且小于所述M的正整数;或者If the number of transient flags with the first value among the M transient flags is greater than or equal to m, then determine that the global transient flag is the first value, and the m is a positive value greater than 0 and less than the M an integer; or
    若所述M个通道中满足第一预设条件且对应的暂态标志为第一值的通道数量大于或等于n,则确定所述全局暂态标志为第一值,所述n为大于0且小于所述M的正整数。If the number of channels among the M channels that meet the first preset condition and whose corresponding transient flag is the first value is greater than or equal to n, then determine that the global transient flag is the first value, and the n is greater than 0 and a positive integer less than the M.
  23. 如权利要求22所述的装置,其特征在于,所述暂态检测结果还包括暂态位置信息, 所述全局暂态检测结果还包括全局暂态位置信息,所述暂态位置信息用于指示对应通道的信号中暂态发生的位置;The device according to claim 22, wherein the transient detection result further includes transient position information, the global transient detection result further comprises global transient position information, and the transient position information is used to indicate The position where the transient occurs in the signal of the corresponding channel;
    所述确定模块具体用于:The determination module is specifically used for:
    若所述M个暂态标志中仅有一个暂态标志为第一值,则将所述暂态标志为第一值的通道对应的暂态位置信息确定为所述全局暂态位置信息;If only one of the M transient flags is the first value, then determine the transient position information corresponding to the channel whose transient flag is the first value as the global transient position information;
    若所述M个暂态标志中存在至少两个暂态标志为第一值,则将所述至少两个暂态标志对应的至少两个通道中暂态检测参数最大的通道对应的暂态位置信息确定为所述全局暂态位置信息。If there are at least two transient flags in the M transient flags as the first value, the transient position corresponding to the channel with the largest transient detection parameter among the at least two channels corresponding to the at least two transient flags The information is determined as the global transient location information.
  24. 如权利要求15-23任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 15-23, wherein the device further comprises:
    第四编码模块,用于将所述全局暂态检测结果进行编码,以得到全局暂态检测结果编码结果;A fourth encoding module, configured to encode the global transient detection result to obtain an encoding result of the global transient detection result;
    第三写入模块,用于将所述全局暂态检测结果编码结果写入所述码流。The third writing module is used to write the encoding result of the global transient detection result into the code stream.
  25. 一种解码装置,其特征在于,所述装置包括:A decoding device, characterized in that the device comprises:
    解析模块,用于从码流中解析出全局暂态检测结果和空间编码参数;The analysis module is used to analyze the global transient detection results and spatial encoding parameters from the code stream;
    解码模块,用于基于所述全局暂态检测结果和所述码流进行解码,以得到N个传输通道的频域信号;A decoding module, configured to decode based on the global transient detection result and the code stream, so as to obtain frequency domain signals of N transmission channels;
    空间解码模块,用于基于所述全局暂态检测结果和所述空间编码参数,对所述N个传输通道的频域信号进行空间解码,以得到重建的频域三维音频信号;A spatial decoding module, configured to perform spatial decoding on the frequency-domain signals of the N transmission channels based on the global transient detection result and the spatial encoding parameters, so as to obtain a reconstructed three-dimensional audio signal in the frequency domain;
    确定模块,用于基于所述全局暂态检测结果和所述重建的频域三维音频信号,确定重建的时域三维音频信号。A determination module, configured to determine a reconstructed time domain 3D audio signal based on the global transient detection result and the reconstructed frequency domain 3D audio signal.
  26. 如权利要求25所述的装置,其特征在于,所述确定模块包括:The device according to claim 25, wherein the determining module comprises:
    确定单元,用于基于所述全局暂态检测结果确定目标编码参数,所述目标编码参数包括当前帧的窗函数类型和/或所述当前帧的帧类型;a determining unit, configured to determine a target encoding parameter based on the global transient detection result, where the target encoding parameter includes a window function type of the current frame and/or a frame type of the current frame;
    转换单元,用于基于所述目标编码参数,将所述重建的频域三维音频信号转换为所述重建的时域三维音频信号。A conversion unit, configured to convert the reconstructed frequency-domain 3D audio signal into the reconstructed time-domain 3D audio signal based on the target coding parameters.
  27. 如权利要求26所述的装置,其特征在于,所述全局暂态检测结果包括全局暂态标志,所述目标编码参数包括所述当前帧的窗函数类型;The device according to claim 26, wherein the global transient detection result includes a global transient flag, and the target coding parameters include the window function type of the current frame;
    所述确定单元具体用于:The determining unit is specifically used for:
    若所述全局暂态标志为第一值,则将第一预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the first value, then determining the type of the first preset window function as the window function type of the current frame;
    若所述全局暂态标志为第二值,则将第二预设窗函数的类型确定为所述当前帧的窗函数类型;If the global transient flag is the second value, then determining the type of the second preset window function as the window function type of the current frame;
    其中,所述第一预设窗函数的窗长小于所述第二预设窗函数的窗长。Wherein, the window length of the first preset window function is smaller than the window length of the second preset window function.
  28. 如权利要求26所述的装置,其特征在于,所述全局暂态检测结果包括全局暂态标志 和全局暂态位置信息,所述目标编码参数包括所述当前帧的窗函数类型;The device according to claim 26, wherein the global transient detection result comprises a global transient flag and global transient position information, and the target encoding parameter comprises a window function type of the current frame;
    所述确定单元具体用于:The determining unit is specifically used for:
    若所述全局暂态标志为第一值,则基于所述全局暂态位置信息确定所述当前帧的窗函数类型。If the global transient flag is the first value, then determine the window function type of the current frame based on the global transient position information.
  29. 一种编码端设备,其特征在于,所述编码端设备包括存储器和处理器;An encoding end device, characterized in that the encoding end device includes a memory and a processor;
    所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以实现权利要求1-10任一所述的编码方法。The memory is used to store computer programs, and the processor is used to execute the computer programs stored in the memory, so as to realize the coding method described in any one of claims 1-10.
  30. 一种解码端设备,其特征在于,所述解码端设备包括存储器和处理器;A decoding end device, characterized in that the decoding end device includes a memory and a processor;
    所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以实现权利要求11-14任一所述的解码方法。The memory is used to store computer programs, and the processor is used to execute the computer programs stored in the memory, so as to realize the decoding method described in any one of claims 11-14.
  31. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有指令,当所述指令在所述计算机上运行时,使得所述计算机执行权利要求1-14任一所述的方法的步骤。A computer-readable storage medium, characterized in that instructions are stored in the storage medium, and when the instructions are run on the computer, the computer is made to perform the method described in any one of claims 1-14 step.
  32. 一种计算机可读存储介质,其特征在于,包括如权利要求1-10中任一项所述的编码方法所获得的码流。A computer-readable storage medium, characterized by comprising the code stream obtained by the encoding method according to any one of claims 1-10.
  33. 一种计算机程序,其特征在于,所述计算机程序被执行时实现如权利要求1-14中任一项所述的方法。A computer program, characterized in that, when the computer program is executed, the method according to any one of claims 1-14 is realized.
PCT/CN2022/120507 2021-09-29 2022-09-22 Encoding and decoding methods and apparatus, device, storage medium, and computer program WO2023051370A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111155355.4 2021-09-29
CN202111155355.4A CN115881139A (en) 2021-09-29 2021-09-29 Encoding and decoding method, apparatus, device, storage medium, and computer program

Publications (1)

Publication Number Publication Date
WO2023051370A1 true WO2023051370A1 (en) 2023-04-06

Family

ID=85756468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120507 WO2023051370A1 (en) 2021-09-29 2022-09-22 Encoding and decoding methods and apparatus, device, storage medium, and computer program

Country Status (3)

Country Link
CN (1) CN115881139A (en)
AR (1) AR127171A1 (en)
WO (1) WO2023051370A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687283A (en) * 1995-05-23 1997-11-11 Nec Corporation Pause compressing speech coding/decoding apparatus
CN1750407A (en) * 2002-08-21 2006-03-22 中山正音数字技术有限公司 Coding method for compressing coding of multiple audio track digital audio signal
CN1783726A (en) * 2002-08-21 2006-06-07 广州广晟数码技术有限公司 Decoder for decoding and reestablishing multi-channel audio signal from audio data code stream
CN101197577A (en) * 2006-12-07 2008-06-11 展讯通信(上海)有限公司 Encoding and decoding method for audio processing frame
CN103493129A (en) * 2011-02-14 2014-01-01 弗兰霍菲尔运输应用研究公司 Apparatus and method for coding a portion of an audio signal using a transient detection and a quality result
CN110544484A (en) * 2019-09-23 2019-12-06 中科超影(北京)传媒科技有限公司 high-order Ambisonic audio coding and decoding method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5687283A (en) * 1995-05-23 1997-11-11 Nec Corporation Pause compressing speech coding/decoding apparatus
CN1750407A (en) * 2002-08-21 2006-03-22 中山正音数字技术有限公司 Coding method for compressing coding of multiple audio track digital audio signal
CN1783726A (en) * 2002-08-21 2006-06-07 广州广晟数码技术有限公司 Decoder for decoding and reestablishing multi-channel audio signal from audio data code stream
CN101197577A (en) * 2006-12-07 2008-06-11 展讯通信(上海)有限公司 Encoding and decoding method for audio processing frame
CN103493129A (en) * 2011-02-14 2014-01-01 弗兰霍菲尔运输应用研究公司 Apparatus and method for coding a portion of an audio signal using a transient detection and a quality result
CN110544484A (en) * 2019-09-23 2019-12-06 中科超影(北京)传媒科技有限公司 high-order Ambisonic audio coding and decoding method and device

Also Published As

Publication number Publication date
AR127171A1 (en) 2023-12-27
CN115881139A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
TW201005730A (en) Method and apparatus for error concealment of encoded audio data
TW202127916A (en) Soundfield adaptation for virtual reality audio
US20230179941A1 (en) Audio Signal Rendering Method and Apparatus
US20230137053A1 (en) Audio Coding Method and Apparatus
US20200020342A1 (en) Error concealment for audio data using reference pools
US11081116B2 (en) Embedding enhanced audio transports in backward compatible audio bitstreams
EP4131263A1 (en) Audio signal encoding method and apparatus
US20230298600A1 (en) Audio encoding and decoding method and apparatus
US20230298601A1 (en) Audio encoding and decoding method and apparatus
US10727858B2 (en) Error resiliency for entropy coded audio data
WO2023051370A1 (en) Encoding and decoding methods and apparatus, device, storage medium, and computer program
US20230145725A1 (en) Multi-channel audio signal encoding and decoding method and apparatus
WO2023051367A1 (en) Decoding method and apparatus, and device, storage medium and computer program product
WO2023051368A1 (en) Encoding and decoding method and apparatus, and device, storage medium and computer program product
WO2022258036A1 (en) Encoding method and apparatus, decoding method and apparatus, and device, storage medium and computer program
WO2022242534A1 (en) Encoding method and apparatus, decoding method and apparatus, device, storage medium and computer program
JP2023523081A (en) Bit allocation method and apparatus for audio signal
US20240169998A1 (en) Multi-Channel Signal Encoding and Decoding Method and Apparatus
WO2024021731A1 (en) Audio encoding and decoding method and apparatus, storage medium, and computer program product
EP4318470A1 (en) Audio encoding method and apparatus, and audio decoding method and apparatus
KR20240038770A (en) Audio signal encoding method and device and audio signal decoding method and device
JP2024518846A (en) Method and apparatus for encoding three-dimensional audio signals, and encoder

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874757

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022874757

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022874757

Country of ref document: EP

Effective date: 20240404

NENP Non-entry into the national phase

Ref country code: DE