CN118314908A - Scene audio decoding method and electronic equipment - Google Patents

Scene audio decoding method and electronic equipment

Info

Publication number
CN118314908A
CN118314908A CN202310614158.7A CN202310614158A CN118314908A CN 118314908 A CN118314908 A CN 118314908A CN 202310614158 A CN202310614158 A CN 202310614158A CN 118314908 A CN118314908 A CN 118314908A
Authority
CN
China
Prior art keywords
audio signal
signal
order
scene
energy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310614158.7A
Other languages
Chinese (zh)
Inventor
刘帅
高原
李佳蔚
夏丙寅
王喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN118314908A publication Critical patent/CN118314908A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a scene audio decoding method and electronic equipment. The decoding method comprises the following steps: receiving a first code stream; decoding the first code stream to obtain a first reconstructed signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in the scene audio signal; generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal; reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; and adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result to obtain a reconstructed scene audio signal. Compared with the prior art, the application has lower decoding code rate on the premise of reaching the same quality. In addition, the present application can improve the decoding quality of an audio signal.

Description

Scene audio decoding method and electronic equipment
The present application claims priority from chinese patent office, application No. 202310030731.X, chinese patent application entitled "a method and apparatus for processing three-dimensional audio signal", filed on 1 month 6 of 2023, the entire contents of which are incorporated herein by reference.
Technical Field
The embodiment of the application relates to the field of audio decoding, in particular to a scene audio decoding method and electronic equipment.
Background
The three-dimensional audio technology is an audio technology for acquiring, processing, transmitting and rendering and playing back sound events and three-dimensional sound field information in the real world in a computer, signal processing and other modes. Three-dimensional audio gives sound a strong sense of space, surrounding and immersion, giving the person a remarkable auditory experience of "sounding his/her environment". Among them, HOA (Higher Order Ambisonics, higher order ambisonic) technology has properties of independence from speaker layout during recording, encoding and playback phases, and rotatable playback characteristics of HOA format data, and has higher flexibility in performing three-dimensional audio playback, and thus has been more widely focused and studied.
For the N-order HOA signal, the corresponding channel number is (N+1) 2. As the HOA order increases, the information in the HOA signal used to record more detailed sound scenes increases; however, the amount of data of the HOA signal increases, and the large amount of data makes transmission and storage difficult, so that the HOA signal needs to be encoded and decoded. However, the prior art has a problem of low accuracy in reconstructing the HOA signal.
Disclosure of Invention
The application provides a scene audio coding and decoding method and electronic equipment.
In a first aspect, an embodiment of the present application provides a method for encoding audio of a scene, including: acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer; acquiring attribute information of a target virtual speaker corresponding to the scene audio signal; acquiring a higher-order energy gain of the scene audio signal; coding the high-order energy gain to obtain a high-order energy gain coding result; encoding a first audio signal in the scene audio signal, attribute information of the target virtual speaker and the high-order energy gain encoding result to obtain a first code stream; the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
In a possible manner, the scene audio signal is an N1-order higher order ambisonic HOA signal, the N1-order HOA signal including a second audio signal, the second audio signal being an audio signal of the N1-order HOA signal other than the first audio signal, C1 being equal to a square of (n1+1); the acquiring the higher-order energy gain of the scene audio signal comprises: and acquiring the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal.
Illustratively, the N1 order HOA signal comprises the second audio signal, it being understood that the N1 order HOA signal comprises only the second audio signal.
Illustratively, the N1 order HOA signal comprises a second audio signal, which is understood to include the second audio signal and other audio signals.
For example, the first audio signal may be referred to as a low-order portion of the scene audio signal and the second audio signal may be referred to as a high-order portion of the scene audio signal. That is, a low-order portion of the scene audio signal and a part of a high-order portion of the scene audio signal may be encoded.
It should be appreciated that when the N1-order HOA signal comprises only the first audio signal, the number of channels of the encoded N1-order HOA signal is smaller and the corresponding code rate is lower relative to when the N1-order HOA signal comprises the second audio signal.
In a possible manner, the obtaining the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal includes: acquiring an energy gain of the first audio signal and an energy gain of the second audio signal; and acquiring the high-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal.
In a possible manner, the obtaining the higher-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal includes: the higher order energy Gain' (i, b) is obtained by:
Gain’(i,b)=10*log10(E(i,b)/E(1,b));
Where log10 represents log, x represents multiplication, E (1, b) is channel energy of a b-th frequency band of the first audio signal, E (i, b) is i-th channel energy of a b-th frequency band of the second audio signal, i is a number of an i-th channel of the second audio signal, and b is a frequency band number of the second audio signal.
In a possible manner, the encoding the higher-order energy gain to obtain a higher-order energy gain encoding result includes: quantizing the high-order energy gain to obtain quantized high-order energy gain; and entropy coding is carried out on the quantized high-order energy gain so as to obtain a coding result of the high-order energy gain.
It should be noted that, the position of the target virtual speaker matches the position of the sound source in the scene audio signal; according to the attribute information of the target virtual speaker and the first audio signal in the scene audio signals, a virtual speaker signal corresponding to the target virtual speaker can be generated; the scene audio signal can be reconstructed from the virtual speaker signal and the higher order energy gain encoding result. Therefore, the encoding end encodes the first audio signal in the scene audio signal, the attribute information of the target virtual speaker and the higher-order energy gain encoding result together and then sends the encoded first audio signal, the attribute information of the target virtual speaker and the higher-order energy gain encoding result to the decoding end, and the decoding end can reconstruct the scene audio signal based on the first reconstruction signal obtained by decoding (namely, the reconstruction signal of the first audio signal in the scene audio signal), the attribute information of the target virtual speaker and the higher-order energy gain encoding result.
Compared with other methods for reconstructing scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher; therefore, when K is equal to C1, the audio quality of the reconstructed scene audio signal is higher under the same code rate.
When K is smaller than C1, compared with the prior art, the number of channels of the audio signal coded by the method is smaller, and the data size of the attribute information of the target virtual speaker is far smaller than that of the audio signal of one channel; therefore, the application has lower coding rate on the premise of reaching the same quality.
In addition, in the prior art, the scene audio signals are converted into the virtual speaker signals and the residual signals and then encoded, but the encoding end directly encodes the first audio signals in the scene audio signals, the virtual speaker signals and the residual signals do not need to be calculated, and the encoding complexity of the encoding end is lower.
Exemplary, the scene audio signal according to the embodiment of the present application may refer to a signal for describing a sound field; wherein the scene audio signal may include: HOA signals (where the HOA signals may include three-dimensional HOA signals and two-dimensional HOA signals (which may also be referred to as planar HOA signals)) and three-dimensional audio signals; the three-dimensional audio signal may refer to other audio signals than the HOA signal in the scene audio signal.
In one possible way, when N1 is equal to 1, K may be equal to C1; when N1 is greater than 1, K may be less than C1. It should be appreciated that K may also be less than C1 when N1 is equal to 1.
For example, the process of encoding attribute information of the first audio signal and the target virtual speaker in the scene audio signal may include: downmixing, transformation, quantization, entropy coding, etc., as is not limiting in this regard.
For example, the first bitstream may include encoded data of a first audio signal among the scene audio signals, and encoded data of attribute information of the target virtual speaker.
In one possible manner, the target virtual speaker may be selected from a plurality of candidate virtual speakers based on the scene audio signal, and the attribute information of the target virtual speaker may be determined. Illustratively, the virtual speakers (including candidate virtual speakers and target virtual speakers) are virtual speakers, not actually existing speakers.
For example, the plurality of candidate virtual speakers may be uniformly distributed on the sphere, and the number of target virtual speakers may be one or more.
In one possible manner, a preset target virtual speaker may be acquired, and then attribute information of the target virtual speaker may be determined.
It should be understood that the application is not limited in the manner in which the target virtual speaker is determined.
In a second aspect, an embodiment of the present application provides a method for decoding a scene audio, including: receiving a first code stream; decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstruction signal is a reconstruction signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer smaller than or equal to C1; generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal; reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer; and adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result to obtain a reconstructed scene audio signal.
In a possible manner, the scene audio signal is an N1-order higher order ambisonic HOA signal, the N1-order HOA signal including a second audio signal, the second audio signal being an audio signal of the N1-order HOA signal other than the first audio signal, C1 being equal to a square of (n1+1); and/or the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal comprises a third audio signal, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and C2 is equal to the square of (n2+1).
In a possible manner, the adjusting the first reconstructed scene audio signal according to the higher-order energy gain coding result to obtain a reconstructed scene audio signal includes: entropy decoding is carried out on the high-order energy gain coding result so as to obtain high-order energy gain after entropy decoding; performing inverse quantization on the entropy decoded high-order energy gain to obtain the high-order energy gain; adjusting the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain an adjusted decoded higher-order energy gain; and adjusting a third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain to obtain an adjusted third audio signal, wherein the adjusted third audio signal belongs to the reconstructed scene audio signal. In the above scheme, the decoding end obtains the high-order energy gain coding result from the first code stream, and uses the high-order energy gain coding result to perform energy adjustment on the third audio signal in the N2-order HOA signal. The decoding end adjusts the high-order channel energy of the third audio signal by using the high-order energy gain coding result, so that the decoding quality of the third audio signal is higher.
In a possible manner, the adjusting the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal includes: acquiring higher-order energy of the second audio signal according to the channel energy of the first audio signal and the higher-order energy gain; obtaining a decoding energy scale factor according to the channel energy of the third audio signal and the higher-order energy of the second audio signal; obtaining a decoded higher order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal; and adjusting the decoded high-order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded high-order energy gain.
In a possible manner, the adjusting the decoded higher-order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded higher-order energy gain includes:
The adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Wherein g (i, b) represents the decoded energy scale factor, gain_dec (i, b) represents the decoded higher-order energy Gain of the third audio signal, i is the number of the ith channel of the third audio signal, b is the frequency band sequence number of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
In a possible manner, the adjusting the third audio signal in the N2-order HOA signal according to the adjusted decoded higher order energy gain includes: obtaining an attenuation factor according to the frequency band sequence number of the third audio signal; and adjusting the third audio signal according to the adjusted decoded high-order energy gain and the attenuation factor. In the above scheme, after the adjusted decoded higher-order energy gain is obtained, the gain of the third audio signal of the current frame may be weighted, where the gain attenuates with the frequency band sequence number where the third audio signal is located, and the attenuation factor may be obtained according to the frequency band sequence number where the third audio signal is located first, and then the adjusted decoded higher-order energy gain is applied to the higher-order channel of the third audio signal reconstructed by the current frame, so that the energy of the higher-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is improved.
In a possible manner, after the adjusting the first reconstructed scene audio signal according to the higher order energy gain coding result, the method further includes: acquiring channel energy of a fourth audio signal corresponding to the adjusted third audio signal, wherein the third audio signal comprises an audio signal of a current frame, and the fourth audio signal comprises an audio signal of a frame before the current frame; and adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal. In the above scheme, the decoding end may further adjust the adjusted third audio signal of the current frame again by using the previous frame of the third audio signal, so as to improve the quality of the reconstructed audio signal.
In a possible manner, the readjusting the adjusted third audio signal according to the channel energy of the fourth audio signal includes: acquiring a channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal; acquiring an energy average threshold according to the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal; performing weighted average calculation on the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal according to the energy average threshold value to obtain target energy; acquiring an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal; and adjusting the adjusted third audio signal again according to the energy smoothing factor.
In a possible manner, the obtaining the attenuation factor according to the frequency band sequence number where the third audio signal is located includes:
the attenuation factor g' (i, b) is obtained as follows:
Or alternatively, the first and second heat exchangers may be,
Wherein i is the number of the ith channel of the third audio signal, b is the frequency band sequence number of the third audio signal, and p is a preset attenuation threshold.
In the above scheme, by the calculation mode of the attenuation factor g' (i, b), b is the frequency band sequence number of the third audio signal, and the attenuation factor can be accurately calculated by the parameters, so that when the attenuation factor and the adjusted decoded higher-order energy gain are used for adjusting the third audio signal, the quality of the reconstructed audio signal is improved.
In a possible manner, the obtaining an energy average threshold according to the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal includes:
the energy average threshold k is obtained by:
Wherein e_mean (i, b) represents the channel energy average value of the fourth audio signal and E' _dec (i, b) represents the energy of the adjusted third audio signal.
In a possible manner, the obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal includes:
The energy smoothing factor q (i, b) is obtained as follows:
q(i,b)=sqrt(E_target(i,b))/sqrt(E’_dec(i,b));
wherein e_target (i, b) represents the target energy and E' _dec (i, b) represents the energy of the adjusted third audio signal.
In a possible manner, the adjusting the decoded higher-order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded higher-order energy gain includes:
the adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Wherein g (i, b) represents the decoded energy scale factor, gain_dec (i, b) represents the decoded higher-order energy Gain of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
Compared with other methods for reconstructing scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher; therefore, when K is equal to C1, the audio quality of the reconstructed scene audio signal is higher under the same code rate.
When K is smaller than C1, in the process of encoding the scene audio signals, the channel number of the audio signals encoded by the method is smaller than that of the audio signals encoded by the prior art, and the data size of the attribute information of the target virtual speaker is far smaller than that of the audio signals of one channel; therefore, on the premise of the same code rate, the audio quality of the reconstructed scene audio signal obtained by decoding of the application is higher.
Secondly, since the virtual speaker signal and residual information transmitted by the prior art encoding are converted from the original audio signal (i.e., the scene audio signal to be encoded), errors are introduced instead of the original audio signal; the application encodes part of original audio signals (namely, the audio signals of K channels in the scene audio signals to be encoded), avoids the introduction of errors, and can further improve the audio quality of the reconstructed scene audio signals obtained by decoding; and the fluctuation of reconstruction quality of the reconstructed scene audio signals obtained by decoding can be avoided, and the stability is high.
Furthermore, since the prior art encodes and transmits virtual speaker signals, and the amount of data of the virtual speaker signals is large, the number of target virtual speakers selected by the prior art is limited by the bandwidth. The application encodes and transmits attribute information of the virtual speaker, and the data volume of the attribute information is far smaller than that of the virtual speaker signal; the number of target virtual speakers selected by the present application is therefore less bandwidth limited. The more the number of the target virtual speakers is selected, the higher the quality of the reconstructed scene audio signal is based on the virtual speaker signals of the target virtual speakers. Therefore, compared with the prior art, under the condition of the same code rate, the method can select more target virtual speakers, so that the quality of the reconstructed scene audio signals obtained by decoding is higher.
In addition, compared with the encoding end and the decoding end in the prior art, the encoding end and the decoding end do not need residual error and superposition operation, so that the comprehensive complexity of the encoding end and the decoding end is lower than that of the encoding end and the decoding end in the prior art.
It should be understood that, when the encoding end performs lossy compression on the first audio signal in the scene audio signal, the first reconstructed signal obtained by decoding by the decoding end and the first audio signal encoded by the encoding end have differences. When the encoding end performs lossless compression on the first audio signal, the first reconstruction signal obtained by decoding by the decoding end is identical to the first audio signal encoded by the encoding end.
It should be understood that, when the encoding end performs lossy compression on the attribute information of the target virtual speaker, the attribute information obtained by decoding by the decoding end and the attribute information encoded by the encoding end have a difference. When the encoding end performs lossless compression on the attribute information of the virtual speaker, the attribute information obtained by decoding by the decoding end is the same as the attribute information encoded by the encoding end. The application does not distinguish between the attribute information encoded by the encoding end and the attribute information decoded by the decoding end from the name.
Any implementation manner of the second aspect and the second aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. The technical effects corresponding to the second aspect and any implementation manner of the second aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a third aspect, an embodiment of the present application provides a method for generating a code stream, where the method may generate the code stream according to any implementation manner of the first aspect and the first aspect.
Any implementation manner of the third aspect and any implementation manner of the third aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. The technical effects corresponding to the third aspect and any implementation manner of the third aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a fourth aspect, an embodiment of the present application provides a scene audio coding apparatus, including:
The acquisition module is used for acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer;
the acquisition module is further used for acquiring attribute information of a target virtual speaker corresponding to the scene audio signal;
the acquisition module is further used for acquiring the higher-order energy gain of the scene audio signal;
The coding module is used for coding the high-order energy gain to obtain a high-order energy gain coding result;
The encoding module is further configured to encode a first audio signal in the scene audio signal, attribute information of the target virtual speaker, and the high-order energy gain encoding result, so as to obtain a first code stream; the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
The scene audio coding device of the fourth aspect may perform the steps in any implementation manner of the first aspect and the first aspect, which are not described herein.
Any implementation manner of the fourth aspect and any implementation manner of the fourth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a fifth aspect, an embodiment of the present application provides a scene audio decoding apparatus, including:
The code stream receiving module is used for receiving the first code stream;
The decoding module is used for decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstruction signal is a reconstruction signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer smaller than or equal to C1;
The virtual speaker signal generation module is used for generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;
The scene audio signal reconstruction module is used for reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer;
And the scene audio signal adjusting module is used for adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result so as to obtain a reconstructed scene audio signal.
The scene audio decoding apparatus of the fifth aspect may perform the steps in any implementation manner of the second aspect and the second aspect, which are not described herein.
Any implementation manner of the fifth aspect and any implementation manner of the fifth aspect corresponds to any implementation manner of the second aspect and any implementation manner of the second aspect, respectively. Technical effects corresponding to any implementation manner of the fifth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect, and will not be described herein.
In a sixth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory coupled to the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the method of scene audio coding of the first aspect or any possible implementation of the first aspect.
Any implementation manner of the sixth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the sixth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a seventh aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory coupled to the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the second aspect or the scene audio decoding method in any possible implementation of the second aspect.
Any implementation manner of the seventh aspect and any implementation manner of the seventh aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the seventh aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In an eighth aspect, embodiments of the present application provide a chip comprising one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from the memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause an electronic device to perform the method of scene audio coding in the first aspect or any possible implementation of the first aspect.
Any implementation manner of the eighth aspect and any implementation manner of the eighth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the eighth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a ninth aspect, embodiments of the present application provide a chip comprising one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from the memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause the electronic device to perform the second aspect or the scene audio decoding method in any possible implementation of the second aspect.
Any implementation manner of the ninth aspect and any implementation manner of the ninth aspect correspond to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the ninth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a tenth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when run on a computer or processor, causes the computer or processor to perform the method of scene audio coding in the first aspect or any possible implementation manner of the first aspect.
Any implementation manner of the tenth aspect and the tenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to the tenth aspect and any implementation manner of the tenth aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In an eleventh aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, which when run on a computer or processor causes the computer or processor to perform the method of decoding scene audio in the second aspect or any possible implementation manner of the second aspect.
Any implementation manner of the eleventh aspect and the eleventh aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the eleventh aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a twelfth aspect, embodiments of the present application provide a computer program product comprising a software program which, when executed by a computer or processor, causes the computer or processor to perform the method of scene audio coding in the first aspect or any of the possible implementations of the first aspect.
Any implementation manner of the twelfth aspect and the twelfth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the twelfth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a thirteenth aspect, embodiments of the present application provide a computer program product comprising a software program which, when executed by a computer or processor, causes the computer or processor to perform the method of scene audio decoding of the second aspect or any possible implementation of the second aspect.
Any implementation manner of the thirteenth aspect and the thirteenth aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the thirteenth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a fourteenth aspect, an embodiment of the present application provides an apparatus for storing a code stream, including: a receiver for receiving a code stream and at least one storage medium; at least one storage medium for storing the code stream; the code stream is generated according to the first aspect and any implementation manner of the first aspect.
Any implementation manner of the fourteenth aspect and the fourteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a fifteenth aspect, an embodiment of the present application provides an apparatus for transmitting a code stream, including: a transmitter and at least one storage medium for storing a code stream, the code stream being generated according to the first aspect and any implementation of the first aspect; the transmitter is used for acquiring the code stream from the storage medium and transmitting the code stream to the end-side device through the transmission medium.
Any implementation manner of the fifteenth aspect and the fifteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the fifteenth aspect and the fifteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect and the first aspect, and are not described herein.
In a sixteenth aspect, an embodiment of the present application provides a system for distributing a code stream, the system including: the streaming media device is configured to obtain a target code stream from the at least one storage medium and send the target code stream to the end-side device, where the streaming media device includes a content server or a content distribution server.
Any implementation manner of the sixteenth aspect and the sixteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the sixteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
Drawings
FIG. 1a is a schematic diagram of an exemplary application scenario;
FIG. 1b is a schematic diagram of an exemplary application scenario;
FIG. 2a is a schematic diagram of an exemplary encoding process;
Fig. 2b is a schematic diagram of an exemplary candidate virtual speaker distribution;
FIG. 3 is a schematic diagram of an exemplary decoding process;
FIG. 4 is a schematic diagram of an exemplary encoding process;
FIG. 5 is a schematic diagram of an exemplary decoding process;
FIG. 6a is a schematic diagram of an exemplary encoding end;
FIG. 6b is a schematic diagram illustrating the structure of a decoding end;
Fig. 7 is a schematic diagram of a structure of an exemplary scene audio encoding apparatus;
fig. 8 is a schematic structural view of an exemplary scene audio decoding apparatus;
Fig. 9 is a schematic diagram of the structure of the device shown in an exemplary manner.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.
The terms first and second and the like in the description and in the claims of embodiments of the application, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.
For clarity and conciseness in the description of the embodiments below, a brief description of the related art will be given first.
Sound (sound) is a continuous wave generated by the vibration of an object. An object that produces vibrations and emits sound waves is called a sound source. During the propagation of sound waves through a medium (e.g., air, solid, or liquid), the auditory function of a human or animal senses sound.
The characteristics of sound waves include pitch, intensity and timbre. The pitch represents the level of sound. The sound intensity indicates the size of the sound. The intensity of sound may also be referred to as loudness or volume. The units of sound intensity are decibels (dB). The tone color is also called a sound product.
The frequency of the sound wave determines the pitch. The higher the frequency the higher the tone. The number of times an object vibrates within one second is called the frequency, which is in hertz (Hz). The frequency of the sound which can be recognized by the human ear is between 20Hz and 20000 Hz.
The amplitude of the sound wave determines the intensity of the sound intensity. The larger the amplitude the greater the intensity. The closer to the sound source, the greater the intensity.
The waveform of the sound wave determines the tone. The waveform of the acoustic wave includes square wave, sawtooth wave, sine wave, pulse wave and the like.
Sounds can be classified into regular sounds and irregular sounds according to characteristics of sound waves. The irregular sound refers to a sound emitted by an irregularly vibrating sound source. The random sound is, for example, noise affecting people's work, learning, rest, etc. The regular sound refers to a sound emitted by the sound source vibrating regularly. Regular sounds include voices and musical tones. When the sound is electrically represented, the regular sound is an analog signal that continuously varies in the time-frequency domain. The analog signal may be referred to as an audio signal. An audio signal is an information carrier carrying speech, music and sound effects.
Since human hearing has the ability to discern the position distribution of sound sources in space, the listener can perceive the azimuth of the sound in addition to the pitch, intensity and timbre of the sound when hearing the sound in space.
As the attention and quality requirements of the auditory system experience are increasing, three-dimensional audio technology has grown in order to enhance the sense of depth, presence, and spatial perception of sound. Thus, the listener not only perceives the sound from the front, rear, left and right sound sources, but also perceives the sense that the space where the listener is located is surrounded by the space sound field (sound field) generated by the sound sources, and the sense that the sound spreads around, thereby creating an "immersive" sound effect that the listener is located in a theatre, concert hall, or other location.
The scene audio signal according to the embodiment of the application may refer to a signal for describing a sound field; wherein the scene audio signal may include: HOA signals (where the HOA signals may include three-dimensional HOA signals and two-dimensional HOA signals (which may also be referred to as planar HOA signals)) and three-dimensional audio signals; the three-dimensional audio signal may refer to other audio signals than the HOA signal in the scene audio signal. The HOA signal will be described below as an example.
It is known that sound waves propagate in an ideal medium with wave numbers k=w/c and angular frequencies w=2pi f, where f is the sound wave frequency and c is the speed of sound. The sound pressure p satisfies the equation (1, b),Is a laplace operator.
The space system outside the human ear is assumed to be a sphere, the listener is positioned in the center of the sphere, sound transmitted from outside the sphere has a projection on the sphere, sound outside the sphere is filtered, sound sources are assumed to be distributed on the sphere, and the sound field generated by the original sound sources is fitted by using the sound field generated by the sound sources on the sphere, namely, the three-dimensional audio technology is a method for fitting the sound field. Specifically, equation (1, b) equation is solved under the spherical coordinate system, and in the passive spherical region, the equation (1, b) equation is solved as the following equation (2).
Where r denotes a sphere radius, θ denotes horizontal angle information (or azimuth angle information),Represents pitch angle information (or called elevation angle information), k represents wave number, s represents amplitude of an ideal plane wave, and m represents an order number of the HOA signal (or called an order number of the HOA signal).Representing a globebiser function, also called radial basis function, wherein the first j represents an imaginary unit,Not changing with angle.The term "theta" is used to indicate that,The spherical harmonic of the direction is used,Spherical harmonics representing the direction of the sound source. The HOA signal satisfies equation (3).
Substituting equation (3) into equation (2), equation (2) may be modified into equation (4).
Wherein m is truncated to the nth term, i.e., m=n, toAs an approximate description of the sound field; at this time, the liquid crystal display device,May be referred to as HOA coefficients (which may be used to represent the HOA signal of order N). The sound field refers to the area of the medium where sound waves are present. N is an integer greater than or equal to 1.
A scene audio signal is an information carrier carrying spatial location information of sound sources in a sound field describing the sound field of listeners in space. Equation (4) shows that the sound field can be spread on the sphere according to spherical harmonics, i.e. the sound field can be decomposed into a superposition of a plurality of plane waves. Thus, the sound field described by the HOA signal can be expressed using a superposition of a plurality of plane waves and reconstructed by the HOA coefficients.
The HOA signal to be encoded according to an embodiment of the present application may refer to an N1-order HOA signal, which may be represented by HOA coefficients or Ambisonic (stereo reverberation) coefficients, N1 being an integer greater than or equal to 1 (where, when N1 is equal, the 1-order HOA signal may be referred to as a FOA (First Order Ambisonic, first order Ambisonic) signal). Wherein the N1-order HOA signal comprises (n1+1) 2 channels of audio signals.
Fig. 1a is a schematic diagram of an exemplary application scenario. Shown in fig. 1a is a codec scene of a scene audio signal.
Referring to fig. 1a, an exemplary first electronic device may include a first audio acquisition module, a first scene audio encoding module, a first channel decoding module, a first scene audio decoding module, and a first audio playback module. It should be understood that the first electronic device may include more or fewer modules than shown in fig. 1a, as the application is not limited in this regard.
Referring to fig. 1a, the second electronic device may include a second audio acquisition module, a second scene audio encoding module, a second channel decoding module, a second scene audio decoding module, and a second audio playback module, as an example. It should be understood that the second electronic device may include more or fewer modules than shown in fig. 1a, as the application is not limited in this regard.
Illustratively, the process of the first electronic device encoding and transmitting the scene audio signal to the second electronic device, decoding and audio playback by the second electronic device may be as follows: the first audio acquisition module can acquire audio and output scene audio signals to the first scene audio coding module. Then, the first scene audio coding module can code the scene audio signal and output a code stream to the first channel coding module. The first channel coding module may then perform channel coding on the code stream, and transmit the code stream after channel coding to the second electronic device through the wireless or wired network communication device. Then, the second channel decoding module of the second electronic device may perform channel decoding on the received data to obtain a code stream and output the code stream to the second scene audio decoding module. Then, the second scene audio decoding module can decode the code stream to obtain a reconstructed scene audio signal; the reconstructed scene audio signal is then output to a second audio playback module for audio playback by the second audio playback module.
It should be noted that the second audio playback module may perform post-processing (such as audio rendering (e.g., may convert a reconstructed scene audio signal containing (n1+1) 2 channels of audio signals into an audio signal with the same number of channels as the number of speakers in the second electronic device), loudness normalization, user interaction, audio format conversion or denoising, etc.) on the reconstructed scene audio signal to convert the reconstructed scene audio signal into an audio signal suitable for playing by the speakers in the second electronic device.
It should be understood that the process of encoding and transmitting the scene audio signal to the first electronic device, decoding and playing back the scene audio signal by the first electronic device is similar to the process of transmitting the scene audio signal to the second electronic device by the first electronic device and playing back the scene audio signal by the second electronic device, which is not described herein.
By way of example, the first electronic device and the second electronic device may each include, but are not limited to: personal computers, computer workstations, smart phones, tablet computers, servers, smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, and the like.
The present application is particularly applicable to VR (Virtual Reality)/AR (Augmented Reality) scenes, for example. In one possible approach, the first electronic device is a server and the second electronic device is a VR/AR device. In one possible approach, the second electronic device is a server and the first electronic device is a VR/AR device.
The first scene audio coding module and the second scene audio coding module may be, for example, scene audio encoders. The first and second scene audio decoding modules may be scene audio decoders.
For example, when a scene audio signal is encoded by a first electronic device, the second electronic device reconstructs the scene audio signal, the first electronic device may be referred to as an encoding side, and the second electronic device may be referred to as a decoding side. When the scene audio signal is encoded by the second electronic device, the first electronic device reconstructs the scene audio signal, the second electronic device may be referred to as an encoding side, and the first electronic device may be referred to as a decoding side.
Fig. 1b is a schematic view of an exemplary application scenario. Shown in fig. 1b is a transcoded scene of a scene audio signal.
Referring to fig. 1b (1), an exemplary wireless or core network device may include: channel decoding module, other audio decoding module, scene audio coding module and channel coding module. Wherein a wireless or core network device may be used for audio transcoding.
By way of example, the specific application scenario of fig. 1b (1) may be: the first electronic equipment is not provided with a scene audio coding module and is only provided with other audio coding modules; and the second electronic device is only provided with a scene audio decoding module, and under the condition that other audio decoding modules are not provided, in order to realize that the second electronic device can decode and play back scene audio signals encoded by the first electronic device through other audio encoding modules, the wireless or core network device can be used for transcoding.
Specifically, the first electronic device encodes the scene audio signal by adopting other audio encoding modules to obtain a first code stream; and the first code stream is sent to wireless or core network equipment after channel coding. Then, the channel decoding module of the wireless or core network device may perform channel decoding, and output the first code stream decoded by the channel to other audio decoding modules. And then, the other audio decoding modules decode the first code stream to obtain a scene audio signal and output the scene audio signal to the scene audio encoding module. Then, the scene audio coding module may code the scene audio signal to obtain a second code stream, and output the second code stream to the channel coding module, where the channel coding module performs channel coding on the second code stream and sends the second code stream to the second electronic device. In this way, the second electronic device can call the scene audio decoding module to decode the channel to obtain a second code stream, and then obtain a reconstructed scene audio signal; the reconstructed scene audio signal may then be audio played back.
Referring to fig. 1b (2), an exemplary wireless or core network device may include: channel decoding module, scene audio decoding module, other audio coding module and channel coding module. Wherein a wireless or core network device may be used for audio transcoding.
By way of example, the specific application scenario of fig. 1b (2) may be: the first electronic equipment is only provided with a scene audio coding module, and is not provided with other audio coding modules; and the second electronic device is not provided with a scene audio decoding module, and can use wireless or core network equipment to transcode in order to realize that the second electronic device can decode and play back the scene audio signals encoded by the first electronic device by adopting the scene audio encoding module under the condition that the second electronic device is only provided with other audio decoding modules.
Specifically, the first electronic device encodes a scene audio signal by using a scene audio encoding module to obtain a first code stream; and the first code stream is sent to wireless or core network equipment after channel coding. Then, the channel decoding module of the wireless or core network device may perform channel decoding, and output the first code stream decoded by the channel to the scene audio decoding module. And then, the scene audio decoding module decodes the first code stream to obtain a scene audio signal and outputs the scene audio signal to other audio encoding modules. Then, the other audio encoding modules can encode the scene audio signals to obtain a second code stream, output the second code stream to the channel encoding module, and send the second code stream to the second electronic device after the channel encoding module performs channel encoding on the second code stream. In this way, the second electronic device can call other audio decoding modules to decode the channel to obtain a second code stream, and then obtain a reconstructed scene audio signal; the reconstructed scene audio signal may then be audio played back.
The following describes a codec process of a scene audio signal.
Fig. 2a is a schematic diagram of an exemplary encoding process.
S201, obtaining a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer.
Illustratively, when the field Jing Yinpin signal is a HOA signal, the HOA signal may be an N1-order HOA signal, i.e., in equation (3) above for the term N1
Illustratively, the N1-order HOA signal may include an audio signal of C1 channels, c1= (n1+1) 2. For example, when n1=3, the N1-order HOA signal includes 16 channels of audio signals; when n1=4, the N1-order HOA signal includes 25 channels of audio signals.
S202, acquiring attribute information of a target virtual speaker corresponding to a scene audio signal.
And selecting a target virtual speaker from the plurality of candidate virtual speakers based on the scene audio signals, and acquiring attribute information of the target virtual speaker.
S203, obtaining the higher-order energy gain of the scene audio signal.
Illustratively, the feature information of the HOA signal is obtained from the HOA signal to be encoded, and the high-order energy gain is obtained through the feature information of the HOA signal, where the high-order energy gain may be used to indicate the energy gain of the high-order channel signal of the scene audio signal.
The scene audio signal includes audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, K is a positive integer less than or equal to C1, and the value of K is not limited.
The scene audio signal is an N1-order HOA signal, the N1-order HOA signal includes a first audio signal and a second audio signal, the second audio signal is an audio signal other than the first audio signal in the N1-order HOA signal, and C1 is equal to the square of (n1+1).
In one possible approach, let n1=3, c1=10. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is 1 st to 10 th channel audio signals in the N1-order HOA signal, and the second audio signal is 11 th to 16 th channel audio signals in the N1-order HOA signal.
Illustratively, n1=3, c1=9. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is a1 st to 9 th channel audio signal in the N1-order HOA signal, and the second audio signal is a10 th to 16 th channel audio signal in the N1-order HOA signal.
Illustratively, n1=3, c1=8. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is 1 st to 6 th, 8 th and 9 th channel audio signals in the N1-order HOA signal, and the second audio signal is 7 th and 10 th to 16 th channel audio signals in the N1-order HOA signal.
In one possible implementation, obtaining a higher-order energy gain of a scene audio signal includes:
and acquiring the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal.
The scene audio signals comprise first audio signals and second audio signals, feature information of the second audio signals and feature information of the first audio signals are respectively obtained, and the feature information corresponding to the scene audio signals comprises but is not limited to: gain information and diffusion information. The higher energy gain of the scene audio signal may be obtained from the characteristic information of the second audio signal and the characteristic information of the first audio signal.
For example, gain information Gain (i, b) of a second audio signal in the scene audio signal may be calculated with reference to the following formula:
Gain(i,b)=E(i,b)/E(1,b)
Where i is the number of the ith channel included in the second audio signal in the scene audio signal, the number may also be referred to as the channel number, b is the frequency band number of the second audio signal, E (i, b) is the ith channel energy of the b-th frequency band of the second audio signal, E (1, b) is the channel energy of the b-th frequency band of the first audio signal, for example, the channel of the first audio signal may be specifically the 1 st channel of the N1 st-order HOA signal.
The following steps may be performed within a frame signal or on a subframe. The following steps may be performed in the full band or in the sub-band.
Illustratively, after Gain (i, b) is calculated, gain' (i, b) is calculated as follows:
Gain’(i,b)=10*log10(Gain(i,b))。
s204, coding the high-order energy gain to obtain a high-order energy gain coding result.
After the coding end obtains the high-order energy gain of the scene audio signal, the high-order energy gain can be coded to generate a high-order energy gain coding result. The function of the high-order energy gain is to adjust the high-order channel energy at the decoding end, so that the encoding and decoding quality of the HOA signal is higher.
S205, encoding attribute information and a higher-order energy gain encoding result of a first audio signal and a target virtual speaker in the scene audio signal to obtain a first code stream; the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
The virtual speakers are, for example, virtual speakers, not real speakers.
For example, based on the above knowledge, the scene audio signal may be expressed using a superposition of a plurality of plane waves, and thus a target virtual speaker for simulating a sound source in the scene audio signal may be determined; in this way, the virtual speaker signal corresponding to the target virtual speaker is used to reconstruct the scene audio signal later in the decoding process.
In one possible way, a plurality of candidate virtual speakers with different positions may be provided on the spherical surface; next, a target virtual speaker may be selected from the plurality of candidate virtual speakers that matches a sound source location in the scene audio signal.
Fig. 2b is a schematic diagram of an exemplary candidate virtual speaker distribution. In fig. 2b, a plurality of candidate virtual speakers may be uniformly distributed on a sphere, with a point on the sphere representing a candidate virtual speaker.
It should be noted that the number and distribution of the candidate virtual speakers are not limited in the present application, and may be set as required, and specifically described later.
For example, a target virtual speaker whose position corresponds to a sound source position in the scene audio signal may be selected from the plurality of candidate virtual speakers based on the scene audio signal; the number of the target virtual speakers may be one or more, and the present application is not limited thereto.
In one possible approach, the target virtual speaker may be preset.
It should be understood that the application is not limited in the manner in which the target virtual speaker is determined.
Illustratively, in one possible approach, during decoding, the scene audio signal may be reconstructed from the virtual speaker signal; however, directly transmitting the virtual speaker signal of the target virtual speaker increases the code rate; and the virtual speaker signal of the target virtual speaker may be generated based on the attribute information of the target virtual speaker and the scene audio signal of a part or all of the channels; therefore, the attribute information of the target virtual speaker can be acquired, and the audio signals of K channels in the scene audio signals can be acquired as first audio signals; and then encoding the first audio signal, the attribute information of the target virtual speaker and the high-order energy gain encoding result to obtain a first code stream.
For example, operations such as down-mixing, transformation, quantization, entropy coding and the like may be performed on the attribute information of the first audio signal and the target virtual speaker to obtain a first code stream, and in addition, a higher-order energy gain coding result may be written into the first code stream. That is, the first code stream may include encoded data of the first audio signal in the scene audio signal, encoded data of attribute information of the target virtual speaker, and a higher-order energy gain encoding result.
Compared with other methods for reconstructing scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher; therefore, when K is equal to C1, the audio quality of the scene audio signal reconstructed by the method is higher under the same code rate.
When K is smaller than C1, in the process of encoding the scene audio signals, the channel number of the audio signals encoded by the method is smaller than that of the audio signals encoded by the prior art, and the data size of the attribute information of the target virtual speaker is also far smaller than that of the audio signals of one channel; therefore, the application has lower coding rate on the premise of reaching the same quality.
In addition, in the prior art, the scene audio signal is converted into the virtual speaker signal and the residual signal and then encoded, and the encoding end directly encodes the audio signal of part of channels in the scene audio signal without calculating the virtual speaker signal and the residual signal, so that the encoding complexity of the encoding end is lower.
Fig. 3 is a schematic diagram of an exemplary decoding process. Fig. 3 is a decoding process corresponding to the encoding process of fig. 2.
S301, a first code stream is received.
S302, decoding the first code stream to obtain a first reconstruction signal and attribute information of a target virtual speaker.
Illustratively, the encoded data of the first audio signal in the scene audio signal included in the first code stream may be decoded, and a first reconstructed signal may be obtained; that is, the first reconstructed signal is a reconstructed signal of the first audio signal. And decoding the encoded data of the attribute information of the target virtual speaker contained in the first code stream, thereby obtaining the attribute information of the target virtual speaker.
It should be understood that, when the encoding end performs lossy compression on the first audio signal in the scene audio signal, the first reconstructed signal obtained by decoding by the decoding end and the first audio signal encoded by the encoding end have differences. When the encoding end performs lossless compression on the first audio signal, the first reconstruction signal obtained by decoding by the decoding end is identical to the first audio signal encoded by the encoding end.
It should be understood that, when the encoding end performs lossy compression on the attribute information of the target virtual speaker, the attribute information obtained by decoding by the decoding end and the attribute information encoded by the encoding end have a difference. When the encoding end performs lossless compression on the attribute information of the virtual speaker, the attribute information obtained by decoding by the decoding end is the same as the attribute information encoded by the encoding end. The application does not distinguish between the attribute information encoded by the encoding end and the attribute information decoded by the decoding end from the name.
S303, generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first reconstruction signal.
S304, reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal. The first reconstructed scene audio signal comprises audio signals of C2 channels, C2 being a positive integer.
For example, a scene audio signal may be reconstructed based on the virtual speaker signal; and then, the virtual speaker signal corresponding to the target virtual speaker can be generated based on the attribute information of the target virtual speaker and the first reconstruction signal. One target virtual speaker corresponds to one virtual speaker signal, and the virtual speaker signal is a plane wave. Then, reconstructing is performed based on the attribute information of the target virtual speaker and the virtual speaker signal, and generating a first reconstructed scene audio signal.
Illustratively, when the field Jing Yinpin signal is an HOA signal, the reconstructed first reconstructed scene audio signal may also be an HOA signal, which may be an N2-order HOA signal, where N2 is a positive integer. Illustratively, the N2-order HOA signal may comprise a C2-channel audio signal, c2= (n2+1) 2.
Illustratively, the order N2 of the first reconstructed scene audio signal may be greater than or equal to the order N1 of the field Jing Yinpin signal in the embodiment of fig. 2 a; correspondingly, the number of channels C2 of the audio signal comprised by the first reconstructed scene audio signal may be greater than or equal to the number of channels C1 of the audio signal comprised by the field Jing Yinpin signal in the embodiment of fig. 2 a.
Illustratively, the scene audio signal is an N1-order HOA signal, the N1-order HOA signal comprising a second audio signal, the second audio signal being an audio signal of the N1-order HOA signal other than the first audio signal, C1 being equal to the square of (n1+1); and/or the number of the groups of groups,
The first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal comprising a third audio signal, the third audio signal being a reconstructed signal of the N2-order HOA signal corresponding to each channel of the second audio signal, C2 being equal to the square of (n2+1).
In one possible way, the first reconstructed scene audio signal may be directly used as the final decoding result.
And S305, adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result to obtain a reconstructed scene audio signal.
The decoding end obtains a high-order energy gain coding result from the first code stream, and energy adjustment is carried out on the first reconstruction scene audio signal by using the high-order energy gain coding result. The decoding end adjusts the high-order channel energy of the first reconstructed scene audio signal by using the high-order energy gain coding result, so that the decoding quality of the scene audio signal is higher.
Compared with other methods for reconstructing scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher; therefore, when K is equal to C1, the audio quality of the reconstructed scene audio signal is higher under the same code rate.
When K is smaller than C1, in the process of encoding the scene audio signals, the channel number of the audio signals encoded by the method is smaller than that of the audio signals encoded by the prior art, and the data size of the attribute information of the target virtual speaker is far smaller than that of the audio signals of one channel; therefore, on the premise of the same code rate, the audio quality of the reconstructed scene audio signal obtained by decoding of the application is higher.
Secondly, since the virtual speaker signal and residual information transmitted by the prior art encoding are converted from the original audio signal (i.e., the scene audio signal to be encoded), errors are introduced instead of the original audio signal; the application encodes part of original audio signals (namely, the audio signals of K channels in the scene audio signals to be encoded), avoids the introduction of errors, and can further improve the audio quality of the reconstructed scene audio signals obtained by decoding; and the fluctuation of reconstruction quality of the reconstructed scene audio signals obtained by decoding can be avoided, and the stability is high.
Furthermore, since the prior art encodes and transmits virtual speaker signals, and the amount of data of the virtual speaker signals is large, the number of target virtual speakers selected by the prior art is limited by the bandwidth. The application encodes and transmits attribute information of the virtual speaker, and the data volume of the attribute information is far smaller than that of the virtual speaker signal; the number of target virtual speakers selected by the present application is therefore less bandwidth limited. The more the number of the target virtual speakers is selected, the higher the quality of the reconstructed scene audio signal is based on the virtual speaker signals of the target virtual speakers. Therefore, compared with the prior art, under the condition of the same code rate, the method can select more target virtual speakers, so that the quality of the reconstructed scene audio signals obtained by decoding is higher.
In addition, compared with the encoding end and the decoding end in the prior art, the encoding end and the decoding end do not need residual error and superposition operation, so that the comprehensive complexity of the encoding end and the decoding end is lower than that of the encoding end and the decoding end in the prior art. Because the first code stream sent by the encoding end comprises the encoding result of the high-order energy gain, the high-order energy gain can be used for adjusting the energy of the high-order channel at the decoding end, so that the encoding and decoding quality of the scene audio signal is higher.
The following describes the encoding process of the higher-order energy gain in the encoding process and the adjusting process of the audio signal by the higher-order energy gain in the decoding process.
Fig. 4 is a schematic diagram of an exemplary encoding process.
S401, acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer.
For example, S401 may refer to the description of S201 above, and will not be described herein.
S402, acquiring attribute information of a target virtual speaker corresponding to the scene audio signal.
In one possible approach, attribute information of the target virtual speaker is generated based on the position information of the target virtual speaker. In one possible manner, the position information (including pitch angle information and horizontal angle information) of the target virtual speaker may be used as the attribute information of the target virtual speaker. In one possible manner, a position index (including a pitch angle index (which may be used to uniquely identify pitch angle information) and a horizontal angle index (which may be used to uniquely identify horizontal angle information)) corresponding to the position information of the target virtual speaker is set as the attribute information of the target virtual speaker.
In one possible approach, a virtual speaker index (e.g., virtual speaker identification) of the target virtual speaker may be used as the attribute information of the target virtual speaker. Wherein the virtual speaker indexes are in one-to-one correspondence with the position information.
In one possible manner, the virtual speaker coefficient of the target virtual speaker may be set as the attribute information of the target virtual speaker. For example, C2 virtual speaker coefficients of the target virtual speaker may be determined, and the C2 virtual speaker coefficients of the target virtual speaker may be used as attribute information of the target virtual speaker; wherein the C2 virtual speaker coefficients of the target virtual speaker are in one-to-one correspondence with the audio signals of the C2 channel number included in the first reconstructed scene audio signal.
The data volume of the virtual speaker coefficient is far greater than the data volume of the position information, the index of the position information and the virtual speaker index; depending on the bandwidth, it is possible to decide which information of the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is to be used as the attribute information of the target virtual speaker. For example, when the bandwidth is large, the virtual speaker coefficient may be regarded as attribute information of the target virtual speaker; therefore, the decoding end is not required to calculate the virtual speaker coefficient of the target virtual speaker, and the calculation force of the decoding end can be saved. When the bandwidth is small, any one of the position information, the index of the position information, and the virtual speaker index may be used as the attribute information of the target virtual speaker; in this way, the code rate can be saved. It should be understood that which information of the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is adopted may be set in advance as the attribute information of the target virtual speaker; the application is not limited in this regard.
S403, acquiring the energy gain of the first audio signal and the energy gain of the second audio signal.
The feature information corresponding to the scene audio signal includes gain information, the scene audio signal includes a first audio signal and a second audio signal, and an energy gain E (1, b) of the first audio signal and an energy gain E (i, b) of the second audio signal are calculated, respectively.
S404, obtaining a higher-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal.
After the coding end obtains the high-order energy gain of the scene audio signal, the high-order energy gain can be coded to generate a high-order energy gain coding result. The function of the high-order energy gain is to adjust the high-order channel energy at the decoding end, so that the encoding and decoding quality of the HOA signal is higher.
Illustratively, obtaining the higher order energy gain from the energy gain of the first audio signal and the energy gain of the second audio signal comprises:
the higher-order energy Gain' (i, b) is obtained by:
Gain’(i,b)=10*log10(E(i,b)/E(1,b));
Wherein log10 represents log, represents multiplication, E (1, b) is the channel energy of the first audio signal, E (i, b) is the channel energy of the second audio signal, i is the number of the i-th channel of the second audio signal, and b is the frequency band number of the second audio signal.
The characteristic information of the second audio signal may be, for example, a higher energy gain of the N1-order HOA signal, in particular, an energy ratio of each channel of the second audio signal to a W channel (1 st channel of the N1-order HOA signal), which may in particular be a channel of the first audio signal.
For example, the feature information of the second audio signal may be acquired with reference to the following steps:
and performing time-frequency transformation on the N1-order HOA signal, and transforming the time domain N1-order HOA signal to obtain a frequency domain N1-order HOA signal.
The W channel energy E (1, b) and the channel energy E (i, b) of the second audio signal are calculated, wherein i is the channel number of the second audio signal.
The calculation of the higher order energy Gain' (i, b) may use the following formula:
Gain(i,b)=E(i,b)/E(1,b);
Gain’(i,b)=10*log10(Gain(i,b))。
S405, quantizing the high-order energy gain to obtain the quantized high-order energy gain.
S406, entropy coding is carried out on the quantized high-order energy gain so as to obtain a coding result of the high-order energy gain.
The method comprises the steps of obtaining characteristic information of a second audio signal in a scene audio signal, obtaining high-order energy gain of the scene audio signal through the characteristic information of the second audio signal, and sequentially quantizing and entropy coding the high-order energy gain.
Illustratively, scalar quantization may be employed to quantize the higher order energy gain.
Entropy encoding is performed on the quantized higher-order energy gain. The entropy encoding method is not limited.
Illustratively, the higher-order energy gain is differentially encoded, then the number of entropy encoded bits is estimated, and if the estimated number of bits is less than the fixed-length encoding, the higher-order energy gain is variable-length encoded, such as Huffman encoding; otherwise, the high-order energy gain is subjected to fixed-length coding.
After the high-order energy gain coding result is obtained, the coding result is written into the code stream.
S407, encoding the attribute information of the first audio signal and the target virtual speaker in the scene audio signal and the high-order energy gain encoding result to obtain a first code stream.
It should be appreciated that the number of channels of the audio signal included in the first audio signal may be determined according to requirements and bandwidth, and the present application is not limited in this respect.
For example, S401 may refer to the description of S201 above, and will not be described herein.
In the embodiment of the application, the coding end can calculate the energy ratio of the higher-order channel and the W channel so as to obtain a higher-order energy gain coding result, and then selects Huffman coding or direct coding according to the bit number estimation of the difference result between subframes. Therefore, the first code stream sent by the encoding end comprises a high-order energy gain encoding result, so that the high-order energy gain can be used for adjusting the energy of the high-order channel at the decoding end, and the encoding and decoding quality of the scene audio signal is higher.
Fig. 5 is a schematic diagram of an exemplary decoding process. Fig. 5 is a decoding process corresponding to the encoding process of fig. 4.
S501, a first code stream is received.
S502, decoding the first code stream to obtain attribute information of the first reconstruction signal and the target virtual speaker and a high-order energy gain coding result.
S503, generating a virtual speaker signal corresponding to a target virtual speaker based on attribute information of the target virtual speaker and a first audio signal;
S504, reconstructing based on attribute information of a target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer
Exemplary, S501 to S504 may refer to descriptions of S301 to S304, and are not described herein.
For example, S306 described above may refer to descriptions of S505 to S508.
S505, performing entropy decoding on the high-order energy gain coding result to obtain the entropy decoded high-order energy gain.
S506, performing inverse quantization on the entropy decoded high-order energy gain to obtain the high-order energy gain.
Illustratively, the high-order energy gain encoding result is read from the first code stream. Entropy decoding is performed on the higher-order energy gain encoding result. The entropy decoding method is the inverse process of the entropy coding at the coding end.
Illustratively, if the encoding end employs fixed-length encoding, the decoding end uses fixed-length decoding corresponding thereto, and if the encoding end employs encoding, the decoding end uses side-length decoding corresponding thereto, such as huffman decoding.
And (3) performing inverse quantization on the entropy decoding result, wherein the inverse quantization method is an inverse process of the quantization method of the coding end.
And S507, adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain the adjusted decoded high-order energy gain.
After the decoding end performs signal reconstruction to obtain a first reconstructed scene audio signal, determining a first audio signal and a third audio signal from the first reconstructed scene audio signal, wherein the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, determining characteristic information of the second audio signal according to the characteristic information of the third audio signal, and finally adjusting the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain an adjusted decoded higher-order energy gain, and adjusting the higher-order energy gain to ensure that the higher-order channel energy is more uniform and smooth, and the quality of the reconstructed audio signal is better.
Illustratively, S507 adjusts the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal, including:
s5071, obtaining the higher energy of the second audio signal according to the channel energy and the higher energy gain of the first audio signal;
The decoding end obtains a high-order energy gain coding result from the first code stream, entropy decodes the high-order energy gain coding result, and dequantizes the high-order energy gain coding result to obtain the high-order energy gain. And estimating the energy of the second audio signal according to the channel energy and the higher-order energy gain of the first audio signal so as to determine the higher-order energy of the second audio signal.
Illustratively, the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal is subjected to time-frequency transformation, and the time-domain N2-order HOA signal is transformed to obtain a frequency-domain N2-order HOA signal.
The higher order energy e_ref (i, b) of the second audio signal is calculated, the following formula may be used:
E_Ref(i,b)=E_dec(1,b)*10^(Gain’(i,b)/10)
Wherein e_dec (1, b) is the channel energy of the b-th frequency band of the first audio signal in the N2-order HOA signal, i is the channel number corresponding to the second audio signal, gain' (i, b) is the higher-order energy Gain, and b is the frequency band number of the first audio signal.
S5072, obtaining a decoding energy scale factor according to the channel energy of the third audio signal and the higher-order energy of the second audio signal.
Specifically, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and the decoded energy scale factor is obtained by performing energy scale calculation on the third audio signal and the second audio signal.
Illustratively, the decoding energy scaling factor g (i, b) is calculated, using the following formula:
g(i,b)=sqrt(E_Ref(i,b))/sqrt(E_dec(i,b))
Where sqrt () is an evolution operation, e_dec (i, b) is a channel energy of a b-th frequency band of the third audio signal, i is a channel number corresponding to the third audio signal, and e_ref (i, b) is a higher-order energy of the b-th frequency band of the second audio signal.
S5073, obtaining a decoded higher order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal.
And taking the channel energy of the first audio signal as a reference, and performing gain calculation on the channel energy of the third audio signal to obtain a decoded higher-order energy gain of the third audio signal.
Illustratively, the decoded higher-order energy Gain gain_dec (i, b) is calculated, and the following formula can be employed:
Gain_dec(i,b)=E_dec(i,b)/E_dec(1,b)
where e_dec (1, b) is the channel energy of the b-th frequency band of the first audio signal in the N2-order HOA signal, and e_dec (i, b) is the channel energy of the b-th channel of the third audio signal.
And S5074, adjusting the decoded higher-order energy gain of the third audio signal according to the decoded energy scale factor to obtain an adjusted decoded higher-order energy gain.
Specifically, in order to make the energy of the higher-order channel more uniform and smooth, the decoded higher-order energy gain of the third audio signal is adjusted by using the decoded energy scaling factor, and the adjusted decoded higher-order energy gain is determined. After the decoding energy scale factor is used for adjustment, the energy of the higher-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is better.
Illustratively, adjusting the decoded higher order energy gain of the third audio signal according to the decoded energy scale factor to obtain an adjusted decoded higher order energy gain comprises:
The adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Wherein g (i, b) represents a decoding energy scale factor, gain_dec (i, b) represents a decoding higher-order energy Gain of a b-th frequency band of a third audio signal, i is a number of an i-th channel of the third audio signal, b is a frequency band sequence number of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
For example, min (a, b) is the minimum value of a and b, w is the adjustment ratio threshold, and w is a number of values, for example, w is 0.25.
And S508, adjusting the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain to obtain an adjusted third audio signal.
The decoding end obtains a high-order energy gain coding result from the first code stream, and energy adjustment is carried out on a third audio signal in the N2-order HOA signal by using the high-order energy gain coding result. The decoding end adjusts the high-order channel energy of the third audio signal by using the high-order energy gain coding result, so that the decoding quality of the third audio signal is higher.
The third audio signal is a channel audio signal corresponding to each channel of the second audio signal in the N2-order HOA signal.
For example, the third audio signal may be adjusted based on the feature information corresponding to the second audio signal in the N1-order HOA signal, so as to improve the quality of the N2-order HOA signal.
Illustratively, S508 adjusts a third audio signal in the N2-order HOA signal according to the adjusted decoded higher order energy gain, comprising:
s5081, obtaining an attenuation factor according to the frequency band sequence number of the third audio signal.
For example, the decoding end may obtain an attenuation factor according to the frequency band sequence number of the third audio signal, where the attenuation factor may attenuate according to the frequency band sequence number of the reconstructed signal, where the attenuation factor may be used to adjust the first reconstructed scene audio signal, so that the quality of the reconstructed scene audio signal is higher.
S5082, adjusting the third audio signal according to the adjusted decoded higher-order energy gain to obtain an adjusted third audio signal, wherein the adjusted third audio signal belongs to the reconstructed scene audio signal.
After the adjusted decoded high-order energy gain is obtained, the gain of the third audio signal of the current frame can be weighted, the gain attenuates along with the frequency band sequence number of the third audio signal, an attenuation factor can be obtained according to the frequency band sequence number of the third audio signal, and then the adjusted decoded high-order energy gain acts on the high-order channel of the third audio signal reconstructed by the current frame, so that the energy of the high-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is improved.
Illustratively, the third audio signal is adjusted using the adjusted decoded higher order energy Gain dec' (i, b).
For example, adjustments may be made with reference to the following formulas:
X’(i,b)=X(i,b)*Gain_dec’(i,b)*g’(i,b);
wherein X (i, b) is the third audio signal before adjustment and X' (i, b) is the third audio signal after adjustment.
In a possible manner, S5081 obtains an attenuation factor according to a frequency band sequence number where the third audio signal is located, including:
the attenuation factor g' (i, b) is obtained as follows:
Or alternatively, the first and second heat exchangers may be,
Where i is the number of the ith channel of the third audio signal, b is the band number of the third audio signal,
P is a preset attenuation threshold, which represents a multiplication operation.
Illustratively, b is a frequency band number of the third audio signal, which may also be referred to as a subband number, b=0, 1,2, …,11.P is an attenuation threshold, for example, P is set to 0.99.
By the above-mentioned method of calculating the attenuation factor g' (i, b), b is the frequency band sequence number of the third audio signal, and the attenuation factor can be accurately calculated by the above-mentioned parameters, so that the quality of the reconstructed audio signal is improved when the attenuation factor and the adjusted decoded higher-order energy gain are used for adjusting the third audio signal.
Illustratively, after adjusting the third audio signal in the N2-order HOA signal according to the adjusted decoded higher order energy gain S5082, the method further comprises:
s5083, obtaining channel energy of a fourth audio signal corresponding to the adjusted third audio signal, wherein the third audio signal comprises an audio signal of a current frame, and the fourth audio signal comprises an audio signal of a frame before the current frame;
s5084, the adjusted third audio signal is adjusted again according to the channel energy of the fourth audio signal.
The decoding end may further adjust the adjusted third audio signal of the current frame again by using the previous frame of the third audio signal, so that the quality of the reconstructed audio signal is improved. The third audio signal comprises an audio signal of the current frame and the fourth audio signal comprises an audio signal of a preceding frame of the current frame, e.g. the preceding frame is an audio signal of a preceding frame that may be adjacent to the current frame, or the preceding frame may also be an audio signal of a preceding frame that is not adjacent to the current frame, the channel energy of the fourth audio signal being usable for adjusting the third audio signal. For example, the decoding end performs linear weighting on the sub-bands corresponding to the higher-order channel of the current frame and the higher-order channel of the previous 2 frames of the third audio signal, so as to obtain the higher-order channel of the current frame after energy smoothing.
Illustratively, S5084 adjusts the third audio signal according to the channel energy of the fourth audio signal, including:
S50841, obtaining a channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal;
wherein the channel energy average of the fourth audio signal may be an average of all channel energies of the fourth audio signal.
And S50842, acquiring an energy average threshold according to the channel energy average value of the fourth audio signal and the channel energy of the third audio signal.
The energy average threshold is a threshold calculated from the channel energies of the third audio signal and the fourth audio signal, respectively.
Illustratively, obtaining the energy average threshold from the channel energy average value of the fourth audio signal and the channel energy of the third audio signal comprises:
the energy average threshold k is obtained by:
The following formula may be employed:
Wherein e_mean (i, b) represents the channel energy average value of the fourth audio signal and E' _dec (i, b) represents the energy of the adjusted third audio signal.
S50843, carrying out weighted average calculation on the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal according to the energy average threshold value so as to obtain target energy;
the target energy E_target (i, b) is calculated, and the following formula can be used:
E_target(i,b)=k*E_mean(i,b)+(1-k)*E’_dec(i,b);
Where e_mean (i, b) is the average of the energy of the previous frame and E' _dec (i, b) is the energy of the adjusted third audio signal.
S50844, obtaining an energy smoothing factor according to the target energy and the adjusted channel energy of the third audio signal;
the energy smoothing factor may be used for adjustment of the third audio signal such that the decoding quality of the third audio signal is higher.
Illustratively, the obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal comprises:
The energy smoothing factor q (i, b) is obtained as follows:
q(i,b)=sqrt(E_target(i,b))/sqrt(E’_dec(i,b));
Where e_target (i, b) represents the target energy and E' _dec (i, b) represents the energy of the third audio signal.
And S50845, adjusting the adjusted third audio signal again according to the energy smoothing factor.
And (3) re-adjusting the adjusted third audio signal by using the energy smoothing factor q (i, b), thereby further improving the decoding quality of the third audio signal.
By way of example, the third audio signal may be adjusted with reference to the following formula:
X”(i,b)=X’(i,b)*q(i,b);
The average value of the energy of the previous frame may also be updated, for example, with the energy of the adjusted third audio signal after the adjusted third audio signal is obtained.
For example, the input signal in the encoding end is a3 rd order HOA signal, the 3 rd order HOA signal includes 16 channels of audio signals, the first audio signal is the audio signals of the 1 st to 5 th channels, the 7 th channels and the 9 th to 10 th channels, and the second audio signal is the audio signals of the 6 th channel, the 8 th channel and the 11 th to 16 th channels. The following three implementation modes are available for the code stream coded by the coding end: 1. the code stream obtained by coding does not comprise the high-order energy gain coding result. 2. The code stream obtained by encoding comprises a high-order energy gain encoding result. For the scene audio decoding method executed by the decoding end, there are three implementation modes: 1. and when the received code stream does not comprise the high-order energy gain coding result, the decoding end rebuilds the scene audio signal in the code stream. 2. When the received code stream comprises a high-order energy gain coding result, the decoding end rebuilds the scene audio signal in the code stream, and adjusts the rebuilding scene audio signal according to the high-order energy gain coding result to obtain the rebuilt scene audio signal. 3. In the embodiment of the application, when the received code stream comprises a high-order energy gain coding result, the decoding end reconstructs a scene audio signal in the code stream and adjusts the reconstructed scene audio signal according to the high-order energy gain coding result so as to obtain the reconstructed scene audio signal.
By analyzing the signal quality of the reconstructed scene audio signal, it is known that the decoded HOA signal without carrying the higher-order energy gain coding result has poor quality. There are decoded HOA signals carrying the higher order energy gain encoding results, but no adjustment of the reconstructed scene audio signal by the attenuation factor is performed, quality is moderate. The decoding HOA signal carrying the high-order energy gain coding result is used for adjusting the reconstructed scene audio signal through the attenuation factor, and the quality is optimal.
According to the analysis, the decoding end in the embodiment of the application can adjust the reconstructed scene audio signal according to the high-order energy gain coding result, so that the high-order channel energy of the reconstructed scene audio signal is more uniform and smooth, and the quality of the reconstructed scene audio signal is better. For example, the attenuation factor can attenuate along with two factors of the frequency band and the Ambisonic order of the reconstructed scene audio signal, so that the encoding and decoding quality of the HOA signal is effectively improved.
Fig. 6a is a schematic diagram of an exemplary encoding end.
Parameters fig. 6a, an exemplary encoding side may include a configuration unit, a virtual speaker generation unit, a target speaker generation unit, and a core encoder. It should be understood that fig. 6a is only an example of the present application, and the encoding end of the present application may include more or less modules than those shown in fig. 6a, and will not be described herein.
Illustratively, the configuration unit may be configured to determine configuration information of the candidate virtual speaker.
The virtual speaker generating unit may be configured to generate a plurality of candidate virtual speakers according to configuration information of the candidate virtual speakers and determine virtual speaker coefficients corresponding to the candidate virtual speakers.
The target speaker generating unit may be configured to select a target virtual speaker from a plurality of candidate virtual speakers according to the scene audio signal and the plurality of sets of virtual speaker coefficients, and determine attribute information of the target virtual speaker.
Illustratively, the core encoder may be configured to obtain a higher-order energy gain of the scene audio signal, and obtain a higher-order energy gain encoding result; and encoding the first audio signal, the attribute information of the target virtual speaker and the high-order energy gain encoding result in the scene audio signal.
For example, the scene audio coding module in fig. 1a and 1b may include the configuration unit, the virtual speaker generation unit, the target speaker generation unit, and the core encoder of fig. 6 a; or only a core encoder.
Fig. 6b is a schematic diagram illustrating the structure of a decoding end.
Parameters fig. 6b, an exemplary decoding side may include a core decoder, a virtual speaker coefficient generation unit, a virtual speaker signal generation unit, a reconstruction unit, and a signal adjustment unit. It should be understood that fig. 6b is only an example of the present application, and the decoding end of the present application may include more or less modules than those shown in fig. 6b, which are not described herein.
The core decoder may be used to decode the first code stream to obtain the first reconstructed signal, the attribute information of the target virtual speaker, and the higher-order energy gain encoding result.
The virtual speaker coefficient generation unit may be configured to determine the virtual speaker coefficient based on attribute information of the target virtual speaker.
The virtual speaker signal generation unit may be configured to generate the virtual speaker signal based on the first reconstructed signal and the virtual speaker coefficient.
The reconstruction unit may be adapted for reconstructing, for example, based on the virtual speaker signal and the attribute information, to obtain a first reconstructed scene audio signal.
The signal adjustment unit may be configured to determine the attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal; and adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result to obtain a reconstructed scene audio signal.
For example, the scene audio decoding module in fig. 1a and 1b described above may include the core decoder, the virtual speaker coefficient generation unit, the virtual speaker signal generation unit, the reconstruction unit, and the signal adjustment unit of fig. 6 b; or only a core decoder.
Fig. 7 is a schematic diagram of a structure of an exemplary scene audio encoding apparatus. The scene audio coding device in fig. 7 can be used to perform the coding method of the foregoing embodiment, so the advantages achieved by the device can be referred to the advantages of the corresponding method provided above, and will not be described herein. Wherein, the scene audio coding device may include:
An obtaining module 701, configured to obtain a scene audio signal to be encoded, where the scene audio signal includes audio signals of C1 channels, and C1 is a positive integer;
the acquisition module is further used for acquiring attribute information of a target virtual speaker corresponding to the scene audio signal;
the acquisition module is further used for acquiring the higher-order energy gain of the scene audio signal;
The encoding module 702 is configured to encode the high-order energy gain to obtain a high-order energy gain encoding result;
The encoding module is further configured to encode a first audio signal in the scene audio signal, attribute information of the target virtual speaker, and the high-order energy gain encoding result, so as to obtain a first code stream; the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
Fig. 8 is a schematic diagram of a structure of an exemplary scene audio decoding apparatus. The scene audio decoding apparatus in fig. 8 may be used to perform the decoding method of the foregoing embodiment, so the advantages achieved by the apparatus may refer to the advantages of the corresponding method provided above, and will not be described herein. Wherein, the scene audio decoding apparatus may include:
A code stream receiving module 801, configured to receive a first code stream;
the decoding module 802 is configured to decode the first code stream to obtain a first reconstructed signal, attribute information of a target virtual speaker, and a higher-order energy gain encoding result, where the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;
A virtual speaker signal generating module 803, configured to generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information and the first audio signal;
A scene audio signal reconstruction module 804, configured to reconstruct based on attribute information of the target virtual speaker and the virtual speaker signal, so as to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer;
The scene audio signal adjustment module 805 is configured to adjust the first reconstructed scene audio signal according to the higher-order energy gain encoding result, so as to obtain a reconstructed scene audio signal.
In one example, fig. 9 shows a schematic block diagram of an apparatus 900 of an embodiment of the application, the apparatus 900 may comprise: the processor 901 and transceiver/transceiver pins 902, optionally, also include a memory 903.
The various components of apparatus 900 are coupled together by a bus 904, wherein bus 904 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are referred to in the figures as bus 904.
Optionally, the memory 903 may be used to store instructions in the foregoing method embodiments. The processor 901 is operable to execute instructions in the memory 903 and control the receive pin to receive signals and the transmit pin to transmit signals.
The apparatus 900 may be an electronic device or a chip of an electronic device in the above-described method embodiments.
All relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
The present embodiment also provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from the memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause an electronic device to perform the methods of the embodiments described above. The interface circuit may be referred to as transceiver 902 in fig. 9, among other things.
The present embodiment also provides a computer-readable storage medium having stored therein computer instructions that, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the scene audio codec method in the above-described embodiments.
The present embodiment also provides a computer program product which, when run on a computer, causes the computer to perform the above-described related steps to implement the scene audio codec method in the above-described embodiments.
The embodiment also provides a device for storing the code stream, which comprises: a receiver for receiving a code stream and at least one storage medium; at least one storage medium for storing the code stream; the code stream is generated according to the scene audio coding method in the above-described embodiment.
The embodiment of the application provides a device for transmitting a code stream, which comprises the following steps: a transmitter and at least one storage medium for storing a code stream generated according to the scene audio coding method in the above embodiment; the transmitter is used for acquiring the code stream from the storage medium and transmitting the code stream to the end-side device through the transmission medium.
The embodiment of the application provides a system for distributing a code stream, which comprises the following steps: the streaming media device is used for acquiring a target code stream from the at least one storage medium and sending the target code stream to the end-side device, wherein the streaming media device comprises a content server or a content distribution server.
In addition, embodiments of the present application also provide an apparatus, which may be embodied as a chip, component or module, which may include a processor and a memory coupled to each other; the memory is used for storing computer-executable instructions, and when the device is operated, the processor can execute the computer-executable instructions stored in the memory, so that the chip executes the scene audio coding and decoding method in the method embodiments.
The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.
It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Any of the various embodiments of the application, as well as any of the same embodiments, may be freely combined. Any combination of the above is within the scope of the application.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
The steps of a method or algorithm described in connection with the present disclosure may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, a removable disk, a compact disk Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (14)

1. A method of decoding scene audio, the method comprising:
Receiving a first code stream;
Decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstruction signal is a reconstruction signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer smaller than or equal to C1;
Generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;
Reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer;
And adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result to obtain a reconstructed scene audio signal.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The scene audio signal is an N1-order higher order ambisonic HOA signal, the N1-order HOA signal comprising a second audio signal, the second audio signal being an audio signal of the N1-order HOA signal other than the first audio signal, C1 being equal to the square of (n1+1); and/or the number of the groups of groups,
The first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal includes a third audio signal, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and C2 is equal to the square of (n2+1).
3. The method of claim 2, wherein adjusting the first reconstructed scene audio signal based on the higher order energy gain encoding result to obtain a reconstructed scene audio signal comprises:
entropy decoding is carried out on the high-order energy gain coding result so as to obtain high-order energy gain after entropy decoding;
performing inverse quantization on the entropy decoded high-order energy gain to obtain the high-order energy gain;
adjusting the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain an adjusted decoded higher-order energy gain;
And adjusting a third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain to obtain an adjusted third audio signal, wherein the adjusted third audio signal belongs to the reconstructed scene audio signal.
4. A method according to claim 3, wherein said adjusting the higher order energy gain based on the characteristic information of the second audio signal and the characteristic information of the first audio signal comprises:
acquiring higher-order energy of the second audio signal according to the channel energy of the first audio signal and the higher-order energy gain;
obtaining a decoding energy scale factor according to the channel energy of the third audio signal and the higher-order energy of the second audio signal;
Obtaining a decoded higher order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal;
And adjusting the decoded high-order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded high-order energy gain.
5. The method of claim 4, wherein adjusting the decoded higher order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded higher order energy gain comprises:
The adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Wherein g (i, b) represents the decoded energy scale factor, gain_dec (i, b) represents the decoded higher-order energy Gain of the third audio signal, i is the number of the ith channel of the third audio signal, b is the frequency band sequence number of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
6. The method according to any of claims 3 to 5, wherein said adjusting a third audio signal of the N2-order HOA signal according to the adjusted decoded higher order energy gain comprises:
obtaining an attenuation factor according to the frequency band sequence number of the third audio signal;
And adjusting the third audio signal according to the adjusted decoded high-order energy gain and the attenuation factor.
7. The method according to any one of claims 3 to 6, wherein after said adjusting the first reconstructed scene audio signal according to the higher order energy gain coding result, the method further comprises:
acquiring channel energy of a fourth audio signal corresponding to the adjusted third audio signal, wherein the third audio signal comprises an audio signal of a current frame, and the fourth audio signal comprises an audio signal of a frame before the current frame;
And adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal.
8. The method of claim 7, wherein the readjusting the adjusted third audio signal according to the channel energy of the fourth audio signal comprises:
Acquiring a channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal;
Acquiring an energy average threshold according to the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal;
performing weighted average calculation on the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal according to the energy average threshold value to obtain target energy;
Acquiring an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal;
And adjusting the adjusted third audio signal again according to the energy smoothing factor.
9. The method of claim 6, wherein the obtaining the attenuation factor according to the frequency band sequence number of the third audio signal comprises:
the attenuation factor g' (i, b) is obtained as follows:
Or alternatively, the first and second heat exchangers may be,
Wherein i is the number of the ith channel of the third audio signal, b is the frequency band sequence number of the third audio signal, and p is a preset attenuation threshold.
10. A scene audio decoding device, the device comprising:
The code stream receiving module is used for receiving the first code stream;
The decoding module is used for decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstruction signal is a reconstruction signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer smaller than or equal to C1;
The virtual speaker signal generation module is used for generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;
The scene audio signal reconstruction module is used for reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal; the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer;
And the scene audio signal adjusting module is used for adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result so as to obtain a reconstructed scene audio signal.
11. An electronic device, comprising:
A memory and a processor, the memory coupled with the processor;
the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the scene audio decoding method of any one of claims 1 to 9.
12. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and to send the signal to the processor, the signal including computer instructions stored in the memory; the computer instructions, when executed by the processor, cause the electronic device to perform the scene audio decoding method of any one of claims 1 to 9.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which when run on a computer or a processor causes the computer or the processor to perform the scene audio decoding method according to any one of claims 1 to 9.
14. A computer program product, characterized in that it contains a software program which, when executed by a computer or processor, causes the steps of the method according to any one of claims 1 to 9 to be performed.
CN202310614158.7A 2023-01-06 2023-05-27 Scene audio decoding method and electronic equipment Pending CN118314908A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310030731X 2023-01-06

Publications (1)

Publication Number Publication Date
CN118314908A true CN118314908A (en) 2024-07-09

Family

ID=

Similar Documents

Publication Publication Date Title
US9516446B2 (en) Scalable downmix design for object-based surround codec with cluster analysis by synthesis
KR101358700B1 (en) Audio encoding and decoding
KR101143225B1 (en) Complex-transform channel coding with extended-band frequency coding
CN112997248A (en) Encoding and associated decoding to determine spatial audio parameters
US20240087580A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
EP3818730A1 (en) Energy-ratio signalling and synthesis
US9311925B2 (en) Method, apparatus and computer program for processing multi-channel signals
CN112005560B (en) Method and apparatus for processing audio signal using metadata
WO2022262576A1 (en) Three-dimensional audio signal encoding method and apparatus, encoder, and system
CN118314908A (en) Scene audio decoding method and electronic equipment
WO2024146408A1 (en) Scene audio decoding method and electronic device
CN118136027A (en) Scene audio coding method and electronic equipment
CN118138980A (en) Scene audio decoding method and electronic equipment
US20240079017A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
US20240087578A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
WO2022242483A1 (en) Three-dimensional audio signal encoding method and apparatus, and encoder
WO2022257824A1 (en) Three-dimensional audio signal processing method and apparatus
CN115346537A (en) Audio coding and decoding method and device

Legal Events

Date Code Title Description
PB01 Publication