WO2022253187A1 - 一种三维音频信号的处理方法和装置 - Google Patents
一种三维音频信号的处理方法和装置 Download PDFInfo
- Publication number
- WO2022253187A1 WO2022253187A1 PCT/CN2022/096025 CN2022096025W WO2022253187A1 WO 2022253187 A1 WO2022253187 A1 WO 2022253187A1 CN 2022096025 W CN2022096025 W CN 2022096025W WO 2022253187 A1 WO2022253187 A1 WO 2022253187A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound field
- signal
- current frame
- sound
- channels
- Prior art date
Links
- 230000005236 sound signal Effects 0.000 title claims abstract description 189
- 238000000034 method Methods 0.000 title claims abstract description 174
- 238000012545 processing Methods 0.000 title claims abstract description 69
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 105
- 230000005540 biological transmission Effects 0.000 claims description 71
- 238000004458 analytical method Methods 0.000 claims description 62
- 230000008569 process Effects 0.000 claims description 50
- 230000015654 memory Effects 0.000 claims description 40
- 238000012880 independent component analysis Methods 0.000 claims description 11
- 238000000513 principal component analysis Methods 0.000 claims description 10
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 claims description 7
- 238000004891 communication Methods 0.000 description 31
- 238000010586 diagram Methods 0.000 description 27
- 101100438378 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.71 / DSM 1257 / FGSC 987) fac-1 gene Proteins 0.000 description 11
- 101100326803 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.71 / DSM 1257 / FGSC 987) fac-2 gene Proteins 0.000 description 11
- 230000006835 compression Effects 0.000 description 11
- 238000007906 compression Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 11
- 239000011159 matrix material Substances 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000003672 processing method Methods 0.000 description 9
- 239000000203 mixture Substances 0.000 description 6
- 238000011022 operating instruction Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 206010019133 Hangover Diseases 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000008054 signal transmission Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 210000003454 tympanic membrane Anatomy 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 239000003570 air Substances 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/20—Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/008—Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/15—Aspects of sound capture and related signal processing for recording or reproduction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/03—Application of parametric coding in stereophonic audio systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/07—Synergistic effects of band splitting and sub-band processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/11—Application of ambisonics in stereophonic audio systems
Definitions
- the present application relates to the technical field of audio processing, in particular to a method and device for processing three-dimensional audio signals.
- Three-dimensional audio technology has been widely used in wireless communication voice, virtual reality/augmented reality and media audio.
- Three-dimensional audio technology is an audio technology for acquiring, processing, transmitting and rendering playback of sound events and three-dimensional sound field information in the real world.
- the three-dimensional audio technology makes the sound have a strong sense of space, envelopment and immersion, giving people an extraordinary auditory experience of "immersive sound”.
- Higher order ambisonics (HOA) technology has the property of being independent of the speaker layout in the recording, encoding and playback stages and the rotatable playback characteristics of HOA format data, which has higher flexibility in three-dimensional audio playback. Therefore, it has also received more extensive attention and research.
- the acquisition device (such as a microphone) collects a large amount of data to record the three-dimensional sound field information, and transmits the three-dimensional audio signal to the playback device (such as a speaker, earphone, etc.), so that the playback device can play the three-dimensional audio signal. Due to the large amount of data of the three-dimensional sound field information, a large amount of storage space is required to store the data, and the bandwidth requirement for transmitting the three-dimensional audio signal is relatively high. In order to solve the above problems, the three-dimensional audio signal can be compressed, and the compressed data can be stored or transmitted.
- the encoder can use multiple pre-configured virtual speakers to encode the 3D audio signal, but before the encoder encodes the 3D audio signal, it cannot classify the 3D audio signal, and there is a problem that the 3D audio signal cannot be effectively identified.
- Embodiments of the present application provide a method and device for processing a 3D audio signal, which are used to classify the sound field of the 3D audio signal, so that the 3D audio signal can be accurately identified.
- an embodiment of the present application provides a method for processing a three-dimensional audio signal, including: linearly decomposing the current frame of the three-dimensional audio signal to obtain a linear decomposition result; A sound field classification parameter; determining a sound field classification result of the current frame according to the sound field classification parameter.
- the current frame of the three-dimensional audio signal is first linearly decomposed to obtain the linear decomposition result; then the sound field classification parameters corresponding to the current frame are obtained according to the linear decomposition results; finally the sound field classification results of the current frame are determined according to the sound field classification parameters.
- the linear decomposition result of the current frame is obtained by linearly decomposing the current frame of the three-dimensional audio signal, and then the sound field classification parameter corresponding to the current frame is obtained through the linear decomposition result, the sound field classification parameter is determined by the sound field classification parameter.
- the sound field classification result of the current frame is obtained, and the sound field classification of the current frame can be realized through the sound field classification result.
- the embodiment of the present application classifies the sound field of the 3D audio signal, so that the 3D audio signal can be accurately identified.
- the three-dimensional audio signal includes: a high-order ambisonics HOA signal, or a first-order ambisonics FOA signal.
- the linearly decomposing the current frame of the 3D audio signal to obtain a linear decomposition result includes: performing singular value decomposition on the current frame to obtain the corresponding A singular value, wherein the linear decomposition result includes: the singular value; or, performing principal component analysis on the current frame to obtain a first eigenvalue corresponding to the current frame, wherein the linear decomposition result includes : the first eigenvalue; or, performing independent component analysis on the current frame to obtain a second eigenvalue corresponding to the current frame, wherein the linear decomposition result includes: the second eigenvalue.
- the linear decomposition may be a singular value decomposition.
- Linear decomposition can also be principal component analysis to obtain eigenvalues, and linear decomposition can also be independent component analysis to obtain second eigenvalues.
- the linear decomposition of the current frame can be realized, and a linear analysis result can be provided for subsequent channel judgment.
- the obtaining the sound field classification parameters corresponding to the current frame according to the linear decomposition results includes: obtaining the The ratio of the ith linear analysis result of the current frame to the i+1 linear analysis result of the current frame, where i is a positive integer; obtain the i-th sound field classification corresponding to the current frame according to the ratio parameter.
- the i-th linear analysis result and the i+1-th linear analysis result are two consecutive linear analysis results of the current frame.
- the encoding end can calculate the sound field classification parameter corresponding to the current frame according to the linear decomposition result. For example, there are multiple linear analysis results of the current frame, and two consecutive linear analysis results among the multiple linear analysis results are expressed as the i-th linear analysis result and the i+1-th linear analysis result of the current frame, then the current frame can be calculated
- the ratio of the i-th linear analysis result of the current frame to the i+1-th linear analysis result of the current frame, the specific value of i is not limited. After the above ratio is obtained, the i-th sound field classification parameter corresponding to the current frame can be obtained by using the ratio of the i-th linear analysis result to the i+1-th linear analysis result of the current frame.
- the sound field classification result includes: sound field type; and determining the sound field classification result of the current frame according to the sound field classification parameter includes: when the When the values of the plurality of sound field classification parameters meet the preset diffuse sound source judgment conditions, it is determined that the sound field type is a diffuse sound field; or, when at least one of the values of the plurality of sound field classification parameters satisfies the preset When the dissimilarity sound source judgment condition is determined, the sound field type is determined to be a dissimilarity sound field.
- the sound field type may include dissimilarity sound field and diffuse sound field.
- the judgment condition of diffuse sound source and the judgment condition of heterogeneity sound source are preset.
- the judgment condition of diffuse sound source is used to judge whether the type of sound field is It is a diffuse sound field
- the dissimilarity sound source judgment condition is used to judge whether the sound field type is a dissimilarity sound field.
- the diffuse sound source judgment condition includes: the value of the sound field classification parameter is less than a preset dissimilar sound source judgment threshold; or, the dissimilar sound source judgment condition includes: the The value of the sound field classification parameter is greater than or equal to a preset threshold for judging different sound sources.
- the threshold for determining the dissimilarity sound source may be a preset threshold, and the specific value is not limited.
- the conditions for judging the diffuse sound source include: the value of the sound field classification parameter is less than the preset dissimilarity sound source judgment threshold, so when the values of multiple sound field classification parameters are all less than the preset dissimilarity sound source judgment threshold, the sound field type is determined as Diffuse sound field.
- the dissimilar sound source judgment condition includes: the value of the sound field classification parameter is greater than or equal to the preset dissimilar sound source judgment threshold, so at least one of the values of the plurality of sound field classification parameters is greater than or equal to the preset dissimilar sound source judgment threshold , determine that the sound field type is a heterogeneous sound field.
- the sound field classification result includes: sound field type; or, the sound field classification result includes: the number of different sound sources and the sound field type;
- the sound field classification parameter determining the sound field classification result of the current frame includes: obtaining the number of different sound sources corresponding to the current frame according to the values of the plurality of sound field classification parameters; The number determines the sound field type.
- the encoding end can obtain the number of different sound sources corresponding to the current frame through the values of multiple sound field classification parameters.
- the different sound sources are position and/or For point sound sources with different directions, the number of distinct sound sources included in the current frame is called the number of distinct sound sources.
- the sound field of the current frame can be classified according to the number of different sound sources. After the number of distinct sound sources corresponding to the current frame is obtained to determine the sound field type, the sound field type corresponding to the current frame can be determined by analyzing the number of distinct sound sources corresponding to the current frame.
- the sound field classification result includes: the number of heterogeneous sound sources
- the determining the sound field classification result of the current frame according to the sound field classification parameters includes : Obtain the number of dissimilar sound sources corresponding to the current frame according to the values of the plurality of sound field classification parameters.
- the encoding end can obtain the number of different sound sources corresponding to the current frame through the values of multiple sound field classification parameters.
- the different sound sources are position and/or For point sound sources with different directions, the number of distinct sound sources included in the current frame is called the number of distinct sound sources.
- the determining the sound field type according to the number of disparate sound sources corresponding to the current frame includes: when the number of dissimilar sound sources satisfies a first preset condition, determining the The sound field type is the first sound field type; when the number of the heterogeneous sound sources does not meet the first preset condition, it is determined that the sound field type is the second sound field type; wherein, the first sound field type corresponds to The number of different sound sources is different from the number of different sound sources corresponding to the second sound field type.
- the sound field types can be divided into two types according to the number of different sound sources: the first sound field type and the second sound field type.
- the encoding end obtains the preset condition, judges whether the number of different sound sources satisfies the preset condition, and determines that the sound field type is the first sound field type when the number of different sound sources satisfies the first preset condition; When the first preset condition is met, the sound field type is determined to be the second sound field type.
- the division of the sound field type of the current frame can be realized by judging whether the number of dissimilar sound sources satisfies the first preset condition, so that it can be accurately identified that the sound field type of the current frame belongs to the first sound field type or the second sound field type. Sound field type.
- the first preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, where the second threshold is greater than the first threshold; or, The first preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the specific values of the first threshold and the second threshold are not limited, and may be combined with specific application scenarios.
- the second threshold is greater than the first threshold, so the first threshold and the second threshold can form a preset range, then the first preset condition can be that the number of heterogeneous sound sources is within the preset range, or the first preset condition can be Yes, the number of heterogeneous sound sources is outside the preset range.
- the first threshold and the second threshold in the above-mentioned first preset condition the number of different sound sources can be judged to determine whether the number of different sound sources satisfies the first preset condition, so that the current frame can be accurately identified
- the sound field type of belongs to the first sound field type or the second sound field type.
- the method further includes: determining a coding mode corresponding to the current frame according to the sound field classification result.
- the encoding end may determine the encoding mode corresponding to the current frame according to the sound field classification result.
- the encoding mode refers to the mode adopted when encoding the current frame of the 3D audio signal.
- an appropriate encoding mode is selected for different sound field classification results of the current frame, so as to use the encoding mode to encode the current frame to improve the compression efficiency and auditory quality of the audio signal.
- the determining the encoding mode corresponding to the current frame according to the sound field classification result includes: when the sound field classification result includes the number of different sound sources, or the sound field classification result includes When the number of heterogeneous sound sources and the type of sound field, determine the encoding mode corresponding to the current frame according to the number of heterogeneous sound sources; or, when the sound field classification result includes the sound field type, or the sound field classification result includes the heterogeneous sound source number and type of sound field, determine the encoding mode corresponding to the current frame according to the type of sound field; or, when the sound field classification result includes the number of different sound sources and the type of sound field, determine The sound field type determines the encoding mode corresponding to the current frame.
- the encoding end can determine the encoding mode corresponding to the current frame through the number of different sound sources and/or the sound field type, so that the encoding end can determine the corresponding encoding mode according to the sound field classification result of the current frame, so that the determined encoding mode It can be adapted to the current frame of the three-dimensional audio signal, so that the coding efficiency can be improved.
- the determining the encoding mode corresponding to the current frame according to the number of distinct sound sources includes: determining the encoding mode when the number of distinct sound sources satisfies a second preset condition.
- the mode is the first coding mode; when the number of the heterogeneous sound sources does not meet the second preset condition, the coding mode is determined to be the second coding mode; wherein, the first coding mode is selected based on a virtual speaker
- the second coding mode is the HOA coding mode based on virtual speaker selection or the HOA coding mode based on directional audio coding
- the first coding mode and the second coding mode The two encoding modes are different encoding modes.
- the coding modes can be divided into two types according to the number of different sound sources: the first coding mode and the second coding mode.
- the encoding end obtains the second preset condition, judges whether the number of different sound sources satisfies the second preset condition, and when the number of different sound sources satisfies the second preset condition, determines that the encoding mode is the first encoding mode; When the number of sources does not meet the second preset condition, the encoding mode is determined to be the second encoding mode.
- the coding mode of the current frame by judging whether the number of different sound sources satisfies the second preset condition, so that it can be accurately identified that the coding mode of the current frame belongs to the first coding mode or the second coding mode model.
- the second preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, where the second threshold is greater than the first threshold; or, The second preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the determining the encoding mode corresponding to the current frame according to the sound field type includes: when the sound field type is a heterogeneous sound field, determining that the encoding mode is selected based on a virtual speaker HOA coding mode; when the sound field type is a diffuse sound field, determine that the coding mode is an HOA coding mode based on directional audio coding.
- the determining the coding mode corresponding to the current frame according to the sound field classification result includes: determining the initial coding mode corresponding to the current frame according to the sound field classification result of the current frame; acquiring The sliding window where the current frame is located, the sliding window includes: the initial encoding mode of the current frame, and the encoding mode of N-1 frames before the current frame, where N is the length of the sliding window; Determine the coding mode of the current frame according to the initial coding mode of the current frame and the coding mode of the N-1 frame.
- the initial encoding mode of the current frame is corrected through a sliding window to obtain the encoding mode of the current frame, so as to ensure that the encoding modes between consecutive frames do not switch frequently and improve encoding efficiency.
- the method further includes: determining a coding parameter corresponding to the current frame according to the sound field classification result.
- the encoding end may determine the encoding parameters corresponding to the current frame according to the sound field classification result.
- the encoding parameters refer to the parameters used when encoding the current frame of the three-dimensional audio signal.
- appropriate encoding parameters are selected for different sound field classification results of the current frame, so as to use the encoding parameters to encode the current frame, thereby improving the compression efficiency and auditory quality of the audio signal.
- the encoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of coding bits of the virtual speaker signal, the number of coding bits of the residual signal, Or the number of voting rounds for best matching speaker search; wherein, the virtual speaker signal and the residual signal are signals generated according to the three-dimensional audio signal.
- the number of voting rounds satisfies the following relationship: 1 ⁇ I ⁇ d, wherein, the I is the number of voting rounds, and the d is the dissimilarity sound included in the sound field classification result. source quantity.
- the encoder determines the number of voting rounds for the best matching speaker search according to the number of dissimilar sound sources in the current frame, and the number of voting rounds is less than or equal to the number of dissimilar sound sources in the current frame, so that the number of voting rounds can meet The actual situation of the sound field classification of the current frame, which solves the problem of the number of voting rounds needed to determine the best matching speaker search when encoding the current frame.
- the number of channels of the virtual speaker signal refers to the number of channels used to transmit the virtual speaker signal.
- the number of channels of the virtual speaker signal can be determined by the number of different sound sources and the type of sound field.
- the channel number of the virtual speaker signal is determined to be 1, so that the coding efficiency of the current frame can be improved.
- min means to take the minimum value operation, that is, take the minimum value from S and PF as the number of channels of the virtual speaker signal, so that the number of channels of the virtual speaker signal can conform to the actual sound field classification of the current frame case, solves the problem of determining the number of channels of the virtual speaker signal when encoding the current frame.
- the channel number of the virtual speaker signal after obtaining the channel number of the virtual speaker signal, it can be calculated according to the sum of the preset channel number of the residual signal and the preset channel number of the virtual speaker signal, and the preset channel number of the residual signal
- the number of channels of the residual signal, the value of the PR can be preset by the encoder, the value of R can be obtained through the above max(C-1, PR) calculation formula, the preset number of channels of the residual signal
- the sum of the number of channels of the preset virtual speaker signal is preset at the encoding end.
- the above C may also be simply referred to as the total number of transmission channels.
- the residual after obtaining the channel number of the virtual speaker signal, the residual can be calculated according to the sum of the preset channel number of the residual signal and the preset channel number of the virtual speaker signal, and the channel number of the virtual speaker signal
- the number of channels of the signal, the sum of the preset number of channels of the residual signal and the preset number of channels of the virtual speaker signal are preset by the encoding end.
- the above C may also be simply referred to as the total number of transmission channels.
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the ratio of the number of encoded bits of the signal to the number of encoded bits of the transmission channel is obtained;
- the number of encoded bits of the residual signal is obtained by the ratio of the number of encoded bits of the virtual speaker signal to the number of encoded bits of the transmission channel; wherein, the transmission
- the number of coding bits of the channel includes the number of coding bits of the virtual speaker signal and the number of coding bits of the residual signal.
- the virtual speaker signal When the number of the heterogeneous sound sources is less than or equal to the number of channels of the virtual speaker signal, the virtual speaker signal The ratio of the number of coded bits of to the number of coded bits of the transmission channel is obtained by increasing the initial ratio of the number of coded bits of the virtual speaker signal to the number of coded bits of the transmission channel.
- the method further includes: encoding the current frame and the sound field classification result, and writing them into a code stream.
- the embodiment of the present application also provides a method for processing a three-dimensional audio signal, including: receiving a code stream; decoding the code stream to obtain the sound field classification result of the current frame; obtaining the current frame according to the sound field classification result The decoded 3D audio signal.
- the sound field classification result can be used to decode the current frame in the code stream, so the decoding end uses a decoding method that matches the sound field of the current frame to decode, thereby obtaining the three-dimensional audio signal sent by the encoding end, and realizing the audio signal Transmission from the encoding end to the decoding end.
- the obtaining the decoded 3D audio signal of the current frame according to the sound field classification result includes: determining the decoding mode of the current frame according to the sound field classification result; mode to obtain the decoded 3D audio signal of the current frame.
- the determining the decoding mode of the current frame according to the sound field classification result includes: when the sound field classification result includes the number of different sound sources, or the sound field classification result includes the difference When the number of sound sources and the type of sound field, determine the decoding mode of the current frame according to the number of different sound sources; or, when the sound field classification result includes the sound field type, or the sound field classification result includes the number of different sound sources and When the sound field type is used, the decoding mode of the current frame is determined according to the sound field type; or, when the sound field classification result includes the number of different sound sources and the type of sound field, according to the number of different sound sources and the type of sound field A decoding mode of the current frame is determined.
- the determining the decoding mode corresponding to the current frame according to the number of different sound sources includes: when the number of different sound sources satisfies a preset condition, determining that the decoding mode is The first decoding mode; when the number of the heterogeneous sound sources does not meet the preset condition, determine that the decoding mode is the second decoding mode; wherein, the first decoding mode is an HOA decoding mode selected based on a virtual speaker Or the HOA decoding mode based on directional audio coding, the second decoding mode is the HOA decoding mode based on virtual speaker selection or the HOA decoding mode based on directional audio coding, and the first decoding mode and the second decoding mode are Different decoding modes.
- the preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, where the second threshold is greater than the first threshold; or, the The preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the obtaining the decoded 3D audio signal of the current frame according to the sound field classification result includes: determining the decoding parameters of the current frame according to the sound field classification result; parameters to obtain the decoded 3D audio signal of the current frame.
- the decoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of decoding bits of the virtual speaker signal, or the number of decoding bits of the residual signal ; Wherein, the virtual speaker signal and the residual signal are obtained by decoding the code stream.
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the ratio of the number of decoded bits of the signal to the number of decoded bits of the transmission channel is obtained;
- the number of decoded bits of the residual signal is obtained by the ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel; wherein, the transmission
- the number of decoding bits of the channel includes the number of decoding bits of the virtual speaker signal and the number of decoding bits of the residual signal.
- the virtual speaker signal When the number of distinct sound sources is less than or equal to the number of channels of the virtual speaker signal, the virtual speaker signal The ratio of the number of decoded bits of to the number of decoded bits of the transmission channel is obtained by increasing the initial ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel.
- the embodiment of the present application also provides a three-dimensional audio signal processing device, including: a linear analysis module, used to linearly decompose the three-dimensional audio signal to obtain a linear decomposition result; a parameter generation module, used to Acquiring sound field classification parameters corresponding to the current frame from the linear decomposition result; a sound field classification module configured to determine the sound field classification result of the current frame according to the sound field classification parameters.
- a linear analysis module used to linearly decompose the three-dimensional audio signal to obtain a linear decomposition result
- a parameter generation module used to Acquiring sound field classification parameters corresponding to the current frame from the linear decomposition result
- a sound field classification module configured to determine the sound field classification result of the current frame according to the sound field classification parameters.
- the constituent modules of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned first aspect and various possible implementations. For details, see the aforementioned first aspect and various possible implementations. Description in Implementation.
- the embodiment of the present application also provides a three-dimensional audio signal processing device, including: a receiving module, used to receive a code stream; a decoding module, used to decode the code stream to obtain the sound field classification result of the current frame; A generating module, configured to obtain the decoded 3D audio signal of the current frame according to the sound field classification result.
- the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
- the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
- the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
- the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
- the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
- the number of encoded bits of the virtual speaker signal satisfies the following relationship:
- the core_numbit is the number of coded bits of the virtual speaker signal
- the fac1 is the weighting factor allocated to the coded bits of the virtual speaker signal
- the fac2 is the weighted factor allocated to the coded bits of the residual signal
- the round represents downward Rounding
- the F is the number of channels of the virtual speaker signal
- the R represents the number of channels of the residual signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal
- the number of coded bits of the residual signal satisfies the following relationship:
- the res_numbit is the number of encoded bits of the residual signal
- the core_numbit is the number of encoded bits of the virtual speaker signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal.
- the number of coded bits of the residual signal satisfies the following relationship:
- the res_numbit is the number of coded bits of the residual signal
- the fac1 is the weighting factor allocated to the coded bits of the virtual loudspeaker signal
- the fac2 is the weighted factor allocated to the coded bits of the residual signal
- the round represents downward Rounding
- the F is the number of channels of the virtual speaker signal
- the R represents the number of channels of the residual signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal
- the number of encoded bits of the virtual loudspeaker signal satisfies the following relationship:
- the core_numbit is the number of encoded bits of the virtual speaker signal
- the res_numbit is the number of encoded bits of the residual signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal.
- the number of coding bits of each virtual speaker signal satisfies the following relationship:
- the core_ch_numbit is the number of coded bits of each virtual speaker signal
- the fac1 is the weighting factor allocated to the coded bits of the virtual speaker signal
- the fac2 is the weighted factor allocated to the coded bits of the residual signal
- the round represents Rounding down
- the F is the number of channels of the virtual speaker signal
- the R represents the number of channels of the residual signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal ;
- the res_numbit is the number of coded bits of each residual signal
- the fac1 is the weighting factor allocated to the coded bits of the virtual speaker signal
- the fac2 is the weighted factor allocated to the coded bits of the residual signal
- the round represents Rounding down
- the F is the number of channels of the virtual speaker signal
- the R represents the number of channels of the residual signal
- the numbit is the sum of the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal .
- the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it is run on a computer, the computer executes the above-mentioned first aspect or the second aspect. described method.
- an embodiment of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the first aspect or the second aspect.
- the embodiment of the present application provides a computer-readable storage medium, including the code stream generated by the method described in the foregoing first aspect.
- the embodiment of the present application provides a communication device, which may include entities such as terminal equipment or chips, and the communication device includes: a processor and a memory; the memory is used to store instructions; the processor is used to Executing the instructions in the memory causes the communication device to execute the method as described in any one of the aforementioned first aspect or second aspect.
- the present application provides a chip system, which includes a processor, configured to support an audio encoder or an audio decoder to implement the functions involved in the above aspect, for example, to send or process the information involved in the above method data and/or information.
- the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the audio encoder or audio decoder.
- the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
- the current frame of the three-dimensional audio signal is first linearly decomposed to obtain the linear decomposition result; then the sound field classification parameter corresponding to the current frame is obtained according to the linear decomposition result; finally, the sound field classification of the current frame is determined according to the sound field classification parameter result.
- the linear decomposition result of the current frame is obtained by linearly decomposing the current frame of the three-dimensional audio signal, and then the sound field classification parameter corresponding to the current frame is obtained through the linear decomposition result, the sound field classification parameter is determined by the sound field classification parameter.
- the sound field classification result of the current frame is obtained, and the sound field classification of the current frame can be realized through the sound field classification result.
- the embodiment of the present application classifies the sound field of the 3D audio signal, so that the 3D audio signal can be accurately identified.
- FIG. 1 is a schematic diagram of the composition and structure of an audio processing system provided by an embodiment of the present application
- FIG. 2a is a schematic diagram of an audio encoder and an audio decoder provided in an embodiment of the present application applied to a terminal device;
- FIG. 2b is a schematic diagram of an audio encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
- FIG. 2c is a schematic diagram of an audio decoder provided by an embodiment of the present application applied to a wireless device or a core network device;
- FIG. 3a is a schematic diagram of a multi-channel encoder and a multi-channel decoder provided in an embodiment of the present application applied to a terminal device;
- FIG. 3b is a schematic diagram of a multi-channel encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
- FIG. 3c is a schematic diagram of a multi-channel decoder provided in an embodiment of the present application applied to a wireless device or a core network device;
- FIG. 4 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application
- FIG. 5 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application
- FIG. 6 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application.
- FIG. 7 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application.
- FIG. 8 is a schematic diagram of an encoding process of a hybrid HOA encoder provided in an embodiment of the present application.
- FIG. 9 is a schematic flowchart of determining a coding mode of an HOA signal provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of a decoding process of a hybrid HOA decoder provided in an embodiment of the present application.
- FIG. 11 is a schematic diagram of an encoding process of an MP-based HOA encoder provided in an embodiment of the present application.
- FIG. 12 is a schematic diagram of the composition and structure of an audio encoding device provided by an embodiment of the present application.
- FIG. 13 is a schematic diagram of the composition and structure of an audio decoding device provided by an embodiment of the present application.
- FIG. 14 is a schematic diagram of the composition and structure of another audio encoding device provided by the embodiment of the present application.
- FIG. 15 is a schematic diagram of the composition and structure of another audio decoding device provided by an embodiment of the present application.
- Sound is a continuous wave produced by the vibration of an object. Objects that vibrate to emit sound waves are called sound sources. When sound waves propagate through a medium (such as air, solid or liquid), the auditory organs of humans or animals can perceive sound.
- a medium such as air, solid or liquid
- Characteristics of sound waves include pitch, intensity, and timbre.
- Pitch indicates how high or low a sound is.
- Pitch intensity indicates the volume of a sound.
- Pitch intensity can also be called loudness or volume.
- the unit of sound intensity is decibel (decibel, dB). Timbre is also called fret.
- the frequency of sound waves determines the pitch of the sound. The higher the frequency, the higher the pitch.
- the number of times an object vibrates within one second is called frequency, and the unit of frequency is hertz (Hz).
- the frequency of sound that can be recognized by the human ear is between 20Hz and 20000Hz.
- the amplitude of the sound wave determines the intensity of the sound. The greater the amplitude, the greater the sound intensity. The closer the distance to the sound source, the greater the sound intensity.
- the waveform of the sound wave determines the timbre.
- the waveforms of sound waves include square waves, sawtooth waves, sine waves, and pulse waves.
- sounds can be divided into regular sounds and irregular sounds.
- Random sound refers to the sound produced by the sound source vibrating randomly. Random sounds are, for example, noises that affect people's work, study, and rest.
- a regular sound refers to a sound produced by a sound source vibrating regularly. Regular sounds include speech and musical tones.
- regular sound is an analog signal that changes continuously in the time-frequency domain. The analog signals may be referred to as audio signals (acoustic signals).
- An audio signal is an information carrier that carries speech, music and sound effects.
- the human sense of hearing has the ability to distinguish the location and distribution of sound sources in space, when the listener hears the sound in the space, he can not only feel the pitch, intensity and timbre of the sound, but also feel the direction of the sound.
- Three-dimensional audio technology refers to the assumption that the space outside the human ear is a system, and the signal received at the eardrum is a three-dimensional audio signal that is output by filtering the sound from the sound source through a system outside the ear.
- a system other than the human ear can be defined as a system impulse response h(n)
- any sound source can be defined as x(n)
- the signal received at the eardrum is the convolution result of x(n) and h(n) .
- the three-dimensional audio signal described in the embodiment of the present application may refer to a higher order ambisonics (higher order ambisonics, HOA) signal or a first order ambisonics (first order ambisonics, FOA) signal.
- Three-dimensional audio can also be called three-dimensional audio, spatial audio, three-dimensional sound field reconstruction, virtual 3D audio, or binaural audio.
- the sound pressure p satisfies formula (1), is the Laplacian operator.
- the space system outside the human ear is a sphere, and the listener is at the center of the sphere, the sound from outside the sphere has a projection on the sphere, and the sound outside the sphere is filtered out.
- the sound source is distributed on the sphere, use the sphere
- the sound field generated by the above sound source is used to fit the sound field generated by the original sound source, that is, the three-dimensional audio technology is a method of fitting the sound field.
- the formula (1) equation is solved in the spherical coordinate system, and in the passive spherical region, the solution of the formula (1) is the following formula (2).
- r represents the radius of the ball
- ⁇ represents the horizontal angle
- k represents the wave number
- s represents the amplitude of the ideal plane wave
- m represents the order number of the three-dimensional audio signal (or the order number of the HOA signal).
- represents ⁇ The spherical harmonics of the direction, Spherical harmonics representing the direction of the sound source.
- the three-dimensional audio signal coefficients satisfy formula (3).
- formula (3) can be transformed into formula (4).
- N is an integer greater than or equal to 1.
- the value of N is an integer ranging from 2 to 6.
- the coefficients of the 3D audio signal described in the embodiments of the present application may refer to HOA coefficients or ambient stereo (ambisonic) coefficients.
- the three-dimensional audio signal is an information carrier carrying the spatial position information of the sound source in the sound field, and describes the sound field of the listener in the space.
- Formula (4) shows that the sound field can be expanded on the spherical surface according to the spherical harmonic function, that is, the sound field can be decomposed into the superposition of multiple plane waves. Therefore, the sound field described by the three-dimensional audio signal can be expressed by the superposition of multiple plane waves, and the sound field can be reconstructed through the coefficients of the three-dimensional audio signal.
- the HOA signal includes a large amount of data for describing the spatial information of the sound field. If the acquisition device (such as a microphone) transmits the three-dimensional audio signal to a playback device (such as a speaker), a large bandwidth needs to be consumed.
- the encoder can use the spatial squeezed surround audio coding (spatial squeezed surround audio coding, S3AC) method or the directional audio coding (directional audio coding, DirAC) method or the coding method based on virtual speaker selection to compress and code the three-dimensional audio signal to obtain the code stream, to transmit a code stream to the playback device, wherein the encoding method based on virtual speaker selection may also be referred to as a matching projection (matchPRojection, MP) encoding method, and the encoding method selected by a virtual speaker will be described as an example later.
- the playback device decodes the code stream, reconstructs the three-dimensional audio signal, and plays the reconstructed three-dimensional audio signal. Therefore, the amount of data transmitted to the playback device and the bandwidth occupation of the three-dimensional audio signal are reduced.
- the sound field classification of the 3D audio signal can be realized through the linear decomposition of the 3D audio signal, so that the sound field classification of the 3D audio signal can be accurately realized, and the sound field classification result of the current frame can be obtained.
- the embodiment of the present application provides an audio coding technology, especially a three-dimensional audio coding technology for three-dimensional audio signals, and specifically provides a coding technology that uses fewer channels to represent three-dimensional audio signals to improve traditional audio coding system.
- Audio coding (or commonly referred to as coding) includes two parts of audio coding and audio decoding. Audio encoding is performed on the source side and involves processing (eg, compressing) raw audio to reduce the amount of data needed to represent the audio for more efficient storage and/or transmission. Audio decoding is performed at the destination, including inverse processing relative to the encoder to reconstruct the original audio. The encoding part and the decoding part are also collectively referred to as encoding.
- the implementation of the embodiment of the present application will be described in detail below with reference to the accompanying drawings.
- the technical solution of the embodiment of the present application can be applied to various audio processing systems, as shown in FIG. 1 , which is a schematic diagram of the composition and structure of the audio processing system provided by the embodiment of the present application.
- the audio processing system 100 may include: an audio encoding device 101 and an audio decoding device 102 .
- the audio coding device 101 can be used to generate a code stream, and then the audio coded code stream can be transmitted to the audio decoding device 102 through an audio transmission channel, and the audio decoding device 102 can receive the code stream, and then perform the audio decoding function of the audio decoding device 102 , and finally get the reconstructed signal.
- the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
- the audio coding device can be the above-mentioned terminal device or wireless device or Audio encoder for core network equipment.
- the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices. decoder.
- the audio encoder may include a radio access network, a media gateway of the core network, a transcoding device, a media resource server, a mobile terminal, a fixed network terminal, etc., and the audio encoder may also be a virtual reality (VR) ) audio encoders in streaming services.
- VR virtual reality
- the end-to-end audio signal processing flow includes: audio signal A passes through the acquisition module (audioPReprocessing) after (acquisition), the preprocessing operation includes filtering out the low frequency part of the signal, which can be 20Hz or 50Hz as the dividing point, extracting the orientation information in the signal, and then performing encoding processing (audio encoding) Package (file/segment encapsulation) and then send (delivery) to the decoding end, the decoding end first unpacks (file/segment decapsulation), then decodes (audio decoding), performs binaural rendering (audio rendering) processing on the decoded signal, and renders The processed signal is mapped onto the listener's headphones, which may be standalone headphones or headphones on a glasses device.
- FIG. 2a it is a schematic diagram of an audio encoder and an audio decoder provided in the embodiment of the present application applied to a terminal device.
- Each terminal device may include: an audio encoder, a channel encoder, an audio decoder, and a channel decoder.
- the channel encoder is used for channel coding the audio signal
- the channel decoder is used for channel decoding the audio signal.
- the first terminal device 20 may include: a first audio encoder 201 , a first channel encoder 202 , a first audio decoder 203 , and a first channel decoder 204 .
- the second terminal device 21 may include: a second audio decoder 211 , a second channel decoder 212 , a second audio encoder 213 , and a second channel encoder 214 .
- the first terminal device 20 is connected to a wireless or wired first network communication device 22, the first network communication device 22 is connected to a wireless or wired second network communication device 23 through a digital channel, and the second terminal device 21 is connected to a wireless or wired network communication device.
- the second network communication device 23 may generally refer to signal transmission equipment, such as communication base stations, data exchange equipment, and the like.
- the terminal device as the sending end first collects audio, performs audio coding on the collected audio signal, and then performs channel coding, and then transmits in a digital channel through a wireless network or a core network.
- the terminal device as the receiving end performs channel decoding according to the received signal to obtain the code stream, and then recovers the audio signal through audio decoding, and the terminal device at the receiving end enters the audio playback.
- the wireless device or the core network device 25 includes: a channel decoder 251, other audio decoders 252, an audio encoder 253 provided in the embodiment of the present application, and a channel encoder 254, wherein the other audio decoders 252 refer to Audio codecs other than audio codecs.
- the channel decoder 251 is first used to perform channel decoding on the signal entering the device, and then other audio decoders 252 are used for audio decoding, and then the audio encoder 253 provided by the embodiment of the present application is used for decoding.
- the channel coder 254 is used to perform channel coding on the audio signal, and the channel coding is completed before transmission.
- the other audio decoder 252 performs audio decoding on the code stream decoded by the channel decoder 251 .
- FIG. 2c it is a schematic diagram of an audio decoder provided by the embodiment of the present application being applied to a wireless device or a core network device.
- the wireless device or the core network device 25 includes: a channel decoder 251, an audio decoder 255 provided in the embodiment of the present application, other audio encoders 256, and a channel encoder 254, wherein the other audio encoders 256 refer to Audio codecs other than audio codecs.
- the signal entering the device is first channel-decoded by the channel decoder 251, then the received audio coded stream is decoded using the audio decoder 255, and then other audio encoders 256 are used to Perform audio encoding, and finally use the channel encoder 254 to perform channel encoding on the audio signal, and then transmit it after completing the channel encoding.
- the wireless device refers to equipment related to radio frequency in communication
- the core network device refers to equipment related to core network in communication.
- the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
- the audio coding device can be the above-mentioned terminal device or wireless device Or a multi-channel encoder of a core network device.
- the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
- the audio decoding device can be a combination of the above-mentioned terminal devices or wireless devices or core network devices. channel decoder.
- a schematic diagram of the application of the multi-channel encoder and multi-channel decoder provided by the embodiment of the present application to the terminal equipment may include: a multi-channel encoder, a channel encoder, Multi-channel decoder, channel decoder.
- the multi-channel encoder may execute the audio encoding method provided in the embodiment of the present application
- the multi-channel decoder may execute the audio decoding method provided in the embodiment of the present application.
- the channel encoder is used to perform channel coding on the multi-channel signal
- the channel decoder is used to perform channel decoding on the multi-channel signal.
- the first terminal device 30 may include: a first multi-channel encoder 301 , a first channel encoder 302 , a first multi-channel decoder 303 , and a first channel decoder 304 .
- the second terminal device 31 may include: a second multi-channel decoder 311 , a second channel decoder 312 , a second multi-channel encoder 313 , and a second channel encoder 314 .
- the first terminal device 30 is connected to a wireless or wired first network communication device 32, and the first network communication device 32 is connected to a wireless or wired second network communication device 33 through a digital channel, and the second terminal device 31 is connected to a wireless or wired network communication device.
- the second network communication device 33 is connected to a wireless or wired network communication device.
- the foregoing wireless or wired network communication equipment may generally refer to signal transmission equipment, such as communication base stations, data exchange equipment, and the like.
- the terminal device as the sending end performs multi-channel coding on the collected multi-channel signal, and then performs channel coding, and then transmits it in a digital channel through a wireless network or a core network.
- the terminal device as the receiving end performs channel decoding according to the received signal to obtain the coded stream of the multi-channel signal, and then restores the multi-channel signal through multi-channel decoding, and the terminal device as the receiving end plays it back.
- FIG. 3b it is a schematic diagram of a multi-channel encoder applied to a wireless device or a core network device provided by the embodiment of the present application, wherein the wireless device or the core network device 35 includes: a channel decoder 351, other audio decoders 352 , the multi-channel encoder 353, and the channel encoder 354 are similar to those in FIG. 2b, and will not be repeated here.
- FIG. 3c it is a schematic diagram of a multi-channel decoder applied to a wireless device or a core network device provided by the embodiment of the present application, wherein the wireless device or the core network device 35 includes: a channel decoder 351, a multi-channel decoder 355 , other audio encoder 356 , and channel encoder 354 are similar to those in FIG. 2 c and will not be repeated here.
- the audio encoding process can be a part of the multi-channel encoder, and the audio decoding process can be a part of the multi-channel decoder.
- performing multi-channel encoding on the collected multi-channel signal can be the After the multi-channel signal is processed, the audio signal is obtained, and then the obtained audio signal is encoded according to the method provided in the embodiment of the present application; the decoding end encodes the code stream according to the multi-channel signal, decodes the audio signal, and after the up-mixing process Recover the multi-channel signal. Therefore, the embodiments of the present application may also be applied to multi-channel encoders and multi-channel decoders in terminal devices, wireless devices, and core network devices. In wireless or core network equipment, if transcoding needs to be implemented, corresponding multi-channel encoding processing needs to be performed.
- the method can be executed by a terminal device, for example, the terminal device can be an audio encoding device (hereinafter referred to as an encoding terminal or an encoder).
- the terminal device may also be a three-dimensional audio signal processing device.
- the processing method of the three-dimensional audio signal mainly includes the following:
- the encoding end may acquire a three-dimensional audio signal
- the three-dimensional audio signal may be a scene audio signal.
- the three-dimensional audio signal may be a time-domain signal or a frequency-domain signal.
- the 3D audio signal may also be a down-sampled signal.
- the three-dimensional audio signal includes: a high-order ambisonic HOA signal, or a first-order ambisonic FOA signal.
- the three-dimensional audio signal may also be other types of signals, and this is only an example of the present application, and is not intended to limit the embodiment of the present application.
- the 3D audio signal may be a time-domain HOA signal or a frequency-domain HOA signal.
- the 3D audio signal may include all channels of the HOA signal, or may include some HOA channels (for example, FOA channels).
- the three-dimensional audio signal may be all sample points of the HOA signal, or may be 1/Q downsampling points after the HOA signal to be analyzed is downsampled. Among them, Q is the downsampling interval, and 1/Q is the downsampling rate.
- the 3D audio signal includes multiple frames. Next, take the processing of a frame in the 3D audio signal as an example. For example, if this frame is the current frame, there is still The previous frame, there is a next frame after the current frame.
- the processing method of other frames of the 3D audio signal except the current frame in the embodiment of the present application is similar to the processing method of the current frame, and the processing of the current frame will be used as an example in the following.
- the current frame of the 3D audio signal is acquired, the current frame is linearly decomposed first, and the linear decomposition result of the current frame can be obtained through the linear decomposition.
- linear decomposition There are many ways of linear decomposition, which will be described in detail next.
- step 401 performs linear decomposition on the current frame of the three-dimensional audio signal to obtain a linear decomposition result, including:
- A2 Perform principal component analysis on the current frame to obtain the first eigenvalue corresponding to the current frame, wherein the linear decomposition result includes: the first eigenvalue;
- A3. Perform independent component analysis on the current frame to obtain a second eigenvalue corresponding to the current frame, wherein the linear decomposition result includes: the second eigenvalue.
- linear decomposition may include at least one of the following: singular value decomposition (singular value decomposition, SVD), principal component analysis (principal component analysis, PCA), independent component analysis (independent component analysis, ICA).
- singular value decomposition singular value decomposition
- PCA principal component analysis
- independent component analysis independent component analysis
- the linear decomposition may be singular value decomposition.
- a matrix A is formed by the HOA signal.
- the matrix A is a matrix of L*K, where L is equal to the number of channels of the HOA signal, and K is the number of signal points of the HOA signal of each channel in the current frame.
- the number of signal points may include: the number of frequency points, or the number of sample points in the time domain, or the number of frequency points or sample points after downsampling.
- Singular value decomposition is performed on the matrix A to satisfy the following relationship:
- U is an L*L matrix
- V is a K*K matrix
- the subscript T is the transpose of the matrix V
- * means multiplication.
- ⁇ is a L*K diagonal matrix
- each element on the main diagonal is the singular value of the matrix A obtained by singular value decomposition, and the elements outside the main diagonal are all 0.
- K is the number of signal points after downsampling of the HOA signal of each channel of the current frame, for example, the number of signal points may be the number of samples or frequency points.
- the linear decomposition may also be principal component analysis to obtain eigenvalues.
- the eigenvalues obtained through principal component analysis are defined as first eigenvalues. The specific implementation of principal component analysis will not be repeated here.
- the linear decomposition may also be independent component analysis to obtain the second eigenvalue.
- independent component analysis The specific implementation of independent component analysis will not be repeated here.
- the linear decomposition of the current frame can be realized through any of the above-mentioned implementation manners of A1 to A3, so that various types of linear decomposition results can be obtained.
- the encoding end After the encoding end obtains the linear analysis result of the current frame, the encoding end analyzes the linear decomposition result to obtain the sound field classification parameter corresponding to the current frame.
- the sound field classification parameter is obtained by analyzing the linear decomposition result of the current frame.
- the sound field classification parameter is used to determine the sound field classification result of the current frame. According to different specific implementations of the linear decomposition result, the sound field classification parameters may be implemented in multiple ways.
- one sound field classification parameter when there are two linear decomposition results, one sound field classification parameter is obtained.
- the obtained sound field classification parameters are N-1, and the value of N is not limited.
- step 402 obtains the sound field classification parameters corresponding to the current frame according to the linear decomposition result, including:
- the encoding end can calculate the sound field classification parameter corresponding to the current frame according to the linear decomposition result. For example, there are multiple linear analysis results of the current frame, and two consecutive linear analysis results among the multiple linear analysis results are expressed as the i-th linear analysis result and the i+1-th linear analysis result of the current frame, then the current frame can be calculated
- the ratio of the i-th linear analysis result of the current frame to the i+1-th linear analysis result of the current frame, the specific value of i is not limited.
- the i-th linear analysis result and the i+1-th linear analysis result are two consecutive linear analysis results of the current frame.
- the i-th sound field classification parameter corresponding to the current frame can be obtained by using the ratio of the i-th linear analysis result to the i+1-th linear analysis result of the current frame. From this description, it can be seen that the ratio of the i-th linear analysis result to the i+1-th linear analysis result can calculate the i-th sound field classification parameter, then the i+1-th linear analysis result and the i+2-th linear analysis result The ratio of can calculate the i+1th sound field classification parameter, and so on. There is a correspondence between the linear analysis results and the sound field classification parameters.
- the ratio of the i-th linear analysis result to the i+1-th linear analysis result can be used as the i-th sound field classification parameter. It is not limited, after obtaining the ratio of the i-th linear analysis result to the i+1-th linear analysis result, various calculations can be performed on the ratio, so that the i-th sound field classification parameter can be calculated, for example, for The ratio is multiplied according to a preset adjustment factor, so as to obtain the i-th sound field classification parameter.
- the sound field classification parameter can obtain singular values according to the singular value decomposition, and calculate the ratio parameter between two adjacent singular values as the sound field classification parameter.
- temp[i] satisfies:
- the sound field classification parameters can be determined according to the eigenvalues.
- the calculation method of the sound field classification parameter is similar to the calculation method of the above-mentioned ratio temp between the singular values. It is also possible to calculate the ratio between two consecutive eigenvalues as the sound field classification parameter based on the eigenvalues obtained by linear decomposition.
- the sound field classification parameter is a vector; otherwise, the sound field classification parameter is a scalar.
- An example is as follows, for v[i], if the value of i is equal to 2, then the calculated temp[i] is a scalar, that is, there is only one temp value; for v[i], if the value of i is greater than 2, then The calculated temp[i] is a vector, and there are at least two elements in temp.
- the encoding end after the encoding end obtains the sound field classification parameter corresponding to the current frame, the encoding end can perform sound field classification on the current frame according to the sound field classification parameter, because the sound field classification parameter corresponding to the current frame can indicate that the current frame corresponds to The parameters required for classifying the sound field of the current frame can be obtained based on the sound field classification parameter.
- the sound field classification result may include at least one of the following: sound field type, and number of different sound sources.
- the sound field type refers to the sound field type of the current frame determined after the sound field classification of the current frame.
- the sound field type can be divided into the first sound field type, the second sound field type, or The types may be classified into a first sound field type, a second sound field type, a third sound field type, and the like. Specifically, how many types the sound field can be divided into can be determined based on the application scenario.
- the sound field type may include heterogeneous sound field and diffuse sound field.
- the heterogeneous sound field refers to the presence of point sound sources with different positions and/or directions in the sound field
- the diffuse sound field refers to the sound field that does not contain heterogeneous sound sources.
- point sound sources with different positions and/or directions are heterogeneous sound sources
- sound fields containing heterogeneous sound sources are heterogeneous sound fields
- sound fields without heterogeneous sound sources are diffuse sound fields.
- the different sound sources are point sound sources with different positions and/or directions, and the number of different sound sources included in the current frame is called the number of different sound sources.
- the sound field of the current frame can also be classified by the number of different sound sources.
- the sound field classification results include: sound field type;
- Step 403 determines the sound field classification result of the current frame according to the sound field classification parameters, including:
- the sound field type is determined to be a diffuse sound field
- the sound field type is determined to be a different sound field.
- the sound field type may include dissimilar sound field and diffuse sound field.
- the judgment condition of diffuse sound source and the judgment condition of heterogeneous sound source are preset, and the judgment condition of diffuse sound source is used to judge whether the sound field type is diffuse Sound field, heterogeneous sound source judgment condition is used to judge whether the sound field type is heterogeneous sound field. After obtaining the multiple sound field classification parameters of the current frame, judge according to the values of the multiple sound field classification parameters and the above-mentioned preset conditions.
- the diffuse sound source judgment condition and the dissimilarity sound source judgment condition here There is no limit.
- the encoding end After the encoding end obtains multiple sound field classification parameters, when the values of the multiple sound field classification parameters all satisfy the preset diffuse sound source determination condition, it is determined that the sound field type is a diffuse sound field. For example, there are N sound field classification parameters corresponding to the current frame, and only when the values of the N sound field classification parameters meet the preset diffuse sound source determination condition, the sound field type of the current frame is determined to be a diffuse sound field.
- the sound field type is determined to be a different sound field. For example, there are N sound field classification parameters corresponding to the current frame, and as long as at least one value of the N sound field classification parameters satisfies a preset different sound source determination condition, the sound field type is determined to be a different sound field.
- the condition for judging the diffuse sound source includes: the value of the sound field classification parameter is less than a preset threshold for judging the dissimilar sound source;
- the dissimilar sound source determination condition includes: the value of the sound field classification parameter is greater than or equal to a preset dissimilar sound source determination threshold.
- the dissimilarity sound source determination threshold may be a preset threshold, and the specific value is not limited.
- the conditions for judging the diffuse sound source include: the value of the sound field classification parameter is less than the preset dissimilarity sound source judgment threshold, so when the values of multiple sound field classification parameters are all less than the preset dissimilarity sound source judgment threshold, the sound field type is determined as Diffuse sound field.
- the dissimilar sound source judgment condition includes: the value of the sound field classification parameter is greater than or equal to the preset dissimilar sound source judgment threshold, so at least one of the values of the plurality of sound field classification parameters is greater than or equal to the preset dissimilar sound source judgment threshold , determine that the sound field type is a heterogeneous sound field.
- the sound field classification result includes: sound field type; or, the sound field classification result includes: the number of different sound sources and the sound field type;
- Step 403 determines the sound field classification result of the current frame according to the sound field classification parameters, including:
- the encoder can obtain the number of different sound sources corresponding to the current frame through the values of multiple sound field classification parameters.
- the different sound sources are different in position and/or direction
- the number of dissimilar sound sources included in the current frame is called the number of dissimilar sound sources.
- the sound field of the current frame can be classified according to the number of different sound sources.
- the sound field type corresponding to the current frame can be determined by analyzing the number of distinct sound sources corresponding to the current frame.
- the sound field classification results include: the number of different sound sources;
- Step 403 determines the sound field classification result of the current frame according to the sound field classification parameters, including:
- the encoder can obtain the number of different sound sources corresponding to the current frame through the values of multiple sound field classification parameters.
- the different sound sources are different in position and/or direction
- the number of dissimilar sound sources included in the current frame is called the number of dissimilar sound sources.
- the aforementioned step C1 or D1 obtains the number of different sound sources corresponding to the current frame according to the values of multiple sound field classification parameters, including:
- the encoding end can estimate the number of different sound sources and determine the sound field type according to the sound field classification parameters.
- the sound field type may include heterogeneous sound field and diffuse sound field. Dissimilar sound fields refer to the presence of point sound sources in the sound field that differ in position and/or direction. A diffuse sound field is one that does not contain alien sound sources.
- the sound field type is a diffuse sound field.
- the sound field type is a different sound field.
- the sound field classification parameter when the ratio temp[i] between the singular values is used as the sound field classification parameter.
- the value of TH1 may be a constant, for example, the value of TH1 may be 30 or 100, and the value of TH1 is not limited in this embodiment of
- the aforementioned step C2 determines the sound field type according to the number of dissimilar sound sources corresponding to the current frame, including:
- the sound field type is determined to be the first sound field type
- the number of different sound sources corresponding to the first sound field type is different from the number of different sound sources corresponding to the second sound field type.
- the sound field types may be divided into two types according to the number of different sound sources: a first sound field type and a second sound field type.
- the encoding end obtains the first preset condition, judges whether the number of different sound sources meets the first preset condition, and determines that the sound field type is the first sound field type when the number of different sound sources satisfies the first preset condition; When the number of sound sources does not satisfy the first preset condition, the sound field type is determined to be the second sound field type.
- the division of the sound field type of the current frame can be realized by judging whether the number of dissimilar sound sources satisfies the first preset condition, so that it can be accurately identified that the sound field type of the current frame belongs to the first sound field type or the second sound field type. Sound field type.
- the first preset condition includes that the number of heterogeneous sound sources is greater than a first threshold and less than a second threshold, wherein the second threshold is greater than the first threshold;
- the first preset condition includes that the number of heterogeneous sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the second threshold is greater than the first threshold, so the first threshold and the second threshold can form a preset range, then the first preset condition can be that the number of heterogeneous sound sources is within the preset range, or the first preset condition can be Yes, the number of heterogeneous sound sources is outside the preset range.
- the first threshold and the second threshold in the above-mentioned first preset condition the number of different sound sources can be judged to determine whether the number of different sound sources satisfies the first preset condition, so that the current frame can be accurately identified
- the sound field type of belongs to the first sound field type or the second sound field type.
- determining the sound field classification result of the current frame according to the sound field classification parameter may further include: determining the sound field classification result of the current frame according to the sound field classification parameter and other parameters characterizing the characteristics of the three-dimensional audio signal.
- other parameters that characterize the characteristics of the three-dimensional audio signal have multiple implementation methods.
- other parameters that characterize the characteristics of the three-dimensional audio signal may include at least one of the following: energy ratio parameters of the three-dimensional audio signal, high-frequency and low-frequency characteristics of the three-dimensional audio signal analysis parameters, etc.
- a method for processing a three-dimensional audio signal mainly includes the following:
- the encoding end may perform the aforementioned steps 501 to 503. After the encoding end obtains the sound field classification result of the current frame, the encoding end may determine the encoding mode corresponding to the current frame according to the sound field classification result.
- the encoding mode refers to the mode adopted when encoding the current frame of the 3D audio signal. There are many encoding modes, and different encoding modes may be adopted according to the sound field classification results of the current frame. In this embodiment of the present application, an appropriate encoding mode is selected for different sound field classification results of the current frame, so as to use the encoding mode to encode the current frame to improve the compression efficiency and auditory quality of the audio signal.
- step 503 determines the encoding mode corresponding to the current frame according to the sound field classification result, including:
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field, determine the encoding mode corresponding to the current frame according to the number of different sound sources;
- the sound field classification result includes the sound field type, or the sound field classification result includes the number of different sound sources and the sound field type, determine the encoding mode corresponding to the current frame according to the sound field type;
- the sound field classification result includes the number of different sound sources and the type of sound field
- step E1 after the encoder acquires the number of different sound sources in the current frame, the number of different sound sources can be used to determine the encoding mode corresponding to the current frame.
- step E2 after the encoding end obtains the sound field type of the current frame, the sound field type can be used to determine the encoding mode corresponding to the current frame.
- step E3 after the encoding end obtains the number of different sound sources and the type of sound field of the current frame, the number of different sound sources and the type of sound field can be used to determine the encoding mode corresponding to the current frame.
- the encoding end can determine the encoding mode corresponding to the current frame through the number of different sound sources and/or the sound field type, so that the encoding end can determine the corresponding encoding mode according to the sound field classification result of the current frame, so that the determined encoding mode can be compared with the three-dimensional
- the current frame of the audio signal is adapted, so that the coding efficiency can be improved.
- step E1 determining the encoding mode corresponding to the current frame according to the number of different sound sources includes:
- the encoding mode is determined to be the first encoding mode
- the encoding mode is the second encoding mode
- the first coding mode is the HOA coding mode based on virtual speaker selection or the HOA coding mode based on directional audio coding
- the second coding mode is the HOA coding mode based on virtual speaker selection or the HOA coding mode based on directional audio coding
- the first The first encoding mode and the second encoding mode are different encoding modes.
- the HOA coding mode based on virtual speaker selection may also be referred to as the HOA coding mode based on matching projection (matchPRojection, MP).
- the coding modes can be divided into two types according to the number of different sound sources: a first coding mode and a second coding mode.
- the encoding end obtains the second preset condition, judges whether the number of different sound sources satisfies the second preset condition, and when the number of different sound sources satisfies the second preset condition, determines that the encoding mode is the first encoding mode; When the number of sources does not meet the second preset condition, the encoding mode is determined to be the second encoding mode.
- the coding mode of the current frame by judging whether the number of different sound sources satisfies the second preset condition, so that it can be accurately identified that the coding mode of the current frame belongs to the first coding mode or the second coding mode model.
- the second coding mode is the HOA coding mode based on directional audio coding.
- the first coding mode is the HOA coding mode based on directional audio coding
- the second coding mode is the HOA coding mode based on virtual speaker selection, and specific implementations of the first coding mode and the second coding mode can be determined according to application scenarios.
- the sound field classification result may determine the encoding mode selected by the encoding end.
- the sound field classification result can be used to determine the encoding mode of the HOA signal.
- the encoding mode is determined according to the sound field type: the HOA signal belonging to the heterogeneous sound field is suitable for encoding by the encoder corresponding to the encoding mode A, and the HOA signal belonging to the diffuse sound field is suitable for encoding by the encoder corresponding to the encoding mode B.
- the encoding mode is determined according to the number of different sound sources: when the number of different sound sources satisfies the decision condition for using the encoding mode X, the encoder corresponding to the encoding mode X is used for encoding.
- the coding mode is also determined according to the sound field type and the number of different sound sources: when the sound field type is a diffuse sound field, use the encoder corresponding to the coding mode C to encode; when the sound field type is a different sound field and the number of different sound sources satisfies Using the decision condition of coding mode X, encode with the coder corresponding to coding mode X.
- Coding mode A, coding mode B, coding mode C, and coding mode X may include multiple different coding modes.
- different sound field classification results correspond to different coding modes, which are not limited in this embodiment of the application.
- the coding mode X may be coding mode 1 when the number of different sound sources is less than a preset threshold, and coding mode 2 when the number of different sound sources is greater than or equal to the preset threshold.
- the second preset condition includes that the number of heterogeneous sound sources is greater than the first threshold and less than the second threshold, wherein the second threshold is greater than the first threshold;
- the second preset condition includes that the number of heterogeneous sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the second threshold is greater than the first threshold, so the first threshold and the second threshold can form a preset range, then the second preset condition can be that the number of heterogeneous sound sources is within the preset range, or the second preset condition can be Yes, the number of heterogeneous sound sources is outside the preset range.
- the first threshold and the second threshold in the above-mentioned second preset condition the number of different sound sources can be judged to determine whether the number of different sound sources satisfies the second preset condition, so that the current frame can be accurately identified
- the sound field type of belongs to the first sound field type or the second sound field type.
- the first preset condition is a condition set for identifying different sound field types
- the second preset condition is a condition set for identifying different encoding modes
- the first preset condition The second preset condition may include the same conditional content or different conditional content. That is, the first preset condition and the second preset condition may be different preset conditions, or the first preset condition and the second preset condition may be the same preset condition. However, considering that there may be differences in actual use, the first preset condition and the second preset condition are distinguished by using first and second.
- step E2 determines the encoding mode corresponding to the current frame according to the sound field type, including:
- the encoding mode is the HOA encoding mode selected based on the virtual speaker
- the coding mode is an HOA coding mode based on directional audio coding.
- the HOA coding mode based on directional audio is not as efficient as the HOA coding mode based on virtual speaker selection for the case of less heterogeneous sound sources in the sound field and the diffuse sound field.
- the compression efficiency of the HOA coding mode based on virtual speaker selection is not as good as the HOA coding mode based on directional audio.
- the encoding mode when the sound field type is a heterogeneous sound field, the encoding mode is determined to be the HOA encoding mode selected based on the virtual speaker, and when the sound field type is a diffuse sound field, the encoding mode is determined to be the HOA encoding mode based on directional audio encoding,
- a corresponding encoding mode may be selected according to the sound field classification result of the current frame, so as to meet the requirement of obtaining maximum compression efficiency for different types of audio signals.
- the aforementioned step 503 determines the encoding mode corresponding to the current frame according to the sound field classification result, including:
- the sliding window includes: the initial encoding mode of the current frame, and the encoding mode of the N-1 frame before the current frame, N is the length of the sliding window;
- F3. Determine the coding mode of the current frame according to the initial coding mode of the current frame and the coding mode of the N-1 frame.
- the initial coding mode may be the coding mode determined according to the result of the sound field classification, for example, the coding mode of the current frame may be determined according to any one of the aforementioned steps E1 to E3, and the coding mode may be used as the coding mode in F1.
- Initial encoding mode After the initial coding mode is obtained, the sliding window is obtained according to the current frame and the window size of the sliding window.
- the sliding window includes the initial coding mode of the current frame and the coding mode of the N-1 frame before the current frame, N means sliding The number of frames to include in the window.
- the encoding mode of the current frame is determined according to the encoding modes corresponding to the N frames in the sliding window, and the encoding mode of the current frame obtained in step F3 may be the encoding mode used when encoding the current frame.
- the initial encoding mode of the current frame is corrected through a sliding window to obtain the encoding mode of the current frame, so as to ensure that the encoding modes between consecutive frames do not switch frequently, and improve encoding efficiency.
- a processing method may be to store encoder selection identifiers with a length of N frames in the sliding window, and N frames include the encoder selection identifiers of the current frame and the previous N-1 frames; when the encoder selection identifiers are accumulated to a specified threshold, Update the coding type indicator of the current frame.
- other post-processing may also be used to correct the current frame.
- the initial coding mode is used as the initial classification
- the initial classification is modified according to the speech classification result of the audio signal, the signal-to-noise ratio and other characteristics, and the modified result is used as the final result of the coding mode.
- a method for processing a three-dimensional audio signal mainly includes the following:
- steps 601 to 603 are similar to the implementation manners of the steps 401 to 403 in the foregoing embodiments, and the detailed description of steps 601 to 603 will not be given here.
- the encoding end may perform the aforementioned steps 601 to 603. After the encoding end acquires the sound field classification result of the current frame, the encoding end may determine the encoding parameters corresponding to the current frame according to the sound field classification result.
- the encoding parameters refer to the parameters used when encoding the current frame of the three-dimensional audio signal. There are various encoding parameters, and different encoding parameters may be adopted according to the sound field classification results of the current frame. In the embodiment of the present application, appropriate encoding parameters are selected for different sound field classification results of the current frame, so as to use the encoding parameters to encode the current frame, thereby improving the compression efficiency and auditory quality of the audio signal.
- the encoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of encoded bits of the virtual speaker signal, and the number of encoded bits of the residual signal , or the number of voting rounds for the best matching speaker search;
- the virtual loudspeaker signal and the residual signal are signals generated according to the three-dimensional audio signal.
- the encoding end can determine the encoding parameters of the current frame according to the sound field classification result of the current frame, so that the encoding parameters can be used to encode the current frame.
- the coding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of coding bits of the virtual speaker signal, the number of coding bits of the residual signal, or the most The number of voting rounds for the best matching speaker search.
- the number of channels may also be referred to as the number of transmission channels, the number of channels is the number of transmission channels allocated during signal encoding, and the number of encoded bits is the number of encoded bits allocated during signal encoding.
- a method for selecting a virtual speaker uses the virtual speaker coefficient of the current frame to vote for each virtual speaker in the candidate virtual speaker set, and selects the virtual speaker of the current frame according to the voting value, thereby reducing the virtual speaker search. computational responsibility, and reduce the computational burden of the encoder.
- the number of voting rounds for the best matching speaker search refers to the number of voting rounds that need to be performed when searching for the best matching speaker. In a possible implementation, the number of voting rounds can be preconfigured, or can be based on the determined by the sound field classification result of the frame. For example, the number of voting rounds for best matching speaker search is the number of voting rounds for virtual speaker search in the process of determining the virtual speaker signal according to the three-dimensional audio signal.
- the virtual speaker signal and the residual signal in the embodiment of the present application are signals generated according to the three-dimensional audio signal.
- An example is as follows: select the first target virtual speaker from the preset virtual speaker set according to the first scene audio signal; generate a virtual speaker signal according to the first scene audio signal and the attribute information of the first target virtual speaker; use the first target The attribute information of the virtual speaker and the first virtual speaker signal are used to obtain a second scene audio signal; and a residual signal is generated according to the first scene audio signal and the second scene audio signal.
- the number of voting rounds satisfies the following relationship:
- I is the number of voting rounds
- d is the number of dissimilar sound sources included in the sound field classification results.
- the encoding end determines the number of voting rounds for the best matching speaker search according to the number of dissimilar sound sources in the current frame, and the number of voting rounds is less than or equal to the number of dissimilar sound sources in the current frame, so that the number of voting rounds can conform to the number of dissimilar sound sources in the current frame.
- a practical case for soundfield classification addressing the number of voting rounds needed to determine the best matching speaker search when encoding the current frame.
- the number of voting rounds I should follow the following principles: the minimum number of voting rounds is one, the maximum number of voting rounds cannot exceed the total number of speakers, and the maximum number of voting rounds cannot exceed the number of virtual speaker signal channels, such as speaker
- the total number can be 1024 speakers obtained by the virtual speaker set generation unit in the encoder, and the number of virtual speaker signal channels is the virtual speaker signal to be transmitted by the encoder, that is, the N transmission channels corresponding to the N best matching speakers , usually the number of virtual speaker signal channels is less than the total number of speakers.
- the method for estimating the number of voting rounds is as follows. According to the number of different sound sources in the sound field of the current frame obtained from the sound field classification result, the number of voting rounds I for searching for the best matching speaker is determined.
- the sound field classification result includes the number of different sound sources and the type of sound field
- the number of channels of the virtual speaker signal satisfies the following relationship:
- F is the number of channels of the virtual speaker signal
- S is the number of heterogeneous sound sources
- PF is the number of virtual speaker signal channels preset by the encoder
- the number of channels of the virtual speaker signal satisfies the following relationship:
- F is the number of channels of the virtual speaker signal.
- the number of channels of the virtual speaker signal refers to the number of channels used to transmit the virtual speaker signal.
- the channel number of the virtual speaker signal can be determined by the number of different sound sources and the type of sound field. In the above calculation method, when the sound field type is diffuse In the case of a permanent sound field, the channel number of the virtual speaker signal is determined to be 1, so that the coding efficiency of the current frame can be improved.
- min means to take the minimum value operation, that is, take the minimum value from S and PF as the number of channels of the virtual speaker signal, so that the number of channels of the virtual speaker signal can conform to the actual sound field classification of the current frame case, solves the problem of determining the number of channels of the virtual speaker signal when encoding the current frame.
- the number of channels of the residual signal satisfies the following relationship:
- the PR is the number of residual signal channels preset by the encoder
- the C is the sum of the number of channels of the residual signal preset by the encoder and the number of virtual speaker signal channels preset by the encoder;
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the number of channels of the residual signal
- the C is the sum of the number of residual signal channels preset by the encoder and the number of virtual speaker signal channels preset by the encoder
- the F is the The number of channels of the virtual speaker signal described above.
- the residual signal can be calculated according to the sum of the preset number of channels of the residual signal and the preset number of channels of the virtual speaker signal, and the preset number of channels of the residual signal
- the number of channels, the value of the PR can be preset by the encoder, the value of R can be obtained through the above max(C-1, PR) calculation formula, the number of channels of the preset residual signal and the preset
- the sum of the number of channels of the virtual speaker signal is preset at the encoding end.
- the above C may also be simply referred to as the total number of transmission channels.
- the number of channels of the virtual speaker signal after obtaining the number of channels of the virtual speaker signal, the sum of the preset number of channels of the residual signal and the preset number of channels of the virtual speaker signal, the number of channels of the virtual speaker signal The number of channels of the residual signal is calculated, and the sum of the preset number of channels of the residual signal and the preset number of channels of the virtual speaker signal is preset by the encoding end.
- the above C may also be simply referred to as the total number of transmission channels.
- the sound field classification result includes the number of different sound sources
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- F is the number of channels of the virtual speaker signal
- S is the number of heterogeneous sound sources
- PF is the number of channels of the virtual speaker signal preset by the encoder.
- the number of channels of the virtual speaker signal refers to the number of channels used to transmit the virtual speaker signal, and the number of channels of the virtual speaker signal can be determined by the number of different sound sources.
- min means to take the minimum value operation, that is Take the minimum value from S and PF as the channel number of the virtual speaker signal, so that the channel number of the virtual speaker signal can conform to the actual situation of the sound field classification of the current frame, and solve the need to determine the channel of the virtual speaker signal when encoding the current frame number problem.
- the number of channels of the residual signal satisfies the following relationship:
- R represents the number of channels of the residual signal
- C is the sum of the number of channels of the residual signal preset by the encoder and the number of channels of the virtual speaker signal preset by the encoder
- F is the number of channels of the virtual speaker signal.
- C is the sum of the aforementioned PF and PR.
- the channel of the residual signal can be calculated according to the sum of the channel number of the preset residual signal and the preset channel number of the virtual speaker signal, and the channel number of the virtual speaker signal
- the sum of the preset number of channels of the residual signal and the preset number of channels of the virtual speaker signal is preset by the encoding end.
- the above C may also be simply referred to as the total number of transmission channels.
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the number of coded bits of the virtual loudspeaker signal is obtained by the ratio of the number of coded bits of the virtual loudspeaker signal to the number of coded bits of the transmission channel;
- the number of coded bits of the residual signal is obtained by the ratio of the number of coded bits of the virtual loudspeaker signal to the number of coded bits of the transmission channel;
- the number of encoded bits of the transmission channel includes the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal, when the number of dissimilar sound sources is less than or equal to the number of channels of the virtual speaker signal, the The ratio of the number of coding bits to the number of coding bits of the transmission channel is obtained by increasing the initial ratio of the number of coding bits of the virtual loudspeaker signal to the number of coding bits of the transmission channel.
- the encoding end presets the initial ratio of the number of encoding bits of the virtual speaker signal to the number of encoding bits of the transmission channel, and the encoding end obtains the number of different sound sources, and judges whether the number of different sound sources is less than or equal to the number of channels of the virtual speaker signal, If the number of heterogeneous sound sources is less than or equal to the number of channels of the virtual loudspeaker signal, the initial ratio of the number of encoded bits of the virtual loudspeaker signal to the number of encoded bits of the transmission channel can be increased, and the increased initial ratio is defined as the encoding of the virtual loudspeaker signal.
- the ratio of the number of bits to the number of encoding bits of the transmission channel, the ratio of the number of encoding bits of the virtual speaker signal to the number of encoding bits of the transmission channel can be used to calculate the number of encoding bits of the virtual speaker signal, the number of encoding bits of the virtual speaker signal and the number of encoding bits of the transmission channel The ratio of
- the number of encoded bits of the virtual loudspeaker signal and the number of encoded bits of the residual signal can conform to the actual situation of the sound field classification of the current frame, which solves the need to determine the number of encoded bits of the virtual loudspeaker signal when encoding the current frame , The number of coded bits of the residual signal.
- the encoding end determines the bit allocation method of the virtual speaker signal and the residual signal according to the sound field classification result, divides the transmission channel signal into a virtual speaker signal group and a residual signal group, and assigns the preset virtual speaker signal group Ratio is the initial ratio of the number of encoded bits of the virtual speaker signal to the number of encoded bits of the transmission channel.
- the number of heterogeneous sound sources ⁇ the number of channels of the virtual speaker signal
- the initial ratio of the number to the number of coded bits of the transmission channel, and the increased ratio is taken as the ratio of the number of coded bits of the virtual loudspeaker signal to the number of coded bits of the transmission channel.
- the increased ratio is equal to the sum of the preset adjustment value and the initial ratio.
- the ratio of the number of coding bits of the residual signal to the number of coding bits of the transmission channel 1.0 ⁇ the ratio of the number of coding bits of the virtual speaker signal to the number of coding bits of the transmission channel.
- the method performed at the encoding end may also include the following steps:
- the sound field classification result can be encoded into the code stream, and the encoding end sends the code stream to the decoding end, so that the decoding end can obtain the sound field classification result through the code stream, and the decoding end can obtain the code stream by analyzing the code stream
- the sound field classification result carried in , the decoding end can obtain the sound field distribution of the current frame through the sound field classification result, so that the current frame can be decoded to obtain a three-dimensional audio signal.
- encoding the current frame and the sound field classification result may specifically include directly encoding the current frame, or first processing the current frame, and after obtaining the virtual speaker signal and the residual signal, the The virtual speaker signal and the residual signal are encoded.
- the encoding end may specifically be a core encoder, and the core encoder encodes the virtual speaker signal, residual signal, and sound field classification results to obtain a code stream.
- the code stream may also be referred to as an audio signal coded code stream.
- the processing method of the three-dimensional audio signal provided by the embodiment of the present application may include: an audio encoding method and an audio decoding method, wherein the audio encoding method is performed by an audio encoding device, the audio decoding method is performed by an audio decoding device, and the audio encoding device and the audio decoding device communication between them is possible.
- the aforementioned Figures 4 to 6 are executed by the audio encoding device.
- the processing method of the three-dimensional audio signal performed by the audio decoding device hereinafter referred to as the decoding end
- FIG. 7 mainly includes the following steps :
- the decoding end receives the code stream from the encoding end.
- the code stream carries the sound field classification result.
- the decoding end parses the code stream, and obtains the sound field classification result of the current frame from the code stream, and the sound field classification result is obtained by the encoding end according to the above-mentioned embodiments shown in FIG. 4 to FIG. 6 .
- the decoding end uses the sound field classification result to parse the code stream to obtain the decoded 3D audio signal of the current frame.
- the decoding process of the current frame is not limited.
- the decoding end can decode the current frame through the sound field classification result, and the sound field classification result can be used to decode the current frame in the code stream, so the decoding end uses a decoding method that matches the sound field of the current frame for decoding.
- the three-dimensional audio signal sent by the encoding end is obtained, and the transmission of the audio signal from the encoding end to the decoding end is realized.
- the decoding end can determine the decoding mode and/or decoding parameters consistent with the encoding end according to the sound field classification results transmitted in the code stream, which reduces the The number of encoded bits.
- obtaining the decoded 3D audio signal of the current frame according to the sound field classification result in step 703 includes:
- the decoding mode corresponds to the encoding mode in the aforementioned embodiment, and the implementation of step G1 is similar to that of step 504 in the aforementioned embodiment, and will not be repeated here.
- the decoding end can decode the code stream according to the decoding mode to obtain the decoded 3D audio signal of the current frame.
- step G1 determines the decoding mode of the current frame according to the sound field classification result, including:
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field, determine the decoding mode of the current frame according to the number of different sound sources;
- the sound field classification result includes the sound field type, or the sound field classification result includes the number of different sound sources and the sound field type, determine the decoding mode of the current frame according to the sound field type;
- the sound field classification result includes the number of different sound sources and the type of sound field
- determining the decoding mode corresponding to the current frame according to the number of heterogeneous sound sources includes:
- the decoding mode is the first decoding mode
- the decoding mode is a second decoding mode
- the first decoding mode is the HOA decoding mode based on virtual speaker selection or the HOA decoding mode based on directional audio coding
- the second decoding mode is the HOA decoding mode based on virtual speaker selection or HOA decoding based on directional audio coding mode
- the first decoding mode and the second decoding mode are different decoding modes.
- the preset condition is a condition set by the decoding end to identify different decoding modes, and the implementation of the preset condition is not limited.
- the preset condition includes that the number of heterogeneous sound sources is greater than a first threshold and less than a second threshold, wherein the second threshold is greater than the first threshold;
- the preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- step 703 obtains the decoded 3D audio signal of the current frame according to the sound field classification result, including:
- the decoding parameters correspond to the encoding parameters in the foregoing embodiments, and the implementation of step H1 is similar to that of step 604 in the foregoing embodiments, and details are not repeated here.
- the decoding end may decode the code stream according to the decoding parameter, so as to obtain the decoded 3D audio signal of the current frame.
- the decoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of decoding bits of the virtual speaker signal, or the number of decoding bits of the residual signal ;
- the virtual loudspeaker signal and the residual signal are obtained by decoding the code stream.
- the sound field classification result includes the number of different sound sources and the type of sound field
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of channels of the virtual speaker signal preset by the decoder
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal.
- the number of channels of the residual signal satisfies the following relationship:
- the PR is the number of residual signal channels preset by the decoder
- the C is the sum of the number of channels of the residual signal preset by the decoder and the number of virtual speaker signal channels preset by the decoder
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the channel number of the residual signal
- the C is the sum of the residual signal channel number preset by the decoder and the virtual speaker signal channel number preset by the decoder
- the F is the The number of channels of the virtual speaker signal described above.
- the number of virtual speaker signal channels preset by the decoder is equal to the number of virtual speaker signal channels preset by the encoder, and similarly, the number of channels of the residual signal preset by the decoder is equal to the residual signal preset by the encoder the number of channels.
- the sound field classification result includes the number of different sound sources
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of channels of the virtual speaker signal preset by the decoder.
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the number of channels of the residual signal
- the C is the sum of the number of channels of the residual signal preset by the decoder and the number of channels of the virtual speaker signal preset by the decoder
- the F is the The number of channels of the virtual speaker signal described above.
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the decoding bit number of the virtual loudspeaker signal is obtained by the ratio of the decoding bit number of the virtual loudspeaker signal to the decoding bit number of the transmission channel;
- the number of decoding bits of the residual signal is obtained by the ratio of the number of decoding bits of the virtual speaker signal to the number of decoding bits of the transmission channel;
- the number of decoding bits of the transmission channel includes the number of decoding bits of the virtual speaker signal and the number of decoding bits of the residual signal, and when the number of dissimilar sound sources is less than or equal to the number of channels of the virtual speaker signal,
- the ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel is obtained by increasing the initial ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel.
- the three-dimensional audio signal is taken as an example of the HOA signal.
- the sound field classification method of the HOA signal in the embodiment of the present application is applied to a hybrid HOA encoder.
- the basic process of encoding is shown in Figure 8.
- Classify the HOA signal to determine whether the HOA signal to be encoded in the current frame is suitable for the HOA encoding scheme based on virtual speaker selection, or for the HOA encoding scheme based on directional audio coding DirAC, and determine the HOA encoding mode of the current frame according to the sound field classification result .
- the HOA encoder includes an encoder selection unit, through which the sound field classification of the HOA signal to be encoded is performed, and the encoding mode of the current frame is determined; according to the encoding mode, the encoder A or the encoder B is selected for encoding, to obtain The final encoded stream.
- encoder A and encoder B represent different types of encoders, and each encoder is adapted to a sound field type of the current frame. When encoding is performed using an encoder adapted to the sound field type, it can Increase the compression ratio of the signal.
- the specific process of classifying the sound field of the HOA signal to be encoded and determining the encoding mode includes:
- the sound field classification is performed on the HOA signal to be coded to obtain the sound field classification result.
- the encoding mode of the current frame is determined.
- the encoding mode of the current frame is used to indicate the selection mode of the encoder of the current frame.
- the criterion for determining the coder selection flag may be determined according to the sound field type of the HOA signal applicable to coder A and coder B.
- the signal type processed by encoder A is an HOA signal with a different sound field and the number of different sound sources is less than 3
- the signal type processed by encoder B is an HOA signal with a different sound field and the number of different sound sources is greater than or equal to 3 .
- the signal type processed by the encoder B is a diffuse sound field or an HOA signal with a number of different sound sources greater than or equal to three.
- a sliding window (hangover) process can also be performed on the sound field classification result to ensure that the coding modes between consecutive frames do not switch frequently.
- a processing method may be to store encoder selection identifiers with a length of N frames in the sliding window, and N frames include the encoder selection identifiers of the current frame and the previous N-1 frames; when the encoder selection identifiers are accumulated to a specified threshold, Update the encoding type indicator of the current frame.
- other processing may also be used to correct the sound field classification result.
- the process of determining the encoding mode of the HOA signal mainly includes:
- downsampling the HOA signal to be analyzed is an optional step.
- the HOA signal to be analyzed may be a time-domain HOA signal or a frequency-domain HOA signal.
- the HOA signal to be analyzed may include all channels, or the HOA signal to be analyzed may also include some HOA channels (such as FOA channels).
- the HOA signal to be analyzed may be all samples, or 1/Q downsampling points, for example, 1/120 downsampling points are used in this embodiment.
- the order of the HOA signal in the current frame is 3, the number of channels of the HOA signal is 16 channels, and the frame length of the current frame is 20 milliseconds (ms), that is, the current frame signal contains 960 samples, and the current frame to be encoded
- the number of sampling points contained in each channel signal is 8. That is, the HOA signal has 16 channels in total, and each channel has 8 samples, which constitute the input signal for sound field type analysis, that is, the HOA signal to be analyzed.
- the sound field type is obtained by analyzing the number of heterogeneous sound sources in the HOA signal.
- the sound field type analysis in this embodiment of the present application may be to linearly decompose the HOA signal, obtain a linear decomposition result through the linear decomposition, and then obtain a sound field classification result through the linear decomposition result.
- the number of different sound sources can be obtained according to the linear decomposition result.
- the linear decomposition result may include eigenvalues, and the number of dissimilar sound sources is estimated by the ratio between eigenvalues, specifically including:
- L is equal to the number of channels of the HOA signal
- K is the number of signal points of each channel of the current frame, for example, the number of signal points may be the number of frequency points.
- the judgment threshold of dissimilar sound sources is 100, and the number n of dissimilar sound sources can be estimated in the following way:
- the expected encoding mode is encoding mode 1;
- the coding mode is expected to be coding mode 2.
- encoding mode 1 may be an HOA encoding scheme based on virtual speaker selection.
- Coding mode 2 may be an HOA coding scheme based on directional audio DirAC.
- the actual encoding mode is determined next. For example, a sliding window is used to discriminate the actual encoding mode. In the sliding window, when the expected encoding mode 2 of multiple frames in the sliding window accumulates to the specified threshold, the actual encoding mode of the current frame adopts encoding mode 2, otherwise the actual encoding mode of the current frame adopts encoding mode 1.
- the encoding mode is that the frames of encoding mode 2 are accumulated to 7 frames, and the actual encoding mode of the current frame is determined as encoding mode 2.
- the basic decoding process of a hybrid HOA decoder corresponding to the encoding end is shown in Figure 10: the decoding end obtains the code stream from the encoding end, and then analyzes the HOA decoding mode of the current frame according to the code stream. According to the HOA decoding mode of the current frame, a corresponding decoding scheme is selected for decoding to obtain a reconstructed HOA signal.
- the decoder selection unit may be included in the decoding end, and the code stream is analyzed by the decoder selection unit to determine the decoding mode; according to the decoding mode, decoder A or decoder B is selected for decoding to obtain the reconstructed HOA signal.
- decoder A and decoder B represent different types of decoders, and each decoder is adapted to a sound field type of the current frame.
- decoder A and decoder B represent different types of decoders, and each decoder is adapted to a sound field type of the current frame.
- the sound field classification result of the HOA signal to be coded and the coding mode determined according to the sound field classification result can match the signal types suitable for different coding modes, so that different types of signals can obtain the maximum compression efficiency.
- the encoding end may include: a virtual speaker configuration unit, a code analysis unit, a virtual speaker set generation unit, a virtual speaker selection unit, a virtual speaker signal generation unit, a core encoder processing unit, a signal reconstruction unit, a residual signal generation unit, and a selection unit and signal compensation unit.
- a virtual speaker configuration unit a code analysis unit
- a virtual speaker set generation unit a virtual speaker selection unit
- a virtual speaker signal generation unit generation unit
- a core encoder processing unit a signal reconstruction unit
- residual signal generation unit a selection unit and signal compensation unit.
- the virtual speaker configuration unit is configured to configure the virtual speakers in the virtual speaker set to obtain multiple virtual speakers.
- the virtual speaker configuration unit outputs virtual speaker configuration parameters according to the encoder configuration information.
- Encoder configuration information includes but not limited to: HOA order, encoding bit rate, user-defined information, etc.
- Virtual speaker configuration parameters include but not limited to: number of virtual speakers, HOA order of virtual speakers, position coordinates of virtual speakers Wait.
- the virtual speaker configuration parameters output by the virtual speaker configuration unit are used as the input of the virtual speaker set generation unit.
- the encoding analysis unit is used for encoding and analyzing the HOA signal to be encoded, such as analyzing the sound field distribution of the HOA signal to be encoded, including the number of sound sources, directionality, and dispersion of the HOA signal to be encoded, as a decision on how to select the target virtual speaker one of the judgment conditions.
- the encoding analysis unit may not be included in the encoding end, that is, the encoding end may not analyze the input signal, and a default configuration is used to determine how to select the target virtual speaker.
- the encoder obtains the HOA signal to be encoded, for example, the HOA signal recorded from the actual acquisition device or the HOA signal synthesized by artificial audio objects can be used as the input of the encoder, and the HOA signal to be encoded input by the encoder can be a time-domain HOA
- the signal may also be a frequency domain HOA signal.
- the virtual speaker set generating unit is configured to generate a virtual speaker set, the virtual speaker set may include: a plurality of virtual speakers, and the virtual speakers in the virtual speaker set may also be referred to as "candidate virtual speakers”.
- the virtual loudspeaker set generating unit generates specified candidate virtual loudspeaker HOA coefficients according to virtual loudspeaker configuration parameters. Generating the candidate virtual speaker HOA coefficients requires the coordinates of the candidate virtual speaker (i.e. position coordinates or position information) and the HOA order of the candidate virtual speaker.
- the method for determining the coordinates of the candidate virtual speaker includes but is not limited to generating K virtual speakers according to the equidistant rule, According to the principle of auditory perception, non-uniformly distributed K candidate virtual speakers are generated, and a method for generating a uniformly distributed fixed number of virtual speakers is exemplified below.
- the HOA coefficients of candidate virtual speakers output by the virtual speaker set generation unit are used as the input of the virtual speaker selection unit.
- a virtual speaker selection unit configured to select a target virtual speaker from a plurality of candidate virtual speakers in the virtual speaker set according to the HOA signal to be encoded, and the target virtual speaker may be called a "virtual speaker that matches the HOA signal to be encoded", or Referred to as matching virtual speakers.
- the virtual speaker selection unit matches the HOA signal to be encoded with the candidate virtual speaker HOA coefficients output by the virtual speaker set generation unit, and selects a specified matching virtual speaker.
- the sound field classification is performed on the HOA signal to be coded, the sound field classification result is obtained, and the coding parameters are determined according to the sound field classification result.
- the coding analysis unit performs coding analysis according to the HOA signal to be coded, and the analysis may include: performing sound field classification according to the HOA signal to be coded.
- the sound field classification method is detailed in the foregoing embodiments and will not be repeated here.
- the encoding parameters are determined.
- the encoding parameters may include at least one of the number of channels of the virtual speaker signal, the number of channels of the residual signal, and the number of voting rounds for the best matching speaker search in the HOA encoding scheme based on virtual speaker selection.
- the virtual speaker selection unit matches the HOA coefficients to be encoded with the candidate virtual speaker HOA coefficients output by the virtual speaker set generation unit according to the determined number of voting rounds for the best matching speaker search and the number of channels of the virtual speaker signal, and selects Best match the virtual speaker, and get the HOA coefficient of the matching virtual speaker.
- the number of best matching virtual speakers is equal to the number of channels of the virtual speaker signal.
- the virtual speaker selection unit adopts the best matching speaker search method based on voting to match the HOA coefficients to be encoded with the candidate virtual speaker HOA coefficients output by the virtual speaker set generation unit, and selects the best matching virtual speaker, which can be determined according to the sound field classification results.
- the number of voting rounds I for the best matching speaker search is the number of voting rounds I for the best matching speaker search.
- the number of voting rounds I should follow the following principles.
- the minimum number of voting rounds is taken once, and the maximum value cannot exceed the total number of speakers (for example, 1024 speakers obtained by the virtual speaker set generation unit) and the number of virtual speaker signal channels (the number of channels to be transmitted by the encoder)
- the virtual speaker signal that is, the N transmission channels corresponding to the N best matching speakers), usually the number of virtual speaker signal channels is less than the total number of speakers.
- the number of voting rounds is estimated as follows:
- the number I of voting rounds for speaker selection is determined.
- the number of channels of the virtual loudspeaker signal and the number of channels of the residual signal are determined according to the sound field type.
- the embodiment of the present application provides a method for selecting the number of channels F of an adaptive virtual speaker signal:
- F min(S, PF), where S is the number of heterogeneous sound sources in the sound field, and PF is the number of virtual speaker signal channels preset by the encoder.
- the embodiment of the present application provides a method for selecting the channel number R of the adaptive residual signal:
- R max(C-1, PR), where C is the preset total number of transmission channels, and PR is the preset number of residual signals of the encoder.
- C is the sum of PF and PR.
- the energy of the residual signal is low at this time, so more bits can be allocated to the channels of the virtual speaker signal.
- the virtual speaker signal and the residual signal are divided into two groups, that is, the virtual speaker signal group and the residual signal group.
- the number of heterogeneous sound sources ⁇ the number of channels of the virtual speaker signal according to a preset The value is adjusted to increase the preset distribution ratio of the virtual speaker signal group, and use the increased distribution ratio of the virtual speaker signal group as the distribution ratio of the virtual speaker signal group.
- Allocation ratio of residual signal group 1.0 ⁇ allocation ratio of virtual loudspeaker signal group.
- a virtual loudspeaker signal generation unit calculates a virtual loudspeaker signal by using the HOA coefficients to be encoded and the matching virtual loudspeaker HOA coefficients.
- Signal reconstruction unit reconstruct the HOA signal by using the virtual loudspeaker signal and matching virtual loudspeaker HOA coefficients.
- Residual signal generating unit according to the number of channels of the residual signal determined in step 1, the residual signal is calculated by the HOA coefficient to be encoded and the reconstructed HOA signal output by the HOA signal reconstruction unit.
- the selection unit pre-allocates all available bits to the virtual speaker signal and the residual signal to be transmitted, and the obtained bit pre-allocation information is used to guide the core encoder to process.
- Core encoder processing unit perform core encoder processing on the transmission channel, and output the transmission code stream.
- the transmission channel includes a virtual speaker signal channel and a residual signal channel.
- the encoding parameters are determined.
- the encoding parameters may also include at least one of bit allocation of the virtual speaker signal and bit allocation of the residual signal in the HOA encoding scheme selected based on the virtual speaker. If the sound field classification result is used to determine the bit allocation of the virtual speaker signal and the bit allocation of the residual signal, it is necessary to determine the bit allocation of the virtual speaker signal and the residual signal according to the sound field classification result.
- the bit allocation method for determining the virtual speaker signal and the residual signal according to the sound field classification result is as follows: Assume that the number of channels of the virtual speaker signal is F, and the number of channels of the residual signal is R, which can be used for the virtual speaker signal and The total number of bits for encoding the residual signal is numbit.
- One way is to first determine the total number of bits encoded by the virtual loudspeaker signal and the total number of bits encoded by the residual signal, and then determine the number of encoded bits of each channel.
- the total number of bits encoded for the virtual loudspeaker signal is:
- fac1 is the weighting factor assigned to the coding bits of the virtual loudspeaker signal
- fac2 is the weighting factor assigned to the coding bits of the residual signal.
- the total number of bits res_numbit numbit-core_numbit of residual signal coding.
- the coded bits of each channel of the virtual speaker signal are allocated according to the bit allocation criterion of the virtual speaker signal, and the coded bits of each channel of the residual signal are allocated according to the bit allocation criterion of the residual signal.
- the total number of bits encoded in the residual signal is:
- fac1 is the weighting factor assigned to the coding bits of the virtual loudspeaker signal
- fac2 is the weighting factor assigned to the coding bits of the residual signal.
- the total number of bits core_numbit numbit-res_numbit for coding the virtual loudspeaker signal.
- the coded bits of each channel of the virtual speaker signal are allocated according to the bit allocation criterion of the virtual speaker signal, and the coded bits of each channel of the residual signal are allocated according to the bit allocation criterion of the residual signal.
- the number of encoded bits per channel can also be determined directly.
- the number of bits encoded per virtual speaker signal is:
- the number of bits coded for each residual signal is:
- the final bit allocation result used to encode the virtual loudspeaker signal and the residual signal may be determined after adjusting the bit allocation result obtained by the above method.
- the core encoder processing unit After obtaining the bit allocation result for encoding the virtual speaker signal and the residual signal, the core encoder processing unit will encode the virtual speaker signal and the residual signal according to the bit allocation result.
- the encoding parameters include the number of channels of the virtual speaker signal, the number of channels of the residual signal, the bit allocation of the virtual speaker signal, the bit allocation of the residual signal, and the number of voting rounds for the best matching speaker search in the HOA encoding scheme based on virtual speaker selection. at least one of .
- the encoding parameters reference may be made to the foregoing content, which will not be repeated here.
- the embodiment of the present application classifies the sound field of the HOA signal to be encoded, so as to select an appropriate encoding mode and/or encoding parameter according to the different characteristics of the HOA signal to be encoded, and encode the HOA signal to improve compression efficiency and auditory quality. .
- a three-dimensional audio signal processing device provided by the embodiment of the present application, for example, the three-dimensional audio signal processing device is specifically an audio coding device 1200, which may include: a linear analysis module 1201, a parameter generation module 1202 and sound field classification module 1203, wherein,
- a linear analysis module configured to linearly decompose the three-dimensional audio signal to obtain a linear decomposition result
- a parameter generation module configured to obtain sound field classification parameters corresponding to the current frame according to the linear decomposition result
- a sound field classification module configured to determine the sound field classification result of the current frame according to the sound field classification parameters.
- the three-dimensional audio signal includes: a high-order ambisonics HOA signal, or a first-order ambisonics FOA signal.
- the linear analysis module is configured to perform singular value decomposition on the current frame to obtain a singular value corresponding to the current frame, wherein the linear decomposition result includes: the singular value or, performing principal component analysis on the current frame to obtain the first eigenvalue corresponding to the current frame, wherein the linear decomposition result includes: the first eigenvalue; or, for the current frame Independent component analysis is performed to obtain a second eigenvalue corresponding to the current frame, wherein the linear decomposition result includes: the second eigenvalue.
- the parameter generating module is configured to obtain the ratio of the i-th linear analysis result of the current frame to the i+1-th linear analysis result of the current frame, wherein the i is a positive integer; according to the ratio Obtain the i-th sound field classification parameter corresponding to the current frame.
- the i-th linear analysis result and the i+1-th linear analysis result are two consecutive linear analysis results of the current frame.
- the sound field classification result includes: sound field type; the sound field classification module is configured to satisfy preset values of the plurality of sound field classification parameters When the diffuse sound source judgment condition is determined, the sound field type is determined to be a diffuse sound field; or, when at least one of the values of the plurality of sound field classification parameters meets the preset dissimilarity sound source judgment condition, it is determined that the The sound field type is heterogeneous sound field.
- the diffuse sound source judgment condition includes: the value of the sound field classification parameter is less than a preset dissimilarity sound source judgment threshold; or, the dissimilarity sound source judgment condition includes: the The value of the sound field classification parameter is greater than or equal to a preset threshold for judging different sound sources.
- the sound field classification result includes: sound field type; or, the sound field classification result includes: the number of different sound sources and the sound field type;
- the sound field classification module is configured to obtain the number of different sound sources corresponding to the current frame according to the values of the plurality of sound field classification parameters; and determine the sound field type according to the number of different sound sources corresponding to the current frame.
- the sound field classification results include: the number of different sound sources;
- the sound field classification module is configured to obtain the number of different sound sources corresponding to the current frame according to the values of the plurality of sound field classification parameters.
- the K is the number of signal points corresponding to each channel of the current frame
- the min represents the minimum value operation
- the determining the sound field type according to the number of dissimilar sound sources corresponding to the current frame includes:
- the number of dissimilar sound sources corresponding to the first sound field type is different from the number of dissimilar sound sources corresponding to the second sound field type.
- the first preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, wherein the second threshold is greater than the first threshold;
- the first preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the audio encoding device further includes: an encoding mode determination module (not shown in FIG. 12 ), the encoding mode determination module is used to determine the encoding corresponding to the current frame according to the sound field classification result. model.
- the coding mode determination module is configured to, when the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field, according to the Determine the encoding mode corresponding to the current frame according to the number of different sound sources; or, when the sound field classification result includes the sound field type, or when the sound field classification result includes the number of different sound sources and the sound field type, according to the sound field type Determine the encoding mode corresponding to the current frame; or, when the sound field classification result includes the number of different sound sources and the type of sound field, determine the encoding mode corresponding to the current frame according to the number of different sound sources and the type of sound field model.
- the encoding mode determining module is configured to determine that the encoding mode is the first encoding mode when the number of the dissimilarity sound sources satisfies a second preset condition; when the dissimilarity When the number of sound sources does not meet the second preset condition, determine that the encoding mode is the second encoding mode;
- the first coding mode is the HOA coding mode based on virtual speaker selection or the HOA coding mode based on directional audio coding
- the second coding mode is the HOA coding mode based on virtual speaker selection or the HOA coding based on directional audio coding mode
- the first coding mode and the second coding mode are different coding modes.
- the second preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, wherein the second threshold is greater than the first threshold; or,
- the second preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the encoding mode determining module is configured to determine that the encoding mode is an HOA encoding mode selected based on a virtual speaker when the sound field type is a heterogeneous sound field; when the sound field type is When there is a diffuse sound field, it is determined that the coding mode is an HOA coding mode based on directional audio coding.
- the coding mode determination module is configured to determine the initial coding mode corresponding to the current frame according to the sound field classification result of the current frame; obtain the sliding window where the current frame is located, the The sliding window includes: the initial coding mode of the current frame, and the coding mode of N-1 frames before the current frame, where N is the length of the sliding window; according to the initial coding mode of the current frame and the The encoding mode of the N-1 frame determines the encoding mode of the current frame.
- the audio encoding device further includes: an encoding parameter determination module (not shown in FIG. 12 ), the encoding parameter determination module is used to determine the encoding corresponding to the current frame according to the sound field classification result. parameter.
- the encoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of encoded bits of the virtual speaker signal, the number of encoded bits of the residual signal, or the number of voting rounds for the best matching speaker search;
- the virtual speaker signal and the residual signal are signals generated according to the three-dimensional audio signal.
- the number of voting rounds satisfies the following relationship:
- the I is the number of voting rounds
- the d is the number of dissimilar sound sources included in the sound field classification result.
- the sound field classification result includes the number of different sound sources and the type of sound field
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of channels of the virtual speaker signal preset by the encoder
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal.
- the number of channels of the residual signal satisfies the following relationship:
- the PR is the number of residual signal channels preset by the encoder
- the C is the sum of the number of channels of the residual signal preset by the encoder and the number of virtual speaker signal channels preset by the encoder;
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the number of channels of the residual signal
- the C is the sum of the number of residual signal channels preset by the encoder and the number of virtual speaker signal channels preset by the encoder
- the F is the The number of channels of the virtual speaker signal described above.
- the sound field classification result includes the number of different sound sources
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of channels of the virtual speaker signal preset by the encoder.
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the number of channels of the residual signal
- the C is the sum of the number of channels of the residual signal preset by the encoder and the number of channels of the virtual speaker signal preset by the encoder
- the F is the channel number of the virtual speaker signal
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the number of encoded bits of the virtual loudspeaker signal is obtained by the ratio of the number of encoded bits of the virtual loudspeaker signal to the number of encoded bits of the transmission channel;
- the number of encoded bits of the residual signal is obtained by the ratio of the number of encoded bits of the virtual loudspeaker signal to the number of encoded bits of the transmission channel;
- the number of encoded bits of the transmission channel includes the number of encoded bits of the virtual speaker signal and the number of encoded bits of the residual signal, and when the number of distinct sound sources is less than or equal to the number of channels of the virtual speaker signal,
- the ratio of the number of coded bits of the virtual speaker signal to the number of coded bits of the transmission channel is obtained by increasing the initial ratio of the number of coded bits of the virtual speaker signal to the number of coded bits of the transmission channel.
- the audio encoding device further includes: an encoding module (not shown in FIG. 12 ), the encoding module is used to encode the current frame and the sound field classification result, and write the code flow.
- the current frame of the three-dimensional audio signal is linearly decomposed to obtain the linear decomposition result; then the sound field classification parameter corresponding to the current frame is obtained according to the linear decomposition result; finally, the current frame is determined according to the sound field classification parameter.
- Sound field classification results Since in the embodiment of the present application, the linear decomposition result of the current frame is obtained by linearly decomposing the current frame of the three-dimensional audio signal, and then the sound field classification parameter corresponding to the current frame is obtained through the linear decomposition result, the sound field classification parameter is determined by the sound field classification parameter. The sound field classification result of the current frame is obtained, and the sound field classification of the current frame can be realized through the sound field classification result.
- the embodiment of the present application classifies the sound field of the 3D audio signal, so that the 3D audio signal can be accurately identified.
- a processing device for a three-dimensional audio signal provided by an embodiment of the present application, for example, the processing device for a three-dimensional audio signal is specifically an audio decoding device 1300, which may include: a receiving module 1301, a decoding module 1302 and a signal Generate module 1303, wherein,
- the receiving module is used to receive code stream
- a decoding module configured to decode the code stream to obtain the sound field classification result of the current frame
- a signal generation module configured to obtain the decoded 3D audio signal of the current frame according to the sound field classification result.
- the signal generating module is configured to determine a decoding mode of the current frame according to the sound field classification result; and obtain a decoded 3D audio signal of the current frame according to the decoding mode.
- the signal generation module is configured to: when the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field, according to the The number of different sound sources determines the decoding mode of the current frame; or, when the sound field classification result includes the sound field type, or the sound field classification result includes the number of different sound sources and the sound field type, determine the decoding mode according to the sound field type The decoding mode of the current frame; or, when the sound field classification result includes the number of different sound sources and the type of sound field, determine the decoding mode of the current frame according to the number of different sound sources and the type of sound field.
- the signal generation module is configured to determine that the decoding mode is the first decoding mode when the number of different sound sources satisfies a preset condition; when the number of different sound sources When the preset condition is not met, determine that the decoding mode is the second decoding mode;
- the first decoding mode is the HOA decoding mode based on virtual speaker selection or the HOA decoding mode based on directional audio coding
- the second decoding mode is the HOA decoding mode based on virtual speaker selection or HOA decoding based on directional audio coding mode
- the first decoding mode and the second decoding mode are different decoding modes.
- the preset condition includes that the number of dissimilar sound sources is greater than a first threshold and less than a second threshold, wherein the second threshold is greater than the first threshold;
- the preset condition includes that the number of dissimilar sound sources is not greater than a first threshold or not less than a second threshold, wherein the second threshold is greater than the first threshold.
- the signal generating module is configured to determine decoding parameters of the current frame according to the sound field classification result; and obtain a decoded 3D audio signal of the current frame according to the decoding parameters.
- the decoding parameters include at least one of the following: the number of channels of the virtual speaker signal, the number of channels of the residual signal, the number of decoding bits of the virtual speaker signal, or the number of decoding bits of the residual signal ;
- the virtual loudspeaker signal and the residual signal are obtained by decoding the code stream.
- the sound field classification result includes the number of different sound sources and the type of sound field
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of virtual speaker signal channels preset by the decoder
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal.
- the number of channels of the residual signal satisfies the following relationship:
- the PR is the number of residual signal channels preset by the decoder
- the C is the sum of the number of channels of the residual signal preset by the decoder and the number of virtual speaker signal channels preset by the decoder
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the channel number of the residual signal
- the C is the sum of the residual signal channel number preset by the decoder and the virtual speaker signal channel number preset by the decoder
- the F is the The number of channels of the virtual speaker signal described above.
- the sound field classification result includes the number of different sound sources
- the number of channels of the virtual loudspeaker signal satisfies the following relationship:
- the F is the number of channels of the virtual speaker signal
- the S is the number of the heterogeneous sound sources
- the PF is the number of channels of the virtual speaker signal preset by the decoder.
- the number of channels of the residual signal satisfies the following relationship:
- the R represents the number of channels of the residual signal
- the C is the sum of the number of channels of the residual signal preset by the decoder and the number of channels of the virtual speaker signal preset by the decoder
- the F is the channel number of the virtual speaker signal
- the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
- the decoding bit number of the virtual loudspeaker signal is obtained by the ratio of the decoding bit number of the virtual loudspeaker signal to the decoding bit number of the transmission channel;
- the number of decoding bits of the residual signal is obtained by the ratio of the number of decoding bits of the virtual speaker signal to the number of decoding bits of the transmission channel;
- the number of decoding bits of the transmission channel includes the number of decoding bits of the virtual speaker signal and the number of decoding bits of the residual signal, and when the number of dissimilar sound sources is less than or equal to the number of channels of the virtual speaker signal,
- the ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel is obtained by increasing the initial ratio of the number of decoded bits of the virtual speaker signal to the number of decoded bits of the transmission channel.
- the sound field classification result can be used to decode the current frame in the code stream, so the decoding end uses a decoding method that matches the sound field of the current frame to decode, thereby obtaining the three-dimensional audio signal sent by the encoding end , to realize the transmission of audio signals from the encoding end to the decoding end.
- the embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
- the audio coding device 1400 includes:
- a receiver 1401 , a transmitter 1402 , a processor 1403 and a memory 1404 (the number of processors 1403 in the audio encoding device 1400 can be one or more, one processor is taken as an example in FIG. 14 ).
- the receiver 1401 , the transmitter 1402 , the processor 1403 and the memory 1404 may be connected through a bus or in other ways, where connection through a bus is taken as an example in FIG. 14 .
- the memory 1404 may include read-only memory and random-access memory, and provides instructions and data to the processor 1403 .
- a part of the memory 1404 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
- NVRAM non-volatile random access memory
- the memory 1404 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
- the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
- the processor 1403 controls the operation of the audio encoding device, and the processor 1403 may also be called a central processing unit (central processing unit, CPU).
- CPU central processing unit
- various components of the audio encoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
- the various buses are referred to as bus systems in the figures.
- the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1403 or implemented by the processor 1403 .
- the processor 1403 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be implemented by an integrated logic circuit of hardware in the processor 1403 or instructions in the form of software.
- the above-mentioned processor 1403 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
- the storage medium is located in the memory 1404, and the processor 1403 reads the information in the memory 1404, and completes the steps of the above method in combination with its hardware.
- the receiver 1401 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the audio encoding device.
- the transmitter 1402 can include a display device such as a display screen, and the transmitter 1402 can be used to output through an external interface. Numeric or character information.
- the processor 1403 is configured to execute the methods performed by the audio coding apparatus shown in FIG. 4 to FIG. 6 in the foregoing embodiment.
- the audio decoding device 1500 includes:
- a receiver 1501, a transmitter 1502, a processor 1503, and a memory 1504 (the number of processors 1503 in the audio decoding device 1500 can be one or more, one processor is taken as an example in FIG. 15 ).
- the receiver 1501 , the transmitter 1502 , the processor 1503 and the memory 1504 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 15 .
- the memory 1504 may include read-only memory and random-access memory, and provides instructions and data to the processor 1503 . A portion of memory 1504 may also include NVRAM.
- the memory 1504 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
- the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
- the processor 1503 controls the operation of the audio decoding device, and the processor 1503 may also be called a CPU.
- various components of the audio decoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
- the various buses are referred to as bus systems in the figures.
- the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1503 or implemented by the processor 1503 .
- the processor 1503 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1503 or instructions in the form of software.
- the aforementioned processor 1503 may be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
- a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
- the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
- the storage medium is located in the memory 1504, and the processor 1503 reads the information in the memory 1504, and completes the steps of the above method in combination with its hardware.
- the processor 1503 is configured to execute the method performed by the audio decoding device shown in FIG. 7 of the foregoing embodiment.
- the chip when the audio encoding device or the audio decoding device is a chip in the terminal, the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example Input/output interface, pin or circuit, etc.
- the processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the terminal executes the audio encoding method of any one of the above-mentioned first aspect, or the audio decoding method of any one of the second aspect.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit in the terminal located outside the chip, such as a read-only memory (read -only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
- ROM read-only memory
- RAM random access memory
- the processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling program execution of the method of the first aspect or the second aspect.
- the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
- the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application .
- a computer device which can be a personal computer, a server, or a network device, etc.
- all or part of them may be implemented by software, hardware, firmware or any combination thereof.
- software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server, or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
- wired eg, coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless eg, infrared, wireless, microwave, etc.
- the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Stereophonic System (AREA)
Abstract
Description
Claims (46)
- 一种三维音频信号的处理方法,其特征在于,包括:对三维音频信号的当前帧进行线性分解,以得到线性分解结果;根据所述线性分解结果获取所述当前帧对应的声场分类参数;根据所述声场分类参数确定所述当前帧的声场分类结果。
- 根据权利要求1所述的方法,其特征在于,所述三维音频信号包括:高阶立体混响HOA信号,或者一阶立体混响FOA信号。
- 根据权利要求1或2所述的方法,其特征在于,所述对所述三维音频信号的当前帧进行线性分解,以得到线性分解结果,包括:对所述当前帧进行奇异值分解,以得到所述当前帧对应的奇异值,其中,所述线性分解结果包括:所述奇异值;或者,对所述当前帧进行主成分分析,以得到所述当前帧对应的第一特征值,其中,所述线性分解结果包括:所述第一特征值;或者,对所述当前帧进行独立成分分析,以得到所述当前帧对应的第二特征值,其中,所述线性分解结果包括:所述第二特征值。
- 根据权利要求1至3中任一项所述的方法,其特征在于,所述线性分解结果为多个,所述声场分类参数为多个;所述根据所述线性分解结果获取所述当前帧对应的声场分类参数,包括:获取所述当前帧的第i个线性分析结果与所述当前帧的第i+1个线性分析结果的比值,其中,所述i为正整数;根据所述比值获取所述当前帧对应的第i个声场分类参数。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述声场分类参数为多个;所述声场分类结果包括:声场类型;所述根据所述声场分类参数确定所述当前帧的声场分类结果,包括:当所述多个声场分类参数的值都满足预设的弥散性声源判决条件时,确定所述声场类型为弥散性声场;或者,当所述多个声场分类参数的值中至少一个值满足预设的相异性声源判决条件时,确定所述声场类型为相异性声场。
- 根据权利要求5所述的方法,其特征在于,所述弥散性声源判决条件包括:所述声场分类参数的值小于预设的相异性声源判定阈值;或者,所述相异性声源判决条件包括:所述声场分类参数的值大于或者等于预设的相异性声源判定阈值。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述声场分类参数为多个;所述声场分类结果包括:声场类型;或者,所述声场分类结果包括:相异性声源数量和声场类型;所述根据所述声场分类参数确定所述当前帧的声场分类结果,包括:根据所述多个声场分类参数的值获取所述当前帧对应的相异性声源数量;根据所述当前帧对应的相异性声源数量确定所述声场类型。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述声场分类参数为多个;所述声场分类结果包括:相异性声源数量;所述根据所述声场分类参数确定所述当前帧的声场分类结果,包括:根据所述多个声场分类参数的值获取所述当前帧对应的相异性声源数量。
- 根据权利要求7或8所述的方法,其特征在于,所述多个声场分类参数为temp[i],所述i=0,1,…,min(L,K)-2,所述L表示所述当前帧的通道数量,所述K为所述当前帧的每个通道对应的信号点数,所述min表示取最小值运算;所述根据所述多个声场分类参数的值获取所述当前帧对应的相异性声源数量,包括:从i=0开始依次执行如下判断流程:判断所述temp[i]是否大于预设的相异性声源判定阈值;当本次判断流程中所述temp[i]小于所述相异性声源判定阈值时,更新i的取值为i+1,继续执行下次判断流程;或者,当本次判断流程中所述temp[i]大于或等于所述相异性声源判定阈值时,终止执行判断流程,确定本次判断流程的i加上1等于所述相异性声源数量。
- 根据权利要求7所述的方法,其特征在于,所述根据所述当前帧对应的相异性声源数量确定所述声场类型,包括:当所述相异性声源数量满足第一预设条件时,确定所述声场类型为第一声场类型;当所述相异性声源数量不满足所述第一预设条件时,确定所述声场类型为第二声场类型;其中,所述第一声场类型对应的相异性声源数量和所述第二声场类型对应的相异性声源数量不同。
- 根据权利要求10所述的方法,其特征在于,所述第一预设条件包括所述相异性声源数量大于第一阈值且小于第二阈值,其中,所述第二阈值大于所述第一阈值;或者,所述第一预设条件包括所述相异性声源数量不大于第一阈值或不小于第二阈值,其中,所述第二阈值大于所述第一阈值。
- 根据权利要求1至11中任一项所述的方法,其特征在于,所述方法还包括:根据所述声场分类结果确定所述当前帧对应的编码模式。
- 根据权利要求12所述的方法,其特征在于,所述根据所述声场分类结果确定所述当前帧对应的编码模式,包括:当所述声场分类结果包括相异性声源数量,或所述声场分类结果包括相异性声源数量和声场类型时,根据所述相异性声源数量确定所述当前帧对应的编码模式;或者,当所述声场分类结果包括声场类型,或所述声场分类结果包括相异性声源数量和声场类型时,根据所述声场类型确定所述当前帧对应的编码模式;或者,当所述声场分类结果包括相异性声源数量和声场类型时,根据所述相异性声源数量和所述声场类型确定所述当前帧对应的编码模式。
- 根据权利要求13所述的方法,其特征在于,所述根据所述相异性声源数量确定所述当前帧对应的编码模式包括:当所述相异性声源数量满足第二预设条件时,确定所述编码模式为第一编码模式;当所述相异性声源数量不满足所述第二预设条件时,确定所述编码模式为第二编码模式;其中,所述第一编码模式为基于虚拟扬声器选择的HOA编码模式或基于方向音频编码的HOA编码模式,所述第二编码模式为基于虚拟扬声器选择的HOA编码模式或基于方向音频编码的HOA编码模式,且所述第一编码模式和所述第二编码模式为不同的编码模式。
- 根据权利要求14所述的方法,其特征在于,所述第二预设条件包括所述相异性声源数量大于第一阈值且小于第二阈值,其中,所述第二阈值大于所述第一阈值;或,所述第二预设条件包括所述相异性声源数量不大于第一阈值或不小于第二阈值,其中,所述第二阈值大于所述第一阈值。
- 根据权利要求13所述的方法,其特征在于,所述根据所述声场类型确定所述当前帧对应的编码模式,包括:当所述声场类型为相异性声场时,确定所述编码模式为基于虚拟扬声器选择的HOA编码模式;当所述声场类型为弥散性声场时,确定所述编码模式为基于方向音频编码的HOA编码模式。
- 根据权利要求12所述的方法,其特征在于,所述根据所述声场分类结果确定所述当前帧对应的编码模式,包括:根据所述当前帧的声场分类结果确定所述当前帧对应的初始编码模式;获取所述当前帧所在的滑动窗,所述滑动窗包括:所述当前帧的初始编码模式,以及所述当前帧之前的N-1帧的编码模式,所述N为所述滑动窗的长度;根据所述滑动窗内当前帧的初始编码模式和所述N-1帧的编码模式确定所述当前帧的编码模式。
- 根据权利要求1至17中任一项所述的方法,其特征在于,所述方法还包括:根据所述声场分类结果确定所述当前帧对应的编码参数。
- 根据权利要求18所述的方法,其特征在于,所述编码参数,包括如下至少一种:虚拟扬声器信号的通道数、残差信号的通道数、虚拟扬声器信号的编码比特数、残差信号的编码比特数、或最佳匹配扬声器搜索的投票轮次数;其中,所述虚拟扬声器信号和所述残差信号是根据所述三维音频信号生成的。
- 根据权利要求19所述的方法,其特征在于,所述投票轮次数满足如下关系:1≤I≤d,其中,所述I为所述投票轮次数,所述d为所述声场分类结果包括的相异性声源数量。
- 根据权利要求19或20所述的方法,其特征在于,所述声场分类结果包括相异性声源数量和声场类型;当所述声场类型为相异性声场时,所述虚拟扬声器信号的通道数满足如下关系:F=min(S,PF),其中,所述F为所述虚拟扬声器信号的通道数,所述S为所述相异性声源数量,所述PF为编码器预设的虚拟扬声器信号通道数;或,当所述声场类型为弥散性声场时,所述虚拟扬声器信号的通道数满足如下关系:F=1,其中,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求19至21任一项所述的方法,其特征在于,当所述声场类型为弥散性声场时,所述残差信号的通道数满足如下关系:R=max(C-1,PR),其中,所述PR为编码器预设的残差信号通道数,所述C为所述编码器预设的残差信号的通道数和所述编码器预设的虚拟扬声器信号通道数之和;或,当所述声场类型为相异性声场时,所述残差信号的通道数满足如下关系:R=C–F,其中,所述R表示所述残差信号的通道数,所述C为编码器预设的残差信号通道数和 所述编码器预设的虚拟扬声器信号通道数之和,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求19或20所述的方法,其特征在于,所述声场分类结果包括相异性声源数量;所述虚拟扬声器信号的通道数满足如下关系:F=min(S,PF),其中,所述F为所述虚拟扬声器信号的通道数,所述S为所述相异性声源数量,所述PF为编码器预设的虚拟扬声器信号通道数。
- 根据权利要求19,20,21或23所述的方法,其特征在于,所述残差信号的通道数满足如下关系:R=C–F,其中,所述R表示所述残差信号的通道数,所述C为编码器预设的残差信号的通道数和所述编码器预设的虚拟扬声器信号的通道数之和,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求19至24中任一项所述的方法,其特征在于,所述声场分类结果包括相异性声源数量,或者所述声场分类结果包括相异性声源数量和声场类型;所述虚拟扬声器信号的编码比特数,通过虚拟扬声器信号的编码比特数与传输通道的编码比特数的比值得到;所述残差信号的编码比特数,通过虚拟扬声器信号的编码比特数与传输通道的编码比特数的比值得到;其中,所述传输通道的编码比特数包括所述虚拟扬声器信号的编码比特数和所述残差信号的编码比特数,当所述相异性声源数量小于或等于虚拟扬声器信号的通道数时,所述虚拟扬声器信号的编码比特数与传输通道的编码比特数的比值,通过增加所述虚拟扬声器信号的编码比特数与传输通道的编码比特数的初始比值得到。
- 根据权利要求1至25中任一项所述的方法,其特征在于,所述方法还包括:对所述当前帧和所述声场分类结果进行编码,并写入码流。
- 一种三维音频信号的处理方法,其特征在于,包括:接收码流;解码所述码流以获得当前帧的声场分类结果;根据所述声场分类结果获得所述当前帧解码后的三维音频信号。
- 根据权利要求27所述的方法,其特征在于,所述根据所述声场分类结果获得所述当前帧解码后的三维音频信号,包括:根据所述声场分类结果确定所述当前帧的解码模式;根据所述解码模式获得所述当前帧解码后的三维音频信号。
- 根据权利要求28所述的方法,其特征在于,所述根据所述声场分类结果确定所述当前帧的解码模式,包括:当所述声场分类结果包括相异性声源数量,或所述声场分类结果包括相异性声源数量和声场类型时,根据所述相异性声源数量确定所述当前帧的解码模式;或者,当所述声场分类结果包括声场类型,或所述声场分类结果包括相异性声源数量和声场类型时,根据所述声场类型确定所述当前帧的解码模式;或者,当所述声场分类结果包括相异性声源数量和声场类型时,根据所述相异性声源数量和所述声场类型确定所述当前帧的解码模式。
- 根据权利要求29所述的方法,其特征在于,所述根据所述相异性声源数量确定所述当前帧对应的解码模式包括:当所述相异性声源数量满足预设条件时,确定所述解码模式为第一解码模式;当所述相异性声源数量不满足所述预设条件时,确定所述解码模式为第二解码模式;其中,所述第一解码模式为基于虚拟扬声器选择的HOA解码模式或基于方向音频编码的HOA解码模式,所述第二解码模式为基于虚拟扬声器选择的HOA解码模式或基于方向音频编码的HOA解码模式,且所述第一解码模式和所述第二解码模式为不同的解码模式。
- 根据权利要求30所述的方法,其特征在于,所述预设条件包括所述相异性声源数量大于第一阈值且小于第二阈值,其中,所述第二阈值大于所述第一阈值;或所述预设条件包括所述相异性声源数量不大于第一阈值或不小于第二阈值,其中,所述第二阈值大于所述第一阈值。
- 根据权利要求27所述的方法,其特征在于,所述根据所述声场分类结果获得所述当前帧解码后的三维音频信号,包括:根据所述声场分类结果确定所述当前帧的解码参数;根据所述解码参数获得所述当前帧解码后的三维音频信号。
- 根据权利要求32所述的方法,其特征在于,所述解码参数,包括如下至少一种:虚拟扬声器信号的通道数、残差信号的通道数、虚拟扬声器信号的解码比特数、或残差信号的解码比特数;其中,所述虚拟扬声器信号和所述残差信号是通过所述码流解码得到的。
- 根据权利要求33所述的方法,其特征在于,所述声场分类结果包括相异性声源数量和声场类型;当所述声场类型为相异性声场时,所述虚拟扬声器信号的通道数满足如下关系:F=min(S,PF),其中,所述F为所述虚拟扬声器信号的通道数,所述S为所述相异性声源数量,所述PF为解码器预设的虚拟扬声器信号通道数;或,当所述声场类型为弥散性声场时,所述虚拟扬声器信号的通道数满足如下关系:F=1,其中,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求33或34所述的方法,其特征在于,当所述声场类型为弥散性声场时,所述残差信号的通道数满足如下关系:R=max(C-1,PR),其中,所述PR为解码器预设的残差信号通道数,所述C为所述解码器预设的残差信号的通道数和所述解码器预设的虚拟扬声器信号通道数之和;或,当所述声场类型为相异性声场时,所述残差信号的通道数满足如下关系:R=C–F,其中,所述R表示所述残差信号的通道数,所述C为解码器预设的残差信号通道数和所述解码器预设的虚拟扬声器信号通道数之和,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求33,或35所述的方法,其特征在于,所述声场分类结果包括相异性声源数量;所述虚拟扬声器信号的通道数满足如下关系:F=min(S,PF),其中,所述F为所述虚拟扬声器信号的通道数,所述S为所述相异性声源数量,所述PF为解码器预设的虚拟扬声器信号通道数。
- 根据权利要求33至36中任一项所述的方法,其特征在于,所述残差信号的通道数满足如下关系:R=C–F,其中,所述R表示所述残差信号的通道数,所述C为解码器预设的残差信号的通道数和所述解码器预设的虚拟扬声器信号的通道数之和,所述F为所述虚拟扬声器信号的通道数。
- 根据权利要求33至37中任一项所述的方法,其特征在于,所述声场分类结果包括相异性声源数量,或者所述声场分类结果包括相异性声源数量和声场类型;所述虚拟扬声器信号的解码比特数,通过虚拟扬声器信号的解码比特数与传输通道的解码比特数的比值得到;所述残差信号的解码比特数,通过虚拟扬声器信号的解码比特数与传输通道的解码比特数的比值得到;其中,所述传输通道的解码比特数包括所述虚拟扬声器信号的解码比特数和所述残差信号的解码比特数,当所述相异性声源数量小于或等于虚拟扬声器信号的通道数时,所述虚拟扬声器信号的解码比特数与传输通道的解码比特数的比值,通过增加虚拟扬声器信号的解码比特数与传输通道的解码比特数的初始比值得到。
- 一种三维音频信号的处理装置,其特征在于,包括:线性分析模块,用于对三维音频信号进行线性分解,以得到线性分解结果;参数生成模块,用于根据所述线性分解结果获取所述当前帧对应的声场分类参数;声场分类模块,用于根据所述声场分类参数确定所述当前帧的声场分类结果。
- 一种三维音频信号的处理装置,其特征在于,包括:接收模块,用于接收码流;解码模块,用于解码所述码流以获得当前帧的声场分类结果;信号生成模块,用于根据所述声场分类结果获得所述当前帧解码后的三维音频信号。
- 一种三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求1至26中任一项所述的方法。
- 根据权利要求41所述的三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置还包括:所述存储器。
- 一种三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求27至38中任一项所述的方法。
- 根据权利要求43所述的三维音频信号的处理装置,其特征在于,所述音频解码装置还包括:所述存储器。
- 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至26、或者27至38中任意一项所述的方法。
- 一种计算机可读存储介质,包括如权利要求1至26任一项所述的方法所生成的码流。
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA3221992A CA3221992A1 (en) | 2021-05-31 | 2022-05-30 | Three-dimensional audio signal processing method and apparatus |
EP22815232.8A EP4332964A1 (en) | 2021-05-31 | 2022-05-30 | Method and apparatus for processing three-dimensional audio signal |
JP2023573612A JP2024521204A (ja) | 2021-05-31 | 2022-05-30 | 三次元音声信号処理方法および装置 |
BR112023025071A BR112023025071A2 (pt) | 2021-05-31 | 2022-05-30 | Método e aparelho de processamento de sinal de áudio tridimensional |
KR1020237044256A KR20240012519A (ko) | 2021-05-31 | 2022-05-30 | 3차원 오디오 신호를 처리하기 위한 방법 및 장치 |
US18/521,944 US20240105187A1 (en) | 2021-05-31 | 2023-11-28 | Three-dimensional audio signal processing method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110602507.4 | 2021-05-31 | ||
CN202110602507.4A CN115938388A (zh) | 2021-05-31 | 2021-05-31 | 一种三维音频信号的处理方法和装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/521,944 Continuation US20240105187A1 (en) | 2021-05-31 | 2023-11-28 | Three-dimensional audio signal processing method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022253187A1 true WO2022253187A1 (zh) | 2022-12-08 |
Family
ID=84322803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/096025 WO2022253187A1 (zh) | 2021-05-31 | 2022-05-30 | 一种三维音频信号的处理方法和装置 |
Country Status (8)
Country | Link |
---|---|
US (1) | US20240105187A1 (zh) |
EP (1) | EP4332964A1 (zh) |
JP (1) | JP2024521204A (zh) |
KR (1) | KR20240012519A (zh) |
CN (1) | CN115938388A (zh) |
BR (1) | BR112023025071A2 (zh) |
CA (1) | CA3221992A1 (zh) |
WO (1) | WO2022253187A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105144752A (zh) * | 2013-04-29 | 2015-12-09 | 汤姆逊许可公司 | 对更高阶高保真度立体声响复制表示进行压缩和解压缩的方法和装置 |
CN105981410A (zh) * | 2013-11-28 | 2016-09-28 | 杜比国际公司 | 使用奇异值分解进行高阶高保真立体声编码和解码的方法和装置 |
CN106463121A (zh) * | 2014-05-16 | 2017-02-22 | 高通股份有限公司 | 较高阶立体混响信号压缩 |
WO2020210084A1 (en) * | 2019-04-09 | 2020-10-15 | Facebook Technologies, Llc | Acoustic transfer function personalization using sound scene analysis and beamforming |
-
2021
- 2021-05-31 CN CN202110602507.4A patent/CN115938388A/zh active Pending
-
2022
- 2022-05-30 EP EP22815232.8A patent/EP4332964A1/en active Pending
- 2022-05-30 CA CA3221992A patent/CA3221992A1/en active Pending
- 2022-05-30 WO PCT/CN2022/096025 patent/WO2022253187A1/zh active Application Filing
- 2022-05-30 JP JP2023573612A patent/JP2024521204A/ja active Pending
- 2022-05-30 KR KR1020237044256A patent/KR20240012519A/ko unknown
- 2022-05-30 BR BR112023025071A patent/BR112023025071A2/pt unknown
-
2023
- 2023-11-28 US US18/521,944 patent/US20240105187A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105144752A (zh) * | 2013-04-29 | 2015-12-09 | 汤姆逊许可公司 | 对更高阶高保真度立体声响复制表示进行压缩和解压缩的方法和装置 |
CN105981410A (zh) * | 2013-11-28 | 2016-09-28 | 杜比国际公司 | 使用奇异值分解进行高阶高保真立体声编码和解码的方法和装置 |
CN106463121A (zh) * | 2014-05-16 | 2017-02-22 | 高通股份有限公司 | 较高阶立体混响信号压缩 |
WO2020210084A1 (en) * | 2019-04-09 | 2020-10-15 | Facebook Technologies, Llc | Acoustic transfer function personalization using sound scene analysis and beamforming |
Also Published As
Publication number | Publication date |
---|---|
JP2024521204A (ja) | 2024-05-28 |
CN115938388A (zh) | 2023-04-07 |
EP4332964A1 (en) | 2024-03-06 |
BR112023025071A2 (pt) | 2024-02-27 |
CA3221992A1 (en) | 2022-12-08 |
KR20240012519A (ko) | 2024-01-29 |
US20240105187A1 (en) | 2024-03-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010125228A1 (en) | Encoding of multiview audio signals | |
US20230298600A1 (en) | Audio encoding and decoding method and apparatus | |
US20240087580A1 (en) | Three-dimensional audio signal coding method and apparatus, and encoder | |
WO2022262576A1 (zh) | 三维音频信号编码方法、装置、编码器和系统 | |
WO2022253187A1 (zh) | 一种三维音频信号的处理方法和装置 | |
WO2022257824A1 (zh) | 一种三维音频信号的处理方法和装置 | |
WO2022242483A1 (zh) | 三维音频信号编码方法、装置和编码器 | |
US20240087578A1 (en) | Three-dimensional audio signal coding method and apparatus, and encoder | |
WO2022242479A1 (zh) | 三维音频信号编码方法、装置和编码器 | |
US20240177721A1 (en) | Audio signal encoding and decoding method and apparatus | |
TWI844036B (zh) | 三維音訊訊號編碼方法、裝置、編碼器、系統、電腦程式和電腦可讀儲存介質 | |
US20240169998A1 (en) | Multi-Channel Signal Encoding and Decoding Method and Apparatus | |
CN115346537A (zh) | 一种音频编码、解码方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22815232 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023573612 Country of ref document: JP Ref document number: 3221992 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022815232 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022815232 Country of ref document: EP Effective date: 20231201 |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112023025071 Country of ref document: BR |
|
ENP | Entry into the national phase |
Ref document number: 20237044256 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020237044256 Country of ref document: KR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023134905 Country of ref document: RU |
|
ENP | Entry into the national phase |
Ref document number: 112023025071 Country of ref document: BR Kind code of ref document: A2 Effective date: 20231129 |