WO2018208560A1 - Processing of a multi-channel spatial audio format input signal - Google Patents

Processing of a multi-channel spatial audio format input signal Download PDF

Info

Publication number
WO2018208560A1
WO2018208560A1 PCT/US2018/030680 US2018030680W WO2018208560A1 WO 2018208560 A1 WO2018208560 A1 WO 2018208560A1 US 2018030680 W US2018030680 W US 2018030680W WO 2018208560 A1 WO2018208560 A1 WO 2018208560A1
Authority
WO
WIPO (PCT)
Prior art keywords
spatial
audio signal
signal
format
object location
Prior art date
Application number
PCT/US2018/030680
Other languages
French (fr)
Inventor
David S. Mcgrath
Original Assignee
Dolby Laboratories Licensing Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corporation filed Critical Dolby Laboratories Licensing Corporation
Priority to EP18722375.5A priority Critical patent/EP3622509B1/en
Priority to CN201880041822.0A priority patent/CN110800048B/en
Priority to US16/611,843 priority patent/US10893373B2/en
Priority to JP2019561833A priority patent/JP7224302B2/en
Publication of WO2018208560A1 publication Critical patent/WO2018208560A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/07Synergistic effects of band splitting and sub-band processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/02Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other

Definitions

  • the present disclosure relates to immersive audio format conversion, including conversion of a spatial audio format (for example, Ambisonics, Higher Order
  • Ambisonics or B-format
  • object-based format for example Dolby's Atmos format
  • spatial audio format for example, Ambisonics , Higher Order Ambisonics, or B-format
  • object-based format e.g., Dolby's Atmos format
  • An aspect of the document relates to a method of processing a multi-channel, spatial format input audio signal (i.e., an audio signal in a spatial format (spatial audio format) which includes multiple channels).
  • the spatial format may be Ambisonics, Higher Order Ambisonics (HOA), or B-format, for example.
  • the method may include analyzing the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal.
  • the object locations may be spatial locations, e.g., indicated by 3-vectors in Cartesian or spherical coordinates. Alternatively, the object locations may be indicated in two dimensions, depending on the application.
  • the method may further include, for each of a plurality of frequency subbands of the input audio signal, determining, for each object location, a mixing gain for that frequency subband and that object location.
  • the method may include applying a time-to-frequency transform to the input audio signal and arranging the resulting frequency coefficients into frequency subbands.
  • the method may include applying a filterbank to the input audio signal.
  • the mixing gains may be referred to as object gains.
  • the method may further include, for each frequency subband, generating, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format.
  • the spatial mapping function may be a spatial decoding function, for example spatial decoding function DS(loc).
  • the method may yet further include, for each object location, generating an output signal by summing over the frequency subband output signals for that object location. The sum may be a weighted sum.
  • the object locations may be output as object location metadata (e.g., object location metadata indicative of the object locations may be generated and output).
  • the output signals may be referred to as object signals or object channels.
  • the above processing may be performed for each predetermined period of time (e.g., for each time-block, or each transformation window of a time-to-frequency transform).
  • the proposed method applies a subband-based approach for determining the audio object signals. Configured as such, the proposed method can provide clear panning/steering decisions per subband.
  • the mixing gains for the object locations may be frequency-dependent.
  • the spatial format may define a plurality of channels.
  • the spatial mapping function may be a spatial decoding function of the spatial format for extracting an audio signal at a given location, from the plurality of the channels of the spatial format.
  • At a given location shall mean incident from the given location, for example.
  • a spatial panning function of the spatial format may be a function for mapping a source signal at a source location to the plurality of channels defined by the spatial format.
  • a source location shall mean incident from the source location, for example.
  • Mapping may be referred to as panning.
  • the spatial decoding function may be defined such that successive application of the spatial panning function and the spatial decoding function yields unity gain for all locations on the unit sphere.
  • the spatial decoding function may be further defined such that the average decoded power is minimized.
  • determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and a covariance matrix of the input audio signal in the given frequency subband.
  • the mixing gain for the given frequency subband and the given object location may depend on a steering function for the input audio signal in the given frequency subband, evaluated at the given object location.
  • the steering function may be based on the covariance matrix of the input audio signal in the given frequency subband.
  • determining the mixing gain for the given frequency subband and the given object location may be further based on a change rate of the given object location over time.
  • the mixing gain may be attenuated in dependence on the change rate of the given object location. For instance, the mixing gain may be attenuated if the change rate is high, and may not be attenuated for a static object location.
  • generating, for each frequency subband and for each object location, the frequency subband output signal may involve applying a gain matrix and a spatial decoding matrix to the input audio signal. The gain matrix and the spatial decoding matrix may be successively applied.
  • the gain matrix may include the determined mixing gains for that frequency subband.
  • the gain matrix may be a diagonal matrix, with the mixing gains as its diagonal elements, appropriately ordered.
  • the spatial decoding matrix may include a plurality of mapping vectors, one for each object location. Each mapping vector may be obtained by evaluating the spatial decoding function at a respective object location.
  • the spatial decoding function may be a vector-valued function (e.g., yielding an 1 xns row vector if the multichannel, spatial format input audio signal is defined as a ns> ⁇ 1 column vector,
  • the method may further include re-encoding the plurality of output signals into the spatial format to obtain a multi-channel, spatial format audio object signal.
  • the method may yet further include subtracting the audio object signal from the input audio signal to obtain a multi-channel, spatial format residual audio signal.
  • the spatial format residual signal may be output together with the output signals and location metadata, if any.
  • the method may further include applying a downmix to the residual audio signal to obtain a downmixed residual audio signal.
  • the number of channels of the downmixed residual audio signal may be smaller than the number of channels of the input audio signal.
  • the downmixed spatial format residual signal may be output together with the output signals and location metadata, if any.
  • analyzing the input audio signal may involve, for each frequency subband, determining a set of one or more dominant directions of sound arrival.
  • Analyzing the input audio signal may further involve determining a union of the sets of the one or more dominant directions for the plurality of frequency subbands. Analyzing the input audio signal may yet further involve applying a clustering algorithm to the union of the sets to determine the plurality of object locations. In some examples, determining the set of dominant directions of sound arrival may involve at least one of: extracting elements from the covariance matrix of the input audio signal in the frequency subband, and determining local maxima of a projection function of the input audio signal in the frequency subband. The projection function may be based on the covariance matrix of the input audio signal and a spatial panning function of the spatial format.
  • each dominant direction may have an associated weight.
  • the clustering algorithm may perform weighted clustering of the dominant directions.
  • Each weight may be indicative of a confidence value for its dominant direction, for example.
  • the confidence value may indicate a likelihood of whether an audio object is actually located at the object location.
  • the clustering algorithm may be one of a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm.
  • the method may further include generating object location metadata indicative of the object locations.
  • the object location metadata may be output together with the output signals and the (downmixed) spatial format residual signal, if any.
  • Another aspect of the document relates to an apparatus for processing a multi-channel, spatial format input audio signal.
  • the apparatus may include a processor.
  • the processor may be adapted to analyze the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal.
  • the processor may be further adapted to, for each of a plurality of frequency subbands of the input audio signal, determine, for each object location, a mixing gain for that frequency subband and that object location.
  • the processor may be further adapted to, for each frequency subband, generate, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format.
  • the processor may be yet further adapted to, for each object location, generate an output signal by summing over the frequency subband output signals for that object location.
  • the apparatus may further comprise a memory coupled to the processor. The memory may store respective instructions for execution by the processor.
  • the software program may be adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on the processor.
  • Another aspect of the document relates to a storage medium.
  • the storage medium may comprise a software program adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on the processor.
  • the computer program may comprise executable instructions for performing the method steps outlined in the present document when executed on a computer.
  • Another aspect of the present document relates to a method for processing a multichannel, spatial audio format input signal, the method comprising determining object location metadata based on the received spatial audio format input signal; and extracting object audio signals based on the received spatial audio format input signal.
  • the extracting object audio signals is based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
  • Each extracted audio object signal may have a corresponding object location metadata.
  • the object location metadata may be indicative of the direction-of-arrival of an object.
  • the object location metadata may be derived from statistics of the received spatial audio format input signal.
  • the object location metadata may change from time to time.
  • the object audio signals may be determined based on a a linear mixing matrix in each of a number of sub-bands of the received spatial audio format input signal.
  • the residual signal may be a multi-channel residual signal that may be composed of a number of channels that is less than a number of channels of the received spatial audio format input signal.
  • the extracting object audio signals may be determined by subtracting the contribution of the said object audio signals from the said spatial audio format input signal.
  • the extracting object audio signals may also include determining a linear mixing matrix coefficients that may be used by subsequent processing to create the one or more object audio signals and the residual signal.
  • the matrix coefficients may be different for each frequency band.
  • Another aspect of the present document relates to an apparatus for processing a multichannel, spatial audio format input signal, the apparatus comprising a processor for determining object location metadata based on the received spatial audio format input signal; and an extractor for extracting object audio signals based on the received spatial audio format input signal, wherein the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
  • Fig. 1 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention
  • Fig. 2 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention relating to frequency-domain transforms
  • Fig. 3 illustrates an exemplary diagram of Frequency-domain Banding Gains
  • Fig. 4 illustrates an exemplary diagram of a Time-window for covariance calculation, win b (k ;
  • Fig. 5 shows a flow chart of an exemplary method for converting a spatial audio format (for example, Ambisonics, HO A, or B-format) to an object-based audio format (for example, Dolby's Atmos format).
  • a spatial audio format for example, Ambisonics, HO A, or B-format
  • object-based audio format for example, Dolby's Atmos format
  • Fig. 6 shows a flow chart of another example of a method for converting a spatial audio format to an object-based audio format
  • Fig. 7 is flow chart of an example of a method that implements steps of the method of Fig. 6;
  • Fig. 8 is a flow chart of an example of a method that may be performed in conjunction with the method of Fig. 6.
  • Fig. 1 illustrates an exemplary conceptual block diagram illustrating an exemplary system 100 of the present invention.
  • the system 100 includes a n s -channel Spatial Audio Format 101 that may be an input received by the system 100.
  • the Spatial Audio Format 101 may be a B-format, an Ambisonics format, or an HOA format.
  • the output of the system 100 may include:
  • the system 100 may include a first processing block 102 for determining object locations and a second processing block 103 for extracting object audio signals.
  • Block 102 may output the object location metadata 111 and may provide object location information to block 103 for further processing.
  • Block 103 may be configured to include processing for processing the Spatial Audio signal (input audio signal) 101, to extract n 0 audio signals (output signals, object signals, or object channels) 112 that represent the n 0 audio objects (with locations defined by where 1 ⁇ o ⁇ n 0 ).
  • the n r -channel residual audio signal (spatial
  • Fig. 2 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention relating to frequency-domain transforms.
  • the input and output audio signals are processed in the Frequency Domain (for example, by using CQMF transformed signals).
  • the variables shown in Fig. 2 may be defined as follows: Indices:
  • Fig. 2 shows the transformations into and out of the frequency domain.
  • the CQMF and CQMF '1 transforms are shown, but other frequency-domain transformations are known in the art, and may be applicable in this situation.
  • a filterbank may be applied to the input audio signal, for example.
  • Fig. 2 illustrates a system 200 that includes receiving an input signal (e.g., a multi -channel, spatial format input audio signal, or input audio signal for short).
  • the input signal may include an input signal Si t) for each channel i, 201. That is, the input signal may comprise a plurality of channels. The plurality of channels are defined by the spatial format..
  • the input signal for channel / ' 201 may be transformed into the frequency domain by a CQMF transform 202 that outputs S ⁇ k, /) (frequency - domain input for channel i) 203.
  • the frequency-domain input for channel / ' 203 may be provided to Blocks 204 and 205.
  • Block 204 may perform functionality similar to block 102 of Fig. 1 and may output (location of object o) 211.
  • the output e.g., a multi -channel, spatial format input audio signal, or input audio signal for short.
  • the input signal may include an input signal Si t) for each channel i,
  • Block 204 may provide object location information to block 205 for further processing.
  • Block 205 may perform functionality similar to block 103 of Fig. 1.
  • Block 205 may output T 0 (k,f) (frequency— domain output for object o) 212 which may be then be transformed by a CQMF '1 transform from the frequency domain to the time domain to determine a t 0 (t)(output signal for object o) 213.
  • Block 205 may further output U r (k, f) (frequency— domain output residual channel r) 214 which may then be transformed a CQMF -1 transform from the frequency domain to the time domain to determine u,-(t) (output residual channel r) 215.
  • the frequency-domain transformation is carried out at regular time intervals, so that the transformed signal, S ⁇ k, /), at block k, is a Frequency-domain representation of this input signal in a time interval centred around the time,
  • the frequency-domain processing is carried out on a number, n b , of bands. This is achieved by allocating the set of frequency bins to n b bands. This grouping may be achieved via a set of n b gain vectors,
  • the Spatial Audio input may define a plurality of n s channels.
  • the Spatial Audio input is analysed by first computing the covariance matrix of the n s Spatial Audio signals.
  • the covariance matrix may be determined by block 102 of Fig. 1 and block 204 of Fig. 2.
  • the covariance is computed in each frequency band (frequency subband), b, for each time-block, k.
  • Arranging the n s frequency-domain input signals into a column vector provides:
  • the covariance (covariance matrix) of the input audio signal may be computed as follows: where the ⁇ * operator denotes the complex-conjugate transpose.
  • C b (k) for block k, is a [n s x n s ] matrix, computed from the sum (weighted sum) of the outer products: of the input audio signal in the frequency domain.
  • the weighting functions (if any), win b (k— k') and band b f) may be chosen so as to apply greater weights to frequency bins around band b and time-blocks around block k.
  • the power and normalized covariance may be calculated as follows:
  • the Spatial Audio Input signal is assumed to contain auditory elements (where element c consists of the signal sig c (t) panned to location loc c (t)) that are combined according to a panning rule:
  • the Spatial Input Format is defined by the panning function, PS: which takes a unit-vector as input, and produces a column vector of length n s as
  • the spatial format defines a plurality of channels (e.g., n s . channels).
  • the panning function (or spatial panning function) is a function for mapping (panning) a source signal at a source location (e.g., incident from the source location) to the plurality of channels defined by the spatial format, as shown in the above example.
  • the panning function (spatial panning function) implements a respective panning rule. Analogous statements apply to the panning function (e.g., panning function PR) of the Residual Output signal described below.
  • the Residual Output signal is assumed to contain auditory elements that are combined according to a panning rule, wherein the panning function, PR: which takes a unit-vector as input, and produces a column vector of length n r as outputNote that these panning functions, PSQ and PRO, define the characteristics of the Spatial Input Signal and Residual Output Signal respectively, but this does not mean that these signals are necessarily constructed according to the method of Equation 7.
  • Spatial Input Format panning function e.g., PS: it is also useful to derive a Spatial Input Format decoding function (spatial decoding function), DS: which takes a unit vector as input, and returns a row-vector, of length n s , as output.
  • the function DS(loc) should be defined so as to provide a row-vector suitable for extracting a single audio signal from the multi -channel Spatial Input Signal, corresponding with the audio components around the direction specified by loc.
  • panner/decoder combination may be configured to provide unity- gain:
  • the average decoded power (integrated over the unit-sphere) may be minimised:
  • the Spatial Input Signal contains audio components that are panned according to the 2 nd -order Ambisonics panning rules, as per the panning function shown in Equation 10:
  • the optimal decoding function may be determined as follows:
  • the decoding function DS is an example of a spatial decoding function of the spatial format in the context of the present disclosure.
  • the spatial decoding function of the spatial format is a function for extracting an audio signal at a given location loc (e.g., incident from the given location), from the plurality of channels defined by the spatial format.
  • the spatial decoding function may be defined (e.g., determined, calculated) such that successive application of the spatial panning function (e.g., PS) and the spatial decoding function (e.g., OS) yields unity gain for all locations on the unit sphere.
  • the spatial decoding function may be further defined (e.g., determined, calculated) such that the average decoded power is minimized. next, the steering function will be described.
  • the Spatial Audio Input signal is assumed to be composed of muliple audio components with respective incident directions of arrival, and hence it is desirable to have a method for estimating the proportion of audio signal that appears in a particular direction, by inspection of the Covariance Matrix.
  • the steering function Steer defined below can provide such an estimate.
  • Some complex Spatial Input Signals will contain a large number of audio components, and the finite spatial resolution of the Spatial Input Format panning function will mean that there may be some fraction of the total Audio Input power that is considered to be "diffuse" (meaning that this fraction of the signal is considered to be spread uniformly in all directions).
  • the finite spatial resolution of the Spatial Input Format panning function will mean that there may be some fraction of the total Audio Input power that is considered to be "diffuse" (meaning that this fraction of the signal is considered to be spread uniformly in all directions).
  • a function (the steering function), may be defined such that the
  • the steering function is based on (e.g., depends) on the covariance matrix C of the input audio signal. Also, the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
  • This projection function will take on a larger value whenever the normalized covariance matrix corresponds to an input signal with large signal components in the direction near Likewise, this projection function will take on a smaller value whenever the normalized covariance matrix corresponds to an input signal with no dominant audio components in the direction near
  • this projectipon function may be used to estimate the proportion of the input signal that is biased towards direction v, by forming a monotonic mapping from the projection function to form the steering function,
  • the diffuse power in the vicinity of the direction may be determined as follows:
  • the steering function will take on the value 1.0 whenever the Input Spatial Signal is composed entirely of audio components at location v, and it will take on the value 0.0 when the Input Spatial Signal appears to contain no bias towards the direction v.
  • the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
  • the Spatial Input Format is a first order Ambisonics format, defined by the panning function:
  • the Residual Output signal may be defined in terms of the same spatial format as the Spatial Input Format (so that the panning functions are the same:
  • the Residual Output signal may be determined by block 103 of Fig. 1 and block 205 of Fig. 2.
  • a residual downmix matrix: identity matrix may be defined.
  • the Residual Output signal will be composed of a smaller number of channels than the Spatial Input signal: n r ⁇ n s .
  • the panning function that defines the residual format will be different to the spatial input panning function.
  • R may be chosen to provide a linear transformation from to PRO (as examples of the spatial panning function of the spatial format and the residual format):
  • R is the residual downmix matrix that would be applied if the Spatial Input Format is 3 rd -order Ambisonics and the Residual Format is l st -order Ambisonics:
  • R may be chosen to provide a "least-error" mapping. For example, given a set, of n b unit vectors that are approximately uniformly
  • a pair of matrices may be formed by stacking together n b column vectors: where B s is a [n s X n b ] array of Spatial Input panning vectors, and B R is a [ ⁇ X n b ] array of Residual Output panning vectors.
  • the processing of method 600 may be performed at each time block k, for example. That is, method 600 may be performed for each predetermined period of time (e.g., for each transformation window of a time-to-frequency transform).
  • the multi-channel, spatial format input audio signal may be an audio signal in a spatial format (spatial audio format) and may comprise multiple channels.
  • the spatial format (spatial audio format) may be, but is not limited to, Ambisonics, HOA, or B-format.
  • the input audio signal is analyzed to determine a plurality of object locations of audio objects included in the input audio signal. For example, locations of of n 0 objects may be determined. This may involve performing a
  • This step may be performed by either of a subband-based approach and a broadband approach.
  • a mixing gain is determined for that frequency subband and that object location.
  • the method may further include a step of applying a time-to-frequency transform to a time-domain input audio signal.
  • a frequency subband output signal is generated based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format.
  • the spatial mapping function may be the spatial decoding function (e.g., spatial decoding function PS).
  • an output signal is generated by summing over the frequency subband output signals for that object location.
  • the object locations may be output as object location metadata.
  • this step may further comprise generating object location metadata indicative of the object locations.
  • the object location metadata may be output together with the output signals.
  • the method may further include a step of applying an inverse time-to-frequency transform to the frequency-domain output signals.
  • DOL Spatial Audio input signal
  • a set of one or more dominant directions of sound arrival is determined. This may involve performing process DOL1 described below.
  • DOL J For each band, b, determine a set, V b , of dominant sound-arrival directions Each dominant sound-arrival direction may have an associated
  • weighting factor indicative of the "confidence" assigned to the respective direction vector:
  • the first step (1), DOLJ may be achieved by a number of different methods. Some alternatives are for example: DOLl(a):
  • DOLl(b) For some commonly used spatial formats, a single dominant direction of arrival may be determined from the elements of the Covariance matrix.
  • the Spatial Input Format is a first order Ambisonics format, defined by the panning function: then an estimate may be made for the dominant direction of arrival in band b , by extracting three elements from the Covariance matrix, and then normalizing to form a unit-vector:
  • the processing of DOLlfb) may be said to relate to an example of extracting elements from the covariance matrix of the input audio signal in the relevant frequency subband.
  • DOLl(c) The dominant directions of arrival for band b may be determined by finding all of the local maxima of the projection function:
  • One example method which may be used to search for local minima, operates by refining an initial estimate by a gradient-search method, so as to maximise the value of The initial estimates may be found by:
  • determining the set of dominant directions of sound arrival may involve at least one of extracting elements from a covariance matrix of the input audio signal in the relevant frequency subband, and determining local maxima of a projection function of the input audio signal in the frequency subband.
  • the projection function may be based on the covariance matrix (e.g., normalized covariance matrix) of the input audio signal and a spatial panning function of the spatial format, for example.
  • a union of the sets of the one or more dominant directions for the plurality of frequency subbands is determined. This may involve performing process DOL2 described below. DOL2: From the collection of the dominant sound-arrival directions form the union of the dominant sound-arrival direction sets of all bands:
  • the methods ⁇ DOLl(a), DOLl(b) and DOLl(c)) outlined above may be used to determine a set of dominant sound arrival directions for band b. For each of
  • a corresponding "confidence factor" (w b l , w b 2 , ) may be determined, indicating how much weighting should be given to each dominant sound-arrival-direction.
  • the weighting may be calculated by combining together a number of factors, as follows:
  • Equation 35 the function provides a "loudness" weighting factor
  • the function Steer Q provides a "directional -steering" weighting factor that is responsive to the degree to which the input signal contains power in the direction
  • a clustering algorithm is applied to the union of the sets to determine the plurality of object locations. This may involve performing process DOL3 described below.
  • DOL3 Determine the n 0 object directions from the weighted set of dominant sound-arrival directions:
  • DOL3 will then determine a number (n 0 ) of object locations. This can be achieved by a clustering algorithm. If the dominant directions have associated weights, the clustering algorithm may perform weighted clustering of the dominant directions.
  • Mathematical and Statistical Psychology 59.1 (2006): 1-34) may be used to find a set of n 0 centroids, by clustering the set of directions into n 0 subsets. This
  • set of centroids is then normalized and permuted to create the set of object locations, according to:
  • DOL3(b) Other clustering algorithms, such as Expectation-Maximization, may be used.
  • the clustering algorithm in step S730 may be one of a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm, for example.
  • Fig. 8 is a flow chart of an example of a method 800 that may optionally be performed in conjunction with the method 600 of Fig. 6, for example after step S640.
  • the plurality of output signals are re-encoded into the spatial format to obtain a multi-channel, spatial format audio object signal.
  • the audio object signal is subtracted from the input audio signal to obtain a multi -channel, spatial formal residual audio signal.
  • a downmix is applied to the residual audio signal to obtain a downmixed residual audio signal.
  • the number of channels of the downmixed residual audio signal may be smaller than the number of channels of the input audio signal.
  • Step S830 may be optional.
  • the DOL process determines the locations, of n 0 objects (o G [1, n 0 ]), at each time-block, k. Based on these object locations, the spatial audio input signals are processed (e.g., at blocks 103 or 205) to form a set of n 0 object output signals and ri f residual output signals.
  • This process may be referred to by the shorthand name EOS, and in some embodiments, this process is achieved (e.g., at each time-block k) by the steps EOS/ to EOS6:
  • EOS1 Determine the [n 0 x n s ] object-decoding matrix by stacking n 0 row- vectors:
  • the object-decoding matrix D is an example of a spatial decoding matrix.
  • the spatial decoding matrix includes a plurality of mapping vectors (e.g., vectors one mapping vector for each object location. Each of these mapping vectors may be obtained by evaluating a spatial decoding function at the respective object location.
  • the spatial decoding function may be a vector-valued function (e.g., a 1 X n s row vector of the multi -channel, spatial format input audio signal is defined as a n s x 1 column vector) Determine the [n s x n 0 ] object-encoding matrix by stacking n 0 column- vectors:
  • the object-encoding matrix E is an example of a spatial panning matrix.
  • the spatial panning matrix includes a plurality of mapping vectors (e.g., vectors
  • mapping vectors may be obtained by evaluating a spatial panning function at the respective object location.
  • the spatial panning function may be a vector-valued function (e.g., a n s x 1 column vector of the multi-channel, spatial format input audio signal is defined as a n s x 1 column vector)
  • the object gain matrix G b may be referred to as a gain matrix in the following.
  • This gain matrix includes the determined mixing gains for frequency subband b.
  • it is a diagonal matrix that has the mixing gains (one for each object location, appropriately ordered) as its diagonal elements.
  • process EOS3 determines, for each frequency subband and for each object location, a mixing gain (e.g., frequency dependent mixing gain) for that frequency subband and that object location.
  • a mixing gain e.g., frequency dependent mixing gain
  • process EOS3 is an example of an implementation of step S620 of method 600 described above.
  • determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband.
  • Dependence on the covariance matrix may be through the steering function which is based on (e.g., depends) on the covariance matrix C (or
  • EOS4 Compute the frequency-domain object output signals, T(k, /), by applying the object decoding matrix and the object gain matrix to the spatial input signals, S(k, /), and by summing over the frequency subbands b:
  • the frequency-domain object output signals, T(k, /) may be referred to as frequency subband output signals.
  • the sum may be a weighted sum, for example.
  • Process EOS4 is an example of an implementation of steps S630 and S640 of method 600 described above.
  • generating the frequency subband output signal for a frequency subband and an object location at step S630 may involve applying a gain matrix (e.g., matrix G b ) and a spatial decoding matrix (e.g., matrix D) to the input audio signal. Therein, the gain matrix and the spatial decoding matrix may be successively applied.
  • a gain matrix e.g., matrix G b
  • a spatial decoding matrix e.g., matrix D
  • EOS5 Compute the frequency-domain residual spatial signals by re-encoding the object output signals, T k,f), and subtracting this re-encoded signal from the spatial input: Determine the [ ⁇ X n s ] residual downmix matrix R (for example, via the method of Equation 29), and compute the frequency-domain residual output signals transforming the residual spatial signals via this residual downmix matrix:
  • process EOS5 is an example of an implementation of steps S810, S820, and S830 of method 800 described above.
  • Re-encoding the plurality of output signals into the spatial format may thus be based on the spatial panning matrix (e.g., matrix E).
  • re-encoding the plurality of output signals into the spatial format may involve applying the spatial panning matrix (e.g., matrix E) to a vector of the plurality of output signals.
  • Applying a downmix to the residual audio signal (e.g., S') may involve applying a downmix matrix (e.g., downmix matrix R) to the residual audio signal.
  • a downmix matrix e.g., downmix matrix R
  • the first 2 steps in the EOS process, EOS1 and EOS2 involve the calculation of matrix coefficients, suitable for extracting object-audio signals from the spatial audio input (using the D matrix), and re-encoding these objects back into the spatial audio format (using the E matrix).
  • These matrices are formed by using the PS and DSQ functions. Examples of these functions (for the case where the input spatial audio format is 2 nd -order Ambisonics) are given in Equations 10 and 11.
  • the EOS3 step may be implemented in a number of ways. Some alternative methods are: EOS3(a): The object gains may be computed using the
  • the function is used to indicate what proportion of the
  • a mixing gain (e.g., frequency dependent mixing gain) for each frequency subband and for each object location can be determined (e.g., calculated).
  • determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband.
  • Dependence on the covariance matrix may be through the steering function which is based on (e.g., depends) on the covariance matrix C (or the normalized covariance matrix C) of the input audio signal. That is, the mixing gain for the given frequency subband and the given object location may depend on the steering function for the input audio signal in the given frequency band, evaluated at the given object location.
  • determining the mixing gain for the given frequency subband and the given object location may be further based on a change rate of the given object location over time.
  • the mixing gain may be attenuated in dependence on the change rate of the given object location.
  • the object gains may be computed by combining a number of gain-factors (each of which is generally a real value in the range [0,1]). For example: where and is computed to be a gain factor that is approximately equal to 1 whenever
  • the object location is static and approximately equal to 0 when the object location is "jumping" significantly in the region around time-block k (for example, when for
  • the gain-factor is intended to attenuate the object amplitude whenever
  • an object location is changing rapidly, as may occur when a new object "appears" at time- block k in a location where no object existed during time-block k— 1.
  • a suitable value for is 0.5, an in general will choose such that 0.05 ⁇ ⁇ 1.
  • Fig. 5 illustrates an exemplary method 500 in accordance with present principles.
  • Method 500 includes, at 501, receiving spatial audio information.
  • the spatial audio information may be consistent with n s -channel Spatial Audio Format 101 shown in Fig. 1 and an s ⁇ t) (input signal for channel i) 201 shown in Fig. 2.
  • object locations may be determined based on the received spatial audio information. For example, the object locations may be determined as described in connection with blocks 102 shown in Fig. 1 and 204 shown in Fig. 2.
  • Block 502 may output object location metadata 504.
  • the object location metadata 504 may be similar to the object location metadata 111 shown in Fig. 1 and (location of object o) 211 shown in Fig. 2.
  • object audio signals may be extracted based on the received spatial audio information.
  • the object audio signals may be extracted as described in connection with blocks 103 shown in Fig. 1 and 205 shown in Fig. 2.
  • Block 503 may output object audio signals 505.
  • the object audio signals 505 may be similar to the object audio signals 112 shown in Fig. 1 and output signal for objecto 213 shown in Fig. 2.
  • Block 503 may further output residual audio signals 506.
  • the residual audio signals 506 may be similar to the residual audio signals 113 shown in Fig. 1 and output residual channel r 215 shown in Fig. 2.
  • the apparatus may comprise a processor adapted to perform any of the processes described above, e.g., the steps of methods 600, 700, and 800, as well as their respective implementations DOL1 to DOL3 and EOS1 to EOS5.
  • Such apparatus may further comprise a memory coupled to the processor, the memory storing respective instructions for execution by the processor.
  • the methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits.
  • the signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet. Typical devices making use of the methods and systems described in the present document are portable electronic devices or other consumer equipment which are used to store and/or render audio signals.
  • a first EEE relates to a method for processing a multi -channel, spatial audio format input signal.
  • the method comprises determining object location metadata based on the received spatial audio format input signal, and extracting object audio signals based on the received spatial audio format input signal.
  • the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
  • a second EEE relates to a method according to the first EEE, wherein each extracted audio object signal has a corresponding object location metadata.
  • a third EEE relates to a method according to the first or second EEEs, wherein the object location metadata is indicative of the direction-of-arrival of an object.
  • a fourth EEE relates to a method according to any one of the first to third EEEs, wherein the object location metadata is derived from statistics of the received spatial audio format input signal.
  • a fifth EEE relates to a method according to any one of the first to fourth EEEs, wherein the object location metadata is changing from time to time.
  • a sixth EEE relates to a method according to any one of the first to fifth EEEs, wherein the object audio signals are determined based on a linear mixing matrix in each of a number of sub-bands of the received spatial audio format input signal.
  • a seventh EEE relates to a method according to any one of the first to sixth EEEs, wherein the residual signal is a multi-channel residual signal.
  • An eighth EEE relates to a method according to the seventh EEE, wherein the multi-channel residual signal is composed of a number of channels that is less than a number of channels of the received spatial audio format input signal.
  • a ninth EEE relates to a method according to any one of the first to eighth EEEs, wherein extracting object audio signals is determined by subtracting the contribution of the said object audio signals from the said spatial audio format input signal
  • a tenth EEE relates to a method according to any one of the first to ninth EEEs, wherein extracting object audio signals includes determining a linear mixing matrix coefficients that may be used by subsequent processing to create the one or more object audio signals and the residual signal.
  • An eleventh EEE relates to a method according to any one of the first to tenth EEEs, wherein the matrix coefficients are different for each frequency band.
  • a twelfth EEE relates to an apparatus for processing a multi-channel, spatial audio format input signal.
  • the apparatus comprises a processor for determining object location metadata based on the received spatial audio format input signal, and an extractor for extracting object audio signals based on the received spatial audio format input signal.
  • the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Stereophonic System (AREA)

Abstract

Apparatus, computer readable media and methods for processing a multi-channel, spatial audio format input signal. For example, one such method comprises determining object location metadata based on the received spatial audio format input signal; and extracting object audio signals based on the received spatial audio format input signal, wherein the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.

Description

PROCESSING OF A MULTI-CHANNEL SPATIAL AUDIO FORMAT INPUT SIGNAL
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of priority from U.S. Provisional Patent Application No. 62/598,068 filed on December 13, 2017, European Patent Application No.
17179315.1 filed July 3, 2017, and U.S. Provisional Patent Application No. 62/503,657 filed May 9, 2017, each of which is incorporated herein by reference. TECHNICAL FIELD
The present disclosure relates to immersive audio format conversion, including conversion of a spatial audio format (for example, Ambisonics, Higher Order
Ambisonics, or B-format) to an object-based format (for example Dolby's Atmos format).
SUMMARY
The present document addresses the technical problem of converting a spatial audio format (for example, Ambisonics , Higher Order Ambisonics, or B-format) to an object- based format (e.g., Dolby's Atmos format). In this regard, the term "spatial audio format", as used throughout the specification and claims, particularly relates to audio formats providing loudspeaker-independent signals which represent directional characteristics of a sound field recorded at one or more locations. Moreover, the term "object-based format", as used throughout the
specification and claims, particularly relates to audio formats providing loudspeaker- independent signals which represent sound sources.
An aspect of the document relates to a method of processing a multi-channel, spatial format input audio signal (i.e., an audio signal in a spatial format (spatial audio format) which includes multiple channels). The spatial format (spatial audio format) may be Ambisonics, Higher Order Ambisonics (HOA), or B-format, for example. The method may include analyzing the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal. The object locations may be spatial locations, e.g., indicated by 3-vectors in Cartesian or spherical coordinates. Alternatively, the object locations may be indicated in two dimensions, depending on the application.
The method may further include, for each of a plurality of frequency subbands of the input audio signal, determining, for each object location, a mixing gain for that frequency subband and that object location. To this end, the method may include applying a time-to-frequency transform to the input audio signal and arranging the resulting frequency coefficients into frequency subbands. Alternatively, the method may include applying a filterbank to the input audio signal. The mixing gains may be referred to as object gains. The method may further include, for each frequency subband, generating, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format. The spatial mapping function may be a spatial decoding function, for example spatial decoding function DS(loc). The method may yet further include, for each object location, generating an output signal by summing over the frequency subband output signals for that object location. The sum may be a weighted sum. The object locations may be output as object location metadata (e.g., object location metadata indicative of the object locations may be generated and output). The output signals may be referred to as object signals or object channels. The above processing may be performed for each predetermined period of time (e.g., for each time-block, or each transformation window of a time-to-frequency transform).
Typically, known approaches for format conversion from a spatial format to an object- based format apply a broadband approach when extracting audio object signals associated with a set of dominant directions. By contrast, the proposed method applies a subband-based approach for determining the audio object signals. Configured as such, the proposed method can provide clear panning/steering decisions per subband.
Thereby, increased discreteness in directions of audio objects can be achieved, and there is less "smearing" in the resulting audio objects. For example, after determining the dominant directions (possibly using a broadband approach or using a subband-based approach), it may turn out that a certain audio object is panned to one dominant direction in a first frequency subband, but is panned to another dominant direction in a second frequency subband. This different panning behavior of the audio object in different subbands would not be captured by known approaches for format conversion, at the cost of decreased discreteness of directivity and increased smearing.
In some examples, the mixing gains for the object locations may be frequency- dependent.
In some examples, the spatial format may define a plurality of channels. Then, the spatial mapping function may be a spatial decoding function of the spatial format for extracting an audio signal at a given location, from the plurality of the channels of the spatial format. At a given location shall mean incident from the given location, for example.
In some examples, a spatial panning function of the spatial format may be a function for mapping a source signal at a source location to the plurality of channels defined by the spatial format. At a source location shall mean incident from the source location, for example. Mapping may be referred to as panning. The spatial decoding function may be defined such that successive application of the spatial panning function and the spatial decoding function yields unity gain for all locations on the unit sphere. The spatial decoding function may be further defined such that the average decoded power is minimized.
In some examples, determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and a covariance matrix of the input audio signal in the given frequency subband.
In some examples, the mixing gain for the given frequency subband and the given object location may depend on a steering function for the input audio signal in the given frequency subband, evaluated at the given object location. In some examples, the steering function may be based on the covariance matrix of the input audio signal in the given frequency subband.
In some examples, determining the mixing gain for the given frequency subband and the given object location may be further based on a change rate of the given object location over time. The mixing gain may be attenuated in dependence on the change rate of the given object location. For instance, the mixing gain may be attenuated if the change rate is high, and may not be attenuated for a static object location. In some examples, generating, for each frequency subband and for each object location, the frequency subband output signal may involve applying a gain matrix and a spatial decoding matrix to the input audio signal. The gain matrix and the spatial decoding matrix may be successively applied. The gain matrix may include the determined mixing gains for that frequency subband. For example, the gain matrix may be a diagonal matrix, with the mixing gains as its diagonal elements, appropriately ordered. The spatial decoding matrix may include a plurality of mapping vectors, one for each object location. Each mapping vector may be obtained by evaluating the spatial decoding function at a respective object location. For example, the spatial decoding function may be a vector-valued function (e.g., yielding an 1 xns row vector if the multichannel, spatial format input audio signal is defined as a ns>< 1 column vector,
Figure imgf000005_0001
In some examples, the method may further include re-encoding the plurality of output signals into the spatial format to obtain a multi-channel, spatial format audio object signal. The method may yet further include subtracting the audio object signal from the input audio signal to obtain a multi-channel, spatial format residual audio signal. The spatial format residual signal may be output together with the output signals and location metadata, if any.
In some examples, the method may further include applying a downmix to the residual audio signal to obtain a downmixed residual audio signal. The number of channels of the downmixed residual audio signal may be smaller than the number of channels of the input audio signal. The downmixed spatial format residual signal may be output together with the output signals and location metadata, if any.
In some examples, analyzing the input audio signal may involve, for each frequency subband, determining a set of one or more dominant directions of sound arrival.
Analyzing the input audio signal may further involve determining a union of the sets of the one or more dominant directions for the plurality of frequency subbands. Analyzing the input audio signal may yet further involve applying a clustering algorithm to the union of the sets to determine the plurality of object locations. In some examples, determining the set of dominant directions of sound arrival may involve at least one of: extracting elements from the covariance matrix of the input audio signal in the frequency subband, and determining local maxima of a projection function of the input audio signal in the frequency subband. The projection function may be based on the covariance matrix of the input audio signal and a spatial panning function of the spatial format.
In some examples, each dominant direction may have an associated weight. Then, the clustering algorithm may perform weighted clustering of the dominant directions. Each weight may be indicative of a confidence value for its dominant direction, for example. The confidence value may indicate a likelihood of whether an audio object is actually located at the object location.
In some examples, the clustering algorithm may be one of a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm.
In some examples, the method may further include generating object location metadata indicative of the object locations. The object location metadata may be output together with the output signals and the (downmixed) spatial format residual signal, if any. Another aspect of the document relates to an apparatus for processing a multi-channel, spatial format input audio signal. The apparatus may include a processor. The processor may be adapted to analyze the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal. The processor may be further adapted to, for each of a plurality of frequency subbands of the input audio signal, determine, for each object location, a mixing gain for that frequency subband and that object location. The processor may be further adapted to, for each frequency subband, generate, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format. The processor may be yet further adapted to, for each object location, generate an output signal by summing over the frequency subband output signals for that object location. The apparatus may further comprise a memory coupled to the processor. The memory may store respective instructions for execution by the processor.
Another aspect of the document relates to software program. The software program may be adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on the processor. Another aspect of the document relates to a storage medium. The storage medium may comprise a software program adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on the processor.
Another aspect of the document relates to a computer program product. The computer program may comprise executable instructions for performing the method steps outlined in the present document when executed on a computer.
Another aspect of the present document relates to a method for processing a multichannel, spatial audio format input signal, the method comprising determining object location metadata based on the received spatial audio format input signal; and extracting object audio signals based on the received spatial audio format input signal. The extracting object audio signals is based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
Each extracted audio object signal may have a corresponding object location metadata.
The object location metadata may be indicative of the direction-of-arrival of an object. The object location metadata may be derived from statistics of the received spatial audio format input signal. The object location metadata may change from time to time.
The object audio signals may be determined based on a a linear mixing matrix in each of a number of sub-bands of the received spatial audio format input signal. The residual signal may be a multi-channel residual signal that may be composed of a number of channels that is less than a number of channels of the received spatial audio format input signal.
The extracting object audio signals may be determined by subtracting the contribution of the said object audio signals from the said spatial audio format input signal. The extracting object audio signals may also include determining a linear mixing matrix coefficients that may be used by subsequent processing to create the one or more object audio signals and the residual signal. The matrix coefficients may be different for each frequency band.
Another aspect of the present document relates to an apparatus for processing a multichannel, spatial audio format input signal, the apparatus comprising a processor for determining object location metadata based on the received spatial audio format input signal; and an extractor for extracting object audio signals based on the received spatial audio format input signal, wherein the extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
It should be noted that the methods and systems including its embodiments as outlined in the present patent application may be used stand-alone or in combination with the other methods and systems disclosed in this document. Furthermore, all aspects of the methods and systems outlined in the present patent application may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner. BRIEF DESCRIPTION OF THE DRAWINGS
The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein
Fig. 1 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention;
Fig. 2 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention relating to frequency-domain transforms;
Fig. 3 illustrates an exemplary diagram of Frequency-domain Banding Gains,
bandb{f);
Fig. 4 illustrates an exemplary diagram of a Time-window for covariance calculation, winb(k ;
Fig. 5 shows a flow chart of an exemplary method for converting a spatial audio format (for example, Ambisonics, HO A, or B-format) to an object-based audio format (for example, Dolby's Atmos format).
Fig. 6 shows a flow chart of another example of a method for converting a spatial audio format to an object-based audio format;
Fig. 7 is flow chart of an example of a method that implements steps of the method of Fig. 6; and
Fig. 8 is a flow chart of an example of a method that may be performed in conjunction with the method of Fig. 6.
DETAILED DESCRIPTION
Fig. 1 illustrates an exemplary conceptual block diagram illustrating an exemplary system 100 of the present invention. The system 100 includes a ns-channel Spatial Audio Format 101 that may be an input received by the system 100. The Spatial Audio Format 101 may be a B-format, an Ambisonics format, or an HOA format. The output of the system 100 may include:
• n0 audio output channels, representing n0 audio objects;
· Location data, specifying the time-varying location of the n0 objects;
• A set of rif residual audio channels, representing the original soundfield with the n0 objects removed.
The system 100 may include a first processing block 102 for determining object locations and a second processing block 103 for extracting object audio signals. Block 102 may be configured to include processing for analyzing the Spatial Audio signal 101 and determining the location of a number (n0) of objects, at regular instances in time (defined by the time-interval, Tm). That is, the processing may be performed for each predetermined period of time. For example, the location of object o (1 < o < n0) at time, t = kzm, is given by the 3 -vector:
Figure imgf000009_0001
Depending on the application (e.g., for planar configurations), the location of object o (1 < o < n0) at time, t = kxm may be given by a 2-vector. Block 102 may output the object location metadata 111 and may provide object location information to block 103 for further processing.
Block 103 may be configured to include processing for processing the Spatial Audio signal (input audio signal) 101, to extract n0 audio signals (output signals, object signals, or object channels) 112 that represent the n0 audio objects (with locations defined by where 1 < o < n0). The nr-channel residual audio signal (spatial
Figure imgf000009_0002
format residual audio signal or downmixed spatial format residual audio signal) 113 is also provided as output of this second stage.
Fig. 2 illustrates an exemplary conceptual block diagram illustrating an aspect of the present invention relating to frequency-domain transforms. In a preferred embodiment, the input and output audio signals are processed in the Frequency Domain (for example, by using CQMF transformed signals). The variables shown in Fig. 2 may be defined as follows: Indices:
Figure imgf000010_0001
Fig. 2 shows the transformations into and out of the frequency domain. In this Figure, the CQMF and CQMF'1 transforms are shown, but other frequency-domain transformations are known in the art, and may be applicable in this situation. Also, a filterbank may be applied to the input audio signal, for example.
In one example, Fig. 2 illustrates a system 200 that includes receiving an input signal (e.g., a multi -channel, spatial format input audio signal, or input audio signal for short). The input signal may include an input signal Si t) for each channel i, 201. That is, the input signal may comprise a plurality of channels. The plurality of channels are defined by the spatial format.. The input signal for channel /' 201 may be transformed into the frequency domain by a CQMF transform 202 that outputs S^k, /) (frequency - domain input for channel i) 203. The frequency-domain input for channel /' 203 may be provided to Blocks 204 and 205. Block 204 may perform functionality similar to block 102 of Fig. 1 and may output (location of object o) 211. The output
Figure imgf000011_0002
Figure imgf000011_0003
211 may be a set of outputs, (e.g., foro=l, 2, ... n). Block 204 may provide object location information to block 205 for further processing. Block 205 may may perform functionality similar to block 103 of Fig. 1. Block 205 may output T0(k,f) (frequency— domain output for object o) 212 which may be then be transformed by a CQMF'1 transform from the frequency domain to the time domain to determine a t0(t)(output signal for object o) 213. Block 205 may further output Ur(k, f) (frequency— domain output residual channel r) 214 which may then be transformed a CQMF-1 transform from the frequency domain to the time domain to determine u,-(t) (output residual channel r) 215.
The frequency-domain transformation is carried out at regular time intervals,
Figure imgf000011_0007
so that the transformed signal, S^k, /), at block k, is a Frequency-domain representation of this input signal in a time interval centred around the time,
Figure imgf000011_0006
Figure imgf000011_0001
In some embodiments, the frequency-domain processing is carried out on a number, nb, of bands. This is achieved by allocating the set of frequency bins
Figure imgf000011_0004
to nb bands. This grouping may be achieved via a set of nb gain vectors,
Figure imgf000011_0005
bandb( ), as shown in Figure 3. In this example, tij = 64 and nb = 13.
The Spatial Audio input (input audio signal) may define a plurality of ns channels. In some embodiments, the Spatial Audio input is analysed by first computing the covariance matrix of the ns Spatial Audio signals. The covariance matrix may be determined by block 102 of Fig. 1 and block 204 of Fig. 2. In the example described here, the covariance is computed in each frequency band (frequency subband), b, for each time-block, k. Arranging the ns frequency-domain input signals into a column vector provides:
Figure imgf000012_0001
As a non-limiting example, the covariance (covariance matrix) of the input audio signal may be computed as follows:
Figure imgf000012_0002
where the■* operator denotes the complex-conjugate transpose.
In general, the covariance, Cb(k), for block k, is a [ns x ns] matrix, computed from the sum (weighted sum) of the outer products:
Figure imgf000012_0003
of the input audio signal in the frequency domain. The weighting functions (if any), winb(k— k') and bandb f) may be chosen so as to apply greater weights to frequency bins around band b and time-blocks around block k.
A typical time-window, win¾(fc) , is shown in Figure 4. In this example,
Figure imgf000012_0006
ensuring that the covariance calculation is causal (so, the calculation of the covariance for block k depends only on the frequency-domain input signal at block k or earlier).
The power and normalized covariance may be calculated as follows:
Figure imgf000012_0004
where denotes the trace of the matrix.
Figure imgf000012_0005
Next, the Panning Functions that define the Input Format and the Residual Format will be described. The Spatial Audio Input signal is assumed to contain auditory elements (where element c consists of the signal sigc(t) panned to location locc(t)) that are combined according to a panning rule:
Figure imgf000013_0001
so that the Spatial Input Format is defined by the panning function, PS:
Figure imgf000013_0002
which takes a unit-vector as input, and produces a column vector of length ns as
Figure imgf000013_0003
output.
In general, the spatial format (spatial audio format) defines a plurality of channels (e.g., ns. channels). The panning function (or spatial panning function) is a function for mapping (panning) a source signal at a source location (e.g., incident from the source location) to the plurality of channels defined by the spatial format, as shown in the above example. At this, the panning function (spatial panning function) implements a respective panning rule. Analogous statements apply to the panning function (e.g., panning function PR) of the Residual Output signal described below.
Similarly, the Residual Output signal is assumed to contain auditory elements that are combined according to a panning rule, wherein the panning function, PR:
Figure imgf000013_0004
which takes a unit-vector as input, and produces a column vector of length nr as outputNote that these panning functions, PSQ and PRO, define the characteristics of the Spatial Input Signal and Residual Output Signal respectively, but this does not mean that these signals are necessarily constructed according to the method of Equation 7. In some embodiments, the number of channels nr of the Residual Output signal and the number of channels ns of the Spatial Input Signal may be equal J = ns.
Next, the Input Decoding Function will be described.
Given the Spatial Input Format panning function (e.g., PS:
Figure imgf000013_0005
it is also useful to derive a Spatial Input Format decoding function (spatial decoding function), DS:
Figure imgf000013_0006
which takes a unit vector as input, and returns a row-vector, of length ns, as output. The function DS(loc) should be defined so as to provide a row-vector suitable for extracting a single audio signal from the multi -channel Spatial Input Signal, corresponding with the audio components around the direction specified by loc.
Generally, the panner/decoder combination may be configured to provide unity- gain:
Figure imgf000014_0001
Moreover, the average decoded power (integrated over the unit-sphere) may be minimised:
Figure imgf000014_0002
Assuming, for example, that the Spatial Input Signal contains audio components that are panned according to the 2nd -order Ambisonics panning rules, as per the panning function shown in Equation 10:
Figure imgf000014_0003
The optimal decoding function, may be determined as follows:
Figure imgf000014_0004
Figure imgf000015_0001
The decoding function DS is an example of a spatial decoding function of the spatial format in the context of the present disclosure. In general, the spatial decoding function of the spatial format is a function for extracting an audio signal at a given location loc (e.g., incident from the given location), from the plurality of channels defined by the spatial format. The spatial decoding function may be defined (e.g., determined, calculated) such that successive application of the spatial panning function (e.g., PS) and the spatial decoding function (e.g., OS) yields unity gain for all locations on the unit sphere. The spatial decoding function may be further defined (e.g., determined, calculated) such that the average decoded power is minimized. next, the steering function will be described.
The Spatial Audio Input signal is assumed to be composed of muliple audio components with respective incident directions of arrival, and hence it is desirable to have a method for estimating the proportion of audio signal that appears in a particular direction, by inspection of the Covariance Matrix. The steering function Steer defined below can provide such an estimate.
Some complex Spatial Input Signals will contain a large number of audio components, and the finite spatial resolution of the Spatial Input Format panning function will mean that there may be some fraction of the total Audio Input power that is considered to be "diffuse" (meaning that this fraction of the signal is considered to be spread uniformly in all directions). Hence, for any given direction of arrival v, it is desirable to be able to make an estimation of the amount of the Spatial Audio Input signal that is present in the region around the vector v, excluding the estimated diffuse amount.
A function (the steering function), may be defined such that the
Figure imgf000016_0003
function will take on the value 1.0 whenever the Input Spatial Signal is composed entirely of audio components at location and will take on the value 0.0 when the Input
Figure imgf000016_0009
Spatial Signal appears to contain no bias towards the direction
Figure imgf000016_0010
In general, the steering function is based on (e.g., depends) on the covariance matrix C of the input audio signal. Also, the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
Now it is common to estimate the fraction of the power in a specific direction,
Figure imgf000016_0002
in soundfield with normalized covariance C, by using the projection function:
Figure imgf000016_0001
This projection function will take on a larger value whenever the normalized covariance matrix corresponds to an input signal with large signal components in the direction near
Figure imgf000016_0007
Likewise, this projection function will take on a smaller value whenever the normalized covariance matrix corresponds to an input signal with no dominant audio components in the direction near
Figure imgf000016_0008
Hence, this projectipon function may be used to estimate the proportion of the input signal that is biased towards direction v, by forming a monotonic mapping from the projection function to form the steering function,
Figure imgf000016_0004
In order to determine this monotonic mapping, first it should be estimated the expected value of the function, for the two hypothetical use cases: (1) when
Figure imgf000016_0005
the input signal contains a diffuse soundfield, and (2) when the input signal contains a single sound component, in the direction of
Figure imgf000016_0011
The following explanation will lead to the definition of the function as described in connection with Equations 20 and
Figure imgf000016_0006
21, based on the DiffusePower and SteerPower, as defined in Equations 16 and 19 below.
Given any input panning function (e.g., input panning function, ), it is
Figure imgf000016_0012
possible to determine the average covariance (representing the covariance of a diffuse soundfield):
Figure imgf000017_0001
The normalized covariance for a diffuse soundfield may be computed as follows:
Figure imgf000017_0002
Now it is common to estimate the fraction of the power in a specific direction,
Figure imgf000017_0005
in soundfield with normalized covariance C, by using the projection function:
Figure imgf000017_0003
When the projection is applied to a diffuse soundfield, the diffuse power in the vicinity of the direction,
Figure imgf000017_0004
may be determined as follows:
Figure imgf000017_0006
Typically, will be a real constant (e.g.,
Figure imgf000017_0014
Figure imgf000017_0010
is independent of the direction,
Figure imgf000017_0013
and hence it may be precomputed, being derived only from the definition of the soundfield input panning function and decode function,
Figure imgf000017_0011
and (as examples of the spatial panning function and the spatial decoding function).
Assuming that a spatial input signal is composed of a single audio component that is located at direction
Figure imgf000017_0012
then the resulting covariance matrix will be:
Figure imgf000017_0007
and the normalized covariance will be:
Figure imgf000017_0008
and hence, the
Figure imgf000017_0017
function can be applied to determine the SteerPower:
Figure imgf000017_0009
Typically, will be a real constant, and hence it may be
Figure imgf000017_0015
precomputed, being derived only from the definition of the soundfield input panning function and decode function, (as examples of the spatial panning
Figure imgf000017_0016
function and the spatial decoding function). Forming an estimate of the degree to which the Input Spatial Signal contains a dominant signal from the direction by computing the scaled-projection function, and thence the steering function,
Figure imgf000018_0003
Figure imgf000018_0002
Figure imgf000018_0001
Generally speaking, the steering function,
Figure imgf000018_0004
will take on the value 1.0 whenever the Input Spatial Signal is composed entirely of audio components at location v, and it will take on the value 0.0 when the Input Spatial Signal appears to contain no bias towards the direction v. As noted above, the steering function may be normalized to numerical ranges different from the range [0.0,1.0].
In some embodiments, when the Spatial Input Format is a first order Ambisonics format, defined by the panning function:
Figure imgf000018_0005
and a suitable decoding function is:
Figure imgf000018_0006
then the function may be defined as:
Figure imgf000018_0007
Figure imgf000018_0008
Next, the Residual Format will be described. In some embodiments, the Residual Output signal may be defined in terms of the same spatial format as the Spatial Input Format (so that the panning functions are the same:
Figure imgf000018_0009
The Residual Output signal may be determined by block 103 of Fig. 1 and block 205 of Fig. 2. In this case the number of residual channels will be equal to the number of input channels: nr = ns. Furthermore, in this case, a residual downmix matrix: identity matrix) may be defined.
Figure imgf000018_0010
In some embodiments, the Residual Output signal will be composed of a smaller number of channels than the Spatial Input signal: nr < ns. In this case, the panning function that defines the residual format will be different to the spatial input panning function. In addition, it is desirable to form a [nr X ns] mixdown matrix, R, suitable for converting a ns-channel Spatial Input signal to a nr-channel residual output channel.
Preferably, R may be chosen to provide a linear transformation from to
Figure imgf000019_0007
PRO (as examples of the spatial panning function of the spatial format and the residual format):
Figure imgf000019_0001
An example of a matrix, R, defined as per Equation 25, is the residual downmix matrix that would be applied if the Spatial Input Format is 3rd-order Ambisonics and the Residual Format is lst-order Ambisonics:
Figure imgf000019_0002
Alternatively, R may be chosen to provide a "least-error" mapping. For example, given a set, of nb unit vectors that are approximately uniformly
Figure imgf000019_0003
spread over the unit-sphere, a pair of matrices may be formed by stacking together nb column vectors:
Figure imgf000019_0004
where Bs is a [ns X nb] array of Spatial Input panning vectors, and BR is a [τ X nb] array of Residual Output panning vectors.
A suitable choice for the residual downmix matrix, R, is given by:
Figure imgf000019_0005
where indicates the pseudo-inverse of the Bs matrix.
Figure imgf000019_0006
Next, an example of a method 600 of processing a multi-channel, spatial format input audio signal according to embodiments of the disclosure will be described with reference to Fig. 6. The method may use any of the concepts described above. The processing of method 600 may be performed at each time block k, for example. That is, method 600 may be performed for each predetermined period of time (e.g., for each transformation window of a time-to-frequency transform). The multi-channel, spatial format input audio signal may be an audio signal in a spatial format (spatial audio format) and may comprise multiple channels. The spatial format (spatial audio format) may be, but is not limited to, Ambisonics, HOA, or B-format.
At step S610, the input audio signal is analyzed to determine a plurality of object locations of audio objects included in the input audio signal. For example, locations
Figure imgf000020_0001
of of n0 objects may be determined. This may involve performing a
Figure imgf000020_0002
scene analysis of the input audio signal. This step may be performed by either of a subband-based approach and a broadband approach.
At step S620, for each of a plurality of frequency subbands of the input audio signal, and for each object location, a mixing gain is determined for that frequency subband and that object location. Prior to this step, the method may further include a step of applying a time-to-frequency transform to a time-domain input audio signal. At step S630, for each frequency subband, and for each object location, a frequency subband output signal is generated based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format. The spatial mapping function may be the spatial decoding function (e.g., spatial decoding function PS). At step S640, for each object location, an output signal is generated by summing over the frequency subband output signals for that object location. Further, the object locations may be output as object location metadata. Thus, this step may further comprise generating object location metadata indicative of the object locations. The object location metadata may be output together with the output signals. The method may further include a step of applying an inverse time-to-frequency transform to the frequency-domain output signals. Non-limiting examples of processing that may be used for the analyzing of the input audio signal at step S610, i.e., the determination of object locations, will now be described with reference to Fig. 7. This processing may be performed by/at blocks 102 of Fig. 1 and 204 of Fig. 2, for example. It is a goal of the invention to determine the locations,
Figure imgf000021_0002
of dominant audio objects within the soundfield (as represented by the Spatial Audio input signal s^t) at the time around t = kTm). This process may be referred to by the shorthand name DOL, and in a some embodiments, this process is achieved (e.g., at each time-block k) by the steps DOL1, DOL2 and DOL3.
At step S710, for each frequency subband, a set of one or more dominant directions of sound arrival is determined. This may involve performing process DOL1 described below.
DOL J: For each band, b, determine a set, Vb, of dominant sound-arrival directions Each dominant sound-arrival direction may have an associated
Figure imgf000021_0004
weighting factor, wb j, indicative of the "confidence" assigned to the respective direction vector:
Figure imgf000021_0001
The first step (1), DOLJ, may be achieved by a number of different methods. Some alternatives are for example: DOLl(a):
• The MUSIC algorithm, which is known in the art (see, for example, Schmidt, R.O, "Multiple Emitter Location and Signal Parameter
Estimation," IEEE Trans. Antennas Propagation, Vol. AP-34 (March 1986), pp.276-280.), may be used to determine a number of dominant directions of arrival,
Figure imgf000021_0003
DOLl(b): For some commonly used spatial formats, a single dominant direction of arrival may be determined from the elements of the Covariance matrix. In some embodiments, when the Spatial Input Format is a first order Ambisonics format, defined by the panning function:
Figure imgf000022_0001
then an estimate may be made for the dominant direction of arrival in band b , by extracting three elements from the Covariance matrix, and then normalizing to form a unit-vector:
Figure imgf000022_0002
The processing of DOLlfb) may be said to relate to an example of extracting elements from the covariance matrix of the input audio signal in the relevant frequency subband.
DOLl(c): The dominant directions of arrival for band b may be determined by finding all of the local maxima of the projection function:
Figure imgf000022_0003
One example method, which may be used to search for local minima, operates by refining an initial estimate by a gradient-search method, so as to maximise the value of
Figure imgf000022_0004
The initial estimates may be found by:
- Selecting a number of random directions as starting points
- Taking each of the dominant directions (for this band, b) from the previous time-block, k— 1, as starting points
Accordingly, determining the set of dominant directions of sound arrival may involve at least one of extracting elements from a covariance matrix of the input audio signal in the relevant frequency subband, and determining local maxima of a projection function of the input audio signal in the frequency subband. The projection function may be based on the covariance matrix (e.g., normalized covariance matrix) of the input audio signal and a spatial panning function of the spatial format, for example.
At step S720, a union of the sets of the one or more dominant directions for the plurality of frequency subbands is determined. This may involve performing process DOL2 described below. DOL2: From the collection of the dominant sound-arrival directions form the union of the dominant sound-arrival direction sets of all bands:
Figure imgf000023_0001
The methods {DOLl(a), DOLl(b) and DOLl(c)) outlined above may be used to determine a set of dominant sound arrival directions for band b. For each of
Figure imgf000023_0008
these dominant sound-arrival-directions, a corresponding "confidence factor" (wb l , wb 2, ) may be determined, indicating how much weighting should be given to each dominant sound-arrival-direction.
In the most general case, the weighting may be calculated by combining together a number of factors, as follows:
Figure imgf000023_0002
In Equation 35, the function provides a "loudness" weighting factor
Figure imgf000023_0009
that is responsive to the power of the input signal in band b at time-block, k. For example, an approximation to the specific loudness of the audio signal in band b may be used:
Figure imgf000023_0003
Likewise, in Equation 35, the function Steer Q provides a "directional -steering" weighting factor that is responsive to the degree to which the input signal contains power in the direction
Figure imgf000023_0007
For each band b, the dominant sound arrival directions and their
Figure imgf000023_0004
associated weights (wb>1, wb>2, ) have been defined (as per the algorithm step DOL1). Next, as per algorithm step DOL2, the directions and weights for all bands are combined together to form a single set of directions and weights (referred to as
Figure imgf000023_0005
repectively):
Figure imgf000023_0006
At step S730, a clustering algorithm is applied to the union of the sets to determine the plurality of object locations. This may involve performing process DOL3 described below. DOL3: Determine the n0 object directions from the weighted set of dominant sound-arrival directions:
Figure imgf000024_0002
Algorithm step DOL3 will then determine a number (n0) of object locations. This can be achieved by a clustering algorithm. If the dominant directions have associated weights, the clustering algorithm may perform weighted clustering of the dominant directions. Some alternative methods for DOL3 are, for example:
DOL3(a) The Weighted k-means algorithm, (for example as described by Steinley, Douglas. "K-means clustering: A half-century synthesis." British Journal of
Mathematical and Statistical Psychology 59.1 (2006): 1-34), may be used to find a set of n0 centroids, by clustering the set of directions into n0 subsets. This
Figure imgf000024_0007
set of centroids is then normalized and permuted to create the set of object locations, according to:
Figure imgf000024_0006
Figure imgf000024_0003
where the permutation, is performed so as to minimise the block-to-block object
Figure imgf000024_0008
position change:
Figure imgf000024_0004
DOL3(b) Other clustering algorithms, such as Expectation-Maximization, may be used
DOL3(c) In the special case, when n0 = 1, the weighted mean of the dominant sound arrival directions may be used:
Figure imgf000024_0001
and then normalized:
Figure imgf000024_0005
Accordingly, the clustering algorithm in step S730 may be one of a k-means algorithm, a weighted k-means algorithm, an expectation-maximization algorithm, and a weighted mean algorithm, for example. Fig. 8 is a flow chart of an example of a method 800 that may optionally be performed in conjunction with the method 600 of Fig. 6, for example after step S640.
At step S810, the plurality of output signals are re-encoded into the spatial format to obtain a multi-channel, spatial format audio object signal.
At step S820, the audio object signal is subtracted from the input audio signal to obtain a multi -channel, spatial formal residual audio signal.
At step S830, a downmix is applied to the residual audio signal to obtain a downmixed residual audio signal. Therein, the number of channels of the downmixed residual audio signal may be smaller than the number of channels of the input audio signal. Step S830 may be optional.
Processing relating to extraction of object audio signals that may be used for implementing steps S620, S630, and S640 will be described next. This processing may be performed by/at blocks 103 of Fig. 1 and 205 of Fig. 2, for example. The DOL process (DOL1 to DOL3 described above) determines the locations,
Figure imgf000025_0004
of n0 objects (o G [1, n0]), at each time-block, k. Based on these object locations, the spatial audio input signals are processed (e.g., at blocks 103 or 205) to form a set of n0 object output signals and rif residual output signals. This process may be referred to by the shorthand name EOS, and in some embodiments, this process is achieved (e.g., at each time-block k) by the steps EOS/ to EOS6:
EOS1: Determine the [n0 x ns] object-decoding matrix by stacking n0 row- vectors:
Figure imgf000025_0001
The object-decoding matrix D is an example of a spatial decoding matrix. In general, the spatial decoding matrix includes a plurality of mapping vectors (e.g., vectors
Figure imgf000025_0003
one mapping vector for each object location. Each of these mapping vectors may be obtained by evaluating a spatial decoding function at the respective object location. The spatial decoding function may be a vector-valued function (e.g., a 1 X ns row vector of the multi -channel, spatial format input audio signal is defined as a ns x 1 column vector)
Figure imgf000025_0002
Determine the [ns x n0] object-encoding matrix by stacking n0 column- vectors:
Figure imgf000026_0002
The object-encoding matrix E is an example of a spatial panning matrix. In general, the spatial panning matrix includes a plurality of mapping vectors (e.g., vectors
Figure imgf000026_0003
one mapping vector for each object location. Each of these mapping vectors may be obtained by evaluating a spatial panning function at the respective object location. The spatial panning function may be a vector-valued function (e.g., a ns x 1 column vector of the multi-channel, spatial format input audio signal is defined as a ns x 1 column vector)
Figure imgf000026_0004
EOS3: For each band
Figure imgf000026_0006
and for each output object
Figure imgf000026_0005
determine the object gain
Figure imgf000026_0007
These object or mixing gains may be frequency-dependent. In some embodiments:
Figure imgf000026_0008
Arrange these obj ect gain coefficients to form the obj ect gain matrix, Gb (this is an [n0 x n0] diagonal matrix):
Figure imgf000026_0001
The object gain matrix Gb may be referred to as a gain matrix in the following. This gain matrix includes the determined mixing gains for frequency subband b. In more detail, it is a diagonal matrix that has the mixing gains (one for each object location, appropriately ordered) as its diagonal elements.
Thus, process EOS3 determines, for each frequency subband and for each object location, a mixing gain (e.g., frequency dependent mixing gain) for that frequency subband and that object location. As such, process EOS3 is an example of an implementation of step S620 of method 600 described above. In general, determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband. Dependence on the covariance matrix may be through the steering function which is based on (e.g., depends) on the covariance matrix C (or
Figure imgf000027_0003
the normalized covariance matrix C) of the input audio signal. That is, the mixing gain for the given frequency subband and the given object location may depend on the steering function for the input audio signal in the given frequency band, evaluated at the given object location. EOS4 Compute the frequency-domain object output signals, T(k, /), by applying the object decoding matrix and the object gain matrix to the spatial input signals, S(k, /), and by summing over the frequency subbands b:
Figure imgf000027_0001
(refer to Equation No. 3 for the definition of 5(fc, /)). The frequency-domain object output signals, T(k, /), may be referred to as frequency subband output signals. The sum may be a weighted sum, for example.
Process EOS4 is an example of an implementation of steps S630 and S640 of method 600 described above.
In general, generating the frequency subband output signal for a frequency subband and an object location at step S630 may involve applying a gain matrix (e.g., matrix Gb) and a spatial decoding matrix (e.g., matrix D) to the input audio signal. Therein, the gain matrix and the spatial decoding matrix may be successively applied.
EOS5: Compute the frequency-domain residual spatial signals by re-encoding the object output signals, T k,f), and subtracting this re-encoded signal from the spatial input:
Figure imgf000027_0002
Determine the [τ X ns] residual downmix matrix R (for example, via the method of Equation 29), and compute the frequency-domain residual output signals transforming the residual spatial signals via this residual downmix matrix:
Figure imgf000028_0001
As such, process EOS5 is an example of an implementation of steps S810, S820, and S830 of method 800 described above. Re-encoding the plurality of output signals into the spatial format may thus be based on the spatial panning matrix (e.g., matrix E). For example, re-encoding the plurality of output signals into the spatial format may involve applying the spatial panning matrix (e.g., matrix E) to a vector of the plurality of output signals. Applying a downmix to the residual audio signal (e.g., S') may involve applying a downmix matrix (e.g., downmix matrix R) to the residual audio signal.
The first 2 steps in the EOS process, EOS1 and EOS2, involve the calculation of matrix coefficients, suitable for extracting object-audio signals from the spatial audio input (using the D matrix), and re-encoding these objects back into the spatial audio format (using the E matrix). These matrices are formed by using the PS and DSQ functions. Examples of these functions (for the case where the input spatial audio format is 2nd -order Ambisonics) are given in Equations 10 and 11.
The EOS3 step may be implemented in a number of ways. Some alternative methods are: EOS3(a): The object gains may be computed using the
Figure imgf000028_0003
method of Equation 51 :
Figure imgf000028_0002
In this embodiment, the function is used to indicate what proportion of the
Figure imgf000028_0004
spatial input signal is present in the direction,
Figure imgf000028_0005
Thereby, a mixing gain (e.g., frequency dependent mixing gain) for each frequency subband and for each object location can be determined (e.g., calculated). In general, determining the mixing gain for a given frequency subband and a given object location may be based on the given object location and the covariance matrix (e.g., normalized covariance matrix) of the input audio signal in the given frequency subband. Dependence on the covariance matrix may be through the steering function
Figure imgf000029_0008
which is based on (e.g., depends) on the covariance matrix C (or the normalized covariance matrix C) of the input audio signal. That is, the mixing gain for the given frequency subband and the given object location may depend on the steering function for the input audio signal in the given frequency band, evaluated at the given object location.
EOS3(b): In general, determining the mixing gain for the given frequency subband and the given object location may be further based on a change rate of the given object location over time. For example, the mixing gain may be attenuated in dependence on the change rate of the given object location.
In other words, the object gains may be computed by combining a number of gain-factors (each of which is generally a real value in the range [0,1]). For example:
Figure imgf000029_0001
where
Figure imgf000029_0002
and is computed to be a gain factor that is approximately equal to 1 whenever
Figure imgf000029_0003
the object location is static
Figure imgf000029_0004
and approximately equal to 0 when the object location is "jumping" significantly in the region around time-block k (for example, when for
Figure imgf000029_0005
some threshold a)
The gain-factor is intended to attenuate the object amplitude whenever
Figure imgf000029_0006
an object location is changing rapidly, as may occur when a new object "appears" at time- block k in a location where no object existed during time-block k— 1.
In some embodiments is computed by first computing the jump value:
Figure imgf000029_0007
Figure imgf000029_0009
and then computing
Figure imgf000029_0010
Figure imgf000029_0011
In some embodiments, a suitable value for is 0.5, an in general will choose such that 0.05 < < 1.
Fig. 5 illustrates an exemplary method 500 in accordance with present principles. Method 500 includes, at 501, receiving spatial audio information. The spatial audio information may be consistent with ns -channel Spatial Audio Format 101 shown in Fig. 1 and an s^t) (input signal for channel i) 201 shown in Fig. 2. At 502, object locations may be determined based on the received spatial audio information. For example, the object locations may be determined as described in connection with blocks 102 shown in Fig. 1 and 204 shown in Fig. 2. Block 502 may output object location metadata 504. The object location metadata 504 may be similar to the object location metadata 111 shown in Fig. 1 and (location of object o) 211 shown in Fig. 2.
Figure imgf000030_0001
At 503, object audio signals may be extracted based on the received spatial audio information. For example, the object audio signals may be extracted as described in connection with blocks 103 shown in Fig. 1 and 205 shown in Fig. 2. Block 503 may output object audio signals 505. The object audio signals 505 may be similar to the object audio signals 112 shown in Fig. 1 and output signal for objecto 213 shown in Fig. 2. Block 503 may further output residual audio signals 506. The residual audio signals 506 may be similar to the residual audio signals 113 shown in Fig. 1 and output residual channel r 215 shown in Fig. 2.
Methods of processing multi-channel, spatial format input audio signals have been described above. It is understood that the present disclosure likewise relates to apparatus for processing multi -channel, spatial format input audio signals. The apparatus may comprise a processor adapted to perform any of the processes described above, e.g., the steps of methods 600, 700, and 800, as well as their respective implementations DOL1 to DOL3 and EOS1 to EOS5. Such apparatus may further comprise a memory coupled to the processor, the memory storing respective instructions for execution by the processor.
Various modifications to the implementations described in this disclosure may be readily apparent to those having ordinary skill in the art. The general principles defined herein may be applied to other implementations without departing from the spirit or scope of this disclosure. Thus, the claims are not intended to be limited to the implementations shown herein, but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.
The methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits. The signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet. Typical devices making use of the methods and systems described in the present document are portable electronic devices or other consumer equipment which are used to store and/or render audio signals.
Further implementation examples of the present invention are summarized in the enumerated example embodiments (EEEs) that are listed below. A first EEE relates to a method for processing a multi -channel, spatial audio format input signal. The method comprises determining object location metadata based on the received spatial audio format input signal, and extracting object audio signals based on the received spatial audio format input signal. The extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.
A second EEE relates to a method according to the first EEE, wherein each extracted audio object signal has a corresponding object location metadata.
A third EEE relates to a method according to the first or second EEEs, wherein the object location metadata is indicative of the direction-of-arrival of an object. A fourth EEE relates to a method according to any one of the first to third EEEs, wherein the object location metadata is derived from statistics of the received spatial audio format input signal.
A fifth EEE relates to a method according to any one of the first to fourth EEEs, wherein the object location metadata is changing from time to time. A sixth EEE relates to a method according to any one of the first to fifth EEEs, wherein the object audio signals are determined based on a linear mixing matrix in each of a number of sub-bands of the received spatial audio format input signal.
A seventh EEE relates to a method according to any one of the first to sixth EEEs, wherein the residual signal is a multi-channel residual signal.
An eighth EEE relates to a method according to the seventh EEE, wherein the multi-channel residual signal is composed of a number of channels that is less than a number of channels of the received spatial audio format input signal.
A ninth EEE relates to a method according to any one of the first to eighth EEEs, wherein extracting object audio signals is determined by subtracting the contribution of the said object audio signals from the said spatial audio format input signal
A tenth EEE relates to a method according to any one of the first to ninth EEEs, wherein extracting object audio signals includes determining a linear mixing matrix coefficients that may be used by subsequent processing to create the one or more object audio signals and the residual signal.
An eleventh EEE relates to a method according to any one of the first to tenth EEEs, wherein the matrix coefficients are different for each frequency band.
A twelfth EEE relates to an apparatus for processing a multi-channel, spatial audio format input signal. The apparatus comprises a processor for determining object location metadata based on the received spatial audio format input signal, and an extractor for extracting object audio signals based on the received spatial audio format input signal. The extracting object audio signals based on the received spatial audio format input signal includes determining object audio signals and residual audio signals.

Claims

1. A method for processing a multi -channel, spatial format, input audio signal, the method comprising determining object locations based on the input audio signal; and extracting object audio signals from the input audio signal based on the determined object locations, wherein the determining object locations comprises determining, for each of a number of frequency subbands, one or more dominant sound-arrival-directions.
2. The method according to claim 1, wherein the extracting object audio signals from the input audio signal based on the determined object locations comprises: for each of the number of frequency subbands of the input audio signal, determining, for each object location, a mixing gain for that frequency subband and that object location; for each of the number of frequency subbands, generating, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format; and for each object location, generating an output signal by summing over the frequency subband output signals for that object location.
3. The method according to claim 2, wherein the mixing gains for the object locations are frequency-dependent.
4. The method according to claim 2 or 3, wherein the spatial format defines a plurality of channels; and the spatial mapping function is a spatial decoding function of the spatial format for extracting an audio signal at a given location, from the plurality of the channels of the spatial format.
5. The method according to claim 4, wherein a spatial panning function of the spatial format is a function for mapping a source signal at a source location to the plurality of channels defined by the spatial format; and the spatial decoding function is defined such that successive application of the spatial panning function and the spatial decoding function yields unity gain for all locations on the unit sphere.
6. The method according to claim 2, wherein determining the mixing gain for a given frequency subband and a given object location is based on the given object location and a covariance matrix of the input audio signal in the given frequency subband.
7. The method according to claim 6, wherein the mixing gain for the given frequency subband and the given object location depends on a steering function for the input audio signal in the given frequency subband, evaluated at the given object location.
8. The method according to claim 7, wherein the steering function is based on a covariance matrix of the input audio signal in the given frequency subband.
9. The method according to any one of claims 6 to 8, wherein determining the mixing gain for the given frequency subband and the given object location is further based on a change rate of the given object location over time, wherein the mixing gain is attenuated in dependence on the change rate of the given object location.
10. The method according to claim 2, wherein generating, for each frequency subband and for each object location, the frequency subband output signal involves: applying a gain matrix and a spatial decoding matrix to the input audio signal, wherein the gain matrix includes the determined mixing gains for that frequency subband; and the spatial decoding matrix includes a plurality of mapping vectors, one for each object location, wherein each mapping vector is obtained by evaluating the spatial decoding function at a respective object location.
11. The method according to claim 1 , further comprising: re-encoding the plurality of output signals into the spatial format to obtain a multi-channel, spatial format audio object signal; and subtracting the audio object signal from the input audio signal to obtain a multichannel, spatial format residual audio signal.
12. The method according to claim 11, further comprising: applying a downmix to the residual audio signal to obtain a downmixed residual audio signal, wherein the number of channels of the downmixed residual audio signal is smaller than the number of channels of the input audio signal.
13. The method according to claim 1, wherein the determining object locations further comprises: determining a union of sets of dominant sound-arrival-directions for the number of frequency subbands; and applying a clustering algorithm to the union to determine the plurality of object locations.
14. The method according to claim 13, wherein determining the set of dominant directions of sound-arrival involves at least one of: extracting elements from a covariance matrix of the input audio signal in the frequency subband; and determining local maxima of a projection function of the audio input signal in the frequency subband, wherein the projection function is based on the covariance matrix of the audio input signal and a spatial panning function of the spatial format.
15. The method according to claim 13 or 14, wherein each dominant direction has an associated weight; and the clustering algorithm performs weighted clustering of the dominant directions.
16. The method according to any one of claims 13 to 15, wherein the clustering algorithm is one of: a k-means algorithm, a weighted k-means algorithm, an expectation- maximization algorithm, and a weighted mean algorithm.
17. The method according to any one of claims 1 to 16, further comprising: generating object location metadata indicative of the object locations.
18. The method of any preceding claim, wherein the object audio signals are determined based on a linear mixing matrix in each of the number of sub-bands of the received spatial audio format input signal.
19. The method of claim 18, wherein the matrix coefficients are different for each frequency band.
20. The method of any preceding claim, wherein extracting object audio signals is determined by subtracting the contribution of said object audio signals from said input audio signal.
21. An apparatus for processing a multi -channel, spatial format input audio signal, the apparatus comprising a processor adapted to: analyze the input audio signal to determine a plurality of object locations of audio objects included in the input audio signal, wherein the analyzing comprises determining, for each of a number of frequency subbands, one or more dominant sound- arrival-directions; for each of the number of frequency subbands of the input audio signal, determine, for each object location, a mixing gain for that frequency subband and that object location; for each frequency subband of the number of frequency subbands, generate, for each object location, a frequency subband output signal based on the input audio signal, the mixing gain for that frequency subband and that object location, and a spatial mapping function of the spatial format; and for each object location, generate an output signal by summing over the frequency subband output signals for that object location.
PCT/US2018/030680 2017-05-09 2018-05-02 Processing of a multi-channel spatial audio format input signal WO2018208560A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP18722375.5A EP3622509B1 (en) 2017-05-09 2018-05-02 Processing of a multi-channel spatial audio format input signal
CN201880041822.0A CN110800048B (en) 2017-05-09 2018-05-02 Processing of multichannel spatial audio format input signals
US16/611,843 US10893373B2 (en) 2017-05-09 2018-05-02 Processing of a multi-channel spatial audio format input signal
JP2019561833A JP7224302B2 (en) 2017-05-09 2018-05-02 Processing of multi-channel spatial audio format input signals

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201762503657P 2017-05-09 2017-05-09
US62/503,657 2017-05-09
EP17179315.1 2017-07-03
EP17179315 2017-07-03
US201762598068P 2017-12-13 2017-12-13
US62/598,068 2017-12-13

Publications (1)

Publication Number Publication Date
WO2018208560A1 true WO2018208560A1 (en) 2018-11-15

Family

ID=59285047

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/030680 WO2018208560A1 (en) 2017-05-09 2018-05-02 Processing of a multi-channel spatial audio format input signal

Country Status (1)

Country Link
WO (1) WO2018208560A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2578625A (en) * 2018-11-01 2020-05-20 Nokia Technologies Oy Apparatus, methods and computer programs for encoding spatial metadata
WO2020140658A1 (en) * 2018-12-31 2020-07-09 深圳市华讯方舟太赫兹科技有限公司 Direction of arrival estimation method and apparatus, radar, and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2249334A1 (en) * 2009-05-08 2010-11-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio format transcoder
US20100329466A1 (en) * 2009-06-25 2010-12-30 Berges Allmenndigitale Radgivningstjeneste Device and method for converting spatial audio signal
EP2469741A1 (en) * 2010-12-21 2012-06-27 Thomson Licensing Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field
WO2016133785A1 (en) * 2015-02-16 2016-08-25 Dolby Laboratories Licensing Corporation Separating audio sources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2249334A1 (en) * 2009-05-08 2010-11-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio format transcoder
US20100329466A1 (en) * 2009-06-25 2010-12-30 Berges Allmenndigitale Radgivningstjeneste Device and method for converting spatial audio signal
EP2469741A1 (en) * 2010-12-21 2012-06-27 Thomson Licensing Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field
WO2016133785A1 (en) * 2015-02-16 2016-08-25 Dolby Laboratories Licensing Corporation Separating audio sources

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SCHMIDT, R.O: "Multiple Emitter Location and Signal Parameter Estimation", IEEE TRANS. ANTENNAS PROPAGATION, vol. AP-34, March 1986 (1986-03-01), pages 276 - 280, XP000644956, DOI: doi:10.1109/TAP.1986.1143830
STEINLEY, DOUGLAS: "K-means clustering: A half-century synthesis", BRITISH JOURNAL OF MATHEMATICAL AND STATISTICAL PSYCHOLOGY, vol. 59.1, 2006, pages 1 - 34

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2578625A (en) * 2018-11-01 2020-05-20 Nokia Technologies Oy Apparatus, methods and computer programs for encoding spatial metadata
WO2020140658A1 (en) * 2018-12-31 2020-07-09 深圳市华讯方舟太赫兹科技有限公司 Direction of arrival estimation method and apparatus, radar, and readable storage medium

Similar Documents

Publication Publication Date Title
US10893373B2 (en) Processing of a multi-channel spatial audio format input signal
US11832080B2 (en) Spatial audio parameters and associated spatial audio playback
US10650836B2 (en) Decomposing audio signals
US9786288B2 (en) Audio object extraction
US10818302B2 (en) Audio source separation
EP2792168A1 (en) Audio processing method and audio processing apparatus
EP2543199B1 (en) Method and apparatus for upmixing a two-channel audio signal
CN106233382A (en) A kind of signal processing apparatus that several input audio signals are carried out dereverberation
US10827295B2 (en) Method and apparatus for generating 3D audio content from two-channel stereo content
CN110771181B (en) Method, system and device for converting a spatial audio format into a loudspeaker signal
JP2020519950A5 (en)
WO2018208560A1 (en) Processing of a multi-channel spatial audio format input signal
KR20170101614A (en) Apparatus and method for synthesizing separated sound source
EP3869826A1 (en) Signal processing device and method, and program
JP6815956B2 (en) Filter coefficient calculator, its method, and program
KR101825949B1 (en) Apparatus for location estimation of sound source with source separation and method thereof
CN108028988B (en) Apparatus and method for processing internal channel of low complexity format conversion
US20220392462A1 (en) Multichannel audio encode and decode using directional metadata
US12051427B2 (en) Determining corrections to be applied to a multichannel audio signal, associated coding and decoding
CN109074811A (en) Audio-source separation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18722375

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
ENP Entry into the national phase

Ref document number: 2019561833

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018722375

Country of ref document: EP

Effective date: 20191209