GB2584630A - Audio processing - Google Patents
Audio processing Download PDFInfo
- Publication number
- GB2584630A GB2584630A GB1907601.7A GB201907601A GB2584630A GB 2584630 A GB2584630 A GB 2584630A GB 201907601 A GB201907601 A GB 201907601A GB 2584630 A GB2584630 A GB 2584630A
- Authority
- GB
- United Kingdom
- Prior art keywords
- signal
- channel
- input
- signal component
- channels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012545 processing Methods 0.000 title claims description 125
- 238000002156 mixing Methods 0.000 claims abstract description 59
- 238000009877 rendering Methods 0.000 claims abstract description 22
- 230000005236 sound signal Effects 0.000 claims description 148
- 238000004091 panning Methods 0.000 claims description 53
- 230000001419 dependent effect Effects 0.000 claims description 50
- 238000000034 method Methods 0.000 claims description 50
- 238000000354 decomposition reaction Methods 0.000 claims description 48
- 238000004590 computer program Methods 0.000 claims description 24
- 230000002093 peripheral effect Effects 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 20
- 230000001427 coherent effect Effects 0.000 claims description 18
- 238000012546 transfer Methods 0.000 claims description 13
- 238000009499 grossing Methods 0.000 claims description 11
- 230000002123 temporal effect Effects 0.000 claims description 9
- 238000001914 filtration Methods 0.000 abstract description 13
- 230000015556 catabolic process Effects 0.000 abstract description 2
- 238000006731 degradation reaction Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 13
- 230000004044 response Effects 0.000 description 13
- 230000003111 delayed effect Effects 0.000 description 12
- 230000000694 effects Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 210000005069 ears Anatomy 0.000 description 7
- 238000013507 mapping Methods 0.000 description 6
- 230000008447 perception Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000003860 storage Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000002955 isolation Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 3
- 238000009795 derivation Methods 0.000 description 3
- 101150023426 Ccin gene Proteins 0.000 description 2
- 206010051723 Fluctuance Diseases 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- ATJFFYVFTNAWJD-UHFFFAOYSA-N Tin Chemical compound [Sn] ATJFFYVFTNAWJD-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003756 stirring Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 210000000689 upper leg Anatomy 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/002—Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
- H04S3/004—For headphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
- H04S7/305—Electronic adaptation of stereophonic audio signals to reverberation of the listening space
- H04S7/306—For headphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2420/00—Details of connection covered by H04R, not provided for in its groups
- H04R2420/01—Input selection or mixing for amplifiers or loudspeakers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/01—Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Stereophonic System (AREA)
Abstract
A stereo-widening spatial audio headphone system receives a first (eg. direct) signal encoding a first part of a spatial audio image and a second (eg. ambient) signal component, comprising multiple input channels (eg. 5.1 or 7.1 surround sound) encoding another part of the image, then cross-channel mixes some of the multiple input channels of the second signal component while enabling the first signal component to bypass the cross-channel mixing when rendering. This avoids comb-filtering coloration and degradation from multiple filtering operations on each left/right channel separately.
Description
Audio processing
TECHNICAL FIELD
The example and non-limiting embodiments of the present invention relate to processing of audio signals. In particular, various embodiments of the present invention relate to modification of a spatial image represented by a multi-channel audio signal, such as a two-channel stereo signal.
BACKGROUND
So-called stereo widening is a technique known in the art for enhancing the perceivable spatial audio image of a stereophonic audio signal when reproduced via audio output device. Such a technique aims at processing a stereophonic audio signal such that reproduced sound is not only perceived as originating from directions that are localized between the audio output devices but at least part of the sound field is perceived as if it originated from directions that are not localized between the audio output devices, thereby widening the perceivable width of spatial audio image from that conveyed in the stereophonic audio signal. Herein, we refer to such spatial audio image as a widened or enlarged spatial audio image.
While outlined above via references to a two-channel stereophonic audio signal, stereo widening may be applied to multi-channel audio signals that have more than two channels, such as 5.1-channel or 7.1-channel surround sound for playback via a pair of audio output devices. In some contexts, the term virtual surround is applied to refer to a processed audio signal that conveys a spatial audio image originally conveyed in a multi-channel surround audio signal. Hence, even though the term stereo widening is predominantly applied throughout this disclosure, this term should be construed broadly, encompassing a technique for processing the spatial audio image conveyed in a multi-channel audio signal (i.e. a two-channel stereophonic audio signal or a surround sound of more than two channels) to provide audio playback at widened spatial audio image.
For brevity and clarity of description, in this disclosure we use the term multi-channel audio signal to refer to audio signals that have two or more channels. Moreover, the term stereo signal is used to refer to a stereophonic audio signal and the term surround signal is used to refer to a multi-channel audio signal having more than two channels.
When applied to a stereo signal, stereo widening techniques known in the art typically involve adding a processed (e.g. filtered) version of a contralateral channel signal to each of the left and right channel signals of the stereo signal in order to derive an output stereo signal having a widened spatial audio image (referred to in the following as a widened stereo signal). In other words, a processed version of the right channel signal of the stereo signal is added to the left channel signal of the stereo signal to create the left channel of a widened stereo signal and a processed version of the left channel signal of the stereo signal is added to the right channel signal of the stereo signal to create the right channel of the widened stereo signal. Moreover, the procedure of deriving the widened stereo signal may further involve pre-filtering (or otherwise processing) each of the left and right channel signals of the stereo signal prior to adding the respective processed contralateral signals thereto in order to preserve desired frequency response in the widened stereo signal.
Along the lines described above, stereo widening readily generalizes into widening the spatial audio image of a multi-channel input audio signal, thereby deriving an output multi-channel audio signal having a widened spatial audio image (referred to in the following as a widened multi-channel signal). In this regard, the processing involves creating the left channel of the widened multi-channel audio signal as a sum of (first) filtered versions of channels of the multichannel input audio signal and creating the right channel of the widened multi-channel audio signal as a sum of (second) filtered versions of channels of the multi-channel input audio signal. Herein, a dedicated predefined filter may be provided for each pair of an input channel (channels of the multi-channel input signal) and an output channel (left and right) . As an example in this regard, the left and right channel signals of the widened multi-channel signal Sout,lef and S011 right, respectively, may be defined on basis of channels of a multi-channel audio signal S according to the equation (1): n) = Ei SO, b, n) I t(i, b) Sout,lett(b, S out,right(li n) = E, SO, b,n)Hright(i,b) ( ) where S(1, b, it) denotes frequency bin b in time frame n of channel i of the multi-channel signal 5, b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the left channel signal Soutjeft(by it), and Hright(i, b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the right channel signal S out,right A challenge involved in stereo widening is degraded timbre in the central part of the spatial audio image. In many real-life stereo signals the central part of the spatial audio image includes perceptually important audio content, e.g. in case of music the voice of the vocalist is typically rendered in the center of the spatial audio image. A sound component that is in the center of the spatial audio image is rendered by reproducing the same signal in both channels of the stereo signal and hence via both audio output devices. When stereo widening is applied to such an input stereo signal (e.g. according to the equation (1) above), each channel of the resulting widened stereo signal involves outcome of two filtering operations carried out for the channels of the input stereo signal. This may result in a comb filtering effect, which may cause differences in the perceived timbre, which may be referred to as 'coloration' of the sound. Moreover, the comb filtering effect may further result in degradation of the engagement of the sound source.
In some circumstances, the audio output devices are part of a headphone apparatus that comprises a left audio output device that is worn at, over or in a left ear of a user and a right audio output device that is worn at, over or in a right ear of a user.
Normal playback of stereo audio via headphones may cause the sound to be perceived by a user inside the user's head. The stereo panning cues position the sound in between the ears, inside the head.
To address this loudspeaker virtualization methods are used to process the audio signals so that the perception to the user listening via headphones is similar to the perception to a user who is listening via loudspeakers. This can be achieved by filtering the audio signals using appropriate head-related transfer functions (HRTF) or binaural room impulse responses (BRIR).
SUMMARY
According to various, but not necessarily all, examples there is provided an apparatus for processing an input audio signal comprising multiple channels, the apparatus comprising: means for deriving, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; cross-channel mixing means for cross-channel mixing of a plurality of input channels; means for directing the second signal component to the cross-channel mixing means for cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component; bypass means for enabling the first signal component to bypass the cross-channel mixing means; and means for combining the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
In some but not necessarily all examples, the cross-channel mixing means for cross-channel mixing of a plurality of input channels comprises means for applying head related transfer functions to each one of the plurality of input channels before mixing those channels to produce a modified second signal component comprising two output channels, wherein the head related transfer function applied to an input channel that is mixed to provide an output channel is dependent upon an identity of the input channel and an identify of the output channel.
In some but not necessarily all examples, the cross-channel mixing means for cross-channel mixing of a plurality of input channels comprises means for applying a headphone filter to each one of the plurality of input channels before mixing those channels to produce a modified second signal component comprising two output channels, wherein the headphone filter applied to an input channel that is mixed to provide an output channel is dependent upon an identity of the input channel and an identify of the output channel, wherein the headphone filter for an input channel mixes a direct version of the input channel with an ambient version of the input channel.
In some but not necessarily all examples, the relative gain of the direct version of the input channel compared to the ambient version of the input channel in a mix in the headphone filter is a user-controllable parameter.
In some but not necessarily all examples, the headphone filter for an input channel mixes a single-path direct version of the input channel with a multiple-path ambient version of the input channel; wherein a head related transfer function is used to form the single-path direct version of the input channel; wherein, an indirect path filter is used in combination with a head related transfer function for each path of the multiple paths, to form the multiple-path ambient version of the input channel. In some but not necessarily all examples, the indirect path filter comprises decorrelation means or reverberation means.
In some but not necessarily all examples, the cross-channel mixing means is configured to cause stereo-widening for headphone apparatus such that a width of a spatial audio image associated with the modified second signal component is greater than a width of a spatial audio image associated with the second signal component before cross-channel mixing of the second signal component.
In some but not necessarily all examples, the first portion is front and central relative to a user of the headphone apparatus, and the second portion is peripheral relative to the user of headphone apparatus and does not overlap the first portion.
In some but not necessarily all examples, the first and second portions are contiguous.
In some but not necessarily all examples, the bypass means enables components of the input audio signal that represent a sound source that is coherent between two stereo channels and is positioned to front and center, to bypass the cross-channel mixing means.
In some but not necessarily all examples, a control input controls one or more of: control the first portion and/or the second portion; control decomposition of input signal to first component and second component; control relative gain of the first component and the second component; control widening of the second component; control ratio of direct to ambient gain during widening of second component; control panning of first component; control whether there is or is not panning of the first component; control panning extent of first component; and control energy-based temporal smoothing.
In some but not necessarily all examples, when the input audio signal comprises a same sound source that is repeated at different positions, and that is rendered at the headphone apparatus without interaural time difference and without frequency dependent interaural level differences, when the sound source of the input audio signal is positioned at a first position that is relatively front and central to a user of the headphone apparatus, then the sound source is rendered at the headphone apparatus with interaural time differences and with frequency dependent interaural level differences when the sound source of the input audio signal is repeated at a second position that is relatively peripheral and is not front and central to a user of the headphone apparatus.
In some but not necessarily all examples, there is provided a system comprising the apparatus and a headphone apparatus configured for receiving and rendering the output audio signal.
In some but not necessarily all examples, the apparatus is configured as a headphone apparatus for rendering the output audio signal.
According to various, but not necessarily all, examples there is provided a method for processing an input audio signal comprising a at least one input channel/multiple input channels, the method comprising: deriving, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing; and combining the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
According to various, but not necessarily all, examples there is provided an apparatus for processing an input audio signal comprising a at least one input channel/multiple input channels, the apparatus comprising at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: derive, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; perform cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing; and combine the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
According to various, but not necessarily all, examples there is provided a computer program comprising computer readable program code configured to cause a computer to: derive, based on an input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; perform cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing.
According to various, but not necessarily all, examples there is provided an apparatus for processing an input audio signal comprising multiple channels to produce a two-channel output audio signal configured for rendering by headphone apparatus to produce a spatial audio image, the apparatus comprising: means for processing an input audio signal comprising multiple channels to produce a two-channel output audio signal configured for rendering by headphone apparatus; means for spatially processing the input audio signal to add at peripheral positions, but not at central positions, of the spatial audio image positionally-dependent interaural time differences measurable between coherent audio events in both of the channels of the output audio signal and frequency-dependent and positionally-dependent interaural level differences measurable between coherent audio events in both of the channels of the output audio signal.
In some but not necessarily all examples, the means for deriving the first and second signal components is arranged to derive, on basis of the input audio signal, the first signal component that represents coherent sounds of the spatial audio image that reside within the first portion of the spatial audio image; and derive, on basis of the input audio signal, the second signal component that represents coherent sounds of the spatial audio image that reside within the second portion of the spatial audio image and outside the first portion of the spatial audio image and non-coherent sounds of the spatial audio image.
In some but not necessarily all examples, the first portion of the spatial audio image comprises one or more angular ranges that define a set of sound arrival directions within the spatial audio image.
In some but not necessarily all examples, said one or more angular ranges comprise an angular range that defines a range of sound arrival directions centered around a front direction of the spatial audio image.
In some but not necessarily all examples, the means for deriving the first and second signal components comprises a means for deriving, on basis of the input audio signal, for a plurality of frequency sub-bands, a respective coherence value that is descriptive of coherence between channels of the input audio signal in the respective frequency sub-band; a means for deriving, on basis of estimated sound arrival directions in view of the first portion of the spatial audio image, for said plurality of frequency sub-bands, a respective directional coefficient that is indicative of a relationship between the estimated sound arrival direction and the first portion of the spatial audio image in the respective frequency sub-band; a means for deriving, on basis of said coherence values and directional coefficients, for said plurality of frequency sub-bands, respective decomposition coefficients; and a means for decomposing the input audio signal into the first and second signal components using said decomposition coefficients.
In some but not necessarily all examples, the means for deriving the directional coefficients is arranged to, for said plurality of frequency sub-bands, set the directional coefficient for a frequency sub-band to a non-zero value in response to the estimated sound arrival direction for said frequency sub-band residing within the first portion of the spatial audio image, and set the directional coefficient for a frequency sub-band to a zero value in response to the estimated sound arrival direction for said frequency sub-band residing within the second portion of the spatial audio image.
In some but not necessarily all examples, the means for determining the decomposition coefficients is arranged to derive, for said plurality of frequency sub-bands, the respective decomposition coefficient as the product of the coherence value and the directional coefficient derived for the respective frequency sub-band.
In some but not necessarily all examples, the means for decomposing the input audio signal is arranged to, for said plurality of frequency sub-bands, derive the first signal component in each frequency sub-band as a product of the input audio signal in the respective frequency sub-band and a first scaling coefficient that increases with increasing value of the decomposition coefficient derived for the respective frequency sub-band; and derive the second signal component in each frequency sub-band as a product of the input audio signal in the respective frequency sub-band and a second scaling coefficient that decreases with increasing value of the decomposition coefficient derived for the respective frequency sub-band.
In some but not necessarily all examples, the apparatus comprises a means for delaying the first signal component by a predefined time delay prior to combining the first signal component with the modified second signal component, thereby creating a delayed first signal component that is temporally aligned with the modified second signal component.
In some but not necessarily all examples, the apparatus comprises a means for modifying the first signal component prior to combining the first signal component with the modified second signal component, wherein the modification comprises generating, on basis of the first signal component, a modified first signal component wherein one or more sound sources represented by the first signal component are panned in the spatial audio image, In some but not necessarily all examples, each of said the multiple input channels comprise two channels.
According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.
According to an example embodiment, a computer program is provided, the computer program comprising computer readable program code configured to cause performing at least a method according to the example embodiment described in the foregoing when said program code is executed on a computing apparatus.
The computer program according to an example embodiment may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non-transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention.
The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
DEFINITIONS
A headphone apparatus is an apparatus that has a left audio output device that is worn at, over or in a left ear of a user and a right audio output device that is worn at, over or in a right ear of a user. The audio heard in the left ear by the user is dependent upon audio output by the left audio output device and is not dependent upon audio output by the right audio output device. The audio heard in the right ear by the user is dependent upon audio output by the right audio output device and is not dependent upon audio output by the left audio output device. The headphone receives input signals wirelessly or over a wired connection. In some but not necessarily all examples, the headphone apparatus comprises acoustic isolators that isolate the ears of the user from external environmental sounds. In some examples, the headphone apparatus can comprise 'cans' that cover the user's ears and provide at least some acoustic isolation. In some examples, the headphone apparatus can comprise deformable 'buds' that fit snugly inside the user's ears and provide at least some acoustic isolation. Each audio output device comprises a transducer that converts a received electrical signal to an acoustic pressure wave or a vibration.
multi-channel audio signal: in this disclosure we use the term multi-channel audio signal to refer to audio signals that have two or more channels.
stereo signal: the term stereo signal is used to refer to a stereophonic audio signal.
surround sound signal: the term surround signal is used to refer to a multi-channel audio signal having more than two channels.
BRIEF DESCRIPTION OF FIGURES
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where Figure 1A illustrates a block diagram of some elements of an audio processing system for headphones according to an example; Figure 1B illustrates a block diagram of some elements of an audio processing system for headphones according to an example; Figure 2 illustrates a block diagram of some elements of a device that be applied to implement the audio processing system for headphones according to an example; Figure 3 illustrates a block diagram of some elements of a signal decomposer according to an example; Figure 4 illustrates a block diagram of some elements of a re-panner for headphones according to an example; Figure 5 illustrates a block diagram of some elements of a stereo widening processor for headphones according to an example; Figure 6 illustrates a flow chart depicting a method for audio processing for headphones according to an example; and Figure 7 illustrates a block diagram of some elements of an apparatus according to an example.
DESCRIPTION OF SOME EMBODIMENTS
In the following examples there is disclosed an apparatus 100, 100', 50 for processing an input audio signal 101 comprising multiple channels, the apparatus 100, 100', 50 comprising: means 104 for deriving, based on the input audio signal 101, a first signal component 105-1, comprising at least one input channel, and a second signal component 105-2, comprising multiple input channels, wherein the first signal component 105-1 is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal 101, and the second signal component 105-2 is dependent upon at least a second portion of the spatial audio image that is different to the first portion; cross-channel mixing means 112, 112' for cross-channel mixing of a plurality of input channels; means 104 for directing the second signal component 105-2 to the cross-channel mixing means 112, 112' for cross-channel mixing of at least some of the multiple input channels of the second signal component 105-2 to produce a modified second signal component 113, 113'; bypass means 104, 106 for enabling the first signal component 105-1 to bypass the cross-channel mixing means 112, 112'; and means 114, 114' for combining the first signal component 111, 111' and the modified second signal component 113, 113' into an output audio signal 115 comprising two output channels configured for rendering by headphone apparatus 20.
Figure 1A illustrates a block diagram of some components and/or entities of an audio processing system 100 that may serve as framework for various embodiments of the audio processing technique described in the present disclosure. The audio processing system 100 obtains a stereophonic audio signal as an input signal 101 and provides a stereophonic audio signal having at least partially widened spatial audio image as an output signal 115. The input signal 101 and the output signal 115 are referred to in the following as a stereo signal 101 and a widened stereo signal 115, respectively. In the following examples that pertain to the audio processing system 100, each of these signals is assumed to be a respective two-channel stereophonic audio signal unless explicitly stated otherwise. Moreover, also each of the intermediate audio signals derived on basis of the input signal 101 are likewise respective two-channel audio signals unless explicitly state otherwise.
Nevertheless, the audio processing system 100 readily generalizes into a one that enables processing of a spatial audio signal (i.e. a multi-channel audio signal with more than two channels, such as a 5.1-channel spatial audio signal or a 7.1-channel spatial audio signal), some aspects of which are also described in the examples provided in the following.
The audio processing system 100 may further receive a control input 10 and an indication 12 of target sound source (virtual loudspeaker) positions.
The audio processing system 100 according to the example illustrated in Figure 1A comprises a transform entity (or a transformer) 102 for converting the stereo audio signal 101 from time domain into a transform domain stereo signal 103, a signal decomposer 104 for deriving, based on the transform-domain stereo signal 103, a first signal component 105-1 that represents a focus portion of the spatial audio image and a second signal component 105-2 that represents a non-focus portion of the spatial audio image, a re-panner 106 for generating, on basis of the first signal component 105-1, a modified first signal component 107, where one or more sound sources represented in the focus portion of the spatial audio image are repositioned in dependence of the target configuration, an inverse transform entity 108-1 for converting the modified first signal component 107 from the transform domain to a time-domain modified first signal component 109-1, an inverse transform entity 108-2 for converting the second signal component 105-2 from the transform domain to a time-domain second signal component 109-2, a delay element 110 for delaying the modified first signal component 109-1 by a predefined time delay, a stereo widening (for headphones) processor 112 for generating, on basis of the second signal component 109-2, a modified second signal component 113 where the width of a spatial audio image is extended from that of the second signal component 109-2, and a signal combiner 114 for combining the delayed first signal component 111 and the modified second signal component 113 into a widened stereo signal 115 that conveys a partially extended spatial audio image.
Figure 1B illustrates a block diagram of some components and/or entities of an audio processing system 100, which is a variation of the audio processing system 100 illustrated in Figure 1A. In the audio processing system 100', differences to the audio processing system 100 are that the inverse transform entities 108-1 and 108-2 are omitted, the delay element 100 is replaced with the optional delay element 110' for delaying the modified first signal component 107 into delayed modified first signal component 111', the stereo widening processor 112 is replaced with a stereo widening processor 112' for generating, on basis of the transform-domain second signal component 105-2, a modified (transform-domain) second signal component 113', and the signal combiner 114 is replaced with a signal combiner 114' for combining the delayed modified first signal component 111' and the modified second signal component 113' into a widened stereo signal 115' in the transform domain. Moreover, the audio processing system 100' comprises a transform entity 108' for converting the widened stereo signal 115' from the transform domain into a time-domain widened stereo signal 115. In case the optional delay element 110' is omitted, the signal combiner 114' receives the modified first signal component 107 (instead of the delayed version thereof) and operates to combine modified first signal component 107 with the modified second signal component 113' to create the transform-domain widened stereo signal 115'.
In the following, the audio processing technique described in the present disclosure is predominantly described via examples that pertain to the audio processing system 100 according to the example of Figure 1A and entities thereof, whereas the audio processing system 100' and entities thereof are separately described where applicable. In further examples, the audio processing system 100 or the audio processing system 100' may include further entities and/or some entities depicted in Figures 1A and 1B may be omitted or combined with other entities. In particular, Figures 1A and 1B, as well as the subsequent Figures 2 to 5 serve to illustrate logical components of a respective entity and hence do not impose structural limitations concerning implementation of the respective entity but, for example, respective hardware means, respective software means or a respective combination of hardware means and software means may be applied to implement any of the logical components of an entity separately from the other logical components of that entity, to implement any sub-combination of two or more logical components of an entity, or to implement all logical components of an entity in combination.
The audio processing system 100, 100' may be implemented by one or more computing devices and the resulting widened stereo signal 115 may be provided for playback via headphone apparatus. Typically, the audio processing system 100, 100' is implemented in a computing device of any type, e.g. a portable handheld device, a desktop computer, a server device, etc. Examples of portable handheld devices include a mobile phone, a media player device, a tablet computer, a laptop computer, etc. The computing device can also be used to play back the widened stereo signal 115 via headphone apparatus. In another example, the audio processing system 100, 100' is provided in the headphone apparatus and the playback of the widened stereo signal 115 is provided in the headphone apparatus. In a further example, a first part of the audio processing system 100, 100' is provided in a first device, whereas a second part of the audio processing system 100, 100' and the playback of the widened stereo signal 115 is provided in the headphone apparatus.
Figure 2 illustrates a block diagram of some components and/or entities of a portable handheld device 50 that implements the audio processing system 100 or the audio processing system 100'. For brevity and clarity of description, in the following description it is assumed that the elements of the audio processing system 100, 100' and the playback of the resulting widened stereo signal are provided in the device 50. The device 50 further comprises a memory device 52 for storing information, e.g. the stereo signal 101, and a communication interface 54 for communicating with other devices and possibly receiving the stereo signal 101 therefrom. The device 50, optionally, further comprises an audio preprocessor 56 that may be useable for preprocessing the stereo signal 101 read from the memory 52 or received via the communication interface 54 before providing it to the audio processing system 100, 100'. The audio preprocessor 56 may, for example, carry out decoding of an audio signal stored in an encoded format into a time domain stereo audio signal 101.
Still referring to Figure 2, the audio processing system 100, 100' may further receive the first control input 10 and indication 12 together with the stereo signal 101 from or via the audio preprocessor 56.
The control input 12 is used to control signal de-composition 104 and/or re-panning 106 and/or stereo-widening 112, 112'. More details are provided in the following description.
The indication 12 indicates the target sound source (virtual loudspeaker) positions. Effectively this means the positions of loudspeakers if the input audio signal would be reproduced by loudspeakers.
The virtual loudspeaker positions match typically with the loudspeaker format of input audio signals. For stereo input signals the virtual loudspeaker positions could, e.g., correspond to loudspeaker angles of +/-30 degrees with respect to front direction. For multichannel audio signals, e.g. for 5.1 these angles are typically 0, +/-30 and +/-110 degrees. However, in practice, the virtual loudspeaker positions may have any meaningful values. Target sound source position indication may also be provided by other means (via user interface), be a hardcoded value or be omitted. In at least some examples, the indication 12 is used to control signal decomposition 104. In some but not necessarily all examples, it can be used for stereo-widening 112.
The audio processing system 100, 100' provides the widened stereo signal 115 derived therein to an interface for communicating to headphone apparatus 20 for rendering.
The headphone apparatus 20 is an apparatus that has a left audio output device 21 that is worn at, over or in a left ear of a user and a right audio output device 22 that is worn at, over or in a right ear of a user. The audio heard in the left ear by the user is dependent upon audio output by the left audio output device 21 and is not dependent upon audio output by the right audio output device 22. The audio heard in the right ear by the user is dependent upon audio output by the right audio output device 22 and is not dependent upon audio output by the left audio output device 21. The headphone apparatus 20 receives input signals wirelessly or over a wired connection. In some but not necessarily all examples, the headphone apparatus 20 comprises acoustic isolators 23 that isolate the ears of the user from external environmental sounds. In some examples, the headphone apparatus can comprise left and right 'cans' 23 that cover the user's ears, house the respective audio output devices 21, 22 and provide at least some acoustic isolation. In some examples, the headphone apparatus can comprise a deformable buds' that fit snugly inside the respective left and right ears of the user, surround the respective audio output devices 21, 22 and provide at least some acoustic isolation.
Each audio output device 21, 22 comprises a transducer that converts a received electrical signal to an acoustic pressure wave or a vibration.
The stereo signal 101 may be received at the signal processing system 100, 100' e.g. by reading the stereo signal from a memory or from a mass storage device in the device 50. In another example, the stereo signal is obtained via communication interface (such as a network interface) from another device that stores the stereo signal in a memory or from a mass storage device provided therein. The widened stereo signal 115 may be provided for rendering by headphone apparatus 20. Additionally or alternatively, the widened stereo signal 115 may be stored in the memory or the mass storage device in the device 50 and/or provided via a communication interface to another device for storage therein.
The information 12 that defines the virtual loudspeaker positions may be used to control stereo widening processing such that audio sources are perceived at desired positions, which may also be at positions outside the physical locations of the headphones. The processing may include maintaining some portions (such as the focus portion of the spatial audio image) in between the physical locations of the headphones.
The audio processing system 100, 100' may be arranged to process the stereo signal 101 arranged into a sequence of input frames, each input frame including a respective segment of digital audio signal for each of the channels, provided as a respective time series of input samples at a predefined sampling frequency. In typical example, the audio processing system 100, 100' employs a fixed predefined frame length. In other examples, the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths. A frame length may be defined as number samples L included in the frame for each channel of the stereo signal 101, which at the predefined sampling frequency maps to a corresponding duration in time. As an example, in this regard, the audio processing system 100, 100' may employ a fixed frame length of 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L=320, L=640 and L=960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
Referring back to Figures 1A and 1B, the audio processing system 100, 100' may comprise the transform entity 102 that is arranged to convert the stereo signal 101 from time domain into a transform-domain stereo signal 103. Typically, the transform domain involves a frequency domain. In an example, the transform entity 102 employs short-time discrete Fourier transform (SIFT) to convert each channel of the stereo signal 101 into a respective channel of the transform-domain stereo signal 103 using a predefined analysis window length (e.g. 20 milliseconds). In another example, the transform entity 102 employs an (analysis) complex-modulated quadrature-mirror filter (QMF) bank for time-to-frequency-domain conversion. The STFT and QMF bank serve as non-limiting examples in this regard and in further examples any suitable transform technique known in the art may be employed for creating the transform-domain stereo signal 103.
The transform entity 102 may further divide each of the channels into a plurality of frequency sub-bands, thereby resulting in the transform-domain stereo signal 103 that provides a respective time-frequency representation for each channel of the stereo signal 101. A given frequency band in a given frame may be referred to as a time-frequency file. The number of frequency sub-bands and respective bandwidths of the frequency sub-bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power. In an example, the sub-band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3'd octave band scale known in the art. In other examples, different number of frequency sub-bands that have the same or different bandwidths may be employed. A specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a continuous subset thereof.
A time-frequency tile that represents frequency bin b in time frame rt of channel i of the transform-domain stereo signal 103 may be denoted as S(i,b,n). The channel i represents a single virtual loudspeaker or an input channel. The transform-domain stereo signal 103, e.g. the time-frequency files S(I, b, n), are passed to the signal decomposer 104 for decomposition into the first signal component 105-1 and the second signal component 105-2 therein. As described in the foregoing, a plurality of consecutive frequency bins may be grouped into a frequency sub-band, thereby providing a plurality of frequency sub-bands k = 0, ..., For each frequency sub-band k, the lowest bin (i.e. a frequency bin that represents the lowest frequency in that frequency sub-band) may be denoted as bk3ow and the highest bin (i.e. a frequency bin that represents the highest frequency in that frequency sub-band) may be denoted as bichigh.
Referring back to Figures 1A and 1B, the audio processing system 100, 100' may comprise the signal decomposer 104 that is arranged to derive, based on the transform-domain stereo signal 103, the first signal component 105-1 and the second signal component 105-2. In the following, the first signal component 105-1 is referred to as a signal component that represents the focus portion of the spatial audio image and the second signal component 105-2 is referred to a signal component that represents the non-focus portion of the spatial audio image. The focus portion represents those parts of the audio image that are front and central and can be considered as 'frontness'. The non-focus portion represents those parts of the audio image that are not represented by the focus portion (not front and central) and may be hence referred to as a 'peripheral' portion of the spatial audio image. Herein, the decomposition procedure does not change the number of channels and hence in the present example each of the first signal component 105-1 and the second signal component 105-2 is provided as a respective two-channel audio signal. It should be noted that the terms focus portion and non-focus portion as used in this disclosure are designations assigned to spatial sub-portions of the spatial audio image represented by the stereo signal 101, while these designation as such do not imply any specific processing to be applied (or having been applied) to the underlying stereo signal 101 or the transform-domain stereo signal 103 e.g. to actively emphasize or de-emphasize any portion of the spatial audio image represented by the stereo signal 101.
The signal decomposer 104 may derive, on basis of the transform-domain stereo signal 103, the first signal component 105 that represents those coherent sounds of the spatial audio image that are within a predefined focus range, such sounds hence constituting the focus portion of the spatial audio image. The focus range can be defined by the control input 10.
In contrast, the signal decomposer 104 may derive, on basis of the transform-domain stereo signal 103, the second signal component 105 that represents coherent sound sources or sound components of the spatial audio image that are outside the predefined focus range and all non-coherent sound sources of the spatial audio image, such sound sources or components hence constituting the non-focus portion of the spatial audio image. Hence, the signal decomposer 104 decomposes the sound field represented by the stereo signal 101 into the first signal component 105-1 that is excluded from subsequent stereo widening processing and into the second signal component 105-2 that is subsequently subjected to the stereo widening processing.
Figure 3 illustrates a block diagram of some components and/or entities of the signal decomposer 104 according to an example. The signal decomposer 104 may be, conceptually, divided into a decomposition analyzer 104a and a signal divider 126, as illustrated in Figure 3. In the following, entities of the signal decomposer 104 according to the example of Figure 3 are described in more detail. In other examples, the signal decomposer 104 may include further entities and/or some entities depicted in Figure 3 may be omitted or combined with other entities The signal decomposer 104 may comprise a coherence analyzer 116 for estimating, on basis of the transform-domain stereo signal 103, coherence values 117 that are descriptive of coherence between the channels of the transform-domain stereo signal 103. The coherence values 117 are provided for a decomposition coefficient determiner 124 for further processing therein.
Computation of the coherence values 117 may involve deriving a respective coherence value y(k,n) for a plurality of frequency sub-bands k in a plurality of time frames it based on the time-frequency tiles S(t, b, n) that represent the transform domain stereo signal 103. As an example, the coherence values 117 may be computed e.g. according to the equation (3): Eb ic,high Re(S* (1,b,n)S(2,b,n)) (3) h:b k,low y(k, = hh E as(1,b,n)lis(2,b,n)l) b=bklow where Re denotes the real part operator and " denotes the complex conjugate.
The term y(k, n) has a large value when the audio of the channels is dominated by an audio event that is common to both channels. A common audio event will typically cause a complex phasor distribution across the frequency bins b. For all frequency bins inside a frequency band, the phase is the same in both channels in the case of full coherence (i.e., y(k,n) = 1)..
Still referring to Figure 3, the signal decomposer 104 may comprise the energy estimator 118 for estimating energy of the transform-domain stereo signal 103 on basis of the transform-domain stereo signal 103. The energy values 119 are provided for a direction estimator 120 for direction angle estimation therein.
Computation of the energy values 119 may involve deriving a respective energy value k, it) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames it based on the time-frequency tiles S(i, b, it). As an example, the energy values E(i, k, n) may be computed e.g. according to the equation (4): k, = Ebb:thigh 15(1, b,r)12 (4) Still referring to Figure 3, the signal decomposer 104 may comprise the direction estimator 120 for estimating perceivable arrival direction of the sound represented by the stereo signal 101 based on the energy values 119 in view of a target virtual loudspeaker configuration applied in the stereo signal 101. The direction estimation may comprise computation of direction angles 121 based on the energy values in view of the target virtual loudspeaker positions, which direction angles 121 are provided for a focus estimator 122 for further analysis therein.
The target sound source (virtual loudspeaker) configuration may also be referred to as channel configuration (of the stereo signal 101). This information may be obtained, for example, from metadata 12 that accompanies the stereo signal 101, e.g. metadata included in an audio container within which the stereo signal 101 is stored. In another example, the information defining the target virtual loudspeaker configuration applied in the stereo signal 101 may be received (as user input) 12 via a user interface of the device 50. The target virtual loudspeaker configuration may be defined by indicating, for each channel of the stereo signal 101, a respective target virtual loudspeaker position with respect to an assumed listening point. As an example, a target position for a virtual loudspeaker may comprise a target direction, which may be defined as an angle with respect to a reference direction (e.g. a front direction). Hence, for example in case of a two-channel stereo signal the target virtual loudspeaker configuration may be defined as respective target angles ocin (1) and ocin (2) with respect to the front direction for the left and right virtual loudspeakers. The target angles aim (i) with respect to the front direction may be, alternatively, indicated by a single target angle aim, which defines the absolute value of the target angles with respect to the front direction e.g. such that ocin (1) = air, and ocin (2) = In a further example, no indication 12 is received in the audio processing system 100, 100' and the elements of the audio processing system 100, 100' that make use of the information that defines the target virtual loudspeaker configuration applied in the stereo signal 101 (the signal decomposer 104, the re-panner 106) apply predefined information in this regard instead. An example in this regard involves applying a fixed predefined target virtual loudspeaker configuration. Another example involves selecting one of a plurality of predefined target virtual loudspeaker configurations in dependence of the number of audio channels in the received stereo signal 101. Non-limiting examples in this regard include selecting, in response to a two-channel signal 101 (which is hence assumed as a two-channel stereophonic audio signal), a target virtual loudspeaker configuration where the channels are positioned ±30 degrees with respect to the front direction and/or selecting, in response to a six-channel signal (that is hence assumed to represent a 5.1-channel surround signal), a target virtual loudspeaker configuration where the channels are positioned at target angles ccin ( ) of 0 degrees, ±30 degrees and ±110 degrees with respect to the front direction and complemented with a low frequency effects (LFE) channel.
The direction estimator 120 is configured to estimate perceivable arrival direction of the sound represented by the stereo signal 101. The direction estimation may involve deriving a respective direction angle 121, 0(k, n), for a plurality of frequency sub-bands k in a plurality of time frames n based on the estimated energies E(i, k, n) and the target virtual loudspeaker positions ocin (i), the direction angles 121, 0(k, n), thereby indicating the estimated perceived arrival direction of the sound in frequency sub-bands of input frames. The direction estimation may be carried out, for example, using the tangent law according to the equations (5) and (6), where an underlying assumption is that sound sources in the sound field represented by the stereo signal 101 are arranged (to a significant extent) in their desired spatial positions using amplitude panning: 8(k, = arctan (tan 13(in 91-92) 91+92,/ where = 92 = 1E(2,1c,n) where ccin denotes the absolute value of the target angles aim (1) and ain (2) that define, respectively, the target positions of the left and right virtual loudspeakers with respect to the front direction, which in this example are positioned symmetrically (and equidistantly) with respect to the front direction. In other examples, the target positions of the left and right virtual loudspeakers may be positioned non-symmetrically with respect to the front direction (e.g. such that loc _in (1)1 # loc _in (2)1). Modification of the equation (5) such that it accounts for this aspect is a straightforward task for a person skilled in the art.
For example, the modification of the equation (5) in the case of non-symmetric (virtual) loudspeaker positions can be performed as follows. First, a half of the angle between the loudspeakers is computed (5) (6) oco= I cc (1) -cKin (2)1 Next, the center point between the loudspeakers is computed (1) +Xin (2) Using these values, the equation (5) can represented for non-symmetric cases as 0(k, n) = arctan (tan cci, -+ 9292) + oc where 92 and 92 are computed as in equation (6).
Still referring to Figure 3, the signal decomposer 104 may comprise the focus estimator 122 for determining one or more focus coefficients 123 based on the estimated perceivable arrival direction of the sound represented by the stereo signal 101 (directions angles 121) in view of a defined focus range within the spatial audio image, where the focus coefficients 123 are indicative of the relationship between the estimated arrival direction of the sound (direction angles 121) and the focus range. The focus range may be defined, for example, as a single angular range or as two or more angular sub-ranges in the spatial audio image. In other words, the focus range may be defined as a set of arrival directions of the sound within the spatial audio image. The focus range can be defined by the control input 10.
The focus coefficients 123 may be derived by the focus estimator 122 based at least in part on the direction angles 121. The focus estimator 122 may optionally further receive the indication 12 of the target virtual loudspeaker configuration applied in the stereo signal 101, and compute the focus coefficients 123 further in view of this information. The focus coefficients 123 are provided for the decomposition coefficient determiner 124 for further processing therein.
Typically, the one or more angular ranges of the focus range define a set of arrival directions that cover a defined portion around the center of the spatial audio image, thereby rendering the focus estimation as a frontness' estimation. The focus estimation may involve deriving a respective focus (frontness) coefficient x(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the direction angles 121, 0(k, n), e.g. according to the equation (7): { 119(k,n)1-0Thi, < 10 (k, n) 1 647712.
Akin) = 1, u Thl - 07-h2-0Tht) ' In the equation (7), the first threshold value 8. and and the second threshold value 077,2, where Orhi < °Th2, serve to define a primary (center) angular focus range (between angles -0Th1 to Ortri around the front direction), a secondary angular focus range (from -0Th2 to -0Th1 and from OThi to 6Th2 with respect to the front direction) and a non-focus range (outside -°Th2 and OTh2 with respect to the front direction). The coefficients defining the focus range 0 Thi. OTh2, can be defined by the control input 10.
As a non-limiting example, the first and second threshold values may be set tor9nd = 5° and Th2 = 15°, whereas in other examples different threshold values ffriii and 07.1,2 may be applied instead. Focus estimation according to the equation (7) hence applies a focus range that includes two angular ranges (i.e. the primary angular focus range and the secondary angular focus range) and sets the focus coefficient x(k,n) to unity in response to a sound source direction residing within the primary angular focus range and sets the focus coefficient x(k,n) to zero in response to the sound source direction residing outside the focus range, whereas a predefined function of sound source direction is applied to set the focus coefficient x(k,n) to a value between unity and zero in response to the sound source direction residing within the secondary angular focus range. In general, the focus coefficient x(k,n) is set to a non-zero value in response to the sound source direction residing within the focus range and the focus coefficient x(k,n) is set to zero value in response to the perceived sound source direction, direction angles 121, 0(k, n), residing outside the focus range. In an example, the equation (7) may be modified such that no secondary angular focus range is applied and hence only a single threshold may be applied to define the limit(s) between the focus range and the non-focus range.
Along the lines described in the foregoing, the focus range may be defined as one or more contiguous, non-overlapping angular focus ranges. As an example, the focus range may include a single defined angular range or two or more defined angular ranges.
According to another example, at least one of the focus ranges is selectable, e.g. such that an angular focus range may be selected or adjusted (e.g. via selection or adjustment of one or more threshold values that define the respective angular focus range) in dependence of the target (or assumed) virtual loudspeaker configuration associated with the stereo input signal 1, 9(k,n)I < 49Thl (7) 0, 9(k,n)I > 9Th2 12, and the focus range parameter present in control input 10. For example, the control information could be used to control how large a portion (or which angles) of the sound image will be sent to widening.
Still referring to Figure 3, the signal decomposer 104 may comprise the decomposition coefficient determiner 124 for deriving decomposition coefficients 125 based on the coherence values 117 and the focus coefficients 123. The decomposition coefficients 125 are provided for the signal divider 126 for decomposition of the transform-domain stereo signal 103 therein.
The signal divider 126 is configured to derive, based on the transform-domain stereo signal 103 and the decomposition coefficients 125, the first signal component 105-1 that represents the focus portion of the spatial audio image and the second signal component 105-2 that represents the non-focus portion (e.g. a 'peripheral' portion) of the spatial audio image.
The decomposition coefficient determination aims at providing a high value for a decomposition coefficient fl(k,n) for a frequency sub-band k and frame n that exhibits relatively high coherence between the channels of the stereo signal 101 and that conveys a directional sound component that is within the focus portion of the spatial audio image (see description of the focus estimator 122 in the foregoing). In this regard, the decomposition coefficient determination may involve deriving a respective decomposition coefficient p(k,n) for a plurality of frequency sub-bands k in a plurality of time frames 71 based on the respective coherence value y(k,n) and the respective focus coefficient x(k,n) e.g. according to the equation (8): PR, = Y (k, n)X (k, n). (8) In an example, the decomposition coefficients fl(k,n) may be applied as such as the decomposition coefficients 125 that are provided for the signal divider 126 for decomposition of the transform-domain stereo signal 103 therein.
In another example, energy-based temporal smoothing is applied to the decomposition coefficient fl(k,n) obtained from the equation (8) in order to derive smoothed decomposition coefficients,31(k,n), which may be provided for the signal divider 126 to be applied for decomposition of the transform-domain stereo signal 103 therein. Smoothing of the decomposition coefficients results in slower variations over time in sub-portions of the spatial audio image assigned to the first signal component 105-1 and the second signal component 105-2, which may enable improved perceivable quality in the resulting widened stereo signal 115 via avoidance of small-scale fluctuances in the spatial audio image therein. A weighting that provides the energy-based temporal smoothing may be provided, for example, according to the equation (9a): #'(k, n) = A (k, n) / B(k,n), (9a) where A(k,n) = aE(k,n)fl (k,n) + bA (k, n -1) and (9b) B(k,n) = aE(k,n) + bB(k,n -1) ' where ER,n) denotes the total energy of the transform-domain stereo signal 103 for a frequency sub-band k in time frames it (derivable e.g. based on the energies E(i, k, n) derived using the equation (4)) and a and b (where, preferably, a + b = 1) denote predefined weighting factors. The weighting factors for energy-based temporal smoothing (a and b) can be defined by the control input 10. As a non-limiting example, values a = 0.2 and b = 0.8 may be applied, whereas in other examples other values in the range from 0 to 1 may be applied instead.
Still referring to Figure 3, the signal decomposer 104 may comprise the signal divider 126 for deriving, based on the transform-domain stereo signal 103 and the decomposition coefficients 125, the first signal component 105-1 that represents the focus portion of the spatial audio image and the second signal component 105-2 that represents the non-focus portion (e.g. a 'peripheral' portion) of the spatial audio image.
As an example, the signal decomposition may be carried out for a plurality of frequency sub-bands k in a plurality of channels i in a plurality of time frames n based on the time-frequency files S(, b, it), according the equation (10a): S, (i, b, n) = s(i, b, n)(1 - in))13 S fir(' b, n) = S (i, b, n)fi (b, n)P where Stir (i, b, n) denotes frequency bin b in time frame it of channel i of the first signal component 105-1 that represents the focus portion of the spatial audio image, (10a) Ssw(i, b,n) denotes frequency bin b in time frame n of channel i of the second signal component 105-2 the non-focus portion (e.g. a 'peripheral' portion) of the spatial audio image, p denotes predefined constant parameter (e.g. p = 0.5, or 1), and fl(b,n) is equal to the decomposition coefficients fl(k, n) for each frequency bin b within the frequency sub-band k.
The signal divider 126 creates the first signal component 105-1 that represents the focus portion of the spatial audio image and the second signal component 105-2 that represents the non-focus portion (e.g. a 'peripheral' portion) of the spatial audio image but it does not necessarily place a time-frequency tile S(1, b, ii) into either the first signal component 105-1 or the second signal component 105-2. It can, as in this example, scale or weight the contribution of a time-frequency tile 5(1,13,10 more heavily in one of the first signal component 105-1 or the second signal component 105-2 dependent upon the decomposition coefficients fl(k,r).
The scaling coefficient gb,n)P in the equation (9) may be replaced with another scaling coefficient that increases with increasing value of the decomposition coefficient gb, n) (and decreases with decreasing value of the decomposition coefficient gb,n)) and the scaling coefficient (1 -gb,n))1" in the equation (10a) may be replaced with another scaling coefficient that decreases with increasing value of the decomposition coefficient gb, n) (and increases with decreasing value of the decomposition coefficient gb, 71)).
In another example, the signal decomposition may be carried out for a plurality of frequency sub-bands It in a plurality of channels i in a plurality of time frames 71 based on the time-frequency files S(i, b, it), according the equation (10b): (S(i, b, n), [3 (b, n) fiTh (10b) =t 0, (b, n) > fin, 0, fl(b, n) flTh Sdr(i, b, = b,n), (b, > fiTh wherein,GTh denotes a defined threshold value that has value in the range from 0 to 1, e.g. agn, = 0.5. The signal decomposition parameter #Th can be defined by the control input 10. If applying the equation (10b) the temporal smoothing of the decomposition coefficients 125 described in the foregoing and/or temporal smoothing of the resulting signal components.55w(1,b,n) and Sdr(i,b,n) may be advantageous for improved perceivable quality of the resulting widened stereo signal 115 The decomposition coefficients fl(k,n) according to the equation (8) are derived on time-frequency tile basis, whereas the equations (10a) and (10b) apply the decomposition coefficients gb,n) on frequency bin basis. In this regard, the decomposition coefficients, 6 (k, n) derived for a frequency sub-band k may be applied for each frequency bin b within the frequency sub-band k.
Consequently, the transform-domain stereo signal 103 is divided, in each time-frequency tile b, n), into the first signal component 105-1 that represents sound components positioned in the focus portion of the spatial audio image represented by the stereo signal 101 and into the second signal component 105-2 that represents sound components positioned outside the focus portion of the spatial audio image represented by the stereo signal 101. The first signal component 105-1 is subsequently provided for playback without applying stereo widening thereto, whereas the second signal component 105-2 is subsequently provided for playback after being subjected to stereo widening.
Referring back to Figures 1A and 1B, the audio processing system 100, 100' may comprise the re-panner 106 that is arranged to generate a modified first signal component 107 on basis of the first signal component 105-1, wherein one or more sound sources represented by the first signal component 105-1 are repositioned in the spatial audio image.
Figure 4 illustrates a block diagram of some components and/or entities of the re-panner 106 according to an example. In the following, entities of the re-panner 106 according to the example of Figure 4 are described in more detail. In other examples, the re-panner 106 may include further entities and/or some entities depicted in Figure 4 may be omitted or combined with other entities The re-panner 106 may comprise an energy estimator 128 for estimating energy of the first signal component 105-1. The energy values 129 are provided for a direction estimator 130 and for a re-panning gain determiner 136 for further processing therein. The energy value computation may involve deriving a respective energy value Edr(i, k,n) for a plurality of frequency sub-bands k in plurality of audio channels i (plurality of virtual loudspeakers) in a plurality of time frames n based on the time-frequency tiles Sdr(i, b, n). As an example, the energy values Ear(i, k, n) may be computed e.g. according to the equation (11): Edr(i, k,n) = Eb:high Sdr I (i, b,n) I 2 "k,low In another example, the energy values 119 computed in the energy estimator 118 (e.g. according to the equation (4)) may be re-used in the re-panner 106, thereby dispensing with a dedicated energy estimator 128 in the re-panner 106. Even though the energy estimator 118 of the signal decomposer 104 estimates the energy values 119 based on the transform-domain stereo signal 103 instead of the first signal component 105-1, the energy values 119 enable correct operation of the direction estimator 130 and the re-panning gain determiner 136.
Still referring to Figure 4, the re-panner 106 may comprise the direction estimator 130 for estimating perceivable arrival direction of the sound represented by the first signal component 105-1 based on the energy values 129 in view of the target virtual loudspeaker configuration applied in the stereo signal 101. The direction estimation may comprise computation of direction angles 131 based on the energy values 129 in view of the target virtual loudspeaker positions, which direction angles 131 are provided for a direction adjuster 132 for further processing therein.
The direction estimation may involve deriving a respective direction angle 131, 0 dr (k, n), for a plurality of frequency sub-bands k in a plurality of time frames it based on the estimated energies Ear(, k, n) and the positions at?, (i) of the target virtual loudspeakers. The direction angles 131, Oar(k,n), indicate the estimated perceived arrival direction (direction angle 131) of the sound in frequency sub-bands of first signal component 105-1. The direction estimation may be carried out, for example, according to the equations (12) and (13): 91,dr92.dr cir TO = arctan tan ocm 9i,dr+92,dr where Yigr = k, n), for a first virtual loudspeaker 9924rr = Ma, (2, k, it), for a second virtual loudspeaker In another example, the direction angles 121 computed in the energy estimator 128 (e.g. according to the equations (5) and (6)) may be re-used in the re-panner 106, thereby dispensing with a dedicated direction estimator 130 in the re-panner 106. Even though the direction estimator 120 of the signal decomposer 104 estimates the direction angles 121 based on the energy values 119 derived from the transform-domain stereo signal 103 instead (12) (13) of the first signal component 105-1, the sound source positions are the same or substantially the same and hence the direction angles 121 enable correct operation of the direction adjuster Still referring to Figure 4, the re-panner 106 may comprise the direction adjuster 132 for modifying the estimated perceivable arrival direction (direction angle 131) of the sound represented by the first signal component 105-1. The direction adjuster 132 may derive modified direction angles 133 based on the direction angles 131. The modified direction angles 133 are provided for a panning gain determiner 134 for further processing therein.
The direction adjustment may comprise mapping the currently estimated perceivable arrival direction, direction angles 131, into respective modified direction angles 133 that represent new adjusted perceivable arrival direction of the sound in view of the control information 10.
The mapping between the currently estimated perceivable arrival direction, direction angles 131, and the new adjusted perceivable arrival directions, modified direction angles 132, may be provided by determining a mapping coefficient R which may be applied for deriving a respective modified direction angle Or (k, n) for a plurality of frequency sub-bands k in a plurality of time frames 72 e.g. according to the equation (15): 8' (k, = p 0(k, n). (15) The value of the mapping coefficient R for panning can be defined explicitly by the control input 10 If stereo widening 112 "widens" the signal 105-2 by a certain amount then, the re-panner 106 widens the signal 105-1 via re-panning by the same amount. As a practical example, the stereo widening 112 may widen the signal so that a sound source originally at the location of 5 degrees is perceived after the widening at the location corresponding to 10 degrees in the original signals. Hence, the control information 10 may have information saying that re-panning by the factor 2 (p=2) is needed, so that the positions of the re-panned audio 107 match with the positions of the stereo widened audio 113.
The determination of the mapping coefficient Rand derivation of the modified direction angles Or (k,n) according to the equations (14) and (15) serves as a non-limiting example and a different procedure for deriving the modified direction angles 133 may be applied instead.
Still referring to Figure 4, the re-panner 106 may comprise the panning gain determiner 134 for computing a set of panning gains 135 on basis of the modified direction angles 133. The panning gain determination may comprise, for example, using vector base amplitude panning (VBAP) technique known in the art to compute a respective panning gain g' (i, k,n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames 71 based on the modified direction angles Or (k,n).
For example, the panning gains g'(i, k, n) may be derived based on the tangent law Atan 01(k, n) = tan am 1 + A B = 1 -A 911, k, = B2 1 + B2 g' (2, k = 111 -sit k,n)2.
Still referring to Figure 4, the re-panner 106 may comprise the re-panning gain determiner 136 for deriving re-panning gains 137 based on the panning gains 135 and the energy values 129. The re-panning gains 137 are provided for a re-panning processor 138 for derivation of a modified first signal component 107 therein.
The re-panning gain determination procedure may comprise computing a respective total energy Es(k,n) for a plurality of frequency sub-bands k in a plurality of time frames it e.g. according to the equation (18): Es (k, = E, Ear (i, k, n) (18) The re-panning gain determination may further comprise computing a respective target energy Et (i, k, n) for a plurality of frequency sub-bands /c in plurality of audio channels i in a plurality of time frames it based on the total energies Es(k, n) and the panning gains g' Q, k, n), e.g. according to the equation (19): Et = k,n)2 Es(k,n) (19) The target energies Et(i, k, n) may be applied with the energy values Ech.(i,k,n) to derive a respective re-panning gain gr(i, k, n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames n, e.g. according to the equation (20): yr (i, k, n) = Er (i, k, n)/ Edr(i, k, n) (20) In an example, the re-panning gains gr(i, k,n) obtained from the equation (20) may be applied as such as the re-panning gains 137 that are provided for the re-panning processor 138 for derivation of the modified first signal component 107 therein. In another example, energy-based temporal smoothing is applied to the re-panning gains gr(i, k,n) obtained from the equation (20) in order to derive smoothed re-panning gains g' r(i, k, it), which may be provided for the re-panning processor 138 to be applied for re-panning therein. Smoothing of the re-panning gains gr(i, k,n) results in slower variations over time within the sub-portion of the spatial audio image assigned to the first signal component 105-1, which may enable improved perceivable quality in the resulting widened stereo signal 115 via avoidance of small-scale fluctuances in the respective portion of the widened spatial audio image therein.
Still referring to Figure 4, the re-panner 106 may comprise the re-panning processor 138 for deriving the modified first signal component 107 on basis of the first signal component 105-1 in dependence of the re-panning gains 137. In the resulting modified first signal component 107 the sound sources in the focus portion of the spatial audio image are repositioned (i.e. re-panned) in accordance with the modified direction angles 132 derived in the direction adjuster 132 to account for (possible) differences between direct reproduction of stereo signals over headphones and reproduction of stereo widening 112 processed stereo signals over headphones.,. The channels of the modified first signal component 107 are provided to an inverse transform entity 108-1 for conversion from the transform domain to the time domain therein.
The procedure for deriving the modified first signal component 107 may comprise deriving a respective time-frequency tile Sdr,,p(i,b,n) for a plurality of frequency bins b in plurality of audio channels tin a plurality of time frames n based on a corresponding time-frequency tiles Sar(i, b,n) of the first signal component 105-1 in dependence of the re-panning gains gr(i, b,n), e.g. according to the equation (21): b, n) = (i, 13, n)S a, (i, b, n). (21) The re-panning gains gr(i., k, n) according to the equation (20) are derived on time-frequency tile basis, whereas the equation (21) applies the re-panning gains gr(i, k,n) on frequency bin basis. In this regard, the re-panning gain gr(i, k, n) derived for a frequency sub-band k may be applied for each frequency bin b within the frequency sub-band k.
In other examples, panning can apply to each time-frequency tile S(i, b,n) different combinations of controlled gain gr(i, b, n), controlled reverberation or decorrelation and, optionally, controlled delays to produce the channels of the modified first signal component 107. The reverberation or decorrelation is typically added only at a low level.
In some embodiments, the modified first signal component 107 may be divided to two paths (e.g., using a variable received in the control information 10). The signal in the second path is processed using reverberation or decorrelation. The signal in the first path is passed forward without processing and without any cross-channel mixing. The signals in the two paths are combined, e.g., by summing them.
Referring back to Figure 1A, the audio processing system may comprise the inverse transform entity 108-1 that is arranged to transform the channels of the modified first signal component 107 from the transform-domain (back) to the time domain, thereby providing a time-domain modified first signal component 109-1. Along similar lines, the audio processing system 100 may comprise an inverse transform entity 108-2 that is arranged to transform channels of the second signal component 105-2 from the transform-domain (back) to the time domain, thereby providing a time-domain second signal component 109-2. Both the inverse transform entity 108-1 and the inverse transform entity 108-2 make use of an applicable inverse transform that inverts the time-to-transform-domain conversion carried out in the transform entity 102. As non-limiting examples in this regard, the inverse transform entities 108-1, 108-2 may apply an inverse STFT or a (synthesis) QMF bank to provide the inverse transform. The resulting time-domain modified first signal component 109-1 may be denoted as sar(i, m) and the resulting time-domain second signal component 109-2 may be denoted as ssw(i,m), where i denotes the channel and m denotes a time index (i.e. a sample index).
Referring back to Figure 18, as described in the foregoing, in the audio processing system 100' the inverse transform entities 108-1, 108-2 are omitted, and the modified first signal component 107 is provided as a transform-domain signal to the (optional) delay element 110' and the transform-domain second signal component 105-2 is provided as a transform-domain signal to the stereo widening processor 112'.
Referring back to Figure 1A, the audio processing system 100 may comprise the stereo widening processor 112 that is arranged to generate, on basis of the second signal component 109-2, the modified second signal component 113 where the width of a spatial audio image is extended from that represented by the second signal component 109-2. The stereo widening processor 112 may apply any stereo widening technique known in the art to extend the width of the spatial audio image. In an example, the stereo widening processor 112 processes the second signal component ssw(i,m) into the modified second signal component ssiw(i,m). where the second signal component ssw(i,m) and the modified second signal component sO,m) are respective time-domain signals.
Stereo widening techniques can involve adding a processed (e.g. filtered) version of a contralateral channel signal to each of the left and right channel signals of the stereo signal in order to derive an output stereo signal having a widened spatial audio image (a widened stereo signal). In other words, a processed version of the right channel signal of the stereo signal is added to the left channel signal of the stereo signal to create the left channel of a widened stereo signal and a processed version of the left channel signal of the stereo signal is added to the right channel signal of the stereo signal to create the right channel of the widened stereo signal. The procedure of deriving the widened stereo signal may further involve pre-filtering (or otherwise processing) each of the left and right channel signals of the stereo signal prior to adding the respective processed contralateral signals thereto in order to preserve desired frequency response in the widened stereo signal.
Along the lines described above, stereo widening readily generalizes into widening the spatial audio image of a multi-channel input audio signal, thereby deriving an output multi-channel audio signal having a widened spatial audio image (a widened multi-channel signal). In this regard, the processing involves creating the left channel of the widened multi-channel audio signal as a sum of (first) filtered versions of channels of the multi-channel input audio signal and creating the right channel of the widened multi-channel audio signal as a sum of (second) filtered versions of channels of the multi-channel input audio signal. A dedicated predefined filter may be provided for each pair of an input channel (channels of the multi-channel input signal) and an output channel (left and right). As an example in this regard, the left and right channel signals of the widened multi-channel signal S"t3oft and S"t,right, respectively, may be defined on basis of channels of a multi-channel audio signal S according to the equation (1): Sout,lef t(b,n) = E s(i, b, n)II f t(i, b) Sout,rtght(b,n) = E S (i, b,n)Hrtght(i, bY (1) where SO, b, n) denotes frequency bin b in time frame n of channel i of the multi-channel signal S, H1epO, b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the left channel signal Soutjeft (b. n), and Hright(i, b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the right channel signal S outright(, . Hie ft(i, b) and Hright(i, b) are a directional filter pair.
In stereo widening for headphones, the filters Hie ft(i, b) and Hright(i, b) can include HRTFs, or HRTFs (or BRIRs) can be used later in the processing chain. In stereo widening for headphones, the filter Hie ft(i, b) could be HRTFs to 90 degrees (i.e. to left). The filter firight(i, b) could be HRTFs to -90 degrees (i.e. to right).
In stereo widening for headphones, the filter ILI,ft(i, Li) can comprise a direct (dry) part and an ambient part comprising one or more indirect (wet) paths. 1 1
Hie f tO, = r2 H t,direct(i, b) + (1 -r)2111eft,ambient b) where r is the ratio between direct and ambient parts.
The direct to ambient ratio r can be defined by the control input 10.
The direct part filter /Lief t d ect (1, b) can be H RTFs to 90 degrees (i.e. to left).
The indirect part filter Hte ft,ambient(i, b) can represent, for each time-frequency tile SG, b, , different indirect paths that each has a controlled gain, a controlled reverberation or decorrelation and, optionally, a controlled delay. Each different indirect path is processed using a respective HRTF. The directions of the HRTFs are typically selected so that they cover several directions around the listener, creating a perception of envelopment and/or spaciousness. The filters of the different indirect paths are typically combined to the single filter Hie ft,ambient(i, b) before they are applied.
Likewise, the filter Hright(i, b) can comprise a direct (dry) part and an ambient part comprising one or more indirect (wet) paths.
/fright(i, b) = r2 ring hetairect(ifr b) + (1 -r)211rightambient(i, b) where r is the ratio between direct and ambient parts.
The direct part filter linght,direct b) can be HRTFs to -90 degrees (i.e. to right).
The indirect part filter linght,ambienta, b) can represent, for each time-frequency tile S(i,b, n) , different indirect paths that each has a controlled gain, a controlled reverberation or decorrelafion and, optionally, a controlled delay. Each different indirect path is processed using a respective HRTF. The directions of the HRTFs are typically selected so that they cover several directions around the listener, creating a perception of envelopment and/or spaciousness. The filters of the different indirect paths are typically combined to the single filter ling ht,ambient(i, b) before they are applied.
The target virtual loudspeaker position indication 12 may be optionally provided to the stereo widening block 112. The indicated virtual loudspeaker positions may then be used to select corresponding HRTFs for Hieft and Right filters, e.g. for a stereo signal HRTFs to +/-30 degrees were selected by default. However, in order to produce maximally strong widening effect for a stereo signal, HRTFs to +/-90 may be selected instead. To generalize, the stereo widening block 112 may map the indicated virtual loudspeaker positions to modified positions (for stronger widening effect) which are then used to derive the filters Hieft and Hright.
Figure 5 illustrates a block diagram of some components and/or entities of the stereo widening processor 112 according to a non-limiting example.
The stereo widening processor 112 is configured to provide cross-channel mixing means for applying a headphone filter Ha, HRL, HLR and HRR to each one of the plurality of input channels before mixing those channels to produce a modified second signal component 113 comprising two output channels (LEFT, RIGHT) , wherein the headphone filter Hnin applied to an input channel that is mixed to provide an output channel is dependent upon an identity of the output channel m and an identify of the input channel n.
The headphone filter H., can comprise a head related transfer function dependent upon an identity of the output channel m and an identify of the input channel n.
The headphone filter H. for an input channel n can be configured to mix a direct-rendering version of the input channel with an ambient-rendering version of the input channel. The relative gain of the direct version of the input channel compared to the ambient version of the input channel in a mix in the headphone filter can be controlled via a user-controllable parameter r. The headphone filter for an input channel can be configured to mix a single-path direct version of the input channel with a multiple-path ambient version of the input channel, where a head related transfer function is used to form the single-path direct version of the input channel and an indirect path filter is used in combination with a head related transfer function for each path of the multiple paths, to form the multiple-path ambient version of the input channel. The indirect path filter can comprise decorrelation means or reverberation means.
The cross-channel mixing causes stereo-widening for headphone apparatus such that a width of a spatial audio image associated with the modified second signal component is greater than a width of a spatial audio image associated with the second signal component before cross-channel mixing of the second signal component.
In this example, four filters HLL, HRL, HLR and HRR are applied to create the widened spatial audio image: the left channel of the modified second signal component 113 is created as a sum of the left channel of the second signal component 109-2 filtered by the filter Ha and the right channel of the second signal component 109-2 filtered by the filter HLR, whereas the right channel of the modified second signal component 113 is created as a sum of the left channel of the second signal component 109-2 filtered by the filter HRL and the right channel of the second signal component 109-2 filtered by the filter HRR. In the example of Figure 5, the stereo widening procedure is carried out on basis of the time-domain second signal component 1092. In other examples, the stereo widening procedure (e.g. one that makes use of the filtering structure of Figure 5) may be carried out in the transform domain. In this alternative example, the order of the inverse transform entity 108-2 and the stereo widening processor 112 is changed.
In an example, the stereo widening processor 112 may be provided with a dedicated set of filters Ha, HRL, HLR and HRR that is designed to produce a desired extent of stereo widening for a target virtual loudspeaker configuration. In another example, the stereo widening processor 112 may be provided with a plurality of sets of filters Ha, HRL, HLR and HRR, each set designed to produce a desired extent of stereo widening for a target virtual loudspeaker configuration. In the latter example, the set of filters is selected in dependence of the indicated target virtual loudspeaker configuration. In a scenario with a plurality of sets of filters, the stereo widening processor 112 may dynamically switch between sets of filters e.g. in response to a change in the indicated virtual loudspeaker positions. There are various ways for designing a set of filters Ha, HRL, HLR and HRR.
In stereo widening for headphones, the filter HLL can be filter H le ft(le f t, b) described above, the filter HLR can be filter Hie /aright, b) described above, the filter HRR can be filter Hright(right, b) described above, the filter HRL can be filter firight(le f t,b) described above.
The stereo-widening performed by the spatial audio processor 112, can be performed in the time domain (FIG 1A) or the transform domain (FIG 18).
Referring back to Figure 1A, the audio processing system 100 may comprise the delay element 110 that is arranged to delay the modified first signal component 109-1 by a predefined time delay, thereby creating a delayed first signal component 111. The time delay is selected such that it matches or substantially matches the delay resulting from stereo widening processing applied in the stereo widening processor 112, thereby keeping the delayed first signal component 111 temporally aligned with the modified second signal component 113. In an example, the delay element 110 processes the modified first signal component sdr(i, in) into the delayed first signal component s:/r(i, m). In the example of Figure 1A, the time delay is applied in the time domain. In alternative example, the order of the inverse transform entity 108-1 and the delay element 110 may be changed, thereby resulting in application of the predefined time delay in the transform domain.
Referring back to Figure 18, as described in the foregoing, in the audio processing system 100' the delay element 110' is optional and, if included, it is arranged to operate in the transform-domain, in other words to apply the predefined time delay to the modified first signal component 107 to create the delayed modified first signal component 111' in the transform-domain for provision to the combiner signal 114' as a transform-domain signal. It will be appreciated from the foregoing that if one wants to create a perception of a sound source outside the headphones, stereo widening 112 is needed (using, e.g., HRTFs). However, in between the headphones, the sound can be positioned without stereo widening. e.g., re-panning can be used to position sound sources in between the headphones (You cannot position sounds outside the headphones with this method). However, the focus portion contains sounds only near the center, so positioning them in between the headphones is sufficient. The peripheral portion 113 may contain sound sources perceived also outside the headphone positions. The focus portion 111 does not contain sound sources perceived outside the headphone positions, but still they may be wider than they originally were.
Referring back to Figure 1A, the audio processing system 100 may comprise the signal combiner 114 that is arranged to combine the delayed first signal component 111 and the modified second signal component 113 into the widened stereo signal 115, where the width of spatial audio image is partially extended On the peripheral but not necessarily the front focus portions) from that of the stereo signal 101. As examples in this regard, the widened stereo signal 115 may be derived as a sum, as an average or as another linear combination of the delayed first signal component 111 and the modified second signal component 113, e.g. according to the equation (22): sout(i, = + sa'r(i,rn), (22) where sout(i, in) denotes the widened stereo signal 115.
Referring back to Figure 1B, as described in the foregoing, in the audio processing system 100' the signal combiner 114' is arranged to operate in the transform-domain, in other words to combine the (transform-domain) delayed modified first signal component 113' with the (transform-domain) modified second signal component 113' into the (transform-domain) widened stereo signal 115' for provision to the inverse transform entity 108'. The inverse transform entity 108' is arranged to convert the (transform-domain) widened stereo signal 115' from the transform domain into the (time-domain) widened stereo signal 115. The transform entity 108' may carry out the conversion in a similar manner as described in the foregoing in context of the transform entities 108-1, 108-2.
Each of the exemplifying audio processing systems 100, 100' described in the foregoing via a number of examples may further varied in a number of ways. In the following, non-limiting examples in this regard are described.
In the foregoing, description of elements of the audio processing systems 100, 100' refer to processing of relevant audio signals in a plurality of frequency sub-bands k. In an example, the processing of the audio signal in each element of the audio processing systems 100, 100' is carried out across (all) frequency sub-bands k. In other examples, in at least some elements of the audio processing systems 100, 100' the processing of the audio signal is carried out in a limited number of frequency sub-bands k. As examples in this regard, the processing in a certain element of the audio processing system 100, 100' may be carried out for a predefined number of lowest frequency sub-bands k, for a predefined number of highest frequency sub-bands k, or for a predefined subset of frequency sub-bands k in the middle of the frequency range such that a first predefined number of lowest frequency sub-bands k and a second predefined number of highest frequency sub-bands k is excluded from the processing. The frequency sub-bands k excluded from the processing (e.g. ones at the lower end of the frequency range and/or ones at the higher end of the frequency range) may be passed unmodified from an input to an output of the respective element. As a non-limiting example concerning elements of the audio processing systems 100, 100' where the processing may be carried out only for a limited subset of frequency sub-bands k, involves one or both of the re-panner 116 and the stereo widening processor 112, 112', which may only process the respective input signal in a respective desired sub-range of frequencies, e.g. in a predefined number of lowest frequency sub-bands k or in a predefined subset of frequency sub-bands k in the middle of the frequency range.
In another example, as already described in the foregoing, the input audio signal 101 may comprise a multi-channel signal different from a two-channel stereophonic audio signal, e.g. surround signal. For example in case the input audio signal 101 comprises a 5.1--channel surround signal, the audio processing technique(s) described in the foregoing with references to the left and right channels of the stereo signal 101 may be applied to the front left and front right channels of the 5.1-channel surround signal to derive the left and right channels of the output audio signal 115. The other channels of the 5.1-channel surround signal may be processed e.g. such that the center channel of the 5.1-channels surround signal scaled by a predefined gain factor (e.g. by one having value -\/C5) is added to the left and right channels of the output audio signal 115 obtained from the audio processing system 100, 100', whereas the rear left and right channels of the 5.1-channel surround signal may be processed using a conventional stereo widening technique that makes use of widening filter(s) (utilizing, e.g., HRTFs or BRIRs)) that correspond(s) to respective target positions of the left and right rear loudspeakers (e.g. ±110 degrees with respect to the front direction). The LFE channel of the 5.1-channel surround signal may be added to the center signal of the 5.1-channel surround signal prior to adding the scaled version thereof to the left and right channels of the output audio signal 115.
In another example, as already described in the foregoing, the input audio signal 101 may comprise N spatially distributed channels that are processed to produce a two-channel audio signal 115 processed specifically for playback via headphone apparatus. The mixing of M channels to produce a first signal component 111, 111' of the two-channel stereophonic audio signal 115 can occur at re-panner 106. The mixing of M' channels to produce a second signal component 113, 113' of the two-channel stereophonic audio signal 115 can occur at the stereo widening processor for headphone apparatus 112.
Audio events (sound objects) may move within the sound image. When an audio event (sound object) is positioned within the focus range the audio event is rendered via the first signal component 111, 111' of the two-channel stereophonic audio signal 115. When an audio event is positioned within the non-focus, peripheral range the audio event is rendered via the second signal component 113, 113' of the two-channel stereophonic audio signal 115.
In another example, additionally or alternatively, the audio processing system 100, 100' may enable adjusting balance between the contribution from the first signal component 105-1 and the second signal component 105-2 in the resulting widened stereo signal 115. This may be provided, for example, by applying respective different scaling gains to the first signal component 105-1 (or a derivative thereof) and to the second signal component 105-2 (or a derivative thereof). In this regard, respective scaling gains may be applied e.g. in the signal combiner 114, 114' to scale the signal components derived from the first and second signal components 105-1, 105-2 accordingly, or in the signal divider 126 to scale the first and second signal components 105-1, 105-2 accordingly. A single respective scaling gain may be defined for scaling the first and second signal components 105-1, 105-2 (or a respective derivative thereof) across all frequency sub-bands or in predefined sub-set of frequency sub-bands. Alternatively or additionally, different scaling gains may be applied across the frequency sub-bands, thereby enabling adjustment of the balance between the contribution from the first and second signal components 105-1, 105-2 only on some of the frequency sub-bands and/or adjusting the balance differently at different frequency sub-bands.
In a further example, alternatively or additionally, the audio processing system 100, 100' may enable scaling of one or both of the first signal component 105-1 and the second signal component 105-2 (or respective derivatives thereof) independently of each other, thereby enabling equalization (across frequency sub-bands) for one or both of the first and second signal components. This may be provided, for example, by applying respective equalization gains to the first signal component 105-1 (or a derivative thereof) and to the second signal component 105-2 (or a derivative thereof). A dedicated equalization gain may be defined for one or more frequency sub-bands for the first signal component 105-1 and/or for the second signal component 105-2. In this regard, for each of the first and second signal components 105-1, 105-2, a respective equalization gain may be applied e.g. in the signal divider 126 or in the signal combiner 114, 114' to scale a respective frequency sub-band of the respective one of the first and second signal components 105-1, 105-2 (or a respective derivative thereof). For a certain frequency sub-band, the equalization gain may be the same for both the first and second signal components 105-1, 105-2 or different equalization gains be applied for the first and second signal component 105-1, 105-2.
Operation of the audio processing system 100, 100' described in the foregoing via multiple examples enables adaptively decomposing the stereo signal 101 into the first signal component 105-1 that represents the focus portion of the spatial audio image and that is provided for playback without application of stereo widening thereto and into the second signal component 105-2 that represents peripheral (non-focus) portion of the spatial audio image that is subjected to the stereo widening processing. In particular, since the decomposition is carried out on basis of audio content conveyed by the stereo signal 101 on frame by frame basis, the audio processing system 100, 100' enables both adaptation for relatively static spatial audio images of different characteristics and adaptation to changes in the spatial audio image over time.
The disclosed stereo widening technique that relies on excluding coherent sound sources within the focus portion of the spatial audio image from the stereo widening processing and applies the stereo widening processing predominantly to coherent sounds that are outside the focus portion and to non-coherent sounds (such as ambience) enables improved timbre and reduced coloration' of sounds that are within the focus portion while still providing a large extent of perceivable stereo widening.
In the previous examples, the control input 10 can have one or more different functions.: The parameters of the decomposition process can be defined by the control input. The control input 10 can for example define the focus range used in the analysis for dividing the signals to focus (i.e. front center) and non-focus (i.e. side) signals. The focus range can, for example, be defined via OThl and 077,2 or flTh The signal decomposition parameter)677, can, for example, be defined by the control input 10 The control input 10 can for example control relative gains between the peripheral signals 113, 113' that are widened and the frontal signals 111, 111' that are not. For example, it can in some examples control a relative gain ratio of peripheral to frontal.
The parameters of the widening process can for example be defined by the control input 10. The control input 10 can, for example, control the direct to ambient ratio r used in widening. The parameters may include for example the directions to which the non-focus sounds are processed (for example with the help of HRTF processing), and/or the amount of ambience (for example reverb) added to sound for increasing the "widening" effect or the perceived externalization. Processing the non-focus sounds to different virtual directions is not necessary, one embodiment of the invention can be such that the non-focus sounds are processed only using reverb, decorrelator or other methods which increase the externalization of the non-focus sounds.
The control input 10 can for example control explicitly or implicitly whether or not panning occurs. For example, panning may not occur if the focus range is narrow. For example, panning may not occur if the relative gain ratio of peripheral to frontal is small.
The value of the mapping coefficient p that controls panning extent can, for example, be defined explicitly by the control input 10 or can be controlled via definition of the focus range. The overpan factor p can be used for modifying the front center sector (i.e. focus sounds) within which the focus signal is perceived (for example, it can be made sound wider than in the original signal). The control input 10 can be also another parameter or a set of parameters which modify where the focus sounds are heard in the left -right panning dimension.
The weighting factors for energy-based temporal smoothing (a and b) can, for example, be defined by the control input 10.
All, part or none of the control input can, for example, be controlled by user input.
The control input 10 can for example comprise parameters for controlling the focus sounds (e.g. for adding ambience to produce better externalization to front sounds).
The control input 10 can for example comprise parameters that define multiple analysis sectors (for the decomposition part) and multiple virtual speaker directions (used in the stereo widening block). Non-focus sounds may be divided to more sectors than just left and right (outside of the focus range). There may be several angular regions outside of the focus range, which may be processed separately to e.g. different directions or different amounts of ambience in the invention report).
Components of the audio processing system 100, 100' may be arranged to operate, for example, in accordance with a method 200 illustrated by a flowchart depicted in Figure 6. The method 200 serves as a method for processing an input audio signal comprising a multichannel audio signal that represents a spatial audio image.
The method 200 comprises: at block 202: deriving, based on the input audio signal 101, a first signal component 105-1 comprising a at least one input channel and a second signal component 105-2 comprising multiple input channels, wherein the first signal component 105-1 is dependent upon at least a first (focus) portion of a spatial audio image conveyed by the input audio signal 101, and the second signal component 105-1 is dependent upon at least a second (non-focus) portion of the spatial audio image that is different to the first (focus) portion.
The method 200 further comprises, at block 204, cross-channel mixing of at least some of the multiple input channels of the second signal component 105-2 to produce a modified second signal component 113 while enabling the first signal component to bypass cross-channel mixing The method 200 further comprises, at block 206, combining the first signal component 105-2 and the modified second signal component 113 into an output audio signal 115 comprising two output channels configured for rendering by headphone apparatus The method 200 may be varied in a number of ways, for example in view of the examples concerning operation of the audio processing system 100 and/or the audio processing system 100' described in the foregoing.
The cross-channel mixing enables a width of the spatial audio image to be extended from that of the second signal component 105-2 Figure 7 illustrates a block diagram of some components of an exemplifying apparatus 300. The apparatus 300 may comprise further components, elements or portions that are not depicted in Figure 7. The apparatus 300 may be employed e.g. in implementing one or more components described in the foregoing in context of the audio processing system 100, 100'. The apparatus 300 may implement, for example, the device 50 or one or more components thereof.
The apparatus 300 comprises a processor 316 and a memory 315 for storing data and computer program code 317. The memory 315 and a portion of the computer program code 317 stored therein may be further arranged to, with the processor 316, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100'.
The apparatus 300 comprises a communication portion 312 for communication with other devices. The communication portion 312 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 312 may also be referred to as a respective communication means.
The apparatus 300 may further comprise user I/O (input/output) components 318 that may be arranged, possibly together with the processor 316 and a portion of the computer program code 317, to provide a user interface for receiving input from a user of the apparatus 300 and/or providing output to the user of the apparatus 300 to control at least some aspects of operation of the audio processing system 100, 100' implemented by the apparatus 300. The user I/O components 318 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 318 may be also referred to as peripherals. The processor 316 may be arranged to control operation of the apparatus 300 e.g. in accordance with a portion of the computer program code 317 and possibly further in accordance with the user input received via the user I/O components 318 and/or in accordance with information received via the communication portion 312.
Although the processor 316 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 315 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.
The computer program code 317 stored in the memory 315, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 300 when loaded into the processor 316. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 316 is able to load and execute the computer program code 317 by reading the one or more sequences of one or more instructions included therein from the memory 315. The one or more sequences of one or more instructions may be configured to, when executed by the processor 316, cause the apparatus 300 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100'.
Hence, the apparatus 300 may comprise at least one processor 316 and at least one memory 315 including the computer program code 317 for one or more programs, the at least one memory 315 and the computer program code 317 configured to, with the at least one processor 316, cause the apparatus 300 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100'.
The computer program(s) stored in the memory 315 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 317 stored thereon, the computer program code, when executed by the apparatus 300, causes the apparatus 300 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100'. The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.
Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.
In at least some of the preceding examples, when the input audio signal 101 comprises a same sound source that is repeated at different positions, and that is rendered at the headphone apparatus 20 without interaural time difference and without frequency dependent interaural level differences, when the sound source of the input audio signal 101 is positioned at a first position that is relatively front and central to a user of the headphone apparatus 30, then the sound source is rendered at the headphone apparatus 30 with interaural time differences and with frequency dependent interaural level differences when the sound source of the input audio signal is repeated at a second position that is relatively peripheral and is not front and central to a user of the headphone apparatus 30.
The stereo-widening (for headphones) processor 112, 112' spatially processes the input audio signal 101 to add at peripheral positions, but not at central positions, of the spatial audio image positionally-dependent interaural time differences measurable between coherent audio events in both of the channels of the output audio signal and frequency-dependent and positionally-dependent interaural level differences measurable between coherent audio events in both of the channels of the output audio signal.
In the foregoing examples, there is a bypass initiated by the signal decomposer 104 and provided via a bypass route comprising the re-panner 106 thus enabling the first signal component 105-1 to bypass the stereo-widening (for headphones) processor 112, 112'. In some but not necessarily all examples, the bypass enables components of the input audio signal 101 that represent a sound source that is coherent between two stereo channels and is positioned to front and center, to bypass cross-channel mixing at the stereo-widening (for headphones) processor 112, 112'.
In at least some of the above examples, first focus portion is front and central relative to a user of the headphone apparatus, and the second portion is peripheral relative to a user of headphone apparatus. In at least some of the above examples, first focus portion does not overlap the first portion. In at least some of the above examples, the first focus portion and second non-focus portions are contiguous.
Although the above description discusses an implementation in which there is a central first focus portion and two second focus portions to left and right split by the first focus portion, other arrangements of the first focus portion and the second focus portion are possible. Reference to a portion may, for example, reference a single portion or multiple portions.
Where the second portion comprises multiple portions, then different spatial audio processing can be applied to each of the second portions. For example, different control inputs may be used for different second portions. The same control inputs may be used for different second portions that are disposed symmetrically either side of a central direction. For example, different cross-channel mixing may be used for different second portions to achieve different widening effects. The same cross-channel mixing may be used for different second portions that are disposed symmetrically either side of a central direction. For example, different direct to ambient rations r may be used for different second portions to achieve different effects. The same direct to ambient ratio r may be used for different second portions that are disposed symmetrically either side of a central direction.
Where the first portion comprises multiple portions, then different processing e.g. re-panning can be applied to each of the second portions.
In the foregoing examples, the first (focus) portion is fixed in the audio image when the headphone apparatus move and the audio image is oriented with respect to the headphone apparatus. In the other examples, the audio image is oriented with respect to the 'world' headphone apparatus and is processed to rotated when the headphones rotate. In this example, the first (focus) portion can be fixed in the audio image when the headphone apparatus move or alternatively can rotate with the headphone apparatus. The headphone apparatus 20 can comprise circuitry for tracking it's orientation.
In some examples the apparatus 100,100' is separate to the headphone apparatus 20, for example as illustrated in Figure 3. In other examples, the apparatus 100, 100' is part of the headphone apparatus 20.In at least some of the examples described above, audio is divided into two paths, central and side sound. For central sounds, timbre is important, so the processing is designed in order to keep that good. HRTF processing is avoided. The central sounds can be widened by, for example, "re-panning" which does not degrade timbre, and does some widening, even though it cannot create sources outside the headphones. For side sounds, having very wide perception is the most important thing. Hence, HRTFs are used to get that effect (and provide sound sources outside the headphones). This degrades the timbre, but that is accepted as a trade-off in order to get that maximal wideness. While one keeps timbre for central sounds, it is desirable to make them wide. Side sounds are made very wide.
Although in the foregoing some functions have been described with reference to certain features and/or elements, those functions may be performable by other features and/or elements whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.
Claims (18)
- Claims 1. An apparatus for processing an input audio signal comprising multiple channels, the apparatus comprising: means for deriving, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; cross-channel mixing means for cross-channel mixing of a plurality of input channels; means for directing the second signal component to the cross-channel mixing means for cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component; bypass means for enabling the first signal component to bypass the cross-channel mixing means; and means for combining the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
- 2. An apparatus as claimed in claim 1, wherein the cross-channel mixing means for cross-channel mixing of a plurality of input channels comprises means for applying head related transfer functions to each one of the plurality of input channels before mixing those channels to produce a modified second signal component comprising two output channels, wherein the head related transfer function applied to an input channel that is mixed to provide an output channel is dependent upon an identity of the input channel and an identify of the output channel.
- 3. An apparatus as claimed in claim 1 or 2, wherein the cross-channel mixing means for cross-channel mixing of a plurality of input channels comprises means for applying a headphone filter to each one of the plurality of input channels before mixing those channels to produce a modified second signal component comprising two output channels, wherein the headphone filter applied to an input channel that is mixed to provide an output channel is dependent upon an identity of the input channel and an identify of the output channel, wherein the headphone filter for an input channel mixes a direct version of the input channel with an ambient version of the input channel.
- 4. An apparatus as claimed in claim 3, wherein the relative gain of the direct version of the input channel compared to the ambient version of the input channel in a mix in the headphone filter is a user-controllable parameter.
- 5. An apparatus as claimed in claim 3 or 4, wherein the headphone filter for an input channel mixes a single-path direct version of the input channel with a multiple-path ambient version of the input channel; wherein a head related transfer function is used to form the single-path direct version of the input channel; wherein, an indirect path filter is used in combination with a head related transfer function for each path of the multiple paths, to form the multiple-path ambient version of the input channel.
- 6. An apparatus as claimed in claim 5, wherein the indirect path filter comprises decorrelation means or reverberation means.
- 7. An apparatus as claimed in any preceding claim, wherein the cross-channel mixing causes stereo-widening for headphone apparatus such that a width of a spatial audio image associated with the modified second signal component is greater than a width of a spatial audio image associated with the second signal component before cross-channel mixing of the second signal component.
- 8. An apparatus as claimed in any preceding claim, wherein the first portion is front and central relative to a user of the headphone apparatus, and the second portion is peripheral relative to the user of headphone apparatus and does not overlap the first portion.
- 9. An apparatus as claimed in any preceding claim, wherein the first and second portions are contiguous.
- 10. An apparatus as claimed in any preceding claim, wherein the bypass means enables components of the input audio signal that represent a sound source that is coherent between two stereo channels and is positioned to front and center, to bypass the cross-channel mixing means.
- 11. An apparatus as claimed in any preceding claim, wherein a control input controls one or more of: control the first portion and/or the second portion; control decomposition of input signal to first component and second component; control relative gain of the first component and the second component; control widening of the second component; control ratio of direct to ambient gain during widening of second component; control panning of first component; control whether there is or is not panning of the first component; control panning extent of first component; and control energy-based temporal smoothing.
- 12. An apparatus as claimed in any preceding claim, wherein when the input audio signal comprises a same sound source that is repeated at different positions, and that is rendered at the headphone apparatus without interaural time difference and without frequency dependent interaural level differences, when the sound source of the input audio signal is positioned at a first position that is relatively front and central to a user of the headphone apparatus, then the sound source is rendered at the headphone apparatus with interaural time differences and with frequency dependent interaural level differences when the sound source of the input audio signal is repeated at a second position that is relatively peripheral and is not front and central to a user of the headphone apparatus.
- 13. A system comprising the apparatus as claimed in any preceding claim and a headphone apparatus configured for receiving and rendering the output audio signal.
- 14. An apparatus as claimed in any of claims 1 to 12, configured as a headphone apparatus for rendering the output audio signal.
- 15. A method for processing an input audio signal comprising a at least one input channel/multiple input channels, the method comprising: deriving, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing; and combining the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
- 16. An apparatus for processing an input audio signal comprising a at least one input channel/multiple input channels, the apparatus comprising at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: derive, based on the input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; perform cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing; and combine the first signal component and the modified second signal component into an output audio signal comprising two output channels configured for rendering by headphone apparatus.
- 17. A computer program comprising computer readable program code configured to cause a computer to: derive, based on an input audio signal, a first signal component, comprising at least one input channel, and a second signal component, comprising multiple input channels, wherein the first signal component is dependent upon at least a first portion of a spatial audio image conveyed by the input audio signal, and the second signal component is dependent upon at least a second portion of the spatial audio image that is different to the first portion; and perform cross-channel mixing of at least some of the multiple input channels of the second signal component to produce a modified second signal component while enabling the first signal component to bypass cross-channel mixing.
- 18. An apparatus for processing an input audio signal comprising multiple channels to produce a two-channel output audio signal configured for rendering by headphone apparatus to produce a spatial audio image, the apparatus comprising: means for processing an input audio signal comprising multiple channels to produce a two-channel output audio signal configured for rendering by headphone apparatus; means for spatially processing the input audio signal to add at peripheral positions, but not at central positions, of the spatial audio image positionally-dependent interaural time differences measurable between coherent audio events in both of the channels of the output audio signal and frequency-dependent and positionally-dependent interaural level differences measurable between coherent audio events in both of the channels of the output audio signal.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1907601.7A GB2584630A (en) | 2019-05-29 | 2019-05-29 | Audio processing |
EP20176223.4A EP3745744A3 (en) | 2019-05-29 | 2020-05-25 | Audio processing |
CN202210643129.9A CN115190414A (en) | 2019-05-29 | 2020-05-29 | Apparatus and method for audio processing |
CN202010473489.XA CN112019993B (en) | 2019-05-29 | 2020-05-29 | Apparatus and method for audio processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1907601.7A GB2584630A (en) | 2019-05-29 | 2019-05-29 | Audio processing |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201907601D0 GB201907601D0 (en) | 2019-07-10 |
GB2584630A true GB2584630A (en) | 2020-12-16 |
Family
ID=67385512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1907601.7A Withdrawn GB2584630A (en) | 2019-05-29 | 2019-05-29 | Audio processing |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP3745744A3 (en) |
CN (2) | CN115190414A (en) |
GB (1) | GB2584630A (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2587357A (en) * | 2019-09-24 | 2021-03-31 | Nokia Technologies Oy | Audio processing |
CN115376530A (en) * | 2021-05-17 | 2022-11-22 | 华为技术有限公司 | Three-dimensional audio signal coding method, device and coder |
CN113473352B (en) * | 2021-07-06 | 2023-06-20 | 北京达佳互联信息技术有限公司 | Method and device for dual-channel audio post-processing |
GB2622386A (en) * | 2022-09-14 | 2024-03-20 | Nokia Technologies Oy | Apparatus, methods and computer programs for spatial processing audio scenes |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006056661A1 (en) * | 2004-11-29 | 2006-06-01 | Nokia Corporation | A stereo widening network for two loudspeakers |
EP2154911A1 (en) * | 2008-08-13 | 2010-02-17 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | An apparatus for determining a spatial output multi-channel audio signal |
EP3048809A1 (en) * | 2015-01-21 | 2016-07-27 | Nxp B.V. | System and method for stereo widening |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6091894A (en) * | 1995-12-15 | 2000-07-18 | Kabushiki Kaisha Kawai Gakki Seisakusho | Virtual sound source positioning apparatus |
FI113147B (en) * | 2000-09-29 | 2004-02-27 | Nokia Corp | Method and signal processing apparatus for transforming stereo signals for headphone listening |
FI118370B (en) * | 2002-11-22 | 2007-10-15 | Nokia Corp | Equalizer network output equalization |
US7720230B2 (en) * | 2004-10-20 | 2010-05-18 | Agere Systems, Inc. | Individual channel shaping for BCC schemes and the like |
CN101243488B (en) * | 2005-06-30 | 2012-05-30 | Lg电子株式会社 | Apparatus for encoding and decoding audio signal and method thereof |
US8374365B2 (en) * | 2006-05-17 | 2013-02-12 | Creative Technology Ltd | Spatial audio analysis and synthesis for binaural reproduction and format conversion |
US8619998B2 (en) * | 2006-08-07 | 2013-12-31 | Creative Technology Ltd | Spatial audio enhancement processing method and apparatus |
JP5007563B2 (en) * | 2006-12-28 | 2012-08-22 | ソニー株式会社 | Music editing apparatus and method, and program |
CN101184349A (en) * | 2007-10-10 | 2008-05-21 | 昊迪移通(北京)技术有限公司 | Three-dimensional ring sound effect technique aimed at dual-track earphone equipment |
US8144902B2 (en) * | 2007-11-27 | 2012-03-27 | Microsoft Corporation | Stereo image widening |
FR2996094B1 (en) * | 2012-09-27 | 2014-10-17 | Sonic Emotion Labs | METHOD AND SYSTEM FOR RECOVERING AN AUDIO SIGNAL |
WO2014164361A1 (en) * | 2013-03-13 | 2014-10-09 | Dts Llc | System and methods for processing stereo audio content |
EP2830333A1 (en) * | 2013-07-22 | 2015-01-28 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Multi-channel decorrelator, multi-channel audio decoder, multi-channel audio encoder, methods and computer program using a premix of decorrelator input signals |
CN104200827B (en) * | 2014-09-05 | 2017-04-19 | 赵平 | Method and device for obtaining internet audio file |
GB2561757B (en) * | 2016-04-29 | 2020-08-12 | Cirrus Logic Int Semiconductor Ltd | Audio signal processing |
KR102580502B1 (en) * | 2016-11-29 | 2023-09-21 | 삼성전자주식회사 | Electronic apparatus and the control method thereof |
-
2019
- 2019-05-29 GB GB1907601.7A patent/GB2584630A/en not_active Withdrawn
-
2020
- 2020-05-25 EP EP20176223.4A patent/EP3745744A3/en active Pending
- 2020-05-29 CN CN202210643129.9A patent/CN115190414A/en active Pending
- 2020-05-29 CN CN202010473489.XA patent/CN112019993B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006056661A1 (en) * | 2004-11-29 | 2006-06-01 | Nokia Corporation | A stereo widening network for two loudspeakers |
EP2154911A1 (en) * | 2008-08-13 | 2010-02-17 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | An apparatus for determining a spatial output multi-channel audio signal |
EP3048809A1 (en) * | 2015-01-21 | 2016-07-27 | Nxp B.V. | System and method for stereo widening |
Also Published As
Publication number | Publication date |
---|---|
CN112019993A (en) | 2020-12-01 |
GB201907601D0 (en) | 2019-07-10 |
EP3745744A3 (en) | 2021-03-31 |
EP3745744A2 (en) | 2020-12-02 |
CN112019993B (en) | 2022-06-17 |
CN115190414A (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9918179B2 (en) | Methods and devices for reproducing surround audio signals | |
JP5285626B2 (en) | Speech spatialization and environmental simulation | |
EP3745744A2 (en) | Audio processing | |
KR101567461B1 (en) | Apparatus for generating multi-channel sound signal | |
RU2752600C2 (en) | Method and device for rendering an acoustic signal and a machine-readable recording media | |
US11611828B2 (en) | Systems and methods for improving audio virtualization | |
JP2014506416A (en) | Audio spatialization and environmental simulation | |
JP6660982B2 (en) | Audio signal rendering method and apparatus | |
US11750994B2 (en) | Method for generating binaural signals from stereo signals using upmixing binauralization, and apparatus therefor | |
GB2587357A (en) | Audio processing | |
Frank | How to make Ambisonics sound good | |
US20220014866A1 (en) | Audio processing | |
US10841726B2 (en) | Immersive audio rendering | |
AU2022225084A1 (en) | Apparatus and method for rendering audio objects | |
US11373662B2 (en) | Audio system height channel up-mixing | |
GB2353926A (en) | Generating a second audio signal from a first audio signal for the reproduction of 3D sound |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |