AUDIO RENDERING SYSTEM AND METHOD THEREFOR
FIELD OF THE INVENTION
The invention relates to an audio rendering system and in particular, but not exclusively, to a spatial audio rendering system such as a surround sound audio rendering system.
BACKGROUND OF THE INVENTION
Multi-channel audio rendering, and in particular multi-channel spatial sound rendering, beyond simple stereo has become commonplace through applications such as surround sound home cinema systems. Typically such systems use loudspeakers positioned at specific spatial positions relative to a listening position. For example, a 5.1 home cinema system provides spatial sound via five loudspeakers being positioned with one speaker directly in front of the listening position (the center channel), one speaker to the front left of the listening position, one speaker to the front right of the listening position, one speaker to the rear left of the listening position, and one speaker to the rear right of the listening position. In addition, a non-spatial low frequency speaker is often provided.
Such conventional systems are based on the reproduction of audio signals at specific nominal positions relative to the listening position. One speaker is typically provided for each audio channel and therefore speakers must be positioned at locations corresponding to the predetermined or nominal positions for the system.
In many audio systems, such as spatial multi-channel and especially surround sound systems, there is a desire to provide a more involving user experience. This may for example be achieved by introducing additional speakers that can be positioned at new positions thereby providing a more encapsulating sound rendering for a listener at a given listening position. However, as content is often provided in a specific and typically legacy driven format, the audio rendering system may in many such applications be required to generate new channels from the received signal. For example, for a stereo signal, it may be desirable to derive channels that can be rendered from the side or behind the listening position. For a five channel surround sound system, it may be desirable to generate a sixth and seventh channel e.g. for rendering from an elevated position or to the side of the listener.
Thus, audio rendering systems may perform an upmixing of one or more input channels to generate additional channels. The system may accordingly employ an algorithm which synthesizes additional loudspeaker driving signals from a given input audio signal.
However, a critical issue for such upmixing is that spatial or other distortions should not be introduced, and that the resulting rendered audio stage should still be perceived as natural. Specifically, it is desirable that a more involving and encapsulating sound experience is provided without this resulting in e.g. spatially well-defined sound sources changing their perceived position.
Although a number of algorithms and approaches have been proposed for upmixing of audio to synthesize new channels, these tend not to provide optimal
performance. Specifically, most rendering systems generating and rendering synthesized channels tend to either provide a less than optimal emerging experience and/or tends to introduce spatial distortions to spatially well-defined sound sources.
Hence, an improved audio rendering approach would be advantageous and in particular an audio rendering approach that allows upmixing to synthesize one or more additional channels. Especially an audio rendering approach allowing for increased flexibility, reduced complexity, an improved user experience, a more encapsulating sound experience, reduced spatial distortions, and/or improved performance would be
advantageous.
SUMMARY OF THE INVENTION
Accordingly, the Invention seeks to preferably mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination.
According to an aspect of the invention there is provided an audio rendering system comprising: an audio renderer; a first speaker arrangement coupled to the audio renderer and arranged to render audio to a listening position, the first speaker arrangement having a directional radiation pattern with a direction from the first speaker arrangement to the listening position being within a 3dB beamwidth of a main lobe of the first speaker arrangement; and a second speaker arrangement coupled to the audio renderer and arranged to render audio to the listening position, the second speaker arrangement having a directional radiation pattern with a direction from the second speaker arrangement to the listening position being outside a 3dB beamwidth of a main lobe of the second speaker arrangement; wherein the audio renderer comprises: a receiver for receiving a multi-channel audio signal; a correlation estimator for generating a correlation measure for a first channel signal and a
second channel signal of the multi-channel audio signal; an upmixer for upmixing the first channel signal to a first audio signal and a second audio signal in response to a correlation measure, the second audio signal corresponding to a more diffuse sound than the first audio signal; a first driver for driving the first speaker arrangement from the first audio signal; and a second driver for driving the second speaker from the second audio signal.
The invention may provide an improved user experience to a listener. In particular, a more encapsulating and immerging user experience may often be achieved. In many scenarios an extended sound stage can be perceived. The sound stage may be perceived as natural and spatial distortions of spatially well-defined positions may be reduced. In particular, the combination of an upmixing based on the correlation/coherence between two channels combined with the rendering using non-reflected and reflected paths may provide an improved perceived sound stage expansion in many implementations. Specifically, it may allow for a spatial expansion of ambient sound that is typically perceived as not having strong spatial cues while at the same time allowing specific and well defined individual spatial sound sources to appear unmodified. The approach may specifically result in an audio rendering which expands the general ambient sound to be perceived to increasingly surround the user without changing the specific sound sources in the sound stage.
Specifically, the diffuse sound may be spatially expanded to provide a more embracing sound stage without introducing spatial distortions or errors to non-diffuse/ direct sound.
In many embodiments and for many audio signals, the approach may be able to deliver both clearly localizable sounds as well as a very enveloping ambient sound. This may typically be achieved without the need for any user interaction.
In many embodiments the first and second channels may specifically be a left front and right front channel of a stereo or a surround sound setup. In many embodiments the first and second channels may specifically be a left surround and right surround channel of a surround sound setup. The upmixing which is applied to the first channel signal may also be applied to the second channel signal.
The directional radiation pattern from the two speaker arrangements may be substantially the same or may be different. The beamwidth of the main lobe may in some embodiments be relatively narrow (say ±20°) or may e.g. in other embodiments be relatively broad (say ±120°). In some embodiments, the first speaker arrangement may have a directional radiation pattern which has two (or more) substantially equal lopes in which case either of these main lobes may comprise the direction to the listening position within their
3dB beamwidth. In some embodiments, the second speaker arrangement may have a directional radiation pattern which has two (or more) substantially equal lopes in which case neither of these main lobes comprises the direction to the listening position within their 3dB beamwidth. For example, for a second speaker arrangement being implemented by a bipolar speaker both the lobes will not include the direction to the listening position within their 3 dB beamwidth.
The first speaker arrangement may in use render audio to the listening position predominantly along non-reflected acoustic paths. The first speaker may specifically be arranged such that more than half of the audio energy reaching the listening position from the first speaker arrangement within the first 20 ms after the first wavefront does so via one or more direct paths. Some of the sound within the 20 ms may possibly reach the listening position through reflected acoustic paths but more than half of the audio energy reaching the listening position from the first speaker arrangement in this time interval will in many embodiments and scenarios not be reflected. Sound outside the 20 ms time interval will typically be reverberant sound with few and weak spatial cues. The reverberant sound tends to be dependent only on room acoustics and not on the speaker setup and arrangements.
The second speaker arrangement may in use render audio to the listening position predominantly along reflected acoustic paths. The second speaker may specifically be arranged such that more than half of the audio energy reaching the listening position from the second speaker arrangement within the first 20 ms after the first wavefront does not do so via one or more direct paths. Some of the sound within the 20 ms may possibly reach the listening position through direct non-reflected acoustic paths but more than half of the audio energy reaching the listening position from the second speaker arrangement in this time interval will in many embodiments and scenarios be reflected at least once. Typical reflections may be off the walls, ceiling or floor of the room in which the rendering system is located.
The second audio signal can correspond to a more diffuse sound than the first audio signal in that the second audio signal has a higher proportion of signal components for which the correlation measure indicates a lower correlation between the first channel signal and the second channel signal than for the first audio signal. The second audio signal can correspond to a more uncorrected sound (between the first and second channel) than the first audio signal. When referring to the first and second audio signal representing or
corresponding to more or less diffuse sound, this reference may be considered with reference to the audio scene represented by the input multi-channel signal. This audio scene may
represent an audio environment with a number of spatially well defined (point like) sources as well as more diffuse sound components that are not spatially well-defined. The second audio signal can correspond to more diffuse sound than the first audio signal in that it contains a higher proportion of the audio energy of the diffuse sound of the input multi- channel/ captured audio scene than does the first audio signal. Similarly, the first audio signal can correspond to less diffuse sound than the second audio signal by it containing a higher proportion of the audio energy of the spatially well-defined audio sources of the input multichannel/ captured audio scene than does the second audio signal. Thus, when referring to a signal representing a degree of diffuseness this may relate to the characteristic of the sound components it contains from the original multi-channel signal and thus from the captured audio scene. The term diffuseness/ non-diffuseness may when referring to the signals generally correspond to terms such as directional/non-directional, localizable/non-localizable and/or foreground/background.
The first audio signal may predominantly include sound components of the first channel signal corresponding to spatially specific audio sources (such as point-like sources) whereas the second audio signal may predominantly include sound components of the first channel signal corresponding to spatially non-specific ambient sound. Specifically, the second audio signal may predominantly reflect background sounds whereas the first audio signal may predominantly reflect specific foreground sound sources.
In accordance with an optional feature of the invention, the audio renderer is arranged to divide the first channel signal into a plurality of time-frequency intervals; and the correlation estimator is arranged to generate a correlation value for each time- frequency interval; and the upmixer is arranged to generate the second audio signal by for each time frequency interval weighting a signal value of the first channel signal for the time frequency interval by a first weight being a monotonically decreasing function of a correlation value for the time- frequency interval.
This may provide a particularly advantageous approach. In particular, it may provide an efficient separation of sound components which are highly correlated between channels and sound components that are not highly correlated. The approach may allow an effective generation of a second audio signal which corresponds to diffuse sound components of the first audio channel.
In accordance with an optional feature of the invention, the upmixer is further arranged to generate the first audio signal by for each time frequency interval weighting the signal value of the first channel signal for the time frequency interval by a second weight
being a monotonically increasing function of the correlation value for the time- frequency interval.
This may provide a particularly advantageous approach. In particular, it may provide an efficient separation of sound components which are highly correlated between channels and sound components that are not highly correlated. The approach may allow an effective generation of a first audio signal which corresponds to non-diffuse sound components of the first audio channel, and a second audio signal which corresponds to diffuse sound components of the first audio channel.
In accordance with an optional feature of the invention, the upmixer is further arranged to determine the weight in response to an energy difference estimate for the first channel signal and the second channel signal.
The approach may e.g. allow an improved separation into diffuse and non- diffuse sound. Specifically, it may provide improved consideration for spatially well defined (e.g. point like) sources that are planned to one of the first and second channels, i.e. for which the energy is predominantly located in one of the channels.
The energy difference may be evaluated in individual time frequency intervals, over a group of time frequency intervals or over all frequencies.
The gain may be determined as a function of the energy difference and may specifically be a monotonically decreasing function of the energy difference.
In accordance with an optional feature of the invention, the correlation estimator is arranged to determine the correlation value for the frequency interval in response to a frequency averaging of correlation values of a plurality of time frequency intervals.
This may provide improved performance and may in particular in many embodiments and for many signals may reduce distortion caused by the upmixing of the first channel signal.
In accordance with an optional feature of the invention, the upmixer is further arranged to determine the weight in response to an audio content characteristic for the multichannel signal.
This may provide an improved user experience in many embodiments. For example, it may provide an improved adaptation of the rendering of diffuse and non-diffuse sound of the specific audio signal. For example, a sound stage more appropriate for the audio content may be generated.
In accordance with an optional feature of the invention, the audio renderer may be arranged to modify a rendering property of the first audio signal independently of the second audio signal.
This may provide an improved user experience in many embodiments. For example, it may provide an improved adaptation of the rendering of diffuse and non-diffuse sound of the specific audio signal. For example, a sound stage more appropriate for the audio content may be generated.
In accordance with an optional feature of the invention, the rendering property is an audio level for the first audio signal.
This may provide an improved user experience in many embodiments. For example, it may allow the balance between ambient background sound and foreground sound sources to be dynamically varied.
Alternatively or additionally, the audio renderer may be arranged to modify an audio level of the second audio signal independently of the first audio signal.
In accordance with an optional feature of the invention, the rendering property is a spatial audio radiation pattern property.
This may provide an improved user experience in many embodiments. In particular, it may allow the audio radiation pattern to be individually optimized for the rendering of ambient background sound and foreground point-like sound sources. The audio radiation pattern property may be a property of a beam pattern/shape, e.g. of a speaker array used with a dynamically variable beamformer.
Alternatively or additionally, the audio renderer may be arranged to modify a spatial audio radiation pattern property for the second audio signal independently of the first audio signal.
In accordance with an optional feature of the invention, directional radiation pattern for the second speaker arrangement has a notch in the direction of the listening position.
Thus may provide an improved user experience by providing an improved perception of the rendered diffuse sound components. The second speaker arrangement may specifically be an audio array controlled by a beamformer comprised in the second driver. The adaptive beamformer may be arranged to (possibly dynamically) steer a null in the direction of the listening position.
In accordance with an optional feature of the invention, the second speaker arrangement comprises a bipolar speaker arrangement.
This may allow advantageous performance while maintaining a low complexity implementation.
In accordance with an optional feature of the invention, the first speaker arrangement and the second speaker arrangement are comprised in one speaker enclosure.
This may provide a practical implementation and may in many cases be advantageous to a user as only one speaker enclosure needs to be positioned in the audio environment. The two speaker arrangements may be implemented by separate sets of one or more drive units angled in different directions. As another example, the first and second speaker arrangements may be implemented by a single audio array driven by a different beamformer for each of the first and second audio signals with the beamformers generating beams in different directions.
In accordance with an optional feature of the invention, the multi-channel audio signal is a spatial multi-channel signal having each channel associated with a nominal position of a spatial speaker configuration, and wherein the second speaker arrangement is located at a different position than the nominal position.
This may provide an improved user experience and a more encapsulating sound rendering in many embodiments. In particular, it may provide the perception of a larger sound stage while still maintaining the positions of point like audio sources.
In accordance with an optional feature of the invention, the second driver is associated with an elevated speaker position.
This may provide an improved user experience and a more encapsulating sound rendering in many embodiments. In particular, it may provide the perception of a larger sound stage while still maintaining the positions of point like audio sources.
According to an aspect of the invention there is provided a method of rendering audio.
These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which
Fig. 1 illustrates an example of elements of an audio rendering system in accordance with some embodiments of the invention;
Fig. 2 illustrates an example of elements of an audio renderer in accordance with some embodiments of the invention; and
FIG. 3 illustrates an example of a correlation measure between two channels of a multi-channel audio signal.
DETAILED DESCRIPTION OF SOME EMBODIMENTS OF THE INVENTION
The following description focuses on embodiments of the invention applicable to a spatial surround sound system and in particular to a surround sound home cinema system. However, it will be appreciated that the invention is not limited to this application but may be applied to and in many other audio rendering system.
Fig. 1 illustrates an example of an audio rendering system in accordance with some embodiments of the invention.
In the system, an audio renderer 101 receives a multi- channel signal which in the specific example is a five channel spatial multi-channel signal. The multi-channel signal may be a conventional five channel signal with spatial channels associated with loudspeakers positioned at specific spatial positions relative to a listening position 103. For example, a 5.1 home cinema system provides spatial sound via five loudspeakers being positioned with one speaker 105 directly in front of the listening position (the center channel), one speaker 107 to the front left of the listening position, one speaker 109 to the front right of the listening position, one speaker 111 to the rear left of the listening position, and one speaker 113 to the rear right of the listening position. In addition, a non-spatial low frequency effects channel may be provided and rendered via a low frequency speaker (not shown).
The system of Fig. 1 may thus provide a spatial sound experience to a listener at the listening position 103. However, rather than merely provide a conventional five channel rendering, the system of Fig. 1 is further arranged to synthesize additional channels from the received signals. Specifically, the system of Fig. 1 may decompose one channel into two channels and render the two channels from two different speaker arrangements.
In the specific example, the left front channel is decomposed into a first signal and a second signal, where the first signal drives a first speaker 109 which specifically may be positioned at the nominal position for the left front channel and the second signal drives a second speaker which may be collocated with the first speaker 109 or may be positioned elsewhere.
In the example, the right front channel is decomposed in a similar way and thus an additional speaker arrangement 117 is used to render the additional signal.
In the system, the signal in each of the two front side channels is thus divided into two different signals. Furthermore, one of the generated signals predominantly corresponds to non-diffuse sound, such as sound from specific (point like) sound sources, whereas the other signal predominantly corresponds to more diffuse sound. The
differentiation and the decomposition is based on an evaluation of the correlation between different channels of the multi-channel audio signal. Specifically, point-like sources tend to exhibit a high degree of correlation between channels whereas diffuse sounds such as those originating from e.g. reverberation effects, non-directional noise etc tend not to exhibit a high degree of correlation. In the system, the individual characteristics of the two channels are further emphasized by the rendering of the different signals being different. Specifically, the non-diffuse signal is to a large extent rendered via direct acoustic paths whereas the diffuse signal is to a large extent rendered via indirect paths.
The system may specifically apply a blind decomposition algorithm which seeks to differentiate between ambient and more diffuse background sound and specific spatially well-defined foreground sound sources.
For example, an audio scene in a movie may often consist of sound sources that are at the foreground to the listener (like dialogue and some effects) and sound elements that are at a larger distance or at the background (environmental sounds and sometimes background music). The latter type of sound will typically be diffuse sound with few specific spatial cues.
In theory, the blind decomposition of the content in this way would be extremely difficult without additional cues. However, in many stereo and multi-channel recordings the original mixture has already been created in such a way that the foreground and background sound elements are mixed in different ways. In particular, often foreground sound elements appear typically in only one or two loudspeakers in which case they have a large signal-level cross-correlation at zero time-lag. On the other hand, background sound elements are typically placed in two or more loudspeakers and they are typically weakly zero-lag cross-correlated between pairs of channels. Some foreground sounds which are panned predominantly to one channel may also exhibit a low cross-correlation but as will be described later, such scenarios may explicitly be detected and compensated for.
In the system of Fig. 1 a correlation based decomposition of a signal is used and this can often achieve that two signals can be generated where one predominantly corresponds to diffuse background sounds whereas the other predominantly corresponds to non-diffuse foreground sound elements.
In the system, the two signals can be rendered by speaker arrangements having different directional radiation patterns. Specifically, the speaker arrangement rendering the foreground signals may be aimed directly at the listening position 103, i.e. the listening position may fall within the (3dB) beamwidth of the mainlobe of the speaker arrangement. In contrast, the speaker arrangement rendering the background signals may be aimed away from the listening position 103. For this speaker arrangement, the listening position may thus be outside the (3dB) beamwidth of the mainlobe. Such an arrangement may thus ensure that the proportion of sound rendered via direct acoustic paths relative to the proportion of sound rendered via reflected acoustic paths is much higher for the rendering of foreground objects than for the rendering of background objects. Thus, the relative diffuseness of the rendered sound is increased for background objects relative to the foreground objects.
Specifically, once the decomposed audio signals are generated, the foreground signals can be reproduced by a speaker arrangement which for a non-reverberant time interval (of 20 ms) predominantly renders the signal to the listening position 103 via direct acoustic paths thereby providing strong spatial cues resulting in clearly localizable sound images. On the other hand, the background signals may be rendered by a speaker
arrangement that for a non-reverberant time interval (of 20 ms) predominantly renders the audio to the listening position 103 via reflected paths thereby providing an increase in the diffuseness of the background sound. This may be particularly advantageous in many embodiments as the additional background channel can be used to provide a more encapsulating listening experience with sound being perceived to come from many directions. Thus, a perception of a larger sound stage may be achieved while at the same time using the foreground channel to ensure that the positions of specific foreground sound sources do not chance.
Fig. 2 illustrates an example of some elements of the system of Fig. 1 which are related to the generation of two output channels from one input channel. Specifically, the figure may be considered to illustrate elements for the left front channel of Fig. 1 but it will be appreciated that the approach is equally applicable to the right front channel and indeed to any audio signal which is upmixed to two output channels that are then differently rendered.
Fig. 2 illustrates the audio renderer 101 comprising a receiver 201 that is arranged to receive the multi-channel signal.
The receiver 201 is coupled to an upmixer 203 which is fed the signal of one of the signals of the multi-channel signal, and in the specific example it is fed the left front channel. The upmixer 203 is arranged to upmix the received signal to generate two output
signals. The second signal comprises a higher proportion of diffuse sound than the second signal. Thus, the upmixer 203 may divide the input signal into sound components that predominantly correspond to diffuse or non-spatially well-defined sound sources, and sound components that are not diffuse but typically are spatially relatively well defined. The first signal may typically predominantly correspond to specific foreground elements whereas the second signal may typically correspond to background sound. The two signals will henceforth be referred to as a foreground signal and a background signal.
In the system the decomposition into the foreground and background signals are performed by considering the correlation between two channels of the multi-channel signal. The approach may specifically exploit that diffuse/ background signals tend to be generated to have low correlation between different channels of the multi-channel signal whereas point-like/ specific foreground objects tend to have a high correlation. The upmixer 203 may thus decompose the signal by seeking to direct sound components with high correlation to the foreground signal and sound components with low correlation to the background signal. Thus, the foreground signal may comprise a higher concentration of correlated sound components than the background signal.
The foreground signal is fed to a first driver 205 which is coupled to the upmixer 203 and an external speaker arrangement 107 (henceforth referred to as the foreground speaker 107) which may comprise one or more speaker drivers/ audio transducers.
The background signal is fed to a second driver 207 which is coupled to the upmixer 203 and an external speaker arrangement 115 (henceforth referred to as the background speaker 115) which may comprise one or more speaker drivers/ audio transducers.
Thus, the two generated signals are rendered independently using different speaker arrangements (also for brevity referred to as speakers although it will be appreciated that these may comprise a plurality of speaker drivers, and may indeed share some speaker drivers e.g. using an audio array and beamforming to render the channels).
Furthermore, the individual speakers are arranged to provide a rendering which is particularly suitable for the specific type of audio signal rendered. Thus, the characteristics of the speakers are such that it provides a particularly advantageous rendering for the individual characteristics of the two generated signals.
In the system, both the foreground speaker arrangement 107 and the background speaker arrangement 115 are directional speakers and thus have a directional
radiation pattern (e.g. given as a relative gain as a function of angle of radiation). The directional radiation pattern has a mainlobe for which the maximum radiation level (the maximum gain) is achieved. The beamwidth of such a mainlobe may be determined as the 3dB beamwidth given as the width of the beam between the two points at which the radiation (power) level (the gain) has dropped to 3dB lower than the maximum radiation level (gain). For some speaker arrangements (such as a bipolar speaker) the radiation pattern may exhibit a plurality of identical lobes (i.e. there may be more than one mainlobe).
In the system of Dl, the foreground speaker arrangement 107 is arranged such that the listening position 103 falls within the 3dB beamwidth of the mainlobe (or of any of the mainlobes in case there are more than one). In contrast, the background speaker arrangement 115 is arranged such that the listening position 103 does not fall within the 3dB beamwidth of the mainlobe (or of all the mainlobes in case there are more than one). This arrangement may specifically allow the rendering of the foreground signal being
predominantly along direct acoustic paths (with an initial non-reverberant time interval as will be described in the following) whereas the background signal is predominantly rendered along reflected acoustic paths (again within the non-reverberant time interval).
Specifically in the system of FIG. 1, the foreground speaker 107 is arranged to render audio to a listening position which within a non-reverberant time interval of 20 ms is predominantly rendered along non-reflected acoustic paths from the foreground speaker 107 to the listening position 107. Thus, at least half of the audio energy within the first 20 ms after the first wave front from the foreground speaker 107 reaches the listening position 103 via direct, non-reflected paths. Indeed, in many scenarios, at least 75% or even 90% of the sound energy may be via direct paths. Such a direct rendering provides strong spatial cues and provides a listener with spatial cues that allows for sound components rendered from the foreground speaker 107 to be perceived to originate from the position of the foreground speaker 107. This, together with corresponding sound components from the other spatial channels (and especially from the front right and the center channels), provide a panning effect that allows specific spatially well-defined audio elements to be positioned in the sound scene and to be perceived as sound sources with specific well defined positions.
The background speaker 115 is in contrast arranged to render audio to a listening position 103 which within a non-reverberant time interval of 20 ms is
predominantly rendered along reflected acoustic paths from the background speaker 115 to the listening position 103. Thus, at least half of the early audio energy (within the 20 ms of the first wavefront) from the background speaker 115 reaches the listening position 103 via
non-direct, reflected rendering. Indeed, in many scenarios at least 75% or even 90% of the sound energy may be via reflected paths. The reflections may occur of walls, floor, ceiling, obstacles etc in the room in which the system is located.
Such an indirect rendering results in the rendered audio being spread in both time and space, and it will reduce the amount of spatial cues relating to the speaker position which is provided to a listener. The listener may instead perceive sound that is spread and with a more pronounced diffuse characteristic. Thus, the use of reflected sound enhances the diffuse nature of the background signal which corresponds to the more diffuse background or ambient sounds. Such diffuse sound is particularly suitable to provide the listener with a perception of a larger and more encapsulating sound scene without introducing e.g. the perception of phantom or moved audio sources.
In sound rendering systems, a significant part of the rendered energy reaches the listening position as reverberant signal components. Such a reverberation tail of the acoustic transfer function from a speaker to a listening position may be relatively long and difficult to estimate. Furthermore, the reverberant propagation tends to be independent of the specific speaker setup and is generally predominantly dependent on room characteristics.
The reverberant tail provides very limited spatial cues to the listener. In the system of Fig. 2, the differentiation between the renderings of the two speaker arrangements is used to provide a differentiation in the spatial perception. Accordingly, they are arranged to provide very different rendering for the initial non-reverberant time interval and the characteristics for the reverberant tail are less significant. Therefore, the speaker
arrangements are arranged to provide a very different rendering within a non-reverberant time interval which is defined as a 20 ms propagation time difference interval. However, outside of this 20 ms propagation time difference interval, the rendering characteristics are not considered significant and indeed may be the same for the two rendering systems.
Thus, the two speaker arrangements are arranged such that the audio reaching the listening position 103 within 20 ms of the first wave front does so via direct paths for the first speaker arrangement 101. Equivalently, the first 20 ms from the earliest non-zero value of the acoustic transfer function (i.e. from the first wave front reaches the listening position) from the speaker arrangements to the listening position 103 is for the foreground speaker 107 predominantly a result of direct acoustic paths and for the background speaker 115 predominantly a result of reflected paths.
In the following, references to the differentiations in rendering and the difference between the rendering of the foreground speaker 107 and the background speaker
115 may for brevity not explicitly refer to the characteristics being for this 20 ms time interval, but it will be appreciated that the references to e.g. rendering being predominantly via direct or indirect acoustic paths are to be considered within this time interval.
The upmixer 203 is arranged to generate the foreground and background signals based on an evaluation of the correlation of the channel being upmixed (in the specific example the left front channel) with another channel. Specifically, a correlation measure which is indicative of the correlation between the channel being upmixed and another channel is used by the upmixer to synthesize the new signals.
Accordingly, the audio renderer 101 comprises a correlation estimator 213 which is arranged to generate a correlation measure for the signal of the channel being upmixed and the signal of another channel. In the example where the channel being considered is the left front channel, the correlation measure may typically and in many scenarios advantageously be indicative of the correlation of the left front channel to the right front channel. For examples where the channel being considered is the left surround channel, the correlation measure may typically and in many scenarios advantageously be indicative of the correlation of the left surround channel to the right surround channel. The correlations are of course equally appropriate for the right front and right surround channels respectively.
In the example of Fig. 2, the correlation estimator 213 is arranged to generate the correlation measure by performing a direct correlation. The correlation measure may comprise a specific correlation value for each of a plurality of time frequency intervals, also referred to as time- frequency tiles. Indeed, the upmixing of the signal may be performed in time- frequency tiles and the correlation measure may provide a correlation value for each time- frequency tile.
In some embodiments, the resolution of the correlation measure may be lower than that of the time- frequency tiles of the upmixing. For example, a correlation value may be provided for each of a number of perceptual significance bands, such as for each of a number of ERB bands. Each perceptual significance band may cover a plurality of time- frequency tiles.
The correlation measure may be fed to the upmixer 203 which can proceed to determine gains for respectively the foreground and the background signal. Specifically, the input signal may be segmented and converted to the frequency domain. For each frequency domain value (FFT bin value) in the time segment (i.e. for each time frequency tile), the upmixer 203 may generate a foreground signal value by multiplying it by a foreground gain derived from the correlation value for the corresponding time- frequency tile. The foreground
gain may increase for increasing correlation. As a result a frequency domain signal is generated that comprises a high weighting of the correlated components of the input signal.
Similarly, for each frequency domain value (FFT bin value) in the time segment (i.e. for each time frequency tile), the upmixer 203 may generate a background signal value by multiplying it by a background gain derived from the correlation value for the corresponding time- frequency tile. The background gain may be decrease for increasing correlation. As a result a frequency domain signal is generated that comprises a low weighting of the correlated components of the input signal.
The two generated frequency signals may then be converted back to the time domain to provide the background and foreground signals.
The upmixer 203 may specifically determine the foreground gain and the background gain to exactly or approximately maintain the overall energy level of the signals (specifically the sum, or the sum of the square, of the gains may be set to one). The upmixer 203 may furthermore be arranged to provide a frequency domain smoothing of the gains which may improve the perceived sound quality.
In more detail, the input signal may be given by the short-time input signal vector x(«) = [χ(η), χ(η -1), · · ·, χ(η -Κ + l)f or the spectrum vector obtained using the discrete Fourier transform: Χ{η, ω) = wx(«) where F a matrix of Fourier basis functions and the window function w is a diagonal matrix of, e.g., Hanning window function coefficients on the diagonal and zero elsewhere.
In the specific example, both the left front and the right front channels are upmixed and thus the upmixing is applied to a stereo signal
X(«, ω) = [ j (n, ώ), X2 (n, co)]
The upmixing of such a stereo vector signal to an M-channel vector signal:
Y(«, ω) = [Υ0 (η, ω),■■■ , ΥΜ_γ (η, ω)] can be performed separately for each transform component. For the ω frequency
component, the upmixed vector signal is given by
Υ(η, ω) = G(n, co)X(n, co) where G(«, ω) is a matrix operation.
The filter matrix can in the specific example be written in the following form: g (n, (o) 0
0 ξη {η, ώ)
G(n, c )
g3l (n, (o) 0
0 g42 (n, co)
This matrix does not mix left and right channels (zeroes in the matrix). This is a design choice, and it will be appreciated that it is also possible to design algorithms where the cross-channel terms are non-zero resulting in mixing between the two sides. This may typically be more interesting for the synthesis of the background channels than for the synthesis of the foreground channels.
The gains of the matrix are determined from the correlation measure.
Furthermore, the weights for the foreground signals (i.e. gn and g31) are determined as monotonically increasing functions of the correlation measure (and specifically of the correlation value in that time frequency tile). Thus, the allocation of the signal energy of a specific time frequency tile into the foreground signal increases the more the two spatial channels are correlated. It will be appreciated that the gains may also depend on other parameters and considerations but that the relationship to the correlation value will be monotonically increasing.
The weights for the background signals (i.e. g22 and g42) are determined as monotonically decreasing functions of the correlation measure (and specifically of the correlation value in that time frequency tile). Thus, the allocation of the signal energy of a specific time frequency tile into the background signal increases the less the two spatial channels are correlated, i.e. the more it corresponds to diffuse sound. It will be appreciated
that the gains may also depend on other parameters and considerations but that the relationship to the correlation value will be monotonically decreasing.
Thus, the upmixer 203 decomposes the side front signals into signal components that are correlated and signal components that are not correlated, and thus typically into diffuse ambient sound and non-diffuse foreground sound.
The correlation estimator 213 determines the correlation values which in the specific example is between the two front channels. For a two input data sequence the correlation coefficient can be defined as:
< Xl (n, co), X2 (n
C
■j< Χγ η, ώ), Χγ η,ώ) >2 < Χ2 (η, ω), Χ2 (η, ω) > where <.. .> denotes the computation of an expected value of the inner product of the two data sets over the variable n . When the value of the correlation coefficient C approaches one, it may be said that the content is coherent in the two channels.
The signal power and the product of the two input channels can be obtained in each frequency bin as follows: φϋ 0, ω) = Xi O, co)Xj O, ω)* (i, j = 1 ,2) where * denotes the complex conjugate. Given these instantaneous quantities, a time direction filtering may be applied, e.g. using a first-order integrator with an adaptation parameter resulting in a sliding-window estimate given by: φϋ (η, ω) = φϋ (η, ω) + (1 - λι ) .. (η - 1 , ω)
The correlation value for each time- frequency tile may then be determined as:
Often resulting from the (frequency) bin-by-bin operation, any highly variable function in the frequency domain can create a significant amount of audible artifacts when
applied as a gain function for audio signal processing. The black solid line in Fig. 3 shows an example of such weighting (gain) functions, which actually is the correlation values obtained according to the equations above. Although each value on this curve may represent the desired functionality of the weighting function, an additional averaging process in the frequency direction may improve audio quality substantially in many scenarios.
In the system of Fig. 2, the correlation estimator 213 is therefore furthermore arranged to determine the correlation value for a given time frequency interval in response to a (weighted) frequency averaging of correlation values of a plurality of time frequency intervals. Thus, a spectral smoothing can be performed.
Accordingly, the correlation values may be determined as: w(n, co) = S[w(n, co)], where S[-] indicates a suitable frequency smoothing function. For example, a triangular or square smoothing function may be applied. As a low complexity example, the smoothing function S may simply determine the average of the unsmoothed correlation value for the current time frequency tile and the N surrounding (in the frequency domain) unsmoothed correlation values.
The individual gain coefficients gkp (n, o)), k = 1,2, p = l,..,4 may then for example be determined as: gu{n, (o) = g22{n, a>) = w(n, co)
g31 («, ω) = g42 n, ω) = 1 - w(n, ω)
In some embodiments, other parameters or characteristics may be taken into account when determining the gains. Specifically, in the system of FIG. 2, the correlation estimator 213 may optionally determine the gain in response to an energy difference estimate for the channels.
Specifically, an important special case that may be considered is when a strong spatially well-defined sound source is concentrated in one speaker, e.g. when the sound
source is hard-panned to the left or right. In this case the correlation coefficient will also become small, which would indicate to the system that the corresponding time- frequency region is likely to be ambient diffuse sounds. This is usually not desired as the extreme side panned content is typically intended to be on the extreme sides in the stereo image rather than rendered diffusely.
For example, there are several examples of movie audio tracks with voices of characters moving over the stage from left to the right (or right to left). If the movement starts or ends at the extreme right or left panning directions, a simple correlation-based separation may lead to the voice jumping to be diffuse and ambient at the start or end of the movement which is an artifact that is very easy to notice. This also applies to many other dynamic spatial effects based on amplitude panning.
The system may specifically seek to address such an issue. This is in the example done by adapting the gains in response to an energy difference between the channels.
Specifically, an additional weight function h(n, co) for the gains may be determined on the basis of estimates of signal energy differences between the two channels.
First, the amplitude difference between the two input channels is computed at each frequency bin: E(n, ω) = log( ;. («, co)) - log( ; («, co))
Then in each frame, we apply time integration and the frequency domain smoothing to the obtained estimates to update the weight function h(n,co) such that h(n, co) = S[ 2h(n, co) + (\ - λ2)Ε(η - \ , ω)]
The function h(n, a>) is positive in spectral regions where channel 1 dominates and negative in areas where the other channel has more energy. Finally, the positive and negative values of h{n, co) are mapped separately to the ranges [0, 1] by h(n, co) = f (h(n, co)) e.g. using the logistic function
/(x) = (2/(l + exp(-x)/X)) + 2.
The parameter of this mapping function is typically χ = 0.6 . With the value of χ = 0.0 we actually obtain the method with no elimination of the hard-panned side signals. The value of the parameter can be chosen freely.
Finally, the actual gains can be computed as follows:
w(n, ω)α , if h(n, ω) > 0
w(n, ω), if h(n, ω) < 0
ίν(η,ω)α , if h(n,(£>) < 0
g22(rc,co)
w(n,(£>), if h(n,(£>) > 0
(1 - w(n, c ))(l - h(n, ω))α , if h(n, a>) > 0
(l - w(«, c )), if h(n, co) < 0 w(n, a>))(l - h(n, a>))a , if h(n, a>) > 0
g42 («, co)
(l - w(«, c )), if h(n, co) < 0 where the equations use the energy normalization term:
The system thus separates out components that are likely to correspond to diffuse ambient/background sounds and components that are likely to correspond to non- diffuse foreground sources thereby providing an upmixing to two distinct channels with characteristic properties.
In many systems, the audio renderer 101 may be arranged to individually adapt properties of the rendering for the two channels. Thus, the audio renderer 101 can change or set a rendering property for one of the signals independently of the setting for the other signal. The rendering signal processing may specifically be adapted by means of a user control e.g. for controlling the applicable limits of the degree of diffuseness.
As an example, the audio renderer 101 can set the audio level for one of the signals independently of the other signal. For example, the volume for the background signal
relative to the volume for the foreground signal may be modified and set to provide a desirable audio experience. Thus, in the system, the volumes of the background and foreground may be set individually for the two front side signals. This may provide an improved user experience in many scenarios. For example, it may allow emphasis of the dialogue relative to the background sound thereby e.g. aiding users with hearing difficulties.
As another example, the system may change the spatial rendering characteristics individually for the two signals. Thus, rather than rendering the diffuse background sound and the direct foreground sound for the front side channels in the same way as for a traditional system, the system can render the individual types of sound differently, and especially can render the foreground sound such that it provides strong spatial cues relating to the position of the speaker whereas the background sound is rendered via reflected paths thereby not providing strong spatial cues about the position of the speaker rendering the sound.
Furthermore, in some embodiments the radiations pattern (e.g. the beam pattern) for one of the speakers 107, 115 may be dynamically adaptable. For example, one of the speakers 107, 115 may be implemented using a speaker array with a dynamic adaptable beamformer. Indeed, in some embodiments, the same audio array may together with different beamformers render both the background signal and the foreground signal, i.e. both speaker arrangements 107, 115 may be implemented by the same audio array but using different beamform parameters to provide rendering in different directions.
In cases with dynamic beamforming, the system may individually steer the audio rendering in different directions for the two signals. For example, the system may track the position of a listener e.g. using a video based head tracking system. The beamform parameters may then be individually adapted for the two signals based on the position of the user. E.g. for the foreground signal the beamform weights can be set to direct a maximum of the beam-shape in the direction of the listening position whereas for the background signal the beamform weights can be set to direct a null in the direction of the listening position.
It will be appreciated that different implementations of the speaker
arrangements 107, 115 can be used dependent on the specific preferences, requirements and restrictions of the individual embodiment.
Indeed, as mentioned, the speaker arrangements may be implemented as one or two audio arrays driven by two different beamformers (or equivalently the same physical beamforming functionality using different beamform weights). The beamform weights may in some embodiments be fixed thereby providing a fixed radiation pattern. The audio array
may in such cases be angled to provide strong direct paths to the listening position for the foreground signal but not for the background signal. Rather, the array may be angled to provide a notch (and typically a null) of the beam-pattern in the direction of the listening position.
In other embodiments a more low complexity approach may be used. For example, the speaker used for the foreground signal may be a conventional speaker driver directed towards the listening position. The background speaker arrangement may be a conventional speaker driver which is directed away from the listening position and typically towards a wall for providing suitable reflections.
In many embodiments, the two speakers 115, 107 can be comprised in a single speaker enclosure with the arrangements being such that the radiation patterns are in different directions. Specifically, a foreground speaker may be positioned on a front-firing
configuration whereas the background speaker may be positioned in a side-firing
configuration. When the speaker enclosure is positioned in the nominal position and angled towards the listening position, the foreground speaker will predominantly render the audio along direct paths whereas the background speaker will typically render the audio via reflections of e.g. a wall to the side of the speaker.
In many embodiments the background speaker arrangement 115 may be implemented by a bipolar speaker arrangement. Thus, two drivers may be fed the same drive signal but with a 180° phase difference and with the two drivers being directed in opposite directions. This approach will generate a strong sound radiation in two opposite directions with a null in-between. The null can be directed towards the listening position. This arrangement provides low complexity, and thus low cost, implementation yet can provide a strong rendering of the background signal in several directions thereby providing many different reflections. Furthermore, the direct path audio rendering can be minimized.
Accordingly, a diffuse rendering of the background signal can be achieved via a low cost implementation. The approach may be particularly suitable for implementations in a single enclosure with the two drivers of the bipolar arrangement being arranged in a side-firing configuration with a third driver used for rendering the foreground signal being arranged in a front-firing configuration.
In some embodiments, both the background and the foreground signals may be rendered from the same position, and indeed from the nominal or reference position associated with the spatial audio channel from which they are generated. Such approaches may specifically use a single speaker enclosure comprising both speaker arrangements.
However, in other embodiments, at least one of the generated signals may be rendered from a different position. Specifically, in many embodiments, the foreground signal may be rendered from the reference or nominal position of the channel that was upmixed. This ensures that the positions of the foreground objects in the audio stage are not modified. However, the background signal may be rendered from a different position than the foreground signal and specifically from another position than the nominal position of the channel being upmixed. This may provide an expanded sound stage and may in particular provide a perception of a substantially larger sound stage.
In particular, the background speakers may be rendered from elevated speakers thereby providing a sound stage which extends outside the horizontal plane normally associated with rendering configurations.
In some embodiments, a similar effect may be achieved using (at least partially) upfiring speaker drivers for the background signal where the upfiring speakers are provided in the same enclosure as the speaker driver(s) for the foreground signal.
For systems which allow reproduction of elevated sounds (e.g. above the listener), the approach can be adapted to generate appropriate signals for such elevated speakers. In most cases, the available media such as discs or broadcasts do not contain dedicated height signals. To overcome this, the described upmixing algorithm may be used. Existing solutions often generate signals that are not uncorrected from the other channels, thereby potentially elevating the complete sonic image including the principal sound sources. This is not favorable since the desired location of these sources is in most cases on the horizontal plane and the rendering from the elevated positions will result in a position offset from the horizontal plane being introduced. Other solutions avoid this issue by generating height signals with a rather low audio level. In both cases, the possible advantages of elevated loudspeakers are not fully used. However, the described approach can be used to extract audio signal components that predominantly correspond to more diffuse background sound. The corresponding signal can then be reproduced through e.g. elevated loudspeakers, thereby increasing the sonic envelopment and sense of realism, while not introducing disturbing artifacts such as localization shifts.
In some embodiments, the described approach may be applied to a plurality of the channels/ channel set. For example, the described approach for the front left and right channels may also be applied to the surround left and right channels. Thus, as a specific example, the system may accept five input signals such as the spatial channels of a 5.1
surround sound audio, and may output nine loudspeaker signals, which are center, directional left/right/surround-left/surround-right, and diffuse left/right/surround-left/surround-right.
In some embodiments, the decomposed signals may be recomposed for at least one of the signals. Specifically, the output signal for the speaker at the nominal position may be generated as a combination of the foreground signal and of the background signal. This recombination may allow the diffuse background sounds to be rendered not only from the second speakers (e.g. elevated speakers) but also from the original positions. However, typically the relative level of the background signal components will be reduced with respect to the original signal to compensate for the rendering being along direct paths and for the additional rendering of background sounds which is provided by the additional speakers.
In some embodiments, the upmixer 203 is further arranged to determine the gains used to decompose the input signal into the background signal and foreground signal in response to an audio content characteristic for the received multi-channel signal.
Indeed, by modifying the gain factors, the balance between the direct and ambient channels can be adjusted and this may specifically be used to automatically adapt the processing depending on the audio content.
The audio content may for example be characterized by metadata describing the content. For example, if the audio corresponds to audio of e.g. television programs, metadata may be provided to describe whether the audio is sound from e.g. a football game (having few foreground sources with significant diffuse background sound (ambient sound of the crowd)), from a discussion program (only few foreground sound sources with typically very little background sound) etc. The gains may be adjusted depending on such values. For example, for each content category a scale factor may be stored which scales the gains for the background and foreground decomposition (in opposite directions).
In some embodiments, the adaptation may be in response to a characteristic of the audio signal, such as an averaged frequency response, relative signal energies of all multichannels etc.
It will be appreciated that the above description for clarity has described embodiments of the invention with reference to different functional circuits, units and processors. However, it will be apparent that any suitable distribution of functionality between different functional circuits, units or processors may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controllers. Hence, references to specific functional units or circuits are only to be seen as references to suitable means for
providing the described functionality rather than indicative of a strict logical or physical structure or organization.
The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. The invention may optionally be
implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units, circuits and processors.
Although the present invention has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the invention. In the claims, the term comprising does not exclude the presence of other elements or steps.
Furthermore, although individually listed, a plurality of means, elements, circuits or method steps may be implemented by e.g. a single circuit, unit or processor.
Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also the inclusion of a feature in one category of claims does not imply a limitation to this category but rather indicates that the feature is equally applicable to other claim categories as appropriate.
Furthermore, the order of features in the claims do not imply any specific order in which the features must be worked and in particular the order of individual steps in a method claim does not imply that the steps must be performed in this order. Rather, the steps may be performed in any suitable order. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc. do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.