EP4240026A1

EP4240026A1 - Audio rendering

Info

Publication number: EP4240026A1
Application number: EP22159713.1A
Authority: EP
Inventors: Miikka Tapani Vilermo; Lasse Juhani Laaksonen; Arto Juhani Lehtiniemi
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2023-09-06

Abstract

Examples of the disclosure make use of symmetries and other properties of sound scenes to reduce processing requirements for providing spatial audio and reduce the effects latencies introduced by the processing requirements. In some examples of the disclosure an encoder apparatus can comprise means for providing at least one audio signal. The at least one audio signal can comprise at least one audio object. The at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal. The means are also for providing metadata indicative of the first angular direction of the at least one audio object.

Description

TECHNOLOGICAL FIELD

Examples of the disclosure relate to audio rendering. Some relate to reducing processing requirements for spatial audio rendering.

BACKGROUND

Spatial audio enables spatial properties of a sound scene to be reproduced for a user so that the user can perceive the spatial properties. This can provide an immersive audio experience for a user or could be used for other applications. Delays or latencies within the processing of the spatial audio can reduce the quality of the audio experience for a listener.

BRIEF SUMMARY

According to various, but not necessarily all, examples of the disclosure there may be provided an encoder apparatus comprising means for:

providing at least one audio signal wherein the at least one audio signal comprises at least one audio object and wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and
providing metadata indicative of the first angular direction of the at least one audio object.

The second angular direction may be independent of a head orientation of a listener.
The second angular direction may be a predefined angular direction.
The second angular direction may be ninety degrees relative to a reference.
The second angular direction may be less than ninety degrees relative to a reference.
The reference may comprise the first angular direction.
According to various, but not necessarily all, examples of the disclosure there may be provided a method comprising:

According to various, but not necessarily all, examples of the disclosure there may be provided a computer program comprising computer program instructions that, when executed by processing circuitry, cause:

According to various, but not necessarily all, examples of the disclosure there may be provided a decoder apparatus comprising means for:

obtaining at least one audio signal wherein the at least one audio signal comprises at least one audio object wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and
obtaining metadata indicative of the first angular direction of the at least one audio object;
determining a head orientation of a listener; and
generating a rendering of the at least one audio object for the head orientation of the listener based on the rendering of the at least one audio object to the second angular direction and a mapping of the rendering of the at least one audio object to a third angular direction.

The mapping of the at least one audio object to a third angular direction may comprise rendering the at least one audio object to a third angular direction wherein the third angular direction is at a predetermined angular position relative to the second angular direction.
The generating of the rendering of the at least one audio object for head orientation of the listener may comprise mixing of the rendering of the at least one audio object to the second angular direction and the rendering of the audio object to the third angular direction.
The mixing may be weighted based on the head orientation of the listener.
The generating of the rendering of the at least one audio object for head orientation of the listener may be performed in the time domain.
According to various, but not necessarily all, examples of the disclosure there may be provided a method comprising:

According to various, but not necessarily all, examples of the disclosure there may be provided an apparatus comprising means for:

determining a field of view for at least a first listener and a second listener;
determining whether an audio object is located within the field of view of the first listener and the second listener;
rendering the audio object for the first listener and the second listener wherein the rendering of the audio object is the same for both the first listener and the second listener if the audio object is outside of the field of view and the rendering of the audio object is different for the first listener and the second listener if the audio object is inside the field of view.

The rendering of the audio object for the second listener may be generated based on the rendering of the audio object for the first listener.
The rendering of the audio object for the second listener may be generated by mapping the rendering of the audio object for the first listener to a different angular orientation.
The rendering may comprise at least one of: binaural rendering, stereo rendering.
The field of view may be determined based on a device configured to be viewed by the first listener and the second listener.

BRIEF DESCRIPTION

Some examples will now be described with reference to the accompanying drawings in which:

FIGS. 1A to 1E shows symmetrical relationships between different audio signals;
FIG. 2 shows an example method;
FIG. 3 shows an example method;
FIG. 4 shows an example listener and audio object;
FIG. 5 shows an example listener and audio object;
FIG. 6 shows an example listener and audio object;
FIG. 7 shows an example system;
FIG. 8 shows an example method;
FIG. 9. shows a plurality of listeners and audio objects; and
FIG. 10 shows an example apparatus.

DETAILED DESCRIPTION

Spatial audio enables spatial properties of a sound scene to be reproduced for a user so that the user can perceive the spatial properties of the original sound scene. In audio rendering systems in which a listener can move their head, the rendering of the spatial audio has to be adjusted to take into account the changes in the head position of the listener. This adjustment has to take place in real time. If there are significant latencies in the processing of the spatial audio this can lead to delays that can be perceptible to the listener. These delays can reduce the audio quality and make the spatial audio sound unrealistic. If there is a significant delay a listener in a mediated reality environment, or other spatial audio application could have difficulties in determining which audio corresponds to a visual object. Examples of the disclosure make use of symmetries and other properties of the sound scenes to reduce the processing requirements for providing spatial audio and so reduces the effects of these latencies.
Figs. 1A to 1E show symmetrical relationships between different audio signals. Examples of the disclosure can make use of these relationships so as to reduce the processing required for rendering spatial audio.
Fig. 1A shows a listener 101 and an audio object 103. The listener 101 is listening to spatial audio using a head set 105 or other suitable audio device. The head set 105 can be configured to provide binaural audio to the listener 101. The head set 105 provides a right signal R to the right ear of the listener 101 and a left signal L to the left ear of the listener 101. The right signal R and the left signal L can comprise binaural signals or any other suitable type of signals.
In the example of Fig. 1A the audio object 103 is located in front of the listener 101 and slightly to the right of the listener 101. The right signal R and the left signal L can be rendered so that this location of the audio object 103 can be perceived by the listener 101.
Fig. 1B shows another arrangement for the listener 101 and the audio object 103. This arrangement has a symmetrical relationship to the arrangement that is shown in Fig. 1A. In the arrangement of Fig. 1B the audio object 103 has been reflected about the dashed line. The audio object 103 is now positioned in front of the listener 101 and to the left of the listener 101 rather than the right of the listener 101.
In Fig. 1B the head set 105 provides a right signal R' to the right ear of the listener 101 and a left signal L' to the left ear of the listener 101. The right signal R' and the left signal L' can comprise binaural signals or any other suitable type of signals.
The audio scenes represented in Figs. 1A and 1B are mirror images of each other. This means that the first left signal L is the same as the second right signal R' and the first right signal R is the same as the second left signal L'.
That is:
L=R' and R=L'
Therefore, if the signals L and R are available then the signals R' and L' can be obtained from the original signals L and R. The audio scene shown in Fig. 1B can be recreated by swapping over the signals used for the audio scene in Fig. 1A.
Fig. 1C shows another arrangement for a listener 101 and an audio object 103. In the example of Fig. 1C the listener 101 is facing to the right and the audio object 103 is positioned directly to the left of them. In this scenario a first left signal L is provided to the left ear of the listener 101 and a first right signal R is provided to the right ear of the listener 101.
In the example of Fig. 1D the listener 101 has rotated through 180° compared to the arrangement shown in Fig. 1C. In this arrangement the listener 101 is facing to the left and the audio object 103 is positioned directly to the right of them. In this scenario a second left signal L' is provided to the left ear of the listener 101 and a second right signal R' is provided to the right ear of the listener 101.
The symmetry of the arrangements shown in Figs. 1C and 1D means that, in an analogous manner to that shown in Figs. 1A and 1B, the first left signal L is the same as the second right signal R' and the first right signal R is the same as the second left signal L'. The signals could be swapped for each other in this situation. That is, the audio scene shown in Fig. 1D can be rendered by swapping the signals used for rendering the audio scene shown in Fig. 1C.
Fig. 1E shows a further arrangement in which the listener 101 has rotated through 90° compared to the arrangements shown in Figs. 1C and 1D. The listener 101 in this arrangement is facing directly towards the audio object 103. In this scenario a third left signal L" is provided to the left ear of the listener 101 and a third right signal R" is provided to the right ear of the listener 101.
The audio scene shown in Fig. 1E can be approximated by mixing the audio scenes from Fig. 1C and the audio scene from Fig. 1D. In this arrangement the third left signal L" can be generated from a mix of the first left signal L and the second left signal L'. Similarly, the third right signal R" can be generated from a mix of the first right signal R and the second right signal R'.
In this example $L " = (L + L') / 2$
and $R " = (R + R') / 2$
Similar approximation can also be used in scenarios in which the listener 101 is not facing directly at the audio object 103 but could be facing an angle intermediate between those shown in Figs. 1C, 1D and 1E. In such cases the signals needed for the left and right ear could still be obtained by mixing the first left signal L and the second left signal L' and the first right signal R and the second right signal R' respectively. However in such cases the signals that are mixed wouldn't be mixed equally and different factors could be applied to the respective signals. The different factors would be dependent upon the actual angular position of the audio object 103.
Figs. 1A to 1E therefore show that if you can obtain first left signals L and first right signals R for an audio object 103 you can use these to generate the left and right signals that can be used for other angular orientations for a listener 101. These approximations are sufficiently accurate to provide adequate spatial audio rendering for the listener 101. Examples of the disclosure make use of these spatial relationships to reduce the processing required for the spatial audio rendering. This reduction can help to reduce latencies and provide an improved spatial audio for the listener 101.
Fig. 2 shows an example method that can be used to implement some examples of the disclosure. The method of Fig. 2 could be implemented using an encoder apparatus. The encoder apparatus could be provided within an encoder 705 as shown in Fig. 7 or in any other suitable system.
The method comprises, at block 201, providing an audio signal. The audio signal comprises at least one audio object 103.
The audio object 103 is located at a first angular direction. The first angular direction can be determined relative to a device, a user, a coordinate system or any other suitable reference point. The angular position of the audio object 103 is determined by the audio scene that is represented by the audio signals. For example, it can be determined by the position of the microphones that capture the audio from the audio object 103.
In the audio signal the audio object 103 is rendered to a second angular direction. The second angular direction is different to the first angular direction. For instance, if the angular object 103 is positioned at thirty degrees, then this is the first angular direction. The second angular direction could be a different angular direction such as ninety degrees or sixty degrees or any other suitable direction.
The angle that is used for the second angular direction can be independent of a head orientation of a listener 101. In some examples the head orientation of the listener does not need to be known when the audio signal is being generated. This can mean that the encoding apparatus does not need to perform head tracking or obtain any head tracking data.
The second angular direction can be a predefined angular direction. The predefined angular direction can be defined relative to a reference coordinate system or axis, relative to a head orientation of the listener 101, relative to the actual angular direction of one or more audio objects 103 or relative to any other suitable references.
In some examples the second angular direction is ninety degrees relative to a reference. The reference could be the first angular direction that represents the actual position of the audio object 103. In some examples, the reference could be a direction that is determined to be a front facing or forward direction. In other examples the reference could be less than ninety degrees relative to the reference.
The rendering of the audio object 103 provides information that enables the audio object to be rendered as though it is positioned at the second angular direction. For instance, if the second angular direction is ninety degrees, then the rendering of the audio signal to the second angular direction would enable a left signal and right signal to be generated as though the audio object 103 is positioned at ninety degrees.
The audio signal that is provided at block 201 therefore provides rendering information that enables rendering of the audio object 103 at the second angular direction. The rendering information is not determined for the actual or correct angular direction of the audio object 103. In some examples the angular direction of the rendering information and the actual or correct angular direction of the audio object 103 will be different. In some examples it could happen that the listener 101 is positioned so that the angular direction of the rendering information and the actual or correct angular direction of the audio object 103 are the same, or substantially the same. In such cases examples of the disclosure could still apply.
The rendering information could comprise any information that indicates how the audio signals 201 comprising the audio object 103 should be processed in order to produce an audio output with the audio object 103 at the second angular direction. The rendering information could comprise metadata, mixing matrices or any other suitable type of information.
The method also comprises, at block 203, providing metadata where the metadata is indicative of the first angular direction of the audio object 103. The metadata can be indicative of the actual angular position of the audio object 103 within the audio scene.
The metadata can be provided with the audio signal. The metadata can be provided in any suitable format.
The audio signal and the metadata can be associated together. For example, they can be encoded into a single bitstream. The combination of the metadata indicative of the first angular direction of the audio object 103 and the audio signal with the audio object 103 rendered to a second angular direction comprise sufficient information to enable the audio object 103 to be rendered to the correct angular location for any angular orientation of the head of the listener 101.
In the above-described example, only one audio object 103 is referred to. In other examples the audio signal can comprise a plurality of audio objects. Each of the audio objects 103 can be rendered to a second angular position. The second angular position can be different for the different audio objects 103. In some examples the second angular direction can be the same for two or more of the audio objects 103. In examples where the audio signal comprises a plurality of audio objects 103 the metadata comprises information about the actual angular direction for each of the audio objects within the audio signal. The number of audio objects 103 and the angular direction of the audio objects 103 can be determined by the audio scene that is represented by the audio signals.
The audio signal and the metadata can be provided to a decoder apparatus. The decoder apparatus can be provided in an audio playback device. The audio playback device can be configured to decode the audio signals and process the decoded audio signal to enable spatial audio playback. The spatial audio playback could comprise binaural audio or any other suitable type of audio.
Fig. 3 shows another example method that can be used to implement some examples of the disclosure. The method of Fig. 3 could be implemented using a decoder apparatus. The decoder apparatus could be provided within a decoder 709 as shown in Fig. 7 or in any other suitable system. The method of Fig. 3 could be performed by an apparatus that has received the audio signal and metadata generated using the method of Fig. 2, or any other suitable method.
The method comprises, at block 301 obtaining an audio signal. The audio signal can be an audio signal as provided using the method of Fig. 2, or any other suitable method. The audio signal comprises at least one audio object 103 where the at least one audio object 103 is located at a first angular direction in the actual audio scene but is rendered to a second angular direction within the audio signal. The second angular direction is not the correct angular direction of the audio object 103.
The method also comprises, at block 303, obtaining metadata indicative of the first angular direction of the audio object 103. This metadata enables the decoder to obtain information about the actual angular direction of the audio object 103 within an actual audio scene.
At block 305 the method comprises determining a head orientation of the listener 101. The head orientation of the listener 101 can comprise an indication of the direction in which the listener 101 is facing. The head orientation of the listener 101 can be provided in a single plane, for example the head orientation can comprise an indication of an azimuthal angle indicative of the direction the listener 101 is facing.
Any suitable means and/or processes can be used to determine the head orientation of the listener 101. For example, a headset or earphones can comprise one or more sensors that can be configured to detect movement of the head of the listener 101 and so can enable the orientation of the head of the listener 101 to be determined. In other examples a head tracking device could be provided that is separate to a playback or decoding device.
At block 307 the method comprises generating a rendering of the audio object 103 for the head orientation of the listener 101. The rendering enables the audio object 103 to be perceived at the correct angular orientation. That is, the rendering enables the audio object 103 to be perceived to be at the first angular direction corresponding to the actual orientation of the audio object 103.
To enable the rendering of the audio object 103 at the correct angular orientation the rendering of the audio object 103 to the second angular direction and a mapping of the rendering of the audio object 103 to a third angular direction are used. The rendering of the audio object 103 to the second angular direction can be the rendering that is received with the audio signal. The mapping of the rendering of audio object 103 to a third angular direction can comprise any suitable rendering of the audio object 103 to a third angular direction.
The third angular direction can be at a predetermined angular position relative to the second angular direction. The rendering of the audio object to a third angular direction can be determined by making use of spatial relationships between the second angular direction and the third angular direction. For example, a symmetrical relationship can be used to allow a mapping between left signals and right signals for the respective angular directions. For instance, the third angular direction can be selected so that the left signal for the third angular direction is the same as the right signal for the second angular direction. Similarly, the right signal for the third angular direction would be the same as the left signal for the second angular direction. For instance, if the second angular direction is determined to be 90° to the right of a reference the third angular direction could be 90° to the left of the reference. This could generate a symmetrical arrangement.
The angles that are used for the second angular direction and the third angular direction can be independent of the orientation of the head of the listener 101. This can enable the same second angular direction and the third angular direction to be used for different orientations of the head of the listener 101. This can enable the same second angular direction and third angular direction to be used as the listener 101 moves their head.
To generate the rendering of the audio object 103 for head orientation of the listener 101 the rendering of the audio object to the second angular direction and the rendering of the audio object 103 to the third angular direction are mixed. Any suitable process can be used to mix the respective renderings. The mixing can be weighted based on the head orientation of the listener 101. The weighting can be such that a larger factor is applied to angular direction that is closest to the actual angular orientation of the audio object 103.
In examples of the disclosure the generating of the rendering of the audio object 103 for head orientation of the listener 101 can be performed in the time domain. This means that there is no need to transform the audio signals to a frequency domain. This can reduce latency of the processing of the audio signals. This can also reduce the processing requirements for determining the rendering for the audio objects as a listener moves their head.
Fig. 4 schematically shows an example listener 101 and audio object 103 and a respective first angular direction 401 and second angular direction 403.
The first angular direction 401 is the actual angular direction of the audio object 103. In this example the listener 101 is facing towards the audio object 103 so that the first angular direction 401 is directly ahead of the listener 101. The audio object 103 could be positioned at different angular directions in other examples of the disclosure.
In this example the first angular direction is given by the angle ω with respect to a reference 405. In this case the reference 405 is an arbitrary axis that can be fixed within a coordinate system. Other references could be used in other examples of the disclosure.
The second angular direction 403 is the direction to which the audio object 103 is rendered within the audio signal. In this example the second angular direction 403 is a rotation of 90° clockwise from the first angular direction 401. In this case the angle that is used for the second angular direction 403 is dependent upon the first angular direction 401. In other cases, the angle that is used for the second angular 403 direction can be independent of the first angular direction 401. For instance, the second angular direction 403 could be determined by a reference or point in a coordinate system. In the example of Fig. 4 the second angular direction 403 is a rotation of 90° clockwise from the first angular direction 401 but other angular relationships could be used in other examples.
In examples of the disclosure the audio signal that is provided would comprise a rendering of the audio object 103 to the second angular direction 403. The rendering to the second angular direction 403 is such that, if that rendering was used, the listener 101 would perceive the audio object 103 to be located at the second angular direction 403 instead of the first angular direction 401. The audio signal is provided with metadata indicative of the first angular direction 401.
Fig. 5 schematically shows the example listener 101 and audio object 103 and the respective first angular direction 401, second angular direction 403 and third angular direction 501.
The first angular direction 401 and the second angular direction 403 are as shown in Fig. 4. The third angular direction 501 is a direction to which the rendering of the audio object 103 can be remapped.
In the example of Fig. 5 the third angular direction 501 is selected to provide a symmetrical arrangement around the first angular direction 401. The symmetrical arrangement can be obtained by selecting an angle that is the same size as the angle between the first angular direction 401 and the second angular direction 403 but is in a different direction. In this case the second angular direction 403 direction is a rotation of 90° clockwise from the first angular direction 401 and so the third angular direction 501 is direction is a rotation of 90° anticlockwise from the first angular direction 401. Other sized angles and references could be used in other examples of the disclosure.
In examples of the disclosure a rendering of the audio object 103 to the third angular direction 501 is generated by mapping the rendering to the second angular direction 403 to the third angular direction 501. The mapping makes use of the symmetrical properties of the signals. In this example the mapping to the third angular direction 501 can be generated by swapping the left and right signals that are used for the rendering to the second angular direction 403.
For example, the rendering to the second angular direction 403 can comprise A where A is a spatial audio signal. The spatial audio signals could comprise binaural or stereo signal or any other suitable type of signal. The rendering to the second angular direction 403 can therefore comprise A left channel and A right channel signals. The rendering to the third angular direction 501 would comprise spatial audio signal B. the spatial audio signal B would comprise B left channel and B right channel signals. The B left channel and B right channel signals would be generated by swapping the left and right channels for the audio signal A. That is

B left channel = A right channel
B right channel = A left channel

Fig. 6 schematically shows the example listener 101 and audio object 103 and the respective first angular direction 401, second angular direction 403, third angular direction 501 and the reference 405.
In this case the head orientation of the listener 101 is indicated by angle ϕ. In this example the listener 101 is facing toward the audio object 103 but that does not need to be the case in other implementations of the disclosure. The listener 101 could be facing in other directions in other examples of the disclosure and/or the audio object 103 could be in a different position.
A rendering for the audio object 103 for the head orientation of the listener 101 is generated based on the rendering of the audio object 103 to the second angular direction 403 and the rendering or mapping of the audio object 103 to the third angular direction 501. In this example the rendering of the audio object 103 for head orientation of the listener 101 can be generated by mixing the rendering of the audio object 103 to the second angular direction 403 and the rendering of the audio object 103 to the third angular direction 501.
A suitable mixing equation could be: $Audio played to listner = ‖ \frac{90 - (ω - ϕ)}{180} ‖ A + ‖ \frac{90 + (ω - ϕ)}{180} ‖ B$
Where ω is the first angular direction of the audio object 103, ϕ is the orientation of the head of the listener 101, A is a spatial audio signal rendered to the second angular direction 403, and B is a spatial audio signal rendered to the third angular direction 501. In this case B is the same as A but with the left and right channels swapped.
The use of the absolute values in this example mixing equation enable the mixing equation to be used for any orientation of the head of the listener 101. For instance, the listener could be facing towards the back so that the audio object 103 is behind the listener 101. This approximation can make use of the fact that spatialised audio from an angle of Y degrees is very similar to spatialised sound from 180-Y degrees. This simplification can provide spatial audio of adequate quality.
In the above-described examples and equations it has been assumed that the azimuth angles rotate in clockwise direction. The equations still work if they rotate in an anti-clockwise direction.
In the examples of Figs. 4 to 6 the second angular direction has been selected as 90° to the side of the first angular direction 401. Any angle θ could be used in other examples.
In such cases the mixing equations that are used could be: $Audio played to listener = ‖ \frac{θ - (ω - ϕ)}{2 θ} ‖ A + ‖ \frac{θ + (ω - ϕ)}{2 θ} ‖ B$
Other mixing equations can be used in other examples. The mixing equations can be configured so that the relative emphasis used for the rendering of the audio object 103 to the second angular direction 403 and the rendering of the audio object 103 to the third angular direction 501 is dependent upon whether the head orientation of the listener 101 is closer to the second angular direction 403 or the third angular direction 501. That is, if the head orientation of the listener 101 is closer to the second angular direction 403 then the rendering to the second angular direction 403 would be given a higher weighting than the rendering to the third angular direction 501. Conversely if the head orientation of the listener 101 is closer to the third angular direction 501 then the rendering to the third angular direction 501 would be given a higher weighting than the rendering to the second angular direction 403.
The examples described herein have been restricted to determining the orientation of the head of the listener 101 in the horizontal plane. This can be sufficient to cover many applications that use head tracking. It is possible to extend the examples of the disclosure to cover rotations of the head of the listener 101 out of the plane. For example, if the listener 101 tilts their head. To account for the listener 101 tilting their head additional audio signals can be provided. In such examples, an audio signal can be provided with a rendering at an angle above the angular position of the audio object 103. This can then be mapped to an angular direction below the audio object 103 and the two renderings can be mixed as appropriate to take into account any tilting of the head of the listener 101.
Examples of the disclosure can also be used for audio scenes that comprise a plurality of audio object 103. In such cases each audio object 103 can be treated separately so that an audio signal and rendering to a second angular direction 403 can be provided for each audio object 103. The renderings to the third angular direction 501 can be determined for each of the audio objects 103 and the signals for each audio object 103 can be mixed depending on the head orientation of the listener 101 and then summed for playback to the listener 101.
The audio signals that are used can also comprise other audio in addition to the audio objects 103. For example, the other audio could comprise ambient or non-directional or other types of audio. The ambient or non-directional audio can be rendered using an assumed head orientation of the listener 101. The assumed head orientation could be a front facing direction or any other suitable direction. The ambient or non-directional audio can be rendered without applying head tracking because it can be assumed to be the same in all directions. The non-directional audio could comprise music tracks or background audio or any other suitable type of audio.
Examples of the disclosure provide for improved spatial audio. The calculations that are used in examples of the disclosure do not require very much data to be sent and also don't require very much processing to be carried out, compared to processing the frequency domain. This reduction in the processing requirements mean that the processing could be carried out by small processors within headphones or headsets of other suitable playback devices. This can provide a significant reduction in latencies introduced by the processing compared to systems in which all the processing is done by a central processing device, such as a mobile phone, and then transmitted to the headphones or other playback device.
Also, in examples of the disclosure, all of the processing can be performed in the time domain. This also provides for a reduction in latency.
Examples of the disclosure can also be used for scenarios in which a plurality of listeners 101 are listening to the same audio scene but are each using their own headsets for play back of the audio. In examples of the disclosure the head tracking can be performed by the playback device and the data sent from the encoding device to the playback device is independent of the orientation of the head of the listener 101. This means that the encoding device can send the same audio signal and metadata to each decoding device. This can reduce the processing requirements for the encoding device.
Fig. 7 schematically shows an example system 701 that can be used to implement examples of the disclosure. The system 701 comprises an encoder 705 and a decoder 709. The encoder 705 and the decoder 709 can be in different devices. For instance, the encoder 705 could be provided in an audio capture device or a network device and the decoder 709 could be provided in a headset or other device configured to enable audio playback. In some examples the encoder 705 and the decoder 709 could be in the same device. For instance, a device can be configured for both audio capture and audio playback.
The system 701 is configured so that the encoder 705 is configured to obtain an input comprising audio signals 703. The audio signals 703 can be obtained from two or more microphones configured to capture spatial audio or from any other suitable source
The encoder 705 can comprise any means that can be configured to encode the audio signals 703 to provide a bitstream 707 as an output. In some examples of the disclosure the bitstream 707 can comprise the audio signal with an audio object 103 rendered to a second angular direction 403 and also metadata indicative of the first angular direction 401 of the audio object 103.
The bitstream 707 can be transmitted from a device comprising the encoder 705 to a device comprising the encoder 709. The bitstream 707 can be transmitted using any suitable means. In some examples the bitstream 707 can be transmitted using wireless connections. The wireless connections could comprise a low-power means such as Bluetooth or any other appropriate means.
In some examples the encoder 709 and the decoder 709 could be in the same device and so the bitstream 707 can be stored in the device comprising the encoder 705 and can be retrieved and decoded by a decoder 709 when appropriate.
The decoder 709 can be configured to receive the bitstream 707 as an input. The decoder 709 comprises means that can be configured to decode the bitstream 707.
The decoder 707 can also comprise means for determining a head orientation for the listener.
In some examples the decoder 707 can be configured to generate a rendering of the audio object 103 at a third angular direction 501 and use the rendering of the audio object 103 to the second angular direction 403 and the rendering of the audio object to the third angular direction 501 to generate a rendering of the audio object 103 for the head orientation of the listener 101.
The decoder 709 provides the spatial audio output 711 for the listener. The spatial audio output 711 can comprise binaural audio or any other suitable type of audio.
Fig. 8 shows another example method that could be used in some examples of the disclosure. The method of Fig. 8 could be implemented using an encoder apparatus or a decoder apparatus or any other suitable type of apparatus. The method of Fig. 8 can be used where two or more listeners 101A, 101B are rendering the same content but are each associated with their own playback device. For example, two or more 101A, 101B could be viewing the same content on a screen or other display but could each be wearing their own headset or earphones.
The method comprises, at block 801, determining a field of view for at least a first listener 101A and a second listener 101B. The field of view can be determined based on information such as the locations of the listeners 101A, 101B and the directions in which they are facing. In some examples the field of view could be determined based on a device that is shared by the listeners 101. For instance, it could be determined that the listeners 101A, 101B are sharing a display or screen so that both listeners 101A, 101B are viewing the same display or screen. In such cases the position of the display or screen could be used to determine the field of view of the listeners 101.
The field of view can be determined to comprise an angular range. The angular range can be derived with respect to any suitable reference. The reference could be the positions of the listeners 101A, 101B, the position of one or more devices and/or any other suitable object. The field of view can have an angular range of about 30° or any other suitable range.
At block 803 the method comprises determining whether an audio object 103 is located within the field of view. The method can determine whether an audio object 103 is located within the field of view of each of the plurality of listeners 101A, 101B. Where there are more than two listeners 101 it can be determined whether or not the audio object 103 is within a shared field of view of a subset of the listeners 101.
To determine whether or not an audio object 103 is within the field of view the angular direction of the audio object 103 can be determined. The angular direction of the audio object 103 can be determined based on metadata provided with an audio signal or by using any suitable process. Once the angular direction of the audio object 103 has been determined it can be determined whether or not this angular direction falls within the field view as determined at block 801.
At block 805 the audio object 103 is rendered for the both the first listener 101A and the second listener 101B. The rendering can comprise binaural rendering, stereo rendering or any other suitable type of spatial audio rendering. The rendering of the audio object 103 can be based upon whether or not the audio object 103 is within the field of view of the listeners 101A, 101B. The rendering of the audio object 103 can be the same for both the first listener 101A and the second listener 101B if the audio object 103 is outside of the field of view. The rendering of the audio object 103 can be different for the first listener 101A and the second listener 101B if the audio object 103 is inside the field of view.
In some examples, the rendering of the audio object 103 for the second listener 101B can be generated based on the rendering of the audio object 103 for the first listener 101A. Processes similar to those shown in Figs. 2 to 6 can be used to generate the rendering of the audio object 103 for the second listener 101B based on the rendering of the audio object 103 for the first listener 101A. In such examples, from the point of view of the second listener 101B, the rendering for the first listener 101A would be a rendering to the second angular direction 403. This could be processed using the methods described above to provide a rendering to the correct angular direction for the second listener 101B. For instance, the rendering of the audio object 103 for the second listener 101B can be generated by mapping the rendering of the audio object 103 for the first listener 101A to a different angular orientation and mixing the respective renderings as appropriate based on the angular direction of the audio object 103 relative to the second listener 101B.
In other examples the rendering for the first listener 101A and the second listener 101B can be generated by summing and averaging binauralized or stereo audio objects 103. An audio signal based on the summed, or averaged, audio objects 103 can then be sent to each of the playback devices of the listeners 101A, 101B. This can enable different renderings for each of the different listeners 101A, 101B.
Fig. 9 shows a plurality of listeners 101A, 101B and audio objects 103A, 103B. The method of Fig. 8 could be used to provide spatial audio for the listeners 101A, 101B in Fig. 9.
The plurality of listeners 101A, 101B are each listening to spatial audio using their own playback device. In this example the playback device comprises a headset 105A, 105B. Each listener 101A, 101B within the plurality of listeners 101 is associated with a different headset 105A, 105B.
In the example of Fig. 9 two listeners 101A, 101B are shown. In other examples there could be more than two listeners 101A, 101B. Similarly, Fig. 9 also shows two audio objects 103A, 103B. The audio scene could comprise other numbers of audio objects 103 in other examples of the disclosure.
Each of the listeners 101A, 101B are listening to the same audio content. For instance, in the example of Fig. 9 each of the listeners 101A, 101B are viewing a device 901. The device 901 could be displaying images or video content and the audio content could be corresponding to the images displayed on the display of the device 901. The listeners 101A, 101B can be listening to the same audio scene via the headsets 105A, 105B. However, because the listeners 101A, 101B are in different positions and can be facing in different orientations the spatial aspects of the audio scene can be different for each of the listeners 101A, 101B.
In this example the listeners 101A, 101B are positioned adjacent to each other so that the second listener 101B is to the right of the first listener 101A. Each of the listeners 101A, 101B is facing towards the device 901. The first listener 101A is facing to the front and slightly to the right to view the device 901 and the second listener 101B is facing to the front and slightly to the left to view the device 901. Other arrangements of the listeners 101A, 101B could be used in other examples of the disclosure.
A field of view of the listeners 101A, 101B can be defined using any suitable parameters. In some examples the field of view could be defined so that it covers the device 901 that is being viewed by each of the listeners 101A, 101B. in some examples the field of view could be determined based on the direction in which the listeners 101A, 101B are facing.
In this example the field of view can cover an angular region that covers the span of the device 901. The field of view can have an angular range of about 30° or any other suitable range.
The example of Fig. 9 also comprises a first audio object 103A and a second audio object 103B. The first audio object 103A and the second audio object 103B are positioned in different locations around the listeners 101A, 101B.
In the example of Fig. 9 the first audio object 103A is positioned at an angular direction that would be behind the listeners 101A, 101B. Both the first listener 101A and the second listener 101B are facing away from the first audio object 103A. In this case the first audio object 103A would be determined to not be within the field of view of the listeners 101A, 101B.
The second audio object 103B is positioned in front the listeners 101A, 101B. In this example the second audio object 103B is positioned at an angular direction that overlaps with the angular direction of the device 901. Both the first listener 101A and the second listener 101B are facing towards from the second audio object 103B. In this case the second audio object 103B would be determined to be within the field of view of the listeners 101A, 101B.
This means that the first audio object 103A could be rendered to be the same for both the first listener 101A and the second listener 101B. This does not necessarily recreate an accurate representation of the audio scene. However, to create spatial audio of an adequate quality it is sufficiently accurate for the listeners 101A, 101B to perceive that the audio object 101A, 101B is behind them. The precise angular direction of audio objects 103A, 103B behind the listeners 10A, 101B is not as important as the angular direction of audio objects 103A, 103B that are in front of the listeners 101A, 101B because the listeners 101A, 101B cannot see the audio objects 103A that are behind them.
In this example the audio object that is outside of the filed of view of the listeners 101A, 101B is positioned behind the listeners 101A, 101B. In other examples the audio object could be to the far left or the far right. Such an object would be perceived to be the left of both of the listeners 101A, 101B or to the right of both of the listeners 101A, 101B respectively. Therefore, in such cases the audio object 103 could be approximated to be at the same angular direction for each of the listeners 101A, 101B.This can enable the same rendering to be used for each of the listeners 101A, 101B.
Conversely the second audio object 103B could be rendered differently for each of the listeners 101A, 101B. The second audio object 103B is positioned slightly to the right of the first listener 101A but slightly to the left of the second listener 101B. This means that the second audio object 103B would have spatial parameters that would be perceived differently by each of the listeners 101A, 101B. To enable these different spatial parameters to be taken into account the second audio object 103B would be rendered differently for each of the listeners 101A, 101B. The rendering of the second audio object 103B that is within the field of view would take into account the angular direction between each of the respective listeners 101A, 101B and the audio object 103B. In some examples the methods shown in Figs. 2 to 6 and described above could be used to enable the rendering of the second audio object 103B for the different listeners 103A, 103B.
In some examples the different rendering for the different listeners 101A, 101B could take into account head tracking. For instance, if the first listener 101A moves their head this can change the rendering of the audio object 103B for that first listener 101A but not for the second listener 101B.
Examples of the disclosure therefore provide for an efficient method of processing audio for a plurality of listeners 101A, 101B while still providing spatial audio of a sufficient quality. Using the same rendering for audio objects 103 that are outside of the field of view of the listeners 101A, 101B can reduce the processing requirements. However, using different rendering for audio objects 103 that are within the field of view can ensure that adequate spatial audio quality is provided.
Fig. 10 schematically shows an example apparatus 1001 that could be used in some examples of the disclosure. The apparatus 1001 could comprise a controller apparatus and could be provided within decoder device 705 or an encoder device 709 as shown in Fig. 7, or in any other suitable type of device. In the example of Fig. 10 the apparatus 1001 comprises at least one processor 1003 and at least one memory 1005. It is to be appreciated that the apparatus 1001 could comprise additional components that are not shown in Fig. 10.
In the example of Fig. 10 the apparatus 1001 can be implemented as processing circuitry. In some examples the apparatus 1001 can be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
As illustrated in Fig. 10 the apparatus 1001 can be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 1007 in a general-purpose or special-purpose processor 1003 that can be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 1003.
The processor 1003 is configured to read from and write to the memory 1005. The processor 1003 can also comprise an output interface via which data and/or commands are output by the processor 1003 and an input interface via which data and/or commands are input to the processor 1003.
The memory 1005 is configured to store a computer program 1007 comprising computer program instructions (computer program code 1009) that controls the operation of the apparatus 1001 when loaded into the processor 1003. The computer program instructions, of the computer program 1007, provide the logic and routines that enables the apparatus 1001 to perform methods as illustrated in Figs. 2, 3 and 9. The processor 1003 by reading the memory 1005 is able to load and execute the computer program 1007.
The apparatus 1001 could be comprised within an encoder apparatus. In such examples the apparatus 1001 therefore comprises: at least one processor 1003; and at least one memory 1005 including computer program code 1009, the at least one memory 1005 and the computer program code 1009 configured to, with the at least one processor 1003, cause the apparatus 1001 at least to perform:

The apparatus 1001 could be comprised within a decoder apparatus. In such examples the apparatus 1001 therefore comprises: at least one processor 1003; and at least one memory 1005 including computer program code 1009, the at least one memory 1005 and the computer program code 1009 configured to, with the at least one processor 1003, cause the apparatus 1001 at least to perform:

In some examples the apparatus 1001 can comprise: at least one processor 1003; and at least one memory 1005 including computer program code 1009, the at least one memory 1005 and the computer program code 1009 configured to, with the at least one processor 1003, cause the apparatus 1001 at least to perform:

As illustrated in Fig. 10 the computer program 1009 can arrive at the apparatus 1001 via any suitable delivery mechanism 1011. The delivery mechanism 1011 can be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid-state memory, an article of manufacture that comprises or tangibly embodies the computer program 1007. The delivery mechanism can be a signal configured to reliably transfer the computer program 1007. The apparatus 1001 can propagate or transmit the computer program 1007 as a computer data signal. In some examples the computer program 1007 can be transmitted to the apparatus 1001 using a wireless protocol such as Bluetooth, Bluetooth Low Energy, Bluetooth Smart, 6LoWPan (IP_v6 over low power personal area networks) ZigBee, ANT+, near field communication (NFC), Radio frequency identification, wireless local area network (wireless LAN) or any other suitable protocol.
The computer program 1007 can comprise computer program instructions for causing an apparatus 1001 to perform at least the following:

The computer program 1007 can comprise computer program instructions for causing an apparatus 1001 to perform at least the following:

The computer program instructions can be comprised in a computer program 1007, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions can be distributed over more than one computer program 1007.
Although the memory 1005 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable and/or can provide permanent/semi-permanent/ dynamic/cached storage.
Although the processor 1003 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable. The processor 1003 can be a single core or multi-core processor.
References to "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc. or a "controller", "computer", "processor" etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
As used in this application, the term "circuitry" can refer to one or more or all of the following:

(a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
1. (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
2. (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software might not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
The blocks illustrated in the Figs. 2, 3 and 9 can represent steps in a method and/or sections of code in the computer program 1007. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block can be varied. Furthermore, it can be possible for some blocks to be omitted.
The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to "comprising only one..." or by using "consisting".
In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or 'for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', 'for example', 'can' or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
The term 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.

Claims

An encoder apparatus comprising means for:
providing at least one audio signal wherein the at least one audio signal comprises at least one audio object and wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

providing metadata indicative of the first angular direction of the at least one audio object.
An encoder apparatus as claimed in claim 1 wherein the second angular direction is independent of a head orientation of a listener.
An encoder apparatus as claimed in any preceding claim wherein the second angular direction is a predefined angular direction.
An encoder apparatus as claimed in any preceding claim wherein the second angular direction is ninety degrees relative to a reference.
An encoder apparatus as claimed in any of claims 1 to 3 wherein the second angular direction is less than ninety degrees relative to a reference.
An encoder apparatus as claimed in any of claims 4 or 5 wherein the reference comprises the first angular direction.
A method comprising:
providing at least one audio signal wherein the at least one audio signal comprises at least one audio object and wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

providing metadata indicative of the first angular direction of the at least one audio object.
A computer program comprising computer program instructions that, when executed by processing circuitry, cause:
providing at least one audio signal wherein the at least one audio signal comprises at least one audio object and wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

providing metadata indicative of the first angular direction of the at least one audio object.
A decoder apparatus comprising means for:
obtaining at least one audio signal wherein the at least one audio signal comprises at least one audio object wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

obtaining metadata indicative of the first angular direction of the at least one audio object;

determining a head orientation of a listener; and

generating a rendering of the at least one audio object for the head orientation of the listener based on the rendering of the at least one audio object to the second angular direction and a mapping of the rendering of the at least one audio object to a third angular direction.
A decoder apparatus as claimed in claim 9 wherein the mapping of the at least one audio object to a third angular direction comprises rendering the at least one audio object to a third angular direction wherein the third angular direction is at a predetermined angular position relative to the second angular direction.
A decoder apparatus as claimed in claim 10 wherein the generating of the rendering of the at least one audio object for head orientation of the listener comprises mixing of the rendering of the at least one audio object to the second angular direction and the rendering of the at least one audio object to the third angular direction.
A decoder apparatus as claimed in claim 11 wherein the mixing is weighted based on the head orientation of the listener.
A decoder apparatus as claimed in any of claims 9 to 12 wherein the generating of the rendering of the at least one audio object for head orientation of the listener is performed in the time domain.
A method comprising:
obtaining at least one audio signal wherein the at least one audio signal comprises at least one audio object wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

obtaining metadata indicative of the first angular direction of the at least one audio object;

determining a head orientation of a listener; and

generating a rendering of the at least one audio object for the head orientation of the listener based on the rendering of the at least one audio object to the second angular direction and a mapping of the rendering of the at least one audio object to a third angular direction.
A computer program comprising computer program instructions that, when executed by processing circuitry, cause:
obtaining at least one audio signal wherein the at least one audio signal comprises at least one audio object wherein the at least one audio object is located at a first angular direction but is rendered to a second angular direction within the at least one audio signal; and

obtaining metadata indicative of the first angular direction of the at least one audio object;

determining a head orientation of a listener; and

generating a rendering of the at least one audio object for the head orientation of the listener based on the rendering of the at least one audio object to the second angular direction and a mapping of the rendering of the at least one audio object to a third angular direction.