GB2556922A

GB2556922A - Methods and apparatuses relating to location data indicative of a location of a source of an audio component

Info

Publication number: GB2556922A
Application number: GB1620008.1A
Authority: GB
Inventors: Cricri Francesco
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2018-06-13
Also published as: GB201620008D0

Abstract

In response to determining that a portion of location data indicative of a location within a scene of a source (26, figure 1) of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region. The method may be used when rendering in a virtual reality system, where spatial audio mixing (SAM) is used with positioning detection technology such as High Accuracy Indoor Positioning (HAIP) being used for determining the position of actors or other audio sources in the captured scene. The actors may wear a radio tag which is continuously tracked by an antenna generally co-located with a virtual reality camera (22, figure 1).

Description

(54) Title ofthe Invention: Methods and apparatuses relating to location data indicative of a location of a source of an audio component

Abstract Title: Location of a source of an audio component by zooming

Figure 3

•28-1

3/7

Zooming Region

5/7

Zooming Region

6/7

Zooming Region

in

Audio-Visual

Application No. GB1620008.1

RTM

Date :16 May 2017

Intellectual

Property

Office

The following terms are registered trade marks and should be read as such wherever they occur in this document:

OZO (Pages 1, 6 & 22)

Bluetooth (Page 20)

IEEE (Page 20)

Wi-Fi (Page 20)

Ricoh (Page 22)

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

- 1 Methods and Apparatuses Relating to Location Data Indicative of a Location of a Source of an Audio Component

Field

This specification relates generally to responding to unreliability of location data indicative of a location of a source of an audio component of audio-visual content.

Background

Production of virtual reality (VR) videos involves obtaining both visual and audio (audio-visual) information from a scene being recorded. In addition, spatial audio mixing (SAM) can be used with positioning detection technology, such as HighAccuracy Indoor Positioning (HAIP) radio technology, being used for determining the position of actors or other audio sources in the captured scene. For example, the actors may wear a radio tag which is continuously tracked by an antenna generally co-located with the VR camera, one example of which is Nokia’s OZO camera. Additionally, the actors may wear a close-up microphone (also known as a lavalier microphone) to capture close-up audio from the actor. SAM allows for rendering the audio captured by close-up microphones with correct spatial information.

Summary

According to a first aspect the specification describes a method comprising in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

The method may further comprise, in response to determining that the portion of location data is unreliable, associating the captured audio component with a range of locations within the scene, wherein the association between the captured audio component and the range of locations is for use during rendering of the captured audio component such that, when the audio component is rendered, it appears to originate from the range of locations.

The method may further comprise selecting the locations of the range of locations such that they are all within the determined region.

- 2 The method may further comprise selecting the range of locations such that some of the locations are within the determined region and some of the locations are outside the determined region.

The locations in the range of locations may be dispersed around a video capture device which captures the video component.

The method may further comprise detecting missing data in the location data, and in response, determining that the portion of location data including the missing data is unreliable.

The method may further comprise detecting fluctuating data in the location data, and in response, determining that the portion of location data including the fluctuating data is unreliable.

The method may further comprise determining the region of the captured video component into which to zoom based on a reliable portion of the location data which precedes the portion of location data that is determined to be unreliable.

The method may further comprise determining the region of the captured video component into which to zoom based on a reliable portion of location data which follows the portion of location data that is determined to be unreliable.

The method may further comprise, in response to determining that the portion of location data is unreliable, enabling visual object detection based on visual content captured by a visual content capture device.

The method may further comprise determining the region of the captured video component into which to zoom based on visual detection of the audio source within the captured visual content.

The method may further comprise performing visual detection of the audio source at predetermined time intervals while the location data is unreliable.

-3The method may further comprise determining the region of the captured video component into which to zoom based on a location of the audio source determined based on audio data captured using a microphone array.

According to a second aspect, this specification describes apparatus configured to perform any method as described with reference to the first aspect.

In a third aspect, this specification describes computer-readable instructions, which when executed by computing apparatus, cause computing apparatus to perform any method as described with reference to the first aspect.

In a fourth aspect, this specification describes apparatus comprising at least one processor and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to, in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determine a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

The computer program code, when executed by the at least one processor, may cause the apparatus to, in response to determining that the portion of location data is unreliable, associate the captured audio component with a range of locations within the scene, wherein the association between the captured audio component and the range of locations is for use during rendering of the captured audio component such that, when the audio component is rendered, it appears to originate from the range of locations.

The computer program code, when executed by the at least one processor, may cause the apparatus to select the locations of the range of locations such that they are all within the determined region.

The computer program code, when executed by the at least one processor, may cause the apparatus to select the range of locations such that some of the locations are within the determined region and some of the locations are outside the determined region.

-4The computer program code, when executed by the at least one processor, may cause the apparatus to detect missing data in the location data, and in response, determine that the portion of location data including the missing data is unreliable.

The computer program code, when executed by the at least one processor, may cause the apparatus to detect fluctuating data in the location data, and in response, determine that the portion of location data including the fluctuating data is unreliable.

The computer program code, when executed by the at least one processor, may cause 10 the apparatus to determine the region of the captured video component into which to zoom based on a reliable portion of the location data which precedes the portion of location data that is determined to be unreliable.

The computer program code, when executed by the at least one processor, may cause the apparatus to determine the region of the captured video component into which to zoom based on a reliable portion of location data which follows the portion of location data that is determined to be unreliable.

The computer program code, when executed by the at least one processor, may cause the apparatus to, in response to determining that the portion of location data is unreliable, enable visual object detection based on visual content captured by a visual content capture device.

The computer program code, when executed by the at least one processor, may cause the apparatus to determine the region of the captured video component into which to zoom based on visual detection of the audio source within the captured visual content.

The computer program code, when executed by the at least one processor, may cause the apparatus to perform visual detection of the audio source at predetermined time intervals while the location data is unreliable.

The computer program code, when executed by the at least one processor, may cause the apparatus to determine the region of the captured video component into which to zoom based on a location of the audio source determined based on audio data captured using a microphone array.

-5According to a fifth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by at least one processor, causes performance of at least: in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region. The computer-readable code stored on the medium of the fifth aspect may further cause performance of any of the operations described with reference to the method of the first aspect.

According to a sixth aspect, this specification describes apparatus comprising means for, in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region. The apparatus of the sixth aspect may further comprise means for causing performance of any of the operations described with reference to the method of the first aspect.

Brief Description of the Figures

For a more complete understanding of the methods, apparatuses and computerreadable instructions described herein, reference is now made to the following description taken in connection with the accompanying drawings in which:

Figure 1 is a schematic illustration of a virtual reality capture system;

Figure 2 is a schematic illustration of a zooming operation within a virtual reality context;

Figure 3 is a flow chart illustrating an example of operations, at least some of which may be performed by one or more components of the virtual reality capture system; Figures 4 to 6 illustrate various examples of the determination of both a zooming region into which to zoom video content and a range of locations within a captured scene with which to associate corresponding audio data;

Figure 7 illustrates a schematic illustration of an example configuration of an AV processing apparatus which may form part of the virtual reality capture system of Figure 1; and

Figure 8 is a computer-readable memory medium upon which computer readable code may be stored.

-6Detailed description

In the description and drawings, like reference numerals may refer to like elements throughout.

Figure l is a schematic illustration of system l (a “virtual reality capture system”) for producing audio and video content. The examples described herein primarily relate to producing spatial audio content and virtual reality (VR) video of actors performing a scene. It will be appreciated, however, that the principles described herein relate to many different scenarios, such as but not limited to recording a musical performance, recording a presenter presenting a TV programme, an interviewer performing an interview.

The VR system 1 includes a video capture device 22 for capturing one or more video components of VR audio-visual content of scene. The video capture device 22 may comprise a virtual reality camera, which is configured to capture video from a wide field of view. For instance, the camera may be configured to capture content from a spherical field of view, as is the case with the Nokia’s OZO.

In the example of Figure 1, an audio component of VR audio-visual content of a scene may also be obtained by the VR capture system 1. As will be appreciated the audio and video components are captured simultaneously. The audio component may be the audio data that is captured by a single audio capture device 12, such as a Lavalier microphone. As such, the audio component may be one of many audio components which contribute to the audio-visual content. In the example of Figure 1, the source of the audio component is the voice of an actor 26. However, as will of course be appreciated, the audio capture device 12 may alternatively be worn, carried by, or otherwise co-located with, a different audio source 26, such as but not limited to a musical instrument. In any case, the audio capture device 12 is associated with a source

26 of the audio component.

Also depicted in Figure 1 is at least a portion of a VR content playback system 2. As will be appreciated, the playback system 2 may be remote from the capture system 1, for instance, it may be in the user’s home. In some examples, however, the playback system 2 maybe in the same location as the capture system 1, and may be worn, for instance, by a director of the content capture. The depicted portions of the playback

-Ίsystem 2 include a display device 28, in this example in the form of a head-mounted display, and an audio content provision device 30, in this example in the form of a binaural headset. The depicted portion of the playback system 2 is being worn by a user 32.

Although not shown in Figure 1, the playback system 2 may further include a server apparatus for receiving data from derived from the VR capture system 1 and causing it to be provided to the user 32 via the display device 28 and the audio content provision device 30. In other examples, this functionality may be performed by the display device

28, which may itself be a computing device (such as but not limited to a smartphone).

The playback system 2 is configured to render video and audio content derived from the audio capture system for consumption by the user 32.

Returning now to the capture of the VR content, in the example of Figure 1, the source

26 of audio content is able to move or be moved within the scene. To enable rendering of the audio component produced by the source 26 with the correct spatial information, the VR capture system 1 may be configured to determine or estimate a position of the source 26. The VR capture system 1 may determine or estimate the position of the audio source 26 based on a location of a positioning tag 18 associated with the audio source 26. For example, a positioning tag 18, in the form of, for instance, a radio tag, may be worn or carried by the audio source 26. In any case, the positioning tag 18 is associated with the source of the audio component 26 and the audio capture device 12.

The VR capture system may further comprise a positioning apparatus 16 positioned at the location from which the visual content of the scene is being captured. The positioning apparatus 16 is configured to receive signals from the radio tag 18 in order for the position of the audio source to be determined or estimated. By repeatedly or continuously determining the position of the audio capture device it is possible to track the location of the audio source 26 as it is moved. In this way, location data representing the position of the audio source over time may be determined. In some examples, the position of the tag 18 may be estimated using a system such as Nokia’s high-accuracy indoor positioning (HAIP) radio technology which makes use of phased antenna array to estimate the location of a tag based on an angle of arrival of radio signals at the phased antenna array. However, as will of course be appreciated, any suitable mechanism for radio positioning may be used. The positioning apparatus 16 may form part of the VR system 1 or alternatively may simply provide an output

-8(including for instance, estimated positions of one or more radio tags) for use by the VR capture system 1.

The VR capture system 1 includes an audio-visual content processing apparatus 10 5 (hereafter AV processing apparatus). The AV processing apparatus 10 may be configured to associate the captured audio component with the estimated position in the scene of the source of the captured audio component (such as the voice of the actor 26). As discussed above, the estimated position of the source 26 may have been determined using the positioning tag 18 associated with the source of the audio component.

Radio positioning can be prone to errors of different types. One type of error may be missing portions of location data. For example, a portion of the location data may be missing for an interval of time. This may occur, for instance, because the positioning apparatus 24 temporarily stops receiving radio signals from the radio tag 18. This may happen for example because of occlusions to the direct signal path from the radio tag to the antenna, or because of interference. When this occurs, it may seem, based on the location data, that the audio source “disappears” for a short period of time only to “reappear” at a later time at a new position, once the radio signals are once again being received. Another type of error may be fluctuation of position indicated by the location data. This may occur for instance due to reflections or occlusion of the radio signals communicated between the radio tag and positioning apparatus 24. When the audio is rendered based on the missing or fluctuating location data, the missing or fluctuating location data may cause a user to hear the captured audio component at a location which does not match the location of the visual representation of the audio source 26. Regardless of the type of error that has occurred, when an error has occurred, the resulting location data maybe said to be “unreliable”.

In order to reduce the extent of a mismatch between the audio and visual information, the AV processing apparatus 10 is configured to determine if a portion of location data indicative of a location within the scene of a source of a captured audio component is unreliable. Based on the outcome of this determination, it may respond accordingly.

As discussed above, a determination that the location data is unreliable may result from, for instance, a determination that a portion of the location data is missing or that a portion of location data is fluctuating.

-9If the AV processing apparatus to determines that a portion of location data indicative of a location within the scene of a source of a captured audio component is reliable (is not unreliable), the AV processing apparatus toresponds by associating the captured audio component with the estimated position of the audio source 26. Accordingly, the captured audio component is associated with the estimated position when the estimated position is not determined to be unreliable, thereby to enable the captured audio component to be rendered such that it appears to come from the estimated position.

However, if the AV processing apparatus 10 determines that a portion of location data indicative of a location within the scene of a source of a captured audio component is unreliable, then it responds by determining a region (a “zooming region”) of the captured video component(s) into which to zoom such that the source of the captured audio component is within the determined region. Accordingly, in response to determining that the location data is unreliable, the video data can, when rendered for a user, be caused to zoom into the zooming region, thereby to show the source of the audio component (which is within the zooming region) such that the user appears to be closer to the source of the audio component. Put another way, the audio source appears larger and closer from the point of view of the user when the video component is viewed.

Zooming, within the context of immersive or virtual reality video, may be understood to mean that the virtual positon of the user is moved closer to the location of the audio source in virtual reality space. Within this context, the zooming region may be the field of view of the user within the video content following the zooming operation. The zooming region is determined such that the source of the audio component is within this field of view of the user after the zooming operation has been performed (assuming, of course, that the direction in the real world in which the user is facing is the same before and after the zooming operation).

In view of the above, within the context of immersive (or VR) video, the AV processing apparatus 10 could be said to respond to a determination that the portion of location data indicative of the location of the audio source is unreliable by determining a virtual position of the user which is closer to a virtual region (in VR space) in which the audio source is estimated to be located. Zooming may then comprise moving the user to the new virtual position. In some instances, the AV processing apparatus 10 may also

- 10 determine an adjusted virtual orientation of the user such that the adjusted virtual orientation is directly towards the virtual region (in VR space) in which the audio source is estimated to be located. This may occur, for instance, when the virtual orientation of the user, prior to the zooming operation, is not directly towards the virtual region.

This change of virtual position (and orientation) of the user in virtual space is illustrated in Figure 2, in which the user (denoted by numeral 28-1) is initially located at a first location in virtual space L_Vi and has a first orientation Ovi. The source of an audio component 26 is currently within the field of view of the user. As discussed above, when the AV processing apparatus 10 determines that the location of the source 26 is unreliable (hence, the source being shown in dashed lines in Figure 2), it responds by determining a new location L_V2 of the user (denoted by numeral 28-2) in virtual space, the new location being closer to a region Rl within which the source 26 is estimated to be present. As such, when the user is consuming the VR AV content, the user is automatically moved to that virtual location L_V2 such that the user appears to be nearer to the source 26. In addition, in the example of Figure 2, a new virtual orientation 0v₂ of the user 28-2 is also determined and, when the content is consumed, the user’s virtual orientation may also be automatically adjusted to the new orientation.

This may occur, for instance, when the new location is such that the estimated region would not be within the user’s field of view if the user’s original virtual orientation was maintained. In other examples, however, the new location Lv₃ may be determined, such that the estimated region Rl is within the user’s field of view, even without adjusting the virtual orientation of the user (as illustrated by reference 28-3).

Returning now to Figure 1, in response to determining that the portion of location data is unreliable, the AV processing apparatus 10 may also be configured to associate the captured audio component with a range of locations within the scene. In this way, the captured audio component is able to be rendered such that it appears to come from that range of locations. Accordingly, the user is caused to see the source of the audio component at a closer distance, and also to hear the audio source as coming from a wider range of angles relative to the user. This more closely reflects reality in which, when an audio source is nearby, the audio is observed as coming from a wider angle than when the source is further away. Put another way, the perceptual location of a nearby audio source may be blurred, where as an audio source which is far away may be heard as a point source.

- 11 In some examples, the range of locations maybe determined such that they are within the determined zooming region. Accordingly, when the audio component is rendered by the AV processing device 10 or another audio playback device, the audio component appears to originate from the area corresponding to the determined zooming region. In other examples, the range of locations may extend over a different range of angles. For instance, the range of locations associated with the audio component may extend over a 360 degree angular range about the video capture device 22. Accordingly, when the audio component is rendered for a user the audio component appears to surround the user (put another way, to originate from all around the user).

In some examples, the zooming region (and also the estimated region discussed with reference to Figure 2) may be determined based on the final reliable location data before the portion of location data determined to be unreliable. Put another way, the zooming region may be determined based on the location data immediately prior to the unreliable location data. In this case, the zooming region may be determined as including the final location indicated by the location data before it became unreliable. Determining the zooming region on this basis maybe performed offline, after capture of the audio and video components. Alternatively, it maybe performed during live streaming of the audio and video components.

In other examples, the zooming region (and also the estimated region discussed with reference to Figure 2) may be determined based on the final reliable location before the unreliable portion of location data and based on the first reliable location after the unreliable portion of location data. In such examples, the zooming region includes both the final reliable location before the unreliable portion of location data and the first reliable location after the unreliable portion of location data. A determination of the zooming region on this basis may be performed offline after recording the audio and video components. Alternatively, such a determination may be performed during streaming of the audio and video components. However, where it is used during live streaming, a delay is introduced between capture and streaming, such that the unreliable data portion can be observed by the system from beginning to end. In this way, the last reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable location data can be determined.

- 12 In other examples, the zooming region may be determined based on a visually detected location of the audio source. For example, in response to determining that a portion of location data is unreliable, the AV processing apparatus 10 may cause visual detection using video capture apparatus 22 to be enabled. In this case, the zooming region of the captured video component may be determined to include the visually detected location of the audio source. In response to a plurality of bodies being visually detected in the scene (or in a relevant part of the scene), the visually detected audio source 26 may be determined to be the body closest to the final location before the unreliable portion of location data. The visual detection maybe performed at given intervals of time during a period in which the location data is unreliable. By lowering the rate at which visual detection is performed, the processing of the visual data required for visual detection of the audio source maybe reduced. This determination maybe performed during live streaming of the audio and video components.

In other examples, the zooming region of the captured video component may be determined based on a location of the audio source determined based on captured audio data. For example, a microphone array maybe arranged at the video capture device 22. The location of the audio source may be determined based on the audio data received by the microphone array, using known techniques for tracking audio sources (for instance, as described in PCT/EP2016/051709).

The functionality described in general terms above will now be discussed in more detail with respect to Figures 3 to 8.

Figure 3 is a flow chart illustrating various operations which may be performed by the one or more components of the VR capture system 1, for instance the AV processing apparatus 10, in conjunction with the audio-visual content playback system 2.

In operation 100, the AV processing apparatus 10 may receive location data indicative of a location within a scene of a source of a captured audio component. For instance, the AV processing apparatus 10 may estimate the location of, or receive information identifying an estimated location of, a radio tag which corresponds to an audio capture device 12 associated with the source 26 of a captured audio component. The location data may be determined using any suitable signal processing methods to improve accuracy and reliability of the location detection, such as for example, averaging or filtering a plurality of detected positions.

-13In operation Siio, the AV processing apparatus to may determine whether a portion of location data is unreliable, for instance because it is fluctuating or missing. Missing data may be detected through analysing the rate of received position data from the radio tag. Fluctuating (or erroneous) data can be detected in several ways and the method proposed herein is not limited to any specific detection technique. For example, the event of fluctuating data can be detected by analysing one of more features extracted from the positioning data, such as rate of change, extent of change and velocity.

If a portion of location data is determined to be unreliable, the AV processing apparatus to may proceed to operation S120. In operation S120, the AV processing apparatus 10 may determine a region (a zooming region) of a captured video component into which to zoom such that the source of the captured audio component is within the determined region. Accordingly, when the video component is viewed by a user 32, the user 32 will view the source of the captured audio component as being zoomed in. Accordingly, the source of the captured audio component will appear large and closer from the point of view of the user.

However, if the location data is determined to be reliable, the AV processing apparatus may instead proceed to operation S140. In operation 140, captured audio component is associated with locations indicated by the location data. In this way, when the audio component is rendered for consumption by the user 32, it appears to arrive from the locations indicated by the location data.

At operation S130 following from operation 120, the AV processing apparatus 10 associates the captured audio component with a determined range of locations within the scene, such that, when the audio component is rendered for consumption by the user 32, it appears to arrive from the determined locations in the range. The range of locations may be determined in various ways, for instance as described in more detail with reference to Figures 4 and 5.

The rendering area or range may be implemented by use of a Head-Related Transfer Function (HRTF) which characterises the reception of a sound from multiple points instead of just one point, for example from all the points in an area of interest. This can be achieved by modelling a specific HRTF in an anechoic chamber where the audio

-14source maybe a distributed source such as an array of loudspeakers, or by just reapplying the same HRTF in multiple points.

At operation S150, following either of operations S130 or S140, the AV processing 5 apparatus 10 may cause storage of the audio component, the associated locations, the video component and the zooming region (if such was determined). Although not shown in Figure 3, the stored data may be retrieved at a later time for transmission and or playback.

Finally, at operation S160 following from operations S150, the captured audio visual content is rendered for consumption by the user 32 via the playback system 2. The captured audio component is rendered simultaneously with the captured video component. As discussed above, the audio component is rendered such that the captured audio component appears to come from the locations with which it was associated at operations S130 or S140. Accordingly, if the audio component corresponds with an unreliable portion of location data, the user will experience the captured audio component as coming from a range of locations within the scene. In such a situation, the video component is rendered such that it is zoomed in to the determined zooming region. Alternatively, if the audio component corresponds with a reliable portion of location data, the video component will not be zoomed in and the user will experience the captured audio as coming from a point source.

Figure 4 illustrates an example of a way in which the zooming region might be determined such that the source of the captured audio component is within the determined region. The zooming region may thus correspond to a region in which the source 26 is estimated to be present. The figure also illustrates an example of how the audio component might be associated with a range of locations within the scene.

Specifically, in the example of Figure 4, the zooming region for the captured video component is determined based on the last reliable location of the source 26. As such, during playback the video component is zoomed into a region which includes the last reliable location of the source 26. In Figure 4, the zooming region is represented by the dashed lines originating from the video capture device.

In addition to zooming in to the source 26 of the audio component, the AV processing apparatus 10 is configured to associate corresponding captured audio data with a

-15determined range of locations within the scene. Therefore, when the captured audio component is rendered to be heard by a user, the audio component appears to come from a range of angular directions. In the example of Figure 4, the captured audio component is associated with a range of locations surrounding the last reliable location of the source 26. In this example, the range of angle and the zooming region correspond. As such, when rendered for the user, the captured audio component appears to originate from the region into which the VR capture system has zoomed. In such examples, both the zooming region and the range of directions are determined based on the location of the source 26 of the audio component and a predetermined field-of view, which defines the zoom level into which the video is zoomed.

Figure 5 illustrates another example of a way in which the zooming region might be determined such that the source of the captured audio component is within the determined region. Specifically, in this example, the zooming region is determined based on the last reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable location data.

The zooming region may be determined such that both the last reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable location data are within the determined region. Therefore, during playback the video component maybe zoomed into a region of the captured video component such that both the last reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable location data are visible when viewed by a user. In such examples, the field of view of the zooming region is determined based on the reliable locations occurring either side of the unreliable location data portion. For instance, the field of view maybe determined so as to encompass a predetermined angular separation either side of the two reliable locations.

In addition, in the example of Figure 5, the audio component corresponding to the unreliable portion of location data has been associated with a range of locations within the scene. As such, when the captured audio component is rendered for the user, the audio component appears to come from a range of angular directions. In the example of Figure 5, the captured audio component is associated with a range of locations which are determined based on the last reliable location of the source 26 before the portion of unreliable location data and the first reliable location after the portion of unreliable data. More specifically, in the example of Figure 5, the range of locations is determined

-ι6as extending between those two locations. As such, although the range of locations falls within the zooming region, the field of view of the zooming region and the range of locations associated with the audio data may not have the exact same boundaries. In other examples, however, the boundaries of the zooming region and the range of locations associated with the audio component may correspond exactly. In such examples, the range of locations may be determined similarly to the field of view of the zooming region so as to include a predetermined angular separation either side of the two reliable locations.

In yet other examples, however, although the zooming region is determined based on two reliable locations, the range of locations for the audio component maybe determined such that the rendered audio appears to come from a wider angular range of locations, for example as illustrated in Figure 6.

The example implementation described with reference to Figure 5 may be suitable for being performed post-production, when the system is able to determine both the final reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable data. This implementation may also be suitable for being performed during streaming, if a sufficient delay is provided between capture and streaming such that both the final reliable location before the portion of unreliable location data and the first reliable location after the portion of unreliable data can be determined prior to streaming.

Figure 6 illustrates an example in which the captured audio component is associated with a range of locations such that when the audio is rendered, it appears to come from angular directions extending beyond the zooming region. For instance, the range of locations maybe 360 degrees surrounding the listener. In this example, the zoom region maybe determined as described with reference to either of Figures 4 and 5.

As an alternative to the examples described above with reference to Figures 4 to 6, the zooming region may be determined based on a visually detected location of the source 26 of the captured audio component. In such examples, in response to determining that a portion of location data is unreliable, the AV processing apparatus 10 may be configured to cause visual object detection for objects in the captured visual content to be enabled, thereby to determine the location of the source 26 of the captured audio component. The visual object detection may be run continuously, for instance based on

-17the captured video content. Alternatively, the visual object detection maybe run periodically with a certain frequency, in order to reduce computation load. For example, visual detection may be run every 30 seconds. However, it will be appreciated that visual detection may be performed at any desired time intervals other than every

30 seconds. In some examples, the visual object detection may be performed by a visual content capture device other than the video capture device 22. For instance it maybe performed using one or more still cameras.

The AV processing apparatus 10 may thus determine the zooming region based on the 10 visually detected location of the audio source. Accordingly, the zooming region may be determined such that the visually detected location is within the determined region. Therefore, when the user is consuming the VR content, the video component may be zoomed in to a region of the captured video component which includes the visuallydetected location.

If more than one body is visually detected, the AV processing apparatus may be configured to select the position of the body which is nearest to the last reliable location as the location of the source 26 of the captured audio component.

As an alternative to determination based on location data or visual detection, the zooming region may be determined based on a location of the source 26 determined according to captured audio data. For example, the VR capture system 1 may include an array of microphones (not shown) located at (for instance, integrated with) the video capture device 22. The microphones may receive audio data from at least one audio source in the scene. The audio data captured by the microphones may be analysed in order to determine a location of the source 26 of the captured audio source. The AV processing apparatus 10 may then be configured to determine the zooming region based on the location determined according to the audio data captured by the microphone array.

In addition to determining the zooming region based on the visually detected location or the location determined based on the capture audio data, the AV processing apparatus may also be configured to associate a portion of the audio component, which corresponds with unreliable location data, with a range of locations such that, when the captured audio component is rendered for the user, the audio component appears to

-18come from a range of angular directions. The range of locations may be determined in any of the ways described herein.

Figure 7 is a schematic block diagram of an example configuration of an AV processing 5 apparatus 10 such as described with reference to Figures 1 to 6. The AV processing apparatus may form part of, or be generally collocated with the video capture device 22 or, alternatively, may be located remotely from the video capture device 22.

The AV processing apparatus 10 comprises memory 30 and processing circuitry 34.

The memory 30 may comprise any combination of different types of memory. In the example of Figure 7, the memory comprises one or more read-only memory (ROM) media 32 and one or more random access memory (RAM) memory media 31.

The AV processing apparatus 10 may further comprise one or more input interface 35 which may be configured to receive signals from the positioning apparatus 16, the audio capture device 12, and the camera device 22. These signals may be received and/or processed in (near) real-time or instead may be received from storage and processed at a later time after capture of the data (audio, visual and location). The processing circuitry 34 may be configured to process the signals received via the input interface 35 to determine whether a portion of location data indicative of a location within a scene of a source of a captured audio component is reliable, and to determine a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

The AV processing apparatus 10 may further comprise one or more output interfaces 36 configured for outputting spatial audio, and for outputting VR video content. For instance, the VR video content and spatial audio content maybe output directly or indirectly to the playback system 2 and/or to a storage device (not shown) for storage.

Although not shown in Figure 7, the AV processing apparatus 10 may in some examples include the positioning apparatus 16. In such examples, one of more of the input interfaces 35 may be configured to receive signals from the radio tag 18 in such a way to enable the position of the radio tag to be estimated.

The memory 30 described with reference to Figure 7 may have computer readable instructions stored thereon 32A, which when executed by the processing circuitry 34

-19causes the processing circuitry 34 to cause performance of various ones of the operations described above. The processing circuitry 34 described above with reference to Figure 7 may be of any suitable composition and may include one or more processors 34A of any suitable type or suitable combination of types. For example, the processing circuitry 34 may be a programmable processor that interprets computer program instructions and processes data. The processing circuitry 34 may include plural programmable processors. Alternatively, the processing circuitry 34 maybe, for example, programmable hardware with embedded firmware. The processing circuitry 34 may be termed processing means. The processing circuitry 34 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processing circuitry 34 may be referred to as computing apparatus.

The processing circuitry 34 described with reference to Figure 7 is coupled to the memory 30 (or one or more storage devices) and is operable to read/write data to/from the memory. The memory may comprise a single memory unit or a plurality of memory units 32 upon which the computer-readable instructions (or code) 32A is stored. For example, the memory 30 may comprise both volatile memory 31 and non-volatile memory 32. For example, the computer readable instructions 32A may be stored in the non-volatile memory 32 and may be executed by the processing circuitry 34 using the volatile memory 31 for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories 30 in general may be referred to as non-transitory computer readable memory media.

The term ‘memory, in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.

The computer readable instructions 32A described herein with reference to Figure 7 may be pre-programmed into the AV processing apparatus 10. Alternatively, the computer readable instructions 32A may arrive at the AV processing apparatus 10 via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium 40 such as a CDROM or DVD. The computer readable instructions 32A may provide the logic and

- 20 routines that enable the AV processing apparatus 10 to perform the functionalities described above. The combination of computer-readable instructions stored on memory (of any of the types described above) may be referred to as a computer program product.

Figure 8 illustrates an example of a computer-readable medium 40 with computerreadable instructions stored thereon. The computer-readable instructions, when executed by a processor, may cause any one or any combination of the operations described above to be performed.

Where applicable, wireless communication capability of the AV processing apparatus 10 may be provided by a single integrated circuit. It may alternatively be provided by a set of integrated circuits (i.e. a chipset). The wireless communication capability may alternatively be provided by a hardwired, application-specific integrated circuit (ASIC).

Communication between the apparatuses/devices comprising the AV processing apparatus 10 may be provided using any suitable protocol, including but not limited to a Bluetooth protocol (for instance, in accordance or backwards compatible with Bluetooth Core Specification Version 4.2) or an IEEE 802.11 protocol such as WiFi.

As will be appreciated, the AV processing apparatus 10 described herein may include various hardware components which have may not been shown in the Figures since they may not have direct interaction with embodiments of the invention.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or “computer-readable medium” maybe any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

Reference to, where relevant, “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc., or a “processor” or “processing circuitry” etc. should be understood to encompass not only computers

- 21 having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specific circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc.

should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.

As used in this application, the term ‘circuitry’ refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

This definition of ‘circuitry¹ applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term “circuitry” would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term “circuitry” would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device.

If desired, the different functions discussed herein maybe performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagram of Figure 3 is an example only and that various operations depicted therein may be omitted, reordered and/or combined.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described

- 22 embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

As used herein, virtual reality (VR) content may cover, but is not limited to, computer5 generated VR content, content captured by a presence capture device such as Nokia’s OZO camera or the Ricoh’s Theta, and a combination of computer-generated and presence-device captured content. Indeed, VR content may cover any type or combination of types of immersive media (or multimedia) content.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Claims

Claims

1. A method comprising:

in response to determining that a portion of location data indicative of a 5 location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.
2. A method according to claim l, further comprising:

io in response to determining that the portion of location data is unreliable, associating the captured audio component with a range of locations within the scene, wherein the association between the captured audio component and the range of locations is for use during rendering of the captured audio component such that, when the audio component is rendered, it appears to originate from the range of locations.
3. A method according to claim l or 2, comprising selecting the locations of the range of locations such that they are all within the determined region.
4. A method according to claim l or 2, comprising selecting the range of locations

20 such that some of the locations are within the determined region and some of the locations are outside the determined region.
5. The method of claim 4, wherein the locations in the range of locations are dispersed around a video capture device which captures the video component.
6. A method according to any preceding claim, further comprising: detecting missing data in the location data; and in response, determining that the portion of location data including the missing data is unreliable.
7. A method according to any of claims 1 to 5, further comprising: detecting fluctuating data in the location data; and in response, determining that the portion of location data including the fluctuating data is unreliable.
8. A method according to any preceding claim, further comprising:

-24determining the region of the captured video component into which to zoom based on a reliable portion of the location data which precedes the portion of location data that is determined to be unreliable.

5
9. A method according to any preceding claim, further comprising:

determining the region of the captured video component into which to zoom based on a reliable portion of location data which follows the portion of location data that is determined to be unreliable.
10 10. A method according to any of claims 1 to 7, further comprising:

in response to determining that the portion of location data is unreliable, enabling visual object detection based on visual content captured by a visual content capture device.

15
11. A method according to claim 10, further comprising:

determining the region of the captured video component into which to zoom based on visual detection of the audio source within the captured visual content.
12. A method according to claim 11, further comprising:

20 performing visual detection of the audio source at predetermined time intervals while the location data is unreliable.
13. A method according to any of claims 1 to 7, further comprising: determining the region of the captured video component into which to zoom

25 based on a location of the audio source determined based on audio data captured using a microphone array.
14. Apparatus configured to perform the method according to any of claims 1 to 13.

30 15. Computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform a method according to any of claims 1 to 13.

16. Apparatus comprising:

at least one processor; and

35 at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to:

-25in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determine a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

17. Apparatus according to claim 16, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

in response to determining that the portion of location data is unreliable, associate the captured audio component with a range of locations within the scene,

10 wherein the association between the captured audio component and the range of locations is for use during rendering of the captured audio component such that, when the audio component is rendered, it appears to originate from the range of locations.

18. Apparatus according to claim 16 or 17, wherein the computer program code,
15 when executed by the at least one processor, causes the apparatus to:

select the locations of the range of locations such that they are all within the determined region.
19. Apparatus according to claim 16 or 17, wherein the computer program code,
20 when executed by the at least one processor, causes the apparatus to:

select the range of locations such that some of the locations are within the determined region and some of the locations are outside the determined region.

20. Apparatus according to claim 19, wherein the locations in the range of locations

25 are dispersed around a video capture device which captures the video component.
21. Apparatus according to any of claims 16 to 20, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

detect missing data in the location data; and

30 in response, determine that the portion of location data including the missing data is unreliable.
22. Apparatus according to any of claims 16 to 20, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

35 detect fluctuating data in the location data; and

- 26 in response, determine that the portion of location data including the fluctuating data is unreliable.
23. Apparatus according to any of claims 16 to 22, wherein the computer program 5 code, when executed by the at least one processor, causes the apparatus to:

determine the region of the captured video component into which to zoom based on a reliable portion of the location data which precedes the portion of location data that is determined to be unreliable.

10
24. Apparatus according to any of claims 16 to 23, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

determine the region of the captured video component into which to zoom based on a reliable portion of location data which follows the portion of location data that is determined to be unreliable.
25. Apparatus according to any of claims 16 to 22, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

in response to determining that the portion of location data is unreliable, enable visual object detection based on visual content captured by a visual content capture

20 device.
26. Apparatus according to claim 25, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

determine the region of the captured video component into which to zoom 25 based on visual detection of the audio source within the captured visual content.
27. Apparatus according to claim 26, wherein the computer program code, when executed by the at least one processor, causes the apparatus to: perform visual detection of the audio source at predetermined time intervals while the location data is

30 unreliable.
28. Apparatus according to any of claims 16 to 22, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

determine the region of the captured video component into which to zoom 35 based on a location of the audio source determined based on audio data captured using a microphone array.

-2η29· A computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by at least one processor, causes performance of at least:

5 in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

10 30. Apparatus comprising:

means for, in response to determining that a portion of location data indicative of a location within a scene of a source of a captured audio component is unreliable, determining a region of a captured video component into which to zoom such that the source of the captured audio component is within the determined region.

Intellectual

Property

Office

Application No: GB1620008.1 Examiner: Mr Conal Clynch