GB2553351A

GB2553351A - Salient object detection

Info

Publication number: GB2553351A
Application number: GB1615006.2A
Authority: GB
Inventors: Saarinen Jukka; Cricri Francesco
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2016-09-05
Filing date: 2016-09-05
Publication date: 2018-03-07
Also published as: GB201615006D0

Abstract

A method and apparatus configured to identify a plurality of objects 2.1 in a captured video and/or audio scene, process the scene by removing an object 2.2 from the scene; measuring the effect of removing said object using received data 2.3; reintroducing said object into the scene 2.5; and repeating the said steps for one or more other objects in the scene in turn 2.6. The saliency level is then determined for each object based on the measured effect 2.8. The effect may be measured by the interactions of the remaining objects and the received data may be from sensors of an external observer, possibly using a VR headset.

Description

(54) Title ofthe Invention: Salient object detection

Abstract Title: Saliency detection by measuring the effect of removing an object (57) A method and apparatus configured to identify a plurality of objects 2.1 in a captured video and/or audio scene, process the scene by removing an object 2.2 from the scene; measuring the effect of removing said object using received data 2.3; reintroducing said object into the scene 2.5; and repeating the said steps for one or more other objects in the scene in turn 2.6. The saliency level is then determined for each object based on the measured effect 2.8. The effect may be measured by the interactions ofthe remaining objects and the received data may be from sensors of an external observer, possibly using a VR headset.

Fig. 2

1/10

2/10

LD

3/10

-j.-i u

cu

LU

CU

L

CS

CU cr>

rH

CO rH

Vnr^irwf uo

-{.—I u

Φ if?

o £

O

4«J

U

LU cu _a„

Σ3 zn co

CU rΙΛ

-ί--.' u

cu ®“™i £2 o

u cu ί-ί-i cu

5,,.

S/S

CO cu r-l rSZ!

4->

u cu

1q

O

4/10

ΓΜ σι m

rΓΜ

5/10 oo

CM

LH

CM

6/10

LL

ro

co 00

7/10 c Ε

0¾ m CO

8/10

9/10

10/10

TO >
o £ CD CX M~~ O 4_j	Q.
CD	CD
QJ	4-*
	on
LU CD L_ 3 50 TO CD Σ

4™} c
CD	to
-M	u
C	CD
o	> u
u	CD
O CD M—	to JD O
a	ro
o	c
Σ	<u
CD	W X
D	LU
>	O
O	4_j
L·
a.

£ o k- M—	tn u.
ro	CD
+j	>
ro	L-
Q	CD
	CZ) XJ
o !Λ	a
c:
0)	ro
DO	c L·-
CD	CD
>
	X
CD	LU
υ CD CX

Ο 'Π γμ r-H ’“Ι ίΗ

CM

LL

O	4-~* c
4_j	CD
TO 4-J	£
TO	CD
Q
!L»	Z5 to
o	TO
5/5	CD
c CD
czs	CD
5Λ	4->
<Λ	TO
CD
CD	CD
O	C
koo	CD
a.	Q

cn s—I

XT

LL

CM

Application No. GB1615006.2

RTM

Date :6 March 2017

Intellectual

Property

Office

The following terms are registered trade marks and should be read as such wherever they occur in this document:

Bluetooth

Nokia ozo

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

Salient Object Detection

Field

This specification relates to methods and systems for salient object detection in video 5 and/or audio content. The specification particularly relates, but is not limited to, salient object detection in captured scenes, e.g. for virtual reality, augmented reality and spatial video and/or audio applications.

Background

In multimedia applications, there has been much activity directed to enabling computers to understand content in a manner that is analogous to humans. To a human consumer of an image, video clip or audio sound, some parts (e.g. objects) will draw their attention or interest over others. These parts are termed ‘salient’. One open problem is for computers to automatically infer the salient parts for subsequent use.

In the context of images, saliency is an abstract concept that generally relates to the contrast between certain image regions and the image background, as well as the content within the image regions. Some researchers have explored visual saliency models, and machine learning techniques, such as neural networks, probabilistic graphical models, sparse coding and kernel machines have been adapted to the field of Usual saliency detection. These models learn saliency from given scenes automatically, but can lack robustness and be affected by, e.g. complex backgrounds.

More specifically, one known problem relates to detecting the true saliency of image regions within a single image. Most known approaches only directly apply, or manually modify, previous unsupervised saliency detection algorithms for a single image to saliency detection. These do not yield promising results as unsupervised algorithms tend to lack robustness and can be influenced by complex backgrounds. Another problem relates to developing an improved mechanism for exploring the importance levels of salient objects among multiple related images. Most known approaches focus on exploring the homogeneity based on low-level features, e.g. colour, texture or corner descriptors. Homogeneity based on higher level concepts is not captured. Low-level features are easily influenced by, for example, a variation in luminance, shape or viewpoint, leading to unsatisfactory results.

Summary

- 2 According to one aspect, a method comprises: l) identifying a plurality of objects in a captured video and/or audio scene; 2) processing the scene by performing: i) removing an object from the scene; and ii) measuring the effect of removing said object using received data; iii) reintroducing said object into the scene; iv) repeating i) to iii) for one or more other objects in the scene in turn; and 3) determining the saliency level for each object based on the measured effect from 2)11).

The method may further comprise 4) identifying using the saliency levels determined in 3) a subset of the objects as being more salient than one or more other objects in the scene.

The subset may comprise one or more objects for which the measurement is above a predetermined threshold.

2)11) may comprise measuring the difference in the functionality and/or interaction of the one or more non-removed objects with trained data over a time frame.

The trained data may comprise one or more trained data models, for example a neural network, pattern recognition model and/or natural language processing model.

The trained data model may comprise a motion model representing the expected motion of one or more objects over time and wherein 2)11) may comprise measuring the difference between movement of the non-removed objects over the time frame, and the expected movement determined by the motion model.

The captured scene may be a video and audio scene represented by video and audio data for the objects within a space, wherein 2)11) may comprise for at least part of the scene which comprises the removed object, using trained data to measure if the interaction between two or more objects which move relative to one another over the time frame is realistic.

2)1) may comprise removing the audio and video data for the removed object, and 2)11) may comprise applying audio produced at the time the two or more objects spatially interact to the trained data to measure if the interaction is realistic based on the amount of the audio attributable to the non-removed objects.

-32)ii) maybe performed for a series of time frames of the captured scene leading up to, and including, a spatial contact or collision between the removed object and one or more non-removed objects.

The trained model may receive the video data resulting from 2)1) and the audio data for the overall scene at, and/or leading up to, the time of interaction and wherein a binary decision may be generated as to whether or not an interaction is realistic.

The trained model may receive video data resulting from 2)1) and may generate 10 therefrom expected audio for the overall scene over a series of time frames, and wherein a difference is measured between the received audio and the expected audio.

The captured scene may contain speech audio data from two or more audio sources captured over a time frame, wherein 2)1) may comprise silencing the speech audio from one audio source, and 2)11) may comprise using a natural language processing model to measure the degree to which the speech audio for the non-silenced audio sources over the time frame conform to a recognised pattern.

The method may further comprise: generating audio and/or video content representing a time period which includes the scene resulting from 2)1); providing the content for output to one or more external observers; and wherein 2)ii) may comprise measuring the effect of removing the object using data received from one or more sensors associated with said external observers.

The sensor data may include one or more of heart rate data, temperature data, sweat data and gaze direction data.

The method of any preceding definition may be performed in a post-processing phase of virtual reality, augmented reality and/or spatial audio content.

The identified subset of objects in 4) maybe rendered in a post-processed virtual space and the remaining objects are removed and/or filtered to limit their movement and/or playback volume in the rendered virtual space.

3) may comprise ranking or categorising objects based on their saliency levels.

-4According to another aspect, there is provided a computer program comprising instructions that when executed by a computer apparatus control it to perform the method of any preceding definition.

According to another aspect, there is provided apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured with the processor to cause the apparatus to: 1) identify a plurality of objects in a captured video and/or audio scene; 2) process the scene by performing: i) removing an object from the scene; and ii) measuring the effect of removing said object using received data; iii) reintroducing said object into the scene; iv) repeating i) to iii) for one or more other objects in the scene in turn; and 3) determine the saliency level for each object based on the measured effect from 2)11).

According to another aspect, there is provided a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: 1) identifying a plurality of objects in a captured video and/or audio scene; 2) processing the scene by performing: i) removing an object from the scene; and ii) measuring the effect of removing said object using received data; iii) reintroducing said object into the scene; iv) repeating i) to iii) for one or more other objects in the scene in turn; and 3) determining the saliency level for each object based on the measured effect from 2)11).

Brief Description of the Drawings

Embodiments will now be described, by way of non-limiting example, with reference to embodiments, in which:

Figure 1 is a block diagram showing an example of a post-processing system according to embodiments in relation to a video and audio capture scenario;

Figure 2 is an overview flow diagram showing an example of processing steps of embodiments;

Figures 3a - 3c are pictorial illustrations indicating how the Figure 2 method may be performed on an example scene comprising objects;

Figure 4 is an example of top plan view of a Virtual Reality (VR) capture scenario, including the post-processing system shown in Figure 1;

-5Figure 5 is an example of a schematic diagram of components of the post-processing system shown in Figures 1 and 4;

Figure 6 is an example of a flow diagram showing one embodiment of performing a measuring step shown in the Figure 2 flow diagram;

Figure 7 is an example of a flow diagram showing processing steps of the Figure 6 embodiment in greater detail;

Figure 8 is an example of a flow diagram showing alternative processing steps of the Figure 6 embodiment in greater detail;

Figures 9a - 9d are pictorial illustrations indicating how the Figure 6 method may be performed on an example scene comprising objects;

Figures 10a - 10c are pictorial illustrations indicating how the Figure 6 method may be performed on a different example scene comprising objects;

Figures 11a - lid are pictorial illustrations indicating how the Figure 6 method may be performed on a different example scene comprising speech objects;

Figure 12 is an example of a flow diagram showing a second embodiment of performing a measuring step shown in the Figure 2 flow diagram;

Figure 13 is a schematic diagram of observing users in relation to the post-processing system of Figures 1 and 4 for providing reaction data from sensors; and

Figure 14 is an example of a flow diagram showing processing steps of the Figure 13 embodiment.

Detailed Description of Preferred Embodiments

Embodiments herein relate to methods and systems for detecting saliency in any one of audio content, video content or content which comprises both audio and video. The audio and/or video content can be captured using any system or means. The methods and systems may also be applied to static images.

In some embodiments, the audio and video may comprise spatial audio and/or video, for example captured by a system employing an array of cameras and/or microphones.

One such example is Nokia’s OZO system which captures in real-time audio and video with a 360 degree azimuthal field-of-view to capture a spherical world. Alternatively, or additionally, spatial audio maybe captured using close-up microphones, such as Lavalier microphones, which are carried by audio sources within a capture space and which may have associated positioning means, e.g. radio positioning tags, to indicate to an external system the location of the audio sources in order to compute the spatial audio. HAIP (High Accuracy Indoor Positioning) positioning tags are one example.

-6Embodiments may relate to a post-processing phase performed on captured video and/or audio content, for example performed on captured data for a Virtual Reality (VR) or Augmented Reality (AR) space.

For example, in a captured VR space comprising multiple audio sources which emit sound, it may be desirable to filter certain, less salient, sources in favour of more salient sources so as to avoid overwhelming or confusing the user consuming the content.

For example, in a captured or artificially generated video space, it may be desirable to attach artificially-generated sounds to certain objects which are determined to be salient. For example, in some situations, not all salient objects in a captured space will be wearing positioning tags (which are generally associated with human audio sources) and so sounds may be attached in the post-processing phase when saliency is identified.

For example, in three-dimensional (3D) rendering applications, it maybe desirable to render in the post-processing phase salient objects with higher accuracy or resolution than those considered less salient.

In overview, there are described post-processing methods and systems which receive captured content, identify a plurality of objects within the captured content, which can be audio and/or video objects, and which then employ a so-called silencing framework to silence each object in turn and then measure, or score, the effect of said removal on one or more of the other (non-silenced objects). The measure, or score, is indicative of the saliency. The process may be repeated for other objects in the scene, and, in some embodiments, the measures are ranked, categorised or arranged to derive relative saliency for all objects in the scene which can be used for certain other functions. For example, object(s) having a saliency measure which is above a predetermined threshold may render an object as ‘salient’ and other objects, below said threshold, as ‘non30 salient’, with certain later functions being performed only for salient objects. For example, the saliency measure may be used for content categorisation and indexing purposes, whereby an archive of content is stored and searchable based on the detected saliency within the content.

In the context of video data, silencing may mean removing the object from the video, and may involve changing the object’s pixels so that they more closely match the

-Ίbackground pixels, for example by using video data from other cameras with different Hewing angles to ‘fill-in’ the background. In the context of audio data, silencing may mean literally silencing, or reducing, the audio associated with the object.

Referring to Figure l, a post-processing system l is shown in the context of an exemplary⁷ application, which is that of VR/AR capture of a real-world capture space 3. The capture space 3 comprises plural objects 5 which may emit sounds and/or may move in space. A capture device 6, which may be Nokia’s OZO camera (mentioned above) captures video and audio data and provides it to the post-processing system 1 as data streams, generally indicated 7. Positioning data may also be sent from the capture space 3 to the post-processing system 1. The positioning data may be derived from tags carried by the objects 5, for example HAIP tags, and/or by one or more cameras within the capture space 3. The post-processing system 1 may be configured to determine, using the video and/or audio data, and in some cases the positioning data, a measure of each object’s saliency, or relative saliency, by means of the above-mentioned silencing framework. In some embodiments, the measure may then be used within the postprocessing system (or other post-processing systems 11) for one or more other functions, for example to filter what is presented to a user 9 consuming the postprocessed capture data through a VR headset 10. In some embodiments, the measure may be used for related functions, such as indexing and/or categorisation of content. Other post-processing tasks may be performed in the post-processing system 1 associated with VR/AR content processing.

Referring to Figure 2, an overview of the processing steps which are performed by the post-processing system 1 is shown as a flow chart.

In a first step 2.1, a plurality of objects are identified from within the capture space 3. Object identification may use known techniques. For example, each object may have an associated identification or positioning tag, e.g. a HAIP tag. For example, image processing techniques may be employed to identify objects 5 from image or video data from within the captured space 3. The objects 5 may be foreground objects, e.g. as distinct from the background scene. Therefore, foreground detection, which involves segmenting regions of image pixels as distinct from others, may use one or more known techniques. For example, a learned or trained model maybe employed for object identification in step 2.1. For example, an object proposal model maybe employed, which uses known techniques to identify regions and positions of an image which are

-8delineated from other regions to generate candidate objects. The regions may be surrounded by a bounding box. The regions may be identified using learned or trained data, e.g. in a trained model such as an artificial neural network.

Further information on object identification is disclosed in Uijlings J.R.R., Van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M. (2013) Selective Search for Object Recognition. International Journal for Object Recognition. See https: / /i vi.fnwi. u va.nl /isis / publi cati ons/bibtexbro wser.ph p?key- Uijlings IJ CV2013&bi b=all.-bib· Object detection is also disclosed in Kuo, W., Hariharan, B., Malik, J. (2015) DeepBox: Learning Objectness with Convolutional Networks, Proceedings of the IEEE International Conference on Computer Vision. See http://people.eecs.berkeley.edu/~wckuo/KuoICCV^r2Oi5.pdf

The positions of the identified objects maybe identified within the capture space 3. As noted above, this may involve using positioning tags, e.g. HAIP tags, carried by objects, or may involve video processing techniques using the video data from one or more cameras within the capture space 3.

In a subsequent step 2.2, having identified a plurality of objects, a first one of the objects is removed, or silenced. In other words, the content is processed so that the object ‘disappears’ from the captured scene, either in the visual and/or audio sense. In the context of video data, this may mean making the object effectively invisible, or something approaching invisible, which may involve replacing the object’s pixels (or all pixels within the bounding box) with pixels approximating to those in the background or with pixels with a predetermined appearance, for example a predetermined colour. For example, if multiple cameras are available within the capture space 3, the background pixels can be computed /approximated using data from the other cameras. In the context of audio data, this may mean literally silencing the audio received from the object, whether through a close-up microphone, or from the directional position of the object.

In a subsequent step 2.3, the effect of the removal is measured. More specifically, the effect of the removal maybe measured in relation to the other, non-removed objects.

As will be explained below, this may involve measuring how the removal affects and/or influences the functionality or interaction of the other, non-removed objects in the scene. For example, this may involve measuring how the removal affects the coherence

-9of the remaining objects in terms of their appearance, their movement, their sound or the semantics of the communication among them.

This maybe an entirely or partially automatic method, not involving human 5 interaction. Alternatively, or additionally, this may involve measuring how the removal is perceived by observers, for example the effect produced on human consumers of the modified content.

The measured effect can take different alternative forms, for example a binary measurement or a more graduated measurements. Each measurement may represent a saliency level, e.g. Salient; Non-Salient or o - 100% Salient, to give examples. A predetermined threshold may be used to make subsequent decisions on the measurement or score, i.e. to assign the measurement or score to one of a number of predetermined saliency levels. Any measuring or quantifying method for indicating in data form the relative saliency of one object over another may be used. The resulting measurement or quantifying parameter maybe termed a ‘score’.

In step 2.4, if there are further objects in the captured scene, the process moves to step 2.5 in which the removed object is re-introduced, i.e. effectively replaced in the scene and another object, not already removed, selected in step 2.6. The selection step 2.6 maybe ordered or randomised. The process then returns to step 2.2 whereby the new object is removed and the effect of the removal measured as before. The process ends in step 2.7 when all objects in the scene have been removed and analysed.

During each occurrence of step 2.3, the measurement result may be stored in step 2.8 on memory of the post-processing system 1, or external memory, which may yield an overall saliency table for the scene being analysed. The results may produce a more accurate saliency measurement and/or show which objects are more salient than others. Objects maybe assigned a saliency level, which may simply be ‘salient’ or ‘non salient’ or may use a greater number of levels, e.g. ‘0% salient’, ‘10% salient’, ‘100% salient’ etc.

The above process is demonstrated by way of a visual example in Figure 3. Referring to Figure 3a, a captured space 15 is shown which contains three objects 17,18,19 which have been identified using any of the above methods. The objects 17 - 19 may be visual

- 10 and/or audio objects. For ease of explanation, we shall assume for this example that the objects 17 - 19 are video objects captured by a camera.

In Figure 3a, the first object 17 is selected and, as shown in the modified scene 20, is 5 removed by replacing its pixels with those corresponding to the background, leaving only the second and third objects 18,19 (step 2.2) in the modified scene. The effect produced on the said remaining second and third objects 18,19 is measured and stored (steps 2.3,2.8) and the process repeated for the other objects as shown in Figures 3b and 3c respectively.

Referring now to Figure 4, a further, more detailed example will now be described. Figure 4 is an overview of a VR capture scenario 23 shown together with a postprocessing system 25 which may (or may not) have an associated user interface 26. The Figure shows in plan-view a real world space 27 which maybe any real-world scenario.

A capture device 28, e.g. Nokia’s OZO device, for video and spatial audio capture is supported on a floor of the space 27 in front of multiple objects 29, 30, 31, examples of which will be described below.

The position of the capture device 28 is known, e.g. through predetermined positional data or signals derived from a positioning tag (not shown) on the capture device. The capture device 28 may comprise a microphone array configured to provide spatial audio capture.

The objects 29 - 31 maybe visible and/or audio objects, e.g. objects generating a sound such as speech or music. Each object 29-31 may have an associated close-up microphone providing audio signals. The audio sources 29 - 31 may also carry a positioning tag which can be any module capable of indicating through data its respective spatial position to the post-processing system 25. For example the positioning tag maybe a high accuracy indoor positioning (HAIP) tag which works in association with one or more HAIP locators 33 within the space 27. HAIP systems use Bluetooth Low Energy (BLE) communication between the tags and the one or more locators 33. For example, there maybe four HAIP locators 33 mounted on, or placed relative to, the capture device 28. A respective HAIP locator 33 maybe to the front, left, back and right of the capture device 28. Each tag sends BLE signals from which the HAIP locators derive the tag, and therefore, audio source location. In general, such direction of arrival (DoA) positioning systems are based on (i) a known location and

- 11 orientation of the or each locator, and (ii) measurement of the DoA angle of the signal from the respective tag towards the locators in the loctators’ local co-ordinate system. Based on the location and angle information from one or more locators, the position of the tag can be calculated using geometry.

The post-processing system 25 is a processing system which may have an associated user interface (UI) 26. As shown in Figure 4, it receives as input from the capture device 28 spatial audio and video data, and positioning data, through a signal line 135. Alternatively, the positioning data can be received from the HAIP locator(s) 33. The post-processing system 25 may also receive as input from each of the objects 29 - 31 audio data and positioning data from the respective positioning tags, or the HAIP locator(s) 33, through separate signal lines 37. The post-processing system 25 may generate spatial audio data for output to a user device 39, such as a VR headset with video and audio output.

The input audio data maybe multichannel audio in loudspeaker format, e.g. stereo signals, 4.0 signals, 5.1 signals, Dolby Atmos (RTM) signals or the like. Instead of loudspeaker format audio, the input can be in the multi microphone signal format, such as the raw eight signal input from the OZO VR camera, if used for the capture device

28.

In some embodiments, the capture space 27 also includes one or more further cameras 40 which maybe used to ‘fill-in’ background for removed objects 29 - 31 as will be explained later on. Video data generated by said cameras 40 is provided to the post25 processing system 25.

Figure 5 shows an example schematic diagram of components of the post-processing system 25. The post-processing system 25 has a controller 41, a touch sensitive display 43 comprised of a display part 45 and a tactile interface part 47, hardware keys 51, a memory 53, RAM 55 and an input interface 57. The controller 41 is connected to each of the other components in order to control operation thereof. The display part 45 and the hardware keys 51 are optional.

The memory 53 may be a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 53 stores, amongst other things, an operating system 59 and one or more software applications 61. The

- 12 RAM 55 is used by the controller 41 for the temporary storage of data. The operating system 59 may contain code which, when executed by the controller 41 in conjunction with RAM 55, controls operation of each of hardware components of the terminal.

The controller 41 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor (including processor circuitry), or plural processors (each including processor circuitry).

The input interface 57 receives video and audio data from the capture device 28, such 10 as Nokia’s OZO (RTM) device, and audio data from each of the objects 29 - 31. The input interface 57 may also receive the positioning data from (or derived from) the positioning tags on each of the capture device 28 and the objects 29 - 31, from which can be made an accurate determination of their respective positions in the real world space 27.

In some embodiments, one or more of the software applications 61 is configured to provide video and distributed spatial audio capture, mixing and rendering to generate a VR environment, or virtual space, including the rendered spatial audio. The capture data may represent a non-zero capture period which may comprise multiple frames of data captured over a number of seconds, minutes or hours.

The same or a different software application 61 may be configured to perform the Figure 2 method, i.e. saliency determination, which determination maybe used prior to the mixing and rendering stages, or for some further post-processing stage. The remainder of this description will focus on this aspect of the software application 61.

Referring back Figure 2, the step (2.3) of measuring the effect of removing, or silencing an object (step 2.2) maybe performed in one or a number of different ways.

Referring to Figure 6, in a first embodiment of step 2.3 a determination is made as to how the removal affects or influences other objects in terms of functionality or interaction. This may be an entirely automatic processing stage performed by the software application 61.

Referring to Figure 7, a more specific example of the Figure 6 embodiment is shown in flow diagram form. In a first step 7.1, data from the modified scene is received. In a

-13subsequent step 7.2 the said data is applied, or processed by, a trained model to generate a classification result in step 7.3. The trained model may be a neural network, or plural neural networks, or similar. The classification result in step 7.3 can be, for example, a binary result, e.g. realistic or non-realistic, or a graduated result, e.g. 0% realistic, 5% realistic, 60& realistic and so on.

It will be appreciated that a neural network is an example of a computational model with an input and an output and which performs a certain computational task. The input to the neural network may be represented by the input data on which to perform the task. The output may be the desired task, or an intermediate result which may enable a subsequent module or modules to perform the task. There are two main categories of neural networks, namely discriminative and generative. Discriminative networks allow for obtaining information about the posterior probability of the underlying factors of variations (such as classes) given the input data. Examples of such networks are convolutional neural networks, used for performing classification. The input may be raw data, or features extracted from raw data, and the output may be the estimated class. Generative networks allow for obtaining information about the joint probability distribution of the data and the underlying factors of variation, or the likelihood of the data given factors of variation. These neural networks are usually used either for pre-training a classification model, or for generating artificial data such as artificial images. A neural network may be defined by two sets of numerical values: hyper-parameters (or topology) and parameters (or weights). The hyper-parameters are values which may be set by the human practitioner based on his or her experience and thus they are usually not learnt. They define the general structure of the network, such as number of layers, the number of units per layer, the type of connection (dense or sparse), activation functions, etc. There is some research ongoing for learning the hyper-parameters from training data, but the most common method is to test out values within pre-defined ranges, i.e. by using grid-search methods.

The parameters are the component of the neural network which is usually learnt. The parameters are set during a so-called training phase in an iterative manner by running an optimization routine on training data. This optimization usually consists of finding the combination of weights for which the objective function is optimized. The objective function can be a classification-error function, and thus the optimization may be a minimization problem. For example, one basic minimization routine is a Stochastic Gradient Descent (SGD) routine, which uses the gradient ofthe objective function

-14(computed for example by the Back-propagation algorithm, Monte-Carlo methods or Reinforcement learning techniques) for updating the weights. Once the hyperparameters and weights have been set, the neural network is completely defined by those two types of data, and it is ready for deployment. References made hereinafter to “neural network” may refer to these types of data, or subsets thereof.

In the present example, the trained model can be a neural network trained as a motion model which determines whether or not the motion of objects in the modified scene is coherent, or realistic. For example, the motion model may be a pattern recognition model which is pre-trained on a corpus/dataset of videos of objects which move without disturbance.

Referring to Figure 9, an example is illustrated in terms of captured video data based on the Figure 4 scenario 23. Figure 9a shows the captured video data 50 comprising the three objects 29 - 31 moving over a non-zero time period. The arrows and dashed lines indicate the respective movement and end positions of the objects 29’ - 31’. In the example, the second and third objects 30,31 collide during the time period and deflect in different directions.

Following the Figure 2 method, in a first iteration of the silencing framework, Figure 9b shows the modified scene 52 in which the first object 29 is removed and replaced by background pixels. The modified scene 52 is then applied to a pattern recognition model 55 which generates a measure of how realistic the resulting video motions are.

In this case, the motion is considered realistic, or 100% realistic.

Figure 9c shows a subsequent iteration in which the second object 30 is removed and replaced by background pixels. The modified scene 53 is applied to the pattern recognition model 55 and, in this example, the determination is one of non-realism (or 0% realism) based on the motion of the third object 31 not being coherent with the rest of the scene, i.e. because said object changes direction without any apparent interaction. Figure 9d shows the subsequent iteration in which the third object 31 is removed and replaced by background pixels. The modified scene 54 is similarly applied as before and, in this example, the determination is one of non-realism (or 0% realism) based on the motion of the second object 30 not being coherent with the rest of the scene, i.e. because said object changes direction without any apparent interaction.

-ι₅Overall, therefore, a low or zero realism score is indicative that the removed object is salient. In the shown example, the second and third objects 30,31 are determined as salient.

Referring to Figure 10, another example is illustrated in terms of captured video and audio data. Figure 10a shows captured video data 60 comprising a fork object 65 moving towards, and striking, a plate object 66 during the capture time period. The striking interaction produces a noise, indicated by reference numeral 67, which is not attributable to the objects 65, 66 themselves, but only to their interaction.

Following the Figure 2 method, in a first iteration of the silencing framework, Figure 10b shows the modified scene 62 in which the plate object 66 is removed and replaced by background pixels. The modified scene 62 may then be applied to a classification model 68 which has been trained to distinguish between a realistic sound and a non15 realistic sound, e.g. using a binary classification. The input data to the model may be the video data for the modified scene 62 concatenated with the proposed audio data. In the case of Figure 10b, the video data corresponding to the moving fork object 65 is inputted together with the audio data corresponding to the noise 67. In this case, the classification model 68 may determine that the noise is unrealistic in the context of the fork object 65 not producing the sound itself, and not interacting with any other object and therefore the output is a non-realistic score, e.g. 0% realism, or a low realism score. Similarly, for Figure 10c, the video data corresponding to the static plate 66 is inputted together with the audio data corresponding to the noise 67. In this case, the classification model 68 may determine that the noise 67 is again unrealistic in the context of the plate object 66 not producing the sound itself, and not interacting with any other object. Again, the output from the classification model 68 may be a nonrealistic score or a low-realism score.

Overall, therefore, a low or zero realism score is indicative that the removed object is salient. In the shown example, both objects 65, 66 are determined as salient.

Referring to Figure 8, in some embodiments for the Figure 6 embodiment, a generative trained model may be used (e.g. a restricted Boltzmann machine, a generative adversarial network, a variational auto-encoder etc.) wherein the input is one form of data from the modified scene, e.g. the video data (step 8.1) which is processed by the model in step 8.2 to produce an expected form of other data, e.g. the expected audio

- ιόdata. The expected form of the other data is then compared with corresponding data from the modified scene, e.g. the audio data (step 8.3) to generate a distance function or measure to determine the degree of similarity between the expected sound and the actual sound (step 8.4).

In some embodiments for step 6.1, only audio data is processed. For example, the audio data may be speech data. In this case, the object identification stage in step 2.1 may comprise sound-source separation. The subsequent processing stage may use a language model. For example, if the captured content contains speech, e.g. a conversation, then a measure may be made of how removal of one object affects the interaction between, or with, the remaining object(s). A natural language processing (NLP) model is one such model example. For example, the model may determine whether the speech makes sense or not, e.g. following a question and answer pattern, or sounds natural or not. Consequently, it can be computed if the speech associated with the removed object is important, or salient to the scene.

To illustrate, in Figure 11 an example is shown to demonstrate how the Figure 2 method may be applied to a captured audio scene containing speech. Figure 11a shows the captured scene 70 which comprises a first (human) object 72 in conversation with a second (human) object 73 over a non-zero time period. In the background, third and fourth (human) objects 74, 75 are observing and generate noise in the background which may also be speech or some other background noise, e.g. clapping or cheering. For ease of explanation, the time period covers four exchanges of speech between the first and second objects 72, 73, with the background noise being clapping noise covering the entire time period, for example:

Speech 1 (Object 72): “Hello George, how are you enjoying the show?” Speech 2 (Object 73): “It is great to be here, what did you think?”

Speech 3 (Object 72): “It was ok. I missed the first part, but the second part was good”

Speech 4 (Object 73): “That’s a shame, the first part was the best part”.

In the first iteration shown in Figure 11a, the first object 72 is silenced. In this audio context, silenced means that the speech is muted. The resulting audio data which is fed into the NLP model 77 is the following, with the clapping noise in the background:

-17“It is great to be here, what did you think?” <Pause>

“That’s a shame, the first part was the best part”.

The NLP model 77 may compute that this speech pattern does not conform to a recognised natural language pattern, and hence determine low, or zero realism.

In the second iteration shown in Figure 11b, the second object 73 is silenced. The resulting audio data which is fed into the NLP model 77 is the following, with the clapping noise in the background:

“Hello George, how are you enjoying the show?” <Pause>

“It was ok. I missed the first part, but the second part was good”

In the third and fourth iterations (shown collectively in Figure 11c for ease of 20 illustration) the third and fourth objects 74, 75 are respectively silenced. The resulting audio data which is fed into the NLP model 77 follows the original speech pattern without background clapping, i.e.

“Hello George, how are you enjoying the show?” “It is great to be here, what did you think?” “It was ok. I missed the first part, but the second part was good” “That’s a shame, the first part was the best part”.

The NLP model 77 may compute that this speech pattern does conform to a recognised 30 natural language pattern, and hence determine a high realism score.

Therefore, the removed objects which resulted in a low realism score are determined to be salient, i.e. the first and second objects 72, 73.

Referring to Figure 12, in a second embodiment of step 2.3 a determination is made as to how the removal affects the behaviour of observers, e.g. external human users when

-18consuming the modified content. This embodiment introduces external observers into the loop, to feedback data responsive to them consuming, i.e. viewing and/or listening, to the modified content during each iteration of the Figure 2 process. The term “behaviour” may refer to, but is not limited to, a quantifiable reaction, e.g. heart rate, perspiration (sweat), temperature, spatial position or gaze direction, to give some examples.

Referring to Figure 13, a practical implementation for deriving data indicative of observer behaviour is shown. One or more external human users 80 are shown in relation to either the Figure 1 or Figure 4 post-processing system 1, 25. Each of the users 80 may receive the video and/or audio data as modified during the Figure 2 process to iteratively remove or silence the objects in turn. More specifically, the external users 80 may be wearing a VR headset 81 which provides audio through headphones and video through a pair of video screens. Further, the external users may be carrying one or more sensors 82, which may be one or more of a heart rate sensor, a sweat sensor, or indeed any sensor capable of sensing a bodily or biological reaction over time. The data generated by the one or more such sensors is provided to the postprocessing system 1, 25 by means of a wired connection, or by wireless means, and this sensor data may even be transmitted wirelessly or remotely over a network provided it is possible to correlate the sensor data with the data being consumed.

In some embodiments, the gaze direction of the external users 80 may be the sensor data, or additional sensor data. The gaze direction may utilise one or more cameras provided within the VR headset 81 each of which is directed at the user’s eyes and monitors movement in all directions away from a central, reference, position, which is usually set or estimated when the user’s eyes are looking directly forwards.

The general process employed by the software application 61 is indicated in Figure 14. In a first step 14.1, the modified content is provided to the or each external user. The modified content can be only audio content, only video content, a combination of both, and may also include spatial content. In step 14.2 data is received from the or each sensor carried or worn by the external users. In step 14.3 the sensor data is processed to quantify or generate a measurement of the user’s reaction to a particular object being removed of silenced.

-19In this respect, it will be appreciated that removal of an object that is salient may cause confusion to the external observer who cannot make sense of, e.g. what they are seeing or hearing. Confusion may cause a detectable increase in heart rate and/or perspiration and/or cause the user’s gaze direction to change as they attempt to comprehend what is being conveyed to them. The degree to which the change occurs can be quantified and interpreted by the software application 61 as a measure of saliency for the removed object in this case. Where the reaction or change measure is above a predetermined threshold, a removed object maybe classified as salient.

In some embodiments, a combination of both the Figure 6 and Figure 12 methods may be used, which maybe in parallel to produce a combined saliency determination, or which may be cascaded, one after the other, where the later method augments or refines the earlier method.

The above systems and methods may provide therefore both determination of salient objects over non-salient objects in an improved manner, by using an iterative silencing framework. The systems and methods may also provide a way of quantifying saliency, e.g. in terms of a score or percentage. The systems and methods have use in the fields of spatial audio mixing and 3D object reconstruction. Also, salient objects are a building block for a larger number of applications which may form part of a further post-processing module or stage. For example, one application is in searching and browsing of salient objects and events in a large corpus of audio visual data. For example, we may wish to search for all images where a certain object is a salient object rather than an object with a minor role in the scene. Also, we may wish to browse to a point in a video clip where a certain object becomes a salient object. Another useful application is the spatio-temporal summarisation of videos, particularly in the case of 360 degree or spherical video (such as Nokia’s OZOZ video) due to its large spatial extent. In order to generate a spatio-temporal summary, all salient objects may be detected and inserted in a final compilation of video portions which may represent the final summary.

It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.

- 20 Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims

Claims

1. A method comprising:

1) identifying a plurality of objects in a captured video and/or audio scene;

5 2) processing the scene by performing:

i) removing an object from the scene; and ii) measuring the effect of removing said object using received data;

iii) reintroducing said object into the scene;

iv) repeating i) to iii) for one or more other objects in the scene in turn;

io and

3) determining the saliency level for each object based on the measured effect from 2)ii).

2. The method of claim l, further comprising 4) identifying using the saliency

15 levels determined in 3) a subset of the objects as being more salient than one or more other objects in the scene.

3. The method of claim 2, wherein the subset comprises one or more objects for which the measurement is above a predetermined threshold.

4. The method of any preceding claim, wherein 2)11) comprises measuring the difference in the functionality and/or interaction of the one or more non-removed objects with trained data over a time frame.

25 5. The method of claim 4, wherein the trained data comprises one or more trained data models, for example a neural network, pattern recognition model and/or natural language processing model.

6. The method of claim 5, wherein the trained data model comprises a motion

30 model representing the expected motion of one or more objects over time and wherein
2)11) comprises measuring the difference between movement of the non-removed objects over the time frame, and the expected movement determined by the motion model.

35 7. The method of any of claims 4 to 6, wherein the captured scene is a video and audio scene represented by video and audio data for the objects within a space, wherein

- 22 2)ii) comprises for at least part of the scene which comprises the removed object, using trained data to measure if the interaction between two or more objects which move relative to one another over the time frame is realistic.

5 8. The method of claim 7, wherein 2)1) comprises removing the audio and video data for the removed object, and 2)11) comprises applying audio produced at the time the two or more objects spatially interact to the trained data to measure if the interaction is realistic based on the amount of the audio attributable to the nonremoved objects.

9. The method of claim 8, wherein 2)11) is performed for a series of time frames of the captured scene leading up to, and including, a spatial contact or collision between the removed object and one or more non-removed objects.

15 10. The method of claim 8 or claim 9, wherein the trained model receives the video data resulting from 2)1) and the audio data for the overall scene at, and/or leading up to, the time of interaction and wherein a binary decision is generated as to whether or not an interaction is realistic.

20 11. The method of claim 8 or claim 9, wherein the trained model receives video data resulting from 2)1) and generates therefrom expected audio for the overall scene over a series of time frames, and wherein a difference is measured between the received audio and the expected audio.

25 12. The method of any preceding claim, wherein the captured scene contains speech audio data from two or more audio sources captured over a time frame, wherein 2)1) comprises silencing the speech audio from one audio source, and 2)11) comprises using a natural language processing model to measure the degree to which the speech audio for the non-silenced audio sources over the time frame conform to a recognised

30 pattern.

13. The method of any preceding claim, further comprising: generating audio and/or video content representing a time period which includes the scene resulting from 2)1); providing the content for output to one or more external observers; and

35 wherein 2)11) comprises measuring the effect of removing the object using data received from one or more sensors associated with said external observers.

-2314· The method of claim 13, wherein the sensor data includes one or more of heart rate data, temperature data, sweat data and gaze direction data.

5 15. The method of any preceding claim, performed in a post-processing phase of virtual reality, augmented reality and/or spatial audio content.

16. The method of claim 2 or any claim dependent thereon, wherein the identified subset of objects in 4) are rendered in a post-processed virtual space and the remaining

10 objects are removed and/or filtered to limit their movement and/or playback volume in the rendered virtual space.

17. The method of any preceding claim, wherein 3) comprises ranking or categorising objects based on their saliency levels.

18. A computer program comprising instructions that when executed by a computer apparatus control it to perform the method of any of claims 1 to 17.

19. Apparatus configured to perform the method of any of claims 1 to 17.

20. Apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured with the processor to cause the apparatus to:

1) identify a plurality of objects in a captured video and/or audio scene;

25 2) process the scene by performing:

i) removing an object from the scene;

ii) measuring the effect of removing said object using received data;

iii) reintroducing said object into the scene;

iv) repeating i) to iii) for one or more other objects in the scene in turn;

30 and
3) determine the saliency level for each object based on the measured effect from 2)11).

21. A non-transitory computer-readable storage medium having stored thereon 35 computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:

-241) identifying a plurality of objects in a captured video and/or audio scene;

2) processing the scene by performing:

i) removing an object from the scene;

ii) measuring the effect of removing said object using received data;

5 iii) reintroducing said object into the scene;

iv) repeating i) to iii) for one or more other objects in the scene in turn;

and

3) determining the saliency level for each object based on the measured effect from 2)11).

Intellectual

Property

Office

Application No: GB1615006.2 Examiner: Matthew Khong