WO2018234622A1

WO2018234622A1 - A method for detecting events-of-interest

Info

Publication number: WO2018234622A1
Application number: PCT/FI2018/050426
Authority: WO
Inventors: Lixin Fan; Yu You; Tinghuai Wang
Original assignee: Nokia Technologies Oy
Priority date: 2017-06-21
Filing date: 2018-06-07
Publication date: 2018-12-27

Abstract

A method comprising: providing at least one three-dimensional (3D) panorama video to be viewed by at least one user; recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video; analysing the plurality of user responses; extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.

Description

A METHOD FOR DETECTING EVENTS-OF-INTEREST Field

Various embodiments relate to panorama videos, and more particularly to a method for detecting events-of-interest. Background of the invention

The availability of panoramic 360-degree video has been gradually increased, nowadays including main public video sites, such as YouTube®. A panoramic video typically supports 360 degree field of view horizontally and over 180 degrees vertically. A panoramic 360-degree video is delivered to client applications running on platforms that support the rendering of 360-degree panorama video. The video is typically delivered to the user compressed in an equirectangular or cube map projection.

When playing back the 360-degree panorama video, viewers may pan 360 degrees within a horizontal or vertical (or any directional) loop to watch the scene. The 360 degree panorama video may be viewed in an immersive multimedia display unit, for example, a head mounted display with video player software, whereupon the viewer may experience to be immersed in the scene shown on the display. On the other hand, 360 degree panorama video may be rendered to be displayed on an ordinary 2D display, provided that the playback device supports rendering of 360-degree panorama video.

In a variety of contexts, a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI). If the temporal locations of such EOls within the video are known, the user may preview or forward the playback to such location. However, the spatial information contained in a 360 degree panorama video, i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display. Therefore, providing the mere temporal locations of the EOls is insufficient if a good coverage of EOls is desired.

Summary of the invention

Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are at least alleviated. Various aspects of the invention include a method, an apparatus and a computer program, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims. According to a first aspect, there is disclosed a method comprising: providing at least one three-dimensional (3D) panorama video to be viewed by at least one user; recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video; analysing the plurality of user responses; extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.

According to an embodiment, said behavior comprises one or more of the following:

- change of a viewing angle or a field-of-view (FOV);

- change of a viewing position;

- change of focus in terms of depth;

- pausing or rewinding or forwarding of the video playback;

- adding a comment to the video.

According to an embodiment, the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.

According to an embodiment, the plurality of user responses are recorded from a plurality of users.

According to an embodiment, the method further comprises extracting key features from said user responses; and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.

According to an embodiment, the key features comprise one or more of the following:

- eye coordinates of the user;

- center coordinates of the field-of-view (FOV);

- starting and/or ending time of the user response; - frequently used words in comments.

According to an embodiment, the method further comprises clustering feature vectors having similar spectral properties; and identifying the behavior underlying a cluster of feature vectors.

According to an embodiment, the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video; and generating a heatmap of each depth map.

According to an embodiment, the method further comprises providing a file comprising said 3D panorama video with a temporal control bar, wherein the one or more predetermined behaviors contained by said 3D panorama video are indicated according to their temporal location.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to carry out the method of any of the embodiments.

According to a third aspect, there is provided a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the method according to any of the above embodiments.

These and other aspects of the invention and the embodiments related thereto will become apparent in view of the detailed disclosure of the embodiments further below.

List of drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 a shows an example of a multi-camera system as a simplified block diagram, in accordance with an embodiment; shows a perspective view of a multi-camera system, in accordance with an embodiment; shows an example of a video playback apparatus as a simplified block diagram, in accordance with an embodiment; shows a flow chart of an event detection method according to an embodiment of the invention; illustrate an example of of a user watching a 360-degree 3D panorama video; illustrates an example of an example of a heatmap visualizing EOIs within a frame according to an embodiment of the invention; illustrates an example of video control bar application according to embodiment of the invention.

Description of embodiments

The following embodiments are exemplary. Although the specification may refer to "an", "one", or "some" embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.

When 360 degree stereo panorama video is viewed in an immersive multimedia display unit, for example, a head mounted display with a video player software, a video player may be able to create the similar effect of the viewer moving in the immersed space which is present in the real world. The forward and/or backward motion of a head of the user may make the video more realistic looking as the objects in the foreground appear to move slightly in relation to background objects when the head is moved.

Figure 1 a illustrates an example of a multi-camera system 100, which may be able to capture and produce 360 degree stereo panorama video. The multi-camera system 100 comprises two or more camera units 102. In this example, the number of camera units 102 is eight, but may also be less than eight or more than eight. Each camera unit 102 is located at a different location in the multi-camera system, and may have a different orientation with respect to other camera units 102, so that they may capture a part of the 360 degree scene from different viewpoints substantially simultaneously. A pair of camera units 102 of the multi-camera system 100 may correspond with left and right eye viewpoints at a time. As an example, the camera units 102 may have an omnidirectional constellation, so that it has a 360° viewing angle in a 3D-space. In other words, such multi-camera system 100 may be able to see each direction of a scene so that each spot of the scene around the multi- camera system 100 can be viewed by at least one camera unit 102 or a pair of camera units 102.

Without losing generality, any two camera units 102 of the multi-camera system 100 may be regarded as a pair of camera units 102. Hence, a multi-camera system of two cameras may have only one pair of camera units, a multi-camera system of three cameras may have three pairs of camera units, a multi-camera system of four cameras may have six pairs of camera units, etc. Generally, a multi-camera system 100 comprising N camera units 102, where N is an integer greater than one, may have N(N-1 )/2 pairs of camera units 102. Accordingly, images captured by the camera units 102 at a certain time may be considered as N(N-1 )/2 pairs of captured images.

The multi-camera system 100 of Figure 1 a may also comprise a processor 104 for controlling operations of the multi-camera system 100. There may also be a memory 106 for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The multi-camera systeml OO may further comprise a user interface (Ul) 1 10 for displaying information to the user, for generating audio signals, and/or for receiving user inputs. However, the multi-camera system 100 need not comprise each feature mentioned above, or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 102 (not shown).

Figure 1 a also illustrates some operational elements which may be implemented, for example, as a computer code which can be executed in the processor 104, in hardware, or both to perform a desired function. An optical flow estimation 1 14 may perform optical flow estimation for pair of images of different camera units 102. Transform vectors or other information indicative of an amount interpolation/extrapolation to be applied to different parts of a viewport may have been stored into the memory 106 or they may be calculated e.g. as a function of the location of a pixel in question. The operation of the elements will be described later in more detail. It should be noted that there may also be other operational elements in the multi-camera system 100 than those depicted in Figure 1 a.

Figure 1 b shows a perspective view of the multi-camera system 100, in accordance with an embodiment. In Figure 1 b seven camera units 102a— 102g can be seen, but the multi-camera system 100 may comprise even more camera units which are not visible from this perspective view. Figure 1 b also shows two microphones 1 12a, 1 12b, but the apparatus may also comprise one microphone or more than two microphones.

In accordance with an embodiment, the multi-camera system 100 may be controlled by another device, wherein the multi-camera system 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided with information from the multi-camera system 100 via the user interface of the other device.

Some terminology regarding the multi-camera system 100 will now be shortly described. A viewport is a part of the scene which is displayed by a head mounted display at a time. Both left and right eye images may have overlapping, but slightly different viewports. A camera space, or camera coordinates, stands for a coordinate system of an individual camera unit 102 whereas a world space, or world coordinates, stands for a coordinate system of the multi-camera system 100 as a whole. An optical flow may be used to describe how objects, surfaces, and edges in a visual scene move or transform, when an observing point moves from a location of one camera to a location of another camera. In some embodiments, there need not be any actual movement but it may virtually be determined how the view of the scene might change when a viewing point is moved from one camera unit to another camera unit.

Figure 2 shows an example of a video playback apparatus 200 as a simplified block diagram, in accordance with an embodiment. A non-limiting example of video playback apparatus 200 includes an immersive display unit. An example of the immersive display unit includes, but is not limited to, a head mounted display. The video playback apparatus 200 may comprise, for example, one or two displays 202 for video playback. When two displays are used a first display 202a may display images for a left eye and a second display 202b may display images for a right eye, in accordance with an embodiment. In case of only one display 202, that display 202 may be used to display images for the left eye on the left side of the display 202 and to display images for the right eye on the right side of the display 202. While describing various embodiments further below, it is assumed that the one or two displays are configured to create one or more images surrounding the viewer partly or completely, such as a head-mounted display with head tracking system or a cylindrical or a spherical display curving in 2D or in 3D at least partly, but possibly 360° around the viewer.

The video playback apparatus 200 may be provided with encoded data streams via a communication interface 204 and a processor 206 may perform control operations for the video playback apparatus 200 and may also perform operations to reconstruct video streams for displaying on the basis of received encoded data streams. The video playback apparatus 200 may further comprise a processor 206 for reconstructing video streams for displaying on the basis of received encoded data streams. There may also be a decoder 208 for decoding received data streams and a memory 210 for storing data and computer code. In an embodiment, the decoding 208 is implemented, for example, as a software code, which can be executed by the processor 206 to perform the desired function, in hardware, or in both. The video playback apparatus 200 may further comprise a user input 212 for receiving e.g. user instructions or inputs. The video playback apparatus 200 may comprise an image modification unit 214 which may perform image modification on the basis of modification information provided by a user input and at least one image modification function and transform image elements as will be described later in this specification. In an embodiment, the image modification unit 214 can be implemented, for example, as a software code, which can be executed by the processor to perform the desired function, in hardware, or in both.

It should be noted that the video playback device 200 need not comprise each of the above elements or may also comprise other elements. For example, the decoding element 208 may be a separate device wherein that device may perform decoding operations and provide decoded data stream to the video playback device 200 for further processing and displaying decoded video streams. In a variety of contexts, a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI). If the temporal locations of such EOIs within the video are known, the user may preview or forward the playback to such location. However, the spatial information contained in a 360 degree panorama video, i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display.

In the following, the method for detecting events of interest (EOI) in accordance with an embodiment is described in more detail with reference to the flow diagram of Figure 3. Video information to be processed may have been captured and processed by two or more camera units 102 to obtain a panorama video, for example a 360 degree panorama video. The playback of the video may be carried out, for example, using the immersive multimedia display unit disclosed in figure 2, such as a head mounted display with video player software. Alternatively, the playback of the video may be carried out by an ordinary 2D display device supporting rendering of 360- degree panorama video.

In the method, at least one three-dimensional (3D) panorama video is provided to be viewed (300) by at least one user, and in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses are recorded (302) during viewing the at least one 3D panorama video. The plurality of user responses are analysed (304), and from the analysed user responses, one or more predetermined behaviors are extracted (306), said behavior comprising a temporal sequence of user responses. Metadata for said at least one 3D panorama video is created (308), the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.

Each user, when viewing the 3D panorama video is more interested in some events of the video or details appearing on video frames, commonly referred to as EOIs, than other event or details. Thus, by gathering data from repeated views of the 3D panorama video, either by a single user or multiple users, statistical analysis can be made as regards to what are more significant EOIs and less significant EOIs. For gathering the data, some user responses, i.e. actions carried out by the user when viewing the video, are detected and recorded during viewing the video, and the user responses are analysed.

One purpose of the analysis is to identify temporally succeeding user responses, which can be referred to as a behavior of the user as regards to viewing the video. In other words, a particular temporal sequence of user responses can be identified as a particular behavior regarding to viewing the video. Metadata relating to the identified behavior is created, wherein the metadata comprises at least information indicative of behavior and its spatial and temporal data. Thus, metadata relating to one or more EOIs is gathered, and while this is repeated several times, either by a single user or multiple users, so-called crowd-sourced metadata relating users' behavior is created.

On the basis of the metadata, the more significant EOIs can be identified as regards to their spatial and temporal location. The spatial and temporal information may be utilized in previewing and playing back of 360-degree video. It is noted that the 3D spatial information as well as the time location/duration of an EOl constitute a four- dimensional entity which may be referred to as 4D EOl. The temporal location or duration may be determined in terms of frame number(s) associated with the EOl. According to an embodiment, said behavior comprises one or more of the following:

- change of a viewing angle or a field-of-view (FOV);

- change of a viewing position;

- change of focus in terms of depth;

- pausing or rewinding or forwarding of the video playback;

- adding a comment to the video.

An example of a user watching a 360-degree 3D panorama video is shown in Figure 4. The user A is currently watching the frame n of the 360-degree 3D panorama video. Frame n is displayed at a particular temporal location t_n along the playback timeline. The viewpoint of user A is located in a particular position Xu, yu, Zu in relation to the displayed 3D scenery. From the position, the user has a field-of-view (FOV) to a part of the frame (shown as a rectangle), where the FOV is horizontally limited by the viewing angle. The object shown in the part of frame n that the user A is watching is located at a particular depth d₀, which can be determined from the depth map of frame n. The parameters described above may be considered an initial user response upon viewing the 360-degree 3D panorama video.

Now, starting from the initial user response the user may carry out further user responses, and depending on the further user response, these sequences of user responses may be determined as different behaviors of the user. The user may change the viewing angle e.g. by rotating the viewing angle in the current position x_u, y_u, Zu. The user may change the field-of-view (FOV) e.g. by raising or lowering the gaze. The user may change the viewing position in terms of one or more of coordinates Xu, yu, Zu. The user may change the focus in terms of depth by focusing to an object having the depth value higher or lower than d₀.

The user may also pause the playback of the video at a particular temporal location. Moreover, the user may rewind or forward the playback of the video to a particular temporal location.

The user may also write a comment relating to the video. Herein, for example commentary sharing systems enabling viewers to add comments on top of an uploaded video, such as Danmu, can be utilized.

The above user responses are such that represent various behaviors, which presumably indicate some kind of interest of the user towards an event or a detail of the video. Therefore they can be regarded as predetermined behaviors, which are intended to be identified through the analysis of the user responses.

According to an embodiment, the at least one 3D panorama video is associated with a log file for storing the plurality of user responses. Thus, the recorded user response, as well as the behaviors extracted on the basis of the user responses may be stored in the log file.

According to an embodiment, the plurality of user responses are recorded from a plurality of users. Even though the user responses may gathered from a single user, statistically more relevant data can be obtained if the user responses are recorded from a plurality of users, thereby better referring to crowd-sourced analysis of EOIs.

According to an embodiment, the method further comprises extracting key features from said user responses and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.

- eye coordinates of the user;

- center coordinates of the field-of-view (FOV);

- starting and/or ending time of the user response;

- frequently used words in comments. The eye coordinates of the user may be measured by either a head tracking e.g. on Head-Mounted-Display (HMD) or an eye tracking system arranged to measure either the point of gaze or the motion of an eye relative to the head. The eye tracking system may be implemented in functional connection with the playback device. The eye coordinates may comprise the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the depth value of the object the user is watching.

The center coordinates of the FOV may be measured e.g. in relation to the frame, such as the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the most prominent depth value in the FOV.

An example of extracting the key feature of the user responses are shown in Table 1 below.

Table 1 .

Herein, for user responses which last for a certain period, the data regarding the key features is continuously recorded. For example, if the user makes comments or pauses the playback, it will result in a continuous recording of eye/center coordinates, as well as the start and end times of the user response. For other actions, e.g. forwarding/rewinding or changes the viewing angle, the end time is more informative for indicating the temporal distribution of the event of interest. According to an embodiment, the method further comprises clustering feature vectors having similar spectral properties, and identifying the behavior underlying a cluster of feature vectors. The feature vectors may be represented in two- dimensional spatio-temporal coordinates. The feature vectors may be clustered together for example such that a maximum distance between any points of feature vectors in the cluster is defined and a group of feature vectors within the maximum distance are clustered together. Herein, while spectral clustering may provide advantages in terms of connectivity criteria which are suitable for spatial-temporal data, a combination of k-means and mean-shift clustering algorithms could be alternatively used for each group. From each cluster, sequences of user responses are identified, since similar sequences typically correspond to at least one predetermined user behavior. For each user behavior group, the spatial-temporal data of certain associated actions are collected, vectorised and clustered to form spatial-temporal sequences or trajectories corresponding to the underlying favorable user behavior. For example, all data associated to the action of 'viewing angle change' such as start/end time, eye/center coordinates may be collected. The data is properly vectored and normalized before being clustered using spectral clustering. Each spatial-temporal cluster then corresponds to a group of favorable user behavior. According to an embodiment, the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video, and generating a heatmap of each depth map. The tracking and the changes of view orientations, carried out by the users, over time provide a good insight into how users experience the 360° panorama. The cluster of feature vectors may be further distinguished by projecting them to the depth map of each frame, thereby providing information regarding variations of the depth of EOIs experienced by the users. As the number of feature vectors of a particular cluster at a particular depth level increases, these depth-specific clusters may form regions that can be visualized as gradients of color in the form of a heatmap.

Figure 5 shows an example how a heatmap can be used for further visualizing EOIs within a frame, i.e. at a particular moment in time. A sequential encoding of the frame-specific heatmaps, possibly applying interpolation, creates a heatmap video representation over time, which may provide enhanced visualization and intuition about how the users experience the 360° panorama video over time.

The metadata relating to 4D EOIs and created as described above may be utilized in a wide variety of Augmented Reality/Virtual Reality (ARA/R) applications. As an example, an application relating to a 4D video control bar is illustrated below. Nevertheless, potential applications based on the 4D EOls detection method as described herein are by no means limited to the following use case. A 4D video control bar may be included in on the display, for example below the displayed 360° panorama video, as shown in Figure 6. The 4D video control bar 600 may comprise a horizontally aligned timeline of the playback and a plurality of EOls 602a - 602h along the timeline as vertically aligned bars. Various types of predetermined EOls may be indicated e.g. using different colors of bars. Such 4D video control bar can be used to preview and navigate EOls, for example before a user indulges into the lengthy video contents. In contrast to existing video control bar, which allows users to shift the playback back and forth at an arbitrary time point, the 4D video control bar also facilitates the selection of interesting viewing angles, FOVs, depths, etc., which are crucial for good experience of 360-degree video viewing.

The 4D video control bar allows interacting with the EOls, such as preview an EOl or fast-forwarding to an EOl, in both time and 3D space. For example, when hovering on one specific time 602f on the control bar by e.g. a mouse pointer, an interactive 3D sphere 604 may be rendered with certain 3D viewing locations highlighted. The highlighted locations 606, 608 are associated with identified EOls, and are clickable by users in the 4D video control bar as the vertically aligned bars, as shown in Figure 6. In case that multiple 3D locations are distributed on different sides (e.g. facets of buildings), users may rotate the 3D sphere to a different viewpoint. Figure 6 illustrates how after clicking a particular 3D location, the video playback view has switched to the alternative location and viewpoint of the EOl. The 3D sphere may disappear automatically and the playback continues from the view of the selected location. With the EOl information, including additional information to the views can be controlled and targeted easily. Without breaking the viewing experience, additional information, such as details of the place or event or advertisements, can be rendered to fuse into the video content, together with the 3D information of the EOl regions. The geometric boundary of the EOl determines the placement of additional information and can be used as the anchor for various placement decisions. For example, the placement can be in front of the EOl in the direction of the viewing angle from the current viewer and always facing to the viewer. The above embodiments may provide various advantages. The embodiments enable to detect EOIs in 360-degree panorama video not only in terms of their temporal location, but also in terms of their 3D spatial location, i.e. resulting in so-called 4D EOIs. The embodiments utilize crowd-sourced video viewing and interaction behaviors, thereby providing statistically more reliable results about the relevant 4D EOIs.

In general, the various embodiments may be implemented in hardware or special purpose circuits or any combination thereof. While various embodiments may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Embodiments of the inventions may be practiced in various components such as integrated circuit modules.

A skilled man appreciates that any of the embodiments described above may be implemented as a combination with one or more of the other embodiments, unless there is explicitly or implicitly stated that certain embodiments are only alternatives to each other.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. Thus, the implementation may include a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the various embodiments or a subset of them. Additionally or alternatively, the implementation may include a computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to apparatus to perform the various embodiments or a subset of them. For example, an apparatus may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment.

The above-presented embodiments are not limiting, but it can be modified within the scope of the appended claims.

Claims

Claims:

1 . A method comprising:

providing at least one three-dimensional (3D) panorama video to be viewed by at least one user;

recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video;

analysing the plurality of user responses;

extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and

creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.

2. The method according to claim 1 , wherein said behavior comprises one or more of the following:

- change of a viewing angle or a field-of-view (FOV);

- change of a viewing position;

- change of focus in terms of depth;

- pausing or rewinding or forwarding of the video playback;

- adding a comment to the video.

3. The method according to claim 1 or 2, wherein the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.

4. The method according to any preceding claim, wherein the plurality of user responses are recorded from a plurality of users.

5. The method according to any preceding claim, further comprising extracting key features from said user responses; and

forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.

6. The method according to claim 5, wherein the key features comprise more of the following:

- eye coordinates of the user;

- center coordinates of the field-of-view (FOV);

- starting and/or ending time of the user response;

- frequently used words in comments.

7. The method according to claim 5 or 6, further comprising

clustering feature vectors having similar spectral properties; and identifying the behavior underlying a cluster of feature vectors.

8. The method according to claim 7, further comprising:

projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video; and

generating a heatmap of each depth map.

9. The method according to any preceding claim, further comprising providing a file comprising said 3D panorama video with a temporal control bar, wherein the one or more predetermined behaviors contained by said 3D panorama video are indicated according to their temporal location.

10. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to carry out the method of any of claims 1 - 9.

1 1 . A computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to carry out the method of any of claims 1 - 9.