WO2018234622A1 - A method for detecting events-of-interest - Google Patents

A method for detecting events-of-interest Download PDF

Info

Publication number
WO2018234622A1
WO2018234622A1 PCT/FI2018/050426 FI2018050426W WO2018234622A1 WO 2018234622 A1 WO2018234622 A1 WO 2018234622A1 FI 2018050426 W FI2018050426 W FI 2018050426W WO 2018234622 A1 WO2018234622 A1 WO 2018234622A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
video
panorama video
user responses
temporal
Prior art date
Application number
PCT/FI2018/050426
Other languages
French (fr)
Inventor
Lixin Fan
Yu You
Tinghuai Wang
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2018234622A1 publication Critical patent/WO2018234622A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/017Head mounted
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/172Processing image signals image signals comprising non-image signal components, e.g. headers or format information
    • H04N13/178Metadata, e.g. disparity information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/243Image signal generators using stereoscopic image cameras using three or more 2D image sensors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/296Synchronisation thereof; Control thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44222Analytics of user selections, e.g. selection of programs or purchase activity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • H04N21/8405Generation or processing of descriptive data, e.g. content descriptors represented by keywords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/698Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/0101Head-up displays characterised by optical features
    • G02B2027/0132Head-up displays characterised by optical features comprising binocular systems
    • G02B2027/0134Head-up displays characterised by optical features comprising binocular systems of stereoscopic type
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B27/00Optical systems or apparatus not provided for by any of the groups G02B1/00 - G02B26/00, G02B30/00
    • G02B27/01Head-up displays
    • G02B27/0179Display position adjusting means not related to the information to be displayed
    • G02B2027/0187Display position adjusting means not related to the information to be displayed slaved to motion of at least a part of the body of the user, e.g. head, eye
    • GPHYSICS
    • G03PHOTOGRAPHY; CINEMATOGRAPHY; ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ELECTROGRAPHY; HOLOGRAPHY
    • G03BAPPARATUS OR ARRANGEMENTS FOR TAKING PHOTOGRAPHS OR FOR PROJECTING OR VIEWING THEM; APPARATUS OR ARRANGEMENTS EMPLOYING ANALOGOUS TECHNIQUES USING WAVES OTHER THAN OPTICAL WAVES; ACCESSORIES THEREFOR
    • G03B37/00Panoramic or wide-screen photography; Photographing extended surfaces, e.g. for surveying; Photographing internal surfaces, e.g. of pipe

Definitions

  • Various embodiments relate to panorama videos, and more particularly to a method for detecting events-of-interest. Background of the invention
  • panoramic 360-degree video has been gradually increased, nowadays including main public video sites, such as YouTube®.
  • a panoramic video typically supports 360 degree field of view horizontally and over 180 degrees vertically.
  • a panoramic 360-degree video is delivered to client applications running on platforms that support the rendering of 360-degree panorama video.
  • the video is typically delivered to the user compressed in an equirectangular or cube map projection.
  • 360 degree panorama video When playing back the 360-degree panorama video, viewers may pan 360 degrees within a horizontal or vertical (or any directional) loop to watch the scene.
  • the 360 degree panorama video may be viewed in an immersive multimedia display unit, for example, a head mounted display with video player software, whereupon the viewer may experience to be immersed in the scene shown on the display.
  • 360 degree panorama video may be rendered to be displayed on an ordinary 2D display, provided that the playback device supports rendering of 360-degree panorama video.
  • a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI).
  • EOI event of interest
  • the user may preview or forward the playback to such location.
  • the spatial information contained in a 360 degree panorama video i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display. Therefore, providing the mere temporal locations of the EOls is insufficient if a good coverage of EOls is desired.
  • a method comprising: providing at least one three-dimensional (3D) panorama video to be viewed by at least one user; recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video; analysing the plurality of user responses; extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.
  • said behavior comprises one or more of the following:
  • the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.
  • the plurality of user responses are recorded from a plurality of users.
  • the method further comprises extracting key features from said user responses; and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.
  • the key features comprise one or more of the following:
  • the method further comprises clustering feature vectors having similar spectral properties; and identifying the behavior underlying a cluster of feature vectors.
  • the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video; and generating a heatmap of each depth map.
  • the method further comprises providing a file comprising said 3D panorama video with a temporal control bar, wherein the one or more predetermined behaviors contained by said 3D panorama video are indicated according to their temporal location.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to carry out the method of any of the embodiments.
  • a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the method according to any of the above embodiments.
  • Fig. 1 a shows an example of a multi-camera system as a simplified block diagram, in accordance with an embodiment; shows a perspective view of a multi-camera system, in accordance with an embodiment; shows an example of a video playback apparatus as a simplified block diagram, in accordance with an embodiment; shows a flow chart of an event detection method according to an embodiment of the invention; illustrate an example of of a user watching a 360-degree 3D panorama video; illustrates an example of an example of a heatmap visualizing EOIs within a frame according to an embodiment of the invention; illustrates an example of video control bar application according to embodiment of the invention.
  • a video player may be able to create the similar effect of the viewer moving in the immersed space which is present in the real world.
  • the forward and/or backward motion of a head of the user may make the video more realistic looking as the objects in the foreground appear to move slightly in relation to background objects when the head is moved.
  • FIG. 1 a illustrates an example of a multi-camera system 100, which may be able to capture and produce 360 degree stereo panorama video.
  • the multi-camera system 100 comprises two or more camera units 102.
  • the number of camera units 102 is eight, but may also be less than eight or more than eight.
  • Each camera unit 102 is located at a different location in the multi-camera system, and may have a different orientation with respect to other camera units 102, so that they may capture a part of the 360 degree scene from different viewpoints substantially simultaneously.
  • a pair of camera units 102 of the multi-camera system 100 may correspond with left and right eye viewpoints at a time.
  • the camera units 102 may have an omnidirectional constellation, so that it has a 360° viewing angle in a 3D-space.
  • such multi-camera system 100 may be able to see each direction of a scene so that each spot of the scene around the multi- camera system 100 can be viewed by at least one camera unit 102 or a pair of camera units 102.
  • any two camera units 102 of the multi-camera system 100 may be regarded as a pair of camera units 102.
  • a multi-camera system of two cameras may have only one pair of camera units
  • a multi-camera system of three cameras may have three pairs of camera units
  • a multi-camera system of four cameras may have six pairs of camera units, etc.
  • a multi-camera system 100 comprising N camera units 102, where N is an integer greater than one, may have N(N-1 )/2 pairs of camera units 102. Accordingly, images captured by the camera units 102 at a certain time may be considered as N(N-1 )/2 pairs of captured images.
  • the multi-camera system 100 of Figure 1 a may also comprise a processor 104 for controlling operations of the multi-camera system 100. There may also be a memory 106 for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner.
  • the multi-camera systeml OO may further comprise a user interface (Ul) 1 10 for displaying information to the user, for generating audio signals, and/or for receiving user inputs.
  • the multi-camera system 100 need not comprise each feature mentioned above, or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 102 (not shown).
  • Figure 1 a also illustrates some operational elements which may be implemented, for example, as a computer code which can be executed in the processor 104, in hardware, or both to perform a desired function.
  • An optical flow estimation 1 14 may perform optical flow estimation for pair of images of different camera units 102. Transform vectors or other information indicative of an amount interpolation/extrapolation to be applied to different parts of a viewport may have been stored into the memory 106 or they may be calculated e.g. as a function of the location of a pixel in question. The operation of the elements will be described later in more detail. It should be noted that there may also be other operational elements in the multi-camera system 100 than those depicted in Figure 1 a.
  • Figure 1 b shows a perspective view of the multi-camera system 100, in accordance with an embodiment.
  • seven camera units 102a— 102g can be seen, but the multi-camera system 100 may comprise even more camera units which are not visible from this perspective view.
  • Figure 1 b also shows two microphones 1 12a, 1 12b, but the apparatus may also comprise one microphone or more than two microphones.
  • the multi-camera system 100 may be controlled by another device, wherein the multi-camera system 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided with information from the multi-camera system 100 via the user interface of the other device.
  • a viewport is a part of the scene which is displayed by a head mounted display at a time. Both left and right eye images may have overlapping, but slightly different viewports.
  • a camera space, or camera coordinates stands for a coordinate system of an individual camera unit 102 whereas a world space, or world coordinates, stands for a coordinate system of the multi-camera system 100 as a whole.
  • An optical flow may be used to describe how objects, surfaces, and edges in a visual scene move or transform, when an observing point moves from a location of one camera to a location of another camera. In some embodiments, there need not be any actual movement but it may virtually be determined how the view of the scene might change when a viewing point is moved from one camera unit to another camera unit.
  • FIG. 2 shows an example of a video playback apparatus 200 as a simplified block diagram, in accordance with an embodiment.
  • a non-limiting example of video playback apparatus 200 includes an immersive display unit.
  • An example of the immersive display unit includes, but is not limited to, a head mounted display.
  • the video playback apparatus 200 may comprise, for example, one or two displays 202 for video playback. When two displays are used a first display 202a may display images for a left eye and a second display 202b may display images for a right eye, in accordance with an embodiment. In case of only one display 202, that display 202 may be used to display images for the left eye on the left side of the display 202 and to display images for the right eye on the right side of the display 202.
  • the one or two displays are configured to create one or more images surrounding the viewer partly or completely, such as a head-mounted display with head tracking system or a cylindrical or a spherical display curving in 2D or in 3D at least partly, but possibly 360° around the viewer.
  • the video playback apparatus 200 may be provided with encoded data streams via a communication interface 204 and a processor 206 may perform control operations for the video playback apparatus 200 and may also perform operations to reconstruct video streams for displaying on the basis of received encoded data streams.
  • the video playback apparatus 200 may further comprise a processor 206 for reconstructing video streams for displaying on the basis of received encoded data streams.
  • the decoding 208 is implemented, for example, as a software code, which can be executed by the processor 206 to perform the desired function, in hardware, or in both.
  • the video playback apparatus 200 may further comprise a user input 212 for receiving e.g.
  • the video playback apparatus 200 may comprise an image modification unit 214 which may perform image modification on the basis of modification information provided by a user input and at least one image modification function and transform image elements as will be described later in this specification.
  • the image modification unit 214 can be implemented, for example, as a software code, which can be executed by the processor to perform the desired function, in hardware, or in both.
  • the video playback device 200 need not comprise each of the above elements or may also comprise other elements.
  • the decoding element 208 may be a separate device wherein that device may perform decoding operations and provide decoded data stream to the video playback device 200 for further processing and displaying decoded video streams.
  • a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI). If the temporal locations of such EOIs within the video are known, the user may preview or forward the playback to such location.
  • EOI event of interest
  • the spatial information contained in a 360 degree panorama video i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display.
  • Video information to be processed may have been captured and processed by two or more camera units 102 to obtain a panorama video, for example a 360 degree panorama video.
  • the playback of the video may be carried out, for example, using the immersive multimedia display unit disclosed in figure 2, such as a head mounted display with video player software.
  • the playback of the video may be carried out by an ordinary 2D display device supporting rendering of 360- degree panorama video.
  • At least one three-dimensional (3D) panorama video is provided to be viewed (300) by at least one user, and in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses are recorded (302) during viewing the at least one 3D panorama video.
  • the plurality of user responses are analysed (304), and from the analysed user responses, one or more predetermined behaviors are extracted (306), said behavior comprising a temporal sequence of user responses.
  • Metadata for said at least one 3D panorama video is created (308), the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.
  • Each user, when viewing the 3D panorama video is more interested in some events of the video or details appearing on video frames, commonly referred to as EOIs, than other event or details.
  • EOIs events of the video or details appearing on video frames
  • statistical analysis can be made as regards to what are more significant EOIs and less significant EOIs.
  • some user responses i.e. actions carried out by the user when viewing the video, are detected and recorded during viewing the video, and the user responses are analysed.
  • Metadata relating to the identified behavior is created, wherein the metadata comprises at least information indicative of behavior and its spatial and temporal data.
  • metadata relating to one or more EOIs is gathered, and while this is repeated several times, either by a single user or multiple users, so-called crowd-sourced metadata relating users' behavior is created.
  • the more significant EOIs can be identified as regards to their spatial and temporal location.
  • the spatial and temporal information may be utilized in previewing and playing back of 360-degree video. It is noted that the 3D spatial information as well as the time location/duration of an EOl constitute a four- dimensional entity which may be referred to as 4D EOl.
  • the temporal location or duration may be determined in terms of frame number(s) associated with the EOl. According to an embodiment, said behavior comprises one or more of the following:
  • FIG. 4 An example of a user watching a 360-degree 3D panorama video is shown in Figure 4.
  • the user A is currently watching the frame n of the 360-degree 3D panorama video.
  • Frame n is displayed at a particular temporal location t n along the playback timeline.
  • the viewpoint of user A is located in a particular position Xu, yu, Zu in relation to the displayed 3D scenery. From the position, the user has a field-of-view (FOV) to a part of the frame (shown as a rectangle), where the FOV is horizontally limited by the viewing angle.
  • the object shown in the part of frame n that the user A is watching is located at a particular depth d 0 , which can be determined from the depth map of frame n.
  • the parameters described above may be considered an initial user response upon viewing the 360-degree 3D panorama video.
  • the user may carry out further user responses, and depending on the further user response, these sequences of user responses may be determined as different behaviors of the user.
  • the user may change the viewing angle e.g. by rotating the viewing angle in the current position x u , y u , Zu.
  • the user may change the field-of-view (FOV) e.g. by raising or lowering the gaze.
  • the user may change the viewing position in terms of one or more of coordinates Xu, yu, Zu.
  • the user may change the focus in terms of depth by focusing to an object having the depth value higher or lower than d 0 .
  • the user may also pause the playback of the video at a particular temporal location. Moreover, the user may rewind or forward the playback of the video to a particular temporal location.
  • the user may also write a comment relating to the video.
  • commentary sharing systems enabling viewers to add comments on top of an uploaded video, such as Danmu, can be utilized.
  • the above user responses are such that represent various behaviors, which presumably indicate some kind of interest of the user towards an event or a detail of the video. Therefore they can be regarded as predetermined behaviors, which are intended to be identified through the analysis of the user responses.
  • the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.
  • the recorded user response, as well as the behaviors extracted on the basis of the user responses may be stored in the log file.
  • the plurality of user responses are recorded from a plurality of users. Even though the user responses may gathered from a single user, statistically more relevant data can be obtained if the user responses are recorded from a plurality of users, thereby better referring to crowd-sourced analysis of EOIs.
  • the method further comprises extracting key features from said user responses and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.
  • the key features comprise one or more of the following:
  • the eye coordinates of the user may be measured by either a head tracking e.g. on Head-Mounted-Display (HMD) or an eye tracking system arranged to measure either the point of gaze or the motion of an eye relative to the head.
  • the eye tracking system may be implemented in functional connection with the playback device.
  • the eye coordinates may comprise the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the depth value of the object the user is watching.
  • the center coordinates of the FOV may be measured e.g. in relation to the frame, such as the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the most prominent depth value in the FOV.
  • the method further comprises clustering feature vectors having similar spectral properties, and identifying the behavior underlying a cluster of feature vectors.
  • the feature vectors may be represented in two- dimensional spatio-temporal coordinates.
  • the feature vectors may be clustered together for example such that a maximum distance between any points of feature vectors in the cluster is defined and a group of feature vectors within the maximum distance are clustered together.
  • spectral clustering may provide advantages in terms of connectivity criteria which are suitable for spatial-temporal data
  • a combination of k-means and mean-shift clustering algorithms could be alternatively used for each group. From each cluster, sequences of user responses are identified, since similar sequences typically correspond to at least one predetermined user behavior. For each user behavior group, the spatial-temporal data of certain associated actions are collected, vectorised and clustered to form spatial-temporal sequences or trajectories corresponding to the underlying favorable user behavior.
  • the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video, and generating a heatmap of each depth map.
  • the tracking and the changes of view orientations, carried out by the users, over time provide a good insight into how users experience the 360° panorama.
  • the cluster of feature vectors may be further distinguished by projecting them to the depth map of each frame, thereby providing information regarding variations of the depth of EOIs experienced by the users. As the number of feature vectors of a particular cluster at a particular depth level increases, these depth-specific clusters may form regions that can be visualized as gradients of color in the form of a heatmap.
  • Figure 5 shows an example how a heatmap can be used for further visualizing EOIs within a frame, i.e. at a particular moment in time.
  • the metadata relating to 4D EOIs and created as described above may be utilized in a wide variety of Augmented Reality/Virtual Reality (ARA/R) applications.
  • ARA/R Augmented Reality/Virtual Reality
  • a 4D video control bar may be included in on the display, for example below the displayed 360° panorama video, as shown in Figure 6.
  • the 4D video control bar 600 may comprise a horizontally aligned timeline of the playback and a plurality of EOls 602a - 602h along the timeline as vertically aligned bars.
  • Various types of predetermined EOls may be indicated e.g.
  • Such 4D video control bar can be used to preview and navigate EOls, for example before a user indulges into the lengthy video contents.
  • the 4D video control bar also facilitates the selection of interesting viewing angles, FOVs, depths, etc., which are crucial for good experience of 360-degree video viewing.
  • the 4D video control bar allows interacting with the EOls, such as preview an EOl or fast-forwarding to an EOl, in both time and 3D space.
  • an interactive 3D sphere 604 may be rendered with certain 3D viewing locations highlighted.
  • the highlighted locations 606, 608 are associated with identified EOls, and are clickable by users in the 4D video control bar as the vertically aligned bars, as shown in Figure 6.
  • users may rotate the 3D sphere to a different viewpoint.
  • Figure 6 illustrates how after clicking a particular 3D location, the video playback view has switched to the alternative location and viewpoint of the EOl.
  • the 3D sphere may disappear automatically and the playback continues from the view of the selected location.
  • additional information such as details of the place or event or advertisements, can be rendered to fuse into the video content, together with the 3D information of the EOl regions.
  • the geometric boundary of the EOl determines the placement of additional information and can be used as the anchor for various placement decisions. For example, the placement can be in front of the EOl in the direction of the viewing angle from the current viewer and always facing to the viewer.
  • the embodiments enable to detect EOIs in 360-degree panorama video not only in terms of their temporal location, but also in terms of their 3D spatial location, i.e. resulting in so-called 4D EOIs.
  • the embodiments utilize crowd-sourced video viewing and interaction behaviors, thereby providing statistically more reliable results about the relevant 4D EOIs.
  • the various embodiments may be implemented in hardware or special purpose circuits or any combination thereof. While various embodiments may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the implementation may include a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the various embodiments or a subset of them.
  • the implementation may include a computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to apparatus to perform the various embodiments or a subset of them.
  • an apparatus may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment.

Abstract

A method comprising: providing at least one three-dimensional (3D) panorama video to be viewed by at least one user; recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video; analysing the plurality of user responses; extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.

Description

A METHOD FOR DETECTING EVENTS-OF-INTEREST Field
Various embodiments relate to panorama videos, and more particularly to a method for detecting events-of-interest. Background of the invention
The availability of panoramic 360-degree video has been gradually increased, nowadays including main public video sites, such as YouTube®. A panoramic video typically supports 360 degree field of view horizontally and over 180 degrees vertically. A panoramic 360-degree video is delivered to client applications running on platforms that support the rendering of 360-degree panorama video. The video is typically delivered to the user compressed in an equirectangular or cube map projection.
When playing back the 360-degree panorama video, viewers may pan 360 degrees within a horizontal or vertical (or any directional) loop to watch the scene. The 360 degree panorama video may be viewed in an immersive multimedia display unit, for example, a head mounted display with video player software, whereupon the viewer may experience to be immersed in the scene shown on the display. On the other hand, 360 degree panorama video may be rendered to be displayed on an ordinary 2D display, provided that the playback device supports rendering of 360-degree panorama video.
In a variety of contexts, a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI). If the temporal locations of such EOls within the video are known, the user may preview or forward the playback to such location. However, the spatial information contained in a 360 degree panorama video, i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display. Therefore, providing the mere temporal locations of the EOls is insufficient if a good coverage of EOls is desired.
Summary of the invention
Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are at least alleviated. Various aspects of the invention include a method, an apparatus and a computer program, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims. According to a first aspect, there is disclosed a method comprising: providing at least one three-dimensional (3D) panorama video to be viewed by at least one user; recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video; analysing the plurality of user responses; extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.
According to an embodiment, said behavior comprises one or more of the following:
- change of a viewing angle or a field-of-view (FOV);
- change of a viewing position;
- change of focus in terms of depth;
- pausing or rewinding or forwarding of the video playback;
- adding a comment to the video.
According to an embodiment, the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.
According to an embodiment, the plurality of user responses are recorded from a plurality of users.
According to an embodiment, the method further comprises extracting key features from said user responses; and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.
According to an embodiment, the key features comprise one or more of the following:
- eye coordinates of the user;
- center coordinates of the field-of-view (FOV);
- starting and/or ending time of the user response; - frequently used words in comments.
According to an embodiment, the method further comprises clustering feature vectors having similar spectral properties; and identifying the behavior underlying a cluster of feature vectors.
According to an embodiment, the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video; and generating a heatmap of each depth map.
According to an embodiment, the method further comprises providing a file comprising said 3D panorama video with a temporal control bar, wherein the one or more predetermined behaviors contained by said 3D panorama video are indicated according to their temporal location.
According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to carry out the method of any of the embodiments.
According to a third aspect, there is provided a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the method according to any of the above embodiments.
These and other aspects of the invention and the embodiments related thereto will become apparent in view of the detailed disclosure of the embodiments further below.
List of drawings
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
Fig. 1 a shows an example of a multi-camera system as a simplified block diagram, in accordance with an embodiment; shows a perspective view of a multi-camera system, in accordance with an embodiment; shows an example of a video playback apparatus as a simplified block diagram, in accordance with an embodiment; shows a flow chart of an event detection method according to an embodiment of the invention; illustrate an example of of a user watching a 360-degree 3D panorama video; illustrates an example of an example of a heatmap visualizing EOIs within a frame according to an embodiment of the invention; illustrates an example of video control bar application according to embodiment of the invention.
Description of embodiments
The following embodiments are exemplary. Although the specification may refer to "an", "one", or "some" embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.
When 360 degree stereo panorama video is viewed in an immersive multimedia display unit, for example, a head mounted display with a video player software, a video player may be able to create the similar effect of the viewer moving in the immersed space which is present in the real world. The forward and/or backward motion of a head of the user may make the video more realistic looking as the objects in the foreground appear to move slightly in relation to background objects when the head is moved.
Figure 1 a illustrates an example of a multi-camera system 100, which may be able to capture and produce 360 degree stereo panorama video. The multi-camera system 100 comprises two or more camera units 102. In this example, the number of camera units 102 is eight, but may also be less than eight or more than eight. Each camera unit 102 is located at a different location in the multi-camera system, and may have a different orientation with respect to other camera units 102, so that they may capture a part of the 360 degree scene from different viewpoints substantially simultaneously. A pair of camera units 102 of the multi-camera system 100 may correspond with left and right eye viewpoints at a time. As an example, the camera units 102 may have an omnidirectional constellation, so that it has a 360° viewing angle in a 3D-space. In other words, such multi-camera system 100 may be able to see each direction of a scene so that each spot of the scene around the multi- camera system 100 can be viewed by at least one camera unit 102 or a pair of camera units 102.
Without losing generality, any two camera units 102 of the multi-camera system 100 may be regarded as a pair of camera units 102. Hence, a multi-camera system of two cameras may have only one pair of camera units, a multi-camera system of three cameras may have three pairs of camera units, a multi-camera system of four cameras may have six pairs of camera units, etc. Generally, a multi-camera system 100 comprising N camera units 102, where N is an integer greater than one, may have N(N-1 )/2 pairs of camera units 102. Accordingly, images captured by the camera units 102 at a certain time may be considered as N(N-1 )/2 pairs of captured images.
The multi-camera system 100 of Figure 1 a may also comprise a processor 104 for controlling operations of the multi-camera system 100. There may also be a memory 106 for storing data and computer code to be executed by the processor 104, and a transceiver 108 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The multi-camera systeml OO may further comprise a user interface (Ul) 1 10 for displaying information to the user, for generating audio signals, and/or for receiving user inputs. However, the multi-camera system 100 need not comprise each feature mentioned above, or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 102 (not shown).
Figure 1 a also illustrates some operational elements which may be implemented, for example, as a computer code which can be executed in the processor 104, in hardware, or both to perform a desired function. An optical flow estimation 1 14 may perform optical flow estimation for pair of images of different camera units 102. Transform vectors or other information indicative of an amount interpolation/extrapolation to be applied to different parts of a viewport may have been stored into the memory 106 or they may be calculated e.g. as a function of the location of a pixel in question. The operation of the elements will be described later in more detail. It should be noted that there may also be other operational elements in the multi-camera system 100 than those depicted in Figure 1 a.
Figure 1 b shows a perspective view of the multi-camera system 100, in accordance with an embodiment. In Figure 1 b seven camera units 102a— 102g can be seen, but the multi-camera system 100 may comprise even more camera units which are not visible from this perspective view. Figure 1 b also shows two microphones 1 12a, 1 12b, but the apparatus may also comprise one microphone or more than two microphones.
In accordance with an embodiment, the multi-camera system 100 may be controlled by another device, wherein the multi-camera system 100 and the other device may communicate with each other and a user may use a user interface of the other device for entering commands, parameters, etc. and the user may be provided with information from the multi-camera system 100 via the user interface of the other device.
Some terminology regarding the multi-camera system 100 will now be shortly described. A viewport is a part of the scene which is displayed by a head mounted display at a time. Both left and right eye images may have overlapping, but slightly different viewports. A camera space, or camera coordinates, stands for a coordinate system of an individual camera unit 102 whereas a world space, or world coordinates, stands for a coordinate system of the multi-camera system 100 as a whole. An optical flow may be used to describe how objects, surfaces, and edges in a visual scene move or transform, when an observing point moves from a location of one camera to a location of another camera. In some embodiments, there need not be any actual movement but it may virtually be determined how the view of the scene might change when a viewing point is moved from one camera unit to another camera unit.
Figure 2 shows an example of a video playback apparatus 200 as a simplified block diagram, in accordance with an embodiment. A non-limiting example of video playback apparatus 200 includes an immersive display unit. An example of the immersive display unit includes, but is not limited to, a head mounted display. The video playback apparatus 200 may comprise, for example, one or two displays 202 for video playback. When two displays are used a first display 202a may display images for a left eye and a second display 202b may display images for a right eye, in accordance with an embodiment. In case of only one display 202, that display 202 may be used to display images for the left eye on the left side of the display 202 and to display images for the right eye on the right side of the display 202. While describing various embodiments further below, it is assumed that the one or two displays are configured to create one or more images surrounding the viewer partly or completely, such as a head-mounted display with head tracking system or a cylindrical or a spherical display curving in 2D or in 3D at least partly, but possibly 360° around the viewer.
The video playback apparatus 200 may be provided with encoded data streams via a communication interface 204 and a processor 206 may perform control operations for the video playback apparatus 200 and may also perform operations to reconstruct video streams for displaying on the basis of received encoded data streams. The video playback apparatus 200 may further comprise a processor 206 for reconstructing video streams for displaying on the basis of received encoded data streams. There may also be a decoder 208 for decoding received data streams and a memory 210 for storing data and computer code. In an embodiment, the decoding 208 is implemented, for example, as a software code, which can be executed by the processor 206 to perform the desired function, in hardware, or in both. The video playback apparatus 200 may further comprise a user input 212 for receiving e.g. user instructions or inputs. The video playback apparatus 200 may comprise an image modification unit 214 which may perform image modification on the basis of modification information provided by a user input and at least one image modification function and transform image elements as will be described later in this specification. In an embodiment, the image modification unit 214 can be implemented, for example, as a software code, which can be executed by the processor to perform the desired function, in hardware, or in both.
It should be noted that the video playback device 200 need not comprise each of the above elements or may also comprise other elements. For example, the decoding element 208 may be a separate device wherein that device may perform decoding operations and provide decoded data stream to the video playback device 200 for further processing and displaying decoded video streams. In a variety of contexts, a video footage may contain multiple highlights or scenes that could be of interest to viewers, i.e. so-called event of interest (EOI). If the temporal locations of such EOIs within the video are known, the user may preview or forward the playback to such location. However, the spatial information contained in a 360 degree panorama video, i.e. viewing position, viewing direction, field-of-view and depth, provide a huge variety in what could be shown on the display.
In the following, the method for detecting events of interest (EOI) in accordance with an embodiment is described in more detail with reference to the flow diagram of Figure 3. Video information to be processed may have been captured and processed by two or more camera units 102 to obtain a panorama video, for example a 360 degree panorama video. The playback of the video may be carried out, for example, using the immersive multimedia display unit disclosed in figure 2, such as a head mounted display with video player software. Alternatively, the playback of the video may be carried out by an ordinary 2D display device supporting rendering of 360- degree panorama video.
In the method, at least one three-dimensional (3D) panorama video is provided to be viewed (300) by at least one user, and in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses are recorded (302) during viewing the at least one 3D panorama video. The plurality of user responses are analysed (304), and from the analysed user responses, one or more predetermined behaviors are extracted (306), said behavior comprising a temporal sequence of user responses. Metadata for said at least one 3D panorama video is created (308), the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.
Each user, when viewing the 3D panorama video is more interested in some events of the video or details appearing on video frames, commonly referred to as EOIs, than other event or details. Thus, by gathering data from repeated views of the 3D panorama video, either by a single user or multiple users, statistical analysis can be made as regards to what are more significant EOIs and less significant EOIs. For gathering the data, some user responses, i.e. actions carried out by the user when viewing the video, are detected and recorded during viewing the video, and the user responses are analysed.
One purpose of the analysis is to identify temporally succeeding user responses, which can be referred to as a behavior of the user as regards to viewing the video. In other words, a particular temporal sequence of user responses can be identified as a particular behavior regarding to viewing the video. Metadata relating to the identified behavior is created, wherein the metadata comprises at least information indicative of behavior and its spatial and temporal data. Thus, metadata relating to one or more EOIs is gathered, and while this is repeated several times, either by a single user or multiple users, so-called crowd-sourced metadata relating users' behavior is created.
On the basis of the metadata, the more significant EOIs can be identified as regards to their spatial and temporal location. The spatial and temporal information may be utilized in previewing and playing back of 360-degree video. It is noted that the 3D spatial information as well as the time location/duration of an EOl constitute a four- dimensional entity which may be referred to as 4D EOl. The temporal location or duration may be determined in terms of frame number(s) associated with the EOl. According to an embodiment, said behavior comprises one or more of the following:
- change of a viewing angle or a field-of-view (FOV);
- change of a viewing position;
- change of focus in terms of depth;
- pausing or rewinding or forwarding of the video playback;
- adding a comment to the video.
An example of a user watching a 360-degree 3D panorama video is shown in Figure 4. The user A is currently watching the frame n of the 360-degree 3D panorama video. Frame n is displayed at a particular temporal location tn along the playback timeline. The viewpoint of user A is located in a particular position Xu, yu, Zu in relation to the displayed 3D scenery. From the position, the user has a field-of-view (FOV) to a part of the frame (shown as a rectangle), where the FOV is horizontally limited by the viewing angle. The object shown in the part of frame n that the user A is watching is located at a particular depth d0, which can be determined from the depth map of frame n. The parameters described above may be considered an initial user response upon viewing the 360-degree 3D panorama video.
Now, starting from the initial user response the user may carry out further user responses, and depending on the further user response, these sequences of user responses may be determined as different behaviors of the user. The user may change the viewing angle e.g. by rotating the viewing angle in the current position xu, yu, Zu. The user may change the field-of-view (FOV) e.g. by raising or lowering the gaze. The user may change the viewing position in terms of one or more of coordinates Xu, yu, Zu. The user may change the focus in terms of depth by focusing to an object having the depth value higher or lower than d0.
The user may also pause the playback of the video at a particular temporal location. Moreover, the user may rewind or forward the playback of the video to a particular temporal location.
The user may also write a comment relating to the video. Herein, for example commentary sharing systems enabling viewers to add comments on top of an uploaded video, such as Danmu, can be utilized.
The above user responses are such that represent various behaviors, which presumably indicate some kind of interest of the user towards an event or a detail of the video. Therefore they can be regarded as predetermined behaviors, which are intended to be identified through the analysis of the user responses.
According to an embodiment, the at least one 3D panorama video is associated with a log file for storing the plurality of user responses. Thus, the recorded user response, as well as the behaviors extracted on the basis of the user responses may be stored in the log file.
According to an embodiment, the plurality of user responses are recorded from a plurality of users. Even though the user responses may gathered from a single user, statistically more relevant data can be obtained if the user responses are recorded from a plurality of users, thereby better referring to crowd-sourced analysis of EOIs.
According to an embodiment, the method further comprises extracting key features from said user responses and forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.
According to an embodiment, the key features comprise one or more of the following:
- eye coordinates of the user;
- center coordinates of the field-of-view (FOV);
- starting and/or ending time of the user response;
- frequently used words in comments. The eye coordinates of the user may be measured by either a head tracking e.g. on Head-Mounted-Display (HMD) or an eye tracking system arranged to measure either the point of gaze or the motion of an eye relative to the head. The eye tracking system may be implemented in functional connection with the playback device. The eye coordinates may comprise the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the depth value of the object the user is watching.
The center coordinates of the FOV may be measured e.g. in relation to the frame, such as the X and Y coordinates of the frame e.g. in terms of pixels, blocks or macroblocks, and the Z coordinate as the most prominent depth value in the FOV.
An example of extracting the key feature of the user responses are shown in Table 1 below.
Figure imgf000013_0001
Table 1 .
Herein, for user responses which last for a certain period, the data regarding the key features is continuously recorded. For example, if the user makes comments or pauses the playback, it will result in a continuous recording of eye/center coordinates, as well as the start and end times of the user response. For other actions, e.g. forwarding/rewinding or changes the viewing angle, the end time is more informative for indicating the temporal distribution of the event of interest. According to an embodiment, the method further comprises clustering feature vectors having similar spectral properties, and identifying the behavior underlying a cluster of feature vectors. The feature vectors may be represented in two- dimensional spatio-temporal coordinates. The feature vectors may be clustered together for example such that a maximum distance between any points of feature vectors in the cluster is defined and a group of feature vectors within the maximum distance are clustered together. Herein, while spectral clustering may provide advantages in terms of connectivity criteria which are suitable for spatial-temporal data, a combination of k-means and mean-shift clustering algorithms could be alternatively used for each group. From each cluster, sequences of user responses are identified, since similar sequences typically correspond to at least one predetermined user behavior. For each user behavior group, the spatial-temporal data of certain associated actions are collected, vectorised and clustered to form spatial-temporal sequences or trajectories corresponding to the underlying favorable user behavior. For example, all data associated to the action of 'viewing angle change' such as start/end time, eye/center coordinates may be collected. The data is properly vectored and normalized before being clustered using spectral clustering. Each spatial-temporal cluster then corresponds to a group of favorable user behavior. According to an embodiment, the method further comprises projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video, and generating a heatmap of each depth map. The tracking and the changes of view orientations, carried out by the users, over time provide a good insight into how users experience the 360° panorama. The cluster of feature vectors may be further distinguished by projecting them to the depth map of each frame, thereby providing information regarding variations of the depth of EOIs experienced by the users. As the number of feature vectors of a particular cluster at a particular depth level increases, these depth-specific clusters may form regions that can be visualized as gradients of color in the form of a heatmap.
Figure 5 shows an example how a heatmap can be used for further visualizing EOIs within a frame, i.e. at a particular moment in time. A sequential encoding of the frame-specific heatmaps, possibly applying interpolation, creates a heatmap video representation over time, which may provide enhanced visualization and intuition about how the users experience the 360° panorama video over time.
The metadata relating to 4D EOIs and created as described above may be utilized in a wide variety of Augmented Reality/Virtual Reality (ARA/R) applications. As an example, an application relating to a 4D video control bar is illustrated below. Nevertheless, potential applications based on the 4D EOls detection method as described herein are by no means limited to the following use case. A 4D video control bar may be included in on the display, for example below the displayed 360° panorama video, as shown in Figure 6. The 4D video control bar 600 may comprise a horizontally aligned timeline of the playback and a plurality of EOls 602a - 602h along the timeline as vertically aligned bars. Various types of predetermined EOls may be indicated e.g. using different colors of bars. Such 4D video control bar can be used to preview and navigate EOls, for example before a user indulges into the lengthy video contents. In contrast to existing video control bar, which allows users to shift the playback back and forth at an arbitrary time point, the 4D video control bar also facilitates the selection of interesting viewing angles, FOVs, depths, etc., which are crucial for good experience of 360-degree video viewing.
The 4D video control bar allows interacting with the EOls, such as preview an EOl or fast-forwarding to an EOl, in both time and 3D space. For example, when hovering on one specific time 602f on the control bar by e.g. a mouse pointer, an interactive 3D sphere 604 may be rendered with certain 3D viewing locations highlighted. The highlighted locations 606, 608 are associated with identified EOls, and are clickable by users in the 4D video control bar as the vertically aligned bars, as shown in Figure 6. In case that multiple 3D locations are distributed on different sides (e.g. facets of buildings), users may rotate the 3D sphere to a different viewpoint. Figure 6 illustrates how after clicking a particular 3D location, the video playback view has switched to the alternative location and viewpoint of the EOl. The 3D sphere may disappear automatically and the playback continues from the view of the selected location. With the EOl information, including additional information to the views can be controlled and targeted easily. Without breaking the viewing experience, additional information, such as details of the place or event or advertisements, can be rendered to fuse into the video content, together with the 3D information of the EOl regions. The geometric boundary of the EOl determines the placement of additional information and can be used as the anchor for various placement decisions. For example, the placement can be in front of the EOl in the direction of the viewing angle from the current viewer and always facing to the viewer. The above embodiments may provide various advantages. The embodiments enable to detect EOIs in 360-degree panorama video not only in terms of their temporal location, but also in terms of their 3D spatial location, i.e. resulting in so-called 4D EOIs. The embodiments utilize crowd-sourced video viewing and interaction behaviors, thereby providing statistically more reliable results about the relevant 4D EOIs.
In general, the various embodiments may be implemented in hardware or special purpose circuits or any combination thereof. While various embodiments may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
A skilled man appreciates that any of the embodiments described above may be implemented as a combination with one or more of the other embodiments, unless there is explicitly or implicitly stated that certain embodiments are only alternatives to each other.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. Thus, the implementation may include a computer readable storage medium stored with code thereon for use by an apparatus, which when executed by a processor, causes the apparatus to perform the various embodiments or a subset of them. Additionally or alternatively, the implementation may include a computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to apparatus to perform the various embodiments or a subset of them. For example, an apparatus may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment.
The above-presented embodiments are not limiting, but it can be modified within the scope of the appended claims.

Claims

Claims:
1 . A method comprising:
providing at least one three-dimensional (3D) panorama video to be viewed by at least one user;
recording, in response to said at least one user starting to view the at least one 3D panorama video, a plurality of user responses during viewing the at least one 3D panorama video;
analysing the plurality of user responses;
extracting, from the analysed user responses, one or more predetermined behaviors, said behavior comprising a temporal sequence of user responses; and
creating metadata for said at least one 3D panorama video, the metadata comprising at least information indicative of the one or more predetermined behaviors and their spatial and temporal data.
2. The method according to claim 1 , wherein said behavior comprises one or more of the following:
- change of a viewing angle or a field-of-view (FOV);
- change of a viewing position;
- change of focus in terms of depth;
- pausing or rewinding or forwarding of the video playback;
- adding a comment to the video.
3. The method according to claim 1 or 2, wherein the at least one 3D panorama video is associated with a log file for storing the plurality of user responses.
4. The method according to any preceding claim, wherein the plurality of user responses are recorded from a plurality of users.
5. The method according to any preceding claim, further comprising extracting key features from said user responses; and
forming feature vectors associated with said behaviors on the basis of the key features, said feature vectors comprising 3D spatial data and temporal data of said behavior.
6. The method according to claim 5, wherein the key features comprise more of the following:
- eye coordinates of the user;
- center coordinates of the field-of-view (FOV);
- starting and/or ending time of the user response;
- frequently used words in comments.
7. The method according to claim 5 or 6, further comprising
clustering feature vectors having similar spectral properties; and identifying the behavior underlying a cluster of feature vectors.
8. The method according to claim 7, further comprising:
projecting at least one cluster of feature vectors to a depth map of each frame of said 3D panorama video; and
generating a heatmap of each depth map.
9. The method according to any preceding claim, further comprising providing a file comprising said 3D panorama video with a temporal control bar, wherein the one or more predetermined behaviors contained by said 3D panorama video are indicated according to their temporal location.
10. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to carry out the method of any of claims 1 - 9.
1 1 . A computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to carry out the method of any of claims 1 - 9.
PCT/FI2018/050426 2017-06-21 2018-06-07 A method for detecting events-of-interest WO2018234622A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20175582 2017-06-21
FI20175582 2017-06-21

Publications (1)

Publication Number Publication Date
WO2018234622A1 true WO2018234622A1 (en) 2018-12-27

Family

ID=64735516

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050426 WO2018234622A1 (en) 2017-06-21 2018-06-07 A method for detecting events-of-interest

Country Status (1)

Country Link
WO (1) WO2018234622A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019240647A1 (en) * 2018-06-14 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) System and method for providing 360 degrees immersive video based on gaze vector information
CN111309147A (en) * 2020-02-12 2020-06-19 咪咕视讯科技有限公司 Panoramic video playing method and device and storage medium
WO2022199441A1 (en) * 2021-03-23 2022-09-29 影石创新科技股份有限公司 360-degree video playback method and apparatus, computer device, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270692A1 (en) * 2013-03-18 2014-09-18 Nintendo Co., Ltd. Storage medium storing information processing program, information processing device, information processing system, panoramic video display method, and storage medium storing control data
US20160104508A1 (en) * 2014-10-10 2016-04-14 Samsung Electronics Co., Ltd. Video editing using contextual data and content discovery using clusters
US20160300392A1 (en) * 2015-04-10 2016-10-13 VR Global, Inc. Systems, media, and methods for providing improved virtual reality tours and associated analytics
EP3112985A1 (en) * 2015-06-30 2017-01-04 Nokia Technologies Oy An apparatus for video output and associated methods
US20170061686A1 (en) * 2015-08-28 2017-03-02 Hai Yu Stage view presentation method and system
US20170076429A1 (en) * 2015-09-16 2017-03-16 Google Inc. General spherical capture methods
US9762851B1 (en) * 2016-05-31 2017-09-12 Microsoft Technology Licensing, Llc Shared experience with contextual augmentation
US9781342B1 (en) * 2015-10-22 2017-10-03 Gopro, Inc. System and method for identifying comment clusters for panoramic content segments

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270692A1 (en) * 2013-03-18 2014-09-18 Nintendo Co., Ltd. Storage medium storing information processing program, information processing device, information processing system, panoramic video display method, and storage medium storing control data
US20160104508A1 (en) * 2014-10-10 2016-04-14 Samsung Electronics Co., Ltd. Video editing using contextual data and content discovery using clusters
US20160300392A1 (en) * 2015-04-10 2016-10-13 VR Global, Inc. Systems, media, and methods for providing improved virtual reality tours and associated analytics
EP3112985A1 (en) * 2015-06-30 2017-01-04 Nokia Technologies Oy An apparatus for video output and associated methods
US20170061686A1 (en) * 2015-08-28 2017-03-02 Hai Yu Stage view presentation method and system
US20170076429A1 (en) * 2015-09-16 2017-03-16 Google Inc. General spherical capture methods
US9781342B1 (en) * 2015-10-22 2017-10-03 Gopro, Inc. System and method for identifying comment clusters for panoramic content segments
US9762851B1 (en) * 2016-05-31 2017-09-12 Microsoft Technology Licensing, Llc Shared experience with contextual augmentation

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019240647A1 (en) * 2018-06-14 2019-12-19 Telefonaktiebolaget Lm Ericsson (Publ) System and method for providing 360 degrees immersive video based on gaze vector information
CN111309147A (en) * 2020-02-12 2020-06-19 咪咕视讯科技有限公司 Panoramic video playing method and device and storage medium
WO2022199441A1 (en) * 2021-03-23 2022-09-29 影石创新科技股份有限公司 360-degree video playback method and apparatus, computer device, and storage medium

Similar Documents

Publication Publication Date Title
US11924394B2 (en) Methods and apparatus for receiving and/or using reduced resolution images
US11653065B2 (en) Content based stream splitting of video data
JP6708689B2 (en) 3D gameplay sharing
CN109416931B (en) Apparatus and method for gaze tracking
US10440407B2 (en) Adaptive control for immersive experience delivery
JP6410918B2 (en) System and method for use in playback of panoramic video content
US20190335166A1 (en) Deriving 3d volumetric level of interest data for 3d scenes from viewer consumption data
US11748870B2 (en) Video quality measurement for virtual cameras in volumetric immersive media
US20160198140A1 (en) System and method for preemptive and adaptive 360 degree immersive video streaming
JP7459870B2 (en) Image processing device, image processing method, and program
CN105939497B (en) Media streaming system and media streaming method
US10764493B2 (en) Display method and electronic device
JPWO2017169369A1 (en) Information processing apparatus, information processing method, and program
US20210274145A1 (en) Methods, systems, and media for generating and rendering immersive video content
WO2018234622A1 (en) A method for detecting events-of-interest
EP3236336A1 (en) Virtual reality causal summary content
JP2016163342A (en) Method for distributing or broadcasting three-dimensional shape information
US10732706B2 (en) Provision of virtual reality content
CN110730340B (en) Virtual audience display method, system and storage medium based on lens transformation
US11622099B2 (en) Information-processing apparatus, method of processing information, and program
JP2022051978A (en) Image processing device, image processing method, and program
US20230353717A1 (en) Image processing system, image processing method, and storage medium
JP2008187678A (en) Video generating apparatus and video generating program
Foote et al. One-man-band: A touch screen interface for producing live multi-camera sports broadcasts
WO2018004933A1 (en) Apparatus and method for gaze tracking

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18821310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18821310

Country of ref document: EP

Kind code of ref document: A1