WO2019030551A1 - Method for applying metadata to immersive media files - Google Patents

Method for applying metadata to immersive media files Download PDF

Info

Publication number
WO2019030551A1
WO2019030551A1 PCT/IB2017/054839 IB2017054839W WO2019030551A1 WO 2019030551 A1 WO2019030551 A1 WO 2019030551A1 IB 2017054839 W IB2017054839 W IB 2017054839W WO 2019030551 A1 WO2019030551 A1 WO 2019030551A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
interest
metadata
data
immersive media
Prior art date
Application number
PCT/IB2017/054839
Other languages
French (fr)
Inventor
Mark MILSTEIN
Original Assignee
Milstein Mark
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Milstein Mark filed Critical Milstein Mark
Priority to PCT/IB2017/054839 priority Critical patent/WO2019030551A1/en
Publication of WO2019030551A1 publication Critical patent/WO2019030551A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings

Definitions

  • the present invention relates to a method for applying metadata to immersive media files
  • IPTC International Press Telecommunications Council
  • Metadata is a set of descriptive information about a file.
  • Video and audio files automatically include basic metadata properties, such as date, duration, and file type.
  • Immersive videos also known as 360-degree videos or spherical videos, are video recordings where a view in every direction is recorded at the same time, shot using an omnidirectional camera or a collection of cameras.
  • Virtual reality and augmented reality applications are also commonly based on immersive videos.
  • an immersive media file comprising at least one frame coding immersive media content
  • a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface
  • Fig. 1 is a flow diagram of a preferred method according to the invention.
  • Fig. 2 is an illustration of applying a spatial grid overlay to a media file.
  • Fig. 3 is an illustration of identifying an object of interest in a media stream.
  • Figs. 4 to 10 are screenshots of an exemplary embodiment being carried out using Adobe Premiere, wherein the screenshots show the following steps:
  • Fig. 4 Opening a video file in Adobe Premiere.
  • Fig. 5 Applying a spatial grid overlay to the video file.
  • Fig 5a Enlarged video frame with spatial grid overlay taken from the screenshot in Fig. 5.
  • Fig. 6 Adding a comment marker (information data) to an object of interest identified in the video stream.
  • Fig. 6a Enlarged marker editor panel taken from the screenshot in Fig.
  • Fig. 7 Inputting frame sequence data (time) of first frame wherein object of interest appears.
  • Fig. 7a Enlarged marker data input box taken from the screenshot in
  • Fig. 8 Inputting frame sequence data (time) of second frame wherein object of interest disappears.
  • Fig. 8a Enlarged marker data input box taken from the screenshot in
  • Fig. 9 Inputting coordinates of object of interest within the first frame.
  • Fig. 9a Enlarged marker data input box taken from the screenshot in
  • Fig. 10 Inputting coordinates of object of interest within the second frame.
  • Fig. 10a Enlarged marker data input box taken from the screenshot in Fig. 10.
  • an immersive media file 10 is provided in Step 100, which immersive media file 10 comprises at least one frame 12 coding immersive media content.
  • the immersive media file 10 is an immersive video comprising a plurality of frames 12.
  • the frames 12 comprise electronically coded still images, which, when displayed subsequently at a given frame rate, produce a sense of motion.
  • the frames 12 are preferably processed by performing image recognition on the frames by an artificial intelligence.
  • an artificial intelligence is Clarifai (www.clarifai.com). Clarifai's visual recognition model processes the frames 12 of the video file 10 in real-time and returns predictions on what is in the still images coded in the frames. These predictions take the form of keywords with associated probabilities indicating the probability of the accurateness of the given keyword.
  • keyword is understood to mean one or more words describing the image, including conceptual description.
  • the keywords may be associated with the whole media file 10 or only part of the frames 12 if the keywords relate to objects or concepts appearing in those frames 12.
  • Step 104 the keywords provided by the artificial intelligence is preferably reviewed by a human editor.
  • a probability threshold may be defined such that keywords having a probability lower than the given threshold are automatically discarded without review.
  • the keywords which are deemed inappropriate by the human reviewer for the frames 12 with which the keyword is associated are also discarded. Keywords, which are not discarded, are used for providing information data as will be explained in the following.
  • a spatial grid overlay 14 is added on the frames 12, which spatial grid overlay defines latitudes and longitudes on a spheroid surface.
  • each still image coded in each frame is an equirectangular projection (also called spherical projection) where the stitched image shows a 360° horizontal by 180° vertical field of view i.e. the whole sphere. Panoramas in this projection are meant to be viewed as though the image is wrapped into a sphere and viewed from within.
  • the spatial grid overlay 14 is a Cartesian coordinate system wherein the two perpendicular X and Y coordinate axis define the longitudes and latitudes, the longitudes ranging from -180° to +180°, the latitudes ranging from -90° to +90°.
  • the grid is preferably custom designed with a 2:1 ratio scale to match various camera standards.
  • the spatial grid overlay 14 is a virtual overlay which is not visible to the user and which is used by a program to calculate the spatial location of a point or area selected by the user.
  • an object of interest 16 is identified in Step 108 in the immersive media content.
  • the object of interest 16 is an airplane flying over a river.
  • the object of interest may be identified by a human editor or by an artificial intelligence and possibly reviewed by a human editor.
  • Step 1 10 a reference frame is determined for the object of interest 16.
  • the object of interest 16 may be present in a single frame 12 or a few frames 12 which, having regard to the frame rate of the video, are displayed practically at a single time instance. In this case it is sufficient to identify a single (first) reference frame in Step 1 10 in which the object of interest 16 is present.
  • the object of interest 16 may be present in a plurality of frames 12 which, having regard to the frame rate of the video, are displayed during an observable length of time.
  • a first reference frame 12a is determined in Step 1 10 in which said object of interest 16 is present for the first time in a frame sequence 18 as illustrated in Fig. 3.
  • a second reference frame 12b is further determined in Step 1 12 in which said object of interest 16 is present for the last time in the frame sequence 18.
  • further reference frames may be determined marking the point of reappearances and disappearances.
  • All of the above mentioned reference frames 16 can be defined by determining their frame sequence data.
  • the frame sequence data corresponds to a position of the reference frame 12a, 12b in the frame sequence 18. This position can be identified via the number of the frame 12 in the frame sequence.
  • the first reference frame 12a is the n th frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n
  • the last reference frame 12b is n+m th frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n+m.
  • the first reference frame 12a is displayed at t1 time and the second reference frame 12b is displayed at t2 time when the video file 10 is played by a media player.
  • the frame rate of the video file 10 is m frames/(t2-t1 ). Since time and frame sequence number are linked via the frame rate, hence time-based metadata and frame-based metadata are used herein as synonyms.
  • a typical video frame rate is 25 frames/sec. Since a human eye cannot distinguish between subsequently played frames 12 within a fraction of a second, the first reference frame 12a and the second reference frame 12b need not be strictly the first and last frames in which the object of interest 16 appears and disappears respectively but in fact any frames 12 can be selected which would be perceived by a human viewer as being close to the frame of appearance and disappearance.
  • any frame can be used as the first reference frame 12a and no second reference frame 12b is recorded. Even in this case it is advantageous to use the frame 12 of the first appearance as the first reference frame 12a since a user wishing to find an object of interest within a video file 10 can easily play the video from the first reference frame 12a and hence view all the subsequent frames 12, but it is more difficult to trace back the first appearance if that is upstream of the first reference frame 12a.
  • any one of the frames 12 containing the object of interest 16 can be used as the first reference frame 12a and no further reference frames are needed.
  • latitude and longitude data is determined in the spatial grid overlay 14 of the first reference frame 12a of a point (pixel) associated with the object of interest 16.
  • the latitude data corresponds to the Y coordinate (distance measured from the X axis) of the pixel ranging from -90° to +90°
  • the longitude data corresponds to the X coordinate (distance measured from the X axis) of the pixel ranging from -180° to +180°.
  • the pixel associated with the object of interest 16 may be any pixel from the pixels making up the object of interest 16 or a pixel in the vicinity of the pixels making up the object of interest 16. In a preferred embodiment the pixel associated with the object of interest 16 is approximately at the center of the object of interest 16 even if this pixel is outside of the object of interest 16 (e.g. in the case of a ring the center of the ring is not part of the ring).
  • a second reference frame 12b has also been defined, preferably latitude and longitude data is determined similarly in Step 1 14 in the second reference frame 12b as well.
  • Steps 1 10, 1 12, 1 14 and 1 16 can be carried out in any order.
  • Step 1 18 information data is provided, which is associated with the object of interest 16.
  • associated information data codes information related to the object of interest 16.
  • information data could code the following comment: "Airplane flying over the river”.
  • keywords have been generated by an artificial intelligence, these keywords can be used to provide the information data.
  • the information data provided in Step 1 18 could code the single word "airplane”.
  • Step 1 18 could be performed together with any one of steps 1 10, 1 12, 1 14 and 1 16 or it could be carried out separately.
  • Step 120 frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data.
  • the frame and location based metadata would include the comment "Airplane flying over the river", would indicate the time (or frame number) of the first reference frame 12a where the airplane (as object of interest 16) appears as well as the longitudinal and latitudinal location of the appearance within the spherical image, and would further indicate the time (or frame number) of the second reference frame 12b where the airplane disappears as well as the longitudinal and latitudinal location of its disappearance.
  • Step 122 the generated metadata is applied to the immersive media file 10.
  • the generated metadata can be applied to the immersive video file 10 in known ways.
  • One common way of applying metadata is by storing the generated metadata in the immersive media file's 10 .xmp metadata fields in the form of XMP metadata. This is generally referred to as embedded XMP metadata.
  • the Extensible metadata Platform is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets.
  • XMP is built on XML, which facilitates the exchange of metadata across a variety of applications and publishing workflows.
  • Metadata in most other formats (such as Exif, GPS, and TIFF) automatically transfers to XMP so it can be more easily viewed and managed.
  • XMP metadata is stored directly in source files. If a particular file format doesn't support XMP, however, metadata is stored in a separate sidecar file. Hence, another possibility for applying the generated metadata is to create an XMP sidecar file, which can then be handled together with the immersive video file 10.
  • Step 100 and Steps 104 to 120 can be carried out for example in Adobe Premiere, which is a timeline-based video editing application developed by Adobe Systems.
  • Step 100 the user opens Adobe Premiere and creates a new project. The user then selects from the media browser the desired immersive video file 10 and drags it over to the project panel to import the video file 10 for editing (see Fig. 4).
  • the spatial grid overlay 14 is applied as a .png file, which matches precisely the dimensions of the frames in the immersive video file 10. For example a .png file size of 1920x960 is used to process HD video, while a file size of 3840x1920 is used to process 4K video.
  • Step 104 the user also drags the spatial grid overlay 14 (.png file) into the project panel of Adobe Premiere.
  • the immersive video file 10 and the spatial grid overlay 14 (.png file) are both added to a timeline of the project such that the spatial grid overlay 14 is on top of the video content (Figs. 5 and 5a).
  • Step 108 the user identifies the object of interest 16 in the media content as described above.
  • the object of interest 16 is an airplane flying over the river Danube. If an artificial intelligence has been used to generate keywords for the video content these keywords can help to identify the possible objects of interest 16. For example if the keyword "airplane" is included in the list of keywords the user need only locate the airplane in the video and add any further description that he or she may find useful, e.g. "flying over the Danube".
  • Adobe Premiere has a built in marker function for applying notes to video files.
  • the inventors have realized that Adobe Premiere's marker function is suitable to embed keywords and information in a video's .xmp metadata fields, and as a result make those keywords and information universally searchable.
  • marker function is suitable for carrying out Steps 1 10, 1 12, 1 14, 1 16 and 1 18 of the method according to the invention.
  • the user opens the program panel in the Adobe Premiere project and navigates the cursor to the first reference frame 12a where the object of interest 16 appears first in the video content.
  • the user then adds a comment marker by opening a marker editor panel 20 and selecting the marker type "Comment Marker".
  • the comment markers are provided for tagging objects within the video, and contain the following fields:
  • the user navigates his cursor on the timeline to where the object of interest 16 he is tagging appears in the video stream and copies the time code to an "In" field 26 of a marker data input box 24 (Step 1 10). See Figs. 7 and 7a.
  • Step 1 12 the user navigates his cursor on the timeline to where the object of interest 16 he is tagging disappears in the video stream and copies the time code to an "Out" field 28 of the marker data input box 24 (Step 1 12). See Figs. 8 and 8a.
  • the user determines the spatial coordinates of the object of interest 16.
  • the user goes to the "In" frame determined by the marker, reads the coordinates of a selected point (preferably a center point) of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: I(x45;y37). If necessary the user can zoom into the selected frame 12 to better determine the coordinates. See Figs. 9 and 9a.
  • the user then goes to the "Out" frame determined by the marker, reads the coordinates of preferably the same point or a point in the vicinity of the first selected point of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: O(x32;y54). See Figs. 10 and 10a.
  • the marker is then saved and ready for applying to the video file 10.
  • other type of built in markers e.g. segmentation markers, which can be used to mark the starting time and ending time of a scene of interest in the immersive media content.
  • the process is similar to what is described above, the difference being that there is no spatial latitude and longitude data, hence the "name" field of the segmentation marker can be used for other purposes (e.g. this may also be used to enter comments, including keywords).
  • the generated metadata may also relate to audio information, such as the start and end of certain audio content (e.g. background music or sound).
  • Certain audio information may be associated with a location, e.g. a speech or dialog may be associated with the location of the person(s) speaking within the 360-degree video.
  • a speech or dialog may be associated with the location of the person(s) speaking within the 360-degree video.
  • the same "Comment" type of markers can be used as described above.
  • the video has to be exported. Using this process saves all of the markers into the video's .xmp metadata fields.
  • the video resulting from the exporting process will contain all the markers (with the keywords) that the user has entered to the timeline as clip markers.
  • the Markers can be also exported separately into a comma separated value (CSV) file or other similar file format which allows the user to work with the marker's content outside Adobe Premiere and separately from the video file 10.
  • CSV comma separated value
  • the XMP sidecar and the relevant marker lines within it can be read, if the .xmp metadata is exported from the video using e.g. ExifTool.
  • Exiftool is a free and open-source software for reading, writing and manipulating image, audio and video metadata.
  • ExifTool has a Graphical User Interface with limited functions, but the available options are enough for exporting the XMP metadata from the video file 10.
  • the end result of the method according to the invention is an immersive (360-degree) video tagged with metadata that allows for it to be discovered by using frame- and location based metadata markers.
  • XLS is a file extension for a spreadsheet file format created by Microsoft for use with Microsoft Excel.
  • the frame sequence data, the latitude and longitude data and the information data are more easily viewed in a spreadsheet format and can be conveniently handled by other programs.
  • the method according to the present invention can also be applied for other type of media, e.g. image files.
  • an immersive media file 10 which electronically codes a single still image.
  • the single still image is regarded as the single frame 12 of the media file 10.
  • the above described steps can be carried out similarly by treating the media file 10 as a one frame long video.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for applying metadata to immersive media files, comprising the following steps: - providing an immersive media file comprising at least one frame coding immersive media content, - adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface, - identifying an object of interest in the immersive media content, - determining frame sequence data of a first reference frame in which said object of interest is present, - determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the first reference frame, - providing information data associated with the object of interest - generating frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data, and - applying the generated metadata to the immersive media file.

Description

Method for applying metadata to immersive media files
The present invention relates to a method for applying metadata to immersive media files
Film and video have existed as a media for over a century and half a century, respectively. While the use and popularity of both physical forms of the aforementioned media have waned in years, their digital incarnations have exploded in popularity owing to the availability of low-cost, easily procured cameras, editing software and high-speed internet. This combination has led to the creation of a large number of online stock media marketplaces where professional and amateur videographers can place their content for licensing to third-party users.
Until early 2017, these online stock media marketplaces limited the length of the video clips which could be uploaded and hosted on their platforms to a maximum, on average, of 45 seconds. This relatively short length resulted in nearly all stock video focusing on a single event, scene or moment.
This maximum length was changed by nearly all major stock media marketplaces in early 2017 to approximately three minutes.
This more than tripling of the maximum time length now allowed videographers to increase the variety of subjects or events covered in those videos from a single event to multiple events and scenes.
This in turn meant that traditional means of keywording that content, i.e, applying metadata intended to guarantee online and internal discoverability of those assets, was no longer valid. Traditional means can be described as providing those assets a single, short, concise description sometimes called a caption comprising no less than five and no more than 12 words, as well as providing that asset as many keywords as necessary to guarantee its discoverability in a standard search query. The presentation and retrieval of this metadata is controlled by a set of standards developed the International Press Telecommunications Council (IPTC).
Metadata is a set of descriptive information about a file. Video and audio files automatically include basic metadata properties, such as date, duration, and file type.
While the above described scheme of applying metadata to video was appropriate for short form content, it immediately lost its ability to provide accurate results for a new form media which could display multiple events, audio and dialog, as well as various scenarios over a considerable length of time.
In parallel to these events, immersive video is becoming increasingly popular which raises new kind of challenges.
Immersive videos, also known as 360-degree videos or spherical videos, are video recordings where a view in every direction is recorded at the same time, shot using an omnidirectional camera or a collection of cameras. Virtual reality and augmented reality applications are also commonly based on immersive videos.
The inventors have realized that this new, immersive media also requires a new means of applying metadata to make them discoverable.
Therefore, a need exists to make discoverable these new, long form videos, as well as accommodate a unique, newly formed immersive media.
It is an object of the present invention to overcome the problems associated with the prior art.
These objects are achieved by a method for applying metadata to immersive media files, comprising the steps of:
- providing an immersive media file comprising at least one frame coding immersive media content, - adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface,
- identifying an object of interest in the immersive media content,
- determining frame sequence data of a first reference frame in which said object of interest is present,
- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the first reference frame,
- providing information data associated with the object of interest
- generating frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data, and
- applying the generated metadata to the immersive media file.
Further advantageous embodiments of the invention are defined in the attached dependent claims.
Further details of the invention will be apparent from the accompanying figures and exemplary embodiments.
Fig. 1 is a flow diagram of a preferred method according to the invention.
Fig. 2 is an illustration of applying a spatial grid overlay to a media file. Fig. 3 is an illustration of identifying an object of interest in a media stream.
Figs. 4 to 10 are screenshots of an exemplary embodiment being carried out using Adobe Premiere, wherein the screenshots show the following steps:
Fig. 4: Opening a video file in Adobe Premiere.
Fig. 5: Applying a spatial grid overlay to the video file.
Fig 5a: Enlarged video frame with spatial grid overlay taken from the screenshot in Fig. 5.
Fig. 6: Adding a comment marker (information data) to an object of interest identified in the video stream.
Fig. 6a: Enlarged marker editor panel taken from the screenshot in Fig.
6. Fig. 7: Inputting frame sequence data (time) of first frame wherein object of interest appears.
Fig. 7a: Enlarged marker data input box taken from the screenshot in
Fig. 7.
Fig. 8: Inputting frame sequence data (time) of second frame wherein object of interest disappears.
Fig. 8a: Enlarged marker data input box taken from the screenshot in
Fig. 8.
Fig. 9: Inputting coordinates of object of interest within the first frame. Fig. 9a: Enlarged marker data input box taken from the screenshot in
Fig. 9.
Fig. 10: Inputting coordinates of object of interest within the second frame.
Fig. 10a: Enlarged marker data input box taken from the screenshot in Fig. 10.
According to the preferred method illustrated in Fig. 1 an immersive media file 10 is provided in Step 100, which immersive media file 10 comprises at least one frame 12 coding immersive media content. In the illustrated example the immersive media file 10 is an immersive video comprising a plurality of frames 12. The frames 12 comprise electronically coded still images, which, when displayed subsequently at a given frame rate, produce a sense of motion.
In Step 102 the frames 12 are preferably processed by performing image recognition on the frames by an artificial intelligence. One suitable artificial intelligence is Clarifai (www.clarifai.com). Clarifai's visual recognition model processes the frames 12 of the video file 10 in real-time and returns predictions on what is in the still images coded in the frames. These predictions take the form of keywords with associated probabilities indicating the probability of the accurateness of the given keyword.
Other artificial intelligence programs work similarly to produce keywords relating to the media file 10. In the context of the present invention the term keyword is understood to mean one or more words describing the image, including conceptual description.
The keywords may be associated with the whole media file 10 or only part of the frames 12 if the keywords relate to objects or concepts appearing in those frames 12.
In Step 104 the keywords provided by the artificial intelligence is preferably reviewed by a human editor. In order to facilitate reviewing a probability threshold may be defined such that keywords having a probability lower than the given threshold are automatically discarded without review. The keywords which are deemed inappropriate by the human reviewer for the frames 12 with which the keyword is associated are also discarded. Keywords, which are not discarded, are used for providing information data as will be explained in the following.
In Step 106 a spatial grid overlay 14 is added on the frames 12, which spatial grid overlay defines latitudes and longitudes on a spheroid surface. Preferably each still image coded in each frame is an equirectangular projection (also called spherical projection) where the stitched image shows a 360° horizontal by 180° vertical field of view i.e. the whole sphere. Panoramas in this projection are meant to be viewed as though the image is wrapped into a sphere and viewed from within. In this case the spatial grid overlay 14 is a Cartesian coordinate system wherein the two perpendicular X and Y coordinate axis define the longitudes and latitudes, the longitudes ranging from -180° to +180°, the latitudes ranging from -90° to +90°. By reading the numbers from the X and Y axis any object can be located within a 360° space. The grid is preferably custom designed with a 2:1 ratio scale to match various camera standards.
The application of the spatial grid overlay 14 to a given frame 12 is illustrated in Fig. 2. As a result, an annotated frame 12 is produced, which comprises both the original image content and the spatial grid overlay 14. In another exemplary embodiment the spatial grid overlay 14 is a virtual overlay which is not visible to the user and which is used by a program to calculate the spatial location of a point or area selected by the user.
Once the spatial grid overlay 14 is applied, an object of interest 16 is identified in Step 108 in the immersive media content. For example in Fig. 3 the object of interest 16 is an airplane flying over a river. The object of interest may be identified by a human editor or by an artificial intelligence and possibly reviewed by a human editor.
In order to provide time-based (or frame-based) metadata, in Step 1 10 a reference frame is determined for the object of interest 16.
The object of interest 16 may be present in a single frame 12 or a few frames 12 which, having regard to the frame rate of the video, are displayed practically at a single time instance. In this case it is sufficient to identify a single (first) reference frame in Step 1 10 in which the object of interest 16 is present.
Alternatively, the object of interest 16 may be present in a plurality of frames 12 which, having regard to the frame rate of the video, are displayed during an observable length of time. In this case, preferable, a first reference frame 12a is determined in Step 1 10 in which said object of interest 16 is present for the first time in a frame sequence 18 as illustrated in Fig. 3. Preferably, a second reference frame 12b is further determined in Step 1 12 in which said object of interest 16 is present for the last time in the frame sequence 18. For an object of interest 16 that disappears only temporarily and reappears later in the video, further reference frames may be determined marking the point of reappearances and disappearances.
All of the above mentioned reference frames 16 can be defined by determining their frame sequence data. The frame sequence data corresponds to a position of the reference frame 12a, 12b in the frame sequence 18. This position can be identified via the number of the frame 12 in the frame sequence. For example in Fig. 3 the first reference frame 12a is the nth frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n, while the last reference frame 12b is n+mth frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n+m. It is also possible to identify the position of the reference frame 12a, 12b as a time point at which the given frame 12 is displayed when the video file 14 is being played by a media player. In the present example the first reference frame 12a is displayed at t1 time and the second reference frame 12b is displayed at t2 time when the video file 10 is played by a media player. This also means that the frame rate of the video file 10 is m frames/(t2-t1 ). Since time and frame sequence number are linked via the frame rate, hence time-based metadata and frame-based metadata are used herein as synonyms.
A typical video frame rate is 25 frames/sec. Since a human eye cannot distinguish between subsequently played frames 12 within a fraction of a second, the first reference frame 12a and the second reference frame 12b need not be strictly the first and last frames in which the object of interest 16 appears and disappears respectively but in fact any frames 12 can be selected which would be perceived by a human viewer as being close to the frame of appearance and disappearance.
If it is not necessary to mark the point of appearance and disappearance any frame can be used as the first reference frame 12a and no second reference frame 12b is recorded. Even in this case it is advantageous to use the frame 12 of the first appearance as the first reference frame 12a since a user wishing to find an object of interest within a video file 10 can easily play the video from the first reference frame 12a and hence view all the subsequent frames 12, but it is more difficult to trace back the first appearance if that is upstream of the first reference frame 12a.
If the object of interest 16 is present for a very short time perceived by a human as a single instant, then any one of the frames 12 containing the object of interest 16 can be used as the first reference frame 12a and no further reference frames are needed.
In order to produce location based metadata in Step 1 14 latitude and longitude data is determined in the spatial grid overlay 14 of the first reference frame 12a of a point (pixel) associated with the object of interest 16. The latitude data corresponds to the Y coordinate (distance measured from the X axis) of the pixel ranging from -90° to +90° and the longitude data corresponds to the X coordinate (distance measured from the X axis) of the pixel ranging from -180° to +180°. When the image is wrapped into a sphere and viewed from within the X and Y coordinates give the longitudinal and latitudinal position of the given pixel within the sphere.
The pixel associated with the object of interest 16 may be any pixel from the pixels making up the object of interest 16 or a pixel in the vicinity of the pixels making up the object of interest 16. In a preferred embodiment the pixel associated with the object of interest 16 is approximately at the center of the object of interest 16 even if this pixel is outside of the object of interest 16 (e.g. in the case of a ring the center of the ring is not part of the ring).
In case a second reference frame 12b has also been defined, preferably latitude and longitude data is determined similarly in Step 1 14 in the second reference frame 12b as well.
Steps 1 10, 1 12, 1 14 and 1 16 can be carried out in any order..
It is also possible to track the location of the object of interest 16 through the subsequent frames 12 of the video file 10. This can be achieved by determining in further steps one or more intermediate reference frames 12c together with their frame sequence data (e.g. frame number or time) and by determining latitude and longitude data of a point associated with the object of interest 16 in the spatial grid overlay 14 of the intermediate reference frame 12c. The more intermediate reference frames 12c there are, the more accurate is the object tracking. This is in particular useful if otherwise it would be difficult to track the motion of the object of interest 16 in the 360-degree view of the video content when the video file 10 is played.
In Step 1 18 information data is provided, which is associated with the object of interest 16. In the context of the present invention associated information data codes information related to the object of interest 16. In the example illustrated in Fig. 3 such information data could code the following comment: "Airplane flying over the river". If keywords have been generated by an artificial intelligence, these keywords can be used to provide the information data. For example if the artificial intelligence has generated the keyword "airplane" the information data provided in Step 1 18 could code the single word "airplane".
Step 1 18 could be performed together with any one of steps 1 10, 1 12, 1 14 and 1 16 or it could be carried out separately.
Following the previously described steps it is now possible to generate in Step 120 frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data. In the present example the frame and location based metadata would include the comment "Airplane flying over the river", would indicate the time (or frame number) of the first reference frame 12a where the airplane (as object of interest 16) appears as well as the longitudinal and latitudinal location of the appearance within the spherical image, and would further indicate the time (or frame number) of the second reference frame 12b where the airplane disappears as well as the longitudinal and latitudinal location of its disappearance.
In Step 122 the generated metadata is applied to the immersive media file 10.
The generated metadata can be applied to the immersive video file 10 in known ways. One common way of applying metadata is by storing the generated metadata in the immersive media file's 10 .xmp metadata fields in the form of XMP metadata. This is generally referred to as embedded XMP metadata.
Nowadays all motion content - standard and immersive - store metadata using the Extensible Metadata Platform (XMP). The Extensible metadata Platform is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets.
XMP is built on XML, which facilitates the exchange of metadata across a variety of applications and publishing workflows.
Metadata in most other formats (such as Exif, GPS, and TIFF) automatically transfers to XMP so it can be more easily viewed and managed.
In some cases, XMP metadata is stored directly in source files. If a particular file format doesn't support XMP, however, metadata is stored in a separate sidecar file. Hence, another possibility for applying the generated metadata is to create an XMP sidecar file, which can then be handled together with the immersive video file 10.
The inventors have realized that the above described Step 100 and Steps 104 to 120 can be carried out for example in Adobe Premiere, which is a timeline-based video editing application developed by Adobe Systems.
In order to perform the method according to the invention using Adobe Premiere, in Step 100 the user opens Adobe Premiere and creates a new project. The user then selects from the media browser the desired immersive video file 10 and drags it over to the project panel to import the video file 10 for editing (see Fig. 4).
In the present embodiment the spatial grid overlay 14 is applied as a .png file, which matches precisely the dimensions of the frames in the immersive video file 10. For example a .png file size of 1920x960 is used to process HD video, while a file size of 3840x1920 is used to process 4K video. In Step 104 the user also drags the spatial grid overlay 14 (.png file) into the project panel of Adobe Premiere. The immersive video file 10 and the spatial grid overlay 14 (.png file) are both added to a timeline of the project such that the spatial grid overlay 14 is on top of the video content (Figs. 5 and 5a).
In Step 108 the user identifies the object of interest 16 in the media content as described above. In the present example the object of interest 16 is an airplane flying over the river Danube. If an artificial intelligence has been used to generate keywords for the video content these keywords can help to identify the possible objects of interest 16. For example if the keyword "airplane" is included in the list of keywords the user need only locate the airplane in the video and add any further description that he or she may find useful, e.g. "flying over the Danube".
Adobe Premiere has a built in marker function for applying notes to video files. The inventors have realized that Adobe Premiere's marker function is suitable to embed keywords and information in a video's .xmp metadata fields, and as a result make those keywords and information universally searchable. The inventors have further recognized that marker function is suitable for carrying out Steps 1 10, 1 12, 1 14, 1 16 and 1 18 of the method according to the invention.
The user opens the program panel in the Adobe Premiere project and navigates the cursor to the first reference frame 12a where the object of interest 16 appears first in the video content. The user then adds a comment marker by opening a marker editor panel 20 and selecting the marker type "Comment Marker".
The comment markers are provided for tagging objects within the video, and contain the following fields:
- a "Name" field, for adding a title (e.g. type of comment)
- a "Comment" field for adding keywords,
- "In" field for the time code when the object appears in the video
- "Out" field for the time code when the object disappears in the video. After having selected the option "Comment Marker" in the marker's editor panel 20 the user inputs keywords (e.g. "Airplane flying over the Danube") in the "Comments" field 22 of the editor panel 20 (Step 1 18). See Figs. 6 and 6a.
The user navigates his cursor on the timeline to where the object of interest 16 he is tagging appears in the video stream and copies the time code to an "In" field 26 of a marker data input box 24 (Step 1 10). See Figs. 7 and 7a.
After this, the user navigates his cursor on the timeline to where the object of interest 16 he is tagging disappears in the video stream and copies the time code to an "Out" field 28 of the marker data input box 24 (Step 1 12). See Figs. 8 and 8a.
Next, the user determines the spatial coordinates of the object of interest 16. The user goes to the "In" frame determined by the marker, reads the coordinates of a selected point (preferably a center point) of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: I(x45;y37). If necessary the user can zoom into the selected frame 12 to better determine the coordinates. See Figs. 9 and 9a. The user then goes to the "Out" frame determined by the marker, reads the coordinates of preferably the same point or a point in the vicinity of the first selected point of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: O(x32;y54). See Figs. 10 and 10a.
The marker is then saved and ready for applying to the video file 10. It is further possible to add other type of built in markers, e.g. segmentation markers, which can be used to mark the starting time and ending time of a scene of interest in the immersive media content. The process is similar to what is described above, the difference being that there is no spatial latitude and longitude data, hence the "name" field of the segmentation marker can be used for other purposes (e.g. this may also be used to enter comments, including keywords).
The generated metadata may also relate to audio information, such as the start and end of certain audio content (e.g. background music or sound). Certain audio information may be associated with a location, e.g. a speech or dialog may be associated with the location of the person(s) speaking within the 360-degree video. In this case the same "Comment" type of markers can be used as described above.
In order to "embed" the markers added to any video using the previous steps, the video has to be exported. Using this process saves all of the markers into the video's .xmp metadata fields. The video resulting from the exporting process will contain all the markers (with the keywords) that the user has entered to the timeline as clip markers.
The Markers can be also exported separately into a comma separated value (CSV) file or other similar file format which allows the user to work with the marker's content outside Adobe Premiere and separately from the video file 10.
The XMP sidecar and the relevant marker lines within it can be read, if the .xmp metadata is exported from the video using e.g. ExifTool.
Exiftool is a free and open-source software for reading, writing and manipulating image, audio and video metadata. ExifTool has a Graphical User Interface with limited functions, but the available options are enough for exporting the XMP metadata from the video file 10.
By exporting the markers into a CSV file, the user has the options to work with these tags outside of an Adobe product and independently of the video file 10 itself.
The end result of the method according to the invention is an immersive (360-degree) video tagged with metadata that allows for it to be discovered by using frame- and location based metadata markers.
It is further possible to export the .xmp metadata as other convenient file format, e.g. as a .xls file. XLS is a file extension for a spreadsheet file format created by Microsoft for use with Microsoft Excel. The frame sequence data, the latitude and longitude data and the information data are more easily viewed in a spreadsheet format and can be conveniently handled by other programs.
The method according to the present invention can also be applied for other type of media, e.g. image files.
According to another preferred method an immersive media file 10 is provided, which electronically codes a single still image. In this case the single still image is regarded as the single frame 12 of the media file 10. The above described steps can be carried out similarly by treating the media file 10 as a one frame long video.
Various modifications to the above disclosed embodiments will be apparent to a person skilled in the art without departing from the scope of protection determined by the attached claims.

Claims

1. Method for applying metadata to immersive media files, comprising the following steps:
- providing an immersive media file comprising at least one frame coding immersive media content,
- adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface,
- identifying an object of interest in the immersive media content,
- determining frame sequence data of a first reference frame in which said object of interest is present,
- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the first reference frame,
- providing information data associated with the object of interest
- generating frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data, and
- applying the generated metadata to the immersive media file.
2. The method according to claim 1 , wherein the immersive media file comprises more than one frames forming a frame sequence and the frame sequence data corresponds to a position of a frame in the frame sequence, and the first reference frame is a frame in which said object of interest is present for the first time in the frame sequence, the method further comprising the steps of:
- determining frame sequence data of a second reference frame in which said object of interest is present for the last time following the first reference frame in the frame sequence,
- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the second reference frame, and - generating frame and location based metadata further comprising said frame sequence data of the second reference frame and said latitude and longitude data in the spatial grid overlay of the second reference frame.
3. The method according to claims 1 or 2, further comprising:
- processing the at least one frame by performing image recognition on the at least one frame by suitable artificial intelligence,
- providing keywords for the at least one processed frame via the artificial intelligence,
- performing human review of the keywords,
- discarding the frame based keywords which are deemed inappropriate by the human reviewer for the at least one frame,
- using the keywords that are not discarded for providing information data.
4. The method according to any one of claims 1 to 3, further comprising:
- determining at least one frame sequence data of an intermediate reference frame, different from the first reference frame and the second reference frame, in which said object of interest is present,
- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the intermediate reference frame, and
- generating frame and location based metadata further comprising said frame sequence data of the intermediate reference frame and said latitude and longitude data in the spatial grid overlay of the intermediate reference frame.
5. The method according to any one of claims 1 to 4, further comprising:
- identifying a scene of interest in the immersive media content,
- determining frame sequence data of a first scene frame where said scene of interest starts,
- determining frame sequence data of a second scene frame where said scene of interest ends, - providing information data associated with the scene of interest,
- generating frame and location based metadata further comprising said frame sequence data of the first scene frame, and said frame sequence data of the second scene frame, and said information data associated with the scene of interest, and
- applying the generated metadata to the immersive media file.
6. The method according to any one of claims 1 to 5, further comprising applying the generated metadata to the immersive media by storing the generated metadata in the immersive media file in the form of embedded XMP metadata.
7. The method according to any one of claims 1 to 6, further comprising applying the generated metadata to the immersive media storing the generated metadata in an XMP sidecar file for the immersive media file.
8. The method according to any one of claims 1 to 7, wherein the object of interest is audio information and the point associated with the object of interest is the point associated with an object in the media content, which is identified as the source of the audio information.
9. The method according to claim 5, wherein the scene of interest is a scene during which given audio information can be heard when the media file is played.
PCT/IB2017/054839 2017-08-08 2017-08-08 Method for applying metadata to immersive media files WO2019030551A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2017/054839 WO2019030551A1 (en) 2017-08-08 2017-08-08 Method for applying metadata to immersive media files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2017/054839 WO2019030551A1 (en) 2017-08-08 2017-08-08 Method for applying metadata to immersive media files

Publications (1)

Publication Number Publication Date
WO2019030551A1 true WO2019030551A1 (en) 2019-02-14

Family

ID=59887323

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2017/054839 WO2019030551A1 (en) 2017-08-08 2017-08-08 Method for applying metadata to immersive media files

Country Status (1)

Country Link
WO (1) WO2019030551A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11954150B2 (en) * 2018-04-20 2024-04-09 Samsung Electronics Co., Ltd. Electronic device and method for controlling the electronic device thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030149983A1 (en) * 2002-02-06 2003-08-07 Markel Steven O. Tracking moving objects on video with interactive access points
US6711590B1 (en) * 1998-07-10 2004-03-23 Canon Kabushiki Kaisha Linking metadata with a time-sequential digital signal
US20170084084A1 (en) * 2015-09-22 2017-03-23 Thrillbox, Inc Mapping of user interaction within a virtual reality environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711590B1 (en) * 1998-07-10 2004-03-23 Canon Kabushiki Kaisha Linking metadata with a time-sequential digital signal
US20030149983A1 (en) * 2002-02-06 2003-08-07 Markel Steven O. Tracking moving objects on video with interactive access points
US20170084084A1 (en) * 2015-09-22 2017-03-23 Thrillbox, Inc Mapping of user interaction within a virtual reality environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAORAN YI ET AL: "Automatic Generation of MPEG-7 Compliant XML Document for Motion Trajectory Descriptor in Sports Video", MULTIMEDIA TOOLS AND APPLICATIONS, KLUWER ACADEMIC PUBLISHERS, BO, vol. 26, no. 2, 1 June 2005 (2005-06-01), pages 191 - 206, XP019213867, ISSN: 1573-7721, DOI: 10.1007/S11042-005-0450-8 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11954150B2 (en) * 2018-04-20 2024-04-09 Samsung Electronics Co., Ltd. Electronic device and method for controlling the electronic device thereof

Similar Documents

Publication Publication Date Title
JP7123122B2 (en) Navigating Video Scenes Using Cognitive Insights
CN107707931B (en) Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment
US7908556B2 (en) Method and system for media landmark identification
US9462175B2 (en) Digital annotation-based visual recognition book pronunciation system and related method of operation
JP6013363B2 (en) Computerized method and device for annotating at least one feature of an image of a view
KR101887548B1 (en) Method and apparatus of processing media file for augmented reality services
US7945142B2 (en) Audio/visual editing tool
JP5510167B2 (en) Video search system and computer program therefor
US20110304774A1 (en) Contextual tagging of recorded data
US7921116B2 (en) Highly meaningful multimedia metadata creation and associations
US8966372B2 (en) Systems and methods for performing geotagging during video playback
US20160004911A1 (en) Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
JP2018197865A (en) Geo-tagging of voice record
CN109063123B (en) Method and system for adding annotations to panoramic video
US8135724B2 (en) Digital media recasting
TW201113825A (en) Video content-aware advertisement placement
WO2016142638A1 (en) Anonymous live image search
US20210117471A1 (en) Method and system for automatically generating a video from an online product representation
KR20160044981A (en) Video processing apparatus and method of operations thereof
KR20090093904A (en) Apparatus and method for scene variation robust multimedia image analysis, and system for multimedia editing based on objects
CN104954640A (en) Camera device, video auto-tagging method and non-transitory computer readable medium thereof
US11126856B2 (en) Contextualized video segment selection for video-filled text
WO2019030551A1 (en) Method for applying metadata to immersive media files
Chen Storyboard-based accurate automatic summary video editing system
KR101947553B1 (en) Apparatus and Method for video edit based on object

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17768230

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17768230

Country of ref document: EP

Kind code of ref document: A1