WO2019030551A1

WO2019030551A1 - Method for applying metadata to immersive media files

Info

Publication number: WO2019030551A1
Application number: PCT/IB2017/054839
Authority: WO
Inventors: Mark MILSTEIN
Original assignee: Milstein Mark
Priority date: 2017-08-08
Filing date: 2017-08-08
Publication date: 2019-02-14

Abstract

The invention relates to a method for applying metadata to immersive media files, comprising the following steps: - providing an immersive media file comprising at least one frame coding immersive media content, - adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface, - identifying an object of interest in the immersive media content, - determining frame sequence data of a first reference frame in which said object of interest is present, - determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the first reference frame, - providing information data associated with the object of interest - generating frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data, and - applying the generated metadata to the immersive media file.

Description

Method for applying metadata to immersive media files

The present invention relates to a method for applying metadata to immersive media files

Film and video have existed as a media for over a century and half a century, respectively. While the use and popularity of both physical forms of the aforementioned media have waned in years, their digital incarnations have exploded in popularity owing to the availability of low-cost, easily procured cameras, editing software and high-speed internet. This combination has led to the creation of a large number of online stock media marketplaces where professional and amateur videographers can place their content for licensing to third-party users.

Until early 2017, these online stock media marketplaces limited the length of the video clips which could be uploaded and hosted on their platforms to a maximum, on average, of 45 seconds. This relatively short length resulted in nearly all stock video focusing on a single event, scene or moment.

This maximum length was changed by nearly all major stock media marketplaces in early 2017 to approximately three minutes.

This more than tripling of the maximum time length now allowed videographers to increase the variety of subjects or events covered in those videos from a single event to multiple events and scenes.

This in turn meant that traditional means of keywording that content, i.e, applying metadata intended to guarantee online and internal discoverability of those assets, was no longer valid. Traditional means can be described as providing those assets a single, short, concise description sometimes called a caption comprising no less than five and no more than 12 words, as well as providing that asset as many keywords as necessary to guarantee its discoverability in a standard search query. The presentation and retrieval of this metadata is controlled by a set of standards developed the International Press Telecommunications Council (IPTC).

Metadata is a set of descriptive information about a file. Video and audio files automatically include basic metadata properties, such as date, duration, and file type.

While the above described scheme of applying metadata to video was appropriate for short form content, it immediately lost its ability to provide accurate results for a new form media which could display multiple events, audio and dialog, as well as various scenarios over a considerable length of time.

In parallel to these events, immersive video is becoming increasingly popular which raises new kind of challenges.

Immersive videos, also known as 360-degree videos or spherical videos, are video recordings where a view in every direction is recorded at the same time, shot using an omnidirectional camera or a collection of cameras. Virtual reality and augmented reality applications are also commonly based on immersive videos.

The inventors have realized that this new, immersive media also requires a new means of applying metadata to make them discoverable.

Therefore, a need exists to make discoverable these new, long form videos, as well as accommodate a unique, newly formed immersive media.

It is an object of the present invention to overcome the problems associated with the prior art.

These objects are achieved by a method for applying metadata to immersive media files, comprising the steps of:

- providing an immersive media file comprising at least one frame coding immersive media content, - adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface,

- identifying an object of interest in the immersive media content,

- determining frame sequence data of a first reference frame in which said object of interest is present,

- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the first reference frame,

- providing information data associated with the object of interest

- generating frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data, and

- applying the generated metadata to the immersive media file.

Further advantageous embodiments of the invention are defined in the attached dependent claims.

Further details of the invention will be apparent from the accompanying figures and exemplary embodiments.

Fig. 1 is a flow diagram of a preferred method according to the invention.

Fig. 2 is an illustration of applying a spatial grid overlay to a media file. Fig. 3 is an illustration of identifying an object of interest in a media stream.

Figs. 4 to 10 are screenshots of an exemplary embodiment being carried out using Adobe Premiere, wherein the screenshots show the following steps:

Fig. 4: Opening a video file in Adobe Premiere.

Fig. 5: Applying a spatial grid overlay to the video file.

Fig 5a: Enlarged video frame with spatial grid overlay taken from the screenshot in Fig. 5.

Fig. 6: Adding a comment marker (information data) to an object of interest identified in the video stream.

Fig. 6a: Enlarged marker editor panel taken from the screenshot in Fig.

6. Fig. 7: Inputting frame sequence data (time) of first frame wherein object of interest appears.

Fig. 7a: Enlarged marker data input box taken from the screenshot in

Fig. 7.

Fig. 8: Inputting frame sequence data (time) of second frame wherein object of interest disappears.

Fig. 8a: Enlarged marker data input box taken from the screenshot in

Fig. 8.

Fig. 9: Inputting coordinates of object of interest within the first frame. Fig. 9a: Enlarged marker data input box taken from the screenshot in

Fig. 9.

Fig. 10: Inputting coordinates of object of interest within the second frame.

Fig. 10a: Enlarged marker data input box taken from the screenshot in Fig. 10.

According to the preferred method illustrated in Fig. 1 an immersive media file 10 is provided in Step 100, which immersive media file 10 comprises at least one frame 12 coding immersive media content. In the illustrated example the immersive media file 10 is an immersive video comprising a plurality of frames 12. The frames 12 comprise electronically coded still images, which, when displayed subsequently at a given frame rate, produce a sense of motion.

In Step 102 the frames 12 are preferably processed by performing image recognition on the frames by an artificial intelligence. One suitable artificial intelligence is Clarifai (www.clarifai.com). Clarifai's visual recognition model processes the frames 12 of the video file 10 in real-time and returns predictions on what is in the still images coded in the frames. These predictions take the form of keywords with associated probabilities indicating the probability of the accurateness of the given keyword.

Other artificial intelligence programs work similarly to produce keywords relating to the media file 10. In the context of the present invention the term keyword is understood to mean one or more words describing the image, including conceptual description.

The keywords may be associated with the whole media file 10 or only part of the frames 12 if the keywords relate to objects or concepts appearing in those frames 12.

In Step 104 the keywords provided by the artificial intelligence is preferably reviewed by a human editor. In order to facilitate reviewing a probability threshold may be defined such that keywords having a probability lower than the given threshold are automatically discarded without review. The keywords which are deemed inappropriate by the human reviewer for the frames 12 with which the keyword is associated are also discarded. Keywords, which are not discarded, are used for providing information data as will be explained in the following.

In Step 106 a spatial grid overlay 14 is added on the frames 12, which spatial grid overlay defines latitudes and longitudes on a spheroid surface. Preferably each still image coded in each frame is an equirectangular projection (also called spherical projection) where the stitched image shows a 360° horizontal by 180° vertical field of view i.e. the whole sphere. Panoramas in this projection are meant to be viewed as though the image is wrapped into a sphere and viewed from within. In this case the spatial grid overlay 14 is a Cartesian coordinate system wherein the two perpendicular X and Y coordinate axis define the longitudes and latitudes, the longitudes ranging from -180° to +180°, the latitudes ranging from -90° to +90°. By reading the numbers from the X and Y axis any object can be located within a 360° space. The grid is preferably custom designed with a 2:1 ratio scale to match various camera standards.

The application of the spatial grid overlay 14 to a given frame 12 is illustrated in Fig. 2. As a result, an annotated frame 12 is produced, which comprises both the original image content and the spatial grid overlay 14. In another exemplary embodiment the spatial grid overlay 14 is a virtual overlay which is not visible to the user and which is used by a program to calculate the spatial location of a point or area selected by the user.

Once the spatial grid overlay 14 is applied, an object of interest 16 is identified in Step 108 in the immersive media content. For example in Fig. 3 the object of interest 16 is an airplane flying over a river. The object of interest may be identified by a human editor or by an artificial intelligence and possibly reviewed by a human editor.

In order to provide time-based (or frame-based) metadata, in Step 1 10 a reference frame is determined for the object of interest 16.

The object of interest 16 may be present in a single frame 12 or a few frames 12 which, having regard to the frame rate of the video, are displayed practically at a single time instance. In this case it is sufficient to identify a single (first) reference frame in Step 1 10 in which the object of interest 16 is present.

Alternatively, the object of interest 16 may be present in a plurality of frames 12 which, having regard to the frame rate of the video, are displayed during an observable length of time. In this case, preferable, a first reference frame 12a is determined in Step 1 10 in which said object of interest 16 is present for the first time in a frame sequence 18 as illustrated in Fig. 3. Preferably, a second reference frame 12b is further determined in Step 1 12 in which said object of interest 16 is present for the last time in the frame sequence 18. For an object of interest 16 that disappears only temporarily and reappears later in the video, further reference frames may be determined marking the point of reappearances and disappearances.

All of the above mentioned reference frames 16 can be defined by determining their frame sequence data. The frame sequence data corresponds to a position of the reference frame 12a, 12b in the frame sequence 18. This position can be identified via the number of the frame 12 in the frame sequence. For example in Fig. 3 the first reference frame 12a is the n^th frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n, while the last reference frame 12b is n+m^th frame 12 in the frame sequence 18, meaning that the number of the first reference frame 12a is n+m. It is also possible to identify the position of the reference frame 12a, 12b as a time point at which the given frame 12 is displayed when the video file 14 is being played by a media player. In the present example the first reference frame 12a is displayed at t1 time and the second reference frame 12b is displayed at t2 time when the video file 10 is played by a media player. This also means that the frame rate of the video file 10 is m frames/(t2-t1 ). Since time and frame sequence number are linked via the frame rate, hence time-based metadata and frame-based metadata are used herein as synonyms.

A typical video frame rate is 25 frames/sec. Since a human eye cannot distinguish between subsequently played frames 12 within a fraction of a second, the first reference frame 12a and the second reference frame 12b need not be strictly the first and last frames in which the object of interest 16 appears and disappears respectively but in fact any frames 12 can be selected which would be perceived by a human viewer as being close to the frame of appearance and disappearance.

If it is not necessary to mark the point of appearance and disappearance any frame can be used as the first reference frame 12a and no second reference frame 12b is recorded. Even in this case it is advantageous to use the frame 12 of the first appearance as the first reference frame 12a since a user wishing to find an object of interest within a video file 10 can easily play the video from the first reference frame 12a and hence view all the subsequent frames 12, but it is more difficult to trace back the first appearance if that is upstream of the first reference frame 12a.

If the object of interest 16 is present for a very short time perceived by a human as a single instant, then any one of the frames 12 containing the object of interest 16 can be used as the first reference frame 12a and no further reference frames are needed.

In order to produce location based metadata in Step 1 14 latitude and longitude data is determined in the spatial grid overlay 14 of the first reference frame 12a of a point (pixel) associated with the object of interest 16. The latitude data corresponds to the Y coordinate (distance measured from the X axis) of the pixel ranging from -90° to +90° and the longitude data corresponds to the X coordinate (distance measured from the X axis) of the pixel ranging from -180° to +180°. When the image is wrapped into a sphere and viewed from within the X and Y coordinates give the longitudinal and latitudinal position of the given pixel within the sphere.

The pixel associated with the object of interest 16 may be any pixel from the pixels making up the object of interest 16 or a pixel in the vicinity of the pixels making up the object of interest 16. In a preferred embodiment the pixel associated with the object of interest 16 is approximately at the center of the object of interest 16 even if this pixel is outside of the object of interest 16 (e.g. in the case of a ring the center of the ring is not part of the ring).

In case a second reference frame 12b has also been defined, preferably latitude and longitude data is determined similarly in Step 1 14 in the second reference frame 12b as well.

Steps 1 10, 1 12, 1 14 and 1 16 can be carried out in any order..

It is also possible to track the location of the object of interest 16 through the subsequent frames 12 of the video file 10. This can be achieved by determining in further steps one or more intermediate reference frames 12c together with their frame sequence data (e.g. frame number or time) and by determining latitude and longitude data of a point associated with the object of interest 16 in the spatial grid overlay 14 of the intermediate reference frame 12c. The more intermediate reference frames 12c there are, the more accurate is the object tracking. This is in particular useful if otherwise it would be difficult to track the motion of the object of interest 16 in the 360-degree view of the video content when the video file 10 is played.

In Step 1 18 information data is provided, which is associated with the object of interest 16. In the context of the present invention associated information data codes information related to the object of interest 16. In the example illustrated in Fig. 3 such information data could code the following comment: "Airplane flying over the river". If keywords have been generated by an artificial intelligence, these keywords can be used to provide the information data. For example if the artificial intelligence has generated the keyword "airplane" the information data provided in Step 1 18 could code the single word "airplane".

Step 1 18 could be performed together with any one of steps 1 10, 1 12, 1 14 and 1 16 or it could be carried out separately.

Following the previously described steps it is now possible to generate in Step 120 frame and location based metadata comprising the frame sequence data, the latitude and longitude data and the information data. In the present example the frame and location based metadata would include the comment "Airplane flying over the river", would indicate the time (or frame number) of the first reference frame 12a where the airplane (as object of interest 16) appears as well as the longitudinal and latitudinal location of the appearance within the spherical image, and would further indicate the time (or frame number) of the second reference frame 12b where the airplane disappears as well as the longitudinal and latitudinal location of its disappearance.

In Step 122 the generated metadata is applied to the immersive media file 10.

The generated metadata can be applied to the immersive video file 10 in known ways. One common way of applying metadata is by storing the generated metadata in the immersive media file's 10 .xmp metadata fields in the form of XMP metadata. This is generally referred to as embedded XMP metadata.

Nowadays all motion content - standard and immersive - store metadata using the Extensible Metadata Platform (XMP). The Extensible metadata Platform is an ISO standard, originally created by Adobe Systems Inc., for the creation, processing and interchange of standardized and custom metadata for digital documents and data sets.

XMP is built on XML, which facilitates the exchange of metadata across a variety of applications and publishing workflows.

Metadata in most other formats (such as Exif, GPS, and TIFF) automatically transfers to XMP so it can be more easily viewed and managed.

In some cases, XMP metadata is stored directly in source files. If a particular file format doesn't support XMP, however, metadata is stored in a separate sidecar file. Hence, another possibility for applying the generated metadata is to create an XMP sidecar file, which can then be handled together with the immersive video file 10.

The inventors have realized that the above described Step 100 and Steps 104 to 120 can be carried out for example in Adobe Premiere, which is a timeline-based video editing application developed by Adobe Systems.

In order to perform the method according to the invention using Adobe Premiere, in Step 100 the user opens Adobe Premiere and creates a new project. The user then selects from the media browser the desired immersive video file 10 and drags it over to the project panel to import the video file 10 for editing (see Fig. 4).

In the present embodiment the spatial grid overlay 14 is applied as a .png file, which matches precisely the dimensions of the frames in the immersive video file 10. For example a .png file size of 1920x960 is used to process HD video, while a file size of 3840x1920 is used to process 4K video. In Step 104 the user also drags the spatial grid overlay 14 (.png file) into the project panel of Adobe Premiere. The immersive video file 10 and the spatial grid overlay 14 (.png file) are both added to a timeline of the project such that the spatial grid overlay 14 is on top of the video content (Figs. 5 and 5a).

In Step 108 the user identifies the object of interest 16 in the media content as described above. In the present example the object of interest 16 is an airplane flying over the river Danube. If an artificial intelligence has been used to generate keywords for the video content these keywords can help to identify the possible objects of interest 16. For example if the keyword "airplane" is included in the list of keywords the user need only locate the airplane in the video and add any further description that he or she may find useful, e.g. "flying over the Danube".

Adobe Premiere has a built in marker function for applying notes to video files. The inventors have realized that Adobe Premiere's marker function is suitable to embed keywords and information in a video's .xmp metadata fields, and as a result make those keywords and information universally searchable. The inventors have further recognized that marker function is suitable for carrying out Steps 1 10, 1 12, 1 14, 1 16 and 1 18 of the method according to the invention.

The user opens the program panel in the Adobe Premiere project and navigates the cursor to the first reference frame 12a where the object of interest 16 appears first in the video content. The user then adds a comment marker by opening a marker editor panel 20 and selecting the marker type "Comment Marker".

The comment markers are provided for tagging objects within the video, and contain the following fields:

- a "Name" field, for adding a title (e.g. type of comment)

- a "Comment" field for adding keywords,

- "In" field for the time code when the object appears in the video

- "Out" field for the time code when the object disappears in the video. After having selected the option "Comment Marker" in the marker's editor panel 20 the user inputs keywords (e.g. "Airplane flying over the Danube") in the "Comments" field 22 of the editor panel 20 (Step 1 18). See Figs. 6 and 6a.

The user navigates his cursor on the timeline to where the object of interest 16 he is tagging appears in the video stream and copies the time code to an "In" field 26 of a marker data input box 24 (Step 1 10). See Figs. 7 and 7a.

After this, the user navigates his cursor on the timeline to where the object of interest 16 he is tagging disappears in the video stream and copies the time code to an "Out" field 28 of the marker data input box 24 (Step 1 12). See Figs. 8 and 8a.

Next, the user determines the spatial coordinates of the object of interest 16. The user goes to the "In" frame determined by the marker, reads the coordinates of a selected point (preferably a center point) of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: I(x45;y37). If necessary the user can zoom into the selected frame 12 to better determine the coordinates. See Figs. 9 and 9a. The user then goes to the "Out" frame determined by the marker, reads the coordinates of preferably the same point or a point in the vicinity of the first selected point of the object of interest 16 (the airplane) from the grid overlay 14 and enters them to the "Name" field 30 of the marker data input box 24 (Step 1 14) using the following formula: O(x32;y54). See Figs. 10 and 10a.

The marker is then saved and ready for applying to the video file 10. It is further possible to add other type of built in markers, e.g. segmentation markers, which can be used to mark the starting time and ending time of a scene of interest in the immersive media content. The process is similar to what is described above, the difference being that there is no spatial latitude and longitude data, hence the "name" field of the segmentation marker can be used for other purposes (e.g. this may also be used to enter comments, including keywords).

The generated metadata may also relate to audio information, such as the start and end of certain audio content (e.g. background music or sound). Certain audio information may be associated with a location, e.g. a speech or dialog may be associated with the location of the person(s) speaking within the 360-degree video. In this case the same "Comment" type of markers can be used as described above.

In order to "embed" the markers added to any video using the previous steps, the video has to be exported. Using this process saves all of the markers into the video's .xmp metadata fields. The video resulting from the exporting process will contain all the markers (with the keywords) that the user has entered to the timeline as clip markers.

The Markers can be also exported separately into a comma separated value (CSV) file or other similar file format which allows the user to work with the marker's content outside Adobe Premiere and separately from the video file 10.

The XMP sidecar and the relevant marker lines within it can be read, if the .xmp metadata is exported from the video using e.g. ExifTool.

Exiftool is a free and open-source software for reading, writing and manipulating image, audio and video metadata. ExifTool has a Graphical User Interface with limited functions, but the available options are enough for exporting the XMP metadata from the video file 10.

By exporting the markers into a CSV file, the user has the options to work with these tags outside of an Adobe product and independently of the video file 10 itself.

The end result of the method according to the invention is an immersive (360-degree) video tagged with metadata that allows for it to be discovered by using frame- and location based metadata markers.

It is further possible to export the .xmp metadata as other convenient file format, e.g. as a .xls file. XLS is a file extension for a spreadsheet file format created by Microsoft for use with Microsoft Excel. The frame sequence data, the latitude and longitude data and the information data are more easily viewed in a spreadsheet format and can be conveniently handled by other programs.

The method according to the present invention can also be applied for other type of media, e.g. image files.

According to another preferred method an immersive media file 10 is provided, which electronically codes a single still image. In this case the single still image is regarded as the single frame 12 of the media file 10. The above described steps can be carried out similarly by treating the media file 10 as a one frame long video.

Various modifications to the above disclosed embodiments will be apparent to a person skilled in the art without departing from the scope of protection determined by the attached claims.

Claims

1. Method for applying metadata to immersive media files, comprising the following steps:

- providing an immersive media file comprising at least one frame coding immersive media content,

- adding a spatial grid overlay on the at least one frame, said spatial grid overlay defining latitudes and longitudes on a spheroid surface,

- identifying an object of interest in the immersive media content,

- providing information data associated with the object of interest

- applying the generated metadata to the immersive media file.

2. The method according to claim 1 , wherein the immersive media file comprises more than one frames forming a frame sequence and the frame sequence data corresponds to a position of a frame in the frame sequence, and the first reference frame is a frame in which said object of interest is present for the first time in the frame sequence, the method further comprising the steps of:

- determining frame sequence data of a second reference frame in which said object of interest is present for the last time following the first reference frame in the frame sequence,

- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the second reference frame, and - generating frame and location based metadata further comprising said frame sequence data of the second reference frame and said latitude and longitude data in the spatial grid overlay of the second reference frame.

3. The method according to claims 1 or 2, further comprising:

- processing the at least one frame by performing image recognition on the at least one frame by suitable artificial intelligence,

- providing keywords for the at least one processed frame via the artificial intelligence,

- performing human review of the keywords,

- discarding the frame based keywords which are deemed inappropriate by the human reviewer for the at least one frame,

- using the keywords that are not discarded for providing information data.

4. The method according to any one of claims 1 to 3, further comprising:

- determining at least one frame sequence data of an intermediate reference frame, different from the first reference frame and the second reference frame, in which said object of interest is present,

- determining latitude and longitude data of a point associated with the object of interest in the spatial grid overlay of the intermediate reference frame, and

- generating frame and location based metadata further comprising said frame sequence data of the intermediate reference frame and said latitude and longitude data in the spatial grid overlay of the intermediate reference frame.

5. The method according to any one of claims 1 to 4, further comprising:

- identifying a scene of interest in the immersive media content,

- determining frame sequence data of a first scene frame where said scene of interest starts,

- determining frame sequence data of a second scene frame where said scene of interest ends, - providing information data associated with the scene of interest,

- generating frame and location based metadata further comprising said frame sequence data of the first scene frame, and said frame sequence data of the second scene frame, and said information data associated with the scene of interest, and

- applying the generated metadata to the immersive media file.

6. The method according to any one of claims 1 to 5, further comprising applying the generated metadata to the immersive media by storing the generated metadata in the immersive media file in the form of embedded XMP metadata.

7. The method according to any one of claims 1 to 6, further comprising applying the generated metadata to the immersive media storing the generated metadata in an XMP sidecar file for the immersive media file.

8. The method according to any one of claims 1 to 7, wherein the object of interest is audio information and the point associated with the object of interest is the point associated with an object in the media content, which is identified as the source of the audio information.

9. The method according to claim 5, wherein the scene of interest is a scene during which given audio information can be heard when the media file is played.