WO2021102953A1 - Multi-view video processing method and apparatus - Google Patents

Multi-view video processing method and apparatus Download PDF

Info

Publication number
WO2021102953A1
WO2021102953A1 PCT/CN2019/122095 CN2019122095W WO2021102953A1 WO 2021102953 A1 WO2021102953 A1 WO 2021102953A1 CN 2019122095 W CN2019122095 W CN 2019122095W WO 2021102953 A1 WO2021102953 A1 WO 2021102953A1
Authority
WO
WIPO (PCT)
Prior art keywords
media
view
views
basic
track
Prior art date
Application number
PCT/CN2019/122095
Other languages
French (fr)
Inventor
Qiuting LI
Cheng Huang
Original Assignee
Zte Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zte Corporation filed Critical Zte Corporation
Priority to CN201980101328.3A priority Critical patent/CN114556962B/en
Priority to PCT/CN2019/122095 priority patent/WO2021102953A1/en
Priority to KR1020227020559A priority patent/KR102647019B1/en
Priority to JP2022531493A priority patent/JP2023504097A/en
Priority to EP19954380.2A priority patent/EP4066513A4/en
Publication of WO2021102953A1 publication Critical patent/WO2021102953A1/en
Priority to US17/752,828 priority patent/US20220286659A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • H04N13/117Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation the virtual viewpoint locations being selected by the viewers or determined by viewer tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/156Mixing image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/128Adjusting depth or disparity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/189Recording image signals; Reproducing recorded image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/194Transmission of image signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/275Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals
    • H04N13/279Image signal generators from 3D object models, e.g. computer-generated stereoscopic image signals the virtual viewpoint locations being selected by the viewers or determined by tracking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44218Detecting physical presence or behaviour of the user, e.g. using sensors to detect if the user is leaving the room or changes his face expression during a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • H04N21/8153Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics comprising still images, e.g. texture, background image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Definitions

  • This patent document relates generally to immersive technologies.
  • immersive technology is revolutionizing the way we work and live.
  • Immersive technology is applied in several areas, including marketing and advertising, healthcare, education, and gaming.
  • the rapid development of new immersive technologies requires effectively processing video and image data.
  • additional limited translational movements typically, head movements
  • a method includes requesting, by a user, media files from a server according to the current viewing position, viewing direction and viewport of the user, receiving, by the user, the media files from the server according to the current viewing position, the viewing direction and viewport of the user, extracting patches of one or more atlases, and synthesizing the visual content in the current viewport of the user, and obtaining, by the user, three-dimensional stereoscopic video content in the current viewport of the user.
  • a method of constructing media content includes placing (or extracting) a plurality of media samples into (or from) a media file associated with a plurality of views including one or more basic views and one or more additional views such that each media sample corresponds to one or more of the plurality of views and includes at least one of a texture component or a depth component of the corresponding view, determining a basic view media track corresponding to the one or more basic views and one or more additional view media tracks such that each media track includes one or more indicators for describing information about the corresponding view, and constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more sample groups each of which is associated with one basic view, or constructing the media content from the plurality of media samples based on the one or more indicators by grouping the multiple media tracks in which the plurality of media samples are placed.
  • a method of constructing media content includes placing (or extracting) a plurality of media samples into (or from) a media file associated with a plurality of views including a plurality of basic views and a plurality of additional views such that each media sample corresponds to one of the plurality of views and includes at least one of a texture component or a depth component associated with the corresponding view, determining a plurality of basic view media tracks corresponding to the plurality of basic views, respectively, and a plurality of additional view media tracks corresponding to the plurality of additional views, respectively, such that each media track includes one or more indicators for describing information about the corresponding view, and constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more media sample groups each of which is associated with at least one basic view.
  • a method of constructing media content includes placing (or extracting) , into (or from) a media file associated with a plurality of views, camera information including camera parameters corresponding to the plurality of views according to a viewing direction, a viewing position, and a viewing window, selecting, based on the camera parameter information, media metadata from the media file, and constructing the media content based on the media metadata.
  • Some embodiments may preferably implement these methods as follows:
  • a base view includes a view for predicting other views.
  • each basic view corresponds to a basic view atlas.
  • the one or more indicators include an indicator for grouping, into a sample group, the plurality of samples corresponding to atlases of two or more basic views.
  • the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
  • the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
  • the one or more indicators include an atlas attribute indicator to define which part of the texture component and the depth component is contained in the media track.
  • each media track includes one or more views stored as an atlas mode.
  • an atlas contains an aggregation of one or more patches from one or more views.
  • the one or more indicators include an indicator describing the number of patches in each atlas.
  • the one or more indicators include a patch rotation indicator to a rotation angle at which a patch maps from atlas to corresponding view.
  • the one or more indicators include a file level grouping indicator to describe a grouping type to group different media tracks containing different views.
  • the file level grouping indicator includes grouping information between a basic view and one or more additional views.
  • the file level grouping indicator includes the number of entities in a group.
  • each media track includes an indicator for identify the plurality of views as a basic view or an additional view.
  • each of the plurality of media tracks is from a patch of the plurality of views, and wherein each view corresponds to one camera.
  • the plurality of views of the user includes at least one basic view and at least one additional view associated with the at least one basic view.
  • the camera information includes a media track group to indicate that the media data in the media track group is to be used to decode images within a certain space range corresponding to the media track group.
  • the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.
  • a device that is configured or operable to perform the above-described methods is disclosed.
  • FIG. 1 is a diagram showing an example of content collection of 3DOF+ based on some embodiments of the disclosed technology.
  • FIG. 2 is a diagram showing an example of a video data processing system based on some embodiments of the disclosed technology.
  • FIG. 3 is a flowchart showing an example of a video data processing method based on some embodiments of the disclosed technology.
  • FIG. 4A is a diagram showing an example of a basic structure of a media file based on some embodiments of the disclosed technology.
  • FIG. 4B is a diagram showing another example of the basic structure of the media file based on some embodiments of the disclosed technology.
  • FIG. 5 is a diagram showing an example of multi-layer structure in media file based on some embodiments of the disclosed technology where only one media track includes one or more basic views.
  • FIG. 6 is a diagram showing an example of a multi-layer track structure in media file based on some embodiments of the disclosed technology.
  • Immersive media enables users to experience a highly realistic virtual space environment through visual and audio experiences through audio and video technologies.
  • the immersive experience mainly supports panoramic video.
  • the user can watch the 360-degree video through the free rotation of the head through the head display device, that is, the three-degree-of-freedom (3DOF) immersive experience.
  • 3DOF three-degree-of-freedom
  • 6DOF partial six-degree-of-freedom
  • the user's head can also move within a limited range, viewing more detail, such as occluded visual content.
  • Support for 3DOF plus video can be used through multiple cameras arranged in different shapes in a space to capture the original video, and then by algorithm synthesis of three-dimensional panoramic video, to meet the user's head in the immersive media limited movement.
  • multi-view video typically has a large amount of redundant data, resulting in a waste of storage and computing resources.
  • it is possible to reduce redundancy by removing the visual content of overlapping parts of multiple views at the same time and represent stereo-video within the viewport by views with texture and depth components.
  • the user's terminal device When the user's terminal device receives the immersive media content, it needs to render the content according to the user's viewing position, viewing direction, acquisition device's properties and other information, especially for rendering in the process of moving, so as to ensure the best immersion experience for the user.
  • the user's terminal device When the user's terminal device receives the immersive media content, it needs to render the content according to the user's viewing position, viewing direction, acquisition device's properties and other information, especially for rendering in the process of moving, so as to ensure the best immersion experience for the user.
  • there is a lack of corresponding media expression for the media content of 3DOF plus which will affect the effective rendering and playing back media content.
  • FIG. 1 is a diagram showing an example of video capture of 3DOFplus based on some embodiments of the disclosed technology.
  • multiple cameras are deployed in different locations in the space.
  • multiple cameras are arranged in parallel in a curved configuration or in a rectangular configuration, and images are simultaneously captured by the multiple cameras.
  • the videos captured by different cameras can have similar content.
  • the contents collected from the multi-view videos are synthesized to produce three-dimensional panoramic videos. According to the similarity and difference of multi-view acquisition content, the user can move within a limited range in the process of viewing immersive media to see more details, such as occluded visual content.
  • FIG. 2 is a diagram showing an example of a video data processing system based on some embodiments of the disclosed technology.
  • the video data processing system includes a media server 10 and a user terminal 20.
  • the media server 10 includes a memory module 101 to store media files and a transceiver module 102 to communicate with the user terminal 20.
  • the memory module 101 is used to store media files.
  • the transceiver module 102 is configured to receive a request message from the user terminal 20 and/or send the stored media files to the user terminal 20. In one example, the transceiver module 102 may receive the request message and transmit the media files through a wireless network and/or a wired network.
  • the user terminal 20 includes a transceiver module 201, a de-encapsulation and decoding module 202, a media processing module 203, and a media playback module 204.
  • the transceiver module 201 is configured to receive the media files from the media server 10 and send requests such as media file requests to the media server 10.
  • the de-encapsulation and decoding module 202 is used to unpack and decode the media files received by the transceiver module 201.
  • the media processing module 203 performs video processing such as video synthesis and rendering on a set of atlases of multi-view decoded by the decoding module 202 according to the current viewing state of the user (e.g., the position of the user, the viewing direction, the viewport of the user) .
  • the media playback module 204 is used to play back visual content corresponding to the user's current viewing state.
  • FIG. 3 is a flowchart showing an example of a video data processing method based on some embodiments of the disclosed technology.
  • the video data processing method includes, at step S301, requesting, by a user, media files from a server according to the current viewing position, viewing direction and viewport of the user.
  • the request for the corresponding media files may include the user’s real-time feedback to watch the view desired by the user.
  • a media processing module implemented in the user terminal may use one or more sensors and associated signal processing devices to convert the user’s real-time feedback into a user’s request.
  • the video data processing method includes, at step S302, receiving, by the user, the media files from the server according to the current viewing position, the viewing direction and the viewport of the user, extracting patches of one or more atlases according to the current viewing position, the viewing direction and the viewport of the user, and synthesizing the visual content in the current viewport of the user.
  • the media processing module may reconstruct a video in the viewport of the user according to the real-time viewing position and direction of the user, and transmit the reconstructed video to a media playback module or a display terminal.
  • the video data processing method includes, at step S303, obtaining, by the user, reconstructed three-dimensional stereoscopic video content according to the current viewing position, viewing direction and viewport of the user.
  • the user uses the display terminal or the media playback module to view the reconstructed video.
  • the video data processing method may be implemented to 3DOF plus video data in a file based on an ISO (International Organization for Standardization) basic media file format.
  • the ISO basic media file format may include, among others, a restricted scheme information box, a track reference box, and a track group box, which are defined in the ISO/IEC JTC1/SC29/WG11 Moving Picture Experts Group (MPEG) MPEG-4. Part 12 ISO Base Media File Format.
  • the projection, packaging steps and basic format of the omnidirectional video can be operated based on the MPEG-I Part 2 OMAF (omnidirectional media format) developed by the Moving Picture Experts Group (MPEG) of ISO/IEC JTC1/SC29/WG11.
  • MPEG Moving Picture Experts Group
  • the ISO basic file format represented by mp4 files includes a plurality of boxes, each of which has a particular type and length and can be considered as a data object.
  • a box can contain another box, called a container box.
  • a "ftyp" type box is placed in the beginning of the file.
  • the "ftyp" type box is used as a flag for indicating the file format and contains some information about the file.
  • the "ftyp” type box is followed by "MOOV” type box, which is a container box with metadata information for the media.
  • the media data for MP4 files is contained in one or more "mdat” type boxes, each of which is a container box.
  • MP4 files do not contain such "mdat” type boxes when the media data refers to other files in its entirety.
  • the structure of the media data is described by metadata.
  • MP4 files may contain a "meta" type box, which is also a container box that describes some generic or additional untimed metadata.
  • a timed metadata track (e.g., mechanism in the ISO Basic Media File Format, ISOBMFF) is used to establish timed metadata associated with particular samples.
  • Timing metadata is less coupled with media data and is usually descriptive.
  • FIG. 4A is a diagram showing an example of a basic structure of a media file where texture component and depth component of an atlas are placed into the same media track based on some embodiments of the disclosed technology.
  • FIG. 4B is a diagram showing another example of a basic structure of a media file where texture component and depth component of an atlas are placed into the different media tracks based on some embodiments of the disclosed technology.
  • a plurality of views may be described by using a layered structure that includes a base layer and one or more additional layer.
  • the media file may include one or more media tracks for describing media metadata of the plurality of views where one media track corresponds to one layer.
  • the view of the video may be represented by one or more atlases that are divided into texture components and depth components, and texture components and/or depth components are contained in one or more (e.g., 1 or 2) media tracks corresponding to an atlas.
  • the atlas brings together patches from one or more views, either a complete view or a portion of a trimmed piece from a view. If the content, size and quantity of the patch change, the information regarding the patch is described using timed metadata.
  • static information related to a plurality of views such as a camera parameter list is described in a box in a media track or a box in a media file.
  • each camera corresponds to each view.
  • the atlas in one media track is corresponding to one or more views
  • the list of camera parameters corresponding to the views is described in the media track.
  • Each patch in the atlas is derived from a view of the video captured by the camera. According to the user's viewing position and viewing direction, the patch in the field of view of user is selected based on the properties of each patch described in the media track, such that the user can see the media content of current viewport by synthesizing one or more selected patches.
  • the method for processing video data characterized includes: determining one or more atlases in the one or more media tracks corresponding to the views of the multi-view video in the viewport selected by the user, wherein the one or more atlases include texture component and depth component of the one or more views; and generating a user to view the multi-view video in the viewport based on the atlases and cameras parameters for shooting the multi-view video.
  • the determining of the one or more atlases in the one or more media tracks corresponding to the views of the multi-view video selected by the user includes determining a basic view of the multi-view video and one or more additional views related to the view of the multi-view video selected by the user; and based on the one or more atlases corresponding to the base view and the one or more additional views of the multi-view video in the media tracks, determining one or more atlases corresponding to the views of the multi-view video selected by the user in the media tracks.
  • FIG. 5 is a diagram showing an example of multi-layer structure in media file based on some embodiments of the disclosed technology where only one media track is a base layer contains one or more basic views.
  • the correlation between different views can be used for view prediction.
  • Views acquired by multiple cameras in the same space may have different degrees of correlation.
  • the view can be used to predict its nearby view.
  • views with low correlation however, the contents of the views are independent (or uncorrelation) and cannot be predicted from each other, while inter-frame predictions of basic views are possible.
  • the media file includes media data associated with one or more basic views, and each basic view corresponds to an atlas.
  • the disclosed technology can be implemented in various embodiments to provide a hierarchical media description structure that defines a certain media track as the base layer, used to describe and store the one or more atlases corresponding to one or more basic views.
  • the atlases corresponding to the other related additional views are described and stored in one or more other media tracks such as the media tracks of the additional layer, and the view of user viewport is synthesized from the combination of patches in atlases in the base layer and the additional layers.
  • three-dimensional panoramic video can be synthesized, allowing the user to see partially obscured visual content by changing the position and orientation of the head set.
  • a sample group of type “oinf” indicates the atlas of the basic view in the base layer, and the atlas of the view does not refer to other atlases, but it can be referenced by other atlases in the additional layers.
  • a sample group with group type “oinf” may be referred to by a media track with media track reference type “oref. ”
  • “version” indicates the version of the box
  • grouping_type indicates the type of the sample grouping and links it to its sample group description entry with the same value for grouping_type.
  • grouping_type_parameter indicates the subtype of the grouping
  • Entry_count indicates the number of sampling entries in the following table) ;
  • sample_count indicates the number of consecutive samples belonging to the same sample group
  • group_description_index indicates the index of the sample group entry describing the samples in this group.
  • these identifiers can be described as follows:
  • group_id indicates the group identifier to which the current sample belongs
  • view_idx indicates a basic view index identifier corresponding to the current sample, that is, a view to which the sample corresponding;
  • Camera_idx indicating a camera index identifier corresponding to the basic view
  • number_ref_views indicates the number of other additional views that reference the set of basic view atlases
  • additional_view_idx indicates the index identifier of an additional view associated with the basic view in the group
  • Track_id indicates the media track ID where the samples that the associated additional views corresponding to are located
  • Texture_or_depth_flag indicates whether the media track that contains the samples that associated additional views corresponding to includes texture component and/or depth component. When it equals to 0, includes both, when it equals to 1, only includes texture component;
  • depth_track_id indicates the media track ID where the depth component that the associated additional views corresponding to are located
  • the samples in the media track are atlases contain patches from one or more views.
  • each view may correspond to one atlas.
  • one or more views may correspond to one atlas.
  • the atlas includes a texture component and a depth component.
  • the texture component and the depth component of an atlas may be respectively placed in two media tracks.
  • the texture component and the depth component of an atlas may be placed in the same media track.
  • An “AtlasAttributeBox” describes which components of the atlas is contained in the media track.
  • Texture_in_track indicates whether the media track contains a texture component
  • depth_in_track indicates whether the media track contains a depth component
  • Atlas_id indicates the identifier of the atlas which the texture component and/or depth component belongs to in the media track.
  • the two media tracks belong to one media track group, for example, the track 1 has the texture component of the atlas 1 and the track 2 has the depth component of the atlas 1 where the components of track 1 and track 2 belong to the same atlas.
  • two media tracks belong to one media track group by defining a TrackGrouptypeBox which has the track_group_type is “atls, ” and the media track group is identified by “track_group_id, ” which may be the same as the atlas identification index (atlas_id) of the atlas.
  • the texture and depth component samples are grouped, and the type “texture and depth group” is defined as the texture and depth sample group that describes whether each sample belongs to the texture component or the depth component.
  • these identifiers can be described as follows:
  • Texture_or_depth indicates that the sample of the group is grouped into texture or depth components. When texture_or_depth is equal to 0, the sample is grouped into the texture component, and when texture_or_depth is equal to 1, the sample is grouped into the depth component;
  • number_basicview_flag indicates whether the samples in the media track corresponds to multiple basic views; when it is equal to 0, corresponds to only one basic view, and these samples are not grouping by basic view, when it is equal to 1, corresponding to more than two basic views.
  • group_id, for the samples of multiple basic views in the media track, indicates the grouping identifier of sample group based on the basic views.
  • the view identifier box (ViewIdentifierBox) defined in the ISO basic file format is "vwid" and describes the views contained in each media track, including the identity of the view, other reference views corresponding to the view, and so on. It can indicate whether the media track contains a texture component or a depth component, or both.
  • the user Based on the atlases, samples in the media tracks, the user selects the atlas in one or more media tracks according to the current viewing position, the viewing direction, and the viewport to synthesize the image that the user desires to view.
  • the patches in the set are from video captured by one or more cameras corresponding to one or more views.
  • the view to which each patch belongs and the position of each patch in the view are described to facilitate selection of the media track and its corresponding atlas.
  • the parameters of the box to the atlas according to the ISO basic file format, optionally in one of the following ways.
  • the patches are collected from different views, and the position in the view is different and variable, and the patch information for all atlases is dynamically described by timed metadata.
  • An atlas information list is identified from the timed metadata track according to the sample entry type, and the atlas information list timed metadata track is used to indicate location information corresponding to each patch in the atlas and corresponding view and the corresponding view information.
  • the purpose of the atlas information list timed metadata track is indicated by the track sample entry type, which is “atpl, ” as defined below.
  • AtlasParamsListSampleEntry extends MetaDataSampleEntry ( 'atpl' ) ⁇
  • number_atlases indicates the number of atlases
  • Each sample in the timed metadata track is corresponding to information regarding all atlases at some point.
  • the following example illustrates a sample format of the atlas information refers to the track sample entry type “atpl, ” above, and indicates the atlas information for each sample.
  • Atlas_id indicates the identifier of the atlas
  • view_id indicates a view identifier, which a patch belongs to
  • group_id indicates the group identifier of the grouping based on a basic view and its associated additional views, when group_id is equal to 0, only one basic view, and the views does not need to be grouped;
  • patch_width_in_view and “patch_height_in_view” indicate the width and height of the patch in the luminance sample, respectively;
  • “patch_pos_in_view_x” and “patch_pos_in_view_y” indicate the coordinates of the X-axis and the Y-axis of the patch in the luminance sampling, respectively;
  • patch_rotation indicates the rotation angle at which a patch maps from the atlas to the view. When patch_rotation is 0, it indicates there is no rotation, and when patch_rotation is 1, it indicates a clockwise 90-degree rotation, and other values greater than 1 are temporarily reserved.
  • the atlas information describes the atlas place in the other media track
  • the timed metadata track defined above can link to one or more other media tracks by using the TrackReferenceBox with “cdsc” referenceType.
  • the sample table box includes a new defined box (Atlas Parameters Box) to indicate the atlas information of samples in the media track one by one.
  • sample_count indicates the number of consecutive samples over a period of time
  • number_patches indicates the number of patches included in the atlas
  • view_id indicates the view identifier, which this patch belongs to
  • “patch_pos_in_atlas_x” and “patch_pos_in_atlas_y” indicates the coordinates of the X-axis and the Y-axis of the patch in the atlas, respectively;
  • “patch_rotation” indicates the angle of rotation of the patch from the atlas to the view image. When it is equal to 0, it indicates that there is no rotation; when it is equal to 1, it indicates a clockwise rotation of 90 degrees; other values greater than 1 are temporarily reserved
  • FIG. 6 is a diagram showing an example of a multi-layer track structure in media file based on some embodiments of the disclosed technology where there are multiple basic views in the media file, and basic views can be used as a base view to predict other view.
  • a media track is defined as a base layer for storage of one basic view, and describes the operation point information of all basic views.
  • the other basic view and the corresponding view of the additional view are stored in one or more other media tracks.
  • a three-dimensional stereoscopic video can be synthesized, so that the user can see the partially occluded visual content through the displacement of the head.
  • the type “oinf” (operating points information) sample group in the media track of the base layer is defined to describe the operating points information in all basic views. All the additional layer media tracks refer to the base layer media track by referring to a media track type indicator “oref. ”
  • a basic view can be distinguished from other basic views and their associated additional views by including an indication at the media track level or at the file level.
  • the distinction between basic and additional views is described at the media track level.
  • the media track In a media track that contains atlas belongs to a basic view, the media track is described as a media track that contains a basic view by defining a box of type 'invw' in the Media Information Box.
  • basic_view_idx is an identifier of the basic view that its corresponding atlas in the media track
  • “camera_idx” indicates an identifier of the camera corresponding to the basic view
  • additional_view_idx indicates an identifier of an additional view that refers to the basic view
  • Track_id indicates an identifier of the media track that contains the atlas corresponding to an additional view associated with the basic view
  • Texture_or_depth_flag indicates whether the media track that contains the samples associated with additional views corresponding to the samples includes texture component and/or depth component, if it is 0, includes both, if it is 1, only includes texture component;
  • depth_track_id indicates the media track ID where the depth component that the associated additional views corresponding to are located.
  • a track group box with a track group type (track_group_type) of “cstg” is defined to describe the relationship between the atlases corresponding to basic views and its associated additional views.
  • group descriptions are provided at the file level for the media tracks where the basic view is located, and for the media tracks having the additional view refer to the associated basic views.
  • Extending EntityToGroupBox is used in the ISO base file format to define a grouping type (grouping_type) as “asvw” (associated view group) to group media tracks containing atlases form associated views.
  • grouping_type grouping type
  • asvw associated view group
  • the group members are described to distinguish the media track in which the basic view is located, and it can be said that the track contains texture components and/or depth components of the atlas.
  • group_id indicates the group identifier for the grouping described in each box
  • entity_id indicates the entity identifier of the media track in which the atlas that the belongs to is placed, equal to track_id;
  • “is_basic_view” indicates whether a group member media track in a group includes a basic view. When it equals to 1, a group member media track in a group contains a basic view, or when it is 0, a group member media track in a group does not contain a basic view, and only one group member has a value of 1;
  • is_texture_in_entity indicates whether the media track of a certain group member in the group contains the texture component of the atlas that the view belongs to;
  • is_depth_in_entity indicates whether the media track of a group member in a group contains the depth component of the atlas that the view belongs to.
  • the atlases of the multiple basic views are stored in separate bitstreams.
  • one bitstream contains only one atlas corresponding to one basic view and the atlas corresponding to other views associated with that basic view.
  • a layered structure can still be used.
  • the user after receiving the media stream file and decoding, the user selects the media content corresponding to the basic view and its associated additional view according to the user's viewing direction, viewing position, and viewport.
  • the corresponding camera information are considered in the process to complete the rendering of the selected content.
  • the camera for multiple views is described in detail.
  • the camera parameter information is intrinsic information and position information
  • the data box is defined at the file level for description.
  • the camera parameter list box is identified according to the box type, and the camera parameters corresponding to all the patches in the media file are described, and the box is included in the “meta” type box.
  • view_idx is an identifier of the view captured by the camera, by which the camera corresponding to the view can be found;
  • “camera_pos_x, ” “camera_pos_y, ” and “camera_pos_z” indicates the position of the camera in the global reference coordinate system
  • “projection_type” indicates the projection method of the camera. When it is equal to 0, the indication is ERP projection mode, and when it is equal to 1, the indication is CMP projection mode, and when it is equal to 2, the indication is perspective projection, and the value ranges from 0 to 255, of which 3 to 255 can be defined in the future as needed;
  • projection_plane_width and “projection_plane_height” indicate the horizontal and vertical resolutions of the camera projection plane, which are indicated by the horizontal and vertical resolutions of the encoded brightness sampling;
  • erp_phi_min and “erp_phi_max” indicate the latitude range (minimum and maximum) of the ERP projection
  • erp_theta_min and “erp_theta_max” indicate the longitude range (minimum and maximum) of the ERP projection
  • “perspective_focal_hor, ” “perspective_focal_ver, ” indicate; in luma sample position units the horizontal and vertical components, respectively, of the focal of a perspective projection;
  • “perspective_center _hor, ” and “perspective_center_ver” indicate in luma sample positions the horizontal and vertical coordinates, respectively, of the principal point of a perspective projection.
  • Quantization_law indicates the type of depth quantization method of the camera
  • depth_near, depth_far indicates the minimum and maximum normalized disparity values, respectively
  • “cubic_map_type” indicates the type of CMP projection. When it is equal to 0, the indication is full projection mode (6 package faces) ; when it is greater than 0, it is temporarily reserved; and
  • rotation _yaw and “rotation _pitch” and “rotation _roll” specify the yaw angle, pitch angle, and roll angle for the camera to rotate along the X, Y, and Z axes, respectively.
  • each of the media tracks is from a patch of one or more views, and each view corresponds to one camera, and the camera corresponding to the view in each media track is described.
  • the camera parameter list box is identified according to the box type contained in the "schi" type box, which is the Scheme Information Box, describing the restricted scheme information, where the box is in the scheme type (when scheme_type is “altv, ” support atlas mode) .
  • Scheme Information Box describing the restricted scheme information, where the box is in the scheme type (when scheme_type is “altv, ” support atlas mode) .
  • Example of the Camera Parameters Box is described below.
  • the atlas of the basic view are stored in different media tracks, and the atlas of each basic view is used as the base view, and the view of the associated additional view serves as an additional view.
  • the camera corresponding to the basic view and its associated other views can be described using the Atlas Camera Parameter Box (AltalsCameraParametersListBox) , which is included in the "schi" type box, which is the solution information.
  • the Scheme Information Box is used to describe restricted information.
  • the media track in which the atlas of other related views is located defines the track with the track reference type “sbas” to associate the media track of the basic view with other media track of related additional views.
  • media tracks associated with basic views and other views related to them may be defined using a track group type “cstg. ”
  • the disclosed technology can be implemented in some embodiments to put the atlas of views into media tracks in a situation where multiple basic views exist.
  • the texture component and depth component are put into one media track, or in another implementation, the texture component and depth component are put into different media tracks.
  • the multi-view video processing method implemented based on some embodiments of the disclosed technology may include defining the structure of media tracks and related metadata and describing the metadata of data and the atlas information of views put in the media tracks.
  • the multi-view video processing method implemented based on some embodiments of the disclosed technology may include capturing video based on multiple views, coding the video, packing the coded video with its metadata into a media file, and sending the media file directly or according to user’s selection. The user can accept the media file and extract the metadata and samples needed, decode pictures in the media file and synthesize the decoded pictures to see the synthesized content in his viewport.
  • a video data processing method may include determining one or more atlases in a media track corresponding to views of a multi-viewpoint video in a viewpoint of the user.
  • the one or more atlases include texture information and depth information of the views.
  • the video data processing method may also include, based on the atlas and the camera parameters corresponding to the multi-view video, generating the multi-viewpoint video in the viewpoint of the user.
  • the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user includes determining a base view of the multi-viewpoint video and an additional view related to the base view of the multi-viewpoint video selected by the user, and based on the atlas corresponding to the base view and the additional view of the multi-viewpoint video in the media track, determining one or more atlases corresponding to the view of the multi-view video selected by the user in the media track.
  • the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user further includes identifying an entity group data box in the media track based on a second group type, and the entity group data box indicates the base view of the multi-viewpoint video. In one example, the entity group data box indicates the group identification of the entity group.
  • the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user further includes identifying a basic view data box in the entry of the video track sample, and the basic view data box indicates the base view identification of the multi-view point video.
  • a sample group data box is in the media track and its corresponding sample group description data box is identified based on a first group type, and the sample group entry in the sample group description data box indicates the multi-viewpoint video base view ID.
  • the sample group entry indicates a grouping identifier of the sample group.
  • the video data processing method further includes indicating the identifier of one or more additional views associated with the base view identifier of the multi-viewpoint video indicated in the sample group description data box or the entity group data box or a basic view data box. In one example, the video data processing method further includes using an identifier of the media track where the atlas corresponding to the view is located.
  • the base view indicated by the basic view data box and the atlas corresponding to the additional view identified by the base view identifier of the multi-viewpoint video indicated by the basic view data box are located.
  • the media track is characterized, according to the type of media track, by identifying the grouping of the media track in which a chart set corresponds to the base view and the media track in which the atlas corresponding to the additional view is located.
  • an atlas list timed metadata track is identified according to the first sample entry type, and the atlas list timed metadata track is used to indicate image block information of the atlas in the media track.
  • patch information of the atlas in the media track includes at least one of the following: the identification of the atlas; the number of patches of the atlas; the view identification of the patch source of the atlas; the width and height of the patch of the atlas in the view corresponding to the atlas; the coordinates of the patch in the atlas X-axis and Y-axis; the coordinates of the patch of the atlas in the view corresponding to the atlas of the X-axis and the Y-axis; the patches of the atlas are mapped from the atlas to the corresponding rotation angle of the view.
  • the video data processing method further includes identifying an atlas attribute data box, which indicates whether the atlas contains texture information and/or depth information and indicates the identity of the atlas.
  • an atlas attribute data box which indicates whether the atlas contains texture information and/or depth information and indicates the identity of the atlas.
  • a video track belonging to the texture information containing the atlas and a video track containing depth information of the atlas are identified based on a third track group type.
  • the camera parameter may be used to identify a file-level camera parameter list data box and indicate camera parameter information for shooting the multi-viewpoint view.
  • the atlas camera parameter list data box for identifying a restricted solution information is used to indicate the camera parameter information of the view corresponding to the atlas in the media track.
  • the camera parameter may include at least one of the following: coordinate information for camera location; coordinate rotation information for camera location; camera mapping format information; camera depth quantification information.
  • a computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Therefore, the computer-readable media can include a non-transitory storage media.
  • program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer-or processor-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
  • a hardware circuit implementation can include discrete analog and/or digital components that are, for example, integrated as part of a printed circuit board.
  • the disclosed components or modules can be implemented as an Application Specific Integrated Circuit (ASIC) and/or as a Field Programmable Gate Array (FPGA) device.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • DSP digital signal processor
  • the various components or sub-components within each module may be implemented in software, hardware or firmware.
  • the connectivity between the modules and/or components within the modules may be provided using any one of the connectivity methods and media that is known in the art, including, but not limited to, communications over the Internet, wired, or wireless networks using the appropriate protocols.

Abstract

Methods, apparatus, and systems for effectively reducing media content transmissions and efficiently rendering immersive media contents are disclosed. In one example aspect, a method includes requesting, by a user, media files from a server according to the current viewing position and viewing direction of the user, receiving, by the user, the media files from the server according to the current viewing position and the viewing direction of the user, extracting a patch of an atlas, and synthesizing the visual content in the current window area of the user, and obtaining, by the user, three-dimensional stereoscopic video content according to the current viewing position and viewing direction of the user.

Description

MULTI-VIEW VIDEO PROCESSING METHOD AND APPARATUS TECHNICAL FIELD
This patent document relates generally to immersive technologies.
BACKGROUND
With the recent development of human-computer interaction and artificial intelligence technologies, immersive technology is revolutionizing the way we work and live. Immersive technology is applied in several areas, including marketing and advertising, healthcare, education, and gaming. The rapid development of new immersive technologies requires effectively processing video and image data.
SUMMARY
Disclosed are methods, devices and applications pertaining to representations of media data supporting 3DoF with additional limited translational movements (typically, head movements) along X, Y and Z axes, which effectively reduces the media content transmission and efficiently renders the immersive media content.
In an example embodiment of the disclosed technology, a method includes requesting, by a user, media files from a server according to the current viewing position, viewing direction and viewport of the user, receiving, by the user, the media files from the server according to the current viewing position, the viewing direction and viewport of the user, extracting patches of one or more atlases, and synthesizing the visual content in the current viewport of the user, and obtaining, by the user, three-dimensional stereoscopic video content in the current viewport of the user.
In another example embodiment of the disclosed technology, a method of constructing media content includes placing (or extracting) a plurality of media samples into (or from) a media file associated with a plurality of views including one or more basic views and one or more additional views such that each media sample corresponds to one or more of the plurality of views and includes at least one of a texture component or a depth component of the corresponding view, determining a basic view media track corresponding to the one or more basic views and one or more additional view media tracks such that each media track includes  one or more indicators for describing information about the corresponding view, and constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more sample groups each of which is associated with one basic view, or constructing the media content from the plurality of media samples based on the one or more indicators by grouping the multiple media tracks in which the plurality of media samples are placed.
In yet another example embodiment of the disclosed technology, a method of constructing media content includes placing (or extracting) a plurality of media samples into (or from) a media file associated with a plurality of views including a plurality of basic views and a plurality of additional views such that each media sample corresponds to one of the plurality of views and includes at least one of a texture component or a depth component associated with the corresponding view, determining a plurality of basic view media tracks corresponding to the plurality of basic views, respectively, and a plurality of additional view media tracks corresponding to the plurality of additional views, respectively, such that each media track includes one or more indicators for describing information about the corresponding view, and constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more media sample groups each of which is associated with at least one basic view.
In yet another example embodiment of the disclosed technology, a method of constructing media content includes placing (or extracting) , into (or from) a media file associated with a plurality of views, camera information including camera parameters corresponding to the plurality of views according to a viewing direction, a viewing position, and a viewing window, selecting, based on the camera parameter information, media metadata from the media file, and constructing the media content based on the media metadata.
Some embodiments may preferably implement these methods as follows:
The above methods, wherein a base view includes a view for predicting other views.
The above methods, wherein each basic view corresponds to a basic view atlas.
The above methods, wherein a patch is acquired based on the basic view atlas.
The above methods, wherein the media content is constructed based on a combination of one or more basic views and one or more additional views
The above methods, wherein the one or more indicators include an indicator for grouping, into a sample group, the plurality of samples corresponding to atlases of two or more basic views.
The above methods, wherein the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
The above methods, wherein the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
The above methods, wherein the one or more indicators include an atlas attribute indicator to define which part of the texture component and the depth component is contained in the media track.
The above methods, wherein the one or more indicators include a view identifier to describe the corresponding view. In an implementation, each media track includes one or more views stored as an atlas mode.
The above methods, wherein the constructing of the media content includes combining patches from different views. In an implementation, an atlas contains an aggregation of one or more patches from one or more views.
The above methods, wherein the one or more indicators include an indicator describing the number of patches in each atlas.
The above methods, wherein the one or more indicators include a patch rotation indicator to a rotation angle at which a patch maps from atlas to corresponding view.
The above methods, wherein the one or more indicators include a file level grouping indicator to describe a grouping type to group different media tracks containing different views.
The above methods, wherein the file level grouping indicator includes grouping information between a basic view and one or more additional views.
The above methods, wherein the file level grouping indicator includes the number of entities in a group.
The above methods, wherein each media track includes an indicator for identify the plurality of views as a basic view or an additional view.
The above methods, wherein the camera information is extracted on a media file basis.
The above methods, wherein the camera information is extracted on a media track basis.
The above methods, wherein each of the plurality of media tracks is from a patch of the plurality of views, and wherein each view corresponds to one camera.
The above methods, wherein the plurality of views of the user includes at least one basic view and at least one additional view associated with the at least one basic view.
The above methods, wherein the media data of the at least one basic view is used as a base image of at least one additional view associated with the at least one basic view.
The above methods, wherein the plurality of views of the user includes two or more basic views stored in different media tracks, respectively.
The above methods, wherein the camera information includes a media track group to indicate that the media data in the media track group is to be used to decode images within a certain space range corresponding to the media track group.
In yet another example embodiment of the disclosed technology, the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.
In yet another example embodiment of the disclosed technology, a device that is configured or operable to perform the above-described methods is disclosed.
The above and other aspects and their implementations are described in greater detail in the drawings, the descriptions, and the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram showing an example of content collection of 3DOF+ based on some embodiments of the disclosed technology.
FIG. 2 is a diagram showing an example of a video data processing system based on some embodiments of the disclosed technology.
FIG. 3 is a flowchart showing an example of a video data processing method based on some embodiments of the disclosed technology.
FIG. 4A is a diagram showing an example of a basic structure of a media file based on some embodiments of the disclosed technology.
FIG. 4B is a diagram showing another example of the basic structure of the media file based on some embodiments of the disclosed technology.
FIG. 5 is a diagram showing an example of multi-layer structure in media file based on some embodiments of the disclosed technology where only one media track includes one or more basic views.
FIG. 6 is a diagram showing an example of a multi-layer track structure in media file based on some embodiments of the disclosed technology.
DETAILED DESCRIPTION
Immersive media enables users to experience a highly realistic virtual space environment through visual and audio experiences through audio and video technologies. At present, the immersive experience mainly supports panoramic video. For example, the user can watch the 360-degree video through the free rotation of the head through the head display device, that is, the three-degree-of-freedom (3DOF) immersive experience. For videos that support enhanced three-degree-of-freedom (3DOF plus) and partial six-degree-of-freedom (6DOF) , the user's head can also move within a limited range, viewing more detail, such as occluded visual content.
Support for 3DOF plus video can be used through multiple cameras arranged in different shapes in a space to capture the original video, and then by algorithm synthesis of three-dimensional panoramic video, to meet the user's head in the immersive media limited movement. Based on multi-camera deployment, multi-view video typically has a large amount of redundant data, resulting in a waste of storage and computing resources. Currently, in the content production phase, it is possible to reduce redundancy by removing the visual content of overlapping parts of multiple views at the same time and represent stereo-video within the viewport by views with texture and depth components. When the user's terminal device receives the immersive media content, it needs to render the content according to the user's viewing position, viewing direction, acquisition device's properties and other information, especially for rendering in the process of moving, so as to ensure the best immersion experience for the user. At present, there is a lack of corresponding media expression for the media content of 3DOF plus, which will affect the effective rendering and playing back media content.
FIG. 1 is a diagram showing an example of video capture of 3DOFplus based on some embodiments of the disclosed technology. In order to support the 3DOFplus scenario, multiple cameras are deployed in different locations in the space. For example, multiple cameras are arranged in parallel in a curved configuration or in a rectangular configuration, and images are simultaneously captured by the multiple cameras. Depending on the location and orientation of the camera deployment, the videos captured by different cameras can have similar content.
The contents collected from the multi-view videos are synthesized to produce three-dimensional panoramic videos. According to the similarity and difference of multi-view acquisition content, the user can move within a limited range in the process of viewing immersive media to see more details, such as occluded visual content.
FIG. 2 is a diagram showing an example of a video data processing system based on some embodiments of the disclosed technology. In some embodiments of the disclosed technology, the video data processing system includes a media server 10 and a user terminal 20.
In some implementations, the media server 10 includes a memory module 101 to store media files and a transceiver module 102 to communicate with the user terminal 20. The memory module 101 is used to store media files. In some implementations, the transceiver module 102 is configured to receive a request message from the user terminal 20 and/or send the stored media files to the user terminal 20. In one example, the transceiver module 102 may receive the request message and transmit the media files through a wireless network and/or a wired network.
In some implementations, the user terminal 20 includes a transceiver module 201, a de-encapsulation and decoding module 202, a media processing module 203, and a media playback module 204. In some implementations, the transceiver module 201 is configured to receive the media files from the media server 10 and send requests such as media file requests to the media server 10. The de-encapsulation and decoding module 202 is used to unpack and decode the media files received by the transceiver module 201. In some implementations, the media processing module 203 performs video processing such as video synthesis and rendering on a set of atlases of multi-view decoded by the decoding module 202 according to the current viewing state of the user (e.g., the position of the user, the viewing direction, the viewport of the user) . The media playback module 204 is used to play back visual content corresponding to the user's current viewing state.
FIG. 3 is a flowchart showing an example of a video data processing method based on some embodiments of the disclosed technology.
In some embodiments of the disclosed technology, the video data processing method includes, at step S301, requesting, by a user, media files from a server according to the current viewing position, viewing direction and viewport of the user. In one example, the request for the corresponding media files may include the user’s real-time feedback to watch the view desired by the user. In one example, a media processing module implemented in the user terminal may use one or more sensors and associated signal processing devices to convert the user’s real-time feedback into a user’s request.
In some embodiments of the disclosed technology, the video data processing method includes, at step S302, receiving, by the user, the media files from the server according to the current viewing position, the viewing direction and the viewport of the user, extracting patches of one or more atlases according to the current viewing position, the viewing direction and the viewport of the user, and synthesizing the visual content in the current viewport of the user. In one example, the media processing module may reconstruct a video in the viewport of the user according to the real-time viewing position and direction of the user, and transmit the reconstructed video to a media playback module or a display terminal.
In some embodiments of the disclosed technology, the video data processing method includes, at step S303, obtaining, by the user, reconstructed three-dimensional stereoscopic video content according to the current viewing position, viewing direction and viewport of the user. In one example, the user uses the display terminal or the media playback module to view the reconstructed video.
In some embodiments of the disclosed technology, the video data processing method may be implemented to 3DOF plus video data in a file based on an ISO (International Organization for Standardization) basic media file format. The ISO basic media file format may include, among others, a restricted scheme information box, a track reference box, and a track group box, which are defined in the ISO/IEC JTC1/SC29/WG11 Moving Picture Experts Group (MPEG) MPEG-4. Part 12 ISO Base Media File Format. The projection, packaging steps and basic format of the omnidirectional video can be operated based on the MPEG-I Part 2 OMAF (omnidirectional media format) developed by the Moving Picture Experts Group (MPEG) of ISO/IEC JTC1/SC29/WG11.
In some embodiments of the disclosed technology, all the data in the ISO basic file format is contained in a box (Box) . In one example, the ISO basic file format represented by mp4 files includes a plurality of boxes, each of which has a particular type and length and can be considered as a data object. A box can contain another box, called a container box. In one example, a "ftyp" type box is placed in the beginning of the file. In one example, the "ftyp" type box is used as a flag for indicating the file format and contains some information about the file.
In some embodiments of the disclosed technology, the "ftyp" type box is followed by "MOOV" type box, which is a container box with metadata information for the media. In an implementation, the media data for MP4 files is contained in one or more "mdat" type boxes, each of which is a container box. In another implementation, MP4 files do not contain such "mdat" type boxes when the media data refers to other files in its entirety. In some embodiments of the disclosed technology, the structure of the media data is described by metadata. In some implementations, for further metadata descriptions with respect to the media, MP4 files may contain a "meta" type box, which is also a container box that describes some generic or additional untimed metadata.
In some embodiments of the disclosed technology, a timed metadata track (e.g., mechanism in the ISO Basic Media File Format, ISOBMFF) is used to establish timed metadata associated with particular samples. Timing metadata is less coupled with media data and is usually descriptive.
FIG. 4A is a diagram showing an example of a basic structure of a media file where texture component and depth component of an atlas are placed into the same media track based on some embodiments of the disclosed technology.
FIG. 4B is a diagram showing another example of a basic structure of a media file where texture component and depth component of an atlas are placed into the different media tracks based on some embodiments of the disclosed technology.
In some embodiments of the disclosed technology, a plurality of views may be described by using a layered structure that includes a base layer and one or more additional layer. The media file may include one or more media tracks for describing media metadata of the plurality of views where one media track corresponds to one layer.
As shown in FIG. 4B, the view of the video may be represented by one or more atlases that are divided into texture components and depth components, and texture components  and/or depth components are contained in one or more (e.g., 1 or 2) media tracks corresponding to an atlas. The atlas brings together patches from one or more views, either a complete view or a portion of a trimmed piece from a view. If the content, size and quantity of the patch change, the information regarding the patch is described using timed metadata.
In an implementation, static information related to a plurality of views such as a camera parameter list is described in a box in a media track or a box in a media file.
In another implementation, by defining a box as the “meta” type box to describe each camera one by one, each camera corresponds to each view.
In another implementation, in the case where the atlas in one media track is corresponding to one or more views, and the list of camera parameters corresponding to the views is described in the media track.
Each patch in the atlas is derived from a view of the video captured by the camera. According to the user's viewing position and viewing direction, the patch in the field of view of user is selected based on the properties of each patch described in the media track, such that the user can see the media content of current viewport by synthesizing one or more selected patches.
In some embodiments of the disclosed technology, the method for processing video data, characterized includes: determining one or more atlases in the one or more media tracks corresponding to the views of the multi-view video in the viewport selected by the user, wherein the one or more atlases include texture component and depth component of the one or more views; and generating a user to view the multi-view video in the viewport based on the atlases and cameras parameters for shooting the multi-view video.
In some embodiments of the disclosed technology, the determining of the one or more atlases in the one or more media tracks corresponding to the views of the multi-view video selected by the user includes determining a basic view of the multi-view video and one or more additional views related to the view of the multi-view video selected by the user; and based on the one or more atlases corresponding to the base view and the one or more additional views of the multi-view video in the media tracks, determining one or more atlases corresponding to the views of the multi-view video selected by the user in the media tracks.
FIG. 5 is a diagram showing an example of multi-layer structure in media file based on some embodiments of the disclosed technology where only one media track is a base layer contains one or more basic views.
In some embodiments of the disclosed technology, the correlation between different views can be used for view prediction. Views acquired by multiple cameras in the same space may have different degrees of correlation. For highly correlated views, the view can be used to predict its nearby view. For views with low correlation, however, the contents of the views are independent (or uncorrelation) and cannot be predicted from each other, while inter-frame predictions of basic views are possible.
In some embodiments of the disclosed technology, the media file includes media data associated with one or more basic views, and each basic view corresponds to an atlas. In that regard, the disclosed technology can be implemented in various embodiments to provide a hierarchical media description structure that defines a certain media track as the base layer, used to describe and store the one or more atlases corresponding to one or more basic views. The atlases corresponding to the other related additional views are described and stored in one or more other media tracks such as the media tracks of the additional layer, and the view of user viewport is synthesized from the combination of patches in atlases in the base layer and the additional layers.
Through the multi-layer atlas, three-dimensional panoramic video can be synthesized, allowing the user to see partially obscured visual content by changing the position and orientation of the head set.
In the media track of the base layer, a sample group of type “oinf” (operating points information) indicates the atlas of the basic view in the base layer, and the atlas of the view does not refer to other atlases, but it can be referenced by other atlases in the additional layers. In one or more media tracks corresponding to the additional layer including the atlases of additional views, a sample group with group type “oinf” may be referred to by a media track with media track reference type “oref. ”
There are multiple basic views in the media track of the base layer, grouping the samples based on different basic view, and defining the sample group type as “invw” indicating the grouping samples belongs to a basic view.
Figure PCTCN2019122095-appb-000001
Figure PCTCN2019122095-appb-000002
The specific syntax is as follows:
“version” indicates the version of the box;
“grouping_type” indicates the type of the sample grouping and links it to its sample group description entry with the same value for grouping_type.
“grouping_type_parameter” indicates the subtype of the grouping;
“entry_count” indicates the number of sampling entries in the following table) ;
“sample_count” indicates the number of consecutive samples belonging to the same sample group; and
“group_description_index” indicates the index of the sample group entry describing the samples in this group.
For grouping based on basic view-based samples, defining a sample group entry of type “bsvw” (basic view) , and describing the identifier of the basic view that these samples in the group corresponding to, identifier of the camera corresponding to the basic view, and other additional views associated with the basic view are described.
As an example, these identifiers can be described as follows:
Figure PCTCN2019122095-appb-000003
Figure PCTCN2019122095-appb-000004
The syntax is defined as follows:
“group_id” indicates the group identifier to which the current sample belongs;
“view_idx” indicates a basic view index identifier corresponding to the current sample, that is, a view to which the sample corresponding;
“camera_idx” indicating a camera index identifier corresponding to the basic view ;
“num_ref_views” indicates the number of other additional views that reference the set of basic view atlases;
“additional_view_idx” indicates the index identifier of an additional view associated with the basic view in the group;
“track_id” indicates the media track ID where the samples that the associated additional views corresponding to are located;
“texture_or_depth_flag” indicates whether the media track that contains the samples that associated additional views corresponding to includes texture component and/or depth component. When it equals to 0, includes both, when it equals to 1, only includes texture component;
“depth_track_id” indicates the media track ID where the depth component that the associated additional views corresponding to are located;
In some embodiments of the disclosed technology, the samples in the media track are atlases contain patches from one or more views. In one example, each view may correspond to one atlas. In another example, one or more views may correspond to one atlas. The atlas  includes a texture component and a depth component. In an example hierarchical structure, the texture component and the depth component of an atlas may be respectively placed in two media tracks. In another example hierarchical structure, the texture component and the depth component of an atlas may be placed in the same media track. An “AtlasAttributeBox” describes which components of the atlas is contained in the media track.
Box type: 'atat'
Container: Scheme Information Box ( ‘schi’ )
Mandatory (per an item) : Yes, when scheme_type is equal to ‘altv’
Quantity (per an item) : one
aligned (8) class AtlasAttributeBox () extends FullBox ( ‘atat’ , 0, 0) {
unsigned int (8) texture_in_track;
unsigned int (8) depth_in_track;
unsigned int (8) atlas_id;
}
The syntax is defined as follows:
“texture_in_track” indicates whether the media track contains a texture component;
“depth_in_track” indicates whether the media track contains a depth component; and
“atlas_id” indicates the identifier of the atlas which the texture component and/or depth component belongs to in the media track.
In an embodiment of the disclosed technology, when the texture component and the depth component of the atlas are placed in two media tracks, the two media tracks belong to one media track group, for example, the track 1 has the texture component of the atlas 1 and the track 2 has the depth component of the atlas 1 where the components of track 1 and track 2 belong to the same atlas. In one example, two media tracks belong to one media track group by defining a TrackGrouptypeBox which has the track_group_type is “atls, ” and the media track group is identified by “track_group_id, ” which may be the same as the atlas identification index (atlas_id) of the atlas.
In another embodiment of the disclosed technology, when both the texture and depth components are placed in a media track, the texture and depth component samples are grouped, and the type “texture and depth group” is defined as the texture and depth sample group that describes whether each sample belongs to the texture component or the depth component.
As an example, these identifiers can be described as follows:
Figure PCTCN2019122095-appb-000005
The specific syntax is defined as follows:
“texture_or_depth” indicates that the sample of the group is grouped into texture or depth components. When texture_or_depth is equal to 0, the sample is grouped into the texture component, and when texture_or_depth is equal to 1, the sample is grouped into the depth component; and
“num_basicview_flag” indicates whether the samples in the media track corresponds to multiple basic views; when it is equal to 0, corresponds to only one basic view, and these samples are not grouping by basic view, when it is equal to 1, corresponding to more than two basic views.
“group_id, ” for the samples of multiple basic views in the media track, indicates the grouping identifier of sample group based on the basic views.
In some embodiments of the disclosed technology, the view identifier box (ViewIdentifierBox) defined in the ISO basic file format is "vwid" and describes the views contained in each media track, including the identity of the view, other reference views corresponding to the view, and so on. It can indicate whether the media track contains a texture component or a depth component, or both.
Based on the atlases, samples in the media tracks, the user selects the atlas in one or more media tracks according to the current viewing position, the viewing direction, and the viewport to synthesize the image that the user desires to view. The patches in the set are from video captured by one or more cameras corresponding to one or more views. In some implementations, the view to which each patch belongs and the position of each patch in the view are described to facilitate selection of the media track and its corresponding atlas. Define  the parameters of the box to the atlas according to the ISO basic file format, optionally in one of the following ways.
In some embodiments of the disclosed technology, in each frame sample, the patches are collected from different views, and the position in the view is different and variable, and the patch information for all atlases is dynamically described by timed metadata.
An atlas information list is identified from the timed metadata track according to the sample entry type, and the atlas information list timed metadata track is used to indicate location information corresponding to each patch in the atlas and corresponding view and the corresponding view information.
Specifically, the purpose of the atlas information list timed metadata track is indicated by the track sample entry type, which is “atpl, ” as defined below.
AtlasParamsListSampleEntry extends MetaDataSampleEntry ( 'atpl' ) {
unsigned int (8) num_atlases;
unsigned int (8) num_views;
bit (7) reserved;
}
The specific syntax is defined as follows:
“num_atlases” indicates the number of atlases; and
“num_views” indicates the number of views.
Each sample in the timed metadata track is corresponding to information regarding all atlases at some point. The following example illustrates a sample format of the atlas information refers to the track sample entry type “atpl, ” above, and indicates the atlas information for each sample.
Figure PCTCN2019122095-appb-000006
Figure PCTCN2019122095-appb-000007
The specific syntax is defined as follows:
“atlas_id” indicates the identifier of the atlas;
“num_patches” indicates the number of patches contained in the atlas;
“view_id” indicates a view identifier, which a patch belongs to;
“group_id” indicates the group identifier of the grouping based on a basic view and its associated additional views, when group_id is equal to 0, only one basic view, and the views does not need to be grouped;
“patch_width_in_view” and “patch_height_in_view” indicate the width and height of the patch in the luminance sample, respectively;
“patch_pos_in_atlas_x” and “patch_pos_in_atlas_y” indicate the coordinates of the X-axis and Y-axis in the atlas respectively
“patch_pos_in_view_x” and “patch_pos_in_view_y” indicate the coordinates of the X-axis and the Y-axis of the patch in the luminance sampling, respectively; and
“patch_rotation” indicates the rotation angle at which a patch maps from the atlas to the view. When patch_rotation is 0, it indicates there is no rotation, and when patch_rotation is 1, it indicates a clockwise 90-degree rotation, and other values greater than 1 are temporarily reserved.
In some embodiments of the disclosed technology, the atlas information describes the atlas place in the other media track, the timed metadata track defined above can link to one or more other media tracks by using the TrackReferenceBox with “cdsc” referenceType.
In some embodiments of the disclosed technology, the sample table box (SampleTableBox) includes a new defined box (Atlas Parameters Box) to indicate the atlas information of samples in the media track one by one.
Figure PCTCN2019122095-appb-000008
The specific syntax is defined as follows:
“sample_count” indicates the number of consecutive samples over a period of time;
“num_patches” indicates the number of patches included in the atlas;
“view_id” indicates the view identifier, which this patch belongs to;
“patch_width_in_view” and “patch_height_in_view” indicating the width and height of the patch in the luminance sample, respectively;
“patch_pos_in_atlas_x” and “patch_pos_in_atlas_y” indicates the coordinates of the X-axis and the Y-axis of the patch in the atlas, respectively;
“patch_pos_in_view_x” and “patch_pos_in_view_y” indicates that the coordinates of the X-axis and Y-axis in the luminance sample respectively; and
“patch_rotation” indicates the angle of rotation of the patch from the atlas to the view image. When it is equal to 0, it indicates that there is no rotation; when it is equal to 1, it indicates a clockwise rotation of 90 degrees; other values greater than 1 are temporarily reserved
FIG. 6 is a diagram showing an example of a multi-layer track structure in media file based on some embodiments of the disclosed technology where there are multiple basic views in the media file, and basic views can be used as a base view to predict other view. Based on the layered trackdescription structure, a media track is defined as a base layer for storage of one basic view, and describes the operation point information of all basic views. The other basic view and the corresponding view of the additional view are stored in one or more other media tracks. Through the multiple views, a three-dimensional stereoscopic video can be synthesized, so that the user can see the partially occluded visual content through the displacement of the head.
The type “oinf” (operating points information) sample group in the media track of the base layer is defined to describe the operating points information in all basic views. All the additional layer media tracks refer to the base layer media track by referring to a media track type indicator “oref. ”
A basic view can be distinguished from other basic views and their associated additional views by including an indication at the media track level or at the file level.
In some embodiments of the disclosed technology, the distinction between basic and additional views is described at the media track level. In a media track that contains atlas belongs to a basic view, the media track is described as a media track that contains a basic view by defining a box of type 'invw' in the Media Information Box.
Figure PCTCN2019122095-appb-000009
Figure PCTCN2019122095-appb-000010
The specific syntax is defined as follows:
“basic_view_idx” is an identifier of the basic view that its corresponding atlas in the media track;
“camera_idx” indicates an identifier of the camera corresponding to the basic view;
“additional_view_idx” indicates an identifier of an additional view that refers to the basic view; and
“track_id” indicates an identifier of the media track that contains the atlas corresponding to an additional view associated with the basic view;
“texture_or_depth_flag” indicates whether the media track that contains the samples associated with additional views corresponding to the samples includes texture component and/or depth component, if it is 0, includes both, if it is 1, only includes texture component; and
“depth_track_id” indicates the media track ID where the depth component that the associated additional views corresponding to are located.
For atlases corresponding to basic views and its associated additional views referring to the basic views, a track group box with a track group type (track_group_type) of “cstg” is defined to describe the relationship between the atlases corresponding to basic views and its associated additional views.
In some embodiments of the disclosed technology, group descriptions are provided at the file level for the media tracks where the basic view is located, and for the media tracks having the additional view refer to the associated basic views.
Extending EntityToGroupBox is used in the ISO base file format to define a grouping type (grouping_type) as “asvw” (associated view group) to group media tracks containing atlases form associated views. The group members are described to distinguish the media track in which the basic view is located, and it can be said that the track contains texture components and/or depth components of the atlas. In one group, there is only one basic view in a media track, and other views are associated with the basic view.
Figure PCTCN2019122095-appb-000011
The syntax is as follows:
“group_id” indicates the group identifier for the grouping described in each box;
“num_entities_in_group” indicates the number of entities in the group; and
“entity_id” indicates the entity identifier of the media track in which the atlas that the belongs to is placed, equal to track_id;
“is_basic_view” indicates whether a group member media track in a group includes a basic view. When it equals to 1, a group member media track in a group contains a basic view, or  when it is 0, a group member media track in a group does not contain a basic view, and only one group member has a value of 1;
“is_texture_in_entity” indicates whether the media track of a certain group member in the group contains the texture component of the atlas that the view belongs to; and
“is_depth_in_entity” indicates whether the media track of a group member in a group contains the depth component of the atlas that the view belongs to.
In some implementations, the atlases of the multiple basic views are stored in separate bitstreams. For example, one bitstream contains only one atlas corresponding to one basic view and the atlas corresponding to other views associated with that basic view. In this case, a layered structure can still be used.
In some embodiments of the disclosed technology, after receiving the media stream file and decoding, the user selects the media content corresponding to the basic view and its associated additional view according to the user's viewing direction, viewing position, and viewport. The corresponding camera information are considered in the process to complete the rendering of the selected content.
In some embodiments of the disclosed technology, by defining a box that describes the information of the camera parameters, the camera for multiple views is described in detail.
In some embodiments of the disclosed technology, the camera parameter information is intrinsic information and position information, and the data box is defined at the file level for description. The camera parameter list box is identified according to the box type, and the camera parameters corresponding to all the patches in the media file are described, and the box is included in the “meta” type box.
Figure PCTCN2019122095-appb-000012
Figure PCTCN2019122095-appb-000013
Figure PCTCN2019122095-appb-000014
The specific syntax is defined as follows:
“num_cameras” indicates the number of cameras;
“view_idx” is an identifier of the view captured by the camera, by which the camera corresponding to the view can be found;
“camera_pos_x, ” “camera_pos_y, ” and “camera_pos_z” indicates the position of the camera in the global reference coordinate system;
“projection_type” indicates the projection method of the camera. When it is equal to 0, the indication is ERP projection mode, and when it is equal to 1, the indication is CMP projection mode, and when it is equal to 2, the indication is perspective projection, and the value ranges from 0 to 255, of which 3 to 255 can be defined in the future as needed;
“projection_plane_width” and “projection_plane_height” indicate the horizontal and vertical resolutions of the camera projection plane, which are indicated by the horizontal and vertical resolutions of the encoded brightness sampling;
“erp_phi_min” and “erp_phi_max” indicate the latitude range (minimum and maximum) of the ERP projection;
“erp_theta_min” and “erp_theta_max” indicate the longitude range (minimum and maximum) of the ERP projection;
“perspective_focal_hor, ” “perspective_focal_ver, ” indicate; in luma sample position units the horizontal and vertical components, respectively, of the focal of a perspective projection;
“perspective_center _hor, ” and “perspective_center_ver” indicate in luma sample positions the horizontal and vertical coordinates, respectively, of the principal point of a perspective projection.
“quantization_law” indicates the type of depth quantization method of the camera;
“depth_near, depth_far” indicates the minimum and maximum normalized disparity values, respectively;
“cubic_map_type” indicates the type of CMP projection. When it is equal to 0, the indication is full projection mode (6 package faces) ; when it is greater than 0, it is temporarily reserved; and
“rotation _yaw” and “rotation _pitch” and “rotation _roll” specify the yaw angle, pitch angle, and roll angle for the camera to rotate along the X, Y, and Z axes, respectively.
In some embodiments of the disclosed technology, each of the media tracks is from a patch of one or more views, and each view corresponds to one camera, and the camera corresponding to the view in each media track is described.
The camera parameter list box is identified according to the box type contained in the "schi" type box, which is the Scheme Information Box, describing the restricted scheme information, where the box is in the scheme type (when scheme_type is “altv, ” support atlas mode) . Example of the Camera Parameters Box is described below.
Figure PCTCN2019122095-appb-000015
Figure PCTCN2019122095-appb-000016
In some embodiments of the disclosed technology, there are two or more basic views based on the deployment mode of the camera, and the atlas of the basic view are stored in different media tracks, and the atlas of each basic view is used as the base view, and the view of the associated additional view serves as an additional view. In the media track where the basic view is located, the camera corresponding to the basic view and its associated other views can be described using the Atlas Camera Parameter Box (AltalsCameraParametersListBox) , which is included in the "schi" type box, which is the solution information. The Scheme Information Box is used to describe restricted information. The media track in which the atlas of other related views is located defines the track with the track reference type “sbas” to associate the media track of the basic view with other media track of related additional views.
In some embodiments of the disclosed technology, media tracks associated with basic views and other views related to them may be defined using a track group type “cstg. ”
The disclosed technology can be implemented in some embodiments to put the atlas of views into media tracks in a situation where multiple basic views exist. In an implementation, the texture component and depth component are put into one media track, or in another implementation, the texture component and depth component are put into different media tracks. The multi-view video processing method implemented based on some embodiments of the disclosed technology may include defining the structure of media tracks and related metadata and describing the metadata of data and the atlas information of views put in the media tracks. The multi-view video processing method implemented based on some embodiments of the disclosed technology may include capturing video based on multiple views, coding the video, packing the coded video with its metadata into a media file, and sending the media file directly or according to user’s selection. The user can accept the media file and extract the metadata and samples needed, decode pictures in the media file and synthesize the decoded pictures to see the synthesized content in his viewport.
In some embodiments of the disclosed technology, a video data processing method may include determining one or more atlases in a media track corresponding to views of a multi-viewpoint video in a viewpoint of the user. The one or more atlases include texture information and depth information of the views.
In some embodiments of the disclosed technology, the video data processing method may also include, based on the atlas and the camera parameters corresponding to the multi-view video, generating the multi-viewpoint video in the viewpoint of the user.
In some embodiments of the disclosed technology, the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user includes determining a base view of the multi-viewpoint video and an additional view related to the base view of the multi-viewpoint video selected by the user, and based on the atlas corresponding to the base view and the additional view of the multi-viewpoint video in the media track, determining one or more atlases corresponding to the view of the multi-view video selected by the user in the media track.
In one example, the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user further includes identifying an entity group data box in the media track based on a second group type, and the entity group data box indicates the base view of the multi-viewpoint video. In one example, the entity group data box indicates the group identification of the entity group.
In one example, the determining of the one or more atlases in the media track corresponding to the view of the multi-viewpoint video in the viewpoint of the user further includes identifying a basic view data box in the entry of the video track sample, and the basic view data box indicates the base view identification of the multi-view point video.
In some embodiments of the disclosed technology, a sample group data box is in the media track and its corresponding sample group description data box is identified based on a first group type, and the sample group entry in the sample group description data box indicates the multi-viewpoint video base view ID. In one example, the sample group entry indicates a grouping identifier of the sample group.
In some embodiments of the disclosed technology, the video data processing method further includes indicating the identifier of one or more additional views associated with the base view identifier of the multi-viewpoint video indicated in the sample group description data box  or the entity group data box or a basic view data box. In one example, the video data processing method further includes using an identifier of the media track where the atlas corresponding to the view is located.
In some embodiments of the disclosed technology, the base view indicated by the basic view data box and the atlas corresponding to the additional view identified by the base view identifier of the multi-viewpoint video indicated by the basic view data box are located. In one example, the media track is characterized, according to the type of media track, by identifying the grouping of the media track in which a chart set corresponds to the base view and the media track in which the atlas corresponding to the additional view is located.
In some embodiments of the disclosed technology, an atlas list timed metadata track is identified according to the first sample entry type, and the atlas list timed metadata track is used to indicate image block information of the atlas in the media track.
In some embodiments of the disclosed technology, patch information of the atlas in the media track includes at least one of the following: the identification of the atlas; the number of patches of the atlas; the view identification of the patch source of the atlas; the width and height of the patch of the atlas in the view corresponding to the atlas; the coordinates of the patch in the atlas X-axis and Y-axis; the coordinates of the patch of the atlas in the view corresponding to the atlas of the X-axis and the Y-axis; the patches of the atlas are mapped from the atlas to the corresponding rotation angle of the view.
In some embodiments of the disclosed technology, the video data processing method further includes identifying an atlas attribute data box, which indicates whether the atlas contains texture information and/or depth information and indicates the identity of the atlas. Alternatively, a video track belonging to the texture information containing the atlas and a video track containing depth information of the atlas are identified based on a third track group type.
In some embodiments of the disclosed technology, the camera parameter may be used to identify a file-level camera parameter list data box and indicate camera parameter information for shooting the multi-viewpoint view. Alternatively, the atlas camera parameter list data box for identifying a restricted solution information is used to indicate the camera parameter information of the view corresponding to the atlas in the media track.
In some embodiments of the disclosed technology, the camera parameter may include at least one of the following: coordinate information for camera location; coordinate rotation  information for camera location; camera mapping format information; camera depth quantification information.
Some of the embodiments described herein are described in the general context of methods or processes, which may be implemented in one embodiment by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Therefore, the computer-readable media can include a non-transitory storage media. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-or processor-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
Some of the disclosed embodiments can be implemented as devices or modules using hardware circuits, software, or combinations thereof. For example, a hardware circuit implementation can include discrete analog and/or digital components that are, for example, integrated as part of a printed circuit board. Alternatively, or additionally, the disclosed components or modules can be implemented as an Application Specific Integrated Circuit (ASIC) and/or as a Field Programmable Gate Array (FPGA) device. Some implementations may additionally or alternatively include a digital signal processor (DSP) that is a specialized microprocessor with an architecture optimized for the operational needs of digital signal processing associated with the disclosed functionalities of this application. Similarly, the various components or sub-components within each module may be implemented in software, hardware or firmware. The connectivity between the modules and/or components within the modules may be provided using any one of the connectivity methods and media that is known in the art, including, but not limited to, communications over the Internet, wired, or wireless networks using the appropriate protocols.
While this document contains many specifics, these should not be construed as limitations on the scope of an invention that is claimed or of what may be claimed, but rather as descriptions of features specific to particular embodiments. Certain features that are described in this document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or a variation of a sub-combination. Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results.
Only a few implementations and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this disclosure.

Claims (31)

  1. A method of constructing media content, comprising:
    placing a plurality of media samples into a media file associated with a plurality of views including one or more basic views and one or more additional views such that each media sample corresponds to one or more of the plurality of views and includes at least one of a texture component or a depth component of the corresponding view;
    determining a basic view media track corresponding to the one or more basic views and one or more additional view media tracks such that each media track includes one or more indicators for describing information about the corresponding view; and
    constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more media sample groups each of which is associated with one basic view or by grouping a plurality of media tracks in which the plurality of media samples is placed.
  2. The method of claim 1, wherein each basic view corresponds to n basic view atlas.
  3. The method of claim 2, wherein a patch is acquired based on the basic view atlas.
  4. The method of claim 2, wherein the media content is constructed based on a combination of one or more basic views and one or more additional views.
  5. The method of claim 4, wherein the one or more indicators include an indicator for grouping, into a sample group, the plurality of media samples corresponding to two or more basic views.
  6. The method of claim 1, wherein the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
  7. The method of claim 1, wherein the one or more indicators include an indicator for indicating whether each media track contains the texture component or the depth component or both.
  8. The method of claim 1, wherein the one or more indicators include an atlas attribute indicator to define which part of the texture component and the depth component is contained in the media track.
  9. The method of claim 1, wherein the one or more indicators include a view identifier to describe the corresponding view.
  10. The method of claim 1, wherein the constructing of the media content includes combining patches from different media samples.
  11. The method of claim 10, wherein the one or more indicators include an indicator describing the number of patches in each media sample.
  12. The method of claim 10, wherein the one or more indicators include a patch rotation indicator to a rotation angle at which a patch maps from view atlases to media samples associated with the plurality of views.
  13. A method of constructing media content, comprising:
    placing a plurality of media samples into a media file associated with a plurality of views including a plurality of basic views and a plurality of additional views such that each media sample corresponds to one of the plurality of views and includes at least one of a texture component or a depth component associated with the corresponding view;
    determining a plurality of basic view media tracks corresponding to the plurality of basic views, respectively, and a plurality of additional view media tracks corresponding to the plurality of additional views, respectively, such that each media track includes one or more indicators for describing information about the corresponding view; and
    constructing the media content from the plurality of media samples based on the one or more indicators by grouping the plurality of media samples into one or more media sample group each of which is associated with at least one basic view.
  14. The method of claim 13, wherein an image acquired based on the one or more basic views is used as a base image for predicting other images.
  15. The method of claim 14, wherein each basic view corresponds to a basic view atlas.
  16. The method of claim 15, wherein the image is acquired based on the basic view atlas.
  17. The method of claim 15, wherein each basic view atlas is stored in a separate code stream.
  18. The method of claim 13, wherein the one or more indicators include an entity level grouping indicator to describe a grouping type to group different media tracks containing different views.
  19. The method of claim 18, wherein the entity level grouping indicator includes grouping information between a basic view and an additional view.
  20. The method of claim 18, wherein the entity level grouping indicator includes the number of entities in a group.
  21. The method of claim 13, wherein each media track includes an indicator for identify the plurality of views as a basic view or an additional view.
  22. A method of constructing media content, comprising:
    placing, into a media file associated with a plurality of views, camera information including camera parameters corresponding to the plurality of views according to a viewing direction, a viewing position, and a viewing window;
    selecting, based on the camera parameter information, media metadata from the media file; and
    constructing the media content based on the media metadata.
  23. The method of claim 22, wherein the camera information is extracted on a media file basis.
  24. The method of claim 22, wherein the camera information is extracted on a media track basis.
  25. The method of any of claims 22 to 24, wherein each of the plurality of media tracks is from a patch of the plurality of views, and wherein each view corresponds to one camera.
  26. The method of any of claims 22 to 24, wherein the plurality of views of the user includes at least one basic view and at least one additional view associated with the at least one basic view.
  27. The method of claim 26, wherein the media metadata of the at least one basic view is used as a base image of at least one additional view associated with the at least one basic view.
  28. The method of any of claims 22 to 24, wherein the plurality of views of the user includes two or more basic views stored in different media tracks, respectively.
  29. The method of any of claims 22 to 24, wherein the camera information includes a media track group to indicate that the media data in the media track group is to be used to decode images within a certain space range corresponding to the media track group.
  30. An apparatus for wireless communication, comprising a memory and a processor, wherein the processor reads code from the memory and implements a method recited in any of claims 1 to 29.
  31. A computer readable program storage medium having code stored thereon, the code, when executed by a processor, causing the processor to implement a method recited in any of claims 1 to 29.
PCT/CN2019/122095 2019-11-29 2019-11-29 Multi-view video processing method and apparatus WO2021102953A1 (en)

Priority Applications (6)

Application Number Priority Date Filing Date Title
CN201980101328.3A CN114556962B (en) 2019-11-29 2019-11-29 Multi-view video processing method and device
PCT/CN2019/122095 WO2021102953A1 (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus
KR1020227020559A KR102647019B1 (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus
JP2022531493A JP2023504097A (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus
EP19954380.2A EP4066513A4 (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus
US17/752,828 US20220286659A1 (en) 2019-11-29 2022-05-24 Multi-view video processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/122095 WO2021102953A1 (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/752,828 Continuation US20220286659A1 (en) 2019-11-29 2022-05-24 Multi-view video processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2021102953A1 true WO2021102953A1 (en) 2021-06-03

Family

ID=76129052

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122095 WO2021102953A1 (en) 2019-11-29 2019-11-29 Multi-view video processing method and apparatus

Country Status (6)

Country Link
US (1) US20220286659A1 (en)
EP (1) EP4066513A4 (en)
JP (1) JP2023504097A (en)
KR (1) KR102647019B1 (en)
CN (1) CN114556962B (en)
WO (1) WO2021102953A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697631A (en) * 2022-04-26 2022-07-01 腾讯科技(深圳)有限公司 Immersion medium processing method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120016917A1 (en) * 2009-04-09 2012-01-19 Clinton Priddle Media container file management
US20140192151A1 (en) 2013-01-04 2014-07-10 Qualcomm Incorporated Indication of presence of texture and depth views in tracks for multiview coding plus depth
US20160232939A1 (en) * 2015-02-10 2016-08-11 Nokia Technologies Oy Method, an apparatus and a computer program product for processing image sequence tracks
US20180077210A1 (en) * 2016-09-09 2018-03-15 Nokia Technologies Oy Method and apparatus for controlled observation point and orientation selection audiovisual content
WO2018177819A1 (en) * 2017-03-27 2018-10-04 Canon Kabushiki Kaisha Method for generating media data
US20190306519A1 (en) * 2018-04-03 2019-10-03 Mediatek Singapore Pte. Ltd. Methods and apparatus for encoding and decoding overlay compositions

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2527786B (en) * 2014-07-01 2016-10-26 Canon Kk Method, device, and computer program for encapsulating HEVC layered media data
US20160373771A1 (en) * 2015-06-18 2016-12-22 Qualcomm Incorporated Design of tracks and operation point signaling in layered hevc file format
US10880570B2 (en) * 2016-10-05 2020-12-29 Qualcomm Incorporated Systems and methods of adaptively determining template size for illumination compensation
US10389994B2 (en) * 2016-11-28 2019-08-20 Sony Corporation Decoder-centric UV codec for free-viewpoint video streaming
EP3349182A1 (en) * 2017-01-13 2018-07-18 Thomson Licensing Method, apparatus and stream for immersive video format
US10515477B2 (en) * 2018-02-06 2019-12-24 A9.Com, Inc. Photorealistic three dimensional texturing using canonical views and a two-stage approach
US11212506B2 (en) * 2018-07-31 2021-12-28 Intel Corporation Reduced rendering of six-degree of freedom video
US10887574B2 (en) * 2018-07-31 2021-01-05 Intel Corporation Selective packing of patches for immersive video
WO2020076058A1 (en) * 2018-10-08 2020-04-16 삼성전자 주식회사 Method and apparatus for generating media file comprising 3-dimensional video content, and method and apparatus for replaying 3-dimensional video content
KR20200143287A (en) * 2019-06-14 2020-12-23 한국전자통신연구원 Method and apparatus for encoding/decoding image and recording medium for storing bitstream
US11432009B2 (en) * 2019-07-02 2022-08-30 Intel Corporation Techniques for encoding and decoding immersive video
US11627314B2 (en) * 2019-09-27 2023-04-11 Apple Inc. Video-based point cloud compression with non-normative smoothing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120016917A1 (en) * 2009-04-09 2012-01-19 Clinton Priddle Media container file management
US20140192151A1 (en) 2013-01-04 2014-07-10 Qualcomm Incorporated Indication of presence of texture and depth views in tracks for multiview coding plus depth
US20160232939A1 (en) * 2015-02-10 2016-08-11 Nokia Technologies Oy Method, an apparatus and a computer program product for processing image sequence tracks
US20180077210A1 (en) * 2016-09-09 2018-03-15 Nokia Technologies Oy Method and apparatus for controlled observation point and orientation selection audiovisual content
WO2018177819A1 (en) * 2017-03-27 2018-10-04 Canon Kabushiki Kaisha Method for generating media data
US20190306519A1 (en) * 2018-04-03 2019-10-03 Mediatek Singapore Pte. Ltd. Methods and apparatus for encoding and decoding overlay compositions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4066513A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697631A (en) * 2022-04-26 2022-07-01 腾讯科技(深圳)有限公司 Immersion medium processing method, device, equipment and storage medium
CN114697631B (en) * 2022-04-26 2023-03-21 腾讯科技(深圳)有限公司 Immersion medium processing method, device, equipment and storage medium
WO2023207119A1 (en) * 2022-04-26 2023-11-02 腾讯科技(深圳)有限公司 Immersive media processing method and apparatus, device, and storage medium

Also Published As

Publication number Publication date
KR102647019B1 (en) 2024-03-12
JP2023504097A (en) 2023-02-01
EP4066513A4 (en) 2023-01-25
CN114556962B (en) 2024-01-30
EP4066513A1 (en) 2022-10-05
CN114556962A (en) 2022-05-27
KR20220101169A (en) 2022-07-19
US20220286659A1 (en) 2022-09-08

Similar Documents

Publication Publication Date Title
JP6984841B2 (en) Image processing method, terminal and server
CN112534825B (en) Packaging method, method of generating image, computing device, and readable storage medium
CN113891117B (en) Immersion medium data processing method, device, equipment and readable storage medium
CN113852829A (en) Method and device for encapsulating and decapsulating point cloud media file and storage medium
US20230169719A1 (en) Method and Apparatus for Processing Immersive Media Data, Storage Medium and Electronic Apparatus
CN115396645A (en) Immersion media data processing method, device, equipment and storage medium
CN113949829B (en) Media file encapsulation and decapsulation method, device, equipment and storage medium
US20220286659A1 (en) Multi-view video processing method and apparatus
CN115022715B (en) Data processing method and equipment for immersion media
CN115396647B (en) Data processing method, device and equipment for immersion medium and storage medium
CN115086635B (en) Multi-view video processing method, device and equipment and storage medium
CN115102932B (en) Data processing method, device, equipment, storage medium and product of point cloud media
US20230421819A1 (en) Media file unpacking method and apparatus, device, and storage medium
EP4290866A1 (en) Media file encapsulation method and apparatus, media file decapsulation method and apparatus, device and storage medium
US20230421774A1 (en) Packaging and unpackaging method and apparatus for point cloud media file, and storage medium
WO2023016293A1 (en) File encapsulation method and apparatus for free-viewpoint video, device and storage medium
CN116137664A (en) Point cloud media file packaging method, device, equipment and storage medium
CN116781676A (en) Data processing method, device, equipment and medium of point cloud media
CN115481280A (en) Data processing method, device and equipment for volume video and readable storage medium
CN115474034A (en) Immersion media data processing method and device, related equipment and storage medium
CN116781674A (en) Data processing method, device and equipment for immersion medium and storage medium
CN117082262A (en) Point cloud file encapsulation and decapsulation method, device, equipment and storage medium
CN116643643A (en) Data processing method, device and equipment for immersion medium and storage medium
CN115426502A (en) Data processing method, device and equipment for point cloud media and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19954380

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022531493

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20227020559

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019954380

Country of ref document: EP

Effective date: 20220629