CN116643643A - Data processing method, device and equipment for immersion medium and storage medium - Google Patents

Data processing method, device and equipment for immersion medium and storage medium Download PDF

Info

Publication number
CN116643643A
CN116643643A CN202210138696.9A CN202210138696A CN116643643A CN 116643643 A CN116643643 A CN 116643643A CN 202210138696 A CN202210138696 A CN 202210138696A CN 116643643 A CN116643643 A CN 116643643A
Authority
CN
China
Prior art keywords
camera
freeview
video
group
track
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210138696.9A
Other languages
Chinese (zh)
Inventor
胡颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210138696.9A priority Critical patent/CN116643643A/en
Publication of CN116643643A publication Critical patent/CN116643643A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models

Abstract

The embodiment of the application discloses a data processing method, device and equipment of an immersion medium and a storage medium. The method comprises the following steps: and acquiring a media file of the immersed media, wherein the media file comprises presentation indication information of free view video, the free view video is generated based on image data acquired by one or more camera groups, the presentation indication information of the free view video comprises indication information of the camera groups, and the free view video is displayed according to the presentation indication information of the free view video. Therefore, the cameras for collecting the image data corresponding to the freeview video are grouped, and the display indication information of the freeview video indicates how to render the freeview video by declaring the indication information of the camera group, so that the display of the freeview video can be more comprehensively supported.

Description

Data processing method, device and equipment for immersion medium and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of Virtual Reality (VR) technologies, and in particular, to a data processing method of an immersion medium, a data processing device of the immersion medium, a computer device, and a computer readable storage medium.
Background
Free view video is an immersive media generated based on image data acquired by multiple cameras, containing different views, and supporting viewer multiple degrees of freedom interactions. The rendering of the free view video is realized by rendering the image data acquired by the camera according to the rendering indication information of the free view video; however, it has been found in practice that it is difficult for the presentation instruction information of the freeview video in the related art to fully support the presentation of the freeview video.
Disclosure of Invention
The embodiment of the application provides a data processing method, device and equipment of an immersion medium and a storage medium, which can more comprehensively support the presentation of a free view video.
In one aspect, an embodiment of the present application provides a data processing method for an immersion medium, including:
acquiring a media file of the immersed media, wherein the media file contains presentation indication information of free view video, and the free view video is generated based on image data acquired by one or more camera groups; the display indication information of the free view video comprises indication information of a camera group;
and displaying the freeview video according to the display indication information of the freeview video.
In one aspect, an embodiment of the present application provides a data processing method for an immersion medium, including:
Acquiring image data acquired by one or more camera groups, and encoding the image data into a free view video;
adding presentation indication information for the freeview video according to the application form of the freeview video; the display indication information of the free view video comprises indication information of a camera group;
and packaging the freeview video and the presentation indication information of the freeview video into a media file of the immersion media.
In one aspect, an embodiment of the present application provides a data processing apparatus for immersion media, including:
the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring a media file of an immersed media, the media file contains presentation indication information of a free view video, and the free view video is generated based on image data acquired by one or more camera groups; the display indication information of the free view video comprises indication information of a camera group;
and the processing unit is used for displaying the freeview video according to the display indication information of the freeview video.
In one aspect, an embodiment of the present application provides a data processing apparatus for immersion media, including:
the acquisition unit is used for acquiring image data acquired by one or more camera groups and encoding the image data into a free view video;
The processing unit is used for adding presentation indication information for the free view video according to the application form of the free view video; the display indication information of the free view video comprises indication information of a camera group;
and a media file for encapsulating the freeview video and presentation indication information of the freeview video as an immersive media.
Accordingly, the present application provides a computer device comprising:
a processor for loading and executing the computer program;
a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the data processing method of the immersion medium described above.
Accordingly, the present application provides a computer readable storage medium storing a computer program adapted to be loaded by a processor and to perform the data processing method of an immersion medium as described above.
Accordingly, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer readable storage medium and executes the computer instructions to cause the computer device to perform the data processing method of the immersion medium described above.
In the embodiment of the application, a media file of an immersed media is acquired, the media file contains presentation indication information of free view video, the free view video is generated based on image data acquired by one or more camera groups, the presentation indication information of the free view video contains indication information of the camera groups, and the free view video is displayed according to the presentation indication information of the free view video. Therefore, the cameras for collecting the image data corresponding to the freeview video are grouped, and the display indication information of the freeview video indicates how to render the freeview video by declaring the indication information of the camera group, so that the display of the freeview video can be more comprehensively supported.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1a shows a schematic diagram of a 6DoF provided by an exemplary embodiment of the present application;
FIG. 1b shows a schematic diagram of a 3DoF provided by an exemplary embodiment of the present application;
FIG. 1c shows a schematic diagram of a 3DoF+ provided by an exemplary embodiment of the present application;
FIG. 1d illustrates a schematic diagram of an immersion medium from collection to consumed flow provided by an exemplary embodiment of the present application;
FIG. 1e is a diagram illustrating a free view video data representation provided by an exemplary embodiment of the present application;
FIG. 2 illustrates a flow chart of a method of data processing of immersion media provided by an exemplary embodiment of the present application;
FIG. 3 illustrates a flow chart of another method of data processing of immersion media provided by an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of an immersion medium data processing apparatus according to an exemplary embodiment of the present application;
FIG. 5 illustrates a schematic diagram of another immersion medium data processing apparatus provided in accordance with an exemplary embodiment of the present application;
fig. 6 is a schematic diagram showing a structure of a content consumption apparatus according to an exemplary embodiment of the present application;
fig. 7 is a schematic diagram showing a structure of a content producing apparatus according to an exemplary embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following describes some technical terms related to embodiments of the present application:
1. immersion medium:
immersion media refers to media files that provide immersion media content that enables viewers immersed in the media content to obtain visual, auditory, etc. sensory experiences in the real world. Immersion media can be categorized into: 6DoF (Degree of Freedom) immersion media, 3DoF immersion media, 3dof+ immersion media.
2. Free view video:
the free view video is also called multi-view video, and is an immersion medium which is generated based on image data collected by a plurality of cameras, contains different view angles and supports viewer multi-degree-of-freedom interaction; for example: if a freeview video supports 3dof+ interactions, the freeview video is a 3dof+ immersion medium. And the following steps: if a freeview video supports 6DoF interactions, the freeview video is a 6DoF immersion medium.
3. Track (Track):
a track is a collection of media data in a media file encapsulation process, and a media file may be composed of one or more tracks, such as is common: a media file may contain a video track, an audio track, and a subtitle track.
4. Sample (Sample):
the samples are packaging units in the media file packaging process, and one track is composed of a plurality of samples, for example: a video track may be made up of a number of samples, one typically being a video frame.
5. DoF (Degree of Freedom, degrees of freedom):
DoF in the present application refers to the degrees of freedom that a viewer supports when viewing an immersion medium (e.g., free view video) and generates content interactions, and may include 3DoF (three degrees of freedom), 3dof+ and 6DoF (six degrees of freedom). Wherein 3DoF refers to three degrees of freedom for rotation of the viewer's head about the x, y, and z axes. 3dof+ is the degree of freedom that viewers also possess limited motion along the x, y, z axes on a three degree of freedom basis. The 6DoF is a three degree of freedom basis that the viewer also has freedom to move freely along the x, y, z axes.
6. ISOBMFF (ISO Based Media File Format, media file format based on ISO standard): the media file is a packaging standard, and a typical ISOBMFF file is an MP4 file.
7. DASH (Dynamic Adaptive Streaming over HTTP ): is an adaptive bit rate technology, which enables high quality streaming media to be delivered over the internet via a conventional HTTP web server.
8. MPD (Media Presentation Description, media presentation description signaling in DASH) for describing media segment information in a media file.
9. Representation hierarchy (presentation): refers to a combination of one or more media components in DASH, such as a video file of a certain resolution may be considered as a presentation.
10. Adaptation set levels (Adaptation Sets): refers to a collection of one or more video streams in DASH, and one Adaptation set may contain multiple presentations.
The embodiment of the application relates to a data processing technology of an immersion medium, and a few concepts in the data processing process of the immersion medium are introduced, and particularly, the following embodiments of the application take the immersion medium as a free view video as an example for illustration.
FIG. 1a shows a schematic diagram of a 6DoF provided by an exemplary embodiment of the present application; the 6DoF is divided into a window 6DoF, an omnidirectional 6DoF, and a 6DoF, wherein the window 6DoF refers to a restriction of rotational movement of a viewer of the immersion medium in the X-axis, the Y-axis, and a restriction of translation in the Z-axis; for example, a viewer of the immersion medium cannot see the scene outside the window frame, and a viewer of the immersion medium cannot pass through the window. An omnidirectional 6DoF refers to a limited rotational movement of a viewer of the immersion medium in the X, Y, and Z axes, e.g., a viewer of the immersion medium cannot freely traverse three-dimensional 360 degree VR content in a limited movement area. By 6DoF is meant that a viewer of the immersion medium can translate freely along the X-axis, Y-axis, Z-axis, e.g., the viewer of the immersion medium can walk freely in three-dimensional 360 degree VR content. Similar to 6DoF, there are also 3DoF and 3dof+ fabrication techniques. FIG. 1b shows a schematic diagram of a 3DoF provided by an exemplary embodiment of the present application; as shown in fig. 1b, 3DoF means that the viewer of the immersion medium is fixed at the center point of a three-dimensional space, and the viewer's head of the immersion medium rotates along the X-axis, Y-axis, and Z-axis to view the picture provided by the media content. Fig. 1c shows a schematic diagram of 3dof+ according to an exemplary embodiment of the present application, where 3dof+ refers to a picture provided by media content when a virtual scene provided by an immersion medium has certain depth information, and a viewer's head of the immersion medium can move in a limited space based on 3DoF, as shown in fig. 1 c.
FIG. 1d illustrates a schematic diagram of an immersion medium from collection to consumed flow provided by an exemplary embodiment of the present application; as shown in fig. 1d, the immersion medium flow from collection to consumed includes:
(1) And (3) video acquisition: the freeview video usually comprises a camera array formed by a plurality of cameras and shoots the same three-dimensional scene from a plurality of angles to form texture information (color information and the like) and depth information (space distance information and the like) of the scene, and the content production equipment can form an immersion medium (such as a 3DoF immersion medium, a 3dof+ immersion medium or a 6DoF immersion medium and the like) consumed by a viewer side of the immersion medium according to the position information of each virtual view in the freeview video and combining the texture information and the depth information from different cameras.
(2) After the immersion medium is obtained, the content manufacturing equipment performs compression coding on the immersion medium to obtain a free view video; for example, the content production device may compression encode the immersion medium through AVS3 encoding techniques, HEVC encoding techniques, and the like, resulting in a freeview video.
(3) After the freeview video is obtained, the content creation device encapsulates the data stream in the freeview video. Specifically, the content creation device encapsulates the audio code stream and the video code stream in a file container according to a file format of the immersion media (such as ISOBMFF (ISO Base Media File Format, international standardization organization base media file format)) to form a media file resource of the immersion media, where the media file resource may be a media file or a media segment to form a media file of the immersion media; and recording metadata of media file assets of the immersion medium using media presentation description information (Media presentation description, MPD) in accordance with file format requirements of the immersion medium, where metadata is a generic term for information related to presentation of the immersion medium, the metadata may include description information of media content, description information of windows, signaling information related to presentation of the media content, and so forth.
(4) The content production device transmits the packaged file of the freeview video to the content consumption device. The transmission process may be based on a variety of transmission protocols, where the transmission protocols may include, but are not limited to: DASH (Dynamic Adaptive Streaming over HTTP), HLS (HTTP Live Streaming, dynamic rate adaptive transport) protocol, SMTP (Smart Media TransportProtocol ), TCP (Transmission Control Protocol, transmission control protocol), and the like.
(5) And the content consumption equipment de-encapsulates the encapsulated file of the free view video after acquiring the encapsulated file of the free view video provided by the content production equipment. The file unpacking process of the content consumption equipment end is opposite to the file packing process of the content production equipment end, and the content consumption equipment unpacks the media file resources according to the file format requirement of the immersed media to obtain an audio code stream and a video code stream.
(6) After the package file of the free view video is unpackaged, the content consumption equipment decodes the free view video to obtain the immersion medium. The decoding process of the video code stream by the content consumption device comprises the following steps: (1) and decoding the video code stream to obtain a planar projection image. (2) The projection image is subjected to a reconstruction process to convert it into a 3D image based on the media presentation description information, where the reconstruction process refers to a process of re-projecting the two-dimensional projection image into a 3D space.
(7) The content consumption device presents the corresponding immersion media according to the virtual perspective of the viewer of the immersion media when viewing. And the content consumption equipment renders the audio content obtained by audio decoding and the 3D image obtained by video decoding according to metadata related to rendering and windows in the media presentation description information, and the playing output of the 3D image is realized after the rendering is completed. In particular, if the 3DoF and 3dof+ production techniques are adopted, the content consumption device renders the 3D image mainly based on the current viewpoint, parallax, depth information, and the like, and if the 6DoF production techniques are adopted, the content consumption device renders the 3D image within the window mainly based on the current viewpoint. Wherein the viewpoint refers to a viewing position point of a viewer of the immersion medium, the parallax refers to a line of sight difference generated by binocular eyes of the viewer of the immersion medium or a line of sight difference generated due to movement, and the window refers to a viewing area.
Wherein the content production device and the content consumption device may together comprise an immersion media system. A content production device may refer to a computer device used by a provider of the immersion medium (e.g., a content producer of the immersion medium), which may be a terminal (e.g., a PC (Personal Computer, personal computer), a smart mobile device (e.g., a smartphone), etc.), or a server; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The content consumption device may refer to a computer device used by a user of the immersion medium (e.g., a viewer of the immersion medium) that may be a terminal (e.g., a PC (Personal Computer, personal computer), a smart mobile device (e.g., a smart phone), a VR device (e.g., a VR headset, VR glasses, etc.), a smart appliance, a vehicle terminal, an aircraft, etc.
It can be appreciated that the data processing technique of the application involving immersion media can be implemented by means of cloud technology; for example, a cloud server is used as the content creation device. Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
In practical applications, the freeview video data may be expressed in different ways. Fig. 1e shows a schematic representation of a freeview video data representation according to an exemplary embodiment of the present application. As shown in fig. 1e, in the present application, the freeview video data is composed of image data collected by a plurality of cameras and corresponding freeview video content description metadata, that is, the present application expresses the freeview video data by the image data collected by the plurality of cameras and the corresponding freeview video content description metadata; wherein, the image data that the multiphase camera gathered includes: texture maps acquired by multiple cameras and depth maps corresponding to the multi-camera texture maps.
Under the expression mode of the free view video data, the immersion medium related by the application has the following flow from acquisition to consumption:
(1) For content production equipment, firstly acquiring texture images and depth images acquired by multiple cameras through a multi-camera matrix array; then, splicing texture map information and corresponding depth map information acquired by the multiple cameras; it should be noted that, if the free view video needs to consume the background information in the subsequent consumption process, the background information should also be spliced into the image frame; then, encoding the spliced multi-camera depth map and texture map information by adopting a planar video compression mode; and encapsulates the metadata description information of the above process into a video file.
(2) For a content consumption device, firstly, decapsulating a video file; decoding from the video file according to the metadata information to obtain depth map and texture map information of each camera; and then according to the freeview video, an algorithm of the freeview video is applied, and the depth map information and the texture map information are combined to synthesize the freeview virtual viewpoint requested by the viewer of the immersed media.
In practice, the content production device may instruct the content consumption device to consume the media file of the immersion media through the data box. A data box refers to a data block or object comprising metadata, i.e. metadata of the corresponding media content is contained in the data box. The immersion medium may include a plurality of data boxes including, for example, a rotation data box, an overlay information data box, a media file format data box, and the like. The presentation indication information of the freeview video may be configured in a media format data box of the immersion media, for example, in the freeview data box. Under streaming, the description information corresponding to the presentation indication information may be configured in a transmission signaling file of the immersion medium, such as in a freeview camera descriptor. The presentation indication information of the immersion medium may be configured in a freeview data box (AvsFreeViewInfoBox), content description information, and a freeview boundary data box (AvsFreeViewBoundaryBox). Depending on the encoding standard (e.g., AVS) of the immersion medium, the syntax of the free view data box (avsfreeviesintobox) of the immersion medium can be seen in table 1 below:
TABLE 1
The semantics of the grammar shown in Table 1 above are as follows:
in the freeview data box (AvsFreeViewInfoBox), codec_index is used to indicate the codec independence between the texture map and the depth map corresponding to each camera in the track, and the specific indication manner can refer to table 2:
TABLE 2
depth_padding_size is used to indicate the guard band width of the depth map. texture_padding_size is used to indicate the guard band width of the texture map. The camera_count is used to indicate the number of all cameras that acquire image data of the freeview video. Extcameinfostruct () is used to indicate the extrinsic information of the camera.
Intcameranfostruct () indicates intrinsic information of the camera. The camera_resolution_x is used for indicating the resolution widths of the texture map and the depth map acquired by the camera, and the camera_resolution_y is used for indicating the resolution heights of the texture map and the depth map acquired by the camera. The depth_downsamplefactor is used for indicating a multiple factor of downsampling of the depth map, and the actual resolution width and height of the depth map is 1/2 depth_downsamplefactor of the camera acquisition resolution width and height.
depth_vetex_x is used to indicate an x component value for the top-left vertex of the depth map that is offset with respect to the plane frame origin (e.g., the top-left vertex of the plane frame), and depth_vetex_y is used to indicate a y component value for the top-left vertex of the depth map that is offset with respect to the plane frame origin (e.g., the top-left vertex of the plane frame). texture_vetex_x is used to indicate an x component value that the top-left vertex of the texture map is offset relative to the plane frame origin (e.g., the top-left vertex of the plane frame), and texture_vetex_y is used to indicate a y component value that the top-left vertex of the texture map is offset relative to the plane frame origin (e.g., the top-left vertex of the plane frame).
In a camera external parameter information structure (extcamera infostruct), the camera_pos_present is used for indicating whether the camera position parameter is expressed or not, and when the value of the field is 1, the camera position parameter exists in the media file; when the value of this field is 0, it indicates that no camera position parameter exists in the media file. The camera_ori_present is used to indicate whether the camera direction parameter is expressed. When the value of the field is 1, indicating that a camera direction parameter exists in the media file; when the value of this field is 0, it indicates that no camera direction parameter exists in the media file.
camera_pos_x is used to indicate the x-coordinate of the camera position in meters in the global reference frame, camera_pos_y is used to indicate the y-coordinate of the camera position in meters in the global reference frame, and camera_pos_z is used to indicate the z-coordinate of the camera position in meters in the global reference frame; wherein, the values of the camera_pos_x, the camera_pos_y and the camera_pos_z are 2 -16 The unit of rice.
cam_quat_x is used to indicate the x component in the camera rotation quaternion, cam_quat_y is used to indicate the y component in the camera rotation quaternion, and cam_quat_z is used to indicate the z component in the camera rotation quaternion. The fields cam_quat_x, cam_quat_y and cam_quat_z have the following value ranges: floating point values in the range of [ -1,1 ]. When each component of the rotation information is missing, each component defaults to 0. The calculation rule of each component is as follows:
qX=cam_quat_x÷2 30 ,qY=cam_quat_y÷2 30 ,qZ=cam_quat_z÷2 30
The 4 th component qW can be derived from the mathematical properties of the quaternion:
qW=Sqrt(1–(qX 2 +qY 2 +qZ 2 ))
quaternion (w, x, y, z) represents the angular position around vector (x, y, z):
2 x cos { 1} (w) =2 x sin { 1} (sqrt (x {2} +y {2} +z {2 })) rotates.
In the camera intrinsic information structure (intcamera infostruct), camera_id is used to indicate a corresponding camera identifier. The camera_depth_present is used to indicate whether the camera depth parameter is expressed. When the value of the camera_depth_present field is 1, indicating that a camera depth parameter exists in the media file; when the value of the camera_depth_present field is 0, it indicates that the camera depth parameter does not exist in the media file. camera_type is used to indicate the type of projection algorithm of the camera. When the value of the camera_type field is 0, the projection algorithm type is ERP (Equipment-Rectangular Projection) projection; when the value of the camera_type field is 1, the projection algorithm type is perspective projection; when the value of the camera_type field is 2, the projection algorithm type is represented as orthogonal projection; when the value of the camera_type field is 3, the projection algorithm type is a free view angle pinhole camera model; when the value of the camera_type field is 4, the projection algorithm type is a free view angle fisheye camera model; the rest values are reserved.
The erp_horizontal_ fov is used for indicating the longitude range of the window area in the horizontal direction when projected in an ERP mode, and the value range of the erp_horizontal_ fov field is (0, 2 pi) in radian units; the erp_vertical_ fov is used for indicating the dimension range of the window area in the vertical direction when projected by the ERP mode, and the value range of the erp_vertical_ fov field is (0, pi) in radian units.
The transparent_horizontal_ fov is used for indicating the horizontal range of the window area when projected in a perspective manner, and the transparent_horizontal_ fov field has a value range of (0, pi) in radian; the transparent_aspect_ratio is used for indicating the aspect ratio (horizontal/vertical) of the window when projected in a perspective manner, the transparent_aspect_ratio field takes a value of 32-bit floating point number, and the parsing process complies with the IEEE754 standard.
The ortho_horizontal_size is used for indicating the size of the window in the horizontal direction when projected in an orthogonal mode, the ortho_horizontal_size field takes 32-bit floating point number in meters, and the parsing process of the ortho_horizontal_size field conforms to the IEEE754 standard; the ortho_aspect_ratio is used to indicate the aspect ratio (horizontal/vertical) of the window when projected in an orthogonal manner, and the ortho_aspect_ratio field takes a value of 32-bit floating point number, and the parsing process complies with the IEEE754 standard.
camera_focal_length_x is used to indicate the camera focal length x component of the freeview pinhole camera model, and camera_focal_length_y is used to indicate the camera focal length y component of the freeview pinhole camera model.
The camera_principle_point_x is used for indicating an x component of the offset of the camera main optical axis of the freeview fisheye camera model in the image coordinate system, and the camera_principle_point_y is used for indicating a y component of the offset of the camera main optical axis of the freeview fisheye camera model in the image coordinate system.
camera_near_depth is used to indicate the near plane depth (or distance) of the camera associated view cone and camera_far_depth is used to indicate the far plane depth (or distance) of the camera associated view cone; wherein, the values of the camera_near_depth and the camera_far_depth are 2 -16 The unit of rice.
According to the encoding standard (e.g., AVS) of the immersion medium, the syntax of the content description information of the immersion medium can be seen in the following table 3:
TABLE 3 Table 3
/>
The 3DPoint is referred to in Table 3 above: x, y and z respectively represent x, z and y coordinate values of the 3D point under the Cartesian coordinate system; the semantics of the illustrated grammar are as follows:
cuboid region structure (cuboidregion structure): the cuboid_dx is used for indicating the dimension of the cuboid subregion in the Cartesian coordinate system along the x-axis relative to the anchor point, the cuboid_dy is used for indicating the dimension of the cuboid subregion in the Cartesian coordinate system along the y-axis relative to the anchor point, and the cuboid_dz is used for indicating the dimension of the cuboid subregion in the Cartesian coordinate system along the z-axis relative to the anchor point.
Sphere structure (spheroidsruct): radius_x is used to indicate the radius of the sphere subregion in the cartesian coordinate system in the x-dimension, radius_y is used to indicate the radius of the sphere subregion in the cartesian coordinate system in the y-dimension, and radius_z is used to indicate the radius of the sphere subregion in the cartesian coordinate system in the z-dimension.
Hemispherical structure (halfspace structure): normal_x is used to indicate the planar normal of the hemispherical subregion in the cartesian coordinate system in the x-dimension, normal_y is used to indicate the planar normal of the hemispherical subregion in the cartesian coordinate system in the y-dimension, and normal_z is used to indicate the planar normal of the hemispherical subregion in the cartesian coordinate system in the z-dimension. distance is used to indicate the distance from the origin along the normal vector to the hemispherical plane.
Cylinder structure (CylinderStruct): radius_x is used to indicate the radius of the cylinder subregion in the cartesian coordinate system in the x-dimension, radius_y is used to indicate the radius of the cylinder subregion in the cartesian coordinate system in the x-dimension, and height is used to indicate the height of the cylinder subregion in the cartesian coordinate system in the z-dimension.
The information indicating the free view video scene boundary of the free view boundary data box (AvsFreeViewBoundaryBox) is used for the syntax of the free view boundary data box (AvsFreeViewBoundaryBox) of the immersion medium according to the encoding standard (e.g., AVS) of the immersion medium, see table 4 below:
TABLE 4 Table 4
/>
The semantics of the freeview boundary data box (AvsFreeViewBoundaryBox) syntax shown in table 4 above are as follows:
the boundary_camera_count is used to indicate the number of cameras that make up the boundary of the freeview video scene. camera_id is used to indicate the identifier of the camera that forms the boundary of the freeview video scene; when the position, orientation, window area of the boundary camera has been declared in the freeview box (avsfreeviesintobox), the relevant information of the boundary camera can be determined only by means of the camera identifier. The boundary_space_present is used for indicating whether an additional scene rendering boundary exists, when the value of the boundary_space_present field is 0, the scene rendering boundary is determined by parameters of the boundary camera, and the shooting range of the boundary camera forms the scene rendering boundary; when the value of the boundary_space_present field is 1, the scene rendering boundary is determined by the boundary_space, the boundary_space is used for indicating the range of the scene boundary, and if the region corresponding to the boundary_space is larger than the region formed by the boundary camera, the region beyond the shooting range of the boundary camera is rendered by the content consumption equipment according to the content shot by the boundary camera. The value meaning of the binding_space_type field is as follows in table 5, and the binding_space_type is used for indicating the type of scene boundary:
TABLE 5
/>
The default_origin_point is used for indicating the origin of the scene, and when the default_origin_point field takes a value of 1, the default_origin_point field indicates that the scene boundary uses the coordinate origin (0, 0) as the origin of the scene; when the default_origin_point field takes a value of 0, it indicates that the scene boundary uses the designated point as the origin of the scene.
The boundary_exceeded_switch is used for indicating a processing method when the watching behavior of the viewer of the immersion medium exceeds the scene boundary, and when the value of the boundary_exceeded_switch field is 0, a virtual window area based on an origin and oriented to (0, 0) is rendered for the viewer of the immersion medium; when the value of the field is 1, rendering a content area shot by a camera corresponding to the dest_camera_id for a viewer of the immersion medium; when the boundary_exceeded_switch field has a value of 2, a virtual window area based on the position and orientation specified by the dest_virtual_camera is rendered for the viewer of the immersion medium.
As can be seen from the above tables 1, 3 and 4, the display instruction information of the freeview video indicates the parameter information related to the freeview video camera and the information related to the scene boundary. There are some problems: (1) considering only a single scene boundary, the above presentation instruction information is not covered for application scenes in which a plurality of cameras generate free view video, respectively. (2) Scenes in which normal video and freeview video are consumed in association are not supported. The application provides a new data processing method of immersion media for specific application of free view video, which can be realized by the technology provided by the application: (1) Information indicating groups of cameras constituting different individual scenes and corresponding scene boundary information; (2) Indicating the association relationship between the common video and the free view video; (3) timing metadata information indicating a freeview video; (4) And (3) generating corresponding transmission signaling according to the information in the steps (1) - (3), and guiding the content consumption equipment to select resources in the free view video consumption process.
The data processing flow of the immersion medium in the application is as follows:
for the content creation device, on the one hand, the content creation device generates a freeview video bitstream and encapsulates the video bitstream into a media file, and according to the application form of the freeview video, adds corresponding metadata information in the freeview video file, which may specifically include at least one of the following: a) If the freeview video contains a plurality of independent freeview scenes, dividing the camera groups according to different freeview scenes, and indicating camera information contained in each camera group and scene boundary information corresponding to each camera group. b) And if the free view video is presented as the enhancement effect of the common video, indicating the association relationship between the common video and the free view video. c) If the freeview video is only presented at a specific time, the presentation information of the freeview video is indicated through the timing metadata.
On the other hand, if the content creation device supports streaming, the freeview video is sliced into media segments (according to the specification of the transmission standard) suitable for streaming, and corresponding transmission signaling is generated, where at least one of the following information is indicated in the transmission signaling: a) Indicating camera group information. b) And if the free view video is presented as the enhancement effect of the common video, indicating the association relationship between the common video and the free view video. c) If the freeview video is only presented at a specific time, the timing metadata is also streamed and transmitted to the content consumption device. After generating the corresponding transmission signaling, the content creation device transmits the signaling file to the content consumption device.
For the content consumption equipment, the content consumption equipment can download the complete immersed media file according to the application form and then play the complete immersed media file locally; streaming transmission can also be established with the content production equipment, and a proper free view video stream can be adaptively selected for consumption according to transmission signaling.
To support the implementation flow described above, the present application adds several descriptive fields at the system level. The relevant fields are defined below to support freeview video applications, exemplified in the form of extended ISOBMFF data boxes and DASH signaling; the freeview camera group data box (avsfiewviewfymemerasroupinfobox) is used to indicate information of the camera group contained in the current track (i.e., the track in which the freeview camera group data box (avsfiewviewviewviewregroupinfobox) is located, i.e., the track being parsed). The content photographed by all cameras in one camera group together form a free view video scene, and all cameras in one camera group have the same coordinate system. Extensions to the freeview camera group data box (avsfreeviewframegroupinfobox) are as follows table 6:
TABLE 6
The semantics of the freeview camera group data box (avsfreeviewcamera groupinfosox) syntax shown in table 6 above are as follows:
The same camera group information flag field same_group_info_flag is used for indicating whether all cameras corresponding to image data contained in a current track belong to one or more camera groups corresponding to the current track; when the value of the information mark field of the same camera group is a first set value (for example, same_group_info_flag=1), it indicates that the image data contained in the current track all belong to one or more camera groups corresponding to the current track; when the value of the same camera group information flag field is a second set value (e.g., same_group_info_flag=0), it indicates that at least one camera in the cameras corresponding to the image data included in the current track belongs to a camera group other than one or more camera groups corresponding to the current track. For example, the freeview video is encapsulated in at least one track, the at least one track comprises a first track, the first track comprises image data of the freeview video, the first track corresponds to N camera groups, each camera group comprises one or more cameras, N is a positive integer, and the freeview camera group data box is encapsulated in the first track; when the value of the same camera group information mark field in the free view camera group data box is a first set value, the image data contained in the first track are all represented to belong to N camera groups corresponding to the first track; when the value of the same camera group information mark field in the freeview camera group data box is the second set value, the method indicates that at least one camera belongs to the camera groups except the N camera groups in the cameras corresponding to the image data contained in the first track.
The camera group number field camera_group_num is used to indicate the number of camera groups corresponding to the current track. The camera group identification field camera_group_id is used to indicate an identifier of the camera group.
The identifier of each camera set corresponds to a complete camera set mark field complete_group_flag, and the complete camera set mark field corresponding to the current camera set (i.e. the camera set indicated by the camera set identification field camera_group_id, i.e. the camera set being analyzed) is used for indicating whether the current track contains image data collected by all cameras in the current camera set; when the flag field of the complete camera set takes a value of a first set value (for example, complete_group_flag=1), the current track contains image data collected by all cameras in the current camera set; when the full camera group flag field takes a value of the second set value (e.g., complete_group_flag=0), it indicates that the current track contains image data acquired by a part of the cameras in the current camera group.
Further, when the same camera group information flag field takes a value of a second set value (same_group_info_flag=0), the freeview camera group data box further includes a camera number field camera_num and a camera identification field camera_id; wherein the camera number field camera_num is used for indicating the number of all cameras constituting the current camera group; the camera identification field camera_id is used to indicate an identifier of the camera.
Extensions to the freeview boundary data box (AvsFreeViewBoundaryBox) are as follows table 7:
TABLE 7
/>
The semantics of the freeview boundary data box (AvsFreeViewBoundaryBox) syntax shown in table 7 above are as follows: the camera_group_flag is used for indicating the type of the display boundary corresponding to the current image data; when the camera_group_flag takes a value as a first set value (such as 1), the boundary information indicating one or more camera groups by the current free view boundary data box is indicated; when the camera_group_flag takes a value of a second set value (e.g., 0), the boundary information indicating the current free view video is indicated by the current free view boundary data box. The camera group number field camera_group_num is used to indicate the number of camera groups corresponding to the current track. The camera group identification field camera_group_id is used to indicate an identifier of the camera group. A freeview boundary information structure (AvsFreeViewBoundaryInfoStruct) is used to indicate scene boundary information.
In practical applications, there is a joint consumption scenario of freeview video and non-freeview video, i.e. viewers of immersed media, when watching, preferentially present the non-freeview video, and only present the freeview video in certain segments of the video (such as highlights). At this time, it is necessary to associate the non-freeview video and the freeview video to indicate the relationship therebetween. In one embodiment, the association of non-freeview video and freeview video may be indicated by an freeview relational group box (avsfieviewfindiongroupbox), the extension of which is as follows table 8:
TABLE 8
/>
The semantics of the freeview relational group data box (avsfreeviewfieassociation groupbox) syntax shown in table 8 above are as follows: the freeview video mark field freeview_video_flag indicates that the freeview video is currently presented when the value of the freeview_video_flag is a first set value (e.g. 1); when the value of the freeview_video_flag is a second set value (e.g., 0), it indicates that the non-freeview video is currently presented.
In another embodiment, the association of the non-freeview video and the freeview video may also be indicated by an freeview relationship data box (avsfieviewfmdassociationbox), the extension of which is as follows table 9:
TABLE 9
The semantics of the freeview relational box (avsfreeview associationbox) syntax shown in table 9 above may refer to the semantics of the freeview relational group box (avsfreeview associationgroupbox) syntax in table 8 above, and will not be described again here.
In yet another embodiment, the freeview video and the non-freeview video may also be associated by a track index. The 'af' type of track index is used to indicate an index relationship between the freeview video and the associated non-freeview video, with the track of the encapsulated freeview video pointing to the track of the indexed encapsulated non-freeview video. That is, the track identifiers of the packaged freeview video and the non-freeview video have an index relationship, and the index relationship refers to a track which can be indexed to the packaged non-freeview video through the track identifier of the packaged freeview video; alternatively, the track of the encapsulated freeview video can be indexed by the track identification of the non-encapsulated freeview video.
In one implementation, one freeview video is packaged in multiple video tracks that can be associated by a freeview track group, an extension of the freeview track group data box (avsfreeviewsroupbox) is as follows table 10:
table 10
The freeview track group data box (avsfreeviewsroupbox) shown in table 10 above is identified with an 'a3fg' track group type. Among all tracks containing an 'afvg' type track group type data box (TrackGroupTypeBox), tracks having the same group ID belong to the same track group. The semantics of the freeview track group data box (avsfreeviewsroupbox) syntax are as follows:
the camera group mark field camera_group_flag is used for indicating whether a plurality of camera groups exist in a track group (namely a current track group) corresponding to the free view track group data box; when the camera_group_flag value is a second preset value (e.g. 0), the camera_group_flag value indicates that all tracks in the current track group jointly form a free view video, and a plurality of camera groups do not exist in the free view video; when the camera_group_flag is set to a second preset value (e.g., 1), it indicates that all tracks in the current track group together form a freeview video, and the freeview video includes a plurality of camera groups, where camera group information corresponding to the current track is indicated by a freeview camera group data box (avsfieviewcamera groupfofbox) or by related information in the current data box (e.g., camera group number field camera_group_num, camera group identification field camera_group_id). The camera group number field camera_group_num is used for indicating the number of camera groups corresponding to the current track. The camera group identification field camera_group_id is used to indicate an identifier of the camera group.
In another implementation, the freeview video contains multiple camera groups (i.e., the freeview video is generated based on image data acquired by the multiple camera groups), the presentation time of the camera groups may be indicated by a camera group metadata track containing an extension of a camera group sample entity (camera group sample entry) as follows in table 11:
TABLE 11
The semantics of the camera group sample entity (CameraGroupSampleEntry) syntax shown in table 11 above are as follows:
when the camera group update flag field camera_group_update_flag corresponding to the current sample is a first set value (e.g., 1), the information indicating that the current sample belongs to the camera group and that the current sample belongs to the camera group is specified by the camera group identification field (camera_group_id) corresponding to the current sample; when the camera group update flag field corresponding to the current sample is set to a second set value (e.g., 0), the camera group information corresponding to the current sample is unchanged.
When the camera group cancellation flag field camera_group_cancel_flag corresponding to the current sample is a first set value (e.g. 1), the camera group information corresponding to the target sample is cancelled, so that the camera group information corresponding to the target sample is not validated any more; the current sample refers to the sample being parsed; the target sample refers to any one of the history samples that have been parsed before the current sample, the camera group update flag field corresponding to the target sample has a value of a first set value (e.g., 1), the camera group identification field corresponding to the target sample is the same as the camera group identification field (camera_group_id) corresponding to the current sample, and the difference between the parsing time of the target sample and the parsing time of the current sample is: the minimum value of the difference between the analysis time of each sample in the history samples and the analysis time of the current sample; for example, let the current sample have a serial number of 10, the parsed historical samples include samples with sample numbers of 1-9 (the smaller the serial number of the historical sample indicates the earlier the parsing time of the historical sample), and the camera group identification fields corresponding to the samples with sample numbers of 3,6,8 are the same as the camera group identification fields corresponding to the current sample; the camera group update flag field camera_group_update_flag corresponding to the history sample with the sample number of 3 and the history sample with the sample number of 6 takes a value of 1, and the camera group update flag field camera_group_update_flag corresponding to the history sample with the sample number of 8 takes a value of 0; when the camera group cancel flag field camera_group_cancel_flag corresponding to the current sample is 1, the camera group information corresponding to the history sample with the sample number of 6 is not validated any more.
When the value of the camera group update mark field corresponding to the current sample is a first set value, the value of the camera group cancel mark field corresponding to the current sample is not the first set value; the camera group information corresponding to the current sample remains unchanged: if the camera group information corresponding to the target sample is in an efficiency state, the camera group information corresponding to the current sample is in the efficiency state; if the camera group information corresponding to the target sample is in the cancel state, the camera group information corresponding to the current sample is in the cancel state.
Description information corresponding to a freeview data box (avsfreeviewobox) is stored in a freeview camera descriptor (avsfreeviewamcaminfo) provided in the embodiment of the present application, and the freeview camera descriptor (avsfreeviewomffo) is a supplementary (supplementartyinferty) element, and the @ schema id uri attribute is "urn: avs:ims:2018:av3f". The freeview camera descriptor (AvsFreeViewCamInfo) is contained in a transport signaling file that is encapsulated in a representation level, or adaptation set level, in the media presentation description file of the immersion medium; one adaptation set level may contain one or more representation levels. When the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; the description information is used to describe the target presentation hierarchy when it exists in the target presentation hierarchy. The transmission signaling file contains the attributes shown in table 12 below:
Table 12
Wherein, O represents the corresponding attribute as the Optional; CM represents a corresponding attribute Conditional Mandatory; m represents the corresponding attribute as manager. The extended freeview camera descriptor (AvsFreeViewCamInfo) has been newly added: avsFreeViewCal@camera_group_id, avsFreeViewCal@boundary_camera_flag, and related descriptions of these elements and attributes.
Description information corresponding to an AvsFreeViewAssociationGroupBox (avsfreeviewassociationv) is stored in a freeview video association descriptor (AvsFreeViewAssociation) provided in an embodiment of the present application, and the freeview video association descriptor (AvsFreeViewAssociation) is a supplementary (supplementproperty) element, and an @ schema information uri attribute is "urn: avs: ims:2018: asf. The freeview video association descriptor (avsfreeview association) is contained in a transport signaling file that is encapsulated in a presentation level, or adaptation set level, in the media presentation description file of the immersion medium; one adaptation set level may contain one or more representation levels. When the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; the description information is used to describe the target presentation hierarchy when it exists in the target presentation hierarchy. The transmission signaling file contains the attributes shown in table 13 below:
TABLE 13
Wherein M represents the corresponding attribute as manager. The extended freeview video association descriptor (avsfreeview association) has been newly added: avsfreeview association, avsfreeview association@group_id, avsfreeview association@freeveiew_video_flag, and related descriptions of these elements and attributes.
According to the freeview boundary data box shown in the above table 7 of the embodiment of the present application, camera group information of each scene included in the freeview video and boundary information of the scene are indicated in combination with the description of the freeview camera descriptor shown in the table 12; according to the embodiment of the application, the freeview relation group data box shown in the table 8 or the freeview relation data box shown in the table 9 is used for indicating the association relation between the non-freeview video and the freeview video by combining the description of the freeview video association descriptor shown in the table 13; according to the embodiment of the application, the camera group sample entity shown in the table 11 indicates timing metadata information of the freeview video; according to the embodiment of the application, the freeview camera group data box shown in the table 6 and the freeview track group data box shown in the table 10 generate corresponding transmission signaling to instruct the content consumption device to select resources in the process of the freeview video consumption.
FIG. 2 illustrates a flow chart of a method of data processing of immersion media provided by an exemplary embodiment of the present application; the method may be performed by a content consumption device in an immersion media system, the method comprising the steps of S201-S202:
s201, acquiring a media file of the immersion media.
The media file comprises presentation indication information of the free view video, wherein the presentation indication information is used for indicating how to present the free view video; for example, presentation indication information may be used to indicate boundaries of freeview video, and the like. The freeview video is generated based on image data collected by one or more camera groups, each camera group comprises one or more cameras, and in practical application, one or more cameras corresponding to images used for forming the same scene in the freeview video can be divided into the same camera group, and one or more cameras with the same internal camera parameters can also be divided into the same camera group. The image data includes a texture map and a depth map corresponding to the texture map. The display indication information of the freeview video comprises indication information of a camera group, wherein the indication information of the camera group is used for indicating the number of cameras in the camera group and camera parameters corresponding to the cameras.
In one embodiment, a media file of an immersion medium is sliced into a plurality of media segments, a content consumption device acquires a transmission signaling file of the immersion medium, the transmission signaling file contains description information corresponding to presentation indication information of a free view video, and after the transmission signaling file of the immersion medium is acquired, the content consumption device acquires the media file of the immersion medium according to the transmission signaling file of the immersion medium; specifically, the content consumption device may determine, according to a virtual perspective of the viewer of the immersion medium when viewing, and according to the description information in the transmission signaling file, a media segment corresponding to image data required by the viewer of the immersion medium when viewing; and pulling the determined media segments by streaming.
S202, displaying the freeview video according to the display indication information of the freeview video.
The content consumption device determines image data required for watching from the media file according to the virtual view angle of a viewer of the immersed media when watching; and decoding and displaying the image data according to the indication information of the camera group.
In a first embodiment, the indication information of the camera group includes attribute indication information of the camera group, where the attribute indication information of the camera group may be used to indicate whether image data collected by all cameras in the current camera group (the current camera group may be any one of the camera groups corresponding to the current track) is included in the current track, whether all cameras corresponding to the image data included in the current track belong to one or more camera groups corresponding to the current track, the number of camera groups in the current track, the identification of each camera group, the number of cameras in each camera group, and the identification of each camera in each camera group. The presentation instruction information of the freeview video is metadata information, and the metadata information includes a freeview camera group data box (avsfreeviewcamerarasroupinfobox); the freeview camera set data box contains the attribute indication information of the camera set.
In one embodiment, the freeview video is encapsulated in at least one track, the tracks comprising a first track, the first track comprising image data of the freeview video; the first track corresponds to N camera groups, each camera group comprises one or more cameras, and N is a positive integer; the freeview camera set data box is packaged into a first track; the freeview camera set data box comprises a same camera set information mark field, namely a same camera set information mark field, wherein the same camera set information mark field is used for indicating whether all cameras corresponding to image data contained in a current track belong to one or more camera sets corresponding to the current track, and when the value of the same camera set information mark field is a first set value (for example, the same camera set information mark field is equal to the first set value=1), the image data contained in the current track belong to one or more camera sets corresponding to the current track; when the value of the same camera group information flag field is a second set value (e.g., same_group_info_flag=0), it indicates that at least one camera in the cameras corresponding to the image data included in the current track belongs to a camera group other than one or more camera groups corresponding to the current track. For example, the freeview video is encapsulated in at least one track, the at least one track comprises a first track, the first track comprises image data of the freeview video, the first track corresponds to N camera groups, each camera group comprises one or more cameras, N is a positive integer, and the freeview camera group data box is encapsulated in the first track; when the value of the same camera group information mark field in the free view camera group data box is a first set value, the image data contained in the first track are all represented to belong to N camera groups corresponding to the first track; when the value of the same camera group information mark field in the freeview camera group data box is the second set value, the method indicates that at least one camera belongs to the camera groups except the N camera groups in the cameras corresponding to the image data contained in the first track.
The free view camera group data box further comprises a camera group number field camera_group_num and a camera group identification field camera_group_id, wherein the camera group number field is used for indicating the number of camera groups corresponding to the first track, and the value of the camera group number field is N; one value of the camera group identification field is used to indicate the identifier of one of the N camera groups.
The freeview camera set data box further comprises a complete camera set mark field complete_group_flag, and an identifier of each camera set corresponds to one complete camera set mark field complete_group_flag, wherein the complete camera set mark field is used for indicating whether the current track contains image data acquired by all cameras in the current camera set; when the flag field of the complete camera set takes a value of a first set value (for example, complete_group_flag=1), the current track contains image data collected by all cameras in the current camera set; when the full camera group flag field takes a value of the second set value (e.g., complete_group_flag=0), it indicates that the current track contains image data acquired by a part of the cameras in the current camera group.
Further, when the same camera group information flag field takes a value of a second set value (same_group_info_flag=0), the freeview camera group data box further includes a camera number field camera_num and a camera identification field camera_id; the camera number field is used for indicating the number of cameras contained in the current camera set, and the value of the camera number field is greater than or equal to 1; for example, if the camera group a includes 3 cameras, the camera number field camera_num=3 of the camera group a. The camera identification field is used to indicate an identifier of a camera in the current camera group.
Further, the freeview video includes one or more freeview scenes; in the N camera groups, image data collected by all cameras in the same camera group belong to the same free view angle scene, and all cameras in the same camera group have the same coordinate system.
In a second embodiment, the presentation indication information of the freeview video further includes freeview scene boundary indication information. The display indication information of the freeview video is metadata information, and the metadata information comprises a freeview boundary data box (AvsFreeViewBoundaryBox); the freeview scene boundary data box contains freeview scene boundary indication information. The content consumption device displays the freeview video according to the presentation indication information of the freeview video, and the content consumption device comprises: in decoding the image data, a display boundary of the image data is determined in accordance with the freeview scene boundary indication information.
In one embodiment, the freeview video is encapsulated in at least one track, the tracks comprising a second track, the second track comprising image data of the freeview video; the second track corresponds to P camera groups, and P is a positive integer; the freeform edge data box is encapsulated into the second track. The free view angle boundary data box comprises a camera group mark field camera_group_flag, wherein the camera group mark field is used for indicating the type of a display boundary corresponding to the image data in the second track; when the sign field of the camera group takes a value of a first set value (for example, camera_group_flag=1), the boundary data box representing the free view angle is used for indicating boundary information corresponding to P camera groups; when the camera group flag field takes a value of a second set value (e.g., camera_group_flag=0), the freeview boundary data box is used to indicate boundary information of the freeview video. In addition, the freeview boundary data box further comprises a camera group number field camera_group_num and a camera group identification field camera_group_id, wherein the camera group number field is used for indicating the number of camera groups contained in the second track, and the value of the camera group number field is P; the camera group identification field is used to indicate an identifier of each camera group in the second track.
In a third embodiment, if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video further includes association relation indication information between the freeview video and the non-freeview video. The presentation instruction information of the freeview video is metadata information, and the metadata information includes an avsfreeview relationship group box (avsfieview associationgroupbox); the freeview relation group data box comprises incidence relation indicating information between the freeview video and the non-freeview video.
The content consumption device displays the freeview video according to the presentation indication information of the freeview video, and the content consumption device comprises: in the process of displaying the image data, switching the free view video into the non-free view video according to the incidence relation indication information between the free view video and the non-free view video; or switching the non-freeview video to the freeview video according to the incidence relation indicating information between the freeview video and the non-freeview video.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the third track and the fourth track are provided with the same track group type identifier; the freeview relational group data box comprises a track group type identifier (such as afag) and a freeview video flag field freeview_video_flag; when the value of the freeview video mark field is a first set value (such as freeview_video_flag=1), indicating that the freeview video in the third track needs to be presented; when the value of the freeview video flag field is a second set value (e.g., freeview_video_flag=0), it indicates that the non-freeview video in the fourth track needs to be presented.
In another embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the free view video and the non-free view video belong to the same entity group; the freeview relational group data box includes an identity_id (an identifier for indicating an entity group) of the entity group and a freeview video flag field freeview_video_flag; when the value of the freeview video mark field is a first set value (such as freeview_video_flag=1), the freeview video of the entity corresponding to the identifier of the entity group in the third track needs to be presented; when the value of the freeview video flag field is a second set value (e.g., freeview_video_flag=0), it indicates that the non-freeview video of the entity corresponding to the identifier of the entity group in the fourth track needs to be presented.
In yet another embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; an index relation is arranged between the track identifier of the third track and the track identifier of the fourth track; the index relation refers to that the fourth track can be indexed through the track identifier of the third track; alternatively, the track identification through the fourth track can index to the third track; for example, if the freeview video is played in the order of track identifiers 1-5, the track identifier of the packaged freeview video is 2, and the track identifier of the packaged non-freeview video is 3, then during the period that the content consumption device consumes the freeview video in the track 2, the track identifier 2 can index the track of the packaged non-freeview video identified as 3.
In a fourth embodiment, if the freeview video is encapsulated in a plurality of tracks, the presentation indication information of the freeview video further includes association indication information between the plurality of tracks. The presentation instruction information of the freeview video is metadata information, the metadata information includes a freeview track group data box (avsfreeviewsroupbox), and the freeview track group data box includes association instruction information between a plurality of tracks.
In one embodiment, a plurality of tracks are clustered into M track groups, each track carrying a track group identification; the tracks with the same track group identifier belong to the same track group; m is a positive integer. Each track group corresponds to a free view track group data box, the free view video comprises one or more free view scenes, and each free view scene is generated based on image data acquired by one or more camera groups; the free view angle track group data box comprises a camera group mark field camera_group_flag, and the camera group mark field is used for indicating whether a plurality of camera groups exist in a track group corresponding to the free view angle track group data box.
When the value of the camera group mark field is a first set value (for example, camera_group_flag=1), representing that all tracks in the track group corresponding to the free view track group data box form a second free view scene, wherein the second free view scene is generated based on image data acquired by a plurality of camera groups; when the sign field of the camera set takes a value of a second set value (for example, camera_group_flag=0), the first free view angle scene is formed by all tracks in the track set corresponding to the free view angle track set data box, and the first free view angle scene is generated based on image data acquired by one camera set; the free view angle track group data box also comprises a camera group number field camera_group_num and a camera group identification field camera_group_id, wherein the camera group number field is used for indicating the number of camera groups in the track group corresponding to the free view angle track group data box, and the value of the camera group number field is more than or equal to 1; specifically, when camera_group_flag=1, camera_group_num >1, and when camera_group_flag=0, camera_group_num=1. The camera group identification field is used for indicating identifiers of the camera groups in the track group corresponding to the freeview track group data box.
In a fifth embodiment, the freeview video is generated based on image data acquired by a plurality of camera groups, the indication information of the camera groups further including presentation time indication information for each camera group separately. The presentation indication information of the freeview video is metadata information, wherein the metadata information comprises a camera group sample entity (camera group entity), and the camera group sample entity comprises presentation time indication information of each camera group respectively. The content consumption device displays the freeview video according to the presentation indication information of the freeview video, and the content consumption device comprises: and sequentially decoding and displaying the image data corresponding to the corresponding camera group in the image data according to the respective presentation time indication information of each camera group.
In one embodiment, the camera group sample entity includes a camera group update flag field camera_group_update_flag, a camera group identification field camera_group_id, and a camera group cancel flag field camera_group_cancel_flag;
when the camera group update flag field camera_group_update_flag corresponding to the current sample is a first set value (e.g., 1), the information indicating that the current sample belongs to the camera group and that the current sample belongs to the camera group is specified by the camera group identification field (camera_group_id) corresponding to the current sample; when the camera group update flag field corresponding to the current sample is set to a second set value (e.g., 0), the camera group information corresponding to the current sample is unchanged.
When the camera group cancellation flag field camera_group_cancel_flag corresponding to the current sample is a first set value (e.g. 1), the camera group information corresponding to the target sample is cancelled, so that the camera group information corresponding to the target sample is not validated any more; the current sample refers to the sample being parsed; the target sample refers to any one of the history samples that have been parsed before the current sample, the camera group update flag field corresponding to the target sample has a value of a first set value (e.g., 1), the camera group identification field corresponding to the target sample is the same as the camera group identification field (camera_group_id) corresponding to the current sample, and the difference between the parsing time of the target sample and the parsing time of the current sample is: the minimum value of the difference between the analysis time of each sample in the history samples and the analysis time of the current sample; for example, let the current sample have a serial number of 10, the parsed historical samples include samples with sample numbers of 1-9 (the smaller the serial number of the historical sample indicates the earlier the parsing time of the historical sample), and the camera group identification fields corresponding to the samples with sample numbers of 3,6,8 are the same as the camera group identification fields corresponding to the current sample; the camera group update flag field camera_group_update_flag corresponding to the history sample with the sample number of 3 and the history sample with the sample number of 6 takes a value of 1, and the camera group update flag field camera_group_update_flag corresponding to the history sample with the sample number of 8 takes a value of 0; when the camera group cancel flag field camera_group_cancel_flag corresponding to the current sample is 1, the camera group information corresponding to the history sample with the sample number of 6 is not validated any more.
When the value of the camera group update mark field corresponding to the current sample is a first set value, the value of the camera group cancel mark field corresponding to the current sample is not the first set value; the camera group information corresponding to the current sample remains unchanged: if the camera group information corresponding to the target sample is in an efficiency state, the camera group information corresponding to the current sample is in the efficiency state; if the camera group information corresponding to the target sample is in the cancel state, the camera group information corresponding to the current sample is in the cancel state. In a seventh embodiment, a media file of the immersion medium is sliced into a plurality of media fragments; the content consumption equipment acquires a transmission signaling file of the immersion media, wherein the transmission signaling file contains descriptive information corresponding to presentation indication information of the free view video.
In one embodiment, the freeview video includes one or more freeview scenes; the transmission signaling file comprises at least one of the following: a freeview camera descriptor, a freeview association descriptor; wherein the freeview camera descriptor is used for indicating parameters (such as projection mode, camera position, etc.) of each camera in the one or more camera groups; the freeview association descriptor is used for indicating the association relation between the freeview video and the corresponding non-freeview video.
In another embodiment, the transmission signaling file contains a representation hierarchy and an adaptation set hierarchy, the adaptation set hierarchy including one or more representation hierarchies; when the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; when the description information exists in the target representation hierarchy, the description information is used for describing the target representation hierarchy; wherein the descriptive information may include one or more of freeview camera descriptors, freeview association descriptors.
The embodiment of the application expands the data box and the media presentation description file of the immersion media, and the data processing method of the immersion media comprises the following steps: and acquiring a media file of the immersed media, wherein the media file comprises presentation indication information of free view video, the free view video is generated based on image data acquired by one or more camera groups, the presentation indication information of the free view video comprises indication information of the camera groups, and the free view video is displayed according to the presentation indication information of the free view video. Therefore, the display of the freeview video can be more comprehensively supported by grouping a plurality of cameras corresponding to the freeview video and indicating how the freeview video is rendered by declaring the indication information of the camera group in the display indication information of the freeview video. In addition, determining a display boundary of the freeview scene by the indication of the freeview boundary data box; and determining to display the freeview video or display the non-freeview video through the indication of the view relation group data box or the freeview relation data box.
FIG. 3 illustrates a flow chart of another method of data processing of immersion media provided by an exemplary embodiment of the present application; the method may be performed by a content production device in an immersion media system, the method comprising the steps of S301-S303:
s301, acquiring image data acquired by one or more camera groups, and encoding the image data into free view video.
The specific embodiment of step S301 can refer to the embodiment of video encoding in fig. 1d, and will not be described herein.
S302, presentation indication information is added to the freeview video according to the application form of the freeview video.
The display indication information of the freeview video includes indication information of a camera group, the content production device adds the display indication information to the freeview video according to an application form of the freeview video, which is an inverse process of displaying the freeview video by the content consumption device according to the display indication information of the freeview video, and a specific manner may refer to a process of displaying the freeview video by the content consumption device according to the display indication information of the freeview video in fig. 2, which is not described in detail herein. Taking the freeview camera set data box as an example, if the freeview video is generated based on the image data collected by one camera set and the first track includes the image data collected by all cameras in the camera set, the content creation device configures a value of a complete_group_flag field of the freeview camera set data box to be 1.
S303, packaging the freeview video and the presentation indication information of the freeview video into a media file of the immersed media.
The specific embodiment of step S303 can refer to the embodiment of the video file package in fig. 1d, which is not described herein.
In one embodiment, the content production device supports streaming, and after obtaining a media file of the immersed media, the content production device slices the media file to obtain a plurality of media fragments; generating a transmission signaling file of the immersion medium, wherein the transmission signaling file contains description information corresponding to presentation indication information of the free view video; wherein, the description information may include at least one of the following: the freeview camera descriptor and the freeview association descriptor.
The freeview camera descriptor is used for indicating parameters of each camera in one or more camera groups; the freeview association descriptor is used for indicating the association relation between the freeview video and the corresponding non-freeview video; see, in particular, table 12 and table 13 above, and are not described in detail herein.
The following describes the data processing method of the immersion medium provided by the present application by two complete examples:
in one embodiment, the freeview video is comprised of images acquired by a single camera group.
Content production equipment: generating a free view video bitstream and packaging the video bitstream into a media file, adding corresponding metadata information (i.e. presentation indication information) in the free view video file according to an application form of the free view video, including:
parameter related information indicating a photographing camera;
AvsFreeViewInfoBox:
{Camera1:ID=1;Pos=(100,0,100);orientation=(0,0,0)};
{Camera2:ID=2;Pos=(100,100,100);orientation=(0.5,0.5,0)};
{Camera3:ID=3;Pos=(0,0,100);orientation=(0.5,0.5,-0.5)}。
further, if the content creation device supports streaming, slicing the media file into media segments suitable for streaming (according to the specification of the transmission standard), and generating a corresponding transmission signaling, where the transmission signaling indicates the following information:
indicating camera parameter information and camera group information;
and filling corresponding fields in a freeview camera descriptor (AvsFreeViewCamInfo) in the DASH signaling according to the values of the data boxes in the metadata information.
After the above steps are completed, the content creation device transmits the signaling file to the content consumption device.
Content consumption device: in one implementation, the content consumption device downloads the complete file and then locally plays the complete file, and when playing, selects a texture map and a depth map corresponding to a proper camera according to parameters of each camera in a freeview data box (avsfreeviewbox) and a virtual view when a viewer views the complete file, decodes the texture map and the depth map, and synthesizes a corresponding picture.
In another implementation manner, the content consumption device and the content creation device establish streaming transmission, and the content selects video streams of texture maps and depth maps corresponding to suitable cameras to request according to parameters of each camera in the transmission signaling freeview camera descriptor (AvsFreeViewCamInfo) and a virtual view angle when a viewer views, and synthesizes corresponding pictures after receiving the corresponding video streams.
In another embodiment, the freeview video is comprised of images acquired by multiple camera groups.
Content production equipment: generating a free view video bit stream and packaging the video bit stream into a media file, and adding corresponding metadata information into the free view video file according to the application form of the free view video, wherein the method comprises the following steps:
parameter related information indicating a photographing camera; indicating camera group information;
Track1:
AvsFreeViewInfoBox:
{Camera1:ID=1;Pos=(100,0,100);orientation=(0,0,0)}
{Camera2:ID=2;Pos=(100,100,100);orientation=(0.5,0.5,0)}
AvsFreeViewCameraGroupInfoBox:
{camera_group_num=1;camera_group_id=1;complete_group_flag=1;same_group_info_flag=1}
AvsFreeViewGroupBox:
{camera_group_flag=1}
Track2:
AvsFreeViewInfoBox:
{Camera3:ID=3;Pos=(0,0,100);orientation=(0.5,0.5,-0.5)}
{Camera4:ID=4;Pos=(0,100,0);orientation=(0.5,0.5,0)}
AvsFreeViewCameraGroupInfoBox:
{camera_group_num=1;camera_group_id=2;complete_group_flag=1;same_group_info_flag=1}
AvsFreeViewGroupBox:
{camera_group_flag=1}
track3 (metadata Track): the freeview video camera component data track.
Further, if the content creation device supports streaming, slicing the media file into media segments suitable for streaming (according to the specification of the transmission standard), and generating a corresponding transmission signaling, where the transmission signaling indicates the following information:
and filling corresponding fields of a free view camera descriptor (AvsFreeViewCamInfo) descriptor in the DASH signaling according to the values of the data boxes in the metadata information, wherein the corresponding fields comprise a camera group identification field camera_group_id.
After the above steps are completed, the content creation device transmits the signaling file to the content consumption device.
Content consumption device: in one implementation, the content consumption device downloads the complete file and then plays it locally, and when playing it, determines the camera set that should be played at a particular time based on the information indicated in the freeview video camera set data track. The video data tracks corresponding to a particular camera group are determined from the freeview camera group data box (avsfreeviewfugroupfofbox) and the freeview track group data box (avsfreeviewfuggroupbox). And selecting a texture map and a depth map corresponding to the proper camera according to parameters of each camera in the free view angle data box (AvsFreeViewInfoBox) and a virtual view angle when a viewer views, decoding and synthesizing a corresponding picture. Specifically, camera_group_num=1, indicating that the number of camera groups in track1 is 1; camera_group_id=1, indicating that the identifier of the camera group in track1 is 1; complete_group_flag=1, indicating that Track1 (Track 1) contains image data collected by all cameras in camera group 1; the same_group_info_flag=1 indicates that the image data contained in Track1 (Track 1) all belong to the camera group 1 corresponding to Track 1. camera_group_flag=1, indicating that the current freeview boundary data box is used to indicate the boundary information corresponding to the current camera group. Track2 (Track 2) is similar to Track1 and will not be described again here.
In another implementation, a content consumption device establishes a streaming transmission requesting data for a freeview video camera set data track, and determines a camera set to be played at a particular time based on information therein. The content consumption equipment selects video streams of texture maps and depth maps corresponding to proper cameras to request according to parameters of each camera in a transmission signaling free view camera descriptor (AvsFreeViewCamInfo) and a virtual view angle when a viewer views, and synthesizes corresponding pictures after receiving the corresponding video streams.
The embodiment of the application expands the data box and the media presentation description file of the immersion media, and the data processing method of the immersion media comprises the following steps: acquiring image data acquired by one or more camera groups, and encoding the image data into a free view video; adding presentation indication information for the freeview video according to the application form of the freeview video; the display indication information of the free view video comprises indication information of a camera group; and packaging the freeview video and the presentation indication information of the freeview video into a media file of the immersion media. Therefore, the display of the freeview video can be more comprehensively supported by grouping a plurality of cameras corresponding to the freeview video and indicating how the freeview video is rendered by declaring the indication information of the camera group in the display indication information of the freeview video. In addition, camera group information of each scene contained in the freeview video and boundary information of the scene are indicated by the freeview boundary data box; indicating the association relationship between the non-free view video and the free view video through a view relationship group data box or a free view relationship data box; indicating timing metadata information of the free view video through a camera group sample entity; and generating corresponding transmission signaling through the freeview camera group data box and the freeview track group data box to guide the resource selection of the content consumption equipment in the freeview video consumption process.
The foregoing details of the method of embodiments of the present application are provided for the purpose of better implementing the foregoing aspects of embodiments of the present application, and accordingly, the following provides an apparatus of embodiments of the present application.
Referring to fig. 4, fig. 4 is a schematic structural view of a data processing apparatus for immersion medium according to an exemplary embodiment of the present application; the data processing means of the immersion medium may be a computer program (comprising program code) running in the content consumption device, for example the data processing means of the immersion medium may be an application software in the content consumption device. As shown in fig. 4, the data processing apparatus of the immersion medium includes an acquisition unit 401 and a processing unit 402.
Referring to fig. 4, the detailed descriptions of the respective units are as follows:
an obtaining unit 401, configured to obtain a media file of the immersion medium, where the media file includes presentation indication information of a freeview video, and the freeview video is generated based on image data collected by one or more camera groups; the display indication information of the free view video comprises indication information of a camera group;
the processing unit 402 is configured to display the freeview video according to the presentation indication information of the freeview video.
In one embodiment, the indication information of the camera group includes attribute indication information of the camera group; the display indication information of the free view video is metadata information, and the metadata information comprises a free view camera group data box; the freeview camera set data box contains attribute indication information of the camera set.
In one embodiment, the freeview video is encapsulated in at least one track, the at least one track comprising a first track comprising image data of the freeview video; the first track corresponds to N camera groups, each camera group comprises one or more cameras, and N is a positive integer; the freeview camera set data box is packaged into a first track; the freeview camera set data box also contains the same camera set information flag field;
when the information mark field of the same camera group takes the value as a first set value, the image data contained in the first track are indicated to belong to N camera groups; when the value of the same camera group information mark field is the second set value, the camera group information mark field indicates that at least one camera belongs to the camera groups except the N camera groups in the cameras corresponding to the image data contained in the first track.
In one embodiment, the freeview camera set data box includes a camera set number field and a camera set identification field, the camera set number field is used for indicating the number of camera sets corresponding to the first track, and the value of the camera set number field is N; one value of the camera group identification field is used to indicate an identifier of one of the N camera groups.
In one embodiment, the freeview camera set data box further includes a full camera set flag field; when the sign field of the complete camera set takes a value as a first set value, representing that image data collected by all cameras in the camera set indicated by the camera set identification field are contained in a first track; when the full camera group mark field takes a value of a second set value, the image data which indicates that part of the cameras in the camera group indicated by the camera group identification field are collected in the first track.
In one embodiment, when the value of the same camera group information flag field is the second set value, the freeview camera group data box further includes a camera number field and a camera identification field, the camera number field is used for indicating the number of cameras included in the camera group indicated by the camera group identification field, and the value of the camera number field is greater than or equal to 1; the camera identification field is used to indicate an identifier of a camera in the camera group indicated by the camera group identification field.
In one implementation, the freeview video includes one or more freeview scenes; in the N camera groups, image data collected by all cameras in the same camera group belong to the same free view angle scene, and all cameras in the same camera group have the same coordinate system.
In one embodiment, the indication information of the camera set further comprises camera parameter indication information in the camera set; the metadata information also comprises a free view angle data box, wherein the free view angle data box comprises camera parameters and repeated camera parameter indication fields;
when the repeated camera parameter indication field takes the value as a first set value, the camera parameters in the free view angle data box are shown to be effective on all cameras; and when the repeated camera parameter indication field takes the second set value, the camera parameter indication field indicates that different cameras in the free view angle data box are provided with respective camera parameters.
In one embodiment, the presentation indication information of the freeview video further includes freeview scene boundary indication information.
In one embodiment, the presentation indication information of the freeview video is metadata information, and the metadata information includes a freeview boundary data box; the freeview scene boundary data box contains freeview scene boundary indication information.
In one embodiment, the freeview video is encapsulated in at least one track, the at least one track comprising a second track comprising image data of the freeview video; the second track corresponds to P camera groups, and P is a positive integer; the freeform edge data box is encapsulated into a second track;
The free view angle boundary data box comprises a camera group mark field, wherein the camera group mark field is used for indicating the type of a display boundary corresponding to the image data in the second track; when the sign field of the camera group takes the value as a first set value, the boundary data box representing the free view angle is used for indicating boundary information corresponding to P camera groups; when the value of the camera group mark field is a second set value, the data box representing the free view angle boundary is used for indicating the boundary information of the free view angle video;
the freeview boundary data box further comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups contained in the second track, and the value of the camera group number field is P; the camera group identification field is used to indicate an identifier of each camera group in the second track.
In one embodiment, if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video further includes association relation indication information between the freeview video and the non-freeview video.
In one embodiment, the presentation indication information of the freeview video is metadata information, and the metadata information includes a freeview relational group data box; the freeview relation group data box comprises incidence relation indicating information between the freeview video and the non-freeview video.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the third track and the fourth track are provided with the same track group type identifier;
the freeview relational group data box comprises a track group type identifier and a freeview video mark field; when the value of the free view video mark field is a first set value, indicating that the free view video in the third track needs to be presented; when the value of the freeview video mark field is the second set value, the non-freeview video in the fourth track needs to be presented.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the free view video and the non-free view video belong to the same entity group;
the freeview relational group data box comprises an identifier of an entity group and a freeview video mark field; when the value of the free view video mark field is a first set value, the free view video of the entity corresponding to the identity of the entity group in the third track needs to be presented; and when the value of the freeview video mark field is the second set value, indicating that the non-freeview video of the entity corresponding to the identification of the entity group in the fourth track needs to be presented.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; an index relation is arranged between the track identifier of the third track and the track identifier of the fourth track;
the index relation refers to that the fourth track can be indexed through the track identifier of the third track; alternatively, the track identification by the fourth track can index to the third track.
In one embodiment, if the freeview video is encapsulated in a plurality of tracks, the presentation indication information of the freeview video further includes association indication information between the plurality of tracks.
In one embodiment, the presentation indication information of the freeview video is metadata information, the metadata information includes a freeview track group data box, and the freeview track group data box includes association indication information among a plurality of tracks.
In one embodiment, a plurality of tracks are clustered into M track groups, each track carrying a track group identification; the tracks with the same track group identifier belong to the same track group; m is a positive integer.
In one embodiment, each track set corresponds to a freeview track set data box, and the freeview video includes one or more freeview scenes, each of which is generated based on image data acquired by one or more camera sets; the free view angle track group data box comprises a camera group mark field, wherein the camera group mark field is used for indicating whether a plurality of camera groups exist in a track group corresponding to the free view angle track group data box;
When the sign field of the camera group takes a value as a first set value, representing that all tracks in the track group corresponding to the free view track group data box form a first free view scene, wherein the first free view scene is generated based on image data acquired by one camera group; when the sign field of the camera group takes a value as a second set value, representing that all tracks in the track group corresponding to the free view track group data box form a second free view scene, wherein the second free view scene is generated based on image data acquired by a plurality of camera groups;
the free view angle track group data box also comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups in the track group corresponding to the free view angle track group data box, and the value of the camera group number field is more than or equal to 1; the camera group identification field is used for indicating identifiers of the camera groups in the track group corresponding to the freeview track group data box.
In one embodiment, when the freeview video is generated based on image data acquired by a plurality of camera groups, the indication information for the camera groups further includes a separate presentation time indication information for each camera group.
In one embodiment, the presentation indication information of the freeview video is metadata information, the metadata information includes a camera group sample entity, and the camera group sample entity includes presentation time indication information of each camera group.
In one embodiment, the camera group sample entity includes a camera group update flag field, a camera group identification field, and a camera group kill flag field;
when the value of the camera group update mark field corresponding to the current sample is a first set value, the information indicating that the current sample belongs to the camera group and the camera group to which the current sample belongs is specified by the camera group identification field corresponding to the current sample; when the value of the camera group update mark field corresponding to the current sample is a second set value, the camera group information corresponding to the current sample is kept unchanged;
when the value of the camera group cancellation flag field corresponding to the current sample is a first set value, canceling the camera group information corresponding to the target sample is indicated, so that the camera group information corresponding to the target sample is not effective any more; the current sample refers to the sample being parsed; the target sample is any one of the history samples which are analyzed before the current sample, the camera group update flag field corresponding to the target sample takes the value as a first set value, the camera group identification field corresponding to the target sample is the same as the camera group identification field corresponding to the current sample, and the difference between the analysis time of the target sample and the analysis time of the current sample is: the minimum value of the difference between the analysis time of each sample in the history samples and the analysis time of the current sample;
When the value of the camera group update mark field corresponding to the current sample is a first set value, the value of the camera group cancel mark field corresponding to the current sample is not the first set value; the camera group information corresponding to the current sample remains unchanged: if the camera group information corresponding to the target sample is in an efficiency state, the camera group information corresponding to the current sample is in the efficiency state; if the camera group information corresponding to the target sample is in the cancel state, the camera group information corresponding to the current sample is in the cancel state.
In one embodiment, a media file of an immersion medium is sliced into a plurality of media fragments; the acquisition unit 401 is further configured to:
and acquiring a transmission signaling file of the immersion medium, wherein the transmission signaling file contains description information corresponding to the presentation indication information of the free view video.
In one implementation, the freeview video includes one or more freeview scenes; the transmission signaling file comprises at least one of the following: a freeview camera descriptor, a freeview association descriptor;
wherein the freeview camera descriptor is used for indicating parameters of each camera in the one or more camera groups; the freeview association descriptor is used for indicating the association relation between the freeview video and the corresponding non-freeview video.
In one embodiment, the transmission signaling file contains a representation hierarchy and an adaptation set hierarchy, the adaptation set hierarchy including one or more representation hierarchies; when the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; the description information is used to describe the target presentation hierarchy when it exists in the target presentation hierarchy.
In one embodiment, the processing unit 402 is configured to obtain a media file of the immersion medium, specifically configured to:
determining media fragments corresponding to image data required for watching according to the virtual view angle of a viewer of the immersion media during watching and according to description information in a transmission signaling file;
and pulling the determined media fragments through a streaming transmission mode.
In one embodiment, the processing unit 402 is configured to display the freeview video according to the presentation indication information of the freeview video, specifically configured to:
determining image data required for watching from a media file according to a virtual view angle of a viewer of the immersion medium when watching;
and decoding and displaying the image data according to the indication information of the camera group.
In one embodiment, the presentation indication information of the freeview video further includes freeview scene boundary indication information; the processing unit 402 is configured to display the freeview video according to the presentation indication information of the freeview video, specifically configured to:
In decoding the image data, a display boundary of the image data is determined in accordance with the freeview scene boundary indication information.
In one embodiment, if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video also comprises incidence relation indication information between the freeview video and the non-freeview video; the processing unit 402 is configured to display the freeview video according to the presentation indication information of the freeview video, specifically configured to:
in the process of displaying the image data, switching the free view video into the non-free view video according to the incidence relation indication information between the free view video and the non-free view video; or alternatively, the process may be performed,
and switching the non-free view video into the free view video according to the incidence relation indicating information between the free view video and the non-free view video.
In one embodiment, the image data is acquired by a plurality of camera sets; the indication information of the camera groups also comprises presentation time indication information of each camera group respectively; the processing unit 402 is configured to decode and display image data according to the indication information of the camera group, specifically:
And sequentially decoding and displaying the image data corresponding to the corresponding camera group in the image data according to the respective presentation time indication information of each camera group.
According to one embodiment of the application, some of the steps involved in the data processing method of the immersion medium shown in FIG. 2 may be performed by various elements in the data processing apparatus of the immersion medium shown in FIG. 4. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 401 shown in fig. 4, and step S202 may be performed by the processing unit 402 shown in fig. 4. The respective units in the data processing apparatus of the immersion medium shown in fig. 4 may be individually or collectively combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units having smaller functions, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the application, the data processing apparatus of the immersion medium may also include other elements, and in actual practice, these functions may also be facilitated by other elements and may be cooperatively implemented by a plurality of elements.
According to another embodiment of the present application, a data processing apparatus of an immersion medium as shown in fig. 4 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 on a general-purpose computing apparatus such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and a data processing method of an immersion medium of an embodiment of the present application is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and run in the above-described computing device through the computer-readable recording medium.
Based on the same inventive concept, the principle and beneficial effects of the data processing device for immersion medium provided in the embodiments of the present application are similar to those of the data processing method for immersion medium in the embodiments of the present application, and may refer to the principle and beneficial effects of implementation of the method, and are not repeated here for brevity.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating a structure of a data processing apparatus for another immersion medium according to an exemplary embodiment of the present application; the data processing means of the immersion medium may be a computer program (comprising program code) running in the content production device, for example the data processing means of the immersion medium may be an application software in the content production device. As shown in fig. 5, the data processing apparatus of the immersion medium includes an acquisition unit 501 and a processing unit 502. Referring to fig. 5, the detailed descriptions of the respective units are as follows:
An acquiring unit 501, configured to acquire image data acquired by one or more camera groups, and encode the image data into a freeview video;
a processing unit 502, configured to add presentation indication information to the freeview video according to an application form of the freeview video; the display indication information of the free view video comprises indication information of a camera group; and a media file for encapsulating the freeview video and presentation indication information of the freeview video as an immersive media.
In one embodiment, the processing unit 502 is further configured to:
slicing the media file to obtain a plurality of media fragments; the method comprises the steps of,
generating a transmission signaling file of the immersion medium, wherein the transmission signaling file contains descriptive information corresponding to presentation indication information of the free view video.
According to one embodiment of the application, some of the steps involved in the data processing method of the immersion medium shown in FIG. 3 may be performed by various elements in the data processing apparatus of the immersion medium shown in FIG. 5. For example, step S301 shown in fig. 3 may be performed by the acquisition unit 501 shown in fig. 5, and steps S302 and S303 may be performed by the processing unit 502 shown in fig. 5. The respective units in the data processing apparatus of the immersion medium shown in fig. 5 may be individually or collectively combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units having smaller functions, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the application, the data processing apparatus of the immersion medium may also include other elements, and in actual practice, these functions may also be facilitated by other elements and may be cooperatively implemented by a plurality of elements.
According to another embodiment of the present application, a data processing apparatus of an immersion medium as shown in fig. 5 may be constructed by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 3 on a general-purpose computing apparatus such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and a data processing method of an immersion medium of an embodiment of the present application is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and run in the above-described computing device through the computer-readable recording medium.
Based on the same inventive concept, the principle and beneficial effects of the data processing device for immersion medium provided in the embodiments of the present application are similar to those of the data processing method for immersion medium in the embodiments of the present application, and may refer to the principle and beneficial effects of implementation of the method, and are not repeated here for brevity.
Fig. 6 is a schematic diagram showing a structure of a content consumption apparatus according to an exemplary embodiment of the present application; the content consumption device may refer to a computer device used by a user of the immersion medium, which may be a terminal (e.g., a PC, a smart mobile device (e.g., a smart phone), a VR device (e.g., a VR headset, VR glasses, etc.)). As shown in fig. 6, the content consumption device comprises a receiver 601, a processor 602, a memory 603, and a display/playback means 604. Wherein:
The receiver 601 is used to enable the decoding to interact with transmissions of other devices, in particular to enable transmissions between the content production device and the content consumption device regarding the immersive media. I.e., the content consumption device receives, via the receiver 601, the relevant media assets from which the content production device transmitted the immersion media.
The processor 602 (or CPU (Central Processing Unit, central processing unit)) is a processing core of the content creation device, the processor 602 being adapted to implement one or more program instructions, in particular to load and execute the one or more program instructions to implement the flow of the data processing method of the immersion medium shown in fig. 2.
Memory 603 is a memory device in the content consumption device for storing programs and media resources. It will be appreciated that the memory 603 herein may include both built-in storage media in the content consumption device and extended storage media supported by the content consumption device. It should be noted that, the memory 603 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one memory located remotely from the aforementioned processor. The memory 603 provides storage space for storing the operating system of the content consumption device. And in the storage space is also for storing a computer program comprising program instructions adapted to be invoked and executed by a processor for performing the steps of the data processing method of the immersion medium. In addition, the memory 603 may also be used to store a three-dimensional image of the immersion medium formed after processing by the processor, audio content corresponding to the three-dimensional image, information required for rendering the three-dimensional image and the audio content, and the like.
The display/play device 604 is used for outputting the rendered sound and three-dimensional image.
Referring again to fig. 6, the processor 602 may include a parser 621, a decoder 622, a converter 623, and a renderer 624; wherein:
the parser 621 is configured to perform file decapsulation on an encapsulated file of the rendering media from the content creation device, specifically decapsulate a media file resource according to a file format requirement of the immersion media, to obtain an audio code stream and a video code stream; and provides the audio and video streams to a decoder 622.
The decoder 622 audio decodes the audio code stream to obtain audio content and provides the audio content to a renderer for audio rendering. In addition, the decoder 622 decodes the video stream to obtain a 2D image. According to metadata provided by the media presentation description information, if the metadata indicates that the immersion media has performed an area encapsulation process, the 2D image refers to an encapsulated image; the planar image refers to a projected image if the metadata indicates that the immersion medium has not performed the region encapsulation process.
The converter 623 is for converting a 2D image into a 3D image. If the immersion medium has undergone a region encapsulation process, the converter 623 also first region decapsulates the encapsulated image to obtain a projected image. And reconstructing the projection image to obtain a 3D image. If the rendering medium has not performed the region encapsulation process, the converter 623 directly reconstructs the projection image into a 3D image.
The renderer 624 is used to render the 3D images and audio content of the immersion media. And particularly, rendering the audio content and the 3D image according to metadata related to rendering and windows in the media presentation description information, and outputting the rendering completion to a display/play device.
In one exemplary embodiment, the processor 602 (and in particular the devices contained by the processor) performs the steps of the data processing method of the immersion medium shown in FIG. 2 by invoking one or more instructions in a memory. Specifically, the memory stores one or more first instructions adapted to be loaded by the processor 602 and to perform the steps of:
acquiring a media file of the immersed media, wherein the media file contains presentation indication information of free view video, and the free view video is generated based on image data acquired by one or more camera groups; the display indication information of the free view video comprises indication information of a camera group;
and displaying the freeview video according to the display indication information of the freeview video.
In one embodiment, the indication information of the camera group includes attribute indication information of the camera group; the display indication information of the free view video is metadata information, and the metadata information comprises a free view camera group data box; the freeview camera set data box contains attribute indication information of the camera set.
In one embodiment, the freeview video is encapsulated in at least one track, the at least one track comprising a first track comprising image data of the freeview video; the first track corresponds to N camera groups, each camera group comprises one or more cameras, and N is a positive integer; the freeview camera set data box is packaged into a first track; the freeview camera set data box also contains the same camera set information flag field;
when the information mark field of the same camera group takes the value as a first set value, the image data contained in the first track are indicated to belong to N camera groups; when the value of the same camera group information mark field is the second set value, the camera group information mark field indicates that at least one camera belongs to the camera groups except the N camera groups in the cameras corresponding to the image data contained in the first track.
In one embodiment, the freeview camera set data box includes a camera set number field and a camera set identification field, the camera set number field is used for indicating the number of camera sets corresponding to the first track, and the value of the camera set number field is N; one value of the camera group identification field is used to indicate an identifier of one of the N camera groups.
In one embodiment, the freeview camera set data box further includes a full camera set flag field; when the sign field of the complete camera set takes a value as a first set value, representing that image data collected by all cameras in the camera set indicated by the camera set identification field are contained in a first track; when the full camera group mark field takes a value of a second set value, the image data which indicates that part of the cameras in the camera group indicated by the camera group identification field are collected in the first track.
In one embodiment, when the value of the same camera group information flag field is the second set value, the freeview camera group data box further includes a camera number field and a camera identification field, where the camera number field is used to indicate the number of cameras included in the camera group indicated by the camera group identification field, and the value of the camera number field is greater than or equal to 1; the camera identification field is used to indicate an identifier of a camera in the camera group indicated by the camera group identification field.
In one embodiment, the freeview video includes one or more freeview scenes; in the N camera groups, image data collected by all cameras in the same camera group belong to the same free view angle scene, and all cameras in the same camera group have the same coordinate system.
In one embodiment, the indication information of the camera set further includes camera parameter indication information in the camera set; the metadata information also comprises a free view angle data box, wherein the free view angle data box comprises camera parameters and repeated camera parameter indication fields;
when the repeated camera parameter indication field takes the value as a first set value, the camera parameters in the free view angle data box are shown to be effective on all cameras; and when the repeated camera parameter indication field takes the second set value, the camera parameter indication field indicates that different cameras in the free view angle data box are provided with respective camera parameters.
In one embodiment, the presentation indication information of the freeview video further includes freeview scene boundary indication information.
In one embodiment, the presentation indication information of the freeview video is metadata information, where the metadata information includes a freeview boundary data box; the freeview scene boundary data box contains freeview scene boundary indication information.
In one embodiment, the freeview video is encapsulated in at least one track, the at least one track comprising a second track comprising image data of the freeview video; the second track corresponds to P camera groups, and P is a positive integer; the freeform edge data box is encapsulated into a second track;
The free view angle boundary data box comprises a camera group mark field, wherein the camera group mark field is used for indicating the type of a display boundary corresponding to the image data in the second track; when the sign field of the camera group takes the value as a first set value, the boundary data box representing the free view angle is used for indicating boundary information corresponding to P camera groups; when the value of the camera group mark field is a second set value, the data box representing the free view angle boundary is used for indicating the boundary information of the free view angle video;
the freeview boundary data box further comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups contained in the second track, and the value of the camera group number field is P; the camera group identification field is used to indicate an identifier of each camera group in the second track.
In one embodiment, if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video further includes association relation indication information between the freeview video and the non-freeview video.
In one embodiment, the display indication information of the freeview video is metadata information, and the metadata information includes a freeview relational group data box; the freeview relation group data box comprises incidence relation indicating information between the freeview video and the non-freeview video.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the third track and the fourth track are provided with the same track group type identifier;
the freeview relational group data box comprises a track group type identifier and a freeview video mark field; when the value of the free view video mark field is a first set value, indicating that the free view video in the third track needs to be presented; when the value of the freeview video mark field is the second set value, the non-freeview video in the fourth track needs to be presented.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; the free view video and the non-free view video belong to the same entity group;
the freeview relational group data box comprises an identifier of an entity group and a freeview video mark field; when the value of the free view video mark field is a first set value, the free view video of the entity corresponding to the identity of the entity group in the third track needs to be presented; and when the value of the freeview video mark field is the second set value, indicating that the non-freeview video of the entity corresponding to the identification of the entity group in the fourth track needs to be presented.
In one embodiment, the freeview video is encapsulated in one or more third tracks and the non-freeview video is encapsulated in one or more fourth tracks; an index relation is arranged between the track identifier of the third track and the track identifier of the fourth track;
the index relation refers to that the fourth track can be indexed through the track identifier of the third track; alternatively, the track identification by the fourth track can index to the third track.
In one embodiment, if the freeview video is encapsulated in a plurality of tracks, the presentation indication information of the freeview video further includes association indication information between the plurality of tracks.
In one embodiment, the presentation indication information of the freeview video is metadata information, the metadata information includes a freeview track group data box, and the freeview track group data box includes association indication information between a plurality of tracks.
In one embodiment, a plurality of tracks are clustered into M track groups, each track carrying a track group identification; the tracks with the same track group identifier belong to the same track group; m is a positive integer.
In one embodiment, each track group corresponds to a free view track group data box, and the free view video comprises one or more free view scenes, and each free view scene is generated based on image data acquired by one or more camera groups; the free view angle track group data box comprises a camera group mark field, wherein the camera group mark field is used for indicating whether a plurality of camera groups exist in a track group corresponding to the free view angle track group data box;
When the sign field of the camera group takes a value as a first set value, representing that all tracks in the track group corresponding to the free view track group data box form a first free view scene, wherein the first free view scene is generated based on image data acquired by one camera group; when the sign field of the camera group takes a value as a second set value, representing that all tracks in the track group corresponding to the free view track group data box form a second free view scene, wherein the second free view scene is generated based on image data acquired by a plurality of camera groups;
the free view angle track group data box also comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups in the track group corresponding to the free view angle track group data box, and the value of the camera group number field is more than or equal to 1; the camera group identification field is used for indicating identifiers of the camera groups in the track group corresponding to the freeview track group data box.
In one embodiment, when the freeview video is generated based on image data acquired by a plurality of camera groups, the indication information of the camera groups further includes a separate presentation time indication information for each camera group.
In one embodiment, the presentation indication information of the freeview video is metadata information, the metadata information includes a camera group sample entity, and the camera group sample entity includes presentation time indication information of each camera group.
In one embodiment, a camera group sample entity includes a camera group update flag field, a camera group identification field, and a camera group cancel flag field;
when the value of the camera group update mark field corresponding to the current sample is a first set value, the information indicating that the current sample belongs to the camera group and the camera group to which the current sample belongs is specified by the camera group identification field corresponding to the current sample; when the value of the camera group update mark field corresponding to the current sample is a second set value, the camera group information corresponding to the current sample is kept unchanged;
when the value of the camera group cancellation flag field corresponding to the current sample is a first set value, canceling the camera group information corresponding to the target sample is indicated, so that the camera group information corresponding to the target sample is not effective any more; the current sample refers to the sample being parsed; the target sample is any one of the history samples which are analyzed before the current sample, the camera group update flag field corresponding to the target sample takes the value as a first set value, the camera group identification field corresponding to the target sample is the same as the camera group identification field corresponding to the current sample, and the difference between the analysis time of the target sample and the analysis time of the current sample is: the minimum value of the difference between the analysis time of each sample in the history samples and the analysis time of the current sample;
When the value of the camera group update mark field corresponding to the current sample is a first set value, the value of the camera group cancel mark field corresponding to the current sample is not the first set value; the camera group information corresponding to the current sample remains unchanged: if the camera group information corresponding to the target sample is in an efficiency state, the camera group information corresponding to the current sample is in the efficiency state; if the camera group information corresponding to the target sample is in the cancel state, the camera group information corresponding to the current sample is in the cancel state.
In one embodiment, a media file of an immersion medium is sliced into a plurality of media fragments; the computer program in the memory 603 is loaded by the processor 602 and further performs the steps of:
and acquiring a transmission signaling file of the immersion medium, wherein the transmission signaling file contains description information corresponding to the presentation indication information of the free view video.
In one embodiment, the freeview video includes one or more freeview scenes; the transmission signaling file comprises at least one of the following: a freeview camera descriptor, a freeview association descriptor;
wherein the freeview camera descriptor is used for indicating parameters of each camera in the one or more camera groups; the freeview association descriptor is used for indicating the association relation between the freeview video and the corresponding non-freeview video.
In one embodiment, the transmission signaling file includes a representation hierarchy and an adaptation set hierarchy, the adaptation set hierarchy including one or more representation hierarchies; when the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; the description information is used to describe the target presentation hierarchy when it exists in the target presentation hierarchy.
In one embodiment, the specific implementation of the processor 602 obtaining the media file of the immersion medium is:
determining media fragments corresponding to image data required for watching according to the virtual view angle of a viewer of the immersion media during watching and according to description information in a transmission signaling file;
and pulling the determined media fragments through a streaming transmission mode.
In one embodiment, the processor 602 displays the freeview video according to the presentation indication information of the freeview video, where the specific embodiment is:
determining image data required for watching from a media file according to a virtual view angle of a viewer of the immersion medium when watching;
and decoding and displaying the image data according to the indication information of the camera group.
In one embodiment, the presentation indication information of the freeview video further includes freeview scene boundary indication information; the processor 602 displays the freeview video according to the presentation indication information of the freeview video, where the specific implementation manner of displaying the freeview video is:
In decoding the image data, a display boundary of the image data is determined in accordance with the freeview scene boundary indication information.
In one embodiment, if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video also comprises incidence relation indication information between the freeview video and the non-freeview video; the processor 602 displays the freeview video according to the presentation indication information of the freeview video, where the specific implementation manner of displaying the freeview video is:
in the process of displaying the image data, switching the free view video into the non-free view video according to the incidence relation indication information between the free view video and the non-free view video; or alternatively, the process may be performed,
and switching the non-free view video into the free view video according to the incidence relation indicating information between the free view video and the non-free view video.
In one embodiment, the image data is acquired by a plurality of camera sets; the indication information of the camera groups also comprises presentation time indication information of each camera group respectively; the specific implementation manner of decoding and displaying the image data by the processor 602 according to the instruction information of the camera group is as follows:
And sequentially decoding and displaying the image data corresponding to the corresponding camera group in the image data according to the respective presentation time indication information of each camera group.
Based on the same inventive concept, the principle and beneficial effects of solving the problem of the content consumption device provided in the embodiment of the present application are similar to those of solving the problem of the processing method of the immersion medium in the embodiment of the method of the present application, and may refer to the principle and beneficial effects of implementation of the method, which are not described herein for brevity.
Fig. 7 is a schematic diagram showing a structure of a content producing apparatus according to an exemplary embodiment of the present application; the content production device may refer to a computer device used by a provider of the immersion medium, which may be a terminal (e.g., a PC, a smart mobile device (e.g., a smart phone), etc.) or a server. As shown in fig. 7, the content production device includes a capture device 701, a processor 702, a memory 703, and a transmitter 704. Wherein:
the capture device 701 is used to capture real-world audio-visual scenes to obtain raw data of the immersion medium (including audio content and video content that remain synchronized in time and space). The capture device 701 may include, but is not limited to: audio device, camera device and sensing device. The audio device may include, among other things, an audio sensor, a microphone, etc. The image pickup apparatus may include a general camera, a stereo camera, a light field camera, and the like. The sensing device may include a laser device, a radar device, etc.
The processor 702 (or CPU (Central Processing Unit, central processing unit)) is a processing core of the content creation device, the processor 702 being adapted to implement one or more program instructions, in particular to load and execute the one or more program instructions to implement the flow of the data processing method of the immersion medium shown in fig. 3.
The memory 703 is a memory device in the content creation device for storing programs and media resources. It will be appreciated that the memory 703 here may include either a built-in storage medium in the content production device or an extended storage medium supported by the content production device. It should be noted that, the memory may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, at least one memory located remotely from the aforementioned processor. The memory provides a storage space for storing an operating system of the content production device. And in the storage space is also for storing a computer program comprising program instructions adapted to be invoked and executed by a processor for performing the steps of the data processing method of the immersion medium. In addition, the memory 703 can also be used to store an immersion media file formed upon processing by the processor, the immersion media file including media file resources and media presentation description information.
The transmitter 704 is used to enable transmission interactions of the content production device with other devices, and in particular to enable transmission of immersion media between the content production device and the content playback device. I.e., the content production device transmits the relevant media assets of the immersion media to the content playback device via the transmitter 704.
Referring again to fig. 7, the processor 702 may include a converter 721, an encoder 722, and a wrapper 723; wherein:
the converter 721 is for performing a series of conversion processes on the captured video content, making the video content into content suitable for video encoding of the immersion medium to be performed. The conversion process may include: the stitching and projecting, optionally, the conversion process further includes area encapsulation. The converter 721 may convert the captured 3D video content into 2D images and provide the encoder with video encoding.
Encoder 722 is operative to audio encode the captured audio content to form an audio bitstream of the immersion medium. And is further configured to perform video encoding on the 2D image converted by the converter 721 to obtain a video code stream.
The encapsulator 723 is configured to encapsulate the audio code stream and the video code stream in a file container according to a file format (e.g., ISOBMFF) of the immersion medium to form a media file resource of the immersion medium, where the media file resource may be a media file or a media file segment to form a media file of the immersion medium; and recording metadata of media file assets of the immersion media using the media presentation description information in accordance with file format requirements of the immersion media. The encapsulated file of the immersion medium obtained by the encapsulator processing can be stored in a memory and provided to the content playing device for the presentation of the immersion medium according to the requirement.
The processor 702, and in particular the devices contained by the processor, performs the steps of the data processing method of the immersion medium shown in figure 4 by invoking one or more instructions in a memory. In particular, the memory 703 stores one or more first instructions adapted to be loaded by the processor 702 and to perform the steps of:
acquiring image data acquired by one or more camera groups, and encoding the image data into a free view video;
adding presentation indication information for the freeview video according to the application form of the freeview video; the display indication information of the free view video comprises indication information of a camera group;
and packaging the freeview video and the presentation indication information of the freeview video into a media file of the immersion media.
In one embodiment, the computer program in memory 703 is loaded by processor 702 and further performs the steps of:
slicing the media file to obtain a plurality of media fragments; the method comprises the steps of,
generating a transmission signaling file of the immersion medium, wherein the transmission signaling file contains descriptive information corresponding to presentation indication information of the free view video.
Based on the same inventive concept, the principle and beneficial effects of the content creation device provided in the embodiments of the present application for solving the problems are similar to those of the method for processing the immersion medium in the embodiments of the present application, and may refer to the principle and beneficial effects of implementation of the method, which are not described herein for brevity.
Embodiments of the present application also provide a computer readable storage medium having stored therein one or more instructions adapted to be loaded by a processor and to perform the data processing method of the immersion medium of the method embodiments described above.
Embodiments of the present application also provide a computer program product comprising instructions that, when run on a computer, cause the computer to perform the data processing method of the immersion medium of the method embodiments described above.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the data processing method of the immersion medium described above.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the readable storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The above disclosure is illustrative of a preferred embodiment of the present application, and it is not to be construed as limiting the scope of the application, but rather as providing for the full or partial flow of the solution to the above-described embodiment, and equivalent variations according to the appended claims, will be apparent to those skilled in the art.

Claims (37)

1. A method of data processing of an immersion medium, comprising:
acquiring a media file of an immersed media, wherein the media file contains presentation indication information of a free view video, and the free view video is generated based on image data acquired by one or more camera groups; the display indication information of the free view video comprises indication information of the camera group;
and displaying the free view video according to the display indication information of the free view video.
2. The method of claim 1, wherein the indication information of the camera group includes attribute indication information of the camera group; the display indication information of the free view video is metadata information, and the metadata information comprises a free view camera group data box; the freeview camera set data box comprises attribute indication information of the camera set.
3. The method of claim 2, wherein the freeview video is packaged in at least one track, the at least one track comprising a first track comprising image data of the freeview video; the first track corresponds to N camera groups, each camera group comprises one or more cameras, and N is a positive integer; the freeview camera set data box is packaged into the first track; the freeview camera set data box also comprises the same camera set information mark field;
when the information mark field of the same camera group takes a value as a first set value, the image data contained in the first track are indicated to belong to the N camera groups; and when the information mark field of the same camera group takes a value of a second set value, representing that at least one camera belongs to the camera group except the N camera groups in the cameras corresponding to the image data contained in the first track.
4. The method of claim 3, wherein the freeview camera group data box includes a camera group number field and a camera group identification field, the camera group number field being used to indicate a number of camera groups corresponding to the first track, the camera group number field having a value of N; a value of the camera group identification field is used to indicate an identifier of one of the N camera groups.
5. The method of claim 4, wherein the freeview camera set data box further includes a full camera set flag field;
when the sign field of the complete camera group takes a value as a first set value, representing that image data collected by all cameras in the camera group indicated by the camera group identification field are contained in the first track; and when the value of the complete camera group mark field is a second set value, the image data which indicates that part of cameras in the camera group indicated by the camera group identification field are collected is contained in the first track.
6. The method of claim 4, wherein when the same camera group information flag field has a value of a second set value, the freeview camera group data box further includes a camera number field for indicating the number of cameras included in the camera group indicated by the camera group identification field and a camera identification field, the camera number field having a value of greater than or equal to 1; the camera identification field is used to indicate an identifier of a camera in the camera group indicated by the camera group identification field.
7. The method of claim 3, wherein the freeview video includes one or more freeview scenes; and in the N camera groups, the image data collected by all cameras in the same camera group belong to the same free view angle scene, and all cameras in the same camera group have the same coordinate system.
8. The method of claim 1, wherein the presentation indication information of the freeview video further comprises the freeview scene boundary indication information.
9. The method of claim 8, wherein the presentation indication information of the freeview video is metadata information, and the metadata information includes a freeview boundary data box; the freeview scene boundary data box comprises the freeview scene boundary indication information.
10. The method of claim 9, wherein the freeview video is packaged in at least one track, the at least one track comprising a second track comprising image data of the freeview video; the second track corresponds to P camera groups, and P is a positive integer; the freeform edge data box is encapsulated into the second track;
the freeview boundary data box comprises a camera group mark field, wherein the camera group mark field is used for indicating the type of a display boundary corresponding to the image data in the second track; when the camera group mark field value is a first set value, the free view angle boundary data box is used for indicating boundary information corresponding to the P camera groups; when the camera group mark field value is a second set value, the free view angle boundary data box is used for indicating boundary information of the free view angle video;
The freeview boundary data box further comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups contained in the second track, and the value of the camera group number field is P; the camera group identification field is used to indicate an identifier of each camera group in the second track.
11. The method of claim 1, wherein if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video further includes association relation indication information between the freeview video and the non-freeview video.
12. The method of claim 11, wherein the presentation indication information of the freeview video is metadata information, and the metadata information includes a freeview relational group data box; the freeview relation group data box comprises incidence relation indicating information between the freeview video and the non-freeview video.
13. The method of claim 12, wherein the freeview video is packaged in one or more third tracks and the non-freeview video is packaged in one or more fourth tracks; the third track and the fourth track have the same track group type identifier;
The freeview relational group data box comprises a track group type identifier and a freeview video mark field; when the value of the free view video mark field is a first set value, indicating that the free view video in the third track needs to be presented; and when the value of the free view video mark field is a second set value, indicating that the non-free view video in the fourth track needs to be presented.
14. The method of claim 12, wherein the freeview video is packaged in one or more third tracks and the non-freeview video is packaged in one or more fourth tracks; the free view video and the non-free view video belong to the same entity group;
the freeview relational group data box comprises an identifier of the entity group and a freeview video mark field; when the value of the freeview video mark field is a first set value, the freeview video of the entity corresponding to the identity of the entity group in the third track needs to be presented; and when the value of the freeview video mark field is a second set value, indicating that the non-freeview video of the entity corresponding to the identifier of the entity group in the fourth track needs to be presented.
15. The method of claim 12, wherein the freeview video is packaged in one or more third tracks and the non-freeview video is packaged in one or more fourth tracks; an index relation is arranged between the track identifier of the third track and the track identifier of the fourth track;
the index relationship means that the fourth track can be indexed by the track identifier of the third track; alternatively, the track identifier by the fourth track can be indexed to the third track.
16. The method of claim 1, wherein if the freeview video is packaged into a plurality of tracks, the presentation indication information of the freeview video further comprises association indication information between the plurality of tracks.
17. The method of claim 16, wherein the presentation indication information of the freeview video is metadata information, the metadata information including a freeview track group data box, the freeview track group data box including association indication information between the plurality of tracks.
18. The method of claim 17, wherein the plurality of tracks are clustered into M track groups, each track carrying a track group identification; the tracks with the same track group identifier belong to the same track group; m is a positive integer.
19. The method of claim 18, wherein each track group corresponds to a freeview track group data box, the freeview video including one or more freeview scenes, each freeview scene generated based on image data acquired by one or more camera groups; the free view angle track group data box comprises a camera group mark field, wherein the camera group mark field is used for indicating whether a plurality of camera groups exist in a track group corresponding to the free view angle track group data box;
when the camera group mark field value is a first set value, representing that all tracks in a track group corresponding to the free view angle track group data box form a first free view angle scene, wherein the first free view angle scene is generated based on image data acquired by one camera group; when the camera group mark field value is a second set value, representing that all tracks in the track group corresponding to the free view angle track group data box form a second free view angle scene, wherein the second free view angle scene is generated based on image data acquired by a plurality of camera groups;
the free view angle track group data box further comprises a camera group number field and a camera group identification field, wherein the camera group number field is used for indicating the number of camera groups in a track group corresponding to the free view angle track group data box, and the value of the camera group number field is greater than or equal to 1; the camera group identification field is used for indicating identifiers of all camera groups in the track group corresponding to the freeview track group data box.
20. The method of claim 1, wherein when the freeview video is generated based on image data acquired by a plurality of camera groups, the indication information of the camera groups further includes a separate presentation time indication information for each camera group.
21. The method of claim 20, wherein the presentation indication information of the freeview video is metadata information, the metadata information including camera group sample entities including the respective presentation time indication information for each camera group.
22. The method of claim 21, wherein the camera group sample entity comprises a camera group update flag field, a camera group identification field, and a camera group kill flag field;
when the value of a camera group update mark field corresponding to a current sample is a first set value, the current sample is indicated to belong to a camera group, and the information of the camera group to which the current sample belongs is specified by a camera group identification field corresponding to the current sample; when the value of the camera group update mark field corresponding to the current sample is a second set value, the camera group information corresponding to the current sample is kept unchanged;
When the value of the camera group cancellation mark field corresponding to the current sample is a first set value, canceling the camera group information corresponding to the target sample is indicated, so that the camera group information corresponding to the target sample is not effective any more; the current sample refers to the sample being parsed; the target sample refers to any one of the historical samples which are analyzed before the current sample, the camera group update flag field corresponding to the target sample takes a value of a first set value, the camera group identification field corresponding to the target sample is the same as the camera group identification field corresponding to the current sample, and the difference between the analysis time of the target sample and the analysis time of the current sample is: a minimum value of a difference between the resolution time of each sample in the history samples and the resolution time of the current sample;
when the value of the camera group update flag field corresponding to the current sample is a first set value, the value of the camera group cancel flag field corresponding to the current sample is not the first set value; the camera group information corresponding to the current sample remains unchanged: if the camera group information corresponding to the target sample is in an effective state, the camera group information corresponding to the current sample is in an effective state; and if the camera group information corresponding to the target sample is in the cancel state, the camera group information corresponding to the current sample is in the cancel state.
23. The method of claim 1, wherein the media file of the immersion medium is sliced into a plurality of media fragments; the method further comprises the steps of:
and acquiring a transmission signaling file of the immersion medium, wherein the transmission signaling file contains description information corresponding to the presentation indication information of the free view video.
24. The method of claim 23, wherein the freeview video includes one or more freeview scenes; the transmission signaling file comprises at least one of the following: a freeview camera descriptor, a freeview association descriptor;
wherein the freeview camera descriptor is used to indicate parameters of each camera in the one or more camera groups; the freeview association descriptor is used for indicating the association relation between the freeview video and the corresponding non-freeview video.
25. The method of claim 23, wherein the transmission signaling file includes a representation hierarchy and an adaptation set hierarchy, the adaptation set hierarchy including one or more representation hierarchies therein; when the description information exists in the target adaptive set hierarchy, the description information is used for describing all the representation hierarchies in the target adaptive set hierarchy; the description information is used for describing the target representation hierarchy when the description information exists in the target representation hierarchy.
26. The method of claim 23, wherein the obtaining the media file of the immersion media comprises:
determining media fragments corresponding to image data required for watching according to the virtual view angle of the viewer of the immersion media during watching and according to the description information in the transmission signaling file;
and pulling the determined media fragments through a streaming transmission mode.
27. The method of any of claims 1-26, wherein displaying the freeview video in accordance with presentation indication information of the freeview video comprises:
determining image data required for watching from the media file according to the virtual view angle of the viewer of the immersion media when watching;
and decoding and displaying the image data according to the indication information of the camera group.
28. The method of claim 27, wherein the presentation indication information of the freeview video further includes the freeview scene boundary indication information; the displaying the freeview video according to the presentation indication information of the freeview video further includes:
and in the process of decoding the image data, determining the display boundary of the image data according to the free view scene boundary indication information.
29. The method of claim 27, wherein if the freeview video is associated with a jointly consumed non-freeview video; the presentation indication information of the freeview video also comprises incidence relation indication information between the freeview video and the non-freeview video; the displaying the freeview video according to the presentation indication information of the freeview video further includes:
in the process of displaying the image data, switching the free view video into the non-free view video according to the incidence relation indication information between the free view video and the non-free view video; or alternatively, the process may be performed,
and switching the non-free view video into the free view video according to the incidence relation indication information between the free view video and the non-free view video.
30. The method of claim 27, wherein the image data is acquired by a plurality of camera sets; the indication information of the camera groups further comprises presentation time indication information of each camera group respectively;
the decoding and displaying of the image data according to the indication information of the camera group further comprises:
And sequentially decoding and displaying the image data corresponding to the corresponding camera group in the image data according to the presentation time indication information of each camera group.
31. A method of data processing of an immersion medium, comprising:
acquiring image data acquired by one or more camera groups, and encoding the image data into a free view video;
adding presentation indication information for the freeview video according to the application form of the freeview video; the display indication information of the free view video comprises indication information of the camera group;
and packaging the freeview video and presentation indication information of the freeview video into a media file of the immersion media.
32. The method of claim 31, wherein the method further comprises:
slicing the media file to obtain a plurality of media fragments; the method comprises the steps of,
and generating a transmission signaling file of the immersion medium, wherein the transmission signaling file contains descriptive information corresponding to presentation indication information of the free view video.
33. A data processing apparatus of an immersion medium, the data processing apparatus of an immersion medium comprising:
An acquisition unit, configured to acquire a media file of an immersion medium, where the media file includes presentation indication information of a freeview video, and the freeview video is generated based on image data acquired by one or more camera groups; the display indication information of the free view video comprises indication information of the camera group;
and the processing unit is used for displaying the freeview video according to the presentation indication information of the freeview video.
34. A data processing apparatus of an immersion medium, the data processing apparatus of an immersion medium comprising:
the acquisition unit is used for acquiring image data acquired by one or more camera groups and encoding the image data into a free view video;
the processing unit is used for adding presentation indication information for the freeview video according to the application form of the freeview video; the display indication information of the free view video comprises indication information of the camera group; and a media file for encapsulating the freeview video and presentation indication information of the freeview video into an immersion media.
35. A computer device, comprising: a memory device and a processor;
A memory in which a computer program is stored;
a processor for loading the computer program to implement the data processing method of the immersion medium of any one of claims 1-30; alternatively, a data processing method for loading the computer program to implement the immersion medium of claim 31 or 32.
36. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform the data processing method of the immersion medium of any one of claims 1-30; alternatively, a data processing method of an immersion medium according to claim 31 or 32 is loaded and executed.
37. A computer program product, characterized in that the computer program product comprises a computer program adapted to be loaded by a processor and to perform the data processing method of an immersion medium according to any one of claims 1-30; alternatively, a data processing method of an immersion medium according to claim 31 or 32 is loaded and executed.
CN202210138696.9A 2022-02-15 2022-02-15 Data processing method, device and equipment for immersion medium and storage medium Pending CN116643643A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210138696.9A CN116643643A (en) 2022-02-15 2022-02-15 Data processing method, device and equipment for immersion medium and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210138696.9A CN116643643A (en) 2022-02-15 2022-02-15 Data processing method, device and equipment for immersion medium and storage medium

Publications (1)

Publication Number Publication Date
CN116643643A true CN116643643A (en) 2023-08-25

Family

ID=87642262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210138696.9A Pending CN116643643A (en) 2022-02-15 2022-02-15 Data processing method, device and equipment for immersion medium and storage medium

Country Status (1)

Country Link
CN (1) CN116643643A (en)

Similar Documents

Publication Publication Date Title
CN110800311B (en) Method, apparatus and computer program for transmitting media content
CN109691094A (en) The method for sending omnidirectional's video, the method for receiving omnidirectional's video, the device for sending omnidirectional's video and the device for receiving omnidirectional's video
US20210176509A1 (en) Method, device, and computer program for transmitting media content
US11182875B2 (en) Coordinate mapping for rendering panoramic scene
US20230169719A1 (en) Method and Apparatus for Processing Immersive Media Data, Storage Medium and Electronic Apparatus
CN113891117B (en) Immersion medium data processing method, device, equipment and readable storage medium
CN113852829A (en) Method and device for encapsulating and decapsulating point cloud media file and storage medium
CN113949829B (en) Media file encapsulation and decapsulation method, device, equipment and storage medium
CN113766271B (en) Data processing method, device and equipment for immersive media
CN115002470A (en) Media data processing method, device, equipment and readable storage medium
CN116643643A (en) Data processing method, device and equipment for immersion medium and storage medium
CN114581631A (en) Data processing method and device for immersive media and computer-readable storage medium
CN116643644A (en) Data processing method, device and equipment for immersion medium and storage medium
CN115102932B (en) Data processing method, device, equipment, storage medium and product of point cloud media
WO2022037423A1 (en) Data processing method, apparatus and device for point cloud media, and medium
TWI796989B (en) Immersive media data processing method, device, related apparatus, and storage medium
CN113497928B (en) Data processing method for immersion media and related equipment
CN115086635B (en) Multi-view video processing method, device and equipment and storage medium
CN115474034B (en) Data processing method and device for immersion medium, related equipment and storage medium
CN114374675B (en) Media file encapsulation method, media file decapsulation method and related equipment
CN117082262A (en) Point cloud file encapsulation and decapsulation method, device, equipment and storage medium
CN116781675A (en) Data processing method, device, equipment and medium of point cloud media
CN115883871A (en) Media file encapsulation and decapsulation method, device, equipment and storage medium
CN115061984A (en) Data processing method, device, equipment and storage medium of point cloud media
CN117978992A (en) Data processing method, device, equipment, storage medium and product of point cloud media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40091103

Country of ref document: HK