CN115481280A

CN115481280A - Data processing method, device and equipment for volume video and readable storage medium

Info

Publication number: CN115481280A
Application number: CN202110660865.0A
Authority: CN
Inventors: 胡颖
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2022-12-16

Abstract

The application provides a data processing method, a device, equipment and a readable storage medium of volume video, wherein the method comprises the following steps: acquiring an ith viewpoint group in G viewpoint groups of the volume video as a target viewpoint group; constructing a mapping relation between a target viewpoint group and target space position information based on target video content corresponding to the target viewpoint group, and writing timing metadata information generated based on the mapping relation into a packaging data box corresponding to the volume video to obtain a first extended data box; and packaging the coded video code streams associated with the G viewpoint groups based on the first extended data box to obtain a video media file of the volume video, and sending the video media file to the video client, so that when the video client obtains the first extended data box based on the video media file, the video client displays target video content on the video client based on target space position information in the first extended data box. By the aid of the method and the device, decoding presentation efficiency of the volume video can be improved.

Description

Data processing method, device and equipment for volume video and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for processing volume video data.

Background

Volumetric video refers to visual content captured from three-dimensional space, which may provide a user with a viewing experience in multiple degrees of freedom (e.g., 3DoF +, 6 DoF). The volume video mainly refers to multi-view video (also referred to as multi-view video) with depth information, which is shot from multiple angles by using multiple sets of camera arrays.

In the process of video processing of a volumetric video (i.e., a multi-view video), different views of the volumetric video are usually grouped to obtain a plurality of view groups, and then media contents corresponding to the plurality of view groups may be encoded to obtain corresponding media package files. However, the inventors found in practice that: when a volumetric video containing a plurality of viewpoint groups exists in a media package file, in the process of decoding the coded stream of the coded volumetric video through a video client, all the coded streams of the volumetric video are decoded, so that media content corresponding to each viewpoint group in the plurality of viewpoint groups can be decoded indiscriminately, longer video decoding time is consumed, and the decoding presentation efficiency of the video client is reduced.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, data processing equipment and a readable storage medium for volumetric video, which can improve the decoding presentation efficiency of a video client.

In one aspect, an embodiment of the present application provides a method for processing data of a volume video, where the method is performed by a server, and includes:

acquiring G viewpoint groups of a volume video, and taking the ith viewpoint group in the G viewpoint groups as a target viewpoint group; i is a non-negative integer less than G;

based on the target video content corresponding to the target viewpoint group, constructing a mapping relation between the target viewpoint group and target space position information for watching the target video content, and generating timing metadata information corresponding to the target viewpoint group based on the mapping relation;

writing timing metadata information corresponding to the target viewpoint group into a packaging data box corresponding to the volume video to obtain a first expansion data box corresponding to the packaging data box; the first extended data box comprises timing metadata information corresponding to each viewpoint group of G viewpoint groups;

acquiring coded video code streams associated with the G viewpoint groups, and packaging the coded video code streams based on the first extended data box to obtain video media files of the volume videos;

and issuing the video media file to the video client so that the video client displays the target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box when acquiring the first extended data box based on the video media file.

An embodiment of the present application provides a data processing apparatus for volumetric video, including:

the viewpoint group acquisition module is used for acquiring G viewpoint groups of the volumetric video and taking the ith viewpoint group in the G viewpoint groups as a target viewpoint group; i is a non-negative integer less than G;

the mapping relation construction module is used for constructing a mapping relation between the target viewpoint group and target space position information used for watching the target video content based on the target video content corresponding to the target viewpoint group, and generating timing metadata information corresponding to the target viewpoint group based on the mapping relation;

the timing metadata writing module is used for writing timing metadata information corresponding to the target viewpoint group into a packaging data box corresponding to the volume video to obtain a first extended data box corresponding to the packaging data box; the first extended data box comprises timing metadata information corresponding to each viewpoint group of the G viewpoint groups;

the media file issuing module is used for acquiring coded video code streams associated with the G viewpoint groups and packaging the coded video code streams based on the first extended data box to obtain a video media file of the volume video;

the media file issuing module is further configured to issue the video media file to the video client, so that when the video client acquires the first extended data box based on the video media file, the video client displays target video content corresponding to the target viewpoint group on the video client according to target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box.

Wherein, viewpoint group acquisition module includes:

the viewpoint acquiring unit is used for acquiring V viewpoints of the volumetric video, and performing viewpoint grouping on the V viewpoints based on the viewpoint dependency relationship among the V viewpoints to obtain G viewpoint groups of the volumetric video; v is used for representing the number of viewpoints of the volume video, and is a positive integer greater than or equal to 2; the viewpoint dependency relationship is determined by the content correlation between the video contents respectively corresponding to each of the V viewpoints.

Wherein, viewpoint group acquisition module includes:

the viewpoint group searching unit is used for acquiring video related information associated with the volume video and searching a specified viewpoint group associated with a content producer in the video related information; specifying a set of viewpoints to be associated with a capture intention of a content producer capturing a volumetric video;

and the notification unit is used for acquiring an ith viewpoint group irrelevant to the shooting intention of the content producer from the G viewpoint groups as a target viewpoint group if the specified viewpoint group relevant to the content producer is not found in the video relevant information, notifying the mapping relation construction module to execute the steps of constructing the mapping relation between the target viewpoint group and the target space position information for watching the target video content based on the target viewpoint group, and generating the timing metadata information corresponding to the target viewpoint group based on the mapping relation.

Wherein, the device still includes:

an invalid field setting module, configured to add a recommended viewpoint group identification field in a viewpoint group metadata sample of the encapsulated data box, set a field value of the recommended viewpoint group identification field as an invalid field value, and use, in the viewpoint group metadata sample of the first extended data box, the recommended viewpoint group identification field having the invalid field value as first field indication information; the first field indication information is used to instruct the video client to acquire timing metadata information of each of the G view groups from the first extended data box.

Wherein, the viewpoint group acquiring module further comprises:

a recommended viewpoint group determining unit, configured to, if a specified viewpoint group associated with a content producer is found in the video association information, take the found specified viewpoint group as a recommended viewpoint group;

a target viewpoint group determining unit for acquiring an ith viewpoint group related to a photographing intention of the content producer from the G viewpoint groups as a target viewpoint group based on the recommended viewpoint group.

Wherein, the device still includes:

the recommendation metadata writing module is used for determining the identifier of the target viewpoint group as a recommendation identifier, taking metadata information for describing the recommendation identifier as recommendation metadata information of the target viewpoint group, and writing the recommendation metadata information into a packaging data box corresponding to the volume video to obtain a second expansion data box corresponding to the packaging data box;

and the packaging processing module is used for acquiring the coded video code streams associated with the G viewpoint groups, packaging the coded video code streams based on the second extended data box to obtain a video media file of the volume video, and sending the video media file to the video client, so that when the video client acquires the second extended data box based on the video media file, the video client displays the target video content corresponding to the target viewpoint group indicated by the recommendation identifier based on the recommendation identifier indicated by the recommendation metadata information in the second extended data box.

Before the recommended metadata writing module executes writing of the recommended metadata information into the encapsulated data box corresponding to the volumetric video, the apparatus further includes:

a valid field setting module, configured to add a recommended viewpoint group identification field associated with the recommended identifier in the viewpoint group metadata samples of the encapsulated data box, set a field value of the recommended viewpoint group identification field as a valid field value, and set, in the viewpoint group metadata samples of the second extended data box, the recommended viewpoint group identification field having the valid field value as second field indication information; and the second field indication information is used for indicating the video client to acquire the recommended metadata information from the second extended data box.

In one aspect, an embodiment of the present application provides a method for processing volumetric video data, where the method is executed by a video client and includes:

receiving a video media file of the volume video sent by the server, and performing decapsulation processing on the video media file to obtain a video coding stream of the volume video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

if the field value of the recommended viewpoint group identification field is an invalid value, acquiring timing metadata information respectively corresponding to each viewpoint group in G viewpoint groups of the volumetric video in a first extended data box of the extended data boxes;

acquiring the ith viewpoint group in the G viewpoint groups, taking the spatial position information of the video client as the spatial position information to be compared, and comparing the spatial position information to be compared with the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group to obtain a comparison result; i is a non-negative integer less than G;

if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, taking the ith viewpoint group as a matching viewpoint group, and taking the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

and decoding the video coding stream of the volume video to obtain the matched video content corresponding to the matched viewpoint group based on the mapping relation between the matched viewpoint group and the target spatial position information, and displaying the matched video content corresponding to the matched viewpoint group on the video client.

An aspect of an embodiment of the present application provides a data processing apparatus for volumetric video, including:

the media file receiving module is used for receiving the video media file of the volume video sent by the server and carrying out decapsulation processing on the video media file to obtain a video coding stream of the volume video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

a timing metadata acquisition module, configured to acquire, in a first extended data box of the extended data boxes, timing metadata information corresponding to each of G viewpoint groups of the volumetric video, if a field value of the recommended viewpoint group identification field is an invalid value;

the information comparison module is used for acquiring the ith viewpoint group in the G viewpoint groups, taking the spatial position information of the video client as the spatial position information to be compared, and comparing the spatial position information to be compared with the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group to obtain a comparison result; i is a non-negative integer less than G;

a target information determining module, configured to, if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, take the ith viewpoint group as a matching viewpoint group, and take the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

and the video decoding module is used for decoding the video coding stream of the volume video to obtain the matched video content corresponding to the matched viewpoint group based on the mapping relation between the matched viewpoint group and the target spatial position information, and displaying the matched video content corresponding to the matched viewpoint group on the video client.

Wherein, the timing metadata acquisition module comprises:

a field obtaining unit, configured to obtain, in a first extended data box of the extended data boxes, static viewpoint group metadata fields associated with G viewpoint groups if a field value of the recommended viewpoint group identification field is an invalid value;

and the timing metadata acquisition unit is used for acquiring timing metadata information corresponding to each viewpoint group in the G viewpoint groups respectively based on the first field indication information associated with the recommended viewpoint group identification field if the field value of the static viewpoint group metadata field is a numerical value used for describing that the mapping relation of each viewpoint group in the G viewpoint groups keeps unchanged.

Wherein the static view group metadata field is deployed in a view group static metadata box of the first extended data box; if the field value of the static viewpoint component data field is a value for describing that the mapping relationship of the viewpoint groups among the G viewpoint groups remains unchanged, the static viewpoint component data recorded in the viewpoint component data sample entry of the first extended data box: including a view group identifier for each of the G view groups.

If the field value of the static viewpoint group element data field is a numerical value associated with the dynamic viewpoint group element data, and the dynamic viewpoint group element data is used for describing a viewpoint group with a mapping relation changing along with time in G viewpoint groups, the dynamic viewpoint group element data recorded in the viewpoint group element sample corresponding to the viewpoint group element data sample entry is as follows: the method comprises the steps that an identifier of a variable viewpoint group with a mapping relation changed exists in a sample time stamp, and before the sample time stamp, the mapping relation between the variable viewpoint group and video contents corresponding to the variable viewpoint group is kept unchanged; the variable view group is a view group in which the mapping relationship existing in the G view groups changes with time.

Wherein the view group metadata sample entries and the view group element samples are used to form a view group timing metadata track of the volumetric video, and the view group timing metadata track is used to index into one or more album data tracks associated with the volumetric video.

Wherein the target spatial position information is determined by a type field of decision information associated with the viewpoint group recorded by the server in the viewpoint group metadata sample entry of the first extended data box; the type field of the judgment information is deployed in a viewpoint group static metadata box of a viewpoint group metadata sample inlet; if the type field of the type is judged to be a first numerical value, the target space position information having a mapping relation with the matching viewpoint group is three-dimensional space region information of the matching video content displayed on the video client; if the type field of the type is judged to be a second numerical value, the target space position information having a mapping relation with the matching viewpoint group is the viewing position coordinate information of a user viewing the matching video content on the video client; and if the type field of the judgment type is a third numerical value, the target space position information having a mapping relation with the target viewpoint group is determined by combining the three-dimensional space region information and the viewing position coordinate information.

Wherein, the device still includes:

the information change module is used for taking the mapping relation between the spatial position updating information and the change viewpoint group as an updating relation based on the timing metadata information corresponding to the spatial position updating information in the first extended data box when the spatial position information of the video client is changed from the target spatial position information to the spatial position updating information;

and the updating decoding module is used for decoding the video coding stream of the volumetric video based on the updating relation to obtain the video content corresponding to the changed viewpoint group, and displaying the video content corresponding to the changed viewpoint group on the video client.

Wherein, the device still includes:

a recommended identifier obtaining module, configured to, if a field value of the recommended viewpoint group identification field is a valid value, obtain, in a second extended data box of the extended data boxes, a recommended identifier of the recommended viewpoint group indicated by the recommended metadata information based on second field indication information associated with the recommended viewpoint group identification field;

and the recommended content display module is used for decoding the video coding stream of the volumetric video to obtain the recommended video content corresponding to the recommended viewpoint group and displaying the recommended video content corresponding to the recommended viewpoint group on the video client.

An aspect of an embodiment of the present application provides a computer device, where the computer device includes: a processor and a memory;

a processor is connected to the memory, wherein the memory is used for storing the computer program and the processor is used for calling the computer program to make the computer device execute the method in any aspect of the embodiment of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program being adapted to be loaded and executed by a processor, so as to enable a computer device having the processor to execute the method in any aspect of the embodiments of the present application.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the method in any aspect of the embodiment of the present application.

In this embodiment, when acquiring G viewpoint groups of a volumetric video, a server (i.e., an encoding end) may use an ith viewpoint group of the G viewpoint groups as a target viewpoint group, where i is a non-negative integer smaller than G. Here, it should be understood that each of the G view groups may be used as a target view group, and the ith view group is exemplified here. In this way, the server (i.e., the encoding end) may further construct a mapping relationship between the target viewpoint group and the target spatial position information for viewing the target video content, and may further write the timing metadata information corresponding to the target viewpoint group generated based on the mapping relationship into the encapsulated data box corresponding to the volumetric video, so as to obtain the first extended data box corresponding to the encapsulated data box. It should be understood that the first extended data box may contain timing metadata information corresponding to each of the G view groups. In other words, the timing metadata information in the first extended data box may be used to describe a one-to-one mapping relationship of respective viewpoint groups with spatial positional relationships for viewing responsive video content. In this way, after the server (i.e., the encoding end) encapsulates the encoded video streams associated with the G view groups based on the first extended data box, a video media file for being delivered to the video client (i.e., the decoding end) can be obtained. Therefore, when the application client decapsulates the video media file to obtain the first extended data box added with the timing metadata information, the target viewpoint group mapped by the target spatial position information can be quickly determined according to the timing metadata information written in the first extended data box and the current spatial position information (for example, the target spatial position information) of the video client, and then the target video content corresponding to the target viewpoint group can be decoded in the video client to play the video content corresponding to the specific viewpoint group, so that the decoding presentation efficiency of the video client can be improved in the process of partial access of the volumetric video.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is an architecture diagram of a volumetric video system according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a data processing method for volume video according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a data processing method for volumetric video according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a data processing method for volume video according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a data processing method for volumetric video according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a data processing apparatus for volume video according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus for volume video according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is an architecture diagram of a volumetric video system according to an embodiment of the present disclosure; as shown in fig. 1, the volume video system includes an encoding device and a decoding device, the encoding device may refer to a Computer device used by a provider of the volume video, and the Computer device may be a terminal (such as a PC (Personal Computer), a smart mobile device (such as a smart phone), or the like) or a server. The decoding device may refer to a Computer device used by a user of the volumetric video, which may be a terminal (e.g., a PC (Personal Computer), a smart mobile device (e.g., a smartphone), a VR device (e.g., a VR headset, VR glasses, etc.)). The data processing process of the volume video includes a data processing process on the encoding device side and a data processing process on the decoding device side.

The data processing process at the encoding device side mainly comprises the following steps: (1) a process of acquiring media content of a volumetric video; and (2) a process of volume video coding and file packaging. The data processing process at the decoding device side mainly comprises the following steps: (1) file decapsulation and decoding process of volume video; and (2) a volume video rendering process. In addition, the transmission process involving the volumetric video between the encoding device and the decoding device may be based on various transmission protocols, which may include, but are not limited to: DASH (Dynamic Adaptive Streaming over HTTP), HLS (HTTP Live Streaming), SMTP (Smart Media Transport Protocol), TCP (Transmission Control Protocol), and the like.

The respective processes involved in the data processing of the volume video will be described in detail below with reference to fig. 1.

1. Data processing procedure at the encoding device side:

(1) The acquisition and production process of the media content of the volumetric video.

1) Acquisition of media content for volumetric video.

The media content of the volumetric video is obtained by capturing the real-world audio-visual scene with a capture device. In one implementation, the capture device may refer to a hardware component provided in the encoding device, for example, the capture device refers to a microphone, a camera, a sensor, etc. of the terminal. In another implementation, the capturing device may also be a hardware device connected to the encoding device, such as a camera connected to a server; an acquisition service for providing media content of a voluminous video to an encoding device. The capture device may include, but is not limited to: audio equipment, camera equipment and sensing equipment. The audio device may include, among other things, an audio sensor, a microphone, and the like. The camera devices may include a general camera, a stereo camera, a light field camera, and the like. The sensing device may include a laser device, a radar device, or the like. The number of capture devices may be multiple, the capture devices being deployed at specific locations in real space to simultaneously capture audio content and video content from different angles within the space, the captured audio and video content remaining synchronized in both time and space. Embodiments of the present application may refer to media content in 3-dimensional space captured by a capture device deployed at a particular location to provide a multiple degree of freedom viewing experience as volumetric video.

2) Production of media content for volumetric video.

It should be understood that the process of producing the media content of the volumetric video according to the embodiment of the present application may be understood as a process of producing the content of the volumetric video, and the production of the content of the volumetric video is mainly produced by the content in the form of multi-view video, point cloud data, light field, and the like captured by a camera or a camera array disposed at a plurality of positions, for example, an encoding apparatus may convert the volumetric video from a three-dimensional representation to a two-dimensional representation. The volume video may contain geometric information, attribute information, placeholder map information, atlas data and the like, and the volume video generally needs to be subjected to specific processing before being encoded, for example, the point cloud data needs to be cut, mapped and the like before being encoded. For example, multi-view video generally requires grouping different views of the multi-view video before encoding to distinguish the primary view from the secondary view within each group.

Specifically, (1) projecting the three-dimensional representation data (i.e., the point cloud data) of the collected and input volume video onto a two-dimensional plane, generally in an orthogonal projection, perspective projection or ERP projection manner, and representing the volume video projected onto the two-dimensional plane by data of a geometric component, an occupancy component and an attribute component, wherein the data of the geometric component provides position information of each point of the volume video in a three-dimensional space, the data of the attribute component provides additional attributes (such as texture or material information) of each point of the volume video, and the data of the occupancy component indicates whether data in other components are associated with the volume video;

(2) processing the assembly data of the two-dimensional representation of the volume video to generate a picture block, dividing a two-dimensional plane area where the two-dimensional representation of the volume video is located into a plurality of rectangular areas with different sizes according to the position of the volume video represented in the geometric assembly data, wherein one rectangular area is a picture block, and the picture block contains necessary information for back projecting the rectangular area to a three-dimensional space;

(3) and packing the image blocks to generate an image set, putting the image blocks into a two-dimensional grid, and ensuring that effective parts in the image blocks are not overlapped. Tiles generated by a volumetric video can be packed into one or more atlases;

(4) and generating corresponding geometric data, attribute data and placeholder data based on the atlas data, and combining the atlas data, the geometric data, the attribute data and the placeholder data to form a final representation of the volume video on a two-dimensional plane.

It should be noted that, in the content production process of the volume video, the geometric component is optional, the placeholder component is optional, and the attribute component is optional.

In addition, it should be noted that, since a capture device can capture panoramic video, such video is processed by an encoding device and transmitted to a decoding device for corresponding data processing, a user on the decoding device side needs to view 360 Degrees of video information by performing some specific actions (such as head rotation), while performing unspecific actions (such as head movement) cannot obtain corresponding video changes, and the VR experience is not good, so that depth information matched with the panoramic video needs to be additionally provided to enable the user to obtain better immersion and better VR experience, which relates to 6DoF (Six Degrees of Freedom) production technology. When the user can move more freely in the simulated scene, it is called 6DoF. When the 6DoF manufacturing technology is adopted to manufacture the video content of the volume video, the capturing device generally adopts a light field camera, a laser device, a radar device and the like to capture point cloud data or light field data in a space.

(2) The process of volume video encoding and file packing.

The captured audio content can be directly subjected to audio coding to form an audio code stream of volume video. The captured video content can be subjected to video coding to obtain a video code stream of the volume video. It should be noted here that if the 6DoF production technique is adopted, a specific encoding method (such as a point cloud compression method based on conventional video encoding) needs to be adopted for encoding in the video encoding process. The audio code stream and the video code stream are packaged in a File container according to a File Format of the volume video (such as an ISOBMFF (ISO Base Media File Format, ISO Media File Format)) to form a Media File resource of the volume video, wherein the Media File resource can be a Media File or a Media File of the volume video formed by Media fragments; and recording metadata of Media file resources of the volumetric video by using Media Presentation Description (MPD) according to file format requirements of the volumetric video, where the metadata is a generic term of information related to presentation of the volumetric video, and the metadata may include description information of Media content, timing metadata information describing mapping relationship between each constructed viewpoint group and spatial position information of viewing Media content, description information of a window, signaling information related to presentation of the Media content, and the like. As shown in fig. 1, the encoding device stores the media presentation description information and the media file resources formed after the data processing process, where the media presentation description information and the media file resources may be packaged into a media video file for being delivered to the decoding device according to a specific media file format.

Specifically, as shown in fig. 1, the collected audio may be encoded into a corresponding audio code stream, the geometric information, the attribute information, and the occupation bitmap information of the volume video may be encoded in a conventional video encoding manner, and the atlas data of the volume video may be encoded in an entropy encoding manner. The encoded media is then packaged in a file container in a format (e.g., ISOBMFF, HNSS) and combined with metadata describing the media content attributes and windows metadata to form a media file or an initialization fragment and a media fragment according to a particular media file format. In this embodiment of the present application, when an encoding device forms a media file or forms an initialization segment and a media segment according to a specific media file format, the encoding device may collectively refer to the formed media file or the formed initialization segment and the media segment as a video media file, and may further send the obtained video media file to the decoding device shown in fig. 1.

2. Data processing procedure at decoding device side:

(3) The file decapsulation and decoding process of the volume video;

the decoding device may obtain the media file resources of the volumetric video and the corresponding media presentation description information from the encoding device through recommendation of the encoding device or adaptive dynamic according to user requirements on the decoding device side, for example, the decoding device may determine the orientation and position of the user according to the tracking information of the head/eyes/body of the user, and then dynamically request the encoding device to obtain the corresponding media file resources based on the determined orientation and position. The media file resources and the media presentation description information are transmitted from the encoding device to the decoding device via a transmission mechanism (e.g., DASH, SMT). The process of file decapsulation at the decoding device side is the reverse of the process of file encapsulation at the encoding device side, and the decoding device decapsulates the media file resource according to the file format (for example, ISO media file format) requirement of the volumetric video to obtain an audio code stream and a video code stream. The decoding process of the decoding device side is opposite to the encoding process of the encoding device side, and the decoding device performs audio decoding on the audio code stream to restore the audio content; and the decoding equipment performs video decoding on the video code stream to restore the video content.

(4) And (4) a volume video rendering process.

And rendering the audio content obtained by audio decoding and the video content obtained by video decoding by the decoding equipment according to the metadata related to rendering in the media presentation description information corresponding to the media file resource, wherein the playing and the output of the image are realized after the rendering is finished.

Volumetric video systems support boxes (boxes), which refer to data blocks or objects that include metadata, i.e., metadata for the corresponding media content is contained in the Box. The volume video may include a plurality of data boxes, for example, a File enclosure data Box (isob mff Box) that contains metadata describing corresponding information when the File is enclosed, for example, specifically, timing metadata information obtained by construction or recommended metadata information obtained by construction; among them, the ISO file encapsulation data box may include a first extension data box and a second extension data box, where metadata information (e.g., timing metadata information) provided by the first extension data is used to describe a correspondence between a viewpoint group (i.e., the above-mentioned group) of the volumetric video and the corresponding media content. Wherein the metadata information (e.g., recommendation metadata information) provided by the second extension data box is used to describe a correspondence between a set of recommendation viewpoints associated with the content producer and corresponding media content. For convenience of understanding, the first extended data box and the second extended data box may be collectively referred to as an extended data box in the embodiments of the present application.

In the embodiment of the present application, in order to improve the decoding presentation efficiency of a decoding terminal, a partial access policy for implementing a volumetric video based on metadata information provided by the above extended data box is provided in the embodiment of the present application. For example, based on the partial access policy of the volume video, when the encoding device acquires the viewpoint group of the volume video, the encoding device may obtain indication information of different metadata based on the determined shooting intention of the viewpoint group of the volume video. The decoding device can adaptively provide different decoding presentation requirements for different modes for users according to different indication information of the metadata information provided by the extended data box. It is to be understood that the shooting intent herein may be used to characterize whether there is a viewpoint group matching a specified viewpoint group specified by a content producer among viewpoint groups of the volumetric video.

If the mapping relation does not exist, it is stated that each viewpoint group in the viewpoint groups of the volumetric video is a viewpoint group unrelated to the shooting intention of the content producer, and further, based on the video content corresponding to each viewpoint group, a mapping relation between each viewpoint group in the viewpoint groups of the volumetric video and spatial position information (for example, information such as a 3D spatial region and a user position) may be constructed, so that the constructed mapping relation is used as a basis for selecting different viewpoint groups in the decoding device, and timing metadata information corresponding to each viewpoint group is generated. In this way, during the process that the encoding device writes the generated timing metadata information corresponding to each view group into the encapsulated data box, a first extended data box corresponding to the encapsulated data box may be obtained, and further, the encoding device may perform encapsulation processing on the encoded video stream (i.e., the media file resource) associated with each view group based on the first extended data box, so as to obtain a video media file (e.g., video media file a) of the volumetric video. Based on this, when the decoding device acquires the video media file, the video media file may be decapsulated to obtain the first extended data box and the encoded video stream associated with each view group. At this time, the decoding apparatus may quickly select, according to the timing metadata information of the volumetric video provided by the first extension data box and the current spatial position information of the user (e.g., the current viewing position information of the user), a viewpoint group (e.g., viewpoint group 1) corresponding to the spatial position information matching the current spatial position information of the user, to partially decode and present the media content corresponding to the viewpoint group 1 from the encoded video streams.

Optionally, if the viewpoint group exists, the found specified viewpoint group may be directly used as the recommended viewpoint group. At this time, the encoding apparatus may determine the identifier of the recommended viewpoint group as a recommended identifier, and may further use metadata information describing the recommended identifier as recommended metadata information to write the recommended metadata information into an enclosure data box for the volume video, resulting in the second extended data box described above. In this way, when the encoded video streams associated with each view group are acquired, the encoding device may perform encapsulation processing on the encoded video streams directly based on the second extension data box to obtain another video media file (e.g., video media file B). Based on this, when the decoding device acquires the video media file, the video media file may be decapsulated to obtain the second extended data box and the encoded video stream associated with each view group. At this time, the decoding device may also quickly determine, according to the recommendation identifier indicated by the recommendation metadata information provided by the second extended data box, the recommendation view group corresponding to the recommendation identifier, and may further decode and present the media content corresponding to the recommendation view group.

The specific process of the encoding device constructing and obtaining timing metadata information corresponding to each view group based on the partial access policy or constructing and obtaining recommendation metadata information for representing the shooting intention of the content photographer based on the partial access policy may refer to the following description of the embodiments corresponding to fig. 2 to fig. 3. The specific process of implementing partial decoding by the decoding device based on the partial access policy may refer to the following description of the embodiments corresponding to fig. 4 to fig. 5.

Further, please refer to fig. 2, wherein fig. 2 is a schematic flow chart of a data processing method of a volume video according to an embodiment of the present application. The method may be performed by an encoding device in a volumetric video system, for example, the encoding device may be a server, and the method may include the following steps S101 to S105:

step S101, obtaining G viewpoint groups of the volume video, and taking the ith viewpoint group in the G viewpoint groups as a target viewpoint group;

wherein i is a non-negative integer less than G;

specifically, when acquiring V viewpoints of the volumetric video, the server may perform viewpoint grouping on the V viewpoints based on a viewpoint dependency relationship between the V viewpoints to obtain G viewpoint groups of the volumetric video; where V is used to represent the number of views of the volumetric video, since the volumetric video here is mainly referred to as multi-view video. Therefore, the number of viewpoints (i.e., V) of the volumetric video is a positive integer greater than or equal to 2; it should be understood that the view dependency herein is determined by the content correlation between the video contents respectively corresponding to each of the V views.

Among them, it should be understood that multi-view video usually uses a camera array to shoot a scene from multiple angles, and forms texture information (color information, etc.) and depth information (spatial distance information, etc.) of the scene, and then adds mapping information of 2D planar frames to a 3D rendering space, so as to form 6DoF media that can be consumed on the user side. It should be understood that each camera in the camera array may correspond to a corresponding camera identifier, and in the process of reconstructing the three-dimensional scene, the target viewpoint needs to be synthesized and rendered by using the volumetric videos of one or more viewpoints according to the viewing position, direction, and the like of the user. The volume video corresponding to the auxiliary viewpoint needs to be synthesized according to the volume video data of the base viewpoint.

It can be understood that, in the embodiments of the present application, the corresponding viewpoint information of the volumetric video may be provided through a viewpoint information structure (e.g., viewInfoStruct) of the volumetric video. For example, the viewpoint information structure of the volumetric video may be used to define, according to the identifier of the camera parameter, a viewpoint identifier of each viewpoint in the V viewpoints of the volumetric video, a viewpoint group identifier of a viewpoint group to which each viewpoint belongs, whether each viewpoint carries a valid base viewpoint identifier, and the like.

For understanding, please refer to table 1, where table 1 is used to indicate syntax of a viewpoint information structure of a volume video provided by an embodiment of the present application:

TABLE 1

The semantics of the syntax shown in table 1 above are as follows; view _ id indicates a view identification of a view. view _ group _ id indicates a view group identification of a view group to which a view belongs. view _ description provides a literal description of the view, UTF-8 string ending with a null value. When the field value of the basic _ view _ flag (i.e. the base view identification field) is 1 (i.e. a valid value), indicating that the current view is the base view; on the contrary, when the field value of the basic _ view _ flag (i.e., the base view id field) is 0 (i.e., an invalid value), it indicates that the current view is not the base view.

Further, the server may determine the view dependency relationship between the V views based on the content correlation between the view contents corresponding to each view, and may further perform view grouping on the V views based on the view dependency relationship between the V views to obtain G view groups of the volumetric video.

Here, for convenience of understanding, G view groups including a view group 1, a view group 2, a view group 3, a view group 4, and a view group 5 are taken as an example. It is understood that the embodiment of the present application may provide view grouping information of a volumetric video through a view grouping information structure (e.g., viewGroupInfoStruc) of the volumetric video, where one view grouping information may be used to describe one or more views. For ease of understanding, please refer to table 2 for further understanding, which is a table for indicating syntax of a view grouping information structure of a volume video provided by an embodiment of the present application:

TABLE 2

The semantics of the syntax shown in table 2 above are as follows; view _ group _ id indicates a view group identification of a view group for the volumetric video, for example, the view group identification of the above view group 1 may be identification A1.view _ group _ description is a literal description of a view group, which is a UTF-8 string ending with a null value. num _ views indicates the number of views in the view group, for example, the number of views in the aforementioned view group 1 should be at least one. For convenience of understanding, the viewpoint data in the viewpoint group 1 is taken as 3 as an example, and the 3 viewpoints (e.g., the viewpoint 1a, the viewpoint 1b, and the viewpoint 1 c) include one base viewpoint, two auxiliary viewpoints. Wherein, view _ id indicates the view identifier of the view in the view group, for example, the view identifiers of 3 views (e.g., view 1a, view 1b, and view 1 c) in the view group 1 may be identifier a11, identifier a12, and identifier a13, that is, the view identifier of the jth view in the view group may be denoted as A1j. For any one of the viewpoint groups of the divided volumetric video, the number of viewpoints included in any one of the divided viewpoint groups is a positive integer greater than or equal to 1. Wherein, basic _ view _ flag is a base view id field in the view grouping information structure, and at this time, when a field value of the base view id field is 1, it indicates that the view is a base view. For example, if the 1 st view (e.g., the view indicated when j = 0) in the view group 1 is the view 1a, the view 1a is indicated as the base view if the base view identification field corresponding to the view 1a is 1; on the contrary, optionally, if the base view identifier field corresponding to the view 1a is 0, it indicates that the view 1a is not the base view, for example, the view 1a is an auxiliary view.

Further, it can be understood that when the server obtains G viewpoint groups of the volumetric video, in order to help a user corresponding to the video client select a suitable target viewpoint group, a policy for partial access to the volumetric video is provided in the embodiments of the present application, and the policy is applied to links such as the server, the video client, and the intermediate node.

However, the server needs to search for, in advance, whether or not there is a viewpoint group matching a specified viewpoint group associated with a shooting intention of a content producer among the G viewpoint groups before setting an ith viewpoint group of the G viewpoint groups as a target viewpoint group. It should be understood that the set of specified viewpoints here is related to the capture intent of the content producer capturing the volumetric video.

Specifically, when acquiring video related information related to a volume video, the server may search for a specified viewpoint group related to a content producer in the video related information, and once the content producer specifies a viewpoint group (i.e., a specified viewpoint group) conforming to the shooting intention of the content producer when shooting the volume video, the video related information will carry the specified viewpoint group. On the other hand, if the content creator does not specify a specified viewpoint group matching the self-capture intention when capturing the volume video, the specified viewpoint group is not carried in the video management information. For example, when the volume video is captured, the content producer may collectively refer to the viewpoint group related to the own capturing intention as the specified viewpoint group, and may transmit the video-related information to the server by using data information (i.e., data information representing the capturing intention of the content producer) such as the specified viewpoint group conforming to the own capturing intention and/or camera parameters for capturing the volume video as the video-related information related to the volume video.

Based on the above, when the server acquires video related information associated with the volume video, the server can search the video related information for a specified viewpoint group associated with a content producer; if the server does not find the specified viewpoint group associated with the content producer in the video associated information, it may be determined that the G viewpoint groups are all viewpoint groups unrelated to the shooting intention of the content producer, and then an ith viewpoint group unrelated to the shooting intention of the content producer may be acquired from the G viewpoint groups as a target viewpoint group, and the following S102 is notified to be executed, so as to construct a mapping relationship between the target viewpoint group and target spatial position information for viewing the target video content according to the target video content corresponding to the target viewpoint group, and further generate timing metadata information corresponding to the target viewpoint group based on the constructed mapping relationship.

Optionally, if the specified viewpoint group associated with the content producer is found in the video association information, the server may use the found specified viewpoint group as a recommended viewpoint group, and may obtain, from the G viewpoint groups, an ith viewpoint group related to the shooting intention of the content producer based on the recommended viewpoint group as a target viewpoint group. At this time, the target viewpoint group is a viewpoint group related to the shooting intention of the content producer, the server skips the following steps S102 to S104, directly determines the identifier of the target viewpoint group at this time as a recommended identifier, takes metadata information for describing the recommended identifier as recommended metadata information of the target viewpoint group, and writes the recommended metadata information into a packaged data box corresponding to the volumetric video to obtain a second extended data box corresponding to the packaged data box; further, the server may obtain encoded video streams associated with the G viewpoint groups, may package the encoded video streams based on the second extended data box to obtain a video media file of a volumetric video, and may send the video media file to the video client, so that when the video client obtains the second extended data box based on the video media file, the video client may quickly display, on the video client, the target video content corresponding to the target viewpoint group indicated by the recommendation identifier based on the recommendation identifier indicated by the recommendation metadata information in the second extended data box.

Step S102, constructing a mapping relation between a target viewpoint group and target space position information for watching the target video content based on the target video content corresponding to the target viewpoint group, and generating timing metadata information corresponding to the target viewpoint group based on the mapping relation;

it should be understood that the server adds several descriptive fields at the system level under the system architecture of volumetric video described above in fig. 1. For example, some field extensions of the file encapsulation layer are added at the system layer to extend the ISOBMEF data box (i.e., the above-mentioned encapsulated data box). For example, in the embodiment of the present application, a view group timing metadata track is added to the above encapsulated data box extension. In other words, in the embodiment of the present application, on the basis of the existing encapsulated data box, the encapsulated data box is extended to obtain a corresponding extended data box. Wherein a view group timing metadata track may be added to the extended data box. The view group timing metadata track is determined by a view group metadata sample entry (i.e., sample entry for short) and view group metadata samples (i.e., sample for short).

Therein, it is to be understood that the view group metadata sample entries and view group element samples therein can be used to construct view group timing metadata tracks of the volumetric video, and the view group timing metadata tracks therein can be used to index into one or more atlas data tracks associated with the volumetric video.

It should be understood, among other things, that the viewpoint group timing metadata track herein may be used to indicate correspondence between different viewpoint groups and media content in subsequently packaged video media files. For example, the timed metadata tracks herein may be used to directly associate to the corresponding atlas data tracks, rather than directly associating with the video component tracks of the volumetric video.

Therein, it can be appreciated that the view group timing metadata track can be quickly indexed to the relevant track and track group by a track index string "e.g., 'cdtg' string", e.g., can be directly indexed to an album data track associated with the view group timing metadata track by 'cdtg' string, in which case a view group timing metadata track can be used to associate an album data track.

Optionally, the viewpoint group timing metadata track may also be indexed to one or more album data tracks by another track index string "e.g., 'cdsc' string", for example, when one viewpoint group timing metadata track is used to associate with one or more album data tracks, the viewpoint group timing metadata track may be used to describe each album data track respectively, and the 'cdsc' string herein may have an index sharing function, that is, all the album data tracks may be indexed by the viewpoint group timing metadata track. Therein, it is also understood that the samples in each view group timing metadata track may be marked as synchronized samples. The samples in each view group timing metadata track refer to a data box with metadata information.

It should be understood that the atlas data track may be used to describe the mapping relationship from the 2D plane to the 3D plane, and since the atlas data track may also be indexed to the component track by other track index strings, the component track is the video information such as specific texture and color. However, in the embodiment of the present application, in order to save computing resources and improve the decoding and rendering efficiency of a subsequent video client, it is emphasized that the server may directly associate the view group timing metadata track with the corresponding atlas data track, so that in the decoding process of the subsequent video client, the mapping relationship from the 2D plane to the 3D plane may be quickly obtained through the atlas data track directly associated with the view group timing metadata track, and the two-dimensional volume video may be reconstructed into a three-dimensional space.

Wherein the definition of the viewpoint group metadata sample entry is as follows:

sample inlet type: 'vgme';

includes Sample Description Box ('stsd');

whether to force no;

the number is 0 or 1;

as can be seen from the above definition of the viewpoint component data sample entry, the type of the sample entry of the viewpoint component data sample entry (i.e., the sample entry) is "6vpt", and the viewpoint component data sample entry is defined by viewgroupmatadatasamplenentry. It is understood that the viewpoint component data sample entry herein must contain a viewgroupstatmetacombox (i.e. a viewpoint group static metadata box), where the viewpoint group static metadata box can be used to describe static sample component data, and the static sample component data in the viewpoint group static metadata box can be described by a static viewpoint component data field. For example, if the field value of the static viewpoint group metadata field is 1, it indicates that all the sample group metadata corresponding to the viewpoint group metadata in the sample entry remain unchanged, and further, the mapping relationship of the viewpoint groups in the G viewpoint groups can be described to remain unchanged through the static sample group metadata recorded in the sample entry. For another example, if the field value of the static view component data field is 0, it indicates that the view component data changes with time. At this time, the viewpoint component data that changes with time may be collectively referred to as dynamic viewpoint component data, and then the viewpoint component data that changes with time needs to be recorded in the viewpoint component data sample corresponding to the viewpoint component sample entry.

The metadata information carried by the viewpoint group sample entry is static viewpoint group metadata, which can be used to describe a viewpoint group (e.g., the viewpoint group 1, the viewpoint group 2, and the viewpoint group 3) in which the mapping relationship remains unchanged among G viewpoint groups, and the metadata information carried by the viewpoint group metadata sample is dynamic viewpoint group metadata, which is used to describe a viewpoint group (e.g., the viewpoint group 4 and the viewpoint group 5) in which the mapping relationship changes with time exists among G viewpoint groups. The timing metadata information written in the view group sample entry and the view group metadata sample to describe the mapping relationship of the corresponding view group is not limited herein.

For understanding, please refer to table 3 for further example, which is a syntax of a view group element sample entry provided in the embodiments of the present application:

TABLE 3

Where, as shown in table 3 above, the semantics of the view group element sample entry: the viewpositionstructure indicates viewing position coordinate information of a user corresponding to the video client in the overall space of the multi-view video, and for example, as shown in table 3, the 3D spatial position of the view and the GPS position of the view are defined in the viewpositionstructure. Wherein view _ position (i.e., view position): indicating the position viewed by the user, the structure body contains the specific x, y and z coordinate information of the user in the overall multi-view media space. Wherein view _ orientation indicates the direction in which the user is looking, and the structure contains the rotation information of the user's head. Position _ range _ flag (i.e., position movement identification field): indicating whether the user's moving range information is included. It can be understood that, if a field of the position movement identifier field takes a valid value (for example, 1), the position movement identifier field may be used to indicate that the viewpoint position structure (i.e., viewpos structure in table 3) contains the movement range information of the user; optionally, if a field of the position movement identifier field takes an invalid value (for example, 0), the position movement identifier field may be used to indicate that the viewpoint position structure (i.e., the viewpositionstructure in table 3) does not include the movement range information of the user. Wherein position _ range: and indicating the moving range of the user along the x, y and z axes by taking the view _ position as a coordinate starting point.

As shown in table 3, the view group metadata sample entry must include an extended viewgroupstatamebox (i.e., a view group static metadata box, which is disposed in the first extended data box). The first extended data box contains the following extension fields of the file encapsulation level. The type field of the decision information associated with the view group may be an extension field shown in table 3 above: the view _ associated _ info _ type, static view metadata field may be an extension field as shown in table 3 above: static _ view _ group _ meta.

As shown in table 3, view _ associated _ info _ type: indicating the type of decision information associated with the view group. In an embodiment, if the field takes a value of a first numerical value (e.g., 0), it indicates that the associated information is a spatial region of the media content viewed by the user, that is, the target spatial location information indicating that the target spatial location information has a mapping relationship with the matching viewpoint group is three-dimensional spatial region information of the matching video content displayed on the video client. Optionally, in another embodiment, if the field takes a value of a second value (for example, 1), it indicates that the associated information is user viewing position information, that is, target spatial position information indicating that the target spatial position information has a mapping relationship with the matching viewpoint group is viewing position coordinate information of a user viewing the matching video content on the video client. Optionally, in another embodiment, if the field takes the value of the third numerical value (for example, 2), it indicates that the association information is a combination of a spatial region of the media content viewed by the user and viewing position information of the user, that is, target spatial position information having a mapping relationship with the target viewpoint group is determined by combining three-dimensional spatial region information and viewing position coordinate information.

As can be seen, the target spatial location information is determined by the type field of the decision information associated with the viewpoint group recorded by the server in the viewpoint group metadata sample entry of the first extended data box; the type field of the judgment information is deployed in a viewpoint group static metadata box of a viewpoint group metadata sample inlet; if the type field of the type is judged to be a first numerical value, the target space position information having a mapping relation with the matching viewpoint group is three-dimensional space region information of the matching video content displayed on the video client; if the type field of the type is judged to be a second numerical value, the target space position information having a mapping relation with the matching viewpoint group is the viewing position coordinate information of a user viewing the matching video content on the video client; and if the type field of the judgment type is a third numerical value, the target space position information having a mapping relation with the target viewpoint group is determined by combining the three-dimensional space region information and the viewing position coordinate information.

As shown in table 3, static _ view _ group _ meta (i.e., the static view component data field): indicating whether the view group metadata corresponding to the view group of the above-mentioned G view groups (i.e., whether it remains unchanged in all samples corresponding to the sample entry). If the field value of the static viewpoint component data field is a value used for describing that the mapping relationship of the viewpoint groups in the G viewpoint groups remains unchanged (i.e. if the field value is 1), it indicates that the viewpoint component data corresponding to the viewpoint groups in the G viewpoint groups remains unchanged in all samples corresponding to the sample entry, and thus, the static viewpoint component data recorded in the viewpoint component data sample entry of the first extended data box: a view group identifier for each of the G view groups may be included.

As shown in table 3 above, view _ group _ num (i.e., the number of all view groups recorded in the sample entry, for example, the G number): indicating the number of view groups. view _ group _ id (i.e., view group identification of ith view group of G view groups): an identifier indicating a view group. region _ num (i.e., the number of spatial regions corresponding to the ith view group): indicating the number of 3D spatial regions. 3DSPatialRegionStruct [ j ] (i.e., 3D space region Structure): indicating a spatial region of media content viewed by a user. ViewPositionStruct (i.e. view position structure): indicating the user viewing position information.

For easy understanding, please refer to table 4, which is a syntax of the spatial region information structure provided in the embodiments of the present application. Wherein, the spatial region information structure at least comprises the following two substructures: a 3D spatial region structure (3D spatial region structure) and a 3D bounding box structure (3D bounding box structure). Among them, the 3D spatial region structure (3D spatial region structure) provides spatial region information (including offset of x, y, z axes of the spatial region, width, height and depth of the 3D spatial region) of the volumetric video, and the 3D bounding box structure (3D bounding box structure) provides bounding box information of the volumetric video. This means that the syntax of the 3D space region structure called in table 3 above can be specifically seen in table 4 below:

TABLE 4

As shown in table 4, the semantics of the spatial region information structure are as follows: the 3d region id indicates an identifier of a spatial region; x, y, z in 3D dot structure: indicating the x, z, y coordinate values of the 3D points, respectively, in a cartesian coordinate system. cuboid _ dx, cuboid _ dy, cuboid _ dz: indicating the dimensions of the cuboid subregion in a cartesian coordinate system with respect to the anchor point along the x, y, z axes. an anchor: one 3D point in a cartesian coordinate system is indicated as the anchor of the 3D spatial region. bb _ dx, bb _ dy, bb _ dz: indicating the dimensions along the x, y, z axes of the extension of the entire volumetric video 3D bounding box in cartesian coordinates relative to the origin (0,0,0). dimensions _ included _ flag: an identifier indicating whether the spatial dimension has been designated indicates that the spatial dimension indicated by the 3d spatial region structure has been designated when the value of this field is 1.

The rotation structure in table 3 indicates rotation information required for providing local coordinate axis conversion to global coordinate axis, and the rotation information is expressed in euler angle or quaternion. In the case of stereoscopic panoramic video, the rotation structure is applied to each binocular view. For easy understanding, please refer to table 5, which is a syntax of the rotation structure provided in the embodiments of the present application:

TABLE 5

As shown in table 5 above, the semantics of the rotation structure are as follows: the 3D _rotation _typeindicates a representation type of rotation information. The field value of 0 indicates that the rotation information is given in the form of Euler angles; the value of this field is 1, which means that the rotation information is given in the form of a quaternion. The remaining values are retained. Wherein rotation _ yaw, rotation _ pitch, and rotation _ roll refer to yaw (yaw) angle, pitch (pitch) angle, and roll (roll) angle, respectively, rotated along the X-axis, Y-axis, and Z-axis for conversion of the local coordinate axis of the unit sphere to the global coordinate axis, by 2 ^-16 For accuracy, is associated with a global coordinate axis. rotation _ yaw ranges from-180 DEG to 2 ¹⁶ ,180°*2 ¹⁶ –1]The rotation _ pitch is in the range of [ -90 °. Multidot.2 [) ¹⁶ ,90°*2 ¹⁶ ]The rotation _ roll ranges from [ -180 °. Multidot.2 [) ¹⁶ ,180°*2 ¹⁶ –1]. The rotation _ x, the rotation _ y, the rotation _ z and the rotation _ w respectively indicate values of components of quaternions x, y, z and w, and are used for converting a local coordinate axis of the unit spherical surface into a global coordinate axis.

Optionally, if a field value of a static viewpoint group metadata field deployed in the viewpoint group static metadata box of the first extended data box is a numerical value associated with the dynamic viewpoint group metadata (that is, if the field takes a value of 0), and the dynamic viewpoint group metadata is used to describe a viewpoint group in which a mapping relationship changes with time exists among the G viewpoint groups, it indicates that there is a viewpoint group in which the viewpoint group metadata changes with time exists among the G viewpoint groups. At this time, the dynamic viewpoint component data recorded in the viewpoint group element sample corresponding to the viewpoint component data sample entry: the method comprises the steps that an identifier of a variable viewpoint group with a mapping relation changed exists in a sample time stamp, and the mapping relation between the variable viewpoint group and video content corresponding to the variable viewpoint group is kept unchanged before the sample time stamp; the variable viewpoint group is a viewpoint group in which mapping relationships existing in the G viewpoint groups change with time.

For example, it can be understood that, considering that one viewpoint group metadata sample may correspond to a plurality of viewpoint groups, if a value of a field of a static viewpoint group metadata field changes from a valid value (e.g., 1) to an invalid value (e.g., 0) at a certain sample time stamp (e.g., a play time stamp corresponding to the 10 th minute of a volume video) in a viewpoint group 2 described in a viewpoint group sample entry, the server may update a viewpoint group 2 record in which a mapping relationship changes at the 10 th minute into a viewpoint group metadata sample, which means that the mapping relationship (which may also be referred to as a correspondence relationship) indicated by the timing metadata information of the viewpoint group 2 is valid and remains unchanged until the 10 th minute, that is, the timing metadata information of the viewpoint group 2 is the static viewpoint group metadata before the 10 th minute. However, after the 10 th minute, the timing metadata information of the viewpoint group 2 will need to be further redefined in the next sample (i.e. the next viewpoint group metadata sample), for example, the timing metadata information of the viewpoint group 2 may be redefined as the aforementioned dynamic viewpoint group metadata after the 10 th minute.

Wherein, the viewpoint group metadata sample is used to indicate the related metadata information of the corresponding viewpoint group, and the metadata of a certain viewpoint group defined in the previous sample will remain unchanged until the metadata information of the viewpoint group is redefined in the next sample. For understanding, please refer to the following table 6, which is a syntax of a sample of viewpoint metadata provided in an embodiment of the present application:

TABLE 6

As shown in table 6 above, the semantics of the viewpoint metadata sample are as follows: view _ group _ num: indicating the number of viewpoint groups contained in all current samples whose mapping relationship recorded at the viewpoint group metadata samples varies with time when the current time is the sample timestamp. For example, for the above 5 (e.g., G = 5) view groups, if the mapping relationship of 4 (G1 = 4) view groups changes with time at present, and the mapping relationship of 1 (e.g., G2= 1) view group remains unchanged at the above sample entry with respect to all samples, the timing metadata of the 1 view group may be recorded in the sample entry shown in the above table 3, and the timing metadata of the currently changed 4 view groups may be kept unchanged until the mapping relationship indicated by the timing metadata information of these view groups is redefined in the next sample. The number of G1 and G2 in the G viewpoint groups will not be limited here. Wherein, before the sample time stamp, the mapping relationship of the G1 viewpoint groups is still valid and remains unchanged. Wherein view _ group _ id: an identifier indicating an ith view group among the G1 view groups recorded in the view group metadata sample. The spatial position information associated with the type field of the decision information is redefined in a sample next to the view metadata sample based on the field value of the type field of the decision information recorded in the above sample entry. For example, if the field value of the type field of the determination information is the first value (i.e. 0), it indicates that the target spatial location information having a mapping relationship with the matching view group is the three-dimensional spatial area information of the matching video content displayed on the video client, and in this case, the region _ num: indicating a number of 3D spatial regions; 3DSPatialRegionStruct [ 2 j ]: indicating a spatial region of media content viewed by a user. Optionally, if the type field of the determination type is the second value, the target spatial position information having a mapping relationship with the matching viewpoint group is viewing position coordinate information of a user viewing the matching video content on the video client, and at this time, the ViewPositionStruct: indicating the user viewing position information. Optionally, if the type field of the determination type is the third value, the target spatial position information having a mapping relationship with the target viewpoint group is determined by combining the three-dimensional spatial region information and the viewing position coordinate information.

As shown in table 6, a recommended view group identifier field (i.e., recommended _ view _ group _ flag) is further recorded in the view group metadata sample. Wherein, the received _ view _ group _ flag: indicating whether a recommended set of viewpoints is contained. The value of the field is 1, which indicates that the sample contains recommended viewpoint groups, and further indirectly reflects that the specified viewpoint groups related to the shooting intention of the content producer exist in the G viewpoint groups; the value of the field is 0, which indicates that the sample does not contain the recommended viewpoint group, and further indirectly reflects that no specified viewpoint group related to the shooting intention of the content producer exists in the G viewpoint groups. Where rcmd _ view _ group _ id (i.e., the recommended identifier of the recommended viewpoint group): an identifier of a viewpoint group of the presentation is recommended. It should be understood that the recommended identifier of the recommended view group may be used to instruct the video client to quickly determine the recommended view group corresponding to the recommended identifier when the video client acquires the second extended data box, and then the video content corresponding to the recommended view group may be quickly decoded from the encoded video code stream, so as to further improve the decoding presentation efficiency of the video client.

Step S103, writing the timing metadata information corresponding to the target viewpoint group into a packaging data box corresponding to the volume video to obtain a first extended data box corresponding to the packaging data box;

the first extended data box comprises timing metadata information corresponding to each of G viewpoint groups;

step S104, acquiring coded video code streams associated with G viewpoint groups, and packaging the coded video code streams based on a first extended data box to obtain video media files of volume videos;

and step S105, sending the video media file to the video client.

It should be understood that when the server issues the video media file to the video client, the video client may display, when acquiring the first extended data box based on the video media file, the target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box.

It should be understood that, in the embodiment of the present application, in a case that a specified viewpoint group related to a shooting intention of a content producer is not found in video related information, an ith viewpoint group of G viewpoint groups may be used as a target viewpoint group to construct a mapping relationship between the target viewpoint group and target spatial position information. It should be understood that, referring to a specific process for constructing the mapping relationship between the target viewpoint group and the target control position information, a mapping relationship between each of the G viewpoint groups and the spatial position information for viewing the video content of the corresponding viewpoint group may be further constructed, and then metadata information for describing the corresponding mapping relationship may be collectively referred to as timing metadata information corresponding to each viewpoint group, so that the above steps S103 to S104 may be further performed, and thus the first extended data box may intelligently provide timing metadata information corresponding to each viewpoint group. Thus, when the video client decapsulates the video media file to obtain the first extended data box, the current spatial location information of the video client may be matched with the control location information indicated by the timing metadata information corresponding to a certain view group of the G view groups, and if the spatial location information matched with the current spatial location information of the video client is the target spatial location information, the target video content of the target view group associated with the target spatial location information may be obtained by partially decoding from the encoded video stream associated with the G view groups, and the target video content of the target view group may be displayed in the video client. For example, the volumetric video represented by the two-dimensional representation may be reconstructed into a three-dimensional space as accurately as possible based on the mapping relationship from a 2D plane to a 3D plane indicated in the atlas track corresponding to the target viewpoint group, so as to finally render the target video content representing the three-dimensional representation on the video client.

In this embodiment, when acquiring G viewpoint groups of a volumetric video, a server (i.e., an encoding end) may use an ith viewpoint group of the G viewpoint groups as a target viewpoint group, where i is a non-negative integer smaller than G. Here, it should be understood that each of the G view groups may be used as a target view group, and the ith view group is exemplified here. In this way, the server (i.e., the encoding end) may further construct a mapping relationship between the target viewpoint group and the target spatial position information for viewing the target video content, and may further write the timing metadata information corresponding to the target viewpoint group generated based on the mapping relationship into the encapsulated data box corresponding to the volumetric video, so as to obtain the first extended data box corresponding to the encapsulated data box. It should be understood that the first extended data box may contain timing metadata information corresponding to each of the G view groups. In other words, the timing metadata information in the first extended data box may be used to describe a one-to-one mapping relationship of respective viewpoint groups with spatial positional relationships for viewing responsive video content. In this way, after the server (i.e., the encoding end) encapsulates the encoded video streams associated with the G view groups based on the first extended data box, a video media file for being delivered to the video client (i.e., the decoding end) can be obtained. Therefore, when the application client decapsulates the video media file to obtain the first extended data box added with the timing metadata information, the application client may quickly determine the target viewpoint group mapped by the target spatial position information according to the timing metadata information written in the first extended data box and the current spatial position information (e.g., target spatial position information) of the video client, and may further decode in the video client to obtain the target video content corresponding to the target viewpoint group, so as to play the video content corresponding to the specific viewpoint group, thereby improving the decoding presentation efficiency of the video client in the process of partial access of the volumetric video.

Further, please refer to fig. 3, wherein fig. 3 is a flowchart illustrating a data processing method for a volume video according to an embodiment of the present application. The method may be performed by an encoding device in a volumetric video system, the encoding device may be the server described above, the method may comprise the steps of:

step S201, G viewpoint groups of the volume video are obtained;

specifically, the server may obtain V viewpoints of the volumetric video, and may perform viewpoint grouping on the V viewpoints based on a viewpoint dependency relationship between the V viewpoints to obtain G viewpoint groups of the volumetric video;

v is used for representing the number of viewpoints of the volume video, and is a positive integer greater than or equal to 2; the viewpoint dependency relationship is determined by the content correlation between the video contents respectively corresponding to each of the V viewpoints. That is, in the embodiment of the present application, the viewpoints corresponding to the video content with higher content relevance (for example, the content relevance is greater than or equal to the relevance threshold) may be classified into the same viewpoint group, and the viewpoints corresponding to the video content with lower content relevance (for example, the content relevance is less than the relevance threshold) may be classified into different viewpoint groups.

Step S202, acquiring video related information related to the volume video, and searching a specified viewpoint group related to a content producer in the video related information;

wherein the specified viewpoint group is related to a photographing intention of a content producer who photographs the volume video. It should be understood that, in the process of executing step S202, if the specified view group associated with the content producer is not found in the video associated information, the encoding device (i.e., the server) may further execute steps S203-S207, and further write the constructed timing metadata information corresponding to each view group into the encapsulation data box, so as to obtain a first extended data box for instructing to perform file encapsulation on the encoded video code stream according to the above ISOBMFF file encapsulation format. Alternatively, if the specified viewpoint group associated with the content producer is found in the video associated information, the following steps S208 to S211 may be further performed.

Step S203, if the appointed viewpoint group related to the content producer is not found in the video related information, acquiring the ith viewpoint group which is irrelevant to the shooting intention of the content producer from the G viewpoint groups as a target viewpoint group;

step S204, based on the target video content corresponding to the target viewpoint group, a mapping relation between the target viewpoint group and the target space position information for watching the target video content is constructed, and based on the mapping relation, timing metadata information corresponding to the target viewpoint group is generated.

Step S205, writing the timing metadata information corresponding to the target viewpoint group into a packaging data box corresponding to the volumetric video to obtain a first extended data box corresponding to the packaging data box;

the first extended data box comprises timing metadata information corresponding to each viewpoint group of G viewpoint groups; it should be understood that, in the process of writing the timing metadata information into the encapsulated data box and expanding the encapsulated data box, the encoding apparatus (i.e., the server) according to the embodiment of the present application further adds a recommended viewpoint group identification field in the viewpoint group metadata samples of the encapsulated data box, sets a field value of the recommended viewpoint group identification field as an invalid field value, and sets the recommended viewpoint group identification field having the invalid field value as the first field indication information in the viewpoint group metadata samples of the first expanded data box; the first field indication information is used for indicating the video client to acquire timing metadata information of each of the G view groups from the first extended data box.

Step S206, acquiring coded video code streams associated with the G viewpoint groups, and packaging the coded video code streams based on the first extended data box to obtain video media files of the volume video;

step S207, the video media file is sent to the video client, so that when the video client acquires the first extended data box based on the video media file, the video client displays the target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box.

It should be understood that, for the sake of distinction, in the embodiment of the present application, the video media file packaged in steps S203 to S207 may be referred to as a first video media file, and the video media file packaged in steps S208 to S211 may be referred to as a second video media file.

For a specific implementation manner of steps S203 to S207, reference may be made to the description of the specific process of obtaining the first extended data box and performing encapsulation processing on the encoded video code stream based on the first extended data box in the embodiment corresponding to fig. 2, which will not be described again here.

Optionally, in step S208, if the specified viewpoint group associated with the content producer is found in the video association information, the found specified viewpoint group is used as a recommended viewpoint group;

in step S209, an ith viewpoint group related to the photographing intention of the content producer is acquired from the G viewpoint groups based on the recommended viewpoint group as a target viewpoint group.

Step S210, determining the identifier of the target viewpoint group as a recommended identifier, taking metadata information for describing the recommended identifier as recommended metadata information of the target viewpoint group, and writing the recommended metadata information into a packaging data box corresponding to the volume video to obtain a second expansion data box corresponding to the packaging data box;

the recommended identifier indicated by the recommended metadata information written in the second extended data box may refer to the description of the viewpoint metadata sample in the embodiment corresponding to table 6, and will not be described again here.

Wherein, it should be understood that, when writing the recommendation metadata information into the encapsulated data box corresponding to the volumetric video, the encoding apparatus (i.e., the above-mentioned server) may further add a recommended viewpoint group identification field associated with the recommendation identifier in the viewpoint group metadata sample of the encapsulated data box, and set the field value of the recommended viewpoint group identification field as the valid field value, and in the viewpoint group metadata sample of the second extended data box, take the recommended viewpoint group identification field having the valid field value as the second field indication information; the second field indication information is used for indicating the video client to acquire recommended metadata information from the second extended data box; the video client is a client in the decoding device corresponding to the encoding device.

Step S211, obtaining coded video code streams associated with the G viewpoint groups, packaging the coded video code streams based on the second extended data box to obtain video media files of the volume video, and sending the video media files to the video client, so that when the video client obtains the second extended data box based on the video media files, the video client displays target video content corresponding to the target viewpoint group indicated by the recommended identifier based on the recommended identifier indicated by the recommended metadata information in the second extended data box.

In the embodiment of the present application, the server may divide the plurality of viewpoints of the volumetric video (i.e., the plurality of viewpoints of the above-described multi-viewpoint video) into different viewpoint groups, and may determine whether or not there is a viewpoint group matching a specified viewpoint group associated with the photographer's shooting intention among the video groups. If the recommended viewpoint group exists, a viewpoint group (for example, the ith viewpoint group of the G viewpoint groups) matching the specified viewpoint group associated with the shooting intention of the photographer may be regarded as the recommended viewpoint group, at this time, the recommended viewpoint group here may be regarded as the target viewpoint group, an identifier of the recommended viewpoint group (i.e., the target viewpoint group) may be determined as the recommended identifier, and metadata information describing the recommended identifier may be used as recommended metadata information of the recommended viewpoint group (i.e., the target viewpoint group) to write the recommended metadata information into the package data box to obtain the second extended data box. At this time, the server may perform encapsulation processing on the encoded video stream based on the second extended data box to obtain a video media file (i.e., the second video media file) for being delivered to the video client, so that when the video client receives the video media file (e.g., the second video media file), the video client may directly display, on the video client, the video content corresponding to the recommended viewpoint group indicated by the recommended identifier based on the recommended identifier indicated by the recommended metadata information in the second extended data box. Optionally, if a viewpoint group matched with a specified viewpoint group associated with a shooting intention of a photographer does not exist in the video groups, the ith viewpoint group may be set as a target viewpoint group in G viewpoint groups unrelated to the shooting intention of a content producer, and then a mapping relationship between the target viewpoint group and target control position information for viewing the target video content may be established for target video content corresponding to the target video group, and then timing metadata information corresponding to the target viewpoint group generated based on the mapping relationship may be written into an encapsulated data box corresponding to the volumetric video, so as to obtain the first extended data box. It should be understood that when the video client acquires a video media file (i.e., the first video media file) encapsulated based on the first extended data box, the first extended data box may be obtained by decapsulating, and then the spatial position information currently viewed by the user (for example, information such as a spatial region currently viewed by the user and a user position) may be compared with the spatial position information indicated by the timing metadata information of each view group in the first extended data box, so as to intelligently select a view group that is most matched with the currently viewed spatial position information (for example, the most matched view group may be the target view group), so as to decode and obtain a file track for carrying target view content corresponding to the target view group, and then the target view content corresponding to the target view group may be rendered and presented in the video client.

Further, please refer to fig. 4, where fig. 4 is a flowchart illustrating a data processing method for a volume video according to an embodiment of the present application. The method may be performed by a decoding device in a volumetric video system, the encoding device may be a user terminal integrated with the video client, and the method may include the steps of:

step S301, receiving a video media file of the volumetric video sent by the server, and performing decapsulation processing on the video media file to obtain a video coding stream of the volumetric video and an extended data box corresponding to the video coding stream;

the extended data box comprises a recommended viewpoint group identification field; the server may be the server in the embodiment corresponding to fig. 2 or fig. 3. It should be understood that the video media file received by the video client running in the user terminal may be a first video media file packaged according to the file packaging format indicated by the first extended data box, or a second video media file packaged according to the file packaging format indicated by the second extended data box. It should be understood that whether the video media file is the first video media file or the second video media file is determined by a field value of the recommended viewpoint group identification field contained in the extended data box. For example, if the field value of the recommended viewpoint group identification field is an invalid value, the received video media file is the first video media file, and the following steps S302 to S305 may be further performed; for another example, if the field value of the recommended viewpoint group identification field is a valid value, the received video media file is the second video media file, and the following steps S306 to S307 may be further performed.

Step S302, if the field value of the recommended viewpoint group identification field is an invalid value, acquiring timing metadata information respectively corresponding to each viewpoint group in G viewpoint groups of the volumetric video in a first extended data box of the extended data boxes;

specifically, if the field value of the recommended viewpoint group identification field is an invalid value (for example, the field value of the recommended viewpoint group identification field is 0), then in a first extended data box of the extended data boxes, static viewpoint group metadata fields associated with G viewpoint groups are obtained; if the field value of the static viewpoint group metadata field is a value for describing that the mapping relationship of each of the G viewpoint groups remains unchanged (for example, the field value of the static viewpoint group metadata field is 1), the timing metadata information respectively corresponding to each of the G viewpoint groups is acquired based on the first field indication information associated with the recommended viewpoint group identification field, so as to further perform the following step S303.

If the field value of the static viewpoint group element data field is a numerical value associated with the dynamic viewpoint group element data, and the dynamic viewpoint group element data is used for describing a viewpoint group with a mapping relation changing along with time in G viewpoint groups, the dynamic viewpoint group element data recorded in the viewpoint group element sample corresponding to the viewpoint group element data sample entry is as follows: the identifier of the variable viewpoint group with the mapping relation changed at the time of the sample time stamp is included, and the mapping relation between the variable viewpoint group and the video content corresponding to the variable viewpoint group is kept unchanged before the sample time stamp. It should be understood that, in the embodiments of the present application, a viewpoint group in which a mapping relationship changes with time among G viewpoint groups may be collectively referred to as a variable viewpoint group.

Wherein the view group metadata sample entry and the view group element samples are used to construct a view group timing metadata track for the volumetric video, and the view group timing metadata track is used to index into one or more atlas data tracks associated with the volumetric video. Specifically, please refer to the description of the first extended data box in the embodiment corresponding to fig. 2.

Step S303, obtaining the ith viewpoint group in the G viewpoint groups, taking the spatial position information of the video client as the spatial position information to be compared, and comparing the spatial position information to be compared with the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group to obtain a comparison result;

wherein i is a non-negative integer less than G;

for example, for each of G viewpoint groups (e.g., the viewpoint group 1, the viewpoint group 2, the viewpoint group 3, the viewpoint group 4, and the viewpoint group 5), spatial position information indicated by timing metadata information corresponding to each viewpoint group may be compared with spatial position information to be compared (e.g., information such as a spatial region currently viewed by a user, user position information, etc.) to obtain sub-comparison results between the spatial position information to be compared and spatial position information associated with each viewpoint group, which may be collectively referred to as comparison results, and then, when the spatial position information to be compared is the same as the spatial position information indicated by timing metadata information corresponding to the ith viewpoint group in the G viewpoint groups, the following step S304 may be performed.

Step S304, if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, taking the ith viewpoint group as a matching viewpoint group, and taking the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

wherein the target spatial position information is determined by a type field of decision information associated with the viewpoint group recorded by the server in the viewpoint group metadata sample entry of the first extended data box; the type field of the judgment information is deployed in a viewpoint group static metadata box of a viewpoint group metadata sample inlet; if the type field of the type is judged to be a first numerical value, the target space position information having a mapping relation with the matching viewpoint group is three-dimensional space region information of the matching video content displayed on the video client; if the type field of the judgment type is a second numerical value, the target space position information having a mapping relation with the matching viewpoint group is the viewing position coordinate information of a user viewing the matching video content on the video client; and if the type field of the judgment type is a third numerical value, the target space position information having a mapping relation with the target viewpoint group is determined by combining the three-dimensional space region information and the viewing position coordinate information.

Step S305, based on the mapping relation between the matching viewpoint group and the target space position information, decoding the video coding stream of the volume video to obtain the matching video content corresponding to the matching viewpoint group, and displaying the matching video content corresponding to the matching viewpoint group on the video client.

Optionally, in step S306, if the field value of the recommended viewpoint group identification field is a valid value, in a second extended data box of the extended data box, based on second field indication information associated with the recommended viewpoint group identification field, a recommended identifier of the recommended viewpoint group indicated by the recommended metadata information is obtained;

step S307, decoding the video coding stream of the volume video to obtain recommended video content corresponding to the recommended viewpoint group, and displaying the recommended video content corresponding to the recommended viewpoint group on the video client.

Therefore, in the embodiment of the application, different field indication information can be acquired adaptively by extending different metadata information provided in the data box. For example, if the obtained field indication information is the first field indication information, the timing metadata information corresponding to each viewpoint group written in the first extended data box may be determined, and then the spatial location information matching the current spatial location information of the video client may be intelligently determined according to the timing metadata information corresponding to each viewpoint group and the current spatial location information of the video client obtained in real time, and the determined spatial location information matching the current spatial location information of the video client (for example, the spatial location information indicated by the timing metadata information corresponding to the ith viewpoint group) may be used as a basis for selecting a specific viewpoint group, where the selected specific viewpoint group is a matching viewpoint group. Optionally, if the obtained field indication information is second field indication information, the video content corresponding to the recommended viewpoint group may be obtained by directly decoding, from the recommended metadata information corresponding to the recommended viewpoint group written in the second extended data box, according to the recommended viewpoint group indicated by the recommended metadata information, so as to further improve the decoding presentation efficiency of the video client.

Further, please refer to fig. 5, wherein fig. 5 is a flowchart illustrating a data processing method for a volume video according to an embodiment of the present application. The method may be performed by a decoding device in a volumetric video system, the encoding device may be a user terminal integrated with the video client, and the method may include the steps of:

step S401, receiving a video media file of a volumetric video sent by a server, and performing decapsulation processing on the video media file to obtain a video coding stream of the volumetric video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

step S402, if the field value of the recommended viewpoint group identification field is an invalid value, acquiring timing metadata information respectively corresponding to each viewpoint group in G viewpoint groups of the volumetric video in a first extended data box of the extended data boxes;

step S403, obtaining the ith viewpoint group in the G viewpoint groups, taking the spatial position information of the video client as the spatial position information to be compared, and comparing the spatial position information to be compared with the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group to obtain a comparison result;

wherein i is a non-negative integer less than G;

step S404, if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, taking the ith viewpoint group as a matching viewpoint group, and taking the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

step S405, based on the mapping relation between the matching viewpoint group and the target space position information, decoding the video coding stream of the volume video to obtain the matching video content corresponding to the matching viewpoint group, and displaying the matching video content corresponding to the matching viewpoint group on the video client.

Step S406, when the spatial position information of the video client is changed from the target spatial position information to the spatial position updating information, based on the timing metadata information corresponding to the spatial position updating information in the first extended data box, taking the mapping relationship between the spatial position updating information and the changed viewpoint group as an updating relationship;

step S407, decoding the video encoded stream of the volumetric video based on the update relationship to obtain the video content corresponding to the changed view group, and displaying the video content corresponding to the changed view group on the video client.

It should be understood that a viewpoint position structure for describing user position information, which includes a structure of view _ position for indicating a user viewing position, a view _ orientation structure for indicating a user viewing orientation, and a position _ range _ flag field for indicating whether or not to include movement range information of the user, is included in the above-described first extension data box. The structure of view _ position may include specific x, y, z coordinate information of the user in the multi-view media overall space, wherein the view _ orientation structure may include rotation information of the user's head, and the semantics of the rotation information of the user's head may refer to the description of the rotation structure for indicating the user rotation information shown in table 5 above. A field value of the position _ range _ flag field is 1 (namely, an effective value), which indicates that a viewpoint position structure deployed in the first extended data box contains the movement range information of the user; on the contrary, if the field value of the position _ range _ flag field is 0 (i.e., an invalid value), it indicates that the viewpoint position structure disposed in the first extended data box does not include the movement range information of the user.

Based on this, when the user views the video content corresponding to the viewpoint group 1 at the target spatial position information (for example, spatial position information 1, and the spatial position information indicates that the rotation information of the user at the viewing position P1 is the rotation information W1), the rotation information is changed once the head of the user rotates, and the spatial position information is changed. At this time, when the spatial position information of the video client used by the user is changed from the target spatial position information to another spatial position information (for example, spatial position information 2 indicating that the rotation information of the user at the viewing position P2 is the rotation information W2), the video client may collectively refer to the spatial position information 2 as spatial position update information, and may further refer to a mapping relationship between the spatial position update information and the changed viewpoint group (for example, viewpoint group 2) as an update relationship based on the timing metadata information corresponding to the spatial position update information in the first extension data box, so that the video content corresponding to the changed viewpoint group (i.e., viewpoint group 2) may be decoded from the video encoded stream of the volumetric video based on the update relationship to display the video content corresponding to the changed viewpoint group on the video client. Therefore, according to the embodiment of the application, based on the mapping relation provided by the first extension data box and associated with each viewpoint group, switching of the viewpoint groups can be adaptively performed according to the viewing requirements of the user, so that in the partial access process of the volume video, the decoding presentation efficiency is improved, and meanwhile, the viewing experience of the user is improved.

Further, please refer to fig. 6, where fig. 6 is a schematic structural diagram of a data processing apparatus for volume video according to an embodiment of the present application. The data processing means of the volumetric video may be a computer program (comprising program code) running in the encoding device, for example the data processing means of the volumetric video may be an application software in the encoding device. The data processing device for volume video can be used to execute the steps of the data processing method for volume video in the embodiment corresponding to fig. 2 or fig. 3. Further, referring to fig. 6, the data processing apparatus 1 for volume video may include: the system comprises a viewpoint group acquisition module 11, a mapping relation construction module 12, a timing metadata writing module 13 and a media file issuing module 14. Further, the data processing apparatus 1 for volume video may further include:

a viewpoint group obtaining module 11, configured to obtain G viewpoint groups of the volumetric video, and use an ith viewpoint group of the G viewpoint groups as a target viewpoint group;

wherein i is a non-negative integer less than G;

the viewpoint group acquiring module 11 includes: a viewpoint acquisition unit 111, a viewpoint group search unit 112, a notification unit 113, a recommended viewpoint group determination unit 114, and a target viewpoint group determination unit 115;

a viewpoint obtaining unit 111, configured to obtain V viewpoints of the volumetric video, and perform viewpoint grouping on the V viewpoints based on a viewpoint dependency relationship between the V viewpoints to obtain G viewpoint groups of the volumetric video; v is used for representing the number of viewpoints of the volume video, and is a positive integer greater than or equal to 2; the viewpoint dependency relationship is determined by the content correlation between the video contents respectively corresponding to each of the V viewpoints.

A viewpoint group search unit 112, configured to acquire video related information associated with the volumetric video, and search for a specified viewpoint group associated with a content producer in the video related information; specifying a set of viewpoints to be associated with a capture intention of a content producer capturing a volumetric video;

a notifying unit 113, configured to, if the specified viewpoint group associated with the content producer is not found in the video associated information, acquire an ith viewpoint group unrelated to the shooting intention of the content producer from the G viewpoint groups as a target viewpoint group, and notify the mapping relationship constructing module 12 to execute steps of constructing a mapping relationship between the target viewpoint group and target spatial position information for viewing the target video content based on the target viewpoint group, and generating timing metadata information corresponding to the target viewpoint group based on the mapping relationship.

Optionally, the recommended viewpoint group determining unit 114 is configured to, if a specified viewpoint group associated with a content producer is found in the video association information, take the found specified viewpoint group as the recommended viewpoint group;

a target viewpoint group determining unit 115 for acquiring an ith viewpoint group related to the photographing intention of the content producer from the G viewpoint groups as a target viewpoint group based on the recommended viewpoint group.

It should be understood that, after the data processing apparatus 1 of the volume video executes the above steps through the above notification unit 113, the following steps can also be executed through the invalid field setting module 14. For specific implementation manners of the viewpoint obtaining unit 111, the viewpoint group searching unit 112, the notifying unit 113, the recommended viewpoint group determining unit 114, and the target viewpoint group determining unit 115, reference may be made to the description of the specific process of obtaining the target viewpoint group in the embodiment corresponding to fig. 2, and details will not be further described here.

An invalid field setting module 14, configured to add a recommended viewpoint group identification field in the viewpoint group metadata samples of the encapsulated data boxes, set a field value of the recommended viewpoint group identification field as an invalid field value, and use, in the viewpoint group metadata samples of the first extended data box, the recommended viewpoint group identification field having the invalid field value as first field indication information; the first field indication information is used to instruct the video client to acquire timing metadata information of each of the G view groups from the first extended data box.

A mapping relation construction module 12, configured to construct a mapping relation between a target viewpoint group and target spatial position information for viewing the target video content based on target video content corresponding to the target viewpoint group, and generate timing metadata information corresponding to the target viewpoint group based on the mapping relation;

a timing metadata writing module 13, configured to write timing metadata information corresponding to the target viewpoint group into a packaged data box corresponding to the volumetric video, so as to obtain a first extended data box corresponding to the packaged data box; the first extended data box comprises timing metadata information corresponding to each viewpoint group of the G viewpoint groups;

the media file issuing module 14 is configured to obtain encoded video code streams associated with the G viewpoint groups, and perform encapsulation processing on the encoded video code streams based on the first extended data box to obtain a video media file of a volumetric video;

further, the media file issuing module 14 is further configured to issue the video media file to the video client, so that when the video client acquires the first extended data box based on the video media file, the video client displays target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box.

Here, it should be understood that, alternatively, after the data processing apparatus 1 of the volume video performs the above steps through the above notification unit 115, the following steps may also be performed through the recommended metadata writing module 16.

A recommended metadata writing module 16, configured to determine an identifier of the target viewpoint group as a recommended identifier, use metadata information for describing the recommended identifier as recommended metadata information of the target viewpoint group, and write the recommended metadata information into a packaged data box corresponding to the volumetric video to obtain a second extended data box corresponding to the packaged data box;

and the encapsulation processing module 17 is configured to acquire the encoded video code streams associated with the G viewpoint groups, encapsulate the encoded video code streams based on the second extended data box to obtain a video media file of the volumetric video, and issue the video media file to the video client, so that when the video client acquires the second extended data box based on the video media file, the video client displays the target video content corresponding to the target viewpoint group indicated by the recommendation identifier based on the recommendation identifier indicated by the recommendation metadata information in the second extended data box on the video client.

Optionally, before the recommended metadata writing module 16 writes the recommended metadata information into the encapsulated data box corresponding to the volume video, the data processing apparatus 1 of the volume video performs the following steps through the valid field setting module 18:

a valid field setting module 18 configured to add a recommended viewpoint group identification field associated with the recommended identifier to the viewpoint group metadata samples of the encapsulated data box, set a field value of the recommended viewpoint group identification field as a valid field value, and set the recommended viewpoint group identification field having the valid field value as second field indication information in the viewpoint group metadata samples of the second extended data box; and the second field indication information is used for indicating the video client to acquire the recommended metadata information from the second extended data box.

For a specific implementation manner of the viewpoint group obtaining module 11, the mapping relationship constructing module 12, the timing metadata writing module 13, and the media file issuing module 14, reference may be made to the description of step S101 to step S105 in the embodiment corresponding to fig. 2, and details will not be further described here. Further, for a specific implementation manner of the invalid field setting module 15, the recommended metadata writing module 16, the encapsulation processing module 17, and the valid field setting module 18, reference may be made to the description of step S201 to step S211 in the embodiment corresponding to fig. 3, and details will not be further described here. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 7, wherein fig. 7 is a schematic structural diagram of a data processing apparatus for volume video according to an embodiment of the present application. The data processing means of the volumetric video may be a computer program (comprising program code) running in the decoding device, for example the data processing means of the volumetric video may be an application software in the decoding device. The data processing device for volume video can be used to execute the steps of the data processing method for volume video in the embodiment corresponding to fig. 4 or fig. 5. Further, referring to fig. 7, the data processing apparatus 2 for volume video may include: a media file receiving module 21, a timing metadata obtaining module 22, an information comparing module 23, a target information determining module 24, and a video decoding module 24. Optionally, the data processing apparatus 2 for volumetric video may further include: an information change module 26, an update decoding module 27, a recommendation identifier obtaining module 28, and a recommendation content presentation module 29.

The media file receiving module 21 is configured to receive a video media file of the volumetric video sent by the server, and perform decapsulation processing on the video media file to obtain a video coding stream of the volumetric video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

a timing metadata obtaining module 22, configured to, if a field value of the recommended viewpoint group identification field is an invalid value, obtain, in a first extended data box of the extended data boxes, timing metadata information corresponding to each viewpoint group in G viewpoint groups of the volumetric video;

the timing metadata obtaining module 22 includes: a field acquisition unit 221 and a timing metadata acquisition unit 222;

a field obtaining unit 221, configured to obtain, in a first extended data box of the extended data boxes, static viewpoint group metadata fields associated with G viewpoint groups if a field value of the recommended viewpoint group identification field is an invalid value;

a timing metadata obtaining unit 222, configured to, if a field value of the static viewpoint group metadata field is a numerical value used for describing that a mapping relationship of each of the G viewpoint groups remains unchanged, obtain, based on first field indication information associated with the recommended viewpoint group identification field, timing metadata information corresponding to each of the G viewpoint groups, respectively.

For a specific implementation manner of the field obtaining unit 221 and the timing metadata obtaining unit 222, reference may be made to the description of the specific process for obtaining the timing metadata information through the first extended data box in the embodiment corresponding to fig. 4, and details will not be further described here.

If the field value of the static viewpoint group component data field is a numerical value associated with the dynamic viewpoint group component data, and the dynamic viewpoint group component data is used for describing a viewpoint group with a mapping relation changing along with time in G viewpoint groups, the dynamic viewpoint group component data recorded in the viewpoint group element sample corresponding to the viewpoint group component data sample entry: the method comprises the steps that an identifier of a variable viewpoint group with a mapping relation changed exists in a sample time stamp, and the mapping relation between the variable viewpoint group and video content corresponding to the variable viewpoint group is kept unchanged before the sample time stamp; the variable viewpoint group is a viewpoint group in which mapping relationships existing in the G viewpoint groups change with time.

Wherein the view group metadata sample entry and the view group element samples are used to construct a view group timing metadata track for the volumetric video, and the view group timing metadata track is used to index into one or more atlas data tracks associated with the volumetric video.

The information comparison module 23 is configured to obtain an ith viewpoint group of the G viewpoint groups, use spatial position information of the video client as spatial position information to be compared, and compare the spatial position information to be compared with spatial position information indicated by timing metadata information corresponding to the ith viewpoint group to obtain a comparison result; i is a non-negative integer less than G;

a target information determining module 24, configured to, if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, take the ith viewpoint group as a matching viewpoint group, and take the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

And the video decoding module 25 is configured to decode the video coding stream of the volumetric video to obtain matching video content corresponding to the matching viewpoint group based on the mapping relationship between the matching viewpoint group and the target spatial position information, and display the matching video content corresponding to the matching viewpoint group on the video client.

Optionally, the information changing module 26 is configured to, when the spatial location information of the video client is changed from the target spatial location information to spatial location update information, use a mapping relationship between the spatial location update information and the changed viewpoint group as an update relationship based on the timing metadata information corresponding to the spatial location update information in the first extended data box;

and the update decoding module 27 is configured to decode the video encoded stream of the volumetric video based on the update relationship to obtain video content corresponding to the changed view group, and display the video content corresponding to the changed view group on the video client.

Optionally, the recommended identifier obtaining module 28 is configured to, if the field value of the recommended viewpoint group identification field is a valid value, obtain, in a second extended data box of the extended data boxes, a recommended identifier of the recommended viewpoint group indicated by the recommended metadata information based on second field indication information associated with the recommended viewpoint group identification field;

and the recommended content presentation module 29 is configured to decode the video coding stream of the volumetric video to obtain recommended video content corresponding to the recommended viewpoint group, and present the recommended video content corresponding to the recommended viewpoint group on the video client.

For specific implementation manners of the media file receiving module 21, the timing metadata obtaining module 22, the information comparing module 23, the target information determining module 24, and the video decoding module 25, reference may be made to the descriptions of step S301 to step S305 in the embodiment corresponding to fig. 4, which will not be described herein again. Optionally, the specific implementation manners of the information changing module 26 and the update decoding module 27 may refer to the descriptions of step S306 to step S307 in the embodiment corresponding to fig. 4, which will not be described again here; for a specific implementation manner of the recommendation identifier obtaining module 28 and the recommendation content presenting module 29, reference may be made to the description of step S401 to step S407 in the embodiment corresponding to fig. 5, which will not be described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 8, where fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application. The computer device 1000 as shown in fig. 8 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally also be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 8, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 8, the network interface 1004 is mainly used to provide a network communication function; the user interface 1003 is an interface for providing a user with input; the processor 1001 may be configured to call the device control application stored in the memory 1005, so as to perform the description of the data processing method for the volume video in the embodiment corresponding to fig. 2, fig. 3, fig. 4, or fig. 5, the description of the data processing apparatus 1 for the volume video in the embodiment corresponding to fig. 6, and the description of the data processing 2 for the volume video in the embodiment corresponding to fig. 7, which are not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned computer device 1000, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method for the volumetric video in the embodiment corresponding to fig. 2, fig. 3, fig. 4, or fig. 5 can be performed, and therefore, details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware that is instructed by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for processing volumetric video data, the method being performed by a server and comprising:

writing timing metadata information corresponding to the target viewpoint group into a packaging data box corresponding to the volume video to obtain a first extended data box corresponding to the packaging data box; the first extended data box comprises timing metadata information corresponding to each viewpoint group of the G viewpoint groups;

and issuing the video media file to a video client so that the video client displays target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box when acquiring the first extended data box based on the video media file.

2. The method of claim 1, wherein the obtaining of the G viewpoint groups of the volumetric video comprises:

acquiring V viewpoints of a volumetric video, and performing viewpoint grouping on the V viewpoints based on the viewpoint dependency relationship among the V viewpoints to obtain G viewpoint groups of the volumetric video; wherein V is used for representing the number of viewpoints of the volume video, and is a positive integer greater than or equal to 2; the viewpoint dependency relationship is determined by the content correlation between the video contents respectively corresponding to each of the V viewpoints.

3. The method according to any one of claims 1 to 2, wherein the regarding an ith viewpoint group of the G viewpoint groups as a target viewpoint group comprises:

acquiring video related information associated with the volume video, and searching a specified viewpoint group associated with a content producer in the video related information; the specified set of viewpoints is associated with a capture intent of the content producer capturing the volumetric video;

if the specified viewpoint group associated with the content producer is not found in the video association information, acquiring an ith viewpoint group irrelevant to the shooting intention of the content producer from the G viewpoint groups as the target viewpoint group, notifying execution of the target video content corresponding to the target viewpoint group, constructing a mapping relation between the target viewpoint group and target spatial position information for watching the target video content, and generating timing metadata information corresponding to the target viewpoint group based on the mapping relation.

4. The method of claim 3, further comprising:

adding a recommended viewpoint group identification field in the viewpoint group metadata samples of the encapsulated data box, setting the field value of the recommended viewpoint group identification field as an invalid field value, and taking the recommended viewpoint group identification field with the invalid field value as first field indication information in the viewpoint group metadata samples of the first extended data box; the first field indication information is used for indicating the video client to acquire the timing metadata information of each of the G viewpoint groups from the first extended data box.

5. The method of claim 3, further comprising:

if the appointed viewpoint group associated with the content producer is found in the video associated information, taking the found appointed viewpoint group as a recommended viewpoint group;

acquiring an ith viewpoint group related to the photographing intention of the content producer from the G viewpoint groups based on the recommended viewpoint group as the target viewpoint group.

6. The method of claim 5, further comprising:

determining the identifier of the target viewpoint group as a recommended identifier, taking metadata information for describing the recommended identifier as recommended metadata information of the target viewpoint group, and writing the recommended metadata information into a packaging data box corresponding to the volume video to obtain a second expansion data box corresponding to the packaging data box;

acquiring coded video code streams associated with the G viewpoint groups, performing encapsulation processing on the coded video code streams based on the second extended data box to obtain video media files of the volume video, and sending the video media files to a video client, so that when the video client acquires the second extended data box based on the video media files, the video client displays target video content corresponding to the target viewpoint group indicated by the recommended identifier based on the recommended identifier indicated by the recommended metadata information in the second extended data box on the video client.

7. The method of claim 6, wherein when writing the recommended metadata information into the packed data box corresponding to the volumetric video, the method further comprises:

adding a recommended viewpoint group identification field associated with the recommended identifier in viewpoint group metadata samples of the encapsulated data box, and setting a field value of the recommended viewpoint group identification field as a valid field value, and in viewpoint group metadata samples of the second extended data box, taking the recommended viewpoint group identification field having the valid field value as second field indication information; the second field indication information is used for indicating the video client to acquire the recommended metadata information from the second extended data box.

8. A method for processing volumetric video data, the method being performed by a video client and comprising:

receiving a video media file of a volume video sent by a server, and performing decapsulation processing on the video media file to obtain a video coding stream of the volume video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

if the field value of the recommended viewpoint group identification field is an invalid value, acquiring timing metadata information corresponding to each viewpoint group in G viewpoint groups of the volumetric video in a first extended data box of the extended data boxes;

and decoding the video coding stream of the volumetric video to obtain the matched video content corresponding to the matched viewpoint group based on the mapping relation between the matched viewpoint group and the target spatial position information, and displaying the matched video content corresponding to the matched viewpoint group on the video client.

9. The method according to claim 8, wherein if the field value of the recommended viewpoint group identification field is an invalid value, acquiring, in a first extended data box of the extended data boxes, timing metadata information respectively corresponding to each viewpoint group of the G viewpoint groups of the volumetric video, comprises:

if the field value of the recommended viewpoint group identification field is an invalid value, acquiring static viewpoint group metadata fields associated with the G viewpoint groups in a first extended data box of the extended data boxes;

and if the field value of the static viewpoint group metadata field is a numerical value used for describing that the mapping relation of each viewpoint group in the G viewpoint groups keeps unchanged, acquiring timing metadata information respectively corresponding to each viewpoint group in the G viewpoint groups based on the first field indication information associated with the recommended viewpoint group identification field.

10. The method of claim 9, wherein the static view group metadata field is disposed in a view group static metadata box of the first extended data box; if the field value of the static viewpoint component data field is a value for describing that the mapping relationship of the viewpoint groups in the G viewpoint groups remains unchanged, the static viewpoint component data recorded in the viewpoint component data sample entry of the first extended data box: including a view group identifier for each of the G view groups.

11. The method according to claim 10, wherein if the field value of the static viewpoint component data field is a numerical value associated with dynamic viewpoint component data, and the dynamic viewpoint component data is used to describe that there is a viewpoint group in the G viewpoint groups whose mapping relationship changes with time, the dynamic viewpoint component data recorded in the viewpoint group element sample corresponding to the viewpoint component data sample entry: the method comprises the steps that an identifier of a variable viewpoint group with a mapping relation changed exists at the time of a sample time stamp, and the mapping relation between the variable viewpoint group and video content corresponding to the variable viewpoint group is kept unchanged before the sample time stamp; the variable viewpoint group is a viewpoint group in which mapping relationships existing in the G viewpoint groups change with time.

12. The method of claim 11, wherein the view group metadata sample entries and the view group element samples are used to construct view group timing metadata tracks for the volumetric video, and wherein the view group timing metadata tracks are used to index into one or more atlas data tracks associated with the volumetric video.

13. The method according to any one of claims 8 to 12, wherein the target spatial position information is determined by a type field of decision information associated with a viewpoint group recorded by the server in a viewpoint group metadata sample entry of the first extended data box; the type field of the decision information is deployed in a viewpoint group static metadata box of the viewpoint group metadata sample entry;

if the type field of the determination type is a first numerical value, the target spatial position information having the mapping relationship with the matching viewpoint group is three-dimensional spatial region information of the matching video content displayed on the video client;

if the type field of the judgment type is a second numerical value, the target space position information having the mapping relation with the matching viewpoint group is viewing position coordinate information of a user viewing the matching video content on the video client;

if the type field of the determination type is a third numerical value, the target spatial position information having the mapping relationship with the target viewpoint group is determined by combining the three-dimensional spatial region information and the viewing position coordinate information.

14. The method of any one of claims 8 to 12, further comprising:

when the spatial position information of the video client is changed from the target spatial position information to spatial position updating information, taking the mapping relation between the spatial position updating information and a change viewpoint group as an updating relation based on the timing metadata information corresponding to the spatial position updating information in the first extended data box;

and decoding the video content corresponding to the changed viewpoint group from the video coding stream of the volumetric video based on the updating relation, and displaying the video content corresponding to the changed viewpoint group on the video client.

15. The method according to any one of claims 8 to 12, further comprising:

if the field value of the recommended viewpoint group identification field is a valid value, acquiring a recommended identifier of a recommended viewpoint group indicated by recommended metadata information in a second extended data box of the extended data boxes based on second field indication information associated with the recommended viewpoint group identification field;

and decoding the video coding stream of the volume video to obtain recommended video content corresponding to the recommended viewpoint group, and displaying the recommended video content corresponding to the recommended viewpoint group on the video client.

16. A data processing apparatus for volumetric video, comprising:

the system comprises a viewpoint group acquisition module, a target viewpoint group acquisition module and a viewpoint group selection module, wherein the viewpoint group acquisition module is used for acquiring G viewpoint groups of a volume video and taking the ith viewpoint group in the G viewpoint groups as the target viewpoint group; i is a non-negative integer less than G;

a mapping relation construction module, configured to construct a mapping relation between the target viewpoint group and target spatial position information for viewing the target video content based on target video content corresponding to the target viewpoint group, and generate timing metadata information corresponding to the target viewpoint group based on the mapping relation;

a timing metadata writing module, configured to write timing metadata information corresponding to the target viewpoint group into a packaged data box corresponding to the volumetric video, so as to obtain a first extended data box corresponding to the packaged data box; the first extended data box comprises timing metadata information corresponding to each viewpoint group of the G viewpoint groups;

a media file issuing module, configured to obtain encoded video code streams associated with the G view groups, and perform encapsulation processing on the encoded video code streams based on the first extended data box to obtain a video media file of the volumetric video;

the media file issuing module is further configured to issue the video media file to a video client, so that when the video client acquires the first extended data box based on the video media file, the video client displays target video content corresponding to the target viewpoint group on the video client according to the target spatial position information indicated by the timing metadata information corresponding to the target viewpoint group in the first extended data box.

17. A data processing apparatus for volumetric video, comprising:

the media file receiving module is used for receiving a video media file of the volume video sent by the server and carrying out decapsulation processing on the video media file to obtain a video coding stream of the volume video and an extended data box corresponding to the video coding stream; the extended data box comprises a recommended viewpoint group identification field;

a target information determining module, configured to, if the comparison result indicates that the spatial position information to be compared is the same as the spatial position information indicated by the timing metadata information corresponding to the ith viewpoint group, use the ith viewpoint group as a matching viewpoint group, and use the spatial position information indicated by the timing metadata information corresponding to the matching viewpoint group as target spatial position information;

a video decoding module, configured to decode, based on a mapping relationship between the matching viewpoint group and the target spatial position information, to obtain matching video content corresponding to the matching viewpoint group from a video coding stream of the volumetric video, and display the matching video content corresponding to the matching viewpoint group on the video client.

18. A computer device, comprising: a processor, a memory, a network interface;

the processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and a processor for calling the computer program to perform the method according to any one of claims 1 to 15.

19. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-15.