WO2023207119A1

WO2023207119A1 - Immersive media processing method and apparatus, device, and storage medium

Info

Publication number: WO2023207119A1
Application number: PCT/CN2022/137788
Authority: WO
Inventors: 胡颖
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-04-26
Filing date: 2022-12-09
Publication date: 2023-11-02
Also published as: CN116962756A; CN114697631A; CN114697631B

Abstract

An immersive media processing method, comprising: obtaining a video bitstream of an immersive media (S201); and encapsulating the video bitstream into M video tracks to obtain a media file of the immersive media (S202), wherein among the M video tracks, a video media header data box in at least one video track comprises metadata information, and the metadata information is used for indicating whether the corresponding video track comprises video image information and corresponding video auxiliary information at the same time.

Description

Immersive media processing methods, devices, equipment and storage media

This application claims priority to the Chinese patent application submitted to the China Patent Office on April 26, 2022, with application number 202210445405.0, and the application name is "Immersive media processing method, device, equipment and storage medium", the entire content of which is incorporated by reference in this application.

Technical field

The present application relates to the field of computer technology, and in particular, to an immersive media processing method, device, computer equipment, and computer-readable storage medium.

Background technique

Although traditional encapsulation technology can indicate whether each video track contains video image information and video auxiliary information after encapsulating immersive media into one or more video tracks, practice has found that traditional encapsulation technology has There is a problem of repeated instructions in the information situation.

This will cause the media consumption device to repeatedly parse the corresponding fields during the decapsulation process, wasting the processing resources of the media consumption device, and will also lead to low decapsulation efficiency, resulting in low decoding efficiency.

Contents of the invention

An immersive media processing method, executed by a media processing device, the method includes:

Obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;

Encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track The data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.

An immersion media processing device is provided, and the device includes:

A processing unit, used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;

An encapsulating unit, configured to encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is at least one video track The video media header data box in includes metadata information, which is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.

An immersive media processing method is provided, which is executed by a media processing device. The method includes:

Obtain the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;

Decode the information in the target video track.

An immersion media processing device, the device includes:

The acquisition unit is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; And the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video in at least one video track The media header data box includes metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;

A decoding unit, configured to decode information in the target video track.

A media processing device is provided. The media processing device includes an input interface and an output interface. The media processing device further includes:

A processor adapted to implement one or more computer-readable instructions; and, a computer-readable storage medium storing one or more computer-readable instructions;

Wherein, the one or more computer-readable instructions are adapted to be loaded by the processor and perform the following steps:

Alternatively, the one or more computer readable instructions are adapted to be loaded by the processor and perform the following steps:

Decode the information in the target video track.

A computer-readable storage medium that stores one or more computer-readable instructions, and the one or more computer-readable instructions are suitable for being loaded by a processor and executing any of the above-mentioned tasks. An approach to immersive media.

A computer program product, which includes computer-readable instructions; when the computer-readable instructions are executed by a processor, any one of the above-mentioned immersive media processing methods is implemented.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application, which are of great significance to this field. Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.

Figure 1a is a schematic diagram of a 3DoF provided by an embodiment of the present application;

Figure 1b is a schematic diagram of 3DoF+ provided by an embodiment of the present application;

Figure 1c is a schematic diagram of a 6DoF provided by the embodiment of the present application;

Figure 1d is a schematic diagram of free-view video data provided by an embodiment of the present application;

Figure 1e is a schematic diagram of an immersive media processing process provided by an embodiment of the present application;

Figure 2 is a schematic flowchart of an immersive media processing method provided by an embodiment of the present application;

Figure 3 is a schematic flowchart of an immersive media processing method provided by another embodiment of the present application;

Figure 4a is a schematic diagram of a single-rail packaging method for immersive media provided by an embodiment of the present application;

Figure 4b is a schematic diagram of the transmission of media files of immersive media provided by an embodiment of the present application;

Figure 5a is a schematic diagram of a multi-track packaging method for immersive media provided by an embodiment of the present application;

Figure 5b is a schematic diagram of the transmission of media files of another immersive media provided by an embodiment of the present application;

Figure 6a is a schematic diagram of a multi-track packaging method for immersive media provided by another embodiment of the present application;

Figure 6b is a schematic diagram of the transmission of media files of immersive media provided by another embodiment of the present application;

Figure 7 is a schematic structural diagram of an immersion media processing device provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of an immersion media processing device provided by another embodiment of the present application;

Figure 9 is a schematic structural diagram of a media processing device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

In the embodiments of this application, immersive media (also called immersive media) refers to media content that can provide consumers with an immersive experience, so that consumers immersed in the media content can obtain visual and auditory experiences in the real world. Media for sensory experience. Immersive media can be divided into 3DoF media, 3DoF+ media and 6DoF media according to the degree of freedom (DoF) of consumers when consuming media content; 3DoF media can support users to consume corresponding content through 3DoF (i.e. three degrees of freedom). For media content, 3DoF+ media can support users to consume corresponding media content through 3DoF+, and 6DoF media can support users to consume corresponding media content through 6DoF. Among them, the so-called 3DoF refers to the three degrees of freedom that the user's head can rotate around the X-axis, Y-axis and Z-axis when the user is fixed at the center point of a three-dimensional space, as shown in Figure 1a; the so-called 3DoF+ is Refers to: On the basis of three degrees of freedom, the user's head can also perform limited degrees of freedom along the X-axis, Y-axis, and Z-axis. Limited movement refers to translational movement in a limited space, as shown in Figure 1b ; The so-called 6DoF refers to: on the basis of three degrees of freedom, consumers can also freely move along the X-axis, Y-axis and Z-axis. Free movement refers to free translational movement, as shown in Figure 1c . It should be noted that the degree of freedom mentioned in the embodiments of this application can be understood as the degree of freedom of the user to support user movement and generate content interaction when watching immersive media.

Among them, immersive media captured by multiple cameras (also called cameras), including different viewing angles, and supporting user 3DoF+ or 6DoF interaction, can be called free-view video; that is to say, free-view video is a Common 3DoF+ media and 6DoF media can also be called multi-view video, multi-view video or free-view video. Specifically, the same three-dimensional scene can be photographed from multiple angles through a camera array composed of multiple cameras to form texture information of the scene (which can also be called texture map information or texture map, which can be, for example, color information). etc.) and depth information (which can also be called depth map information or depth map, which can be, for example, spatial distance information, etc.), thereby combining texture information and depth information from different cameras according to the user's location information to constitute user-side consumption. Free-view video (i.e. 6DoF media or 3DoF+ media). Since free-view video is a combination of videos from different perspectives in space collected by multiple cameras, in order to facilitate the expression, storage, compression and processing of free-view video, embodiments of the present application can express free-view video data as a combination of the following information : The texture information collected by multiple cameras, the depth information corresponding to the texture information collected by multiple cameras, and the corresponding content description metadata of the free-view video (can be referred to as metadata), as shown in Figure 1d. The metadata can include the parameters of multiple cameras, as well as description information such as the splicing layout and edge protection of free-view videos.

As shown in Figure 1e, the processing process for immersive media (such as free-view video) may mainly include: video acquisition, video encoding, video file encapsulation, video transmission, video file decapsulation, video decoding and video presentation. in:

(1) Video capture is used to convert analog video signals into digital videos and save them in the format of digital video files; that is to say, video capture can convert video signals into binary digital information. Since the binary digital information converted from the video signal is a binary data stream, the binary digital information can also be called a code stream or bit stream (Bitstream) of the video signal.

(2) Video encoding refers to the process of converting files in the original video format into another video format file through compression technology. That is to say, video encoding can be achieved through standardized video compression algorithms, such as AVS (Audio Video Coding Standard, China’s national digital audio and video coding standard) encoding technology, AVS3 (the third generation of audio and video launched by the AVS standard group) Codec standard) coding technology, HEVC (High Efficiency Video Coding, international video coding standard HEVC/H.265) coding technology, VVC (versatile video coding, international video coding standard VVC/H.266) coding technology and other algorithm implementations, or Achieved through some specific compression algorithms designed for content.

(3) Video file encapsulation refers to storing the encoded and compressed media bit stream (such as video bit stream) in a file in a certain format according to the encapsulation format (or container, or file container). It should be noted that the embodiment of the present application mainly takes the media bit stream as a video bit stream as an example; if the immersive media includes not only video frames, but also one or more media frames such as audio frames and subtitle frames, then the media bit stream The stream also includes at least one of an audio bit stream and a subtitle bit stream. For implementation in this case, please refer to the embodiments of this application. Common encapsulation formats include AVI format (Audio Video Interleaved, audio and video interleaved format) or ISOBMFF (ISO Based Media File Format, media file format based on ISO (International Standard Organization, International Organization for Standardization) standards). Among them, ISOBMFF is the packaging standard for media files. The most typical ISOBMFF file is the MP4 (Moving Picture Experts Group 4) file. By encapsulating the encoded media stream, the media stream can be stored in the device in a structure that is easier to parse; the information added in the video file encapsulation process usually helps to improve the performance of the video file on the device or The storage, association, transmission efficiency of intermediate transmission nodes and the decoding efficiency of video files at video consumption terminals.

Specifically, the media sample (which may be referred to as a sample for short) can be used as the encapsulation unit in the file encapsulation process; that is, the media code stream obtained by compression and encoding is encapsulated into one or more media tracks, thereby obtaining the corresponding media files (i.e. video files). It can be seen that a media file can be composed of multiple media tracks. For example, a media file can contain one or more video tracks, one or more audio tracks, and one or more subtitle tracks; the so-called video track refers to a component used to encapsulate the video bits. The media track of the stream, the so-called audio track refers to the media track used to encapsulate the audio bit stream, and the so-called subtitle track refers to the media track used to encapsulate the subtitle bit stream. The media track mentioned here (which may be referred to as a track for short) refers to a collection of media data in the process of encapsulating media files; a media track may include one or many samples, and during the encapsulation process, one or Multiple media frames as a sample. For example, a sample in a video track is usually a video frame, a sample in an audio track is usually an audio frame, a sample in a subtitle track is usually a subtitle frame, etc.

(4) Video file transmission means: after the media production device obtains the media file through the above steps, the media file is transferred to the media consumption device. It should be noted that at the transmission level, the concept of media track is not involved; at the transmission level, the media file can be considered to include the media resources corresponding to each media track, and the media resources corresponding to each media track are the corresponding media The bit stream is obtained after encapsulating it into a media track. For example, during encapsulation, the video bit stream is encapsulated into one or more video tracks, thereby obtaining a media file; then at the encapsulation level, it can be considered that after encapsulating the video bit stream into one or more video tracks, the resulting It is a video track that encapsulates corresponding information. Therefore, the media file can be considered to include one or more video tracks. However, at the transmission level, it is considered that after the video bit stream is encapsulated into one or more video tracks, each video is obtained. The video resources corresponding to the track, therefore the media file can be considered to include the video resources corresponding to each video track.

It should be noted that the transmission process of media files can be carried out based on various transmission protocols; the transmission protocols here can include but are not limited to: DASH (Dynamic Adaptive Streaming over HTTP, dynamic adaptive streaming media transmission based on HTTP) protocol, HLS (HTTP Live Streaming, dynamic code rate adaptive transmission) protocol, SMTP (Smart Media Transport Protocol, intelligent media transmission protocol), TCP (Transmission Control Protocol, transmission control protocol), etc. Among them, DASH is an adaptive bitrate streaming technology that enables high-quality streaming media to be delivered over the Internet through a traditional HTTP network server; in DASH, it is described by MPD (media presentation description, media presentation description signaling in DASH) Media clip information. And in DASH, it involves the concept of Representation; the so-called Representation refers to a combination of one or more media components (media fragments). A Representation can be understood as a video resource corresponding to a track; for example, when converting a certain After the video bit stream of a certain resolution is encapsulated into a video track to obtain a video file (i.e., a media file) of a certain resolution, and the video file of a certain resolution includes a video resource, then the video file of a certain resolution can Seen as a Representation. Furthermore, DASH also involves the concept of Adaptation Set; the so-called Adaptation Set refers to a collection of one or more video streams, and an Adaptation Set can contain multiple Representations.

(5) Video file decapsulation means: after receiving the media file, the media consumption device decapsulates the media file. It can be understood that the decapsulation process of the media consumption device is opposite to the encapsulation process performed by the media production device mentioned above; the media consumption device can decapsulate the media file according to the file format requirements during encapsulation to obtain the corresponding Media bit streams (such as video bit streams), etc.

(6) Video decoding refers to decoding the media bit stream to restore the video. It should be understood that the decoding process of the media consumption device is inverse to the encoding process performed by the media production device mentioned above.

In the embodiment of this application, media production equipment refers to equipment used by providers of immersive media (such as content producers of immersive media), which can be such as smartphones, tablets, laptops, desktop computers, smart speakers, Smart watches, smart cars and other terminal equipment. Alternatively, the media production equipment can also be a server. The server here can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a server that provides cloud services, cloud databases, and cloud services. Cloud servers for basic cloud computing services such as computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms . In addition, the so-called media consumption device refers to a device used by users of immersive media (for example, users who consume media content of immersive media), which can be such as personal computers, smart phones, VR (Virtual Reality, virtual reality) devices ( For example, VR helmets, VR glasses, etc.), smart watches, smart cars and other terminal devices; further, the media consumption device can also install and run a client with media consumption capabilities, thereby consuming media through the client.

Based on the relevant description of the immersive media shown in Figure 1e above, when the immersive media is free-view video, after the free-view video is collected, the media production equipment needs to compress and encode the collected free-view video; during the compression and encoding process The video compression algorithms involved can be completed by AVS3 encoding technology, HEVC encoding technology, etc. After compressing and encoding the free-view video, the encoded data stream (such as a video bit stream) needs to be encapsulated to obtain a media file, so that the media file can be transmitted to a media consumption device. Correspondingly, on the media consumption device side, reverse operations such as decapsulation and decoding of media files are required to finally present the video content of the free-view video (that is, the media content).

Furthermore, for media production equipment, the production process of free-view video content can roughly include the following steps: ① Capture and shoot through a camera array to obtain texture map information and depth map information collected by multiple cameras in the camera array . ② Splice the texture map information collected by multiple cameras and the corresponding depth map information. This step can be an optional step. ③ Encode the depth map information and texture map information collected by multiple cameras through planar video compression to obtain the encoded video bit stream. ④ Encapsulate the encoded video bit stream and metadata into the media file of the free-view video (i.e., video file). For media consumption devices, it needs to reconstruct the free-view video into a virtual viewpoint (or virtual viewpoint) based on the media file of the free-view video, so as to render and present the video content corresponding to the virtual viewpoint to the user; Specifically, when reconstructing the free-view video to the virtual viewpoint, the media consumption device can perform the following steps: obtain the media file of the free-view video, and decode it from the media file to obtain a or Depth map information and texture map information collected by multiple cameras. Process the decoded free-view video data, and combine the depth map information and texture map information according to the free-view video algorithm (the application's own algorithm can be used) to synthesize the virtual viewpoint corresponding to the target perspective switched by the user.

Traditional video coding and decoding standards define the texture map information structure and depth map information structure involved in free-view video data, as follows:

(1) Texture map information structure:

The texture map information structure is used to indicate: the encoded metadata information of the texture map corresponding to the original camera (ie camera) involved in the free-view video data. Specifically, the syntax of the texture map information structure (TextureInfostruct) can be as follows:

The semantics of the syntax of the texture map information structure mentioned above are as follows: aligned(8) indicates alignment by 8 bits; texture_padding_size indicates the guard band width of the texture map; texture_top_left_x indicates that the texture map corresponding to the corresponding original camera is in the video frame plane The x coordinate of the upper left corner of The x-coordinate of the lower right corner of the texture map in the video frame.

(2) Depth map information structure:

The depth map information structure is used to indicate: the encoded metadata information of the depth map corresponding to the original camera (ie camera) involved in the free-view video data. Specifically, the syntax of the depth map information structure (DepthInfostruct) can be as follows:

The semantics of the syntax of the depth map information structure mentioned above are as follows: depth_padding_size indicates the guard band width of the depth map; depth_top_left_x indicates the x-coordinate of the upper left corner of the depth map corresponding to the corresponding original camera in the video frame plane; depth_top_left_y indicates the corresponding original camera. The y coordinate of the upper left corner of the corresponding depth map in the video frame plane; depth_bottom_right_x indicates the x coordinate of the lower right corner of the depth map corresponding to the corresponding original camera in the video frame plane; depth_bottom_right_y indicates the depth map corresponding to the original camera in the video frame plane. The x coordinate of the lower right corner; depth_downsample_factor indicates the multiple factor of depth map downsampling. The actual resolution width and height of the depth map is 1/2 ^{depth_downsample_factor} of the camera acquisition resolution width and height.

In addition, the existing video coding and decoding standards also stipulate that the free-view video track should be represented in the media file as a restricted video, and the SchemeTypeBox (mode) in the RestrictedSchemeInfoBox (restricted mode information data box) under the sample entry of this type of track type data box) should be set to 'as3f'; however, encoding of free-view video has no restrictions on the video codec. In addition, existing video coding and decoding standards also define free-view video encapsulation methods, which can include single-track encapsulation and multi-track encapsulation:

(1) Single-track packaging of free-view video:

The single-track encapsulation of free-view video means that a single track represents the free-view video code stream (that is, the video bit stream of free-view video). The single-track encapsulation method is used to directly encapsulate the free-view video data into the media track container, and the data structure is retained in the code stream as it is. Single-track encapsulated free-view video tracks use a VisualSampleEntry (sample entry), defined in ISO/IEC 14496-12. The sample entry can contain an AvsFreeViewInfoBox (free view information data box) and PayloadInfoBox (load information data box). AvsFreeViewInfoBox indicates the camera information corresponding to the perspective in the free-view video track, and PayloadInfoBox indicates the texture map and depth map information corresponding to the camera.

(2) Multi-track packaging of free-view video:

Multi-track encapsulation of free-view video means: multiple media tracks (such as video tracks) represent the free-view video code stream; multi-track encapsulation can encapsulate the texture map and depth map collected by the same camera into one track, or encapsulate into different tracks, there is no limit to this. Based on this, when the texture map and the depth map are packaged into different tracks, the multi-track package of the free-view video can contain two types of media tracks: texture map track (the video track that encapsulates the texture map (i.e. texture information)) and depth Image track (video track that encapsulates depth map (i.e., depth information)); in this case, the texture map track and depth map track corresponding to the same camera need to be associated. Specifically, the TrackReferenceBox (track) defined in ISO/IEC 14494-12 can be used. Reference data box), reference or associate the depth map track to the texture map track corresponding to the same camera, the value of reference_type (reference type field) is 'vdep'. It can be seen that the multi-track encapsulation method is used to map the free-view video to the container of the respective media track according to the data type; the tracks belonging to the same camera are related to each other, and the time is aligned.

The texture map track and depth map track of the multi-track packaged free-view video both use VisualSampleEntry, which is defined in ISO/IEC19946-12; and the sample entries of the texture map track and depth map track contain AvsFreeViewInfoBox, and the camera described in the AvsFreeViewInfoBox Parameter information should not appear repeatedly in the texture map track and its corresponding depth map track. When the texture map track contains texture maps from multiple cameras, the sample entry contains a PayloadInfoBox describing the texture map information; when the depth map track contains depth maps from multiple cameras, the sample entry contains a PayloadInfoBox describing the depth map information. In addition, each sample in the texture map track or depth map track corresponds to the texture map or depth map of one or more cameras. The camera information corresponding to the samples in the track is described through the AvsFreeViewInfoBox in VisualSampleEntry. When each sample contains texture maps or depth maps of multiple cameras, each sample may contain multiple subsamples. Each subsample corresponds to the texture map or depth map of a camera and is described using SubSampleInformationBox (subsample information data box). ; Specifically, when the value of the flags field in the relevant syntax of SubSampleInformationBox in the existing video encoding and decoding standards is 0, the subsample can be described through the field of codec_specific_parameters.

Furthermore, if a free-view video is encapsulated into multiple video tracks, and one video track only contains texture information and/or depth information corresponding to some cameras, these video tracks should be related through the free-view track group, that is, These video tracks should all contain AvsFreeViewGroupBox (free view track group data box). AvsFreeViewGroupBox can include a track group identification field (track_group_id). Different video tracks that contain some cameras and correspond to the same free-view video should have the same track_group_id value.

Below, the above-mentioned free-view information data box, load information data box, free-view track group and other data boxes are further elaborated:

①Free view information data box (AvsFreeViewInfoBox):

The free-angle data box is used to indicate the camera metadata information corresponding to one or more free-angle views contained in the corresponding track (ie, the metadata information of the camera). Specifically, the syntax of the free-view data box can be as follows:

Data box type (Box Type): 'afvi'

Contained in (Container): SchemeInformationBox or SampleEntry

Mandatory: No

Quantity: zero or one

The semantics of the syntax of the above free-view data box are as follows: texture_in_track indicates whether the corresponding track contains a texture map; when the value of texture_in_track is 1, it indicates that the corresponding track contains a texture map; when the value of texture_in_track is 0, it indicates that the corresponding track does not contain a texture picture. depth_in_track indicates whether the corresponding track contains a depth map; when the value of depth_in_track is 1, it indicates that the corresponding track contains a depth map; when the value of depth_in_track is 0, it indicates that the corresponding track does not contain a depth map. camera_count indicates the number of all cameras collecting video. IntCameraInfoStruct() indicates the internal parameter information of the camera, and ExtCameraInfoStruct() indicates the external parameter information of the camera. codec_independency indicates the codec independence between the texture maps and depth maps corresponding to each camera in the corresponding track. The values involved can be seen in Table 1 below:

Table 1

②PayloadInfoBox:

The load information data box is used to describe the metadata information of the texture map and depth map corresponding to one or more cameras in the corresponding track. Specifically, the syntax of the payload information data box can be as follows:

Data box type (Box Type): 'plin'

Contained in (Container): SampleEntry

Mandatory: No

Quantity: zero or one

The semantics of the syntax of the above load information data box are as follows: texture_info_flag indicates whether to express the metadata information describing the texture map; when the value of texture_info_flag is 1, it expresses the metadata information describing the texture map; when the value is 0, it does not express the metadata information describing the texture map. Metadata information. depth_info_flag indicates whether to express the metadata information describing the depth map; when the value of depth_info_flag is 1, the metadata information describing the depth map is expressed; when the value is 0, the metadata information describing the depth map is not expressed. num_cameras indicates the number of cameras in this media track, and camera_id indicates the identity of the camera. TextureInfostruct() describes the metadata information of the texture map, and DepthInfostruct() describes the metadata information of the depth map.

③Free view track group data box (AvsFreeViewGroupBox):

The free-view track group data box is used to associate various video tracks of the same free-view video. Specifically, the syntax of the free-view track group data box can be as follows:

The semantics of the syntax of the above free-view track group data box are as follows: camera_count is used to indicate the number of cameras corresponding to the free-view texture information and depth information contained in the track. camera_id is used to indicate the camera identifier corresponding to each camera. depth_texture_type is used to indicate the type of texture information or depth information captured by the corresponding camera contained in the track. Its value can be referred to the following Table 2:

Table 2

depth_texture_type取值depth_texture_type value	含义meaning
00	保留reserve
11	表明包含对应相机拍摄的纹理信息Indicates that it contains texture information captured by the corresponding camera
22	表明包含对应相机拍摄的深度信息Indicates that it contains depth information captured by the corresponding camera
33	表明包含对应相机拍摄的纹理和深度信息Indicates that it contains texture and depth information captured by the corresponding camera

As can be seen from the above description, the existing video encoding and decoding standards have the problem of repeated instructions for the depth information and texture information contained in each video track, such as in the free-view information data box, load information data box, free-view track group There are corresponding fields in the data box indicating whether the track contains texture or depth information. This will cause the media consumption device to repeatedly parse during the decapsulation process, thus causing a waste of resources. Moreover, although the existing video coding and decoding standards indicate the coding and decoding independence of free-view videos, that is, whether there is coding and decoding dependence between texture information and depth information in the video track and between the content captured by each camera. However, the current indication method is limited to free-view application scenarios and does not have compatibility and scalability. In addition, there is a lack of indication of video auxiliary information at the signaling level.

Based on this, the embodiments of this application propose a file encapsulation and transmission signaling indication method for free-view video applications and other immersive media applications; through this method, the following can be achieved: ① Organize free-view video media in a more optimized manner content, preventing the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency; ② With more optimized The method indicates the encoding and decoding dependencies between various parts of the video content, so that the method is not limited to free-view application scenarios, improving compatibility and scalability; ③ Based on the information in ①~②, generate corresponding transmission signaling to guide Resource selection of media consumption devices such as clients in the process of free-view video consumption.

In specific implementation, this method can be applied to the storage, transmission, encapsulation, and decapsulation process of any immersive media (such as free-view video). Taking free-view video as an example, the specific implementation steps of this method are as follows:

Step 1: The media production equipment (such as a server) generates the video bit stream of the free-view video and encapsulates the video bit stream into a media file. According to whether the texture information and depth information in the free-view video are in the same video track, and whether Whether the perspective video is encapsulated into multiple video tracks, the texture information, depth information, and camera information contained in the video track of the free-view video are indicated according to the following guidelines.

(1) If a video track contains both texture information and depth information, these texture information and depth information must correspond to N (N is a positive integer) identical cameras, that is, the texture maps and depth maps captured by N cameras are encapsulated in one in the video track. At this time, the video media header data box (VideoMediaHeaderBox) in the video track identifies that the video track is accompanied by video auxiliary information (such as depth information) through metadata information.

(2) If a video track only contains texture information or only depth information, considering that the texture information is a kind of video image information and the depth information is a kind of video auxiliary information, it can be considered as a texture map track (i.e. The video track encapsulating texture information) is a video image information track, and the depth map track is considered to be a video auxiliary information track. The video image information track mentioned here can be referred to as a video track for short, which refers to a video track that encapsulates video image information; the video auxiliary information track mentioned here can also be called an auxiliary video media track or a video auxiliary track, which Refers to the video track that encapsulates video auxiliary information. Further, considering that the existing video file encapsulation technology stipulates: ①Video media uses the handler type in the HandlerBox of the, as defined in 8.4.3 (as defined in Chapter 8.4.3 of the relevant standards for video media use) handler_type (handler type) in HandlerBox (handler data box)). ②The handler type of video media is 'vide', that is, the corresponding field value in the HandlerBox data box in the MediaBox data box of the video media file is 'vide'; the handler of auxiliary video media The type is 'auxv', that is, the value of the corresponding field in the HandlerBox data box in the MediaBox data box of the auxiliary video media file is 'auxv'. ③The auxiliary video media track (i.e., video auxiliary information track) and the video track (i.e., video image information track) are encoded in the same encoding method, but are identified using different handler types, and the auxiliary video media track is usually not used for visual Rendering (such as scenes where the auxiliary video track contains depth information). Therefore, if a video track only contains texture information or only depth information, the distinction between the video image information track and the video auxiliary information track in the existing technology (distinguished by handler_type) and the correlation technology ('vdep' track index) can be used , combined with the identification information (such as metadata information) in (1), the information contained in the video track can be distinguished as texture information or depth information.

(3) When a free-view video is encapsulated into multiple video tracks, the free-view track group data box is used to associate the multiple video tracks; further, the free-view track group data box can be used to encapsulate the video image information (such as Texture information) to associate multiple video tracks. At the same time, the load information (ie, the information in the aforementioned load information data box) is included in the free-view track group data box. Optionally, the free perspective can also be further distinguished based on the classification indicated by the metadata information in (1) (that is, whether the video track is accompanied by both video image information (such as texture information) and video auxiliary information (such as depth information)) The type of track group. For example, the type of free-view track group mentioned here can be used to indicate that each video track associated with the free-view track group only encapsulates video image information (such as texture information), or that the free-view track group is associated with Each video track of the video track simultaneously encapsulates video image information (such as texture information) and video auxiliary information (such as depth information).

Step 2: The media production device (such as a server) indicates the codec in the video track based on the file encapsulation structure involved in step 1 (i.e. whether it is multi-track encapsulation, whether a single video track contains both texture information and depth information) Dependency information, the codec dependency information may include: an indication of the rules for dividing data sets (ie, video components) within the video, and an indication of whether codec dependencies exist in different data sets under the rules. In addition, when a single video track contains both texture information and depth information, the codec dependency information may also include an indication of whether there is codec dependence on the texture information and depth information in the single video track, and so on.

Step 3: The media production equipment (such as a server) can provide a complete file download service. In this case, the media production equipment (such as a server) may not process the media files of the free-view video, thereby providing complete media files for the media consumption device. download. Of course, if the media production device (such as a server) also supports streaming transmission, at this time, the media production device (such as a server) slices the media file of the free-view video into media segments suitable for streaming transmission (according to the specifications of the existing transmission standards Perform), generate corresponding transmission signaling, and perform step 4. For example, the transmission signaling mentioned here may include but is not limited to DASH signaling or SMT signaling; and, the signaling description file of the transmission signaling indicates the following information:

(a) Indicate the data type (video data/video auxiliary data) contained in the media resources (i.e. video resources) and the index and association relationship between the corresponding media resources;

(b) Indicate codec dependency information in media resources or between media resources.

Step 4: The media production device (such as a server) transmits the signaling description file of the transmission signaling to the user on the media consumption device side.

Step 5: The client in the media consumption device can execute the following logic:

(A) Depending on the application form, the client can download the complete file (that is, the complete media file) and then play it locally. And based on the instructions about texture information, depth information and codec dependency information in the media file, it is determined whether the media file can be partially decoded at different stages to optimize the performance of client playback.

(B) Establish streaming transmission with media production equipment (such as servers) according to the application form, and adaptively select the appropriate freedom according to the texture information, depth information and codec dependency information indicated in the signaling description file of the transmission signaling. View video streams for consumption.

In order to support each implementation step in the above method proposed by the embodiment of this application, the embodiment of this application adds several descriptive fields at the system layer based on the existing technology. In the following, taking the example of extending the existing ISOBMFF data box, DASH signaling and SMT signaling, the relevant fields are defined to support free-view video applications or other immersive media applications, as follows:

1. ISOBMFF data box extension

1. The ISOBMFF data box may include a video media header data box (VideoMediaHeaderBox). The video track uses a video media header data box, which contains general presentation-related information of the video media that is independent of the codec method; the embodiment of the present application can The video media header data box is extended to indicate in the video media header data box whether depth information or other auxiliary information is also included in the video track. Specifically, the expanded video media header data box can be as follows:

Among them, the three fields version, graphicsmode and opcolor are the original fields of the video media header data box. version is used to indicate the version number of the video media header data box; graphicsmode is used to indicate the image combination mode of the current video track. A value of 0 in this field indicates that the image combination mode of the current video track is copy mode. Other values can be determined by corresponding standards. Additional definition; opcolor is an array of values for three colors (red, green, blue), used in conjunction with the image combination mode.

Based on the above expanded video media header data box, it can be known that the embodiment of the present application can define the flag bit video_with_depth_info_flag (video depth information flag field) in the video media header data box to indicate whether the current video track also contains texture information ( Image information) and depth information corresponding to texture information (image information). When the value of this field is the first value (such as the value 1), it indicates that the current video track contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the second value ( If the value is 0), it indicates that the current video track only contains texture information (image information).

Extensively, since texture information is a type of video image information, depth information is a type of video auxiliary information; video image information is not limited to texture information, and video auxiliary information is not limited to depth information. For example, it may also include Parallax information, etc. Therefore, other flag bits can also be defined in the video media header data box to indicate whether the current video track also contains corresponding auxiliary information. For example, one or more of the following flags can be defined:

① video_with_parallax_info_flag (video parallax information flag field), indicating whether the current video track contains both video image information and parallax information corresponding to the video image information. When the value of this field is the first value (such as the value 1), it indicates that the current video track contains both video image information and disparity information corresponding to the video image information; when the value of this field is the second value (such as the value 0), Indicates that the current video track contains only video image information.

②video_with_auxl_info_flag indicates whether the current video track contains both video image information and auxiliary information corresponding to the video image information. When the value of this field is the first value (such as the value 1), the current video track contains both video image information and video auxiliary information corresponding to the video image information; when the value of this field is the second value (such as the value 0), The current video track only contains video image information. The video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.

2. The video media header data box can be further extended to indicate in the video media header data box whether there is codec independence between the data content (information) contained in the video track. Specifically, taking the first value as value 1 and the second value as value 0 as an example, the expanded video media header data box can be further as follows:

The semantics of the syntax of the above extended video media header data box are roughly as follows:

In the case of defining video_with_depth_info_flag, if the value of this field is equal to 1, that is, the current video track contains both texture information (image information) and depth information corresponding to the texture information (image information), you can further add the video media header data The depth_codec_dependency (depth codec dependency field) flag is defined in the box. Among them, when the value of depth_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information); when depth_codec_dependency When the value is a fourth value (such as a value of 0), it indicates that there is no decoding dependency between the texture information (image information) contained in the current video track and the depth information corresponding to the texture information (image information).

When video_with_parallax_info_flag is defined, if the value of this field is equal to 1, that is, the current video track contains both video image information and disparity information corresponding to the video image information, you can further define parallax_codec_dependency (disparity codec) in the video media header data box. Decoding dependency field) this flag. Among them, when the value of parallax_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the disparity information corresponding to the video image information; when the value of parallax_codec_dependency is the fourth value (such as a value of 0), it indicates that there is no decoding dependency between the video image information contained in the current video track and the disparity information corresponding to the video image information.

When video_with_auxl_info_flag is defined, if the value of this field is equal to 1, that is, the current video track contains both video image information and video auxiliary information corresponding to the video image information, you can further define auxl_codec_dependency (auxiliary) in the video media header data box. Codec dependency field) this flag. Among them, when the value of auxl_codec_dependency is the third value (such as the value 1), it indicates that there is a decoding dependency relationship between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information; when the value of auxl_codec_dependency is the fourth value A value (such as a value of 0) indicates that there is no decoding dependency between the video image information contained in the current video track and the video auxiliary information corresponding to the video image information.

3. Define the video composition dependency information data box (VideoCompositionDependencyInfoBox). This video composition dependency information data box is used to indicate: the component type of the video content of the immersive media, and whether there is a codec dependency relationship between different components of the video content of the immersive media. (That is, the dependency on encoding and decoding). Specifically, the video component dependency information data box can be as follows:

Among them, composition_codec_dependency (component codec dependency field) is used to indicate whether there is a codec dependency relationship between different components contained in the video content of the immersive media. When the value of this field is the fifth value (such as the value 0), it indicates that there is no codec dependency between different components of the video content of the immersive media; when the value of this field is the sixth value (such as the value 1), Indicates the existence of codec dependencies between different components of video content for immersive media.

Wherein, composition_type (component type field) is used to indicate the component type of the video content of the immersive media (that is, the type of components contained in the video content). When the value of this field is the seventh value (such as the value 0), it indicates that the component type of the video content of the immersive media is a component type based on the camera, thereby indicating that the components in the video content of the immersive media are divided based on the camera, One component includes video content shot by one camera, that is, the video content of the immersive media includes video content shot by different cameras; when the value of this field is the eighth value (such as the value 1), it indicates that the component type of the video content of the immersive media is Based on the binocular component type, it can be indicated that the components in the video content of the immersive media are obtained based on binocular division. One component includes left-eye video content or right-eye video content, that is, the video content of the immersive media includes left-eye video content. It should be understood that the value of composition_type is not limited to the seventh value and the eighth value, but can also be other values, and other values can be expanded by the corresponding standards.

4. Extended track reference type data box (TrackReferenceTypeBox). When the texture information and depth information of the free-view video are divided into different video track packages, the current technology defines a reference type field (reference_type) that can reference or associate the depth map track to the same camera by storing the track index 'vdep' The corresponding texture map track, but the track index only indicates that there is a dependency between the two during presentation, and does not indicate whether there is a dependency between the two during decoding. Therefore, the embodiments of this application are extended as follows:

When the reference type field (reference_type) stores the track index 'vdep', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track relies on the corresponding texture map track when decoding.

When the reference type field (reference_type) stores the track index 'vdpi', it is used to reference or associate the depth map track to the texture map track corresponding to the corresponding camera, and indicates that the depth map track does not rely on the corresponding texture map track when decoding. .

Correspondingly, considering that the texture map track is one of the video image information tracks, the depth map track is one of the video auxiliary information tracks, and the video image information track is not limited to the texture map track, and the video auxiliary information track is not limited to For the depth map track, for example, the video auxiliary track may also include a disparity information track, etc. Therefore, the track index between the video image information track and the disparity information track, and the track index between the video image information track and the video auxiliary information track can also be extended as follows:

When the reference type field (reference_type) is stored as the track index of 'vplx', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track relies on the corresponding video image information track when decoding.

When the reference type field (reference_type) stores the track index of 'vpli', it is used to reference or associate the disparity information track to the corresponding video image information track, and indicates that the disparity information track does not depend on the corresponding video image information track when decoding.

When the reference type field (reference_type) stores the track index of 'auxl', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track relies on the corresponding video image information track when decoding.

When the reference type field (reference_type) stores the track index of 'auxi', it is used to reference or associate the video auxiliary information track to the corresponding video image information track, and indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding. .

It should be noted that, extensibly, the above-mentioned track index types can be interchanged regarding decoding dependency indications; for example, the track index 'vdep' is used to indicate that encoding and decoding are independent, and the track index 'vdpi' is used to indicate encoding and decoding. Dependencies, etc.

5. Expand the free view track group and free view track group data box (AvsFreeViewGroupBox). The existing technology defines the free-view track group as follows: If a free-view video is encapsulated into multiple video tracks, and one video track only contains texture information and/or depth information corresponding to some cameras, these video tracks should be passed through the free-view track group. The angle track group is associated. Different video tracks that contain some cameras and correspond to the same free-angle video should have the same track_group_id. However, the definition of the free-view track group in the embodiment of this application is expanded as follows:

The free-view track group is used to associate video tracks that correspond to different cameras and contain texture information, and the encoding and decoding between these different video tracks are independent of each other; and, different video tracks that contain some cameras and correspond to the same free-view video should Have the same track_group_id. That is to say, the embodiment of the present application only associates video tracks encapsulated with texture information through free-view track groups; considering that texture information is a type of video image information, it can be further expanded to: only video tracks encapsulated with video Video tracks of image information are associated through free-view track groups. That is to say, in the embodiment of the present application, only the video track that encapsulates video image information (such as texture information) may include a free-view track group data box; the free-view track group data box here may be referred to as track group data for short. box, the track group data box may include a track group identification field (track_group_id) and a camera identification field (camera_id). The track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.

2. DASH signaling extension

1. Define the video auxiliary information descriptor (VideoAuxiliaryInfoDescriptor), which is used to describe the relevant attributes of the video auxiliary information of the video resource. The video auxiliary information descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av3f". The video auxiliary information descriptor may exist at the adaptation set level (adaptation set level) or representation level (representation level). When the video auxiliary information descriptor exists at the adaptation set level, it describes the relevant attributes of the video auxiliary information of all representations in the corresponding adaptation set; when the video auxiliary information descriptor exists at the representation level, it describes the video auxiliary information of the corresponding representation. Related properties. Specifically, the specific attributes of the video auxiliary information descriptor can be shown in Table 3 below:

table 3

In the above Table 3, the corresponding Chinese meanings of each English in the first column can be as follows: The full name of Elementsand Attributes is Elementsand AttributesforVideoAuxiliaryInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the video auxiliary information descriptor; VideoAuxiliaryInfo is the video auxiliary information field , VideoAuxiliaryInfo@withDepthInfo is the video auxiliary information field with depth information, VideoAuxiliaryInfo@withParallaxInfo is the video auxiliary information field with parallax information, VideoAuxiliaryInfo@withAuxiliaryInfo is the video auxiliary information field with auxiliary information, VideoAuxiliaryInfo@depthCodecDependency is the video auxiliary information The depth codec dependency field, VideoAuxiliaryInfo@parallaxCodecDependency is the parallax codec dependency field of video auxiliary information, VideoAuxiliaryInfo@auxilliaryCodecDependency is the auxiliary codec dependency field of video auxiliary information, VideoAuxiliaryInfo@onlyDepthInfo is the depth-only information of video auxiliary information The flag field, VideoAuxiliaryInfo@onlyParallaxInfo, is the only parallax information flag field for video auxiliary information, and VideoAuxiliaryInfo@onlyAuxiliaryInfo is the only auxiliary information flag field for video auxiliary information. The Chinese meaning of Use in the second column is function, O means that the corresponding attribute is Optional (optional), CM means ConditionalMandatory (mandatory under conditions), M means Mandatory (mandatory); in the third column The Chinese meaning of Data type is data type, avs:ims:2018:av31 and xs:bool are two different data types; the Chinese meaning of Description in the fourth column is description.

2. Extend the free-view camera descriptor (AvsFreeViewCamInfoDescriptor), which is used to describe the camera information corresponding to the video resource. The free-view camera descriptor is a SupplementalProperty element, and its @schemeIdUri attribute is "urn:avs:ims:2018:av31". Moreover, the free-view camera descriptor can be at the adaptation set level (adaptation set level) or the representation level (representation level). When the free-view camera descriptor exists at the adaptation set level, it describes the camera information corresponding to all representations in the corresponding adaptation set; when the free-view camera descriptor exists at the representation level, it describes the camera information corresponding to the corresponding representation.

Specifically, the specific attributes of the extended free-view camera descriptor can be roughly shown in Table 4 below:

Table 4

In the above Table 4, the two fields AvsFreeViewCam@sourceId (free-view camera source identification field) and AvsFreeViewCam@combinationId (free-view camera combination identification field) are fields extended in the free-view camera descriptor according to the embodiment of the present application. Other fields besides these two fields are the original fields in the free-view camera descriptor. In addition, the full name of Elements and Attributes in the first column is Elementsand AttributesforAvsFreeViewCamInfoDescriptor, and its corresponding Chinese meaning is the elements and attributes of the free-view camera descriptor. The Chinese meanings of each English in the second column can be found in the relevant explanations in Table 3. The avs:ims:2018:av3l, xs:unsignedInt and xs:float in the third column are three different data types.

Optionally, when immersive media (such as free-view video) is packaged in multi-track form, the representation (video resource) or adaptation set (adaptation set) corresponding to a video track can also be indicated through the preselection (pre-selection set) tool. ; In other words, video resources corresponding to different cameras of the same video content should exist in the same preselection.

3. SMT signaling extension

Define a video auxiliary information descriptor (VideoAuxiliaryInfoDescriptor). The video auxiliary information descriptor can also be called a video auxiliary information descriptor, which can be used to describe the relevant attributes of the video auxiliary information of the video resource. Specifically, the definition of the video auxiliary information descriptor can be shown in Table 5 below:

table 5

Among them, descriptor_tag (descriptor identifier field) is used to identify the type of descriptor (i.e., video auxiliary information descriptor); descriptor_length (descriptor length field) is used to identify the length of the descriptor (i.e., video auxiliary information descriptor), unit is bytes.

video_with_depth_info_flag (video auxiliary information field with depth information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both texture information (image information) and depth information corresponding to the texture information (image information). When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both texture information (image information) and depth information corresponding to the texture information (image information); when the value of this field is the tenth value (such as the value 0), indicates that the current media resource only contains texture information (image information).

When the value of video_with_depth_info_flag is the ninth value (such as value 1), that is, when the video resource contains both texture information and corresponding depth information, the video auxiliary information descriptor may also include depth_codec_dependency (the depth codec of the video auxiliary information). Decoding dependency field); otherwise, the video auxiliary information descriptor may include only_depth_info_flag (only depth information flag field of video auxiliary information). Among them, the depth_codec_dependency field is used to indicate whether there is a decoding dependency between the texture information (image information) contained in the current media resource (that is, the corresponding video resource) and the depth information corresponding to the texture information (image information); when this field is When the value is the eleventh value (such as the value 1), it indicates that there is a decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information); when the value of this field is The twelfth value (such as the value 0) indicates that there is no decoding dependency between the texture information (image information) contained in the current media resource and the depth information corresponding to the texture information (image information). The only_depth_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains depth information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image (texture) information.

video_with_parallax_info_flag (video auxiliary information field with parallax information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and parallax information corresponding to the video image information. When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and disparity information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates that the current Media assets contain only video image information.

When the value of video_with_parallax_info_flag is the ninth value (such as value 1), that is, when the video resource contains both video image information and corresponding disparity information, the video auxiliary information descriptor may also include parallax_codec_dependency (the disparity of the video auxiliary information Codec dependency field); otherwise, the video auxiliary information descriptor may include only_parallax_info_flag (only parallax information flag field of video auxiliary information). Among them, the field parallax_codec_dependency is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the disparity information corresponding to the video image information; when the value of this field is the eleventh value (such as a value of 1), indicates that there is a decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information; when the value of this field is a twelfth value (such as a value of 0), it indicates that There is no decoding dependency between the video image information contained in the current media resource and the disparity information corresponding to the video image information. The only_parallax_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains parallax information; when this field When the value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.

video_with_auxl_info_flag (video auxiliary information field with auxiliary information) is used to indicate whether the current media resource (ie, the corresponding video resource) contains both video image information and video auxiliary information corresponding to the video image information. When the value of this field is the ninth value (such as the value 1), it indicates that the current media resource contains both video image information and video auxiliary information corresponding to the video image information; when the value of this field is the tenth value (such as the value 0), it indicates The current media resource only contains video image information. The video auxiliary information here can be depth information, disparity information, or auxiliary information defined by other applications.

When the value of video_with_auxl_info_flag is the ninth value (such as value 1), that is, when the video resource contains both video image information and corresponding video auxiliary information, the video auxiliary information descriptor may also include auxl_codec_dependency (video auxiliary information auxiliary codec dependency field); otherwise, the video auxiliary information descriptor may include only_auxiliary_info_flag (only auxiliary information flag field of video auxiliary information). Among them, the auxl_codec_dependency field is used to indicate whether there is a decoding dependency between the video image information contained in the current media resource (i.e., the corresponding video resource) and the video auxiliary information corresponding to the video image information; when the value of this field is eleventh When the value of this field is a twelfth value (such as a value of 1), it indicates that there is a decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information; when the value of this field is a twelfth value (such as a value of 0) , indicating that there is no decoding dependency between the video image information contained in the current media resource and the video auxiliary information corresponding to the video image information. The only_auxiliary_info_flag field is used to indicate the image information contained in the current media resource (i.e., the corresponding video resource); when the value of this field is the thirteenth value (such as the value 1), it indicates that the current media resource only contains video auxiliary information; when the field When the field value is the fourteenth value (such as the value 0), it indicates that the current media resource only contains video image information.

Based on the above description, embodiments of the present application propose an immersive media processing method. The immersive media processing method can be executed by a media production device. Referring to Figure 2, the immersive media processing method may include the following steps S201-S202:

S201: Obtain the video bit stream of the immersive media.

In this embodiment of the present application, the immersive media may be free-view video, or other immersive media other than free-view video, which is not limited here. Moreover, immersive media can be captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information. When the immersive media is free-view video, the video image information collected by any camera may include texture information, and the video auxiliary information collected by any camera may include depth information; of course, it is understandable that here Video image information and video auxiliary information are not limited. For example, when the immersive media is other immersive media, the video auxiliary information collected by any camera may include parallax information, or auxiliary information defined by other applications.

S202: Encapsulate the video bit stream into M video tracks to obtain an immersive media media file.

Among them, M is a positive integer. When M=1, it can be indicated that the encapsulation method of immersive media is a single-track encapsulation method, that is, the media production device can encapsulate the video bit stream into a video track to obtain the media file of the immersive media. When M>1, it can indicate that the encapsulation method of immersive media is multi-track encapsulation, that is, the media production equipment can encapsulate the video bit stream into a video track to obtain the media file of immersive media; in this case, the media production equipment The device can encapsulate video image information and corresponding video auxiliary information collected by the same camera into the same video track, or it can encapsulate video image information and corresponding video auxiliary information collected by the same camera into different In the video track, there is no limit to this. Furthermore, when the immersive media packaging method is a multi-track packaging method, one video track can correspond to one or more cameras, and there is no limit to this; for example, the media production equipment can choose a video track to only package the video captured by one camera. At least one of video image information and corresponding video auxiliary information, you can also choose one video track to encapsulate the video image information and corresponding video auxiliary information collected by at least two cameras, you can also choose one video track to encapsulate at least two Video image information collected by the camera, or video auxiliary information collected by at least two cameras.

During the encapsulation process, media production equipment can also perform the following operations:

(1) Depending on whether the video image information and video auxiliary information are in the same video track, add metadata information to the video media header data box of at least one video track among the M video tracks to indicate whether the corresponding video track also contains video Image information and corresponding video auxiliary information. That is to say, among the M video tracks, the video media header data box in at least one video track includes metadata information, and the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information. .

Specifically, if M=1 (that is, the video bit stream of the immersive media is encapsulated into one video track), the media production device can add metadata information to the media header data box of this only video track. If M>1 (that is, the video bit stream of immersive media is encapsulated into multiple video tracks), then in the case where the video image information collected by the same camera and the corresponding video auxiliary information are encapsulated into the same video track, media production The device can add metadata information to the video media header data box of each of the M video tracks; in the case where the video image information and corresponding video auxiliary information collected by the same camera are encapsulated into different video tracks , considering that the M video tracks can be specifically subdivided into one or more video image information tracks and one or more video auxiliary information tracks, and the video auxiliary information tracks can be referenced or associated to the corresponding video image information track, therefore To save resources, the media production equipment may add metadata information in the video media header data box of each video image information track among the M video tracks.

It can be seen that in some embodiments, the media production device can at least add metadata information in the video media header data box of the video track that encapsulates the video image information; that is, if the m-th video track among the M video tracks encapsulates If the video image information of at least one camera is obtained, the video media header data box of the m-th video track includes metadata information, m∈[1, M].

In this way, based on the metadata information in the video media header data box of each video track, it can be directly determined whether the video track contains both video image information and corresponding video auxiliary information, avoiding duplication of media consumption equipment during the unblocking process. Analyze whether other data boxes contain video image information and corresponding video auxiliary information, saving processing resources.

In some embodiments, the metadata information includes a content information flag field; the media production device can set the value of the content information flag field according to whether the video track contains both video image information and corresponding video auxiliary information. Specifically, if the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only Contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0).

In this embodiment, in order to quickly determine whether the video track contains both video image information and corresponding video auxiliary information, a content information flag for determining whether the video track contains both video image information and corresponding video auxiliary information is set in the metadata information. field, ensuring the processing speed of media consumption devices during the unblocking process, and also saving processing resources.

In some embodiments, the content information flag field may be any of the following: video depth information flag field (i.e., the aforementioned video_with_depth_info_flag), video parallax information flag field (i.e., the aforementioned video_with_parallax_info_flag), video auxiliary information flag field (i.e. the video_with_auxl_info_flag mentioned above). Among them: when the content information flag field is the video depth information flag field, the video image information includes texture information, and the video auxiliary information includes depth information; when the content information flag field is the video disparity information flag field, the video auxiliary information includes disparity information; when When the content information flag field is a video auxiliary information flag field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.

It should be noted that the above only exemplifies the manner in which the media production device adds metadata information, and does not limit this. For example, in some embodiments, the media production device may also choose to add metadata information in the video media header data box of the video track that encapsulates the video auxiliary information; in this case, if the corresponding video track also contains video image information and the corresponding video auxiliary information, the media production equipment can set the value of the corresponding content information flag field to the first value (such as the value 1). If the corresponding video track only contains video auxiliary information, the media production equipment can set the corresponding The value of the content information flag field is set to the second value (such as the value 0). As another example, in some embodiments, the media production device may not pay attention to whether the video auxiliary information and video image information are encapsulated in the same video track, and directly choose to add metadata information in the video media header data box of each video track; In this case, the information status of the video track can be indicated by the three values of the content information flag field. If the corresponding video track contains both video image information and corresponding video auxiliary information, the media production device can convert the corresponding content information The value of the flag field is set to the first value (such as the value 1). If the corresponding video track only contains video image information, the media production device can set the value of the corresponding content information flag field to the second value (such as the value 0). ), if the corresponding video track only contains video auxiliary information, the media production device can set the value of the corresponding content information flag field to another value (such as value 2), and so on.

In this embodiment, by identifying whether the content information flag field is a video depth information flag field, a video disparity information flag field, or a video auxiliary information flag field, it is possible to promptly clarify whether the auxiliary information is depth information, parallax information, and auxiliary information during the encapsulation process. Which one.

(2) For a video track containing metadata information, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information (such as the value of the content flag field is the first value), then the media The production equipment can also add a video information encoding and decoding dependency field in the video media header data box in the corresponding video, and can make a connection between the video image information and the video auxiliary information according to the video image information in the corresponding video track and the corresponding video auxiliary information. Whether there is a decoding dependency relationship, set the value of the video information encoding and decoding dependency field.

In some embodiments, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information, the video media header data box in the corresponding video track also includes a video information codec dependency field; when When the value of the video information codec dependency field is the third value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information; when the video information codec dependency field When the value of is the fourth value, it indicates that there is no decoding dependency between the video image information in the corresponding video track and the corresponding video auxiliary information.

Specifically, if there is a decoding dependency relationship between the video image information in the video track and the corresponding video auxiliary information, the value of the video information codec dependency field can be set to a third value (such as the value 1); if the video track If there is no decoding dependency between the video image information and the corresponding video auxiliary information, the value of the video information codec dependency field can be set to a fourth value (such as a value of 0).

In this embodiment, by identifying the video information encoding and decoding dependency fields in the video track, it can be quickly determined whether there is a decoding dependency between the video image information in the video track and the corresponding video auxiliary information, which can improve the efficiency of the encoding and decoding process. .

As can be seen from the foregoing, the specific form of the video information codec dependency field may depend on the specific form of the content flag field. If the content flag field is the video depth information flag field, the video information codec dependency field is the depth codec dependency field. ; If the content flag field is the video disparity information flag field, the video information codec dependency field is the parallax codec dependency field; if the content flag field is the video auxiliary information flag field, the video information codec dependency field is the auxiliary codec field. Decode dependency fields.

Therefore, in some embodiments, the video information codec dependency field is any of the following: a depth codec dependency field, a parallax codec dependency field, an auxiliary codec dependency field; where: when the video information codec dependency field When the sex field is a depth codec dependency field, the video image information includes texture information, and the video auxiliary information includes depth information; when the video information codec dependency field is a disparity codec dependency field, the video auxiliary information includes disparity information; when When the video information codec dependency field is an auxiliary codec dependency field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.

Among them, the depth codec dependency field is the aforementioned depth_codec_dependency, the parallax codec dependency field is the aforementioned parallax_codec_dependency, and the auxiliary codec dependency field is the aforementioned auxl_codec_dependency.

In this embodiment, by identifying whether the video information codec dependency field is a depth codec dependency field, a parallax codec dependency field, or an auxiliary codec dependency field, it can be timely clarified whether the auxiliary information in the video track is depth information, depth information, or auxiliary codec dependency field. Which of disparity information and auxiliary information.

(3) Add a video component dependency information data box to at least one of the M video tracks according to the component type of the video content of the immersive media and whether there is a codec dependency relationship between different components of the video content of the immersive media. .

It can be seen that in some embodiments, among the M video tracks, there is at least one video track including a video component dependency information data box; wherein the video component dependency information data box is used to indicate: the component type of the video content of the immersive media , and whether there are codec dependencies between different components of the video content of the immersive media.

In the above embodiment, a video component dependency data box is defined. Through the video component dependency data box, it is possible to clarify whether the component type of the video content and whether there is a codec dependency relationship between the components, thereby improving the efficiency of codec.

In some embodiments, when M=1, the M video tracks include video component dependency information data boxes; when M>1, if the m-th video track among the M video tracks encapsulates at least two cameras If one or more types of information are collected, the m-th video track includes a video component dependent information data box.

Specifically, if M=1 (that is, the video bit stream of the immersive media is encapsulated into one video track), the media production device can add the video component dependency information data box to this only video track. If M>1 (that is, the video bit stream of the immersive media is encapsulated into multiple video tracks), then in the case where one video track corresponds to at least two cameras, the media production device can encapsulate at least two of the M video tracks. Add a video component dependent information data box to the video track of one or more types of information collected by a camera. For example, if a video track encapsulates video image information and video auxiliary information collected by at least two cameras, the media production equipment can add a video component dependency information data box to the video track; another example, if a video track encapsulates If the video image information collected by at least two cameras is included, or the video auxiliary information collected by at least two cameras is encapsulated, the media production device can add a video component dependency information data box to the video track. It can be seen that if the m-th video track among the M video tracks encapsulates one or more types of information collected by at least two cameras, the m-th video track includes a video component dependency information data box.

In this embodiment, when the number of video tracks is 1, there is no need for multiple video component dependency data boxes to indicate component types of video content and codec dependencies between different components. In the case where the number of video tracks is multiple, if there is a video track that encapsulates a variety of information indicating the collection by two cameras, the video component dependency information data box needs to be included in the video track to encode the information between the two cameras. Decoding dependencies are indicated to improve encoding and decoding efficiency.

In some embodiments, the video component dependency information data box includes a component codec dependency field; when the value of the component codec dependency field is a fifth value, it indicates that codecs exist between different components of the video content of the immersive media. Dependency relationship; when the value of the component codec dependency field is the sixth value, it indicates that there is no codec dependency relationship between different components of the video content of the immersive media.

Among them, the video component dependency information data box may include a component codec dependency field (i.e., the aforementioned composition_codec_dependency); the media production device may set the component codec dependency according to whether there is a codec dependency relationship between different components of the immersive media video content. Decode the value of the dependency field. Specifically, if there is a codec dependency relationship between different components of the video content of the immersive media, the media production device can set the value of the codec dependency field of the corresponding component to a fifth value (such as a value of 1). If there is no codec dependency relationship between different components of the video content of the media, the media production device can set the value of the codec dependency field of the corresponding component to a sixth value (such as a value of 0).

In this embodiment, the value of the component codec dependency field can accurately reflect whether there is a codec dependency relationship between different components of the video content of the immersive media.

In some embodiments, the video component dependency information data box includes a component type field; when the value of the component type field is a seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; wherein, a component Includes video content captured by a camera; when the value of the component type field is the eighth value, it indicates that the component type of the video content of the immersive media is a binocular-based component type; wherein one component includes left-eye video content or right-eye video content .

The component type field is the aforementioned composition_type; the media production device can set the value of the component type field according to the component type of the video content of the immersive media (that is, the division rule of the components of the video content of the immersive media). Specifically, if the component type of the video content of the immersive media is a camera-based component type, that is, one component includes video content shot by a camera, the media production device can set the value of the corresponding component type field to the seventh value ( Such as the value 0); if the component type of the immersive media video content is based on the binocular component type, that is, one component includes left-eye video content or right-eye video content, the media production device can set the value of the corresponding component type field to The eighth value (such as value 1).

In this embodiment, by using the value of the component type field in the video component dependency information data box, it is possible to accurately distinguish whether the component type of the video content of the immersive media is a camera-based component type or a binocular-based component type.

(4) In the case of M>1 (that is, the video bit stream of the immersive media is encapsulated into multiple video tracks), add a track group data box to each video track that encapsulates the video image information in the M video tracks to Each video track encapsulating video image information among the M video tracks is associated.

In some embodiments, when M>1, the video track encapsulating video image information among the M video tracks also includes a track group data box; wherein the track group data box includes a track group identification field and a camera identification field; The track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.

The track group data box may include at least a track group identification field and a camera identification field; for any video track that encapsulates video image information, the media production device may store the identification information of the track group corresponding to the corresponding video track. to the track group identification field in the track group data box in the video track, and store the identification information of the camera corresponding to the corresponding video track in the camera identification field in the track group data box in the video track.

In this embodiment, through the track group data box in the video track, the media production device can accurately store the identification information of the track group and the identification information of the camera in the corresponding identification fields.

(5) In the case where the M video tracks include one or more video image information tracks and one or more video auxiliary information tracks, the media production equipment can also add a track reference type data box to each video auxiliary information track, The video auxiliary information track is referenced or associated with the video image information track corresponding to the same camera, and indicates whether there is a dependency between the video auxiliary information track and the corresponding video image information track during decoding.

In one embodiment, when M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, the M video tracks include: one or more video image information tracks and One or more video auxiliary information tracks; wherein, the video image information track refers to a video track that encapsulates video image information, and the video auxiliary information track refers to a video track that encapsulates video auxiliary information; any video auxiliary information track includes a track reference Type data box, track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate whether the video auxiliary information track and the corresponding video image information track exist during decoding. Dependencies.

In this embodiment, for M video tracks, the video tracks are divided into tracks that encapsulate at least one video image information and tracks that encapsulate at least one video auxiliary information, and are represented by the track reference type data box corresponding to the video auxiliary information track. The indicated decoding dependencies can improve decoding efficiency.

In some embodiments, the track reference type data box contains a track identification field for storing identification information of the video image information track associated with the video auxiliary information track.

Among them, the track reference type data box contains a track identification field (track_ID); then for any video auxiliary information track, the media production device can store the identification information of the video image information track associated with the video auxiliary information track to the video auxiliary information track. In the Track ID field in the Track Reference Type data box in the Information Track.

In this embodiment, through the track identification field in the track parameter type data box, the media production device can accurately store the identification information of the video image information track associated with each video auxiliary track.

In some embodiments, the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the first When the two-track index is used, it indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding.

Among them, since the track reference type data box contains a reference type field; therefore, for any video auxiliary information track, the media production equipment can select the first one according to whether the video auxiliary information track relies on the corresponding video image information track when decoding. The track index or second track index is stored into the reference type field in the track reference type data box in the video side information track. Specifically, if the video auxiliary information track relies on the corresponding video image information track when decoding, you can choose to store the first track index in the reference type field in the track reference type data box in the video auxiliary information track; if When the video auxiliary information track is decoded, independent of the corresponding video image information track, the second track index may be stored in the reference type field in the track reference type data box in the video auxiliary information track.

In this embodiment, during the decoding process, the media production device can clarify whether there is a dependency relationship between the video auxiliary information track and the video image information track based on the track index that exists in the reference type field, so that the decoding can be completed accurately and quickly. Improved decoding efficiency. In some embodiments, the first track index is any one of the following: a first character string, the first character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; a second character string, a second character string used to indicate that the video auxiliary information in the video auxiliary information track includes parallax information; the third string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications information; the second track index is any of the following: the fourth string, the fourth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information; the fifth string, the fifth string is used to indicate the video The video auxiliary information in the auxiliary information track includes disparity information; the sixth string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information or auxiliary information defined by other applications.

Among them, the first character string is the aforementioned 'vdep', the second character string is the aforementioned 'vplx', and the third character string is the aforementioned 'auxl'. The fourth string is the aforementioned 'vdpi', the fifth string is the aforementioned 'vpli', and the sixth string is the aforementioned 'auxi'.

In this embodiment, for the first track index or the second track index, corresponding strings are set for different video auxiliary information, which can effectively distinguish whether the video auxiliary information in the first track index or the second track index is depth information, Which of disparity information and auxiliary information.

In the above immersive media processing method, the video bitstream of the immersive media is obtained and the video bitstream is encapsulated into M video tracks to obtain the media file of the immersive media; in this process, the media file of the immersive media can be obtained by The video media header data box in the video media encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid duplication of media consumption devices during the decapsulation process. Parse the relevant fields in other data boxes used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.

In some embodiments, after the video bit stream is encapsulated into M video tracks, the video resources corresponding to each video track are obtained, and the media file includes the video resources corresponding to the M video tracks; if the transmission method of the media file is streaming transmission method, the method further includes: generating a signaling description file corresponding to the media file; wherein the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource. .

The transmission signaling corresponding to the signaling description file may be DASH signaling or SMT signaling, which is not limited.

Specifically, the media production device may add at least one video auxiliary information descriptor in the signaling description file to describe the relevant attributes of the video auxiliary information of the video resource.

Further, when M>1, it indicates that the immersive media is packaged in a multi-track form, and the video resources or adaptive sets corresponding to one video track are indicated through the pre-selection set tool.

In this way, when there are multiple video tracks, the pre-rotation set tool can effectively and accurately indicate the video resources or adaptive sets corresponding to each video track.

In this embodiment, if it is determined that the transmission mode of the media file is a streaming transmission mode, a signaling description file including at least one video auxiliary information descriptor is generated, so that a detailed and accurate description associated with the video auxiliary information can be related properties.

In some embodiments, when the video auxiliary information descriptor is located at the adaptation set level of the signaling description file, it describes the relevant attributes of the video auxiliary information of each video resource in the corresponding adaptation set; when the video auxiliary information descriptor is located at the signaling description file When describing the presentation level of the file, describe the relevant attributes of the video auxiliary information of the corresponding video resource.

It should be noted that when the transmission signaling is DASH signaling, as can be seen from the aforementioned expansion of DASH signaling, the video auxiliary information descriptor can be located at the presentation level to describe the relevant attributes of the video auxiliary information of the corresponding video resource. , in this case, the number of video auxiliary information descriptors in the signaling description file is the same as the number of video resources (representations) in the media file; alternatively, the video auxiliary information descriptors can be located at the adaptive set level to describe the corresponding Adapt to the relevant attributes of the video auxiliary information of each video resource in the set. In this case, the number of video auxiliary information descriptors in the signaling description file is the same as the number of adaptive sets in the media file.

In this embodiment, by determining the level of the video auxiliary information descriptor in the signaling description file, the relevant attributes matching the level can be accurately described.

In some embodiments, the media production device may add a video auxiliary information field in the video auxiliary information descriptor. Moreover, when the video auxiliary information descriptor is located at the presentation level, the media production device can store the relevant attributes of the video auxiliary information of the corresponding video resource into the video auxiliary information field; when the video auxiliary information descriptor is located at the adaptive set level , the media production device can store relevant attributes of the video auxiliary information of the corresponding adaptive set into the video auxiliary information field.

In this embodiment, according to the video auxiliary information field in the video auxiliary information descriptor, the relevant attributes of the video auxiliary information or the video auxiliary information of the adaptive set can be stored in the corresponding video auxiliary information field in a timely and accurate manner, that is, improving The accuracy of video auxiliary information storage is improved.

In some embodiments, the video auxiliary information descriptor includes an auxiliary information indication field; when the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information; when the auxiliary information When the value of the indication field is the tenth value, it indicates that the corresponding video resource only contains video image information.

Wherein, the media production device can add an auxiliary information indication field in the video auxiliary information descriptor. The media production device may set the value of the auxiliary information indication field according to whether the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information.

Specifically, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, the media production device sets the value of the auxiliary information indication field to the ninth value (such as the value 1); if The video resource described by the video auxiliary information descriptor only contains video image information, then the media production device sets the value of the auxiliary information indication field to the tenth value (such as the value 0).

In this embodiment, according to the value of the auxiliary information indication field, it can be clarified in time whether the corresponding video resource only contains video image information, or whether it contains both video image information and corresponding video auxiliary information.

In some embodiments, the auxiliary information indication field is any of the following: a video auxiliary information field with depth information, a video auxiliary information field with disparity information, a video auxiliary information field with auxiliary information; wherein: when assisting When the information indication field is a video auxiliary information field with depth information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information indication field is a video auxiliary information field with disparity information, the video auxiliary information includes Parallax information; when the auxiliary information indication field is a video auxiliary information field with auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.

Among them, the video auxiliary information field with depth information is VideoAuxiliaryInfo@withDepthInfo or video_with_depth_info_flag involved in the aforementioned DASH signaling, and the video auxiliary information field with disparity information is the aforementioned DASH signaling. VideoAuxiliaryInfo@withParallaxInfo or video_with_parallax_info_flag related to the aforementioned SMT signaling, and the video auxiliary information field with auxiliary information is VideoAuxiliaryInfo@withAuxiliaryInfo related to the aforementioned DASH signaling or video_with_auxl_info_flag related to the SMT signaling.

In this embodiment, by identifying whether the auxiliary information indication field is a video auxiliary information field with depth information, a video auxiliary information field with disparity information, or a video auxiliary information field with auxiliary information, timely transmission can be achieved during the transmission process. It is clarified that in addition to the video image information of the video resource, it also contains one of depth information, disparity information and auxiliary information.

Further, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, that is, the value of the auxiliary information indication field is the ninth value, then the media production device can specify the video auxiliary information descriptor in the video resource. Add an auxiliary information codec dependency field in the auxiliary information codec dependency field, and set the value of the auxiliary information codec dependency field according to whether there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information.

In some embodiments, if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, the video auxiliary information descriptor includes an auxiliary information codec dependency field; when the auxiliary information codec depends When the value of the property field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the value of the auxiliary information encoding and decoding dependency field is the twelfth value When, it indicates that there is no decoding dependency between the video image information and the corresponding video auxiliary information in the corresponding video resource.

Specifically, if there is a decoding dependency relationship between the video image information in the video resource described by the video auxiliary information descriptor and the corresponding video auxiliary information, the value of the auxiliary information codec dependency field can be set to the eleventh value (such as value 1); if there is no decoding dependency between the video image information in the video resource described by the video auxiliary information descriptor and the corresponding video auxiliary information, the value of the auxiliary information codec dependency field can be set to The twelfth value (such as the value 0).

In this embodiment, by identifying the value of the auxiliary information encoding and decoding dependency field, it can be quickly determined whether there is a decoding dependency between the video image information and the corresponding video auxiliary information in the video resource, thereby improving decoding efficiency.

As can be seen from the foregoing, the specific form of the auxiliary information encoding and decoding dependency field may depend on the specific form of the auxiliary information indication field. If the auxiliary information indication field is a video auxiliary information field with depth information, the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information; if the auxiliary information indication field is video auxiliary information with disparity information field, then the auxiliary information codec dependency field is the disparity codec dependency field of the video auxiliary information; if the auxiliary information indication field is a video auxiliary information field with auxiliary information, the auxiliary information codec dependency field is the video auxiliary information Auxiliary codec dependency field.

Therefore, in some embodiments, the auxiliary information codec dependency field is any of the following: a depth codec dependency field of video auxiliary information, a disparity codec dependency field of video auxiliary information, an auxiliary codec of video auxiliary information Dependency field; where: when the auxiliary information codec dependency field is the depth codec dependency field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the auxiliary information codec dependency field is when the auxiliary information codec dependency field is the auxiliary codec dependency field of the video auxiliary information, the video auxiliary information includes depth information, disparity information, or Auxiliary information defined by other applications.

Among them, the depth codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@depthCodecDependency involved in the aforementioned DASH signaling or the depth_codec_dependency involved in the SMT signaling, and the parallax codec dependency field of the video auxiliary information is the VideoAuxiliaryInfo@parallaxCodecDependency involved in the aforementioned DASH signaling. Or parallax_codec_dependency involved in the aforementioned SMT signaling, the auxiliary codec dependency field of the video auxiliary information is VideoAuxiliaryInfo@auxilliaryCodecDependency of DASH signaling or auxl_codec_dependency involved in SMT signaling. ,

In this embodiment, by identifying that the auxiliary information codec dependency field is the depth codec dependency field, the disparity codec dependency field, and the auxiliary codec dependency field, it can be clearly understood in time during the transmission process whether the auxiliary information is depth information, Which of disparity information and auxiliary information.

In some embodiments, the media production device can also add a video auxiliary identification field to the video auxiliary information descriptor, and set the value of the video auxiliary identification field according to the information type in the corresponding video resource.

In some embodiments, the video auxiliary information descriptor includes a video auxiliary identification field; when the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the auxiliary information When the value of the identification field is the fourteenth value, it indicates that the corresponding video resource only contains video image information.

Specifically, if the video resource described by the video auxiliary information descriptor only contains video auxiliary information corresponding to the video image information, the value of the video auxiliary identification field can be set to the thirteenth value (such as the value 1); if the video auxiliary information If the video resource described by the descriptor only contains video auxiliary information corresponding to the video image information, the value of the video auxiliary identification field can be set to a fourteenth value (such as a value of 0).

In this embodiment, the value of the video auxiliary identification field can accurately indicate whether the corresponding video resource only contains video auxiliary information or only video image information during the transmission process.

In some embodiments, the video auxiliary identification field is any of the following: a depth information only flag field of video auxiliary information, a disparity information only flag field of video auxiliary information, or an auxiliary information only flag field of video auxiliary information; wherein : When the video auxiliary identification field is the depth information only flag field of the video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information; when the video auxiliary identification field is the parallax information only flag field of the video auxiliary information. , the video auxiliary information includes disparity information; when the video auxiliary identification field is the auxiliary information only flag field of the video auxiliary information, the video auxiliary information includes depth information, disparity information or auxiliary information defined by other applications.

Among them, the depth-only information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyDepthInfo involved in the aforementioned DASH signaling or the only_depth_info_flag involved in the SMT signaling, and the only parallax information flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyParallaxInfo involved in the aforementioned DASH signaling. or only_parallax_info_flag involved in SMT signaling), the auxiliary information only flag field of the video auxiliary information is VideoAuxiliaryInfo@onlyAuxiliaryInfo involved in the aforementioned DASH signaling or only_auxiliary_info_flag involved in the aforementioned SMT signaling.

In this embodiment, depending on whether the video auxiliary identification field is a depth information flag field only, a disparity information flag field only, or an auxiliary information flag field only, only one type of auxiliary information is involved in the transmission process.

In some embodiments, the signaling description file also includes at least one free-view camera descriptor, and the free-view camera descriptor is used to describe the camera information corresponding to the video resource; wherein the free-view camera descriptor includes a free-view camera source identification field, The value of the free-view camera source identification field is used to indicate the content identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same content identifier; alternatively, the free-view camera descriptor includes a free-view camera Combination identification field, the value of the free-view camera combination identification field is used to indicate the combination identifier to which the corresponding video resource belongs. Video resources corresponding to different cameras under the same video content have the same combination identifier.

Among them, when the transmission signaling is DASH signaling, it can be known from the aforementioned expansion of DASH signaling that the free-view camera descriptor can be located at the presentation level to describe the camera information corresponding to the corresponding video resource. In this case, the signal Let the number of free-view camera descriptors in the description file be the same as the number of video resources (representations) in the media file; alternatively, the free-view camera descriptors can be located at the adaptation set level to describe each video resource in the corresponding adaptation set. Corresponding camera information, in this case, the number of free-view camera descriptors in the signaling description file is the same as the number of adaptive sets in the media file.

Specifically, the media production device can add a free-view camera source identification field in the free-view camera descriptor, and store the content identifier belonging to the video resource described by the free-view camera descriptor in the free-view camera source identification field; or , the media production equipment can add a free-view camera combination identification field in the free-view camera descriptor, and store the combination identifier to which the video resource described by the free-view camera descriptor belongs in the free-view camera combination identification field. In addition, the media production equipment can also add one or more AvsFreeViewCam (free-view camera fields) in the free-view camera descriptor, and store the camera information corresponding to the corresponding representation or adaptation set into AvsFreeViewCam. And, add AvsFreeViewCam@cameraId (free-view camera identification field) in the free-view camera descriptor, and store the camera identifier corresponding to each camera corresponding to the video resource described by the free-view camera descriptor in AvsFreeViewCam@cameraId. In addition, according to the fields mentioned in Table 4, other information of each camera corresponding to the corresponding video resource can also be added to the free-view camera descriptor, such as the x component value, y component value and z value of the corresponding camera position. The component value corresponds to the x component value, y component value, z component value, etc. of the camera focal length.

In this embodiment, by extending the free-view camera source identification field and the free-view camera combination identification field in the free-view camera descriptor, the camera information corresponding to the video resource can be described accurately and comprehensively.

Therefore, when the transmission method of the media file is streaming transmission, corresponding information is added to the signaling description file to indicate whether each video resource in the media file contains both video image information and video auxiliary information, and to indicate whether the video Whether there is a decoding dependency relationship between various information in the resource and whether there is a decoding dependency relationship between different components of the immersive media video content can enable the client in the media consumption device to follow the signaling during the free-view video consumption process. Instructions for transferring files to select corresponding resources for consumption and whether to perform segmented decoding can effectively improve the playback performance of the client.

Please refer to FIG. 3 , which is a schematic flowchart of another immersive media processing method provided by an embodiment of the present application. The immersive media processing method can be executed by the media consumption device, or by a client in the media consumption device; for ease of explanation, the following description will take the media consumption device executing the immersive media processing method as an example. Referring to Figure 3, the immersive media processing method may include the following steps S301-S302:

S301: Obtain the target video track in the media file of the immersive media.

Among them, immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by encapsulating the video bit stream of the immersive media into M videos. obtained in orbit. M is a positive integer. Among the M video tracks, there is at least one video media header data box in the video track that includes metadata information. The metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information. information. Optionally, in the case of M>1, the video track encapsulating the video image information among the M video tracks also contains a track group data box. The track group data box includes a track group identification field and a camera identification field; where, the track group data box includes a track group identification field and a camera identification field; The group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the identification information of the camera corresponding to the corresponding video track.

In the case of M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, the M video tracks include: one or more video image information tracks and one or more video auxiliary information Information track; any video auxiliary information track includes a track reference type data box. The track reference type data box is used to reference or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate the video auxiliary information track and Corresponding video image information track, whether there is any dependency during decoding. Specifically, the track reference type data box includes a track identification field, which is used to store identification information of the video image information track associated with the video auxiliary information track. In addition, the track reference type data box contains a reference type field; when the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding; when the reference type field stores the second track index , indicating that the video auxiliary information track does not rely on the corresponding video image information track when decoding.

In some embodiments of step S301, if the media production device supports a complete file download service, the media consumption device may download the media file of the immersive media from the media production device. After downloading the media file, the media consumption device can decapsulate the media file to obtain M video tracks. The media consumption device can then select a target video track from the M video tracks. Specifically, if the immersive media is a free-view video, the media consumption device can determine a target camera that matches the user's angle, and select the video track corresponding to the target camera from the M video tracks as the target video track, and encapsulate it in the target video track. There is one or more of video image information and video depth information collected by the target camera.

In some embodiments of step S301, if the media production device supports streaming transmission, the media consumption device can establish streaming transmission with the media production device, and obtain the signaling description file corresponding to the media file of the immersive media from the media production device , the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource. Then, the media production device can request the media production device to deliver the target video resource according to the instructions of the signaling description file; after receiving the target video resource, the media production device can decapsulate the target video resource to obtain the target video track. Among them, when the immersive media is free-view video, the media production equipment can determine the target video resource according to the instructions of the signaling description file and combined with the user's perspective, as follows:

The media consumption device can select a target camera that matches the user's angle from multiple cameras based on the user's angle and the camera position information described by each free-view camera descriptor in the signaling description file. The camera identifier stored in the free-view camera identification field in each free-view camera descriptor determines the free-view camera descriptor corresponding to the target camera. Then, the video auxiliary information descriptor corresponding to the free-view camera descriptor can be determined from the signaling description file, and based on the value of the auxiliary information indication field in the determined video auxiliary information descriptor, it is determined whether the corresponding video resource It also contains video image information and video auxiliary information, so that the target video resource can be determined based on the first determination result.

As can be seen from the foregoing, when the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information; when the value of the auxiliary information indication field is the tenth value, it indicates that the corresponding Video resources only contain video image information. Correspondingly, if the first determination result indicates that the corresponding video resource contains both video image information and video auxiliary information, the media consumption device may determine the video resource described by the determined video auxiliary information descriptor as the target video resource. If the first determination result indicates that the corresponding video resource only contains video image information, and as can be seen from the foregoing, the video auxiliary information descriptor in this case includes the auxiliary information codec dependency field, then the media consumption device can further determine the video resource based on the determined video image information. The value of the auxiliary information codec dependency field in the auxiliary information descriptor determines whether there is a decoding dependency relationship between the corresponding video image information and the video auxiliary information, so that the target video resource can be determined based on the second determination result.

As can be seen from the foregoing, when the value of the auxiliary information codec dependency field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information; when the auxiliary information codec dependency When the value of the property field is the twelfth value, it indicates that there is no decoding dependency between the video image information in the corresponding video resource and the corresponding video auxiliary information. Correspondingly, if the second determination result indicates that there is no decoding dependency between the video image information in the corresponding video resource and the corresponding video auxiliary information, the media consumption device may convert the video described by the determined video auxiliary information descriptor into resource as the target video resource; if the second determination result indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information, the media consumption device can use the determined video auxiliary information descriptor to describe The video resource, as well as other video resources associated with this video resource, are used as target video resources. Other video resources here refer to the video resources where the video auxiliary information corresponding to the video image information in the video resource described by the determined video auxiliary information descriptor is located.

Among them, the media consumption device can identify whether each video resource contains video image information or video auxiliary information corresponding to the video image information based on the value of the video auxiliary identification field in each video auxiliary information descriptor in the signaling description file. , thereby determining other video resources associated with the video resource described by the determined video auxiliary information descriptor according to the recognition result. Wherein, when the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information; when the value of the auxiliary information identification field is the fourteenth value, it indicates that the corresponding video The resource only contains video image information.

In this embodiment, by establishing streaming transmission with the media production device, the media consumption device can request the media production device to deliver the target video resource in real time based on the obtained signaling description file, and decapsulate the target video resource. , the target video track can be obtained quickly and accurately, and the efficiency of understanding the package can be improved.

S302. Decode the information in the target video track.

In some embodiments, if the media consumption device obtains the target video track by downloading the entire media file, the media consumption device can obtain the target video based on the metadata information in the video media header data box of the target video track. The information in the track is decoded. Specifically, the metadata information may include a content information flag field. As can be seen from the foregoing, when the value of the content information flag field is the first value, it indicates that the corresponding video track contains both video image information and corresponding video auxiliary information; when the content information When the value of the flag field is the second value, it indicates that the corresponding video track only contains video image information. Therefore, the media consumption device can first determine whether the target video track contains both video image information and corresponding video auxiliary information based on the value of the content information flag field in the video media header data box in the target video track; if it is determined that the target video If the track only contains video image information, the media consumption device can decode the video image information.

If it is determined that the target video track contains both video image information and corresponding video auxiliary information, it can be known from the above that the video media header data box in the target video track at this time also includes a video information encoding and decoding dependency field, and when the video information is encoded When the value of the decoding dependency field is the third numerical value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information, and when the value of the video information codec dependency field is the fourth numerical value When , it indicates that there is no decoding dependency between the video image information in the corresponding video track and the corresponding video auxiliary information. The media consumption device can first determine whether there is a decoding dependency between the video image information in the target video track and the corresponding video auxiliary information based on the value of the video information encoding and decoding dependency field in the video media header data box in the target video track. relationship, thereby determining whether to perform segment decoding processing on the video image information and video auxiliary information in the target video track. Specifically, if there is a decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously; if there is no decoding dependency, the video image information and video auxiliary information in the target video track can be decoded simultaneously. The information is decoded in segments, that is, texture information is decoded at a certain stage and depth information is decoded at a certain stage. Of course, when there is no decoding dependency, the video image information and the video auxiliary information can also be decoded simultaneously, which is not limited.

In other embodiments, if the media consumption device obtains the target video resource through streaming transmission and obtains the target video track according to the target video resource, since the media consumption device has already obtained the target video resource according to the signaling The relevant information in the description file knows whether the target video resource contains both video image information and video auxiliary information, and knows that when the target video resource contains both video image information and video auxiliary information, the video image information and video auxiliary information Whether there is a decoding dependency between them, therefore, after the media consumption device obtains the target video track, it can directly use the corresponding logic to decode the information in the target video track. Of course, in other embodiments, the media consumption device can decode the information in the target video track according to the metadata information in the video media header data box of the target video track, and the decoding logic is the same as the above-mentioned decoding logic.

Optionally, as can be seen from the foregoing, when a video track corresponds to at least two cameras, the video track may also include a video component dependency information data box. The video component dependency information data box may include a component codec dependency field and a component type. field; correspondingly, if the target video track contains the video dependent component information data box, the media consumption device can also obtain the components of the immersive media video content based on the value of the component type field in the video dependent component information data box. Type, and based on the value of the component encoding and decoding dependency field, we can learn whether there is a encoding and decoding dependency between different components of the immersive media video content, so as to determine whether the information corresponding to each camera in the target video track can be combined with the obtained results. Perform independent decoding. Wherein, when the value of the component codec dependency field is the fifth value, it indicates that there is a codec dependency relationship between different components of the video content of the immersive media; when the value of the component codec dependency field is the sixth value , indicating that there are no codec dependencies between different components of the video content of the immersive media. When the value of the component type field is the seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; when the value of the component type field is the eighth value, it indicates that the video content of the immersive media is The ingredient type is based on the binocular ingredient type.

Embodiments of the present application can avoid the media consumption device from repeatedly parsing the relevant fields in other data boxes used to indicate the information in the video track during the decapsulation process, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.

Based on the above description, taking the immersive media as free-view video as an example, taking the media production device as the server, and media consumption by the client in the media consumption device as an example, the embodiments of this application exemplarily propose the following three applications Scenes:

(1) Application scenario 1: Monorail packaging:

1. Server: Generate the video bit stream of the free-view video, and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in single-track mode, as shown in Figure 4a. in:

video_with_depth_info_flag=1, indicating that the video track is a track containing texture information and depth information;

depth_codec_dependency=0, indicating that the depth information of the video track does not depend on texture information during decoding;

composition_type=0, indicating that the video content of this video track contains video content shot by different cameras;

composition_codec_dependency=0, indicating that the video content captured by different cameras in the video track does not depend on each other during decoding.

2. Server, as shown in Figure 4b:

Based on the information in 1, server S1 further repackages F0 into a multi-video track free-view file F1. For the processing of multi-track mode, please refer to the relevant descriptions of subsequent application scenarios 2 and 3.

The server S2 provides a download service and does not process the media file F0 of the free-angle video for users to download.

Server S3 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user. Among them, the signaling description file of the transmission signaling indicates the following information:

VideoAuxiliaryInfo@withDepthInfo=1, indicating that the video resource contains both texture information (image information) and depth information corresponding to the texture information (image information);

VideoAuxiliaryInfo@depthCodecDependency=0, indicating that there is no decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information).

3. Client:

a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information and codec dependency information, the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback.

b) Client C2 establishes streaming transmission with server S3, and selects the corresponding free-view video stream for consumption based on the texture information, depth information and codec dependency information indicated in the transmission signaling. At the same time, the client can perform partial decoding at different stages, that is, decoding texture information at a certain stage and decoding depth information at a certain stage to optimize the performance of client playback. Among them, the server can generate multiple media files. One media file corresponds to a free-view video stream. There is a decoding dependency between the video image information and video auxiliary information in some media files, and the video image information and video auxiliary information in another part of the media files. There can be decoding dependencies between messages.

(2) Application scenario 2: Multi-track packaging, and each video track contains texture information and depth information captured by a camera:

1. Server: Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 5a. in:

Among them, video_with_depth_info_flag=1 in Track1~Track4 indicates that these video tracks are all video tracks containing depth information; depth_codec_dependency=1 indicates that the depth information of these video tracks depends on texture information during decoding.

Moreover, Track1~Track4 all contain track group data boxes (AvsFreeViewGroupBox), and track_group_id is 100, indicating that Track1~Track4 are K tracks of the same free-view video containing different camera shooting data.

2. Server, as shown in Figure 5b:

The server S1 provides a download service and does not process the media file F0 of the free-angle video for users to download.

Server S2 supports streaming transmission, slices the media file F0 of the free-view video into media segments suitable for streaming transmission (in accordance with the specifications of the existing transmission standards), generates corresponding transmission signaling, and transmits the signaling Let the description file be transferred to the user. Among them, the signaling description file of the transmission signaling indicates the following information:

VideoAuxiliaryInfo@depthCodecDependency=1, indicating that there is a decoding dependency between the texture information (image information) contained in the corresponding video resource and the depth information corresponding to the texture information (image information);

At the same time, AvsFreeViewCamInfo@sourceId indicates that different video resources belong to the same video content, and AvsFreeViewCamInfo@cameraId and AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera location information respectively.

3. Client:

a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information, codec dependency information, and camera information, the client can only partially decode the texture information and depth information corresponding to a specific camera according to the user's viewing operation (at this time, because the texture and depth information are mutually exclusive dependencies and therefore cannot be decoded in stages) to optimize the performance of client playback.

b) Client C2 establishes streaming transmission with server S3, and according to the instructions on texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user's viewing operation, requests the server to only include textures corresponding to specific cameras. The video resources of information and depth information are transmitted, and the corresponding video resources are decoded and consumed.

(3) Application scenario 3: Multi-track packaging, and each video track contains texture information or depth information captured by a camera:

1. Server: Generate the video bit stream of the free-view video and encapsulate the video bit stream into a media file F0, and the free-view video bit stream is encapsulated in multi-track mode, as shown in Figure 6a. in:

Among them, video_with_depth_info_flag=0 in Track1~Track4 indicates that these video tracks only contain texture information. The processing type in the handlerbox in Track11~Track14 is auxiliary video track, indicating that these video tracks (ie, video auxiliary information track) are used to assist the consumption of a certain video image information track (ie, texture map track). Through the information in the TrackReferenceTypeBox, you can know the correspondence between the video tracks Track1~Track4 and Track11~Track14. And through the track index type 'vdpi', we can know that the video auxiliary information contained in Track11~Track14 is depth information, and the decoding of the depth information does not rely on the corresponding texture information.

2. Server, as shown in Figure 6b:

Media resources corresponding to Track1~Track4:

VideoAuxiliaryInfo@withDepthInfo=0, indicating that the corresponding video resource only contains texture information;

AvsFreeViewCamInfo@sourceId=100, indicating that different video resources belong to the same video content;

AvsFreeViewCamInfo@cameraId, AvsFreeViewCam@camera_pos_x/y/z indicate the camera identifier and camera position information.

Media resources corresponding to Track11~Track14:

VideoAuxiliaryInfo@onlyDepthInfo=1, indicating that the corresponding video resource only contains depth information;

At the same time, through the @dependencyId attribute in the existing technology, the media resources corresponding to track1~track4 can be associated with the media resources corresponding to track11~track14.

3. Client:

a) Client C1 downloads the complete file F0 and plays it locally. According to the instructions in the file about texture, depth information, codec dependency information, and camera information, the client can only partially decode the texture information and depth information corresponding to a specific camera based on the user's viewing operation. At the same time, since texture information and corresponding depth information are encoded and decoded independently, the client can also decode the texture information and depth information in stages to optimize the performance of client playback.

b) Client C2 establishes streaming transmission with server S3. According to the instructions about texture, depth information, codec dependency information and camera information in the transmission signaling, combined with the user viewing operation, the server requests only the texture information and information corresponding to the specific camera. The video resources of depth information are transmitted, and after receiving the corresponding video resources, the texture information and depth information can be decoded and consumed in stages.

It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.

Based on the relevant description of the immersive media processing method shown in Figure 2, embodiments of the present application also disclose an immersive media processing device. The immersive media processing device may be run on the above-mentioned media production equipment. A computer readable instruction (including program code). The immersion media processing device can execute the method shown in Figure 2. Please refer to Figure 7. The immersion media processing device can run the following units:

The processing unit 701 is used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;

The encapsulating unit 702 is used to encapsulate the video bit stream into M video tracks to obtain a media file of immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track. The data box includes metadata information, and the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.

Based on the relevant description of the immersive media processing method shown in Figure 3 above, an embodiment of the present application also discloses an immersive media processing device. The immersive media processing device may be run on the above-mentioned media consumption device. A computer readable instruction (including program code). The immersion media processing device can execute the method shown in Figure 3. Please refer to Figure 8. The immersion media processing device can run the following units:

The acquisition unit 801 is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; and The media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata. Information, metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;

The decoding unit 802 is used to decode the information in the target video track.

In some embodiments, the media file is produced and encapsulated by a media production device. After the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained. The media file includes videos corresponding to the M video tracks. resource;

Correspondingly, when used to obtain the target video track in the media file of the immersive media, the acquisition unit 801 can be specifically used to:

Establish streaming to media production equipment;

Obtain the signaling description file corresponding to the media file of the immersive media from the media production device. The signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe the relevant attributes of the video auxiliary information of the video resource;

According to the instructions of the signaling description file, the media production equipment is requested to deliver the target video resource; and the target video resource is decapsulated to obtain the target video track.

According to another embodiment of the present application, each unit in the immersion media processing device shown in Figure 7 or Figure 8 can be separately or entirely combined into one or several additional units to form, or one (some) of them can be The unit can also be divided into multiple functionally smaller units, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application. The above units are divided based on logical functions. In practical applications, the function of one unit can also be realized by multiple units, or the functions of multiple units can be realized by one unit. In other embodiments of the present application, the immersive media-based processing device may also include other units. In practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.

According to another embodiment of the present application, the method can be implemented on a general computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and other processing elements and storage elements. Run computer-readable instructions (including program codes) capable of executing each step involved in the corresponding method as shown in Figure 2 or Figure 3 to construct an immersive media processing device as shown in Figure 7 or Figure 8, And to implement the immersive media processing method of the embodiment of the present application. The computer-readable instructions can be recorded on, for example, a computer-readable storage medium, loaded into the above-mentioned computing device through the computer-readable storage medium, and run therein.

The embodiment of the present application can obtain the video bit stream of the immersive media, and can encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; in this process, the video bit stream in at least one video track can be The header data box encapsulates metadata information to provide a unified indication of the information in the corresponding video track. This can organize the media content of immersive media in a more optimized way and avoid the media consumption device from repeatedly parsing other data boxes during the decapsulation process. Related fields used to indicate the information status in the video track, thereby saving processing resources, improving decapsulation efficiency, and thus improving decoding efficiency.

Based on the description of the above method embodiments and device embodiments, embodiments of the present application further provide a media processing device, which may be the above-mentioned media production device or media consumption device. Referring to Figure 9, the media processing device at least includes a processor 901, an input interface 902, an output interface 903, and a computer-readable storage medium 904. Among them, the processor 901, input interface 902, output interface 903 and computer-readable storage medium 904 in the media processing device can be connected through a bus or other means. Computer-readable storage media 904 may be stored in the memory of the media processing device, the computer-readable storage media 904 is used to store computer-readable instructions, the computer-readable instructions include program instructions, and the processor 901 is used to execute The computer-readable storage medium 904 stores program instructions. The processor 901 (or CPU (Central Processing Unit)) is the computing core and control core of the media processing device. It is suitable for implementing one or more computer-readable instructions, and is specifically suitable for loading and executing one or more computer-readable instructions. computer-readable instructions to implement corresponding method processes or corresponding functions.

In one embodiment, when the media processing device is a media production device, the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 2, specifically including: obtaining the immersive media Video bit stream, the immersive media is captured by multiple cameras; the information collected by any camera includes: video image information and corresponding video auxiliary information; the video bit stream is encapsulated into M video tracks , obtain the media file of the immersive media; wherein, M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information, and the metadata information is To indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information, etc.

In another embodiment, when the media processing device is a media editing device, the processor 901 described in the embodiment of this application can be used to perform a series of immersive media processing procedures shown in Figure 3, specifically including: obtaining the immersive media The target video track in the media file, the immersive media is obtained by shooting with multiple cameras, and the information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file is obtained by photographing all The video bit stream of the immersive media is obtained by encapsulating it into M video tracks; where M is a positive integer, and among the M video tracks, there is at least one video media header data box in the video track that includes metadata information, The metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information; to decode the information in the target video track, and so on.

Embodiments of the present application also provide a computer-readable storage medium (Memory). The computer-readable storage medium is a memory device in a media processing device and is used to store programs and data. It can be understood that the computer-readable storage medium here may include a built-in storage medium in the media processing device, and of course may also include an extended storage medium supported by the media processing device. The computer-readable storage medium provides storage space that stores an operating system of the media processing device. Furthermore, the storage space also stores one or more computer-readable instructions suitable for being loaded and executed by the processor. These computer-readable instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory; optionally, it can also be at least one located remotely The computer-readable storage medium of the aforementioned processor.

In one embodiment, one or more computer-readable instructions stored in a computer-readable storage medium can be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 2 step. In another embodiment, one or more computer-readable instructions stored in a computer-readable storage medium may be loaded and executed by the processor to implement the above-mentioned corresponding method in the immersive media processing method embodiment shown in FIG. 3 step.

It should be noted that according to one aspect of the present application, a computer program product or computer program is also provided. The computer program product or computer program includes computer readable instructions, and the computer readable instructions are stored in a computer readable storage medium. . The processor of the computer device reads the computer-readable instructions from the computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device performs various possible aspects of the method embodiment shown in Figure 2 or Figure 3. Choose the method provided in the method.

Furthermore, it should be understood that the above disclosures are only preferred embodiments of the present application. Of course, they cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application are still covered by the present application. range.

The technical features of the above-described embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above-described embodiments are described. However, as long as there is no contradiction in the combination of these technical features, All should be considered to be within the scope of this manual.

The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the patent application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

An immersive media processing method, performed by a media processing device, including:

Obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;

Encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is a video media header in at least one video track The data box includes metadata information used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
The method of claim 1, wherein if the m-th video track among the M video tracks encapsulates the video image information of at least one camera, then the video media header data of the m-th video track The box contains metadata information, m∈[1,M].
The method of claim 2, wherein the metadata information includes a content information flag field;

When the value of the content information flag field is the first value, it indicates that the corresponding video track contains both video image information and corresponding video auxiliary information;

When the value of the content information flag field is the second value, it indicates that the corresponding video track only contains video image information.
The method of claim 3, wherein the content information flag field is any one of the following: video depth information flag field, video disparity information flag field, video auxiliary information flag field; wherein:

When the content information flag field is a video depth information flag field, the video image information includes texture information, and the video auxiliary information includes depth information;

When the content information flag field is a video disparity information flag field, the video auxiliary information includes disparity information;

When the content information flag field is a video auxiliary information flag field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
The method according to any one of claims 1 to 4, characterized in that, if the metadata information indicates that the corresponding video track contains both video image information and corresponding video auxiliary information, then the video media header data in the corresponding video track The box also includes video information encoding and decoding dependency fields;

When the value of the video information codec dependency field is the third value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information;

When the value of the video information codec dependency field is the fourth value, it indicates that there is no decoding dependency relationship between the video image information in the corresponding video track and the corresponding video auxiliary information.
The method of claim 5, wherein the video information codec dependency field is any one of the following: a depth codec dependency field, a parallax codec dependency field, or an auxiliary codec dependency field; wherein :

When the video information codec dependency field is a depth codec dependency field, the video image information includes texture information, and the video auxiliary information includes depth information;

When the video information codec dependency field is a disparity codec dependency field, the video auxiliary information includes disparity information;

When the video information codec dependency field is an auxiliary codec dependency field, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
The method according to any one of claims 1 to 4, characterized in that, among the M video tracks, there is at least one video track including a video component dependency information data box;

The video component dependency information data box is used to indicate: the component type of the video content of the immersive media, and whether there is a codec dependency relationship between different components of the video content of the immersive media.
The method of claim 7, wherein when M=1, the M video tracks include the video component dependency information data box;

When M>1, if the m-th video track among the M video tracks encapsulates one or more types of information collected by at least two cameras, then the m-th video track includes the video component Dependent information data box.
The method of claim 7, wherein the video component dependency information data box includes a component codec dependency field;

When the value of the component codec dependency field is a fifth value, it indicates that there is a codec dependency relationship between different components of the video content of the immersive media;

When the value of the component codec dependency field is a sixth value, it indicates that there is no codec dependency relationship between different components of the video content of the immersive media.
The method of claim 7, wherein the video component dependency information data box includes a component type field;

When the value of the component type field is a seventh value, it indicates that the component type of the video content of the immersive media is a camera-based component type; wherein one component includes video content captured by one camera;

When the value of the component type field is an eighth value, it indicates that the component type of the video content of the immersive media is a binocular-based component type; wherein one component includes left-eye video content or right-eye video content.
The method according to any one of claims 1 to 4, characterized in that, in the case of M>1, the video track encapsulating video image information among the M video tracks also includes a track group data box;

Wherein, the track group data box includes a track group identification field and a camera identification field; the track group identification field is used to store the identification information of the track group corresponding to the corresponding video track, and the camera identification field is used to store the corresponding video track The corresponding camera identification information.
The method according to any one of claims 1 to 4, characterized in that, in the case of M>1, if the video image information and video auxiliary information of the same camera are encapsulated into different video tracks, then the M Each video track includes: one or more video image information tracks and one or more video auxiliary information tracks;

Among them, the video image information track refers to the video track that encapsulates video image information, and the video auxiliary information track refers to the video track that encapsulates video auxiliary information;

Any video auxiliary information track includes a track reference type data box, which is used to refer to or associate the video auxiliary information track to the video image information track corresponding to the same camera, and is used to indicate the video auxiliary information track and the corresponding The video image information track, whether there is any dependency during decoding.
The method of claim 12, wherein the track reference type data box contains a track identification field, and the track identification field is used to store identification information of the video image information track associated with the video auxiliary information track.
The method of claim 12, wherein the track reference type data box contains a reference type field;

When the reference type field stores the first track index, it indicates that the video auxiliary information track relies on the corresponding video image information track when decoding;

When the reference type field stores the second track index, it indicates that the video auxiliary information track does not rely on the corresponding video image information track when decoding.
The method of claim 14, wherein the first track index is any of the following:

a first character string, the first character string being used to indicate that the video auxiliary information in the video auxiliary information track includes depth information;

a second character string, the second character string being used to indicate that the video auxiliary information in the video auxiliary information track includes disparity information;

A third character string, the third character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information, or auxiliary information defined by other applications;

The second track index is any of the following:

A fourth character string, the fourth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information;

A fifth character string, the fifth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes disparity information;

A sixth character string, the sixth character string is used to indicate that the video auxiliary information in the video auxiliary information track includes depth information, disparity information, or auxiliary information defined by other applications.
The method according to any one of claims 1 to 4, characterized in that after the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained, and the media file includes the M Video resources corresponding to each video track;

If the transmission method of the media file is a streaming transmission method, the method further includes: generating a signaling description file corresponding to the media file;

Wherein, the signaling description file includes at least one video auxiliary information descriptor, and the video auxiliary information descriptor is used to describe related attributes of the video auxiliary information of the video resource.
The method of claim 16, wherein when the video auxiliary information descriptor is located at the adaptation set level of the signaling description file, it describes the relevant attributes of the video auxiliary information of each video resource in the corresponding adaptation set;

When the video auxiliary information descriptor is located at the presentation level of the signaling description file, it describes the relevant attributes of the video auxiliary information of the corresponding video resource.
The method of claim 16, wherein the video auxiliary information descriptor includes a video auxiliary information field;

When the video auxiliary information descriptor is located at the presentation level, the video auxiliary information field is used to store relevant attributes of the video auxiliary information of the corresponding video resource;

When the video auxiliary information descriptor is located at the adaptive set level, the video auxiliary information field is used to store relevant attributes of the video auxiliary information of the corresponding adaptive set.
The method of claim 16, wherein the video auxiliary information descriptor includes an auxiliary information indication field;

When the value of the auxiliary information indication field is the ninth value, it indicates that the corresponding video resource contains both video image information and corresponding video auxiliary information;

When the value of the auxiliary information indication field is the tenth value, it indicates that the corresponding video resource only contains video image information.
The method of claim 19, wherein the auxiliary information indication field is any one of the following: a video auxiliary information field with depth information, a video auxiliary information field with parallax information, a video auxiliary information field with auxiliary information. Video auxiliary information field; where:

When the auxiliary information indication field is a video auxiliary information field with depth information, the video image information includes texture information, and the video auxiliary information includes depth information;

When the auxiliary information indication field is a video auxiliary information field with disparity information, the video auxiliary information includes disparity information;

When the auxiliary information indication field is a video auxiliary information field with auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
The method of claim 16, wherein if the video resource described by the video auxiliary information descriptor contains both video image information and corresponding video auxiliary information, then the video auxiliary information descriptor includes an auxiliary information code. Decode dependency fields;

When the value of the auxiliary information encoding and decoding dependency field is the eleventh value, it indicates that there is a decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information;

When the value of the auxiliary information codec dependency field is a twelfth value, it indicates that there is no decoding dependency relationship between the video image information in the corresponding video resource and the corresponding video auxiliary information.
The method of claim 21, wherein the auxiliary information codec dependency field is any one of the following: a depth codec dependency field of video auxiliary information, a disparity codec dependency field of video auxiliary information, The auxiliary codec dependency field of the video auxiliary information; where:

When the auxiliary information codec dependency field is the depth codec dependency field of video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information;

When the auxiliary information codec dependency field is the disparity codec dependency field of the video auxiliary information, the video auxiliary information includes disparity information;

When the auxiliary information codec dependency field is the auxiliary codec dependency field of video auxiliary information, the video auxiliary information includes depth information, disparity information, or auxiliary information defined by other applications.
The method of claim 16, wherein the video auxiliary information descriptor includes a video auxiliary identification field;

When the value of the auxiliary information identification field is the thirteenth value, it indicates that the corresponding video resource only contains video auxiliary information corresponding to the video image information;

When the value of the auxiliary information identification field is a fourteenth value, it indicates that the corresponding video resource only contains video image information.
The method of claim 23, wherein the video auxiliary identification field is any one of the following: a depth information only flag field of video auxiliary information, a parallax information only flag field of video auxiliary information, a video auxiliary information only flag field, Ancillary information only flag field of ; where:

When the video auxiliary identification field is a depth information only flag field of video auxiliary information, the video image information includes texture information, and the video auxiliary information includes depth information;

When the video auxiliary identification field is a disparity information only flag field of the video auxiliary information, the video auxiliary information includes disparity information;

When the video auxiliary identification field is an auxiliary information only flag field of video auxiliary information, the video auxiliary information includes depth information, disparity information or auxiliary information defined by other applications.
The method of claim 16, wherein the signaling description file further includes at least one free-view camera descriptor, and the free-view camera descriptor is used to describe camera information corresponding to the video resource;

Wherein, the free-view camera descriptor includes a free-view camera source identification field. The value of the free-view camera source identification field is used to indicate the content identifier to which the corresponding video resource belongs. Videos corresponding to different cameras under the same video content Resources have the same content identifier;

Alternatively, the free-view camera descriptor includes a free-view camera combination identification field. The value of the free-view camera combination identification field is used to indicate the combination identifier to which the corresponding video resource belongs. Videos corresponding to different cameras under the same video content Resources have the same combined identifier.
The method of claim 16, wherein when M>1, it indicates that the immersive media is packaged in a multi-track form, and the video resource or adaptive set corresponding to one video track is indicated through a pre-selected collection tool.
An immersive media processing method, performed by a media processing device, including:

Obtain the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera includes: video image information and corresponding video auxiliary information; and the media file It is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video media header data box in at least one video track including Metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;

Decode the information in the target video track.
The method of claim 27, wherein the media file is produced and encapsulated by a media production device, and after the video bit stream is encapsulated into M video tracks, video resources corresponding to each video track are obtained. , the media file includes the video resources corresponding to the M video tracks; the obtaining the target video track in the media file of the immersive media includes:

Establish streaming transmission with said media production device;

Obtain the signaling description file corresponding to the media file of the immersive media from the media production device. The signaling description file includes at least one video auxiliary information descriptor. The video auxiliary information descriptor is used to describe the video auxiliary information of the video resource. related attributes;

According to the instructions of the signaling description file, the media production device is requested to deliver the target video resource; and the target video resource is decapsulated to obtain the target video track.
An immersive media processing device, including:

A processing unit, used to obtain the video bit stream of the immersive media, which is captured by multiple cameras; the information collected by any camera through photography includes: video image information and corresponding video auxiliary information;

An encapsulating unit, configured to encapsulate the video bit stream into M video tracks to obtain the media file of the immersive media; where M is a positive integer, and among the M video tracks, there is at least one video track The video media header data box in includes metadata information, which is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information.
An immersive media processing device, including:

The acquisition unit is used to acquire the target video track in the media file of the immersive media. The immersive media is captured by multiple cameras. The information collected by any camera through photography includes: video image information and corresponding video auxiliary information; And the media file is obtained by encapsulating the video bit stream of the immersive media into M video tracks; where M is a positive integer, and among the M video tracks, there is a video in at least one video track The media header data box includes metadata information, the metadata information is used to indicate whether the corresponding video track contains both video image information and corresponding video auxiliary information;

A decoding unit, configured to decode information in the target video track.
A media processing device includes an input interface and an output interface, and also includes:

A processor adapted to implement one or more computer-readable instructions; and, a computer-readable storage medium storing one or more computer-readable instructions;

Wherein, the one or more computer-readable instructions are adapted to be loaded by the processor and execute the immersive media processing method according to any one of claims 1-26; or, the one or more instructions are adapted to The processor loads and executes the immersive media processing method as claimed in claim 27 or 28.