WO2023153473A1

WO2023153473A1 - Media processing device, transmitting device and receiving device

Info

Publication number: WO2023153473A1
Application number: PCT/JP2023/004364
Authority: WO
Inventors: 秀一青木
Original assignee: 日本放送協会
Priority date: 2022-02-10
Filing date: 2023-02-09
Publication date: 2023-08-17

Abstract

This media processing device is provided with: a renderer that generates, on the basis of viewpoint information, a specific content including at least content having a degree of freedom of viewpoint; an output unit that outputs the specific content generated by the renderer to a user terminal; and an acquisition unit that acquires, from a transmitting device, importance information related to each of two or more objects included in the specific content.

Description

Media processing device, transmitting device and receiving device

The present disclosure relates to media processing devices, transmitting devices and receiving devices.

Conventionally, mechanisms for transmitting content such as 360° video and 3D objects have been proposed (for example, Non-Patent Document 1). Known mechanisms of this kind include 3DoF+ (Degree of Freedom), which allows the user to move their head in a sitting position, and 6DoF, which allows the user to freely move their viewpoint. In such a scheme, the positional relationship between the 360° video and the 3D objects is indicated by the scene description.

According to one aspect of the disclosure, a renderer that generates specific content including at least content having a degree of freedom of viewpoint based on viewpoint information; an output unit that outputs the specific content generated by the renderer to a user terminal; a media processing device, comprising: an acquisition unit that acquires importance information about each of two or more objects included in a specific content from a transmission device.

One aspect of the disclosure includes a transmission unit that transmits a configuration of content having a degree of freedom of viewpoint, and the transmission unit transmits importance information regarding each of two or more objects included in specific content that includes at least the content. It is a transmitting device that

One aspect of the disclosure includes a receiving unit that receives a configuration of content having a degree of freedom of viewpoint, and the receiving unit receives importance information regarding each of two or more objects included in specific content that includes at least the content. It is a receiving device that receives

FIG. 1 is a diagram showing a transmission system 10 according to an embodiment. FIG. 2 is a block diagram showing a media processing device 200 and a user terminal 300 according to an embodiment. FIG. 3 is a diagram for explaining the second content according to the embodiment. FIG. 4 is a diagram showing a method for viewing specific content according to the embodiment. FIG. 5 is a diagram for explaining Operation Example 1. FIG. FIG. 6 is a diagram for explaining Operation Example 2. FIG. FIG. 7 is a diagram for explaining Operation Example 2. FIG. FIG. 8 is a diagram for explaining Operation Example 3. FIG. FIG. 9 is a diagram for explaining Operation Example 3. FIG. FIG. 10 is a diagram for explaining Operation Example 3. FIG. FIG. 11 is a diagram for explaining Operation Example 3. FIG. FIG. 12 is a diagram for explaining Operation Example 4. FIG. FIG. 13 is a diagram for explaining Operation Example 4. FIG. FIG. 14 is a diagram for explaining Operation Example 4. FIG. FIG. 15 is a diagram for explaining Operation Example 5. FIG. FIG. 16 is a diagram for explaining Operation Example 5. FIG. FIG. 17 is a diagram for explaining Operation Example 5. FIG. FIG. 18 is a diagram for explaining Operation Example 5. FIG. FIG. 19 is a diagram for explaining the first method according to Modification 1. FIG. FIG. 20 is a diagram for explaining the second method according to Modification 1. FIG.

Next, an embodiment of the present invention will be described. In addition, in the following description of the drawings, the same or similar parts are denoted by the same or similar reference numerals. However, it should be noted that the drawings are schematic, and the ratio of each dimension is different from the actual one.

Therefore, specific dimensions should be determined with reference to the following explanation. In addition, it goes without saying that there are portions with different dimensional relationships and ratios between the drawings.

[Summary of Disclosure]
A media processing device according to an overview of the disclosure includes a renderer that generates specific content including at least content having a degree of freedom of viewpoint based on viewpoint information, and an output that outputs the specific content generated by the renderer to a user terminal. and an acquisition unit configured to acquire importance information about each of two or more objects included in the specific content from a transmission device.

In the outline of the disclosure, a media processing device receives importance information about each of two or more objects included in specific content from a transmitting device. According to such a configuration, by introducing a mechanism for setting the importance of each object such as 360° video and 3D objects, while suppressing transmission traffic, each object included in specific content can be displayed appropriately. can do.

It should be noted that the specific content generated by the media processing device is generated based on viewpoint information, and that the user terminal side can handle the video included in the specific content as 2D video.

[Embodiment]
(transmission system)
A transmission system according to an embodiment will be described below. FIG. 1 is a diagram showing a transmission system 10 according to an embodiment. As shown in FIG. 1, the digital wireless transmission system comprises a transmitting device 100, a media processing device 200 and a user terminal 300. FIG.

In the embodiment, the transmission device 100 transmits to the media processing device 200 the first content without viewpoint freedom and the second content with viewpoint freedom. Furthermore, the transmitting device 100 transmits to the media processing device 200 first control information accompanying the first content and second control information accompanying the second content.

The first content may include at least one of 2D video and audio. The first content and the first control information may be transmitted in the first scheme. The first method may be a method conforming to ISO/IEC 23008-1 (hereinafter referred to as MMT (MPEG Media Transport)). A case in which the first method is MMTP (MMT Protocol) conforming to MMT is exemplified below. The first control information may be called MMT-SI (Signaling Information).

The second content may include 360° video and 3D objects. The second content and second control information may be transmitted by the second method. Alternatively, it may be transmitted by a protocol such as HTTP (Hyper Text Transfer Protocol). The second content may conform to 3DoF+ (Degree of Freedom), in which the user can move his or her head in a sitting position, or 6DoF, in which the user can move freely. Since the second content has a degree of freedom of viewpoint, it may include two or more 360° videos and two or more 3D objects at the same time (frame). The second control information may be called scene description.

Here, the second control information may be transmitted by the first method described above. That is, the second control information may be transmitted by the same first scheme (eg, MMTP) as the first control information. Alternatively, it may be transmitted using a protocol such as HTTP.

Transmission from the transmission device 100 to the media processing device 200 is not particularly limited, but may be transmission using satellite broadcasting, transmission using the Internet network, or mobile communication network. may be used for transmission.

Although not particularly limited, the transmission system may be a digital wireless transmission system. The digital wireless transmission system may be a system used for 4K and 8K satellite broadcasting.

The media processing device 200 generates specific content including at least the second content described above based on the viewpoint information received from the user terminal 300, and transmits the generated specific content to the user terminal 300. Although not particularly limited, the transmission of the specific content may be transmission using the Internet network or transmission using a mobile communication network.

The user terminal 300 may be a user terminal such as a smartphone, tablet terminal, or head-mounted display. As shown in FIG. 1 , two or more user terminals 300 may be provided as user terminals 300 . In other words, two or more user terminals 300 may request the media processing device 200 to generate specific content. Each user terminal 300 may transmit different viewpoint information to the media processing device 200 .

(Media processing device and user terminal)
A media processing device and a user terminal according to embodiments will be described below. FIG. 2 is a block diagram showing a media processing device 200 and a user terminal 300 according to an embodiment.

First, the media processing device 200 has a reception unit 210, a renderer 220, and an encoding processing unit 230.

The reception unit 210 receives viewpoint information. In the embodiment, the receiving unit 210 constitutes a receiving unit that receives viewpoint information from the user terminal 300. FIG. The viewpoint information includes an information element indicating the viewpoint position of the user of the user terminal 300 and an information element indicating the line-of-sight direction of the user of the user terminal 300 .

The renderer 220 generates specific content including at least the second content based on the viewpoint information. Since the specific content is generated based on viewpoint information, it may include one 360° video and one 3D object at the same time (frame). In the following, the specific content will be exemplified for the case where the first content is included in addition to the second content.

As shown in FIG. 2, the renderer 220 generates first content including 2D video and audio as part of specific content based on first control information (MMT-SI). Viewpoint information is not required in generating the first content.

Specifically, the renderer 220 acquires 2D video, audio, and MMT-SI in the form of MMTP packets in which 2D video, audio, and MMT-SI are packetized.

For example, MMTP packets are stored in IP (Internet Protocol) packets. IP packets may be transmitted using UDP (User Datagram Protocol) or may be transmitted using TCP (Transmission Control Protocol).

Here, the first content is processed in units (hereafter referred to as MPU; Media Processing Unit) separated by a certain time width. MPU includes one or more access units. An access unit may also be treated as an MFU (Media Fragment Unit). An MFU for 2D video may be referred to as a NAL (Network Abstraction Layer) unit, and an MFU for audio may be referred to as an MHAS (MPEG-H 3D Audio Stream) packet.

　MMT-SI includes a PA (Package Access) message, and the PA message includes an MPT (MMT Package Table) that lists the first content. Furthermore, MMT-SI includes an MPU timestamp descriptor that indicates the presentation time of the first content. The MPU Timestamp Descriptor may represent the presentation time of the MPU, ie the time of the first presented access unit in the MPU.

The MPU timestamp descriptor may be generated using UTC (Coordinated Universal Time) as the reference time. As the reference time, TAI (International Atomic Time) may be used, or the time provided by GPS (Global Positioning System) may be used. The reference time may be the time provided by an NTP (Network Time Protocol) server or the time provided by a PTP (Precision Time Protocol) server.

Second, the renderer 220 generates second content including 360° video and 3D objects as part of the specific content based on the second control information (scene description). Viewpoint information is used in generating the second content.

Specifically, the renderer 220 may acquire the scene description in the form of MMTP packets in which the scene description is packetized. The acquisition method of 360° video and 3D object is not particularly limited.

　360° video may be converted to 2D video by projective transformation such as ERP (Equirectangular projection) or cube map. Metadata indicating the type of projective transformation applied to the 360° video may be added. A 3D object may be encoded in a mesh format. ISO/IEC 14496-16 “Animation framework extension (AFX)” may be used as mesh format encoding. 3D objects may be encoded in point cloud format. ISO/IEC 23090-5 “Video-based Point Cloud Compression” may be used as point cloud format encoding.

Here, the second content is grouped into one file in units separated by a certain time width. The constant time width may be 500ms. For example, if the frame rate is 60fps (frame per second), one file contains 30 frames.

A scene description is generated for each file and includes information specifying 360° video and 3D objects for each frame. For example, the scene description contains an information element (object_name) indicating the name of the 3D object in the frame, an information element (frame_number) indicating the frame number, an information element (translation_object) indicating the position of the 3D object in the frame, and a rotation of the 3D object in the frame. and an information element (rotation_object) indicating the size of the 3D object in the frame (scale_object).

Third, the renderer 220 outputs specific content including the first content and the second content to the encoding processing unit 230. The renderer 220 may output the presentation time of the specific content to the encoding processing unit 230 together with the specific content.

Here, the presentation time of the specific content may be corrected based on the delay time between the media processing device 200 and the user terminal 300. Specifically, the renderer 220 provides the content from the media processing device 200 to the user terminal 300 based on the presentation time (T) and the delay time (ΔT) of the specific content provided from the transmission device 100 to the media processing device 200. The presentation time (T'=T+ΔT) of the specific content may be calculated. The delay time (ΔT) may be a predetermined value in the media processing device 200, or may be a different value for each user terminal 300. FIG.

Fourth, the renderer 220 may constitute a transmission unit that transmits the viewpoint information used to generate the specific content to the user terminal 300 . Viewpoint information used to generate specific content may be transmitted from the encoding processing unit 230 to the user terminal 300 .

For example, the transmission method for viewpoint information and specific content may be MMTP or HTTP. When MMTP is used as the specific content transmission method, the viewpoint information may be stored as metadata in OMAF (Omnidirectional Media Format) defined by ISO/IEC 23090-2.

The encoding processing unit 230 encodes the specific content generated by the renderer 220. In the embodiment, the encoding processing unit 230 may be an example of a transmitting unit that transmits specific content to the user terminal 300. FIG.

Furthermore, the encoding processing unit 230 may encode the presentation time of the specific content. The encoding processing unit 230 may transmit the information element indicating the presentation time to the user terminal 300 together with the specific content.

Here, as the compression encoding method used by the encoding processing unit 230, any compression encoding method can be used. For example, the compression encoding method may be HEVC (High Efficiency Video Coding) or VVC (Versatile Video Coding).

As described above, the second content included in the specific content is generated based on viewpoint information, so the video included in the specific content can be treated as a 2D video that does not have a degree of freedom of viewpoint.

For example, the transmission control method used to start and end viewing of specific content may include RTSP (Real Time Streaming Protocol). The transmission method may be MMTP or HTTP. When MMTP is used as the transmission method, specific content may be stored in OMAF defined in ISO/IEC 23090-2.

As shown in FIG. 2, the user terminal 300 has a detection unit 310, a decoding processing unit 320, and a renderer 330.

The detection unit 310 detects the user's viewpoint position and line-of-sight direction. The detection unit 310 may include an acceleration sensor and may include a GPS (Global Positioning System) sensor. The detection unit 310 may include a user I/F (eg, touch sensor, keyboard, mouse, controller, etc.) manually input by the user. The detection unit 310 may transmit viewpoint information (viewpoint position and line-of-sight direction) to the media processing device 200 . The detection unit 310 may output viewpoint information (viewport) to the renderer 330 .

The decoding processing unit 320 decodes the specific content received from the media processing device 200. The decoding processing unit 320 may decode the presentation time received from the media processing device 200 . The decoding processing unit 320 may output the specific content to the renderer 330 and may output the presentation time to the renderer 330 .

The renderer 330 outputs the specific content decrypted by the decryption processing unit 320. The renderer 330 may output the specific content based on the presentation time decoded by the decoding processing unit 320. FIG. For example, the renderer 330 may output video content included in the specific content to a display and output audio content included in the specific content to a speaker.

Here, the renderer 330 may generate specific content in which the viewpoint position and line-of-sight direction are corrected based on the difference between the viewpoint information received from the media processing device 200 and the viewpoint information input from the detection unit 310. .

(second content)
The second content according to the embodiment will be described below. Here, the second content at t=0, t=1 and t=2 will be described. The time intervals of t=0, t=1 and t=2 are not particularly limited.

For example, as shown in FIG. 3, at t=0, a 360° video may be displayed without displaying a 3D object. A 360° image may be considered as a background image of a 3D object. At t=1, the 3D object may be displayed in a form superimposed on the 360° video. Furthermore, at t=1, the position and rotation of the 3D object superimposed on the 360° image may be changed.

The above scene description includes information elements indicating the position, rotation and size of the 3D object for each of t=0, t=1 and t=2, and appropriately superimposes the 3D object on the 360° video. can be done.

(Viewing method)
A viewing method according to the embodiment will be described below. Here, viewing of specific content including first content and second content is exemplified.

As shown in FIG. 4, in step S11, the user terminal 300 transmits RTSP SETUP to the media processing device. RTSP SETUP is a message to start viewing specific content.

Here, the RTSP SETUP includes the IP address of the user terminal 300, standby port number, content identification information (content ID), and the like. RTSP SETUP may include information on the capabilities of the user terminal 300 to view specific content. Capability information may include frame rate, display resolution, and the like. The display resolution may include a field of view (FoV). The capability information may include an information element indicating the encoding scheme and compression scheme that the user terminal 300 supports.

Here, a case is exemplified where the capability information of the user terminal 300 is directly notified to the media processing device 200, but the embodiment is not limited to this. The capability information of the user terminal 300 may be notified to the transmission device 100 and then transmitted from the transmission device 100 to the media processing device 200 .

At step S12, the media processing device 200 transmits a response to RTSP SETUP. Here, an ACK indicating that the RTSP SETUP has been accepted is sent as a response.

In step S21, the user terminal 300 transmits initial viewpoint information to the media processing device 200. The initial viewpoint information may be transmitted in the form of MMT-SI.

In step S22, the media processing device 200 generates initial specific content based on the initial viewpoint information (rendering process). For example, the media processing device 200 generates second content to be included in the initial specific content based on the initial viewpoint information and the scene description.

Here, the media processing device 200 may generate the initial specific content using a viewport that is wider than the display resolution of the user terminal 300 . For example, the range wider than the display resolution may be a range of +20% display resolution in the horizontal direction and +20% display resolution in the vertical direction.

The media processing device 200 applies the compression encoding method to the initial specific content. Although not particularly limited, the compression encoding method may be HEVC or VVC.

In step S23, the media processing device 200 transmits the initial specific content corresponding to the initial viewpoint information to the user terminal 300. The media processing device 200 transmits the presentation time of the initial specific content to the user terminal 300 . As described above, the presentation time (T') provided to the user terminal 300 may be determined based on the delay time (ΔT).

When using a different value for each user terminal 300 as the delay time (ΔT), it is possible to specify it on the media processing device 200 side by including the RTSP SETUP transmission time in the RTSP SETUP described above.

The user terminal 300 outputs specific content based on the presentation time (T'). Based on the difference between the viewpoint information received from the media processing device 200 and the viewpoint information input from the detection unit 310, the user terminal 300 may generate specific content in which the viewpoint position and line-of-sight direction are corrected.

In step S31, the user terminal 300 transmits viewpoint information to the media processing device 200. Viewpoint information may be transmitted in the form of MMT-SI. Here, the user terminal 300 may transmit viewpoint information at a predetermined cycle (eg, 500 ms), or may transmit viewpoint information in accordance with a change in at least one of the viewpoint position and line-of-sight direction.

In step S32, the media processing device 200 generates specific content based on the viewpoint information received in step S31 (rendering processing).

At step S33, the media processing device 200 transmits to the user terminal 300 the specific content corresponding to the viewpoint information received at step S31.

The processing of steps S31 to S33 is the same as the processing of steps S21 to S23, except that the viewpoint information received in step S31 is used instead of the initial viewpoint information. Therefore, the details of the processing in steps S31 to S33 are omitted. The processing of steps S31 to S33 may be repeated at a predetermined cycle, or may be repeated each time the user's viewpoint position or line-of-sight direction is changed.

In step S41, the user terminal 300 transmits RTSP TEARDOWN to the media processing device. RTSP TEARDOWN is a message to end viewing of specific content.

At step S42, the media processing device 200 transmits a response to RTSP TEARDOWN. Here, an ACK indicating that RTSP TEARDOWN has been accepted is sent as a response.

Although FIG. 4 illustrates the case where steps S11 and S12 are executed based on RTSP, the embodiment is not limited to this. Steps S11 and S12 may be executed based on MMTP or HTTP.

Similarly, the case where steps S41 and S42 are executed based on RTSP has been exemplified, but the embodiment is not limited to this. Steps S41 and S42 may be executed based on MMTP or HTTP.

Although FIG. 4 exemplifies the case where steps S31 to S33 are executed based on MMTP, the embodiment is not limited to this. Steps S31-S33 may be performed in other ways (eg, HTTP) based.

Similarly, the case where steps S41 to S43 are executed based on MMTP has been exemplified, but the embodiment is not limited to this. Steps S41-S43 may also be performed on a different basis (eg, HTTP).

(Operation example 1)
The above-described embodiment may include Operation Example 1 shown below. In operation example 1, the media processing device 200 transmits viewpoint information used in generating the specific content to the user terminal 300 in association with the sequence number added to the specific content.

Specifically, the media processing device 200 (renderer 220) includes 360° video and 3D objects as part of the specific content based on the second control information (scene description), as in the above-described embodiment. Generate secondary content. Viewpoint information received from the user terminal 300 is used in generating the second content.

In operation example 1, the renderer 220 associates the viewpoint information used in generating the specific content (here, the second content) with the sequence number. The renderer 220 transmits viewpoint information used in generating the specific content to the user terminal 300 in association with the sequence number added to the specific content. Viewpoint information may be stored in a VP (View Port) message. A VP message may have the format of MMT-SI as specified in ISO/IEC 23008-1, ARIB STD-B60, ARIB TR-B39, etc. A VP message may be sent every frame. The VP message may be sent to the user terminal 300 as a message related to MMTP (MMT-SI).

Although not particularly limited, the VP message may have the data structure shown in FIG. As shown in FIG. 5, the VP message may include message_id, version, length, fov, viewpoint_pos_x, viewpoint_pos_y, viewpoint_pos_z, viewpoint_yaw, viewpoint_pitch, viewpoint_roll, viewport_width, viewport_height, mpu_sequence_number_flag, mpu_sequence_number, and so on.

　message_id is identification information indicating a VP message. message_id may be 0x0204.

　version is information indicating the version of the MMTP protocol. version may be 0x00.

"length" is information indicating the length of the VP message.

"fov" is information indicating the field of view.

　viewpoint_pos_x is information indicating the x-coordinate of the viewpoint position. viewpoint_pos_x is an example of viewpoint information used in generating specific content.

"viewpoint_pos_y" is information indicating the y-coordinate of the viewpoint position. viewpoint_pos_y is an example of viewpoint information used in generating specific content.

　viewpoint_pos_z is information indicating the z-coordinate of the viewpoint position. viewpoint_pos_z is an example of viewpoint information used in generating specific content.

"viewpoint_yaw" is information indicating the yaw of the viewpoint position. viewpoint_yaw is an example of viewpoint information used in generating specific content.

"viewpoint_pitch" is information indicating the pitch of viewpoint positions. viewpoint_pitch is an example of viewpoint information used in generating specific content.

"viewpoint_roll" is information indicating the roll of the viewpoint position. viewpoint_roll is an example of viewpoint information used in generating specific content.

viewport_width is information indicating the width of the display area (specific content).

viewport_height is information indicating the height of the display area (specific content).

mpu_sequence_number_flag is information indicating whether or not the mpu_sequence_number field exists. For example, if the mpu_sequence_number_flag is 1, the mpu_sequence_number field may exist, and if the mpu_sequence_number_flag is 0, the mpu_sequence_number field may not exist.

　mpu_sequence_number is the MPU sequence number of the video corresponding to the specific content indicated by the VP message. mpu_sequence_number is an example of a sequence number added to specific content.

Here, viewpoint_pos_x, viewpoint_pos_y, and viewpoint_pos_z are examples of information elements that indicate the user's viewpoint position in a three-dimensional space configured by the scene description. viewpoint_yaw, viewpoint_pitch, and viewpoint_roll are examples of information elements that indicate the direction of the user's line of sight in the three-dimensional space configured by the scene description. viewport_width and viewport_height are examples of information elements indicating the number of pixels of video included in specific content.

Under this premise, the media processing device 200 and the user terminal 300 may perform the following operations.

First, the media processing device 200 (renderer 220) may specify viewpoint_pos_x, viewpoint_pos_y, and viewpoint_pos_z based on the viewpoint position used to generate the specific content. The renderer 220 may identify the viewpoint_yaw, viewpoint_pitch, and viewpoint_roll based on the viewing direction used to generate the particular content. The renderer 220 may determine viewport_width based on the number of pixels in the horizontal direction of the specific content, and determine viewport_height based on the number of pixels in the vertical direction of the specific content.

The media processing device 200 (encoding processing unit 230) may compress and encode the specific content generated by the renderer 220 and transmit the specific content. Here, the encoding processing unit 230 may transmit the VP message shown in FIG. 5 to the user terminal 300. FIG. That is, the encoding processing unit 230 may transmit the viewpoint information used in generating the specific content to the user terminal 300 in association with the sequence number added to the specific content.

Second, user terminal 300 (decryption processing unit 320) constitutes a receiving unit that receives viewpoint information used in generating specific content from media processing device 200 in association with the sequence number added to the specific content. may That is, the decoding processing unit 320 may receive the VP message (MMT-SI) shown in FIG.

Based on the sequence number (mpu_sequence_number), the user terminal 300 (renderer 330) may associate the video that constitutes the decoded specific content with the viewpoint information included in the VP message. Based on the difference between the viewpoint information detected by the detector 310 and the viewpoint information included in the VP message, the renderer 330 renders the detector 310 from the display area defined by the information (viewport_width and viewport_height) included in the VP message. You may specify the image|video of the range specified by the viewpoint information detected by. Renderer 330 may display the identified video.

(Operation example 2)
The above-described embodiment may include Operation Example 2 shown below. In operation example 2, as shown in FIG. 6, a case is assumed in which a first user terminal 400 that feeds back viewpoint information and a second user terminal 500 that does not feed back viewpoint information coexist. The first user terminal 400 may be a terminal such as a head mounted display. The first user terminal may have the same functions as the user terminal 300 described above. The second user terminal 500 may be a terminal such as a volumetric display.

Specifically, in operation example 2, as shown in FIG. 6, the media processing device 200 (renderer 220) may configure a receiving unit that receives content configuration and recommended viewport information from the transmitting device 100. . The configuration of content may be considered to include 2D video, audio, 360° video, and 3D objects. Content configuration may be considered to include MMT-SI and scene description. The recommended viewport information may be considered as an example of specific viewpoint information. The recommended viewport information is information that defines (recommended) the position, direction, and angle of view in which the video is viewed in the 3D space configured by the specific content (the 3D space constructed by the scene description). good too. Recommended viewport information is information indicating at least one of an information element indicating the viewpoint position in the 3D space configured by the specific content and an information element indicating the line-of-sight direction in the 3D space configured by the specific content. may contain elements. It may be considered that the recommended viewport information is mainly viewpoint information used by the second user terminal 500 .

In the following, unless explicitly stated, the viewpoint information used in generating the specific content may include the specific viewpoint information (recommended viewport information) received from the transmission device 100, or the viewpoint information received from the user terminal 300. may contain.

Although not particularly limited, the recommended viewport information may be included in the scene description in the manner shown in FIG. As shown in FIG. 7, the recommended viewport information may include camera_orientation, frame_number, translation, and yfov.

　camera_orientation is information that indicates the direction in which the video is viewed in the 3D space constructed by the scene description. camera_orientation may be considered synonymous with viewing direction.

　frame_number is information indicating the frame number of the video to which camera_orientation, translation, and yfov are applied. camera_orientation may be considered as an example of specific viewpoint information.

　Translation is information that indicates the position where the video is viewed in the 3D space constructed by the scene description. You may think that translation is synonymous with viewpoint position. You may think that translation is an example of specific viewpoint information. For example, in FIG. 7, when the frame number is 0, the viewpoint position is [0, 0, -50], and when the frame number is 2505, the viewpoint position is [0, 0, -75]. A moving case is illustrated.

　yfov is information that indicates the viewing angle of the video in the 3D space constructed by the scene description.

Although not particularly limited, camera_orientation and translation may be given by the creator of the content. Alternatively, assuming a case where content is captured by a camera, camera_orientation and translation may be automatically given by the GPS and sensor provided in the camera.

Here, the translation is an example of an information element that indicates the viewpoint position in the 3D space configured by the specific content. camera_orientation is an example of an information element that indicates the line-of-sight direction in a three-dimensional space configured by specific content.

First, the media processing device 200 (renderer 220) may generate specific content based on specific viewpoint information (recommended viewport information). The media processing device 200 (encoding processing unit 230) may perform compression encoding of the specific content generated by the renderer 220 and transmit the specific content. Here, the encoding processing unit 230 may transmit the specific content generated based on the specific viewpoint information to the first user terminal 400, and transmit the specific content generated based on the specific viewpoint information to the second user terminal. You may send to 500.

Second, the media processing device 200 (reception unit 210) may receive viewpoint information from the first user terminal 400 when the user terminal is the first user terminal 400 that feeds back viewpoint information. The media processing device (renderer 220 ) may generate specific content based on the viewpoint information received from the first user terminal 400 . In such a case, the media processing device 200 (accepting unit 210) may receive the reset signal from the first user terminal 400. FIG. The media processing device (renderer 220) may generate specific content based on the specific viewpoint information (recommended viewport information) in response to the reset signal.

In such a case, the first user terminal 400 may have the same configuration as the user terminal 300. However, first user terminal 400 (detection unit 310) may have a function of detecting a reset signal. The detection unit 310 may detect a user operation of inputting a reset signal. The detector 310 may send a reset signal to the media processing device 200 .

Third, the media processing device 200 (renderer 220) generates specific content based on specific viewpoint information (recommended viewport information) when the user terminal is the second user terminal 500 that does not feed back viewpoint information. good too.

In such a case, the second user terminal 500 may not have the detection unit 310 that detects viewpoint information. Second user terminal 500 may not have renderer 330 . Second user terminal 500 may have the same configuration as user terminal 300 except that it does not have detector 310 and renderer 330 .

(Operation example 3)
The above-described embodiment may include Operation Example 3 shown below. In operation example 3, the media processing device 200 (for example, the selection unit 260 to be described later) selects two or more streams whose quality differs depending on the direction of the 3D object based on the viewpoint information, as the stream quality information regarding the 3D object included in the specific content. A receiving unit may be configured to receive quality information about each of the streams from the transmitting device 100 .

Specifically, in Operation Example 3, as shown in FIG. 8, the media processing device 200 has a selector 260 in addition to the configuration shown in FIG. The selection unit 260 receives the scene description and the 3D objects from the sending device 100 . The selection unit 260 inputs a stream (3D object) selected from two or more streams to the renderer 220 . The selection unit 260 may request the transmission device 100 to transmit the selected stream. In addition, when assuming a case where the media processing device 200 transmits specific content to a plurality of user terminals 300, the selection unit 260 selects the transmission device 100 to transmit a stream required by each of the plurality of user terminals 300. , or may request the transmitting apparatus 100 to transmit all streams.

Here, the selection unit 260 receives, from the transmission device 100, quality information about each of two or more streams whose quality differs depending on the direction of the 3D object based on the viewpoint information, as the quality information of the streams about the 3D object.

The quality information may be information indicating the relative quality of each face that constitutes the bounding box of the 3D object. A bounding box may be represented by a three-dimensional rectangle that projects the 3D object. For example, a bounding box may be defined by vertex A through vertex H, as shown in FIG. In such a case, each face of the bounding box is face #1 represented by vertices A, B, F, E, face #2 represented by vertices B, C, G, F, vertices A, B, Face #3 represented by C, D, Face #4 represented by vertices E, F, G, H, Face #5 represented by vertices A, D, H, E, Vertices D, C, G, Includes face #6, denoted by H.

In such a case, assuming the case of viewing a 3D object in a 3D space constructed by the scene description, it is assumed that three aspects will be mainly observed. In other words, the remaining three faces are assumed to be less observed.

In operation example 3, two or more streams with different quality depending on the orientation of the 3D object based on the viewpoint information are prepared as streams related to the 3D object included in the specific content.

Although not particularly limited, the quality information may be included in the scene description in the manner shown in FIG. In FIG. 10, six streams are illustrated as streams with different orientations of 3D objects. The quality information may be expressed in the form "quality" [#1,#2,#3,#4,#5,#6]. In [ ], #1 to #6 mean the quality index of plane #1 to plane #6. The quality index may take values ranging from 1-9. A higher quality index value may mean a higher quality. For example, in the stream identified by ”id”=”1”, #1,#2,#3 have high quality (”8”) and #4,#5,#6 have high quality (”3”). low. In the stream identified by "id"="2", #1,#2,#3 have low quality ("3") and #4,#5,#6 have high quality ("8").

Under this premise, the media processing device 200 may perform the following operations. In the following, selection of a stream (3D object) selected from two or more streams will be mainly described.

In mode 1, as shown in the upper part of FIG. 11, the media processing device 200 (the selection unit 260) specifies the vertex (for example, vertex B) closest to the user's viewpoint position, and then selects the vertex having the closest vertex. For example, face #1 represented by vertices A,B,F,E, face #2 represented by vertices B,C,G,F, face represented by vertices A,B,C,D #3) may be specified. The selection unit 260 may select the stream with the maximum sum of the quality indexes of the identified three faces (in the example shown in FIG. 10, the stream identified by “id”=“1”).

Note that in mode 1, the vertex closest to the user's viewpoint position is specified based on the viewpoint information. may be considered to select the stream to be sent to the .

Mode 2 may be a mode that is applied when a reduced display or enlarged display of a 3D object is executed. For example, as shown in the middle part of FIG. 11, the media processing device 200 (selection unit 260) identifies the vertex (for example, vertex B) closest to the user's viewpoint position, and then selects three surfaces having the closest vertex. (For example, surface #1 represented by vertices A, B, F, E, surface #2 represented by vertices B, C, G, F, surface #3 represented by vertices A, B, C, D ) may be specified. In the case where a reduced display of the 3D object is performed, the pixels of the 3D object are thinned out. Therefore, the selection unit 260 may select the stream that minimizes the sum of the quality indexes of the identified three faces (in the example shown in FIG. 10, the stream identified by “id”=“2”). . On the other hand, in cases where a magnified display of the 3D object is performed, the pixels of the 3D object are interpolated. Therefore, the selection unit 260 may select the stream that maximizes the sum of the quality indexes of the identified three surfaces (in the example shown in FIG. 10, the stream identified by “id”=“1”). .

Note that in mode 2, since the vertex closest to the user's viewpoint position is specified based on the viewpoint information, the selection unit 260 selects the user terminal 300 from the two or more streams based on the viewpoint information and the quality information. may be considered to select the stream to be sent to the .

Mode 3 may be applied when two 3D objects (3D object #1 and 3D object #2) overlap in the direction of the user's line of sight. Here, selection of a stream for 3D object #1 will be described. For example, as shown in the lower part of FIG. 11, the media processing device 200 (the selection unit 260) identifies the vertex (for example, vertex B) closest to the user's viewpoint position, and then selects three surfaces having the closest vertex. (For example, surface #1 represented by vertices A, B, F, E, surface #2 represented by vertices B, C, G, F, surface #3 represented by vertices A, B, C, D ) may be specified. Here, the 3D object #2 overlaps on the line segment connecting the vertex (for example, vertex B) closest to the user's viewpoint position and the user's viewpoint position, and the identified three faces are overlapped by the 3D object #2. blocked. Therefore, the selection unit 260 may select the stream that minimizes the sum of the quality indexes of the identified three faces (in the example shown in FIG. 10, the stream identified by “id”=“2”). .

Note that in mode 3, since the vertex closest to the user's viewpoint position is specified based on the viewpoint information, the selection unit 260 selects the user terminal 300 from the two or more streams based on the viewpoint information and the quality information. may be considered to select the stream to be sent to the . Furthermore, in mode 3, the overlapping of two 3D objects is specified based on the viewpoint information and the arrangement information of the 3D objects. It may be considered that a stream to be transmitted to the user terminal 300 is selected from among the streams. 3D object placement information (for example, "rotation_object", "scale_object", "translation_object" shown in Fig. 10) may be included in the scene description, and the 3D object placement information is "link_area" shown in Fig. 10. In other words, the selection unit 260 may receive the placement information of the 3D object in the 3D space from the transmission device 100. FIG.

It should be noted that in the quality information shown in FIG. 10, the total sum of the quality indexes of the six planes is the same for each stream. However, embodiments are not so limited. The sum of the six-sided quality indices may differ between two or more streams.

(Operation example 4)
The above-described embodiment may include Operation Example 4 shown below. Here, in addition to operation example 3, operation example 4 includes the following operation. In operation example 4, the media processing device 200 (for example, the selection unit 270 to be described later) may configure a reception unit that receives importance information regarding each of two or more objects included in the specific content from the transmission device 100. .

Specifically, in operation example 4, as shown in FIG. 12, the media processing device 200 has a selection unit 270 in addition to the configuration shown in FIG. The selection unit 270 receives the scene description, 3D objects and 360° video from the transmission device 100 . The selection unit 270 inputs a stream (3D object) selected from two or more streams to the renderer 220 .

Here, the selection unit 270 receives importance information about each of the two or more objects from the transmission device 100. FIG. Objects may include 3D objects and 360° videos.

For example, the importance information may be information indicating relative importance between two or more objects. For example, as shown in FIG. 13, consider a case where object A (background), object B (person), and object C (dog) exist in a three-dimensional space constructed by scene description. Object A (background) is an example of a 360° video, and object B (person) and object C (dog) are examples of 3D objects. In such a case, the importance information may be information indicating the relative importance among each of object A (background), object B (person) and object C (dog).

Although not particularly limited, the importance information may be included in the scene description in the manner shown in FIG. In FIG. 14, the importance information may be represented by weight. weight may take values in the range 1-9. A larger value of weight may mean a higher degree of importance. In FIG. 14, object A (background) identified by "object_id" = "0" has the highest weight ("9"), and object B (person) identified by "object_id" = "1" has the highest weight ( "3") is the lowest, and the weight of object C (dog) identified by "object_id" = "2" ("8") is higher than the weight of object B (person) and the weight of object A (background). A low case is illustrated.

First, the media processing device 200 (selection unit 270) selects the highest quality stream for the 3D object with the highest importance. The stream selection method may be the same as in Operation Example 3. For example, since the importance of object C (dog) is higher than the importance of object B (person), the selection unit 270 selects three planes of object C (dog) that have vertices closest to the user's viewpoint position. Select the stream with the highest sum of quality indices.

Second, the media processing device 200 (selection unit 270) selects the lowest quality stream for 3D objects other than the 3D object with the highest importance. Subsequently, the selection unit 270 replaces the lowest quality stream with the highest quality stream within the range where the specific condition is satisfied, in descending order of importance of the 3D object. The specific conditions may include a first condition that the bandwidth of the line from the transmission device 100 to the media processing device 200 is equal to or less than a threshold, and may include a second condition that the processing load of the media processing device 200 is equal to or less than the threshold. A specific condition may be defined by a combination of a first condition and a second condition. For example, since the importance of object B (human) is lower than the importance of object C (dog), the selection unit 270 selects a high-quality stream for object B (human) within a range that satisfies a specific condition. select.

As described above, media processing device 200 (selection unit 270) selects a stream to be transmitted to user terminal 300 from among two or more streams based on viewpoint information, quality information, and importance information. good. It may be considered that the media processing device 200 (selection unit 270) selects a stream to be transmitted to the user terminal 300 from among two or more streams based on viewpoint information, quality information, arrangement information, and importance information.

Note that, in Operation Example 4, a case where one stream exists for 360° video was exemplified. However, embodiments are not so limited. For 360° video, there may also be two or more streams with different qualities.

In Operation Example 4, there are two or more streams with different quality depending on the orientation of the 3D object based on the viewpoint information. However, embodiments are not so limited. For a 3D object, there may be two or more streams with different quality regardless of the orientation of the 3D object.

(Operation example 5)
The above-described embodiment may include Operation Example 3 shown below. In operation example 3, the media processing device 200 (for example, the renderer 220) generates an information element that defines the movement range of the user's viewpoint position in a three-dimensional space configured by specific content (three-dimensional space constructed by scene description). from the transmitting device 100 may be configured.

First, in operation example 5, the information element may include an information element (hereinafter referred to as the first information element) that restricts movement of the user's viewpoint position to the inside of the 3D object included in the specific content. For example, as shown in FIG. 15, in a case where a 3D object is arranged in a 3D space constructed by a scene description, movement of the viewpoint position inside the 3D object may be restricted. However, in cases where the 3D object is a building, cases where another scene exists inside the 3D object, etc., movement of the viewpoint position to the inside of the 3D object may be permitted.

Second, in operation example 5, the information element may include an information element that restricts movement of the user's viewpoint position to the outside of the three-dimensional space (hereinafter referred to as the second information element). For example, as shown in FIG. 16, the three-dimensional space may be defined by a combination of cuboids and spheroids. The number of cuboids defining the three-dimensional space may be two or more, and the number of spheroids defining the three-dimensional space may be two or more. However, there may be cases where movement of the user's viewpoint position to the outside of the three-dimensional space is permitted.

Although not particularly limited, the first information element may be included in the scene description in the manner shown in FIG. In FIG. 17, the first information element may be represented by viewing_inside_object_flag. viewing_inside_object_flag may be set for each 3D object. For example, when viewing_inside_object_flag is "0", movement of the viewpoint position into the 3D object may be restricted, and when viewing_inside_object_flag is "1", movement of the viewpoint position into the 3D object may be permitted. .

Although not particularly limited, the second information element may be included in the scene description in the manner shown in FIG. In FIG. 18, the second information element may include information elements (cuboid_center_x, cuboid_center_y, cuboid_center_z, cuboid_size_x, cuboid_size_y, cuboid_size_z) that define a cuboid defining a three-dimensional space. cuboid_center_x, cuboid_center_y, and cuboid_center_z are information elements that indicate the center position of the cuboid, and cuboid_size_x, cuboid_size_y, and cuboid_size_z are information elements that indicate the size of the cuboid. The second information element may include information elements (spheroid_center_x, spheroid_center_y, spheroid_center_z, spheroid_size_x, spheroid_size_y, spheroid_size_z) defining a spheroid defining a three-dimensional space. spheroid_center_x, spheroid_center_y, spheroid_center_z are information elements indicating the center position of the spheroid, and spheroid_size_x, spheroid_size_y, spheroid_size_z are information elements indicating the size of the spheroid. Note that cuboid_enable may be an information element indicating whether or not a three-dimensional space is defined by a cuboid, and spheroid_enable may be an information element indicating whether or not a three-dimensional space is defined by a spheroid. FIG. 18 illustrates a case where a three-dimensional space is defined by two cuboids and two spheroids.

Under this premise, the media processing device 200 (renderer 220) may perform the following operations.

First, when the user's viewpoint position moves outside the movement range, the renderer 220 may generate specific content using the intersection of the trajectory of the user's viewpoint position and the boundary of the movement range as the viewpoint position. That is, the renderer 220 may fix the viewpoint position at the position (boundary position) when the viewpoint position is about to move out of the movement range.

Second, the renderer 220 may notify the user that movement of the viewpoint position is restricted when the user's viewpoint position moves outside the movement range. For example, the renderer 220 may display a message such as "You cannot move beyond this point."

(Action and effect)
In the embodiment, the media processing device 200 generates specific content based on viewpoint information, and then transmits the specific content to the user terminal 300. FIG. According to such a configuration, there is no need for the user terminal 300 to generate specific content including the second content having a degree of freedom of viewpoint, and the user terminal 300 can provide the media processing device 200 with viewpoint information. For example, specific content can be presented. Therefore, although there is a delay between the media processing device 200 and the user terminal 300, the processing load on the user terminal 300 can be reduced.

In operation example 1, the media processing device 200 transmits to the user terminal 300 the viewpoint information used in generating the specific content in association with the sequence number added to the specific content. According to such a configuration, the user terminal 300 can grasp the viewpoint information and the sequence number used in generating the specific content. Even if it is assumed that the viewpoint information differs from the viewpoint information used when displaying the specific content on the terminal 300, the specific content can be displayed appropriately.

In operation example 2, the media processing device 200 receives the content configuration and specific viewpoint information (recommended viewport information) from the transmission device 100 . With such a configuration, the media processing device 200 can generate specific content based on specific viewpoint information, and the first user terminal 400 that feeds back viewpoint information and the second user terminal 500 that does not feed back viewpoint information. specific content can be displayed appropriately even when assuming a case where

In operation example 2, even when the user terminal is the first user terminal 400, the media processing device 200 generates the specific content based on the specific viewpoint information in response to the reset signal. According to such a configuration, it is assumed that the viewpoint position and line-of-sight direction in the three-dimensional space constructed by the scene description are unknown in the first user terminal 400 (the case of getting lost in the three-dimensional space). However, it is possible to return to the specific content based on the specific viewpoint information by the reset signal.

In operation example 3, the media processing device 200 sends quality information about each of two or more streams that differ in quality depending on the orientation of the 3D object based on viewpoint information, as stream quality information about the 3D object included in the specific content. Receive from 100. According to such a configuration, the 3D object can be displayed appropriately while suppressing the transmission traffic based on the new knowledge that the quality of each surface constituting the bounding box of the 3D object does not have to be uniform. be able to.

In operation example 4, the media processing device 200 receives from the transmission device 100 importance information regarding each of two or more objects included in the specific content. According to such a configuration, by introducing a mechanism for setting the importance of each object such as 360° video and 3D objects, while suppressing transmission traffic, each object included in specific content can be displayed appropriately. can do.

In operation example 5, the media processing device 200 receives from the transmission device 100 an information element that defines the movement range of the user's viewpoint position in the three-dimensional space configured by the specific content. According to such a configuration, it is possible to appropriately display the specific content including the content having a degree of freedom of viewpoint without causing failure of the specific content displayed on the user terminal 300 .

[Modification 1]
Modification 1 of the embodiment will be described below. In the following, mainly the differences with respect to the embodiments will be described.

Modification 1 describes a method for synchronizing the first content and the second content when the specific content includes both the first content and the second content.

In addition, hereinafter, synchronization means that the presentation times of the first content (eg, MPU) and the second content (file) are appropriately aligned. Therefore, synchronization may include aligning the presentation times of the 2D video and the 3D object, and may include aligning the presentation times of the audio and the 3D object. Similarly, synchronization may include aligning presentation times of 2D video and 360° video, and aligning presentation times of audio and 360° video.

In the first method, a case will be described in which the media processing device 200 synchronizes the first content and the second content based on the first control information (MMT-SI). The media processing device 200 uses MMT-SI as an entry point to check whether there is a scene description (second content). Identify the presentation time of the specific content including the content and the second content.

Specifically, as shown in FIG. 19, 2D video and audio are presented based on the MPU timestamp descriptor (just timestamp in FIG. 19), so 2D video and audio can be synchronized.

On the other hand, the presentation time of the first frame included in the scene description is specified by referring to the MPU timestamp descriptor included in MMT-SI. The presentation times of the second and subsequent frames included in the scene description can be specified by the frame numbers included in the scene description and the frame rate of the second content. For example, considering the case where the frame rate is 30 fps, the presentation time of the nth frame is determined by adding 1/30*n to the time specified by the MPU Timestamp Descriptor. However, the frame number of the first frame included in the scene description is "0".

In the first method, the scene description does not include the presentation time of the first frame included in the scene description, but the scene description may include the presentation time of the first frame included in the scene description.

In the second method, a case will be described in which the media processing device 200 synchronizes the first content and the second content based on the second control information (scene description). Using the scene description as an entry point, the media processing device 200 confirms the presence or absence of MMT-SI (first content), and if MMT-SI exists, based on the presentation time included in the scene description, Identify the presentation time of the specific content including the first content and the second content.

In such cases, the scene description includes absolute time information indicating the presentation time of the second content. The absolute time information may be the presentation time of the first frame included in the scene description.

For example, absolute time information may be generated using UTC as the reference time. As the reference time, TAI may be used, or the time provided by GPS may be used. The reference time may be the time provided by the NTP server or the time provided by the PTP server. Furthermore, absolute time information may be generated based on the same reference time as the MPU timestamp descriptor.

Furthermore, the scene description includes reference information for identifying the first content. The reference information may be information for identifying the MPUs forming the first content. That is, the reference information is information for handling the first content (MPU) as an object included in the scene description.

Specifically, as shown in FIG. 20, the presentation time of the first frame included in the scene description is specified by the absolute time information included in the scene description. The presentation times of the second and subsequent frames included in the scene description can be specified by the frame numbers included in the scene description and the frame rate of the second content. For example, considering the case where the frame rate is 30 fps, the presentation time of the nth frame is determined by adding 1/30*n to the time specified by the MPU Timestamp Descriptor. However, the frame number of the first frame included in the scene description is "0".

On the other hand, 2D video and audio are presented based on the MPU timestamp descriptor (just timestamp in FIG. 20), so 2D video and audio can be synchronized. Here, since the above-described reference information is included in the scene description, the media processing device 200 can confirm the presence or absence of the first content to be presented together with the second content based on the reference information included in the scene description. .

In the second method, 2D video and audio are synchronized based on the MPU timestamp descriptor included in MMT-SI. may be taken based on the information elements (absolute time information and reference information) contained in . In such cases, at least the MPU timestamp descriptor included in MMT-SI may be omitted. Furthermore, MMT-SI itself may be omitted.

If the reference time of the MPU timestamp descriptor included in MMT-SI (hereafter referred to as the first reference time) differs from the reference time of the absolute time information included in the scene description (second reference time), the At least one of the first control information (MMT-SI) and the second control information (scene description) may include conversion information between the first reference time and the second reference time. For example, in MMT-SI, in addition to an MPU timestamp descriptor expressed in a first reference time (e.g. UTC), an MPU timestamp description expressed in a second reference time (e.g. a reference time other than UTC) May contain children. The scene description may include absolute time information expressed in a first reference time (eg, UTC) in addition to absolute time information expressed in a second reference time (eg, a reference time other than UTC).

The MPU timestamp descriptor included in MMT-SI may be referred to as first absolute time information, and the absolute time information included in the scene description may be referred to as second absolute time information.

[Other embodiments]
While the present invention has been described in the foregoing disclosure, the discussion and drawings forming part of this disclosure should not be taken as limiting the invention. Various alternative embodiments, implementations and operational techniques will become apparent to those skilled in the art from this disclosure.

Although the above disclosure exemplified the case where the specific content includes both the first content and the second content, the above disclosure is not limited to this. The specific content should include at least the second content.

Although not specifically mentioned in the disclosure above, the terms related to MMT may be interpreted based on the content stipulated in ISO/IEC 23008-1, ARIB STD-B60, ARIB TR-B39, etc.

In the above disclosure, the MPU timestamp descriptor was exemplified as the first absolute time information included in MMT-SI. However, the above disclosure is not so limited. The first absolute time information included in MMT-SI may be an MPU extended timestamp descriptor.

Although not specifically mentioned in the above disclosure, the media processing device 200 may request part of the second content from the transmission device 100 as necessary. According to such a configuration, it is possible to save the band accompanying the transmission of the second content and suppress an increase in the processing load of the media processing device 200. FIG.

In the above disclosure, MMTP was exemplified as the transmission method for the first content. However, the above disclosure is not so limited. The transmission method of the first content may be a method conforming to ISO/IEC 23009-1 (hereinafter referred to as MPEG-DASH (Dynamic Adaptive Stream over HTTP)). In such a case, the first control information may be MPD (Media Presentation Description). That is, in the above disclosure, MMT-SI may be read as MPD.

Although not specifically mentioned in the above disclosure, "acquisition" may be read as "reception".

Although not particularly limited, operation example 2 may be expressed as follows. The transmission device 100 includes a transmission unit that transmits a configuration of content having a degree of freedom of viewpoint, and the transmission unit transmits specific viewpoint information that is used to generate specific content that includes at least the content. The receiving device includes a receiving unit that receives a configuration of content having a degree of freedom of viewpoint, and the receiving unit receives specific viewpoint information that is used to generate specific content including at least the content. In such cases, the receiving device may be the media processing device 200 or the user terminal 300 .

Although not particularly limited, operation example 3 may be expressed as follows. The transmission device 100 includes a transmission unit that transmits a content configuration having a degree of freedom of viewpoint, the transmission unit transmits stream quality information about a three-dimensional object included in specific content that includes at least the content, and the quality information is , containing quality information about each of two or more streams whose quality varies depending on the orientation of the 3D object. The receiving device includes a receiving unit that receives a content configuration having a degree of freedom of viewpoint, the receiving unit receives stream quality information about a three-dimensional object included in specific content that includes at least the content, and the quality information includes: Contains quality information for each of two or more streams that differ in quality depending on the orientation of the 3D object. In such cases, the receiving device may be the media processing device 200 or the user terminal 300 .

Although not particularly limited, Operation Example 4 may be expressed as follows. The transmission device 100 includes a transmission unit that transmits a content configuration having a degree of freedom of viewpoint, and the transmission unit transmits importance information regarding each of two or more objects included in specific content that includes at least the content. The receiving device includes a receiving unit that receives a configuration of content having a degree of freedom of viewpoint, and the receiving unit receives importance information regarding each of two or more objects included in specific content that includes at least the content. In such cases, the receiving device may be the media processing device 200 or the user terminal 300 .

Although not particularly limited, Operation Example 4 may be expressed as follows. The transmission device 100 includes a transmission unit that transmits a content configuration having a degree of freedom of viewpoint, and the transmission unit defines a movement range of a user's viewpoint position in a three-dimensional space configured by specific content including at least the content. Send an information element. The receiving device includes a receiving unit that receives a content configuration having a degree of freedom of viewpoint, and the receiving unit includes information that defines a movement range of a user's viewpoint position in a three-dimensional space configured by specific content including at least the content. Receive elements. In such cases, the receiving device may be the media processing device 200 or the user terminal 300 .

Although not specifically mentioned in the disclosure above, a program that causes a computer to execute each process performed by the transmission device 100, the media processing device 200, and the user terminal 300 may be provided. Also, the program may be recorded on a computer-readable medium. A computer readable medium allows the installation of the program on the computer. Here, the computer-readable medium on which the program is recorded may be a non-transitory recording medium. The non-transitory recording medium is not particularly limited, but may be, for example, a recording medium such as CD-ROM or DVD-ROM.

Alternatively, a chip configured by a memory storing a program for executing each process performed by the transmitting device 100, the media processing device 200, and the user terminal 300 and a processor executing the program stored in the memory may be provided. .

The above disclosure may have the following problems and effects.

Specifically, a case is conceivable in which specific content including content with a degree of freedom of viewpoint is generated by a media processing device, and then the generated specific content is transmitted from the media processing device to the user terminal.

As a result of intensive studies, the inventors focused on the possibility that the importance of each object, such as 360-degree video and 3D objects, may differ, and discovered that a mechanism for setting the importance of each object is necessary. Ta.

According to the above disclosure, it is possible to provide a media processing device, a transmitting device, and a receiving device that can appropriately display each object included in specific content while suppressing transmission traffic.

This application claims priority from Japanese Patent Application No. 2022-019720 (filed on February 10, 2022), the entire contents of which are incorporated herein.

DESCRIPTION OF SYMBOLS 10... Transmission system 100... Transmission apparatus 200... Media processing apparatus 210... Reception part 220... Renderer 230... Encoding process part 260... Selection part 270... Selection part 300... User terminal 310... Detection Unit 320 Decoding processing unit 330 Renderer 400 First user terminal 500 Second user terminal

Claims

a renderer that generates specific content including at least content having a degree of freedom of viewpoint based on viewpoint information;
an output unit that outputs the specific content generated by the renderer to a user terminal;
an acquisition unit configured to acquire importance information about each of the two or more objects included in the specific content from a transmission device.
The media processing device according to claim 1, wherein said importance information is information indicating relative importance between said two or more objects.
the two or more objects comprise three-dimensional objects;
wherein the obtaining unit obtains, as the stream quality information about the 3D object, quality information about each of two or more streams whose quality varies depending on the orientation of the 3D object based on the viewpoint information, from the transmitting device. 3. A media processing device according to claim 1 or claim 2.
The media processing device according to claim 3, wherein the renderer selects a stream to be transmitted to the user terminal from among the two or more streams based on the viewpoint information, the quality information and the importance information.
The media processing device according to claim 3 or claim 4, wherein the quality information is information indicating the relative quality of each surface forming a bounding box for the three-dimensional object.
The media processing device according to any one of claims 3 to 5, wherein the acquisition unit acquires arrangement information of a three-dimensional object in a three-dimensional space from the transmission device.
7. The renderer according to claim 6, wherein said renderer selects a stream to be transmitted to said user terminal from among said two or more streams based on said viewpoint information, said arrangement information, said quality information and said importance information. media processor.
A transmission unit that transmits a configuration of content having a degree of freedom of viewpoint,
The transmission device, wherein the transmission unit transmits importance information about each of two or more objects included in specific content that includes at least the content.
A receiving unit for receiving a content configuration having a degree of freedom of viewpoint,
The receiving device, wherein the receiving unit receives importance information about each of two or more objects included in specific content that includes at least the content.