WO2018062641A1

WO2018062641A1 - Provision of virtual reality service with consideration of area of interest

Info

Publication number: WO2018062641A1
Application number: PCT/KR2017/001087
Authority: WO
Inventors: 류은석
Original assignee: 가천대학교 산학협력단
Priority date: 2016-09-28
Filing date: 2017-02-01
Publication date: 2018-04-05
Also published as: KR101861929B1; KR20180035089A

Abstract

Disclosed is a method for receiving a video comprising the steps of: receiving a bitstream comprising video data for a virtual reality service, the video data comprising base layer video data for a base layer, and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer; decoding the base layer video data; and decoding the at least one enhancement layer video data on the basis of the base layer video data, wherein the at least one enhancement layer video data is video data for at least one area of interest in a virtual space.

Description

Providing virtual reality service considering area of interest

This specification relates to providing a virtual reality service considering a region of interest.

Recently, with the development of virtual reality (VR) technology and equipment, various services have been realized. Video conferencing services are examples of services implemented on the basis of virtual reality technology. A user may use a device for processing multimedia data including video information of a conference participant for a video conference.

The present specification provides image processing in consideration of ROI information in virtual reality.

In addition, the present specification provides image processing of different quality according to the gaze information of the user.

In addition, the present disclosure provides image processing in response to a change in the gaze of the user.

In addition, the present disclosure provides signaling corresponding to a change in gaze of a user.

An image receiving apparatus according to an embodiment of the present disclosure includes a communication unit configured to receive a bitstream including video data for a virtual reality service, wherein the video data includes at least base layer video data for a base layer and predicted from the base layer. At least one enhancement layer video data for one enhancement layer; A base layer decoder for decoding the base layer video data; And an enhancement layer decoder that decodes the at least one enhancement layer video data based on the base layer video data, wherein the at least one enhancement layer video data is video data for at least one region of interest in a virtual space. Can be.

In addition, the image receiving apparatus according to another embodiment disclosed in the present specification includes a communication unit for receiving base layer video data for the base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer; A first processor for decoding the base layer video data; And a second processor electrically coupled with the first processor to decode the at least one enhancement layer video data based on the base layer video data, wherein the at least one enhancement layer video data is within a virtual space. It may be video data for at least one region of interest.

In addition, the image transmission apparatus according to another embodiment disclosed in the present specification includes a base layer encoder for generating base layer video data; An enhancement layer encoder for generating at least one enhancement layer video data based on the base layer video data; And a communication unit configured to transmit a bitstream including video data for a virtual reality service, wherein the video data is the at least one of the base layer video data for a base layer and the at least one enhancement layer predicted from the base layer. One enhancement layer video data, wherein the at least one enhancement layer video data may be video data for at least one region of interest in a virtual space.

In addition, according to another embodiment of the present disclosure, an image receiving method includes receiving a bitstream including video data for a virtual reality service, wherein the video data is predicted from base layer video data for the base layer and the base layer At least one enhancement layer video data for at least one enhancement layer; Decoding the base layer video data; And decoding the at least one enhancement layer video data based on the base layer video data, wherein the at least one enhancement layer video data may be video data for at least one region of interest in a virtual space. .

In addition, the image transmission method according to another embodiment disclosed in the present specification comprises the steps of generating the base layer video data; Generating at least one enhancement layer video data based on the base layer video data; And transmitting a bitstream comprising video data for the virtual reality service, wherein the video data is for the base layer video data for the base layer and the at least one for the at least one enhancement layer predicted from the base layer. One enhancement layer video data, wherein the at least one enhancement layer video data may be video data for at least one region of interest in a virtual space.

According to the technology disclosed herein, the image processing apparatus may apply different image processing methods based on the eyes of the user. In addition, according to the technology disclosed herein, by the image processing method in consideration of the user's eye information, the video conferencing device, for example, HMD, to minimize the change in the image quality felt by the wearer, to reduce the bandwidth (BW) for image transmission, There is an effect of reducing the power consumption through the improvement of the image processing performance.

1 is a diagram illustrating an exemplary video conferencing system.

2 is a diagram illustrating an exemplary video conferencing service.

3 is a diagram illustrating an example scalable video coding service.

4 is a diagram illustrating an exemplary configuration of a server device.

5 is a diagram illustrating an exemplary structure of an encoder.

6 illustrates an example video conferencing service using scalable video coding.

7 is a diagram illustrating an exemplary image transmission method.

8 is a diagram illustrating an example method of signaling a region of interest.

9 is a diagram illustrating an exemplary configuration of a client device.

10 is a diagram illustrating an exemplary configuration of a controller.

11 is a diagram illustrating an exemplary configuration of a decoder.

12 is a diagram illustrating an exemplary method of generating and / or transmitting image configuration information.

13 is a diagram illustrating an example method for a client device to signal image configuration information.

14 is a diagram illustrating an exemplary method of transmitting a high / low level image.

15 is a diagram illustrating an exemplary image decoding method.

16 is a diagram illustrating an exemplary video encoding method.

17 is a diagram illustrating an exemplary syntax of ROI information.

18 is a diagram illustrating exemplary ROI information and an exemplary SEI message in XML format.

19 illustrates an example protocol stack of a client device.

20 is a diagram illustrating an exemplary relationship between SLT and service layer signaling (SLS).

21 is a diagram illustrating an example SLT.

22 is a diagram illustrating an example code value of a serviceCategory attribute.

FIG. 23 illustrates an example SLS bootstrapping and example service discovery process.

24 is a diagram illustrating an exemplary USBD / USD fragment for ROUTE / DASH.

FIG. 25 is a diagram illustrating an example S-TSID fragment for ROUTE / DASH. FIG.

FIG. 26 illustrates an exemplary MPD fragment. FIG.

27 is a diagram illustrating an exemplary process of receiving a virtual reality service through a plurality of ROUTE sessions.

28 is a diagram illustrating an exemplary configuration of a client device.

29 is a diagram illustrating an exemplary configuration of a server device.

30 is a diagram illustrating an exemplary operation of a client device.

31 is a diagram illustrating an exemplary operation of a server device.

It is to be noted that the technical terms used herein are merely used to describe particular embodiments and are not intended to limit the spirit of the technology disclosed herein. In addition, the technical terms used herein should be construed as meanings generally understood by those skilled in the art to which the technology disclosed herein belongs, unless defined otherwise in this specification. It should not be interpreted in a comprehensive sense, or in an overly reduced sense. In addition, when the technical terms used herein are incorrect technical terms that do not accurately express the spirit of the technology disclosed herein, technical terms that can be properly understood by those skilled in the art to which the technology disclosed herein belongs. It should be replaced by. In addition, the general terms used herein should be interpreted as defined in the dictionary, or according to the context before and after, and should not be interpreted in an excessively reduced sense.

As used herein, terms including ordinal numbers such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the right of description, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component.

DETAILED DESCRIPTION Hereinafter, exemplary embodiments disclosed herein will be described in detail with reference to the accompanying drawings, and the same or similar components will be given the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted.

In addition, in describing the technology disclosed herein, if it is determined that the detailed description of the related known technology may obscure the gist of the technology disclosed herein, the detailed description thereof will be omitted. In addition, it is to be noted that the accompanying drawings are only for easily understanding the spirit of the technology disclosed in this specification, and the spirit of the technology should not be construed as being limited by the accompanying drawings.

1 is a diagram illustrating an exemplary video conferencing system.

The video conferencing system may provide video conferencing services to at least one user located at a remote location. Video conferencing service is a service that allows people in different regions to have a meeting while looking at each other's faces on the screen without meeting each other directly.

The video conferencing system can be configured in two forms. First, a video conferencing system can be achieved using direct N: N communication between client devices (eg, HMDs) of each user. In this case, since various signaling and video transmissions are performed, the entire bandwidth takes up a lot, but the video conferencing system can provide an optimal video for each user.

Second, the video conferencing system may further include a server device (or relay system) for video conferencing. In this case, the server device may receive at least one video image from each client device, and collect / select at least one video image to serve each client device.

The example technique described herein can be applied to both of the above video conferencing systems, and will be described below with reference to the second embodiment.

Video conferencing system 100 may include at least one client device 120, and / or server device 130 for at least one user 110 in a remote location.

The client device 120 may obtain user data from the user 110 using the client device 120. The user data may include image data, audio data, and additional data of the user.

For example, the client device 120 may include at least one of a 2D / 3D camera and an immersive camera that acquire image data of the user 110. The 2D / 3D camera may capture an image having a viewing angle of 180 degrees or less. Immersive cameras can capture images with a viewing angle of less than 360 degrees.

For example, the client device 120 may acquire the user data of the first user 111 located in the first place (Place 1), the first client device 121 and the second located in the second place (Place 2). At least one of a second client device 123 for acquiring user data of the user 113 and a third client device 125 for acquiring user data of the third user 115 located in the third place (Place 3) It may include.

Then, each client device 120 may transmit the obtained user data to the server device 130 via the network.

The server device 130 may receive at least one user data from the client device 120. The server device 130 may generate the entire image for the video conference in the virtual space based on the received user data. The entire image may represent an immersive image providing an image in a 360 degree direction in the virtual space. The server device 130 may generate the entire image by mapping the image data included in the user data to the virtual space.

Thereafter, the server device 130 may transmit the entire image to each user.

Each client device 120 may receive the entire image and render and / or display as much as the area viewed by each user in the virtual space.

2 is a diagram illustrating an exemplary video conferencing service.

Referring to the drawing, the first user 210, the second user 220, and the third user 230 may exist in the virtual space. The first user 210, the second user 220, and the third user 230 may perform a conference while looking at each other in a virtual space. Hereinafter, the description will be given based on the first user 210.

The video conferencing system may determine the line of sight of the speaker and / or the first user 210 speaking in the virtual space. For example, the second user 220 may be a speaker, and the first user 210 may look at the second user.

In this case, the video conferencing system may transmit an image of the second user 220 viewed by the first user 210 to the first user 210 as a high quality video image. In addition, the video conferencing system may transmit an image of the third user 230 which is invisible or partially visible in the direction of the first user 220 to the first user 210 as a low quality video image.

As a result, the video conferencing system makes a difference in the image processing method based on the user's eyes, and saves the bandwidth (BW) for video transmission, compared to the conventional method of transmitting all the images as high quality video images. Image processing performance can be improved.

3 is a diagram illustrating an example scalable video coding service.

The scalable video coding service is a video compression method for providing various services in a scalable manner in terms of time, space, and picture quality in accordance with various user environments such as network conditions or terminal resolutions in various multimedia environments. Scalable video coding services generally provide scalability in terms of spatial resolution, quality, and temporal.

Spatial scalability can be serviced by encoding different resolutions for the same image for each layer. It is possible to provide image content adaptively to devices having various resolutions such as digital TVs, laptops, and smart phones by using spatial hierarchies.

Referring to the drawings, the scalable video coding service may simultaneously support a TV having one or more different characteristics from a VSP (Video Service Provider) through a home gateway in a home. For example, the scalable video coding service may simultaneously support high-definition TV (HDTV), standard-definition TV (SDTV), and low-definition TV (LDTV) having different resolutions.

Temporal scalability may adaptively adjust a frame rate of an image in consideration of a network environment or content of a terminal through which content is transmitted. For example, by providing a service at a high frame rate of 60 frames per second (FPS) when using a local area network, and providing a content at a low frame rate of 16 frames by using a wireless broadband network such as a 3G mobile network, The service can be provided so that the user can receive the video without interruption.

Quality scalability In addition, by providing content of various image quality according to the network environment or the performance of the terminal, the user can reliably play the video content.

The scalable video coding service may include a base layer and one or more enhancement layer (s), respectively. When the receiver receives only the base layer, the receiver may provide a general image quality, and when the receiver receives both the base layer and the enhancement layer, it may provide high quality. That is, when there is a base layer and one or more enhancement layers, the more enhancement layers (for example, enhancement layer 1, enhancement layer 2,…, enhancement layer n) are received when the base layer is received, the quality of the image or the quality of the provided image is increased. This gets better.

In this way, since the video of the scalable video coding service is composed of a plurality of layers, the receiver receives a small amount of base layer data quickly, processes and plays back the image of general quality, and adds the enhancement layer image data if necessary. Can improve the quality of service.

4 is a diagram illustrating an exemplary configuration of a server device.

The server device 400 may include a control unit 410 and / or a communication unit 420.

The controller 410 may generate an entire image for a video conference in the virtual space and encode the generated entire image. In addition, the controller 410 may control all operations of the server device 400. Details are described below.

The communication unit 420 may transmit and / or receive data to an external device and / or a client device. For example, the communicator 420 may receive user data and / or signaling data from at least one client device. In addition, the communication unit 420 may transmit the entire image for the video conference to the client device in the virtual space.

The controller 410 may include at least one of a signaling data extractor 411, an image generator 413, an ROI determiner 415, a signaling data generator 417, and / or an encoder 419. have.

The signaling data extractor 411 may extract signaling data from data received from the client device. For example, the signaling data may include image configuration information. The image configuration information may include gaze information indicating a user's gaze direction and a zoom region information indicating a user's viewing angle in the virtual space.

The image generator 413 may generate the entire image for the video conference in the virtual space based on the image received from the at least one client device.

The ROI determiner 417 may determine an ROI corresponding to the user's gaze direction in the entire area of the virtual space for the video conference service. For example, the ROI determiner 417 may determine the ROI based on the gaze information and / or the zoom region information. For example, the region of interest may be the location of a tile in the virtual space that the user will see (eg, where a new enemy appears in a game, a speaker's location in the virtual space), and / or the user's location. It may be where your eyes look. Also, the region of interest determination unit 417 may determine a virtual space for a video conference service.

The ROI may be generated to indicate the ROI corresponding to the direction of the user's gaze in the entire region.

The signaling data generator 413 may generate signaling data for processing the entire image. For example, the signaling data may transmit the ROI information. The signaling data may be transmitted through at least one of a Supplement Enhancement Information (SEI), a video usability information (VUI), a Slice Header, and a file describing video data.

The encoder 419 may encode the entire video based on the signaling data. For example, the encoder 419 may encode the entire image in a customized manner for each user based on each user's gaze direction. For example, when the first user looks at the second user in the virtual space, the encoder encodes an image corresponding to the second user in high quality based on the first user's gaze in the virtual space, and corresponds to the third user. The video can be encoded with low quality. According to an embodiment, the encoder 419 may include at least one of the signaling data extractor 411, the image generator 413, the ROI determiner 415, and / or the signaling data generator 417. have.

5 is a diagram illustrating an exemplary structure of an encoder.

The encoder 500 (the image encoding apparatus) may include at least one of a base layer encoder 510, at least one enhancement layer encoder 520, and a multiplexer 530.

The encoder 500 may encode the entire image using a scalable video coding method. The scalable video coding method may include scalable video coding (SVC) and / or scalable high efficiency video coding (SHVC).

The scalable video coding method is a video compression method for providing various services in a scalable manner in terms of time, space, and picture quality according to various user environments such as network conditions or terminal resolution in various multimedia environments. For example, the encoder 500 may generate a bitstream by encoding two or more different quality (or resolution, frame rate) images for the same video image.

For example, the encoder 500 may use inter-layer prediction tools, which are encoding methods using inter-layer redundancy, to increase compression performance of a video image. The inter-layer prediction tool improves the extrusion efficiency in the enhancement layer by removing redundancy of images existing between layers.

The enhancement layer may be encoded by referring to information of a reference layer using an inter-layer prediction tool. The reference layer refers to a lower layer referenced when encoding the enhancement layer. Here, since there is a dependency between the layers by using the inter-layer tool, in order to decode the image of the uppermost layer, the bitstreams of all the lower layers referred to are required. In the middle layer, only the bitstream of the layer to be decoded and the lower layers thereof may be obtained and decoded. The bitstream of the lowest layer is a base layer and may be encoded by an encoder such as H.264 / AVC, HEVC, or the like.

The base layer encoder 510 may generate base layer video data (or base layer bitstream) for the base layer by encoding the entire image. For example, the base layer video data may include video data for the entire area that the user views within the virtual space. The image of the base layer may be the image of the lowest quality.

The enhancement layer encoder 520 may include at least one enhancement layer for at least one enhancement layer that is predicted from the base layer by encoding the entire picture based on the signaling data (eg, region of interest information) and the base layer video data. Video data (or enhancement layer bitstream) may be generated. The enhancement layer video data may include video data for the region of interest in the entire region.

The multiplexer 530 may multiplex base layer video data, at least one enhancement layer video data, and / or signaling data, and generate one bitstream corresponding to the entire image.

The client device receives the entire video as one compressed video bitstream, decodes it, and renders the image as much as the user views in the virtual space. The prior art transmits and / or receives the entire image (eg, 360 degree immersive image) as a high resolution (or high quality) image, so the total bandwidth of the bitstream where the high resolution image is collected is very large. .

The server device may use a scalable video coding method. In the following, exemplary techniques are described in detail.

The virtual user 611 may include a first user 611, a second user 613, and a third user 615. The first user 611, the second user 613, and the third user 615 may have a meeting in the virtual space area 610.

The client device (not shown) may determine the line of sight of the speaker and the user in the virtual space and generate image configuration information. The client device may transmit the image configuration information to the server device and / or another client device when the image composition information is generated for the first time or when the gaze of the user does not face the speaker.

The server device (not shown) may receive a video image and signaling data from at least one client device, and generate an entire image of the virtual space 610.

The server device may then encode the at least one video image based on the signaling data. The server device may differently encode the quality of the video image corresponding to the gaze direction (or the region of interest) and the video image not corresponding to the gaze direction based on the image configuration information (for example, gaze information and middle region information). Can be. For example, the server device may encode a video image corresponding to the user's gaze direction with high quality, and encode a video image corresponding to the user's gaze direction with low quality.

Referring to the drawing, the first video image 630 is a video image of the ROI corresponding to the eyeline direction of the first user 611. The first video image 630 needs to be provided to the first user 611 in high quality. Thus, the server device may encode the first video image 630 to generate base layer video data 633, and generate at least one enhancement layer video data 635 using inter-layer prediction.

The second video image 650 is a video image of a non-interested region that does not correspond to the eye direction of the first user 611. The second video image 650 needs to be provided to the first user 611 in low quality. Thus, the server device may encode the second video image 650 to generate only base layer video data 653.

The server device can then send the encoded at least one bitstream to the client device used by the first user 611.

In conclusion, if the first user 611 looks only at the second user 613 or if the third user 615 occupies only a very small area within the viewing angle of the first user 611, the server device is the second user. The image of 613 may be transmitted as base layer video data and at least one enhancement layer video data in scalable video coding. In addition, the server device may transmit only the base layer video data for the image of the third user 615.

7 is a diagram illustrating an exemplary image transmission method.

The server device may receive a video image and signaling data from at least one client device using a communication unit. In addition, the server device may extract the signaling data using the signaling data extractor. For example, the signaling data may include view information and zoom area information.

The gaze information may indicate whether the first user looks at the second user or the third user. When the first user views the direction of the second user in the virtual space, the gaze information may indicate a direction from the first user to the second user.

The zoom area information may indicate an enlargement range and / or a reduction range of the video image corresponding to the user's gaze direction. In addition, the zoom area information may indicate a viewing angle of the user. When the video image is enlarged based on the value of the zoom area information, the first user can see only the second user. When the video image is reduced based on the value of the zoom area information, the first user may view part and / or all of the third user as well as the second user.

The server device may then generate the entire video for the video conference in the virtual space using the video generating unit.

In operation 710, the server device may grasp image configuration information about a viewpoint and a zoom region viewed by each user in the virtual space based on the signaling data using the ROI determiner.

In operation 720, the server device may determine the ROI of the user based on the image configuration information using the ROI determiner.

When the first user views the second user, the video image corresponding to the gaze direction viewed by the first user occupies a large area of the second user, and the third user occupies a small area or may not be included in the video image. It may be. In this case, the ROI may be an area including the second user. The ROI may be changed according to the gaze information and the zoom area information.

When the signaling data (eg, at least one of the viewpoint information and the zoom region information) is changed, the server device may receive new signaling data. In this case, the server device may determine a new region of interest based on the new signaling data.

Then, the server device may determine whether the data currently processed based on the signaling data is data corresponding to the ROI, using the control unit.

When the signaling data is changed, the server device may determine whether the data currently being processed is data corresponding to the ROI based on the new signaling data.

In the case of data corresponding to the region of interest, the server device may encode a video image (eg, the region of interest) corresponding to the viewpoint of the user with high quality by using an encoder (740). For example, the server device may generate base layer video data and enhancement layer video data for the corresponding video image and transmit them.

When the signaling data is changed, the server device may transmit a video image (new region of interest) corresponding to a new view as a high quality image. If the server device is transmitting a low quality image, but the signaling data is changed and the server device transmits the high quality image, the server device may further generate and / or transmit enhancement layer video data.

If the data does not correspond to the ROI, the server device may encode a video image (eg, the non-ROI) that does not correspond to the user's viewpoint with low quality (750). For example, the server device may generate only base layer video data for a video image that does not correspond to a user's viewpoint, and transmit the base layer video data.

When the signaling data is changed, the server device may transmit a video image (new non-interest region) that does not correspond to the viewpoint of the new user as a low quality image. If the server device was previously transmitting high quality video but the signaling data changed and the server device transmitted the low quality video, the server device no longer generates and / or transmits at least one enhancement layer video data. Only hierarchical video data may be generated and / or transmitted.

That is, the quality of the video image when the base layer video data is received is lower than that of the video image when the enhancement layer video data is received. Enhancement layer video data may be received for a video image (eg, a region of interest) corresponding to a gaze direction of. The client device may provide a user with a high quality video image within a short time.

The exemplary method of the present specification has a great advantage over the simple pre-caching method of receiving only data of some additional area in advance, or a method of receiving only data of an area corresponding to a user's gaze direction.

Exemplary methods herein can lower the overall bandwidth as compared to conventional methods of sending all data in high quality.

In addition, the exemplary method herein may speed up video processing in response to user eye movement in real time.

The conventional method is a video for expressing a third user by grasping this movement with a client device (for example, a sensor of an HMD) when the first user looks at the second user and turns to the third user. Process the information and play it on the screen. Since the conventional method is difficult to process the image of a new area very quickly, the conventional method uses an inefficient method of receiving all data in advance.

However, since the exemplary technique of the present specification performs adaptive video transmission through the above scalable video, when the first user turns his head to the third user, the user quickly responds to the user by using the existing base layer data. can do. Exemplary techniques herein can reproduce video images faster than when processing full high definition data. Thus, the example techniques herein can process video images in rapid response to eye movement.

Referring to FIG. (A), it illustrates a method of signaling a region of interest in scalable video.

The server device (or encoder) may divide one video image (or picture) into several tiles having a rectangular shape. For example, the video image may be partitioned on the basis of a Coding Tree Unit (CTU) unit. For example, one CTU may include Y CTB, Cb CTB, and Cr CTB.

The server device may encode video layers of the base layer as a whole without segmenting them into tiles for fast user response. In addition, the server device may encode a video image of one or more enhancement layers by dividing a part or the whole into several tiles as necessary.

That is, the server device may divide the video image of the enhancement layer into at least one tile and encode tiles corresponding to a region of interest (ROI).

At this time, the region of interest 810 is the position of the tiles where the important object to be seen by the user in the virtual space (eg, a position where a new enemy appears in a game, a speaker's position in the virtual space), and And / or where the user's gaze looks.

In addition, the server device may generate the ROI information including tile information for identifying at least one tile included in the ROI. For example, the ROI information may be generated by the ROI determiner, the signaling data generator, and / or an encoder.

Since the tile information of the region of interest 810 is continuous, the tile information of the region of interest 810 may be effectively compressed even if all the tiles are not numbered. For example, the tile information may include not only the numbers of all tiles corresponding to the ROI, but also the start and end numbers of the tiles, coordinate point information, a list of coding unit (CU) numbers, and a tile number expressed by a formula.

The tile information of the non-interested region may be sent to other client devices, image processing computing equipment, and / or servers after undergoing Entropy coding provided by the encoder.

The ROI information can be transmitted through a high-level syntax protocol that carries Session information. In addition, the ROI information may be transmitted in packet units such as Supplementary Enhancement Information (SEI), video usability information (VUI), and Slice Header (Slice Header) of the video standard. In addition, the ROI information may be delivered as a separate file describing the video file (e.g. DASH MPD).

The video conferencing system can lower overall bandwidth and reduce video processing time by transmitting and / or receiving only necessary tiles of the enhancement layer between client devices and / or between client and server devices through signaling of region of interest information. This is important to ensure fast HMD user response time.

Referring to FIG. (B), it shows a method of signaling a region of interest in a single screen video.

An exemplary technique of the present specification may use a technique of degrading image quality by downscaling (downsampling) a region that is not a region of interest (ROI) in a single screen image that is not scalable video. The prior art does not share filter information 820 written for downscaling between terminals using a service, and promises only one technology from the beginning, or only the encoder knows the filter information.

However, the server device may transmit the filter information 820 used at the time of encoding to the client device in order to improve the quality of the region outside the region of interest downscaled by the client device (or the HMD terminal) receiving the encoded image. Can be. This technology can actually significantly reduce image processing time and provide picture quality improvement.

As described above, the server device may generate the region of interest information. For example, the ROI information may further include filter information as well as tile information. For example, the filter information may include the number of promised filter candidates and values used in the filter.

9 is a diagram illustrating an exemplary configuration of a client device.

The client device 900 may include an image input unit 910, an audio input unit 920, a sensor unit 930, an image output unit 940, an audio output unit 950, a communication unit 960, and / or a controller 970. It may include at least one of. For example, the client device 900 may be a head mounted display (HMD). In addition, the controller 970 of the client device 900 may be included in the client device 900 or may exist as a separate device.

The image input unit 910 may capture a video image. The image input unit 910 may include at least one of a 2D / 3D camera and / or an immersive camera that acquires an image of a user. The 2D / 3D camera may capture an image having a viewing angle of 180 degrees or less. Immersive cameras can capture images with a viewing angle of less than 360 degrees.

The audio input unit 920 may record a user's voice. For example, the audio input unit 920 may include a microphone.

The sensor unit 930 may acquire information about the movement of the user's gaze. For example, the sensor unit 930 may include a gyro sensor for detecting a change in azimuth of an object, an acceleration sensor for measuring an acceleration or impact strength of a moving object, and an external sensor for detecting a user's gaze direction. . In some embodiments, the sensor unit 930 may include an image input unit 910 and an audio input unit 920.

The image output unit 940 may output image data received from the communication unit 960 or stored in a memory (not shown).

The audio output unit 950 may output audio data received from the communication unit 960 or stored in a memory.

The communication unit 960 may communicate with an external client device and / or server device through a broadcast network and / or broadband. For example, the communication unit 960 may include a transmitter (not shown) for transmitting data and / or a receiver (not shown) for receiving data.

The controller 970 may control all operations of the client device 900. The controller 970 may process video data and signaling data received from the server device. Details of the controller 970 will be described below.

10 is a diagram illustrating an exemplary configuration of a controller.

The controller 1000 may process signaling data and / or video data. The controller 1000 may include at least one of a signaling data extractor 1010, a decoder 1020, a speaker determiner 1030, a gaze determiner 1040, and / or a signaling data generator 1050. .

The signaling data extractor 1010 may extract signaling data from data received from the server device and / or another client device. For example, the signaling data may include ROI information.

The decoder 1020 may decode video data based on the signaling data. For example, the decoder 1020 may decode the entire image in a customized manner for each user based on the gaze direction of each user. For example, when the first user looks at the second user in the virtual space, the decoder 1020 of the first user may decode the image corresponding to the second user in high definition based on the first user's gaze in the virtual space. The video corresponding to the third user may be decoded with low quality. According to an embodiment, the decoder 1020 may include at least one of a signaling data extractor 1010, a speaker determiner 1030, a gaze determiner 1040, and / or a signaling data generator 1050. .

The speaker determination unit 1030 may determine who the speaker is in the virtual space based on the voice and / or the given option.

The gaze determiner 1040 may determine the gaze of the user in the virtual space and generate image configuration information. For example, the image configuration information may include gaze information indicating a gaze direction and / or zoom area information indicating a viewing angle of a user.

The signaling data generator 1050 may generate signaling data for transmission to the server device and / or another client device. For example, the signaling data may transmit image configuration information. The signaling data may be transmitted through at least one of a Supplement Enhancement Information (SEI), a video usability information (VUI), a Slice Header, and a file describing video data.

11 is a diagram illustrating an exemplary configuration of a decoder.

Decoder 1100 may include at least one of extractor 1110, base layer decoder 1120, and / or at least one enhancement layer decoder 1130.

The decoder 1100 may decode a bitstream (video data) using an inverse process of the scalable video coding method.

The extractor 1110 may receive a bitstream (video data) including video data and signaling data and selectively extract a bitstream according to the image quality of an image to be reproduced. For example, the bitstream (video data) is a base layer bitstream (base layer video data) for the base layer and at least one enhancement layer bitstream (enhancement layer video data) for at least one enhancement layer predicted from the base layer. ) May be included. The base layer bitstream (base layer video data) may include video data for the entire area of the virtual space. At least one enhancement layer bitstream (enhanced layer video data) may include video data for the region of interest within the entire region.

In addition, the signaling data may include ROI information indicating an ROI corresponding to the gaze direction of the user in the entire area of the virtual space for the video conference service.

The base layer decoder 1120 may decode a bitstream (or base layer video data) of a base layer for a low quality image.

The enhancement layer decoder 1130 may decode at least one bitstream (or enhancement layer video data) of at least one enhancement layer for high quality video based on the signaling data and / or the bitstream (or base layer video data) of the base layer. have.

Hereinafter, a method of generating image configuration information for responding to the movement of the user's eye in real time will be described.

The image configuration information may include at least one of gaze information indicating a gaze direction of a user and / or zoom area information indicating a viewing angle of the user. The user's gaze refers to the direction that the user looks in the virtual space, not the real space. In addition, the gaze information may include not only information indicating a direction of a gaze of the current user, but also information indicating a gaze direction of the user in the future (for example, information about a gaze point expected to receive attention).

The client device may sense an operation of looking at another user located in a virtual space centered on the user and process the same.

The client device may receive the sensing information from the sensor unit by using the controller and / or the gaze determination unit. The sensing information may be an image photographed by a camera and a voice recorded by a microphone. In addition, the sensing information may be data sensed by a gyro sensor, an acceleration sensor, and an external sensor.

In operation 1210, the client device may identify a movement of the user's gaze based on the sensing information by using the controller and / or the gaze determination unit. For example, the client device may check the movement of the user's gaze based on the change in the value of the sensing information.

In operation 1220, the client device may generate image configuration information in the virtual conference space by using the controller and / or the gaze determiner. For example, when the client device physically moves or the user's gaze moves, the client device may calculate the gaze information and / or the zoom area information of the user in the virtual conference space based on the sensing information.

In operation 1230, the client device may transmit image configuration information to the server device and / or another client device using the communication unit. In addition, the client device may transfer the image configuration information to its other components.

In the above, the method for generating image configuration information by the client device has been described. However, the present invention is not limited thereto, and the server device may receive sensing information from the client device and generate image configuration information.

In addition, an external computing device connected with the client device may generate the image configuration information, and the computing device may deliver the image configuration information to its client device, another client device, and / or a server device.

The part of signaling image configuration information (including viewpoint information and / or zoom region information) is very important. If the signaling of the video configuration information is too frequent, it may burden the client device, the server device, and / or the entire network.

Therefore, the client device may signal the image configuration information only when the image configuration information (or the gaze information and / or the zoom area information) of the user is changed. That is, the client device may transmit the gaze information of the user to other client devices and / or server devices only when the gaze information of the user is changed.

In one embodiment, the gaze information may be signaled to the client device or the server device of another user only when the speaker who makes the voice differs from the user's gaze direction by using the point that the speaker is usually noticed in the video conference.

Although not the speaker who is speaking, for users who are currently performing attention (in the case of online lectures) or writing something on the board, the client device may have options on the system (eg, speaker and / Alternatively, the lecturer may obtain information on the speaker through setting as the second user.

Referring to the drawing, the client device may determine who is the speaker in the virtual space area for the video conference by using the controller and / or the speaker determination unit (1310). For example, the client device may determine who is the speaker based on the sensing information. In addition, the client device may determine who is the speaker according to the given options.

Thereafter, the client device may determine the gaze of the user by using the controller and / or the gaze determination unit (1320). For example, the client device may generate image configuration information based on the gaze of the user using the controller and / or the gaze determiner.

Then, the client device may determine whether the user's eyes are directed to the speaker by using the controller and / or the gaze determination unit (1330).

When the gaze of the user faces the speaker, the client device may not signal the image configuration information using the communication unit (1340). In this case, the client device may continue to receive the image of the speaker in the user's gaze direction with high quality, and may receive the image that is not in the user's gaze direction with the low quality.

When the gaze of the user does not face the speaker, the client device may signal the image configuration information using the communicator (1350). For example, if the user's gaze first directed to the speaker but later changed to another place, the client device may signal image configuration information for the user's new gaze direction. That is, the client device may transmit image configuration information for the new gaze direction to other client devices and / or server devices. In this case, the client device may receive the image corresponding to the new gaze of the user with high quality, and the image corresponding to the new gaze of the user (for example, the video corresponding to the speaker) may be received with low quality. have.

In the above description, the client device generates and / or transmits the image configuration information. However, the server device receives the sensing information from the client device, generates the image configuration information based on the sensing information, and generates the image configuration information. It can also be sent to one client device.

As described above, in a situation where users are all looking at a speaker in a video conference in a virtual space using a client device (eg, an HMD), the video conference system may display the speaker's video information in the base layer data and the enhancement layer data. Can be transmitted as scalable video data. In addition, the video conferencing system may receive signaling from a user looking at a user other than the speaker, and may transmit video information of the other user as scalable video data of base layer data and enhancement layer data. Through this, the video conferencing system can provide fast and high quality video information to the user while greatly reducing the signaling on the entire system.

The above-mentioned signaling may be signaling between a server device, a client device, and / or an external computing device (if present). In addition, the above-mentioned signaling may be signaling between a client device and / or an external computing device (if present).

The method of transmitting a high / low level image based on the user's gaze information is a method of switching a scalable codec layer (1410), a rate control method using a single bitstream and a QP (Quantization Parameter) in real time encoding. (1420), a single bitstream such as DASH switching in units of chunks (1430), Down Scaling / Up Scaling method (1440), and / or in the case of Rendering high definition rendering method using more resources (1450) It may include.

Although the example technique described above refers to a differential transmission technique over scalable video 1410, even when using a general video coding technique with a single layer, the quantization coefficient (1420, Quantization Parameter) or Down / Up scaling Adjusting the degree 1440 may provide advantages such as lowering the overall bandwidth, quickly responding to user eye movement, and the like. In addition, when using files that are transcoded into bitstreams having several bitrates in advance, the exemplary technique of the present specification switches between high level images and low level images in chunks. It may provide (1430).

In addition, although the present specification takes a video conferencing system as an example, the present specification may be equally applicable to VR (Augmented Reality), AR (Augmented Reality) game, etc. using the HMD. That is, all of the techniques for providing a high level image of an area corresponding to the user's gaze and signaling only when the user looks at an area other than an area or an object that the user is expected to see. The same applies as in the example.

15 is a diagram illustrating an exemplary image decoding method.

The image decoding apparatus (or decoder) may include at least one of an extractor, a base layer decoder, and / or an enhancement layer decoder. The contents of the image decoding apparatus and / or the image decoding method may include all related contents among the above descriptions of the server device and / or the image decoding apparatus (or the decoder).

The image decoding apparatus may use the extractor to receive a bitstream including video data and signaling data (1510). The image decoding apparatus may extract signaling data, base layer video data, and / or at least one enhancement layer video data from the video data.

In addition, the image decoding apparatus may decode base layer video data using a base layer decoder (1520).

In addition, the image decoding apparatus may decode at least one enhancement layer video data based on the signaling data and the base layer video data using the enhancement layer decoder (1530).

For example, video data may include the base layer video data for a base layer and the at least one enhancement layer video data for at least one enhancement layer predicted from the base layer.

In addition, the base layer video data may include video data for the entire region, and the at least one enhancement layer video data may include video data for the region of interest in the entire region.

The at least one enhancement layer may be divided into at least one tile having a rectangular shape for each layer, and the ROI information may include tile information for identifying at least one tile included in the ROI.

In addition, the ROI information is generated based on the image configuration information, and the image configuration information may include gaze information indicating a direction of the user's gaze in a virtual space and zoom area information indicating the user's viewing angle.

Also, the image configuration information may be signaled when the gaze direction of the user does not face the speaker.

In addition, the signaling data may be transmitted through at least one of Supplementary Enhancement Information (SEI), video usability information (VUI), Slice Header, and a file describing the video data.

16 is a diagram illustrating an exemplary video encoding method.

The image encoding apparatus (or encoder) may include at least one of a base layer encoder, an enhancement layer encoder, and / or a multiplexer. The contents of the image encoding apparatus and / or the image encoding method may include all related contents among the descriptions of the client device and / or the image encoding apparatus (or the encoder) described above.

The image encoding apparatus may generate base layer video data using the base layer encoder (1610).

In addition, the apparatus for encoding an image may generate at least one enhancement layer video data based on the signaling data and the base layer video data using the enhancement layer encoder.

In addition, the apparatus for encoding an image may generate a bitstream including video data and signaling data using a multiplexer.

The image encoding apparatus and / or the image encoding method may perform an inverse process of the image decoding apparatus and / or the image decoding method. In addition, common features may be included for this purpose.

17 is a diagram illustrating an exemplary syntax of ROI information.

Referring to FIG. (A), the ROI information (sighted_tile_info) for each video picture is shown. For example, the ROI information may include at least one of info_mode information, tile_id_list_size information, tile_id_list information, cu_id_list_size information, cu_id_list information, user_info_flag information, user_info_size information, and / or user_info_list.

The info_mode information may indicate a mode of information expressing a region of interest for each picture. The info_mode information may be represented by 4 bits of unsigned information. Alternatively, the info_mode information may indicate the mode of the included information. For example, when the value of the info_mode information is '0', the info_mode information may indicate that the previous information mode is used as it is. If the value of the info_mode information is '1', the info_mode information may indicate a list of all tile numbers corresponding to the ROI. If the value of info_mode information is '2', info_mode information is the start number of consecutive tiles corresponding to the region of interest.

Call and end number can be indicated. If the value of the info_mode information is 3 ', the info_mode information may indicate the number of the upper left and lower right tiles of the ROI. If the value of the info_mode information is '4', the info_mode information may indicate the number of tiles corresponding to the ROI and the number of coding units included in the tiles.

The tile_id_list_size information may indicate the length of the tile number list. The tile_id_list_size information may be represented by 8 bits of unsigned information.

The tile_id_list information may include a tile number list based on the info_mode information. Each tile number may be represented by unsigned 8 bits of information. The tile_id_list information is based on the info_mode information, and the number of all tiles corresponding to the region of interest (if info_mode information = 1), the start number and end number of consecutive tiles (if info_mode information = 2), and the region of interest. One of the upper left and lower right tiles may be included (when info_mode information = 3).

The cu_id_list_size information may indicate the length of a coding unit list. The cu_id_list_size information may be represented by unsigned 16 bits of information.

The cu_id_list information may include a list of coding unit numbers based on the info_mode information. Each coding unit number may be represented by unsigned 16 bits of information. For example, the cu_id_list information may indicate a list of coding unit numbers corresponding to the ROI (for example, if info_mode information = 4) based on the info_mode information.

The user_info_flag information may be a flag indicating additional user information mode. The user_info_flag information may indicate whether there is tile-related information that the user and / or provider additionally want to transmit. The user_info_flag information may be represented by unsigned 1 bit information. For example, if the value of the user_info_flag information is '0', it may be indicated that there is no additional user information. If the value of the user_info_flag information is '1', it may indicate that there is additional user information.

The user_info_size information may indicate the length of additional user information. The user_info_size information may be represented by unsigned 16 bits of information.

The user_info_list information may include a list of additional user information. Each additional user information may be represented by information of an unsignable changeable bit.

Referring to FIG. (B), the ROI information for each file, chunk, and video picture group is shown. For example, the ROI information may include at least one of a version information field, an entire data size field, and / or at least one unit information field.

Referring to the figure, the region of interest information (sighted_tile_info) for each file, chunk, and video picture group is shown. For example, the ROI information may include at least one of version_info information, file_size information, and / or unit information.

The version_info information may indicate a version of the ROI information (or signaling standard). The version_info information may be represented by unsigned 8 bits of information.

The file_size information may indicate the size of the unit information. The file_size information may be represented by unsigned 64-bit information. For example, the file_size information may indicate a file size, chunk size, and video picture group size.

The unit information may include region of interest information for each file unit, chunk unit, and / or video picture group unit.

The unit information may include at least one of poc_num information, info_mode information, tile_id_list_size information, tile_id_list information, cu_id_list_size information, cu_id_list information, user_info_flag information, user_info_size information, and / or user_info_list information.

The poc_num information may indicate the number of a video picture. For example, the picture number field may indicate a picture order count (POC) in HEVC and a corresponding picture (frame) number in a general video codec. The poc_num information may be represented by unsigned 32 bits of information.

Since detailed information about the info_mode information, the tile_id_list_size information, the tile_id_list information, the cu_id_list_size information, the cu_id_list information, the user_info_flag information, the user_info_size information, and / or the user_info_list information is the same as the above description, detailed description thereof will be omitted.

The ROI information may be generated at the server device (or an image transmitting apparatus) and transmitted to at least one client device (or an image receiving apparatus).

In addition, the ROI information may be generated in at least one client device (or image receiving apparatus) and transmitted to at least one client device (or image receiving apparatus) and / or server device (or image transmitting apparatus). In this case, the client device and / or the controller of the client device may further include the above-described signaling data extractor, image generator, ROI determiner, signaling data generator, and / or encoder.

Referring to FIG. (A), the ROI information (sighted_tile_info) may be expressed in an XML form. For example, the ROI information (sighted_tile_info) may include info_mode information ('3'), tile_id_list_size information ('6'), and / or tile_id_list information ('6, 7, 8, 9, 10, 11, 12'). It may include.

Referring to FIG. (B), the payload syntax (Syntax) of the Supplemental Enhancement Information (SEI) message in the international video standard is shown. The SEI message indicates additional information that is not essential in the decoding process of the video coding layer (VCL).

The region of interest information (sighted_tile_info, 1810) may be included in an SEI message of high efficiency video encoding (HEVC), MPEG-4 (MPEG-4), and / or advanced video encoding (AVC) and transmitted through a broadcast network and / or broadband. have. For example, the SEI message may be included in the compressed video data.

Hereinafter, a method of transmitting and / or receiving video data and / or signaling data for a virtual reality service through a broadcast network and / or broadband will be described.

19 illustrates an example protocol stack of a client device.

In the figure, the broadcast stack protocol stack is divided into a portion transmitted through a service list table (SLT) and a MPEG Media Transport Protocol (MMTP), and a portion transmitted through a real time object delivery over Unidirectional Transport (ROUTE). Can lose.

The SLT 1910 may be encapsulated through a User Datagram Protocol (UDP) and an Internet Protocol (IP) layer. MPEG Media Transport Protocol (MMTP) may transmit data 1920 formatted in MPU (Media Processing Unit) format defined in MPEG media transport (MMT) and signaling data 1930 according to MMTP. These data can be encapsulated over the UDP and IP layers. ROUTE is a non-timed data such as data 1960 and signaling data 1940 formatted in the form of a dynamic adaptive streaming over HTTP (DASH) segment, and a non-real time (NRT).

timed) data 1950 may be transmitted. These data can also be encapsulated over the UDP and IP layers.

The part transmitted through SLT and MMTP and the part transmitted through ROUTE may be encapsulated again in the data link layer after being processed in the UDP and IP layers. The broadcast data processed in the link layer may be multicast as a broadcast signal through a process such as encoding / interleaving in the physical layer.

In the figure, the broadband protocol stack portion may be transmitted through the HyperText Transfer Protocol (HTTP) as described above. Data 1960 formatted in the form of a DASH segment, signaling data 1980, and data 1970 such as an NRT may be transmitted through HTTP. The signaling data shown here may be signaling data regarding a service. This data can be processed via the Transmission Control Protocol (TCP), IP layer, and then encapsulated at the link layer. Subsequently, the processed broadband data may be unicast to broadband through processing for transmission in the physical layer.

A service can be a collection of media components that are shown to the user as a whole, a component can be of multiple media types, a service can be continuous or intermittent, a service can be real time or non-real time, and a real time service can be a sequence of TV programs. It can be configured as.

The service may include the aforementioned virtual reality service and / or augmented reality service. In addition, the video data and / or audio data may include at least one of data 1920 formatted in MPU format, non timed data 1950 such as NRT, and / or data 1960 formatted in DASH segment form. It can be included in one. In addition, the signaling data (eg, the first signaling data, the second signaling data) may be included in at least one of the SLT 1910, the signaling data 1930, the signaling data 1940, and / or the signaling data 1980. Can be.

Service signaling provides service discovery and description information and includes two functional components. These are bootstrap signaling through SLT 2010 and

SLS

2020 and 2030. For example, SLS in MMTP may be represented by MMT signaling components 2030. These represent the information needed to discover and obtain user services. SLT 2010 allows the receiver to build a basic list of services and bootstrap the discovery of

SLSs

2020 and 2030 for each service.

SLT 2010 enables very fast acquisition of basic service information.

SLS

2020 and 2030 allow the receiver to discover and access the service and its content components (such as video data or audio data).

As described above, the SLT 2010 may be transmitted through UDP / IP. At this time, according to an embodiment, data corresponding to the SLT 2010 may be delivered through the most robust method for this transmission.

The SLT 2010 may have access information for accessing the SLS 2020 carried by the ROUTE protocol. That is, the SLT 2010 may bootstrap the SLS 2020 according to the ROUTE protocol. The SLS 2020 is signaling information located in a layer above ROUTE in the above-described protocol stack and may be transmitted through ROUTE / UDP / IP. This SLS 2020 may be delivered via one of the LCT sessions included in the ROUTE session. The SLS 2020 may be used to access a service component 2040 corresponding to a desired service.

The SLT 2010 may also have access information for accessing the SLM (MMT signaling component) 2030 carried by the MMTP. In other words, the SLT 2010 may bootstrap to the SLM (MMT signaling component) 2030 according to the MMTP. This SLS (MMT signaling component) 2030 may be carried by an MMTP signaling message defined in MMT. The SLS (MMT signaling component) 2030 may be used to access a streaming service component (MPU) 2050 corresponding to a desired service. As described above, in the present specification, the NRT service component 2060 is ROUTE.

Passed through a protocol, the SLS (MMT signaling component) 2030 according to MMTP may also include information for accessing it. In broadband delivery, SLS is carried over HTTP (S) / TCP / IP.

The service may be included in at least one of the service components 2040, the streaming service components 2050, and / or the NRT service components 2060. In addition, the signaling data (eg, the first signaling data and the second signaling data) may be included in at least one of the SLT 2010, the SLS 2020, and / or the MMT signaling components 2030.

21 is a diagram illustrating an example SLT.

SLT supports fast channel scan that allows the receiver to build a list of all the services it can receive by channel name, channel number, and so on. The SLT also provides bootstrap information that allows the receiver to discover the SLS for each service.

The SLT may include at least one of @bsid, @sltCapabilities, sltInetUrl element, and / or Service element.

@bsid may be a unique identifier of the broadcast stream. The value of @bsid can be unique at the local level.

@sltCapabilities means the specifications required for meaningful broadcasting in all services described in the SLT.

The sltInetUrl element refers to a URL (Uniform Resource Locator) value which can download ESG (Electronic Service Guide) data or service signaling information providing guide information of all services described in the corresponding SLT through a broadband network. The sltInetUrl element may include @URLtype.

@URLtype refers to the type of file that can be downloaded through the URL indicated by the sltInetUrl element.

The service element may include service information. The service element may include at least one of @serviceId, @sltSvcSeqNum, @protected, @majorChannelNo, @minorChannelNo, @serviceCategory, @shortServiceName, @hidden, @broadbandAccessRequired, @svcCapabilities, BroadcastSignaling element, and / or svcInetUrl element.

@serviceId is a unique identifier of the service.

@sltSvcSeqNum has a value that indicates information about whether the contents of each service defined in the SLT have changed.

If @protected has a value of “true”, it means that one of the components that make up a service is protected in order to show the service on the screen.

@majorChannelNo means the major channel number of the service.

@minorChannelNo means that the service is minor channel number.

@serviceCategory indicates the type of service.

@shortServiceName indicates the name of the service.

@hidden indicates whether the service should be shown to the user when scanning the service.

@broadbandAccessRequired indicates whether to connect to the broadband network in order to show the service meaningfully to the user.

@svcCapabilities specifies the specifications that must be supported to make the service meaningful to the user.

The BroadcastSignaling element includes a definition of a transport protocol, a location, and identifier values of signaling transmitted to a broadcast network. The BroadcastSignaling element may include at least one of @slsProtocol, @slsMajorProtocolVersion, @slsMinorProtocolVersion, @slsPlpId, @slsDestinationIpAddress, @slsDestinationUdpPort, and / or @slsSourceIpAddress.

@slsProtocol represents the protocol over which the SLS of the service is transmitted.

@slsMajorProtocolVersion represents the major version of the protocol over which the SLS of the service is transmitted.

@slsMinorProtocolVersion represents the minor version of the protocol over which the SLS of the service is transmitted.

@slsPlpId indicates the PLP identifier through which the SLS is transmitted.

@slsDestinationIpAddress represents the destination IP address of SLS data.

@slsDestinationUdpPort represents the destination Port value of SLS data.

@slsSourceIpAddress represents the source IP address of SLS data.

The svcInetUrl element indicates a URL value for downloading ESG service or signaling data related to the service. The svcInetUrl element may contain @URLtype.

@URLtype refers to the type of file that can be downloaded through the URL indicated by the svcInetUrl element.

For example, if the value of the serviceCategory attribute is '0', the service may not be specified. If the value of the serviceCategory attribute is '1', the service may be a linear audio / video service. If the value of the serviceCategory attribute is '2', the service may be a linear audio service. If the value of the serviceCategory attribute is '3', the service may be an app-based service. If the value of the serviceCategory attribute is '4', the service may be an electronic service guide (ESG) service. If the value of the serviceCategory attribute is '5', the service may be an emergency alert service (EAS).

If the value of the serviceCategory attribute is '6', the corresponding service may be a virtual reality and / or augmented reality service.

For the video conferencing service, the value of the serviceCategory attribute may be '6' (2210).

The receiver can obtain the SLT. SLT is used to bootstrap SLS acquisition, and then SLS is used to acquire service components carried in a ROUTE session or an MMTP session.

With respect to the service delivered in the ROUTE session, the SLT provides SLS bootstrapping information such as PLPID (# 1), source IP address (sIP1), destination IP address (dIP1), and destination port number (dPort1). . With regard to the service delivered in the MMTP session, the SLT provides SLS bootstrapping information such as PLPID (# 2), destination IP address (dIP2), and destination port number (dPort2).

For reference, a broadcast stream is a concept of an RF channel defined in terms of carrier frequencies concentrated within a specific band. PLP (physical layer pipe) is a part of the RF channel. Each PLP has specific modulation and coding parameters.

For streaming service delivery using ROUTE, the receiver can obtain the SLS fragments delivered to the PLP and IP / UDP / LCT sessions. These SLS fragments include a User Service Bundle Description / User Service Description (USBD / USD) fragment, a Service-based Transport Session Instance Description (S-TSID) fragment, and a Media Presentation Description (MPD) fragment. They are related to a service.

For streaming service delivery using MMTP, the receiver may obtain SLS fragments that are delivered in PLP and MMTP sessions. These SLS fragments may include USBD / USD fragments, MMT signaling messages. They are related to a service.

The receiver may obtain a video component and / or an audio component based on the SLS fragment.

Unlike the illustrated embodiment, one ROUTE or MMTP session may be delivered through a plurality of PLPs. That is, one service may be delivered through one or more PLPs. As described above, one LCT session may be delivered through one PLP. Unlike shown, components constituting one service may be delivered through different ROUTE sessions. In addition, according to an embodiment, components constituting one service may be delivered through different MMTP sessions. According to an embodiment, components constituting one service are connected to a ROUTE session and an MMTP session.

It may be delivered separately. Although not shown, a component constituting one service may be delivered through a broadband (hybrid delivery).

In addition, service data (eg, video component and / or audio component) and / or signaling data (eg, SLS fragment) may be transmitted through a broadcast network and / or broadband.

24 is a diagram illustrating an exemplary USBD / USD fragment for ROUTE / DASH.

The USBD / USD (User Service Bundle Description / User Service Description) fragment describes the service layer characteristics and provides a Uniform Resource Identifier (URI) reference for the S-TSID fragment and a URI reference for the MPD fragment. That is, the USBD / USD fragment may refer to the S-TSID fragment and the MPD fragment, respectively. The USBD / USD fragment can be expressed as a USBD fragment.

The USBD / USD fragment can have a bundleDescription root element. The bundleDescription root element may have a userServiceDescription element. The userServiceDescription element may be an instance of one service.

The userServiceDescription element may include at least one of @globalServiceId, @serviceId, @serviceStatus, @fullMPDUri, @sTSIDUri, name element, serviceLanguage element, deliveryMethod element, and / or serviceLinakge element.

@globalServiceId can indicate a globally unique URI that identifies the service.

@serviceId is a reference to the corresponding service entry in the SLT.

@serviceStatus can specify the status of the service. The value indicates whether the service is enabled or disabled.

@fullMPDUri may reference an MPD fragment containing a description of the content component of the service delivered over broadcast and / or broadband.

@sTSIDUri may refer to an S-TSID fragment that provides access-related parameters to a transport session that delivers the content of the service.

The name element may indicate a name of a service. The name element may include @lang indicating the language of the service name.

The serviceLanguage element may indicate an available language of the service.

The deliveryMethod element may be a container of transports related to information pertaining to the content of the service on broadcast and (optionally) broadband modes of access. The deliveryMethod element may include a broadcastAppService element and a unicastAppService element. Each subelement may have a basePattern element as a subelement.

The broadcastAppService element may be a DASH presentation delivered on a multiplexed or non-multiplexed form of broadcast containing corresponding media components belonging to the service over the duration of the media presentation to which it belongs. That is, each of the present fields may mean DASH presentations delivered through the broadcasting network.

The unicastAppService may be a DASH presentation delivered on a multiplexed or non-multiplexed form of broadband including constituent media content components belonging to the service over all durations of the media presentation to which it belongs. That is, each of the present fields may mean DASH representations delivered through broadband.

The basePattern may be a character pattern used by the receiver to match against all parts of the fragment URL used by the DASH client to request media segmentation of the parent presentation in the included period.

The serviceLinakge element may include service linkage information.

The Service-based Transport Session Instance Description (S-TSID) fragment provides a transport session description for one or more ROUTE / LCT sessions to which the media content component of the service is delivered and a description of the delivery object delivered in that LCT session. The receiver may obtain at least one component (eg, video component and / or audio component) included in the service based on the S-TSID fragment.

The S-TSID fragment may include an S-TSID root element. The S-TSID root element may include @serviceId and / or at least one RS element.

@serviceID may be a reference corresponding to a service element in USD.

The RS element may have information about a ROUTE session for delivering corresponding service data.

The RS element may include at least one of @bsid, @sIpAddr, @dIpAddr, @dport, @PLPID and / or at least one LS element.

@bsid may be an identifier of a broadcast stream to which the content component of broadcastAppService is delivered.

@sIpAddr may indicate the source IP address. Here, the source IP address may be a source IP address of a ROUTE session for delivering a service component included in a corresponding service.

@dIpAddr may indicate a destination IP address. Here, the destination IP address may be a destination IP address of a ROUTE session for delivering a service component included in a corresponding service.

@dport can represent a destination port. Here, the destination port may be a destination port of a ROUTE session for delivering a service component included in a corresponding service.

@PLPID may be an ID of a PLP for a ROUTE session represented by an RS element.

The LS element may have information about an LCT session that carries corresponding service data.

The LS element may include @tsi, @PLPID, @bw, @startTime, @endTime, SrcFlow and / or RprFlow.

@tsi may indicate a TSI value of an LCT session in which a service component of a corresponding service is delivered.

@PLPID may have ID information of a PLP for a corresponding LCT session. This value may override the default ROUTE session value.

@bw may indicate the maximum bandwiss value. @startTime can indicate the start time of the LCT session. @endTime may indicate an end time of the corresponding LCT session. The SrcFlow element may describe the source flow of ROUTE. The RprFlow element may describe the repair flow of ROUTE.

The S-TSID may include ROI information. In more detail, the RS element and / or the LS element may include ROI information.

FIG. 26 illustrates an exemplary MPD fragment. FIG.

The media presentation description (MPD) fragment may include a formal description of the DASH media presentation corresponding to the linear service of a given duration determined by the broadcaster. MPD fragments are primarily associated with linear services for the delivery of DASH fragments as streaming content. The MPD provides the source identifiers for the individual media components of the linear / streaming service in the form of fragment URLs, and the context of the identified resource within the media presentation. MPD may be transmitted over broadcast and / or broadband.

The MPD fragment may include a period element, an adaptation set element, and a presentation element.

Period elements contain information about periods. The MPD fragment may include information about a plurality of periods. A period represents a continuous time interval of media content presentation.

The adaptation set element includes information about the adaptation set. The MPD fragment may include information about a plurality of adaptation sets. An adaptation set is a collection of media components that includes one or more media content components that can be interchanged. The adaptation set may include one or more representations. Each adaptation set may include audio of different languages or subtitles of different languages.

The representation element contains information about the representation. The MPD may include information about a plurality of representations. A representation is a structured collection of one or more media components, where there may be a plurality of representations encoded differently for the same media content component. On the other hand, when bitstream switching is possible, the electronic device may switch the received presentation to another presentation based on the updated information during media content playback. In particular, the electronic device may convert the received representation into another representation according to the bandwidth environment. The representation is divided into a plurality of segments.

A segment is a unit of media content data. The representation may be transmitted as a segment or part of a segment according to a request of the electronic device using the HTTP GET or HTTP partial GET method defined in HTTP 1.1 (RFC 2616).

In addition, the segment may include a plurality of sub-segments. The subsegment may mean the smallest unit that can be indexed at the segment level. The segment may include an Initialization Segment, a Media Segment, an Index Segment, and a BitstreamSwitching Segment.

The MPD fragment may include ROI information. In more detail, the period element, the adaptation set element, and / or the presentation element may include ROI information.

The client device (or receiver) may receive the bitstream through the broadcast network. For example, the bit stream may include video data and second signaling data for the service. For example, the second signaling data may include an SLT 2710 and an SLS 2730. The service may include a virtual reality service. The service data may include base layer service data 2740 and enhancement layer service data 2750.

The bitstream may include at least one physical layer frame. The physical layer frame may include at least one PLP. For example, the SLT 2710 may be transmitted through the PLP # 0.

In addition, the PLP # 1 may include a first ROUTE session ROUTE # 1. The 1 ROUTE session ROUTE # 1 may include a first LCT session tsi-sls, a second LCT session tsi-bv, and a third LCT session tsi-a. The SLS 2730 is transmitted through the first LCT session tsi-sls, the base layer video data 2740 is transmitted through the second LCT session tsi-bv, and the third LCT session (tsi-sls). Audio data may be transmitted through tsi-a.

In addition, the PLP # 2 may include a second ROUTE session ROUTE # 2, and the second ROUTE session ROUTE # 2 may include a fourth LCT session tsi-ev. Enhancement layer video data (Video Segment) 2750 may be transmitted through a fourth LCT session tsi-ev.

The client device can then obtain the SLT 2710. For example, the SLT 2710 may include bootstrap information 2720 for obtaining the SLS 2730.

The client device may then obtain the SLS 2730 for the virtual reality service based on the bootstrap information 2720. For example, the SLS may include a USBD / USD fragment, an S-TSID fragment, and / or an MPD fragment. At least one of the USBD / USD fragment, the S-TSID fragment, and / or the MPD fragment may include ROI information. In the following description, it is assumed that the MPD fragment includes ROI information.

The client device may then obtain the S-TSID fragment and / or the MPD fragment based on the USBD / USD fragment. The client device may match the representation of the MPD fragment with the media component transmitted over the LCT session based on the S-TSID fragment and the MPD fragment.

The client device can then obtain the base layer video data 2740 and audio data based on the RS element (ROUTE # 1) of the S-TSID fragment. The client device may also obtain enhancement layer video data 2750 and audio data based on the RS element (ROUTE # 2) of the S-TSID fragment.

The client device can then decode the service data (eg, base layer video data, enhancement layer video data, audio data) based on the MPD fragment.

More specifically, the client device may decode the enhancement layer video data based on the base layer video data and / or region of interest information.

In the above description, the enhancement layer video data is transmitted through the second ROUTE session (ROUTE # 2). However, the enhancement layer video data may be transmitted through the MMTP session.

28 is a diagram illustrating an exemplary configuration of a client device.

Referring to FIG. (A), the client device A2800 may include at least one of an image input unit, an audio input unit, a sensor unit, an image output unit, an audio output unit, a communication unit A2810, and / or a controller A2820. Can be. For example, the details of the client device A2800 may include all the contents of the above-described client device.

The controller A2820 may include at least one of a signaling data extractor, a decoder, a speaker determiner, a gaze determiner, and / or a signaling data generator. For example, the details of the controller A2820 may include all of the above-described contents of the controller.

Referring to the drawings, a client device (or a receiver or an image receiving apparatus) may include a communication unit A2810 and / or a controller A2820. The controller A2820 may include a base layer decoder A2821 and / or an enhancement layer decoder A2825.

The communication unit A2810 may receive a bitstream including video data for a virtual reality service. The communication unit A2810 may receive a bitstream through a broadcast network and / or broadband.

The video data may include base layer video data for a base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer.

The base layer decoder A2821 may decode the base layer video data.

The enhancement layer decoder A2825 may decode the at least one enhancement layer video data based on the base layer video data.

The at least one enhancement layer video data may be video data for at least one region of interest in a virtual space.

In addition, the controller A2820 may further include a signaling data generator that generates first signaling data.

The first signaling data may include image configuration information. The image configuration information may include at least one of gaze information indicating a gaze direction of the user and a zoom area information indicating the viewing angle of the user in the virtual space.

The controller A2820 may further include a gaze determination unit that determines whether a gaze area corresponding to the gaze information is included in the at least one ROI.

In addition, when the gaze area is included in an area other than the at least one region of interest, the communication unit A2810 may transmit the first signaling data to a server (or a server device, a transmitter, an image transmission device) and / or at least one client. In this case, the server, the server device, and / or the at least one client device receiving the first signaling data may correspond to the gaze information corresponding to the gaze information in the at least one ROI. It may include. That is, the region of interest may include at least one of a region including the speaker in the virtual space, a region that is predetermined by using at least one enhancement layer video data, and a region of gaze corresponding to the gaze information.

The bitstream may further include second signaling data.

The communication unit A2810 may independently receive the base layer video data and the at least one enhancement layer video data based on the second signaling data through a plurality of sessions.

For example, the communication unit A2810 may receive base layer video data through a first ROUTE session and receive at least one enhancement layer video data through at least one second ROUTE session. Alternatively, the communication unit A2810 may receive base layer video data through a ROUTE session and receive at least one enhancement layer video data through at least one MMTP session.

The second signaling data may include at least one of service layer signaling data (or SLS) including information for acquiring the video data and a service list table (or SLT) including information for acquiring the service layer signaling data. It may include one.

In addition, the service list table may include a service category attribute indicating a category of a service. For example, the service category attribute may indicate the virtual reality service.

In addition, the service layer signaling data may include the ROI information. Specifically, the service layer signaling data may be included in an S-TSID fragment including information on a session in which at least one media component for the virtual reality service is transmitted, and in the at least one media component (video data and / or audio data). It may include at least one of an MPD fragment including information about, and a USBD / USD fragment including a URI value connecting the S-TSID fragment and the MPD fragment.

The MPD fragment may include ROI information indicating a location of the at least one ROI in the entire area of the virtual space.

The bitstream may further include region of interest information indicating a location of the at least one region of interest within the entire region of the virtual space. For example, the ROI information may be transmitted and / or received through at least one of a Supplemental Enhancement Information (SEI) message, a Video Usability Information (VUI) message, a slice header, and a file describing the video data.

The at least one enhancement layer video data may be generated (encoded) and / or decoded based on the base layer video data and the ROI information.

The ROI information may include at least one of an information mode field indicating a mode of information representing the ROI for each picture and a tile number list field including a number of at least one tile corresponding to the ROI. Can be. For example, the information mode field may be the above-described info_mode information, and the tile number list field may be the above-described tile_id_list information.

For example, the tile number list field may include a number of all tiles corresponding to the ROI, starting numbers and ending numbers of consecutive tiles, and numbers of upper and lower right tiles of the ROI, based on the information mode field. It may include the number of the at least one tile in one of the manner.

The ROI information may further include a coding unit number list field indicating the ROI. For example, the coding unit number list field may be the above-described cu_id_list information.

For example, the coding unit number list field may indicate the number of tiles corresponding to the ROI and the number of coding units included in the tile based on the information mode field.

Referring to FIG. (B), the client device B2800 may include at least one of an image input unit, an audio input unit, a sensor unit, an image output unit, an audio output unit, a communication unit B2810, and / or a controller B2820. have. For example, the details of the client device B2800 may include all the contents of the client device A2800 described above.

In addition, the controller B2820 may include at least one of the first processor B2821 and / or the second controller B2825.

The first processor B2821 may decode base layer video data. For example, the first processor B2821 may be a video processing unit (VPU) and / or a digital signal processor (DSP).

The second processor B2825 may be electrically connected to the first processor to decode the at least one enhancement layer video data based on the base layer video data. For example, the second processor B2825 may be a central processing unit (CPU) and / or a graphics processing unit (GPU).

29 is a diagram illustrating an exemplary configuration of a server device.

When communicating only between client devices, at least one client device (or HMD, image receiving apparatus) may perform all operations of the server device (or image transmitting apparatus). Hereinafter, a description will be given of a case where a server device exists, but the contents of the present specification are not limited thereto.

Referring to FIG. (A), the server device A2900, a transmitter, and an image transmission device may include a controller A2910 and / or a communicator A2920. The controller A2920 may include at least one of a signaling data extractor, an image generator, an ROI determiner, a signaling data generator, and / or an encoder. Details of the server device A2900 may include all the contents of the server device described above.

Referring to the drawings, the controller A2910 of the server device A2900 may include a base layer encoder A2911 and / or an enhancement layer encoder A2915.

The base layer encoder A2911 may generate base layer video data.

The enhancement layer encoder A2915 may generate at least one enhancement layer video data based on the base layer video data.

The communicator A2920 may transmit a bitstream including video data for a virtual reality service. The communication unit A2920 may transmit a bitstream through a broadcast network and / or broadband.

The video data may also include the base layer video data for a base layer and the at least one enhancement layer video data for at least one enhancement layer predicted from the base layer.

In addition, the at least one enhancement layer video data may be video data for at least one region of interest in a virtual space.

In addition, the communication unit A2920 may further receive the first signaling data. For example, the first signaling data may include image configuration information.

The ROI determiner of the controller A2910 may include the gaze area corresponding to the gaze information in the at least one ROI.

In addition, the signaling data generator of the controller A2910 may generate second signaling data.

In addition, the communication unit A2920 may independently transmit the base layer video data and the at least one enhancement layer video data through a plurality of sessions based on the second signaling data.

In addition, the second signaling data and / or the ROI information may include all of the above contents.

Referring to FIG. (B), the server device B2900, a transmitter, and an image transmission device may include at least one of the controller B2910 and / or the communicator B2920. The controller B2920 may include at least one of a signaling data extractor, an image generator, an ROI determiner, a signaling data generator, and / or an encoder. Details of the server device B2900 may include all the contents of the server device described above.

The controller B2910 of the server device B2900 may include a first processor B2911 and / or a second processor B2915.

The first processor B2911 may include a base layer encoder that generates base layer video data.

The second processor B2915 may be electrically connected to the first processor to generate (or encode) the at least one enhancement layer video data based on the base layer video data.

30 is a diagram illustrating an exemplary operation of a client device.

The client device (or receiver, image receiving apparatus) may include a communication unit and / or a control unit. The control unit may include a base layer decoder and / or an enhancement layer decoder. In addition, the controller may include a first processor and / or a second processor.

The client device may use the communication unit to receive a bitstream including video data for the virtual reality service (3010).

For example, the video data may include base layer video data for a base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer.

The client device may then decode (3020) the base layer video data using a base layer decoder and / or a first processor.

The client device may then decode (3030) the at least one enhancement layer video data based on the base layer video data using an enhancement layer decoder and / or a second processor.

For example, the at least one enhancement layer video data may be video data for at least one region of interest in a virtual space.

The contents related to the operation of the client device may include all the contents of the client device described above.

31 is a diagram illustrating an exemplary operation of a server device.

The server device may include a control unit and / or a communication unit. The control unit may include a base layer encoder and / or an enhancement layer encoder. In addition, the controller may include a first processor and / or a second processor.

The server device may generate base layer video data using the base layer encoder and / or the first processor (3110).

The server device may then use the enhancement layer encoder and / or the second processor to generate at least one enhancement layer video data based on the base layer video data (3120).

The server device may then use the communication unit to transmit the bitstream containing the video data for the virtual reality service.

For example, the video data may include the base layer video data for a base layer and the at least one enhancement layer video data for at least one enhancement layer predicted from the base layer.

The contents related to the operation of the server device may include all the contents of the server device described above.

In addition, according to an embodiment disclosed herein, the above-described method may be implemented as code that can be read by a processor in a medium in which a program is recorded. Examples of the processor-readable medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, and may be implemented in the form of downloadable file.

The electronic device described above is not limited to the configuration and method of the above-described embodiments, but the embodiments may be configured by selectively combining all or some of the embodiments so that various modifications may be made. It may be.

Preferred embodiments of the technology of the present disclosure have been described with reference to the accompanying drawings. Here, the terms or words used in the present specification and claims should not be construed as being limited to ordinary or dictionary meanings, but should be interpreted as meanings and concepts corresponding to the technical spirit of the present technology.

The scope of the present technology is not limited to the embodiments disclosed herein, and the present technology may be modified, changed, or improved in various forms within the scope of the spirit and claims of the present technology.

Claims

Receiving a bitstream including video data for a virtual reality service,

The video data includes base layer video data for a base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer;

Decoding the base layer video data; And

Decoding the at least one enhancement layer video data based on the base layer video data,

And the at least one enhancement layer video data is video data for at least one region of interest in a virtual space.
According to claim 1,

Generating first signaling data;

The first signaling data includes gaze information indicating a gaze direction of a user in the virtual space.
The method of claim 2,

Determining whether a gaze area corresponding to the gaze information is included in the at least one ROI; And

If the gaze area is included in an area other than the at least one region of interest, transmitting the first signaling data,

And the gaze area is added to the at least one ROI.
According to claim 1,

The bitstream includes region of interest information indicating a location of the at least one region of interest in the entire region of the virtual space,

And the at least one enhancement layer video data is decoded based on the base layer video data and the ROI information.
The method of claim 4, wherein

The ROI information includes a tile number list field including a number of at least one tile corresponding to the ROI.
The method of claim 5,

The tile number list field may include the number of all tiles corresponding to the ROI, a start number and an end number of consecutive tiles, and a number of upper left and lower right tiles of the ROI. Image receiving method comprising a number.
The method of claim 4, wherein

The ROI information is received through at least one of a Supplemental Enhancement Information (SEI) message, a Video Usability Information (VUI) message, a slice header, and a file describing the video data.
The method of claim 4, wherein

The bitstream includes second signaling data,

Receiving the bitstream,

And receiving the base layer video data and the at least one enhancement layer video data independently through a plurality of sessions based on the second signaling data.
The method of claim 8,

The second signaling data includes a service layer signaling data including information for obtaining the video data and a service list table including information for obtaining the service layer signaling data.
The method of claim 9,

The service layer signaling data includes the ROI information.
Generating base layer video data;

Generating at least one enhancement layer video data based on the base layer video data; And

Transmitting a bitstream including video data for the virtual reality service,

The video data includes the base layer video data for a base layer and the at least one enhancement layer video data for at least one enhancement layer predicted from the base layer,

And the at least one enhancement layer video data is video data for at least one region of interest in a virtual space.
The method of claim 11, wherein

Receiving first signaling data,

The first signaling data includes gaze information indicating a gaze direction of a user in the virtual space.

And the first signaling data is received when a gaze area corresponding to the gaze information is included in an area other than the at least one ROI.
The method of claim 12,

And the gaze area is added to the at least one ROI.
The method of claim 11, wherein

The bitstream includes region of interest information indicating a location of the at least one region of interest in the entire region of the virtual space,

And the at least one enhancement layer video data is encoded based on the base layer video data and the ROI information.
The method of claim 14,

The region of interest information includes a tile number list field including a number of at least one tile corresponding to the region of interest.
The method of claim 15,

The tile number list field may include the number of all tiles corresponding to the ROI, a start number and an end number of consecutive tiles, and a number of upper left and lower right tiles of the ROI. Image transmission method including the number.
The method of claim 14,

The region of interest information is transmitted through at least one of a Supplemental Enhancement Information (SEI) message, a Video Usability Information (VUI) message, a slice header, and a file describing the video data.
The method of claim 14,

Generating second signaling data;

Transmitting the bitstream,

And transmitting the base layer video data and the at least one enhancement layer video data independently through a plurality of sessions based on the second signaling data.
The method of claim 18,

The second signaling data includes a service layer signaling data including information for acquiring the video data and a service list table including information for acquiring the service layer signaling data.
The method of claim 19,

The service layer signaling data includes the ROI information.
A communication unit receiving a bitstream including video data for a virtual reality service,

The video data includes base layer video data for a base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer;

A base layer decoder for decoding the base layer video data; And

An enhancement layer decoder that decodes the at least one enhancement layer video data based on the base layer video data,

And the at least one enhancement layer video data is video data for at least one region of interest in a virtual space.
A communication unit for receiving base layer video data for a base layer and at least one enhancement layer video data for at least one enhancement layer predicted from the base layer;

A first processor for decoding the base layer video data; And

A second processor electrically coupled with the first processor to decode the at least one enhancement layer video data based on the base layer video data,

And the at least one enhancement layer video data is video data for at least one region of interest in a virtual space.
A base layer encoder for generating base layer video data;

An enhancement layer encoder for generating at least one enhancement layer video data based on the base layer video data; And

Including a communication unit for transmitting a bitstream containing video data for the virtual reality service,

The video data includes the base layer video data for a base layer and the at least one enhancement layer video data for at least one enhancement layer predicted from the base layer,

And the at least one enhancement layer video data is video data for at least one region of interest in a virtual space.