KR20200062891A

KR20200062891A - System and method for predicting user viewpoint using lication information of sound source in 360 vr contents

Info

Publication number: KR20200062891A
Application number: KR1020180148834A
Authority: KR
Inventors: 김동호; 정은영
Original assignee: 서울과학기술대학교 산학협력단
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2020-06-04
Also published as: KR102136301B1

Abstract

The present invention provides a system and a method for predicting a user viewpoint by using position information of a sound source in 360-degree VR content. According to embodiments of the present invention, an existing image MPD is extended to transmit sound source localization information (SLI) and differentially assign a weight for each item of a pole, a current viewpoint, sound source localization information (SLI), and region-of-interest (ROI) information in accordance with a bandwidth situation. Then a reference bit rate is multiplied and then a received segment bit rate is determined. Segments are received in an assigned bandwidth by the determined segment bit rate. The accuracy of user viewpoint position prediction can be improved by using visual and acoustic recognition. If the user viewpoint position prediction fails, the current user viewpoint is considered to allow user viewpoint position prediction. Therefore, the quality of experience (QoE) of users for play speed and play quality can be guaranteed.

Description

System and method for predicting a user's viewpoint using sound source location information in 360-degree VR content{SYSTEM AND METHOD FOR PREDICTING USER VIEWPOINT USING LICATION INFORMATION OF SOUND SOURCE IN 360 VR CONTENTS}

본 발명은 360도 VR(Virtual Reality) 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 시스템 및 방법에 관한 것으로서, 더욱 상세하게는 DASH 세그먼트(Dynamic Adaptive Streaming over Hyper-text transport protocol (HTTP)) 세그먼트 타일을 이용하여 사용자 현재 시점(Viewport)의 영상 및 오디오를 압축 코딩함에 따라, 사용자 시점 예측의 정확도를 향상시킬 수 있고 실감나는 가상 현실 서비스를 제공할 수 있도록 한 기술에 관한 것이다. The present invention relates to a user's viewpoint prediction system and method using sound source location information in 360-degree VR (Virtual Reality) content, and more specifically, a DASH segment (Dynamic Adaptive Streaming over Hyper-text transport protocol (HTTP)) segment tile By compressing and coding the video and audio of the user's current view (Viewport) by using, it relates to a technology that can improve the accuracy of the user's viewpoint and provide a realistic virtual reality service.

최근 스마트폰 등의 기기 발달과 동시에 VR 기술(Virtual Reality: 이하 VR기술)에 대한 사회적 관심도가 높아지고 있다. VR 기술이란, 모의되는 개체에 대한 표현의 충실도를 높여 현실과 가상 체계의 차이를 극복할 수 있게 하는 기술로써 기존 기술이 갖고 있는 한계를 극복할 기술로 최근 주목 받는 기술 중 하나이다.Recently, with the development of devices such as smartphones, social interest in VR technology (VR technology) is increasing. VR technology is a technology that overcomes the limitations of the existing technology as a technology that can overcome the difference between reality and virtual system by increasing the fidelity of expression for the simulated object.

이러한 360 VR 콘텐츠는 네트워크를 통해 DASH MPD 로 통해 제공된다. 즉, DASH는 HTTP(Hyper-Text Transport Protocol) 기법을 이용하는 웹 서버들로부터 인터넷을 통해 미디어 데이터 스트리밍을 가능하게 하는 적응적 비트 레이트 스트리밍 기법이다. This 360 VR content is provided through DASH MPD through the network. That is, DASH is an adaptive bit rate streaming technique that enables streaming of media data from web servers using HTTP (Hyper-Text Transport Protocol) technique over the Internet.

이때 MPD는 주기 내에 음성 및 영상 스트림을 적응 셋(Adaptation Set)을 부여하고 적응 셋 내에 해상도마다 설명 셋(Description Set)을 부여하여 각 부여된 적응 셋 및 설명 셋에 초 단위 세그먼트가 분리한 후 HTTP 서버(10)에 저장된다. At this time, the MPD provides an adaptation set for the audio and video streams within a period, and a description set for each resolution in the adaptation set. It is stored in the server 10.

한편 타일(tile)은 한 프레임의 비디오를 공간적으로 분할한 후 각 타일 별로 고효율 비디오 코덱(High Efficiency Video Codec: HEVC)을 통해 압축 코딩되며 각 타일마다 해상도를 달리하여 전송된다.Meanwhile, a tile is spatially divided into one frame of video and then compressed and coded through a High Efficiency Video Codec (HEVC) for each tile and transmitted with different resolutions for each tile.

이에 따라 DASH 클라이언트(20)는 HTTP 엔진으로부터 제공된 MPD(Media Presentation Description)을 파싱(Pasing)한 후 해당 콘텐츠를 요청하면, HTTP 서버는 최저 해상도의 세그먼트(Segment)를 제공하고 이 후 네트워크 상황 및 파라미터에 따라 적응적으로 세그먼트를 제공한다. 네트워크 상황이 좋은 경우 고화질의 세그먼트를 요청하고 네트워크 상황이 안 좋은 경우 저화질의 세그먼트를 요청한다. Accordingly, when the DASH client 20 parses (Media Presentation Description) provided from the HTTP engine and requests the corresponding content, the HTTP server provides the segment with the lowest resolution, and thereafter the network status and parameters According to the adaptive segment. If the network condition is good, a high-definition segment is requested, and if the network condition is bad, a low-quality segment is requested.

그러나, 이러한 VR 기술을 이용한 제작된 360 VR 콘텐츠는 기존의 2D 콘텐츠에 비해 대역폭의 소모가 큰 한계에 도달하였다.However, the 360 VR content produced using the VR technology has reached a limit in that bandwidth consumption is greater than that of the existing 2D content.

이에 도 1에 도시된 바와 같이, 고효율 비디오 코덱(High Efficiency Video Codec)의 타일링(tiling)을 이용하여 사용자의 관심 영역(ROI: Region of Interesting)가 포함된 시점(Viewpoint)를 예측하고 시점의 타일은 고화질로 전송하고 나머지 타일(tile)은 저화질로 전송하여 대역폭을 감소하는 다양한 방법들이 개발되어 있다. 이때 사용자 시점은 움직이는 물체 이미지를 활용하여 예측된다. 예를 들어, 자동차가 이동할 때 이동하는 자동차의 타일(tile)은 고화질로 전송되고 나머지 배경의 타일(tile)은 저화질로 전송된다. Accordingly, as illustrated in FIG. 1, a viewpoint including a region of interest (ROI) of a user is predicted by using tiling of a high efficiency video codec, and a tile of the viewpoint Various methods have been developed to reduce the bandwidth by transmitting silver in high quality and transmitting the rest of the tiles in low quality. At this time, the user's viewpoint is predicted using a moving object image. For example, when a vehicle moves, the tiles of the moving vehicle are transmitted in high quality, and the tiles of the remaining background are transmitted in low quality.

그러나, 음원의 방향감과 거리감 및 공간감을 느낄 수 있도록 사용자 시점의 음원을 가상 현실 서비스에 반영하여 제공하는 기술은 없었다.However, there is no technology to provide a sound source from a user's point of view in a virtual reality service so that a sense of direction, distance, and space of the sound source can be felt.

본 발명은 실감나는 오디오를 입체적으로 가상 현실 서비스에 부가하여 제공함에 있어, 높은 비트율을 할당하여 대역폭을 감소할 수 있고, 시점 예측의 정확도를 향상시킬 수 있으며, 재생 속도 및 재생 품질에 대한 사용자 만족도를 보장할 수 있는 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 장치 및 방법을 제공하고자 함에 그 목적이 있다.In the present invention, in providing stereoscopic audio to a virtual reality service in a realistic manner, the bandwidth can be reduced by allocating a high bit rate, the accuracy of viewpoint prediction can be improved, and user satisfaction with playback speed and playback quality It is an object of the present invention to provide an apparatus and method for predicting a user's viewpoint using sound source location information in a 360-degree VR content that can guarantee.

본 발명에 의거 제공되는 입체 음향으로 인해 가상 현실 서비스에 대한 몰입도 및 흥미성을 향상시킬 수 있는 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 장치 및 방법을 제공하고자 함에 그 목적이 있다.An object of the present invention is to provide an apparatus and method for predicting a user's viewpoint using sound source location information in a 360-degree VR content capable of improving immersion and interest in a virtual reality service due to the stereoscopic sound provided according to the present invention.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있으며, 본 발명의 실시 예에 의해 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The object of the present invention is not limited to the above-mentioned object, other objects and advantages of the present invention which are not mentioned can be understood by the following description, and will be more clearly understood by embodiments of the present invention. In addition, it will be readily appreciated that the objects and advantages of the present invention can be realized by means of the appended claims and combinations thereof.

전술한 목적을 달성하기 위한 본 발명의 실시 태양에 의한 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 시스템은, A user's viewpoint prediction system using sound source location information in a 360-degree VR content according to an embodiment of the present invention for achieving the above object,

하나의 전사적 자원 관리(Enterprise Resource Planning, ERP) 형태의 파노라마 비디오를 복수의 타일로 공간 분할한 후 각 타일 별로 압축 코딩하여 미디어 데이터 형태로 HTTP 서버로 전달하는 컨텐츠 제작부; 수신된 미디어 데이터를 기 정해진 소정 시간 단위의 세그먼트로 분할한 다음 영상 MPD 및 오디오 MPD를 생성하고 생성된 영상 MPD 및 오디오 MPD를 포함하는 MPD와 세그먼트 타일을 네트워크의 기준 대역폭을 통해 클라이언트 장치로 전송하는 HTTP 서버; 수신된 세그먼트 타일과 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측하고, 다음에 제공받을 세그먼트의 비트율을 대역폭 상황 기반 적응적으로 결정하며 결정된 세그먼트 비트율로 할당된 대역폭을 HTTP 서버로 전달하는 클라이언트 장치를 포함하고, A content production unit for dividing a panoramic video of a single enterprise resource planning (ERP) type into a plurality of tiles and compressing and coding each tile to deliver the media data as an HTTP server; The received media data is divided into segments in a predetermined time unit, and then the video MPD and audio MPD are generated, and the MPD and segment tiles including the generated video MPD and audio MPD are transmitted to a client device through a network standard bandwidth. HTTP server; A client device that predicts a user's viewpoint based on the received segment tile and video MPD and audio MPD, adaptively determines the bit rate of the next segment to be provided based on the bandwidth situation, and delivers the bandwidth allocated to the determined segment bit rate to the HTTP server Including,

상기 HTTP 서버는 The HTTP server

클라이언트 장치로부터 제공받은 할당된 대역폭으로 다음 제공할 세그먼트를 클라이언트 장치로 전송하도록 구비되고,It is provided to transmit the next segment to be provided to the client device with the allocated bandwidth provided from the client device,

상기 클라이언트 장치는 The client device

수신된 세그먼트 타일에 대해 VR 엔진에서 디코딩을 수행하여 영상 및 오디오를 획득하고 획득된 영상 및 오디오를 360도 공간에 랜더링하여 재생하도록 구비되는 것을 특징으로 한다.It is characterized in that it is provided to obtain a video and audio by decoding in the VR engine on the received segment tile, and render and play the obtained video and audio in a 360-degree space.

바람직하게 상기 오디오 MPD는 Preferably the audio MPD

음원정위정보(SLI: Sound Localization Information)이고, 오디오 MPD의 설명 셋은 SLI의 설명 셋(SLID)을 포함하며, Sound Localization Information (SLI), the audio MPD description set includes the SLI description set (SLID),

SLI의 설명 셋(SILD)는 음원 정위 식별자(SLI_id), 360도 공간 내의 음원의 위치(x, y, z 축 값), 및 패닝 모델(panning model)를 포함할 수 있다.The description set (SILD) of the SLI may include a sound source stereolocation identifier (SLI_id), a position of the sound source in the 360-degree space (x, y, z-axis values), and a panning model.

바람직하게 상기 클라이언트 장치는, Preferably, the client device,

수신된 세그먼트 타일과 MPD를 파싱하는 MPD 파서; 상기 파싱된 MPD의 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측하고, 다음에 제공받을 세그먼트의 비트율을 대역폭 상황 기반 적응적으로 결정하며 결정된 세그먼트 비트율로 할당된 대역폭을 HTTP 서버로 전달하는 처리부; 및 상기 처리부를 통해 수신된 세그먼트를 디코딩하여 오디오 및 영상을 획득하고 획득된 오디오 및 영상을 360도 공간 3차원 랜더링하여 재생하는 VR 엔진을 포함할 수 있다.An MPD parser that parses the received segment tile and MPD; A processor for predicting a user's viewpoint based on the video MPD and audio MPD of the parsed MPD, adaptively determining a bit rate of a next segment to be provided based on a bandwidth situation, and transmitting a bandwidth allocated to the determined segment bit rate to an HTTP server; And a VR engine that decodes the segment received through the processing unit to obtain audio and video, and reproduces the obtained audio and video by 360-dimensional spatial 3D rendering.

바람직하게 상기 처리부는, 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole) 중 적어도 하나의 항목 별 타일 가중치와 기준 비트율과 대역폭 상황별 순위와 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole)에 위치한 타일을 토대로 다음 제공받을 세그먼트 비트율을 결정하며, Preferably, the processing unit includes a tile weight for each item of at least one of interest region information (ROI), sound source positioning information (SLI), current viewport, and poll, and a reference bit rate and bandwidth situation ranking and region of interest Based on the information (ROI), sound source positioning information (SLI), the current view (viewport), and the tile located in the pole (Pole) to determine the next bit rate to be provided,

결정된 세그먼트 비트율로 대역폭을 할당하여 HTTP 서버로 전달하도록 구비될 수 있다.It may be provided to allocate the bandwidth at the determined segment bit rate and deliver it to the HTTP server.

바람직하게 상기 처리부의 가중치는, 대역폭 상황 별로 기 정해진 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole) 각각의 항목과 매칭되어 저장될 수 있다.Preferably, the weight of the processing unit may be stored by matching each item of interest region information (ROI), sound source positioning information (SLI), current viewport, and poll predetermined for each bandwidth situation.

상기 처리부는, 각 항목 별 가중치의 합계의 순위 r가 1인 경우 대역폭 상황이 매우 좋음 상태로 판정한 다음, When the rank r of the sum of weights for each item is 1, the processing unit determines that the bandwidth situation is very good, and

SLI에 해당하는 타일의 가중치 w_S, ROI에 해당하는 타일의 가중치 w_R, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V, 폴(Pole)에 해당하는 타일의 가중치 w_P 각각을 최대 가중치 값 세트 max w_X 로 설정하고,The weight of the tile corresponding to the SLI w _S, the weight of the tile corresponding to the ROI w _R, the weight w _P each tile corresponding to the current weight w _V, Paul (Pole) of the tile corresponding to the point (viewport) maximum weight Value set max w _X Set to

i번째 타일의 비트율 R _i 를 최대 가중치 값 w_X _, i번째 타일 t_i, 및 기준 비트율 R_f의 곱의 합으로 도출하도록 구비될 수 있다.the bit rate R _i in the i-th tile may be provided to derive the sum of the maximum weight value w _{_X,} i t _i-th tile, and based on the product of the bit rate R _f.

상기 처리부는, 각 항목 별 가중치의 합계의 순위 r가 2인 경우 대역폭 상황을 좋음 상태로 판정한 다음, When the rank r of the sum of weights for each item is 2, the processing unit determines the bandwidth situation as a good state,

SLI에 해당하는 타일의 가중치 w_S, ROI에 해당하는 타일의 가중치 w_R, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V를 최대 가중치 값 세트 max w_x 로 설정하고, 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 로 설정하고,The weights of tiles corresponding to SLI, w _S , the weights of tiles corresponding to ROI w _R , and the weights of tiles corresponding to the current viewport (viewport) w _V are the maximum set of weight values max w _x Set to, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 1 from max w _x Set to -1,

세그먼트 비트율 R _i 를 각 관심영역정보(ROI), 음원정위정보(SLI), 및 현재 시점(Viewport)의 각 타일의 합(S+R+V), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱의 합으로 도출하도록 구비될 수 있다.The segment bit rate R _i is the sum of each region of interest (ROI), sound source positioning information (SLI), and each tile of the current view (Viewport) ( S + R + V) , the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and the tile P of the pole in the frequency domain, the reference bit rate R _f , And maximum weight value set (max w _x ) Subtracted 1 from max w _x It can be provided to derive as the sum of products of -1.

그리고, 처리부는 각 항목 별 가중치의 합계의 순위 r가 3인 경우 대역폭 상황을 나쁨 상태로 판정한 다음, Then, when the rank r of the sum of the weights for each item is 3, the processing unit determines the bandwidth condition as a bad state,

SLI에 해당하는 타일의 가중치 w_S, 및 ROI에 해당하는 타일의 가중치 w_R 를 최대 가중치 값 세트 max w_x 로 설정하며, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V및 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x)에서 1을 감산한 값 max w_x -1 로 설정하고,The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Is set to, and the weight w _V of the tile corresponding to the current viewport and the weight w _P of the tile corresponding to the pole are subtracted 1 from the maximum weight value set (max w _x ) max w _x Set to -1,

영상 세트 R에 대한 세그먼트 비트율 R _i 를 각 관심영역정보(ROI), 및 음원정위정보(SLI) 각 타일의 합(S+R), 기준 비트율 R _f , 및 최대 가중치 값 max w_X 의 곱과 현지 시점(Viewport)의 타일 V 및 주파수 영역의 폴(pole)의 타일 P 의 합(V+P), 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x-1 의 곱의 합으로 도출하도록 구비될 수 있다.Video set R Segment bit rate R _i for each region of interest (ROI), and sound source positioning information (SLI) sum of each tile ( S + R ), reference bit rate R _f , and maximum weight value max w _X The product of the product of and the tile V of the local viewport (Viewport) and the tile P of the pole in the frequency domain ( V + P ), the reference bit rate R _f , and a set of maximum weight values (max w _x ) Can be provided to derive as the sum of the product of subtracting 1 from max w _x -1.

한편, 상기 처리부는, 각 항목 별 가중치의 합계의 순위 r가 4인 경우 대역폭 상황을 매우 나쁨 상태로 판정한 다음, On the other hand, if the rank r of the sum of the weights for each item is 4, the processing unit determines the bandwidth situation as a very bad state,

SLI에 해당하는 타일의 가중치 w_S, 및 ROI에 해당하는 타일의 가중치 w_R 를 최대 가중치 값 세트 max w_x 로 설정하고, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V를 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 로 설정하며, 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x )에서 2를 감산한 값 max w_x -2 로 설정하고, The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Set as, and the weight w _V of the tile corresponding to the current view (viewport) is the maximum weight value set (max w _x ) Subtracted 1 from max w _x Set to -1, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 2 from max w _x Set to -2,

영상 세트 R에 대한 비트율 R _i 를 각 관심영역정보(ROI), 및 음원정위정보(SLI) 각 타일의 합(S+R), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 현지 시점(Viewport)의 타일 V 와 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 2를 감산한 값 max w_x -2 의 곱의 합으로 도출하도록 구비될 수 있다.Video set R The bit rate R _i for each region of interest (ROI), and the sound source positioning information (SLI), the sum of each tile ( S + R ), the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and tile V of local viewport and reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x A product of -1 and tile P of a pole in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 2 from max w _x It can be provided to derive as the sum of the product of -2.

본 발명의 다른 실시 예 양태에 의한 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 방법은, A method of predicting a user's viewpoint using sound source location information in a 360 degree VR content according to another embodiment of the present invention,

(a) 콘텐츠 제작부에서 하나의 전사적 자원 관리(Enterprise Resource Planning, ERP) 형태의 파노라마 비디오를 복수의 타일로 공간 분할한 후 각 타일 별로 압축 코딩하여 미디어 데이터 형태로 HTTP 서버로 전달하는 단계; (a) the content production unit divides a panoramic video in the form of an enterprise resource planning (ERP) into a plurality of tiles and compresses each tile to deliver the media data in the form of media data to an HTTP server;

(b) HTTP 서버에서 수신된 미디어 데이터를 기 정해진 소정 시간 단위의 세그먼트로 분할한 다음 영상 MPD 및 오디오 MPD를 생성하고 생성된 영상 MPD 및 오디오 MPD를 포함하는 MPD와 세그먼트 타일을 네트워크의 기준 대역폭을 통해 클라이언트 장치로 전송하는 단계; (b) The media data received from the HTTP server is divided into segments in a predetermined time unit, and then the video MPD and audio MPD are generated, and the MPD and segment tiles including the generated video MPD and audio MPD are used to determine the network's reference bandwidth. Transmitting to the client device through;

(c) 클라이언트 장치에서 수신된 세그먼트 타일과 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측하고, 다음에 제공받을 세그먼트의 비트율을 대역폭 상황 기반 적응적으로 결정하며 결정된 세그먼트 비트율로 할당된 대역폭을 HTTP 서버로 전달하는 단계; (c) Predict the user's viewpoint based on the segment tile and video MPD and audio MPD received from the client device, adaptively determine the bit rate of the next segment to be provided based on the bandwidth situation, and allocate the bandwidth allocated to the determined segment bit rate as HTTP Delivering to a server;

(d) HTTP 서버에서 클라이언트 장치로부터 제공받은 할당된 대역폭으로 다음 제공할 세그먼트를 클라이언트 장치로 전송하는 단계; 및 (d) transmitting the next segment to be provided to the client device in the allocated bandwidth received from the client device in the HTTP server; And

(e) 클라이언트 장치를 경유하여 전달받은 세그먼트 타일에 대해 VR 엔진에서 디코딩을 수행하여 영상 및 오디오를 획득하고 획득된 영상 및 오디오를 360도 공간에 랜더링하여 재생하는 단계를 포함하는 것을 특징으로 한다.(e) obtaining a video and audio by performing a decoding in a VR engine on a segment tile received via a client device, and rendering and playing the obtained video and audio in a 360-degree space.

여기서, 상기 오디오 MPD는 음원정위정보(SLI: Sound Localization Information)이고, 오디오 MPD의 설명 셋은 SLI의 설명 셋(SLID)을 포함하며, SLI의ㅏ 설명 셋(SILD)는 음원 정위 식별자(SLI_id), 360도 공간 내의 음원의 위치(x, y, z 축 값), 및 패닝 모델(panning model)를 포함한다.Here, the audio MPD is sound localization information (SLI), the description set of the audio MPD includes the description set (SLID) of the SLI, and the description set (SILD) of the SLI is the sound source positioning identifier (SLI_id). , The position of the sound source in the 360-degree space (x, y, z-axis values), and a panning model.

그리고, 상기 (c) 단계는, 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole) 중 적어도 하나의 항목 별 타일 가중치와 기준 비트율과 대역폭 상황별 순위와 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole)에 위치한 타일을 토대로 다음 제공받을 세그먼트 비트율을 결정하며, 결정된 세그먼트 비트율로 대역폭을 할당하여 HTTP 서버로 전달하도록 구비될 수 있다.And, the step (c) is, at least one of the items of interest area information (ROI), sound source positioning information (SLI), current view (viewport), and poll (Pole) tile weight and reference bit rate and bandwidth ranking by situation Based on the area of interest (ROI), sound source positioning information (SLI), current view (viewport), and the tile located in the poll (Pole), determines the next segment bit rate to be provided, and allocates bandwidth with the determined segment bit rate to the HTTP server It can be provided to deliver to.

그리고, 상기 c) 단계의 가중치는, 대역폭 상황 별로 기 정해진 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole) 각각의 항목과 매칭되어 저장될 수 있다.In addition, the weight of step c) may be stored by matching items of interest region information (ROI), sound source positioning information (SLI), current viewport, and poll predetermined for each bandwidth situation. .

상기 (c) 단계는, 각 항목 별 가중치의 합계의 순위 r가 1인 경우 대역폭 상황이 매우 좋음 상태로 판정한 다음, In step (c), if the rank r of the sum of the weights for each item is 1, the bandwidth situation is determined to be very good, and then,

상기 (c) 단계는, 각 항목 별 가중치의 합계의 순위 r가 2인 경우 대역폭 상황을 좋음 상태로 판정한 다음, In step (c), if the rank r of the sum of weights for each item is 2, the bandwidth situation is determined to be good, and then,

그리고, (c) 단계는, 각 항목 별 가중치의 합계의 순위 r가 3인 경우 대역폭 상황을 나쁨 상태로 판정한 다음, And, in step (c), if the rank r of the sum of the weights for each item is 3, the bandwidth situation is determined as a bad state,

한편, (c) 단계는, 각 항목 별 가중치의 합계의 순위 r가 4인 경우 대역폭 상황을 매우 나쁨 상태로 판정한 다음, On the other hand, in step (c), if the rank r of the sum of the weights for each item is 4, the bandwidth situation is determined as a very bad state,

본 발명에 따르면 영상 MPD를 확장시켜 오디오 MPD의 음원정위정보(SLI)를 전송하고 대역폭 상황에 따라 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 폴(Pole)의 항목 별 가중치를 차등 부여한 다음 기준 비트율을 곱하여 다음 제공받은 세그먼트 비트율을 결정하고 결정된 세그먼트 비트율로 세그먼트를 수신함에 따라, 시각 및 청각 인지를 활용하여 사용자의 시점 위치 예측의 정확도를 향상시킬 수 있고, 사용자의 시점 위치 예측의 실패할 경우 사용자 현재 시점을 고려하여 사용자 시점 위치 예측이 가능하므로, 재생 속도 및 재생 품질에 대한 사용자 만족도(QoE: Quality of Experience)를 보장할 수 있다. According to the present invention, the image MPD is extended to transmit the sound source positioning information (SLI) of the audio MPD, and the region of interest information (ROI), the sound source positioning information (SLI), the current view point (Viewport), and the poll according to the bandwidth situation. By differentiating the weight of each item, multiplying by the reference bit rate to determine the next provided segment bit rate and receiving the segment with the determined segment bit rate, it is possible to improve the accuracy of the user's viewpoint position prediction by using visual and auditory perception, If the user's viewpoint position prediction fails, the user's viewpoint position prediction is possible in consideration of the user's current viewpoint, thereby guaranteeing user satisfaction (QoE) for playback speed and playback quality.

이에 본 발명에 의거, 실제 환경의 오디오를 입체적으로 가상 현실에 반영하여 사용자에게 제공됨에 따라 가상 현실에 대한 방향감, 거리감 및 공간감을 실제 환경과 동일하게 느낄 수 있어 실감나게 가상 현실 서비스를 제공할 수 있고 이에 따라 가상 현실 서비스에 대한 몰입도 및 흥미성을 더욱 향상시킬 수 있는 이점을 가진다.Accordingly, according to the present invention, as the audio of the real environment is reflected to the virtual reality in three dimensions and provided to the user, the sense of direction, distance, and space for the virtual reality can be felt the same as the real environment, so that the virtual reality service can be realistically provided. Accordingly, it has the advantage of further improving the immersion and interest in the virtual reality service.

본 명세서에서 첨부되는 다음의 도면들은 본 발명의 바람직한 실시 예를 예시하는 것이며, 후술하는 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니된다.
도 1은 종래의 360 VR 콘텐츠의 타일을 기반으로 사용자 시점 영역 및 나머지 영역 별 해상도를 보인 예시도이다.
도 2는 본 발명의 실시 예가 적용되는 통신 시스템의 구성을 보인 도이다.
도 3은 본 발명의 실시 예에 따른 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 장치의 구성을 보인 도이다.The following drawings attached in this specification are intended to illustrate preferred embodiments of the present invention, and serve to further understand the technical idea of the present invention together with the detailed description of the invention described below, and thus the present invention is described in such drawings. It is not limited to interpretation.
1 is an exemplary view showing a resolution of a user's viewpoint area and the remaining areas based on tiles of a conventional 360 VR content.
2 is a view showing the configuration of a communication system to which an embodiment of the present invention is applied.
3 is a view showing a configuration of a user's viewpoint prediction apparatus using sound source location information in a 360-degree VR content according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 실시예들을 보다 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in more detail with reference to the drawings.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and a method of achieving them will be apparent with reference to embodiments described below in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains. It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다.Terms used in the specification will be briefly described, and the present invention will be described in detail.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terminology used in the present invention has been selected, while considering the functions in the present invention, general terms that are currently widely used are selected, but this may vary according to the intention or precedent of a person skilled in the art or the appearance of new technologies. In addition, in certain cases, some terms are arbitrarily selected by the applicant, and in this case, their meanings will be described in detail in the description of the applicable invention. Therefore, the terms used in the present invention should be defined based on the meanings of the terms and the contents of the present invention, not simply the names of the terms.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에서 사용되는 "부"라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부"는 어떤 역할들을 수행한다. 그렇지만 "부"는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부"는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.When a part of the specification "includes" a certain component, this means that other components may be further included instead of excluding other components, unless specifically stated to the contrary. Also, the term "part" as used in the specification means a hardware component such as software, FPGA, or ASIC, and "part" performs certain roles. However, "part" is not meant to be limited to software or hardware. The "unit" may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors.

따라서, 일 예로서 "부"는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부"들로 결합되거나 추가적인 구성요소들과 "부"들로 더 분리될 수 있다.Thus, as an example, "part" refers to components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables. The functionality provided within the components and "parts" may be combined into a smaller number of components and "parts" or further separated into additional components and "parts".

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. In addition, in order to clearly describe the present invention in the drawings, parts irrelevant to the description are omitted.

본 실시 예에서 네트워크는 네트워크 어드레스들 간 인터넷 프로토콜(IP) 패킷들, 프레임 중계 프레임들, 비동기 전송 모드(ATM) 셀들, 또는 다른 정보를 전송할 수 있다. 네트워크는 또한, 케이블 및 위성 통신 링크들과 같은 브로드캐스팅 네트워크들을 포함하는 이종 네트워크일 수도 있다. 네트워크는 하나 이상의 LAN(local area networks); MAN(metropolitan area networks); WAN(wide area networks); 인터넷 같은 글로벌 네트워크 전체나 일부; 또는 하나 이상의 위치들에 있는 어떤 다른 통신 시스템이나 시스템들을 포함할 수 있다.In this embodiment, the network may transmit Internet Protocol (IP) packets between network addresses, frame relay frames, asynchronous transfer mode (ATM) cells, or other information. The network may also be a heterogeneous network including broadcasting networks such as cable and satellite communication links. The network may include one or more local area networks (LANs); MAN (metropolitan area networks); Wide area networks (WAN); All or part of a global network, such as the Internet; Or any other communication system or systems in one or more locations.

다양한 실시예들에서, 이종 네트워크는 브로드캐스트 네트워크 및 브로드밴드 네트워크를 포함한다. 브로드캐스트 네트워크는 일반적으로 한 방향, 예컨대 하나 이상의 서버들로부터 클라이언트 장치들의 방향인, 클라이언트 장치들로의 미디어 데이터의 브로드캐스트를 위한 것이다. 브로드캐스트 네트워크는 가령 위성, 무선, 유선, 및 광섬유 네트워크 링크들과 장치들과 같은 임의 개의 브로드캐스트 링크들과 장치들을 포함할 수 있다. In various embodiments, the heterogeneous network includes a broadcast network and a broadband network. A broadcast network is generally for broadcasting media data to client devices in one direction, for example from one or more servers to client devices. A broadcast network can include any broadcast links and devices, such as satellite, wireless, wired, and fiber optic network links and devices.

브로드밴드 네트워크는 일반적으로 두 방향, 예컨대 하나 이상의 서버들로부터 클라이언트 장치들로 왕복하는 방향인, 클라이언트 장치들의 미디어 데이터에 대한 브로드밴드 액세스를 위한 것이다. 브로드밴드 네트워크는 가령 인터넷, 무선, 유선, 및 광섬유 네트워크 링크들과 장치들과 같은 임의 개의 브로드밴드 링크들과 장치들을 포함할 수 있다.Broadband networks are generally intended for broadband access to media data of client devices, which is a two-way direction, such as round-trip from one or more servers to client devices. A broadband network may include any broadband links and devices, such as Internet, wireless, wired, and fiber optic network links and devices.

네트워크는 서버들및 다양한 클라이언트 장치들인 재생 처리장치 간의 통신을 돕는다. 서버들 각각은 하나 이상의 클라이언트 장치들에 컴퓨팅 서비스를 제공할 수 있는 어떤 적절한 컴퓨팅 또는 프로세싱 장치를 포함한다. 서버들 각각은 예컨대, 하나 이상의 프로세싱 장치들, 명령 및 데이터를 저장하는 하나 이상의 메모리들, 및 네트워크를 통한 통신을 돕는 하나 이상의 네트워크 인터페이스들을 포함할 수 있다. 예를 들어 서버들은 HTTP 기법을 이용하여 네트워크 내 브로드캐스트 네트워크를 통해 미디어 데이터를 브로드캐스팅하는 서버들을 포함할 수 있다. 다른 예에서, 서버들은 대쉬(DASH)를 이용하여 네트워크 내 브로드캐스트 네트워크를 통해 미디어 데이터를 브로드캐스팅하는 서버들을 포함할 수 있다.The network facilitates communication between servers and playback processing devices, which are various client devices. Each of the servers includes any suitable computing or processing device capable of providing computing services to one or more client devices. Each of the servers may include, for example, one or more processing devices, one or more memories storing instructions and data, and one or more network interfaces to facilitate communication over the network. For example, servers may include servers that broadcast media data over a broadcast network in a network using HTTP techniques. In another example, the servers may include servers that broadcast media data over a broadcast network in a network using DASH.

각각의 클라이언트 장치는 네트워크를 통해 적어도 하나의 서버 또는 다른 컴퓨팅 장치(들)과 상호 동작하는 어떤 적절한 컴퓨팅 또는 프로세싱 장치를 나타낸다. 이 예에서, 재생 클라이언튼 장치는 데스크탑 컴퓨터, 모바일 전화나 스마트폰, PDA(personal digital assistant), 랩탑 컴퓨터, 태블릿 컴퓨터, 및 세탑박스 및/또는 텔레비전이 포함될 수 있다. 그러나, 어떤 다른, 혹은 추가적인 클라이언트 장치들이 통신 시스템 내에서 사용될 수도 있다. Each client device represents any suitable computing or processing device that interacts with at least one server or other computing device(s) over a network. In this example, the playback clientton device may include a desktop computer, a mobile phone or smartphone, a personal digital assistant (PDA), a laptop computer, a tablet computer, and a settop box and/or television. However, any other or additional client devices may be used within the communication system.

이 예에서, 일부 클라이언트 장치들은 네트워크와 간접적으로 통신한다. 예를 들어, 클라이언트 장치들은 휴대전화 기지국들이나 eNodeB들과 같은 하나 이상의 기지국들을 통해 통신한다. 또한 클라이언트 장치들은 IEEE 802.11 무선 액세스 포인트들과 같은 하나 이상의 무선 액세스 포인트들을 통해 통신한다. 이들은 다만 예시를 위한 것이며, 각각의 클라이언트 장치가 네트워크와 직접 통신하거나 어떤 적절한 매개 장치(들)이나 네트워크(들)을 통해 네트워크와 간접적으로 통신할 수도 있다는 것을 알아야 한다. 이하에서 보다 상세히 기술되는 바와 같이, 클라이언트 장치들의 전부나 어느 하나는 HTTP 및 DASH를 이용하여 미디어 데이터를 수신 및 제공하는 구조를 포함할 수 있다.In this example, some client devices communicate indirectly with the network. For example, client devices communicate through one or more base stations, such as cell phone base stations or eNodeBs. Additionally, client devices communicate over one or more wireless access points, such as IEEE 802.11 wireless access points. It should be noted that these are for illustration only and that each client device may communicate directly with the network or indirectly with the network via any suitable intermediary device(s) or network(s). As described in more detail below, all or any of the client devices may include a structure for receiving and providing media data using HTTP and DASH.

본 발명의 실시 예가 적용되는 통신 시스템은 각각의 구성요소에 대해 임의 개를 임의의 적절한 구성으로 포함할 수도 있다. 일반적으로, 컴퓨팅 및 통신 시스템들은 광범위한 구성들로 나타나며, 도 2는 본 개시의 범위를 어떤 특정 구성으로 한정하지 않는다. 도 2는 본 특허 문서에서 개시된 다양한 특성들이 사용될 수 있는 하나의 동작 환경을 도시하고 있지만, 그러한 특성들은 어떤 다른 적절한 시스템에서 사용될 수도 있다.The communication system to which the embodiments of the present invention are applied may include any dog in any suitable configuration for each component. In general, computing and communication systems appear in a wide variety of configurations, and FIG. 2 does not limit the scope of the present disclosure to any particular configuration. 2 shows one operating environment in which various features disclosed in this patent document can be used, but such features may be used in any other suitable system.

이에 본 실시 예는 3차원 공간 상의 음원에 대한 SLI(Sound Localization Information)를 영상에 대한 MPD(Media Presentation Description)를 확장시켜 HTTP 서버를 경유하여 클라이언트 장치로 전송하고, 클라이언트 장치에서 생성된 수신된 음원정위정보(SLI: Sound localization Information), 관심정보(ROI: Region of Interest), 현재 시점(viewport), 및 주파수 영역의 폴(Pole)를 토대로 다음 제공받고자 하는 세그먼트의 비트율을 결정하여 HTTP 서버로 전달함에 따라, 높은 비트율을 할당하여 가용 대역폭을 감소할 수 있고, 시점 예측의 정확도를 향상시킬 수 있으며, 재생 속도 및 재생 품질에 대한 사용자 만족도(QoE: Quality of Experience)를 보장할 수 있다.Accordingly, in the present exemplary embodiment, the sound localization information (SLI) for a sound source in a 3D space is extended to an MPD (Media Presentation Description) for a video and transmitted to a client device via an HTTP server, and the received sound source generated by the client device Based on stereolocation information (SLI: Sound localization information), region of interest (ROI), current viewport, and frequency domain poll, the bit rate of the next segment to be received is determined and transmitted to the HTTP server Accordingly, it is possible to reduce a usable bandwidth by allocating a high bit rate, improve the accuracy of viewpoint prediction, and ensure user satisfaction (QoE: Quality of Experience) with respect to playback speed and playback quality.

도 2는 본 발명의 실시 예에 따른 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 시스템을 보인 도면으로서, 도 2를 참조하면 사용자의 시점 예측 시스템은 콘텐츠 제작부(100), HTTP 서버(200), 및 클라이언트 장치(300)를 포함할 수 있다. 2 is a view showing a user's viewpoint prediction system using sound source location information in a 360-degree VR content according to an embodiment of the present invention. Referring to FIG. 2, the user's viewpoint prediction system includes a content production unit 100, an HTTP server ( 200), and a client device 300.

콘텐츠 제작부(100)는 획득된 360도 영상 내의 부분적 비트율 할당을 위해 하나의 전사적 자원 관리(Enterprise Resource Planning, ERP) 형태의 파노라마 비디오를 6개의 영상(2개의 Pole와 4개의 Equator)으로 나누고 6개의 영상을 해상도 별로 코딩한다. 즉, 한 프레임의 비디오는 복수의 타일(tile)로 공간 분할한 후 각 타일 별로 고효율 비디오 코덱(High Efficiency Video Codec: HEVC)을 통해 압축 코딩된다.The content production unit 100 divides one enterprise resource planning (ERP) type panoramic video into six images (two poles and four equators) in order to allocate a partial bit rate within the obtained 360-degree image and divides them into six. The video is coded by resolution. That is, a video of one frame is spatially divided into a plurality of tiles, and then compressed by each tile through a High Efficiency Video Codec (HEVC).

한편, 360도 공간 상의 음원은 사용자의 오리엔테이션에 대해 x, y, z 축의 방위각과 x, y, z 축의 고도각을 각각 지정한 다음 원음 해당 지점에서 측정된 머리전달함수(HRTF: Head-Related Transfer Function) 측정 및 HRTF가 측정되지 아니한 지점의 음상에 대해 보간 처리하여 사용자 오리엔테이션에 대해 음원을 정위(sound localization)시킨다.On the other hand, the sound source on the 360-degree space specifies the azimuth of the x, y, and z axes and the elevation angle of the x, y, and z axes, respectively, for the user's orientation, and then the head-related transfer function (HRTF: Head-Related Transfer Function) ) The sound source is localized for user orientation by interpolating the sound at the point where the measurement and HRTF are not measured.

여기서, HRTF는 동일한 소리를 전방위에서 발생시켜 방향에 따른 주파수 반응을 측정하여 함수로 정리한 관계식이고 HRTF 값은 사람마다 머리 몸체의 특성에 따라 상이하게 결정된다. 이에 최근에는 개인화된 HRTF (individualized HRTF)가 실험실에서 개발되기도 한다. 일반화된 HRTF 데이터는 데이터베이스(database)에 저장하여 오디오 출력 시 사용자에게 동일하게 활용하였다.Here, the HRTF is a relational expression summed up as a function by measuring the frequency response according to the direction by generating the same sound in all directions, and the HRTF value is determined differently according to characteristics of the head body for each person. Accordingly, in recent years, personalized HRTF (individualized HRTF) has been developed in the laboratory. The generalized HRTF data was stored in a database and used equally for users when outputting audio.

그리고 콘텐츠 제작부(100)는 한 주기 내의 프레임의 오디오 및 영상 스트림 각각에 대한 오디오 MPD와 영상 MPD를 포함하는 MPD를 생성한다. 이에 MPD는 한 주기 내의 프레임의 오디오 및 영상 스트림 각각에 대한 오디오 MPD와 영상 MPD를 포함하며, 영상 및 오디오 각각의 MPD는 적응 셋과 상기 적응 셋 내에 포함된 설명 셋을 포함한다.In addition, the content production unit 100 generates an MPD including an audio MPD and a video MPD for each of the audio and video streams of a frame within one cycle. Accordingly, the MPD includes an audio MPD and a video MPD for each of the audio and video streams of a frame within one period, and each MPD of the video and audio includes an adaptation set and a description set included in the adaptation set.

본 발명에 의거 오디오 MPD의 설명 셋은 음원정위 정보(SLI)의 설명 셋(SLID: Sound Location Information Description)를 포함한다. According to the present invention, the description set of the audio MPD includes a sound location information description (SLID).

여기서 SLID는 음원 정위 식별자(id), 360도 공간 내의 음원의 위치(x, y, z 축 값), 및 패닝 모델(panning model)로 구성되며, SLID는 설명 셋은 다음 표 1로 나타낸다.Here, SLID is composed of a sound source stereolocation identifier (id), a position of a sound source in a 360-degree space (x, y, z axis values), and a panning model, and the SLID description set is shown in Table 1 below.

[표 1][Table 1]

그리고 해상도 별로 인코딩된 영상은 초단위의 세그먼트 단위로 분할된 다음 생성된 영상 및 오디오 MPD와 함께 HTTP 서버(200)로 전달된다.Then, the video encoded for each resolution is divided into segment units of seconds, and then transmitted to the HTTP server 200 together with the generated video and audio MPD.

HTTP 서버(200)는 세그먼트 단위로 분할된 영상과 영상 및 오디오 각각의 MPD를 포함하는 MPD를 저장한 다음 네트워크를 통해 클라이언트 장치(300)로 전송한다.The HTTP server 200 stores the MPD including the MPD of each of the segmented video and video and audio, and then transmits the MPD to the client device 300 through the network.

본 실시예에서 HTTP는 오디오, 비디오, 및 위젯, 파일 등과 같은 기타 고정 콘텐츠 같은 시간 연속적인 멀티미디어 전달을 위한 새로운 프레임워크를 규정한다. DASH는 수신 개체로 HTTP 서버들로부터 인터넷을 통해 제공받은 미디어 데이터 스트리밍을 가능토록 한 적응적 비트 레이트 스트리밍 기법이다.In this embodiment, HTTP defines a new framework for time-series multimedia delivery, such as audio, video, and other fixed content such as widgets, files, and the like. DASH is an adaptive bit rate streaming technique that enables streaming of media data received from the HTTP servers via the Internet as a receiving entity.

한편, 클라이언트 장치(300)는 네트워크를 통해 수신된 URL(uniform resource locator)과 함께 요청들(가령, “GET” 요청들)을 HTTP 서버(200)로 보내고, 응답으로 수신된 세그먼트들을 과 같은 미디어 데이터를 수신한다.Meanwhile, the client device 300 sends requests (eg, “GET” requests) to the HTTP server 200 along with a uniform resource locator (URL) received through the network, and media such as segments received in response. Receive data.

그리고 클라이언트 장치(300)는 오디오 및 영상 MPD 및 세그먼트를 수신하고 수신된 세그먼트를 초 단위 세그먼트로 분리한 다음 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측한다. In addition, the client device 300 receives the audio and video MPDs and segments, divides the received segment into seconds, and predicts the user's viewpoint based on the video MPD and the audio MPD.

그리고, 클라이언트 장치(300)는 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)을 고려하여 설정된 가중치로 세그먼트의 비트율을 결정하고 결정된 세그먼트의 비트율로 예측된 사용자 위치 타일을 네트워크를 통해 HTTP 서버(200)로 요청한다. In addition, the client device 300 determines the bit rate of the segment with the weight set by considering the region of interest (ROI), the sound source positioning information (SLI), the current view (Viewport), and the pole of the frequency domain, and the determined segment Request the user location tile predicted by the bit rate of the HTTP server 200 over the network.

그 다음 클라이언트 장치(300)는 HTTP 서버(200)로부터 네트워크를 통해 결정된 비트율로 전달받은 영상 및 오디오를 360도 공간에 랜더링하여 재생한다.Then, the client device 300 renders and plays the video and audio received from the HTTP server 200 at a determined bit rate through the network in a 360-degree space.

도 3은 도 2에 도시된 클라이언트 장치(300)의 세부적인 구성을 보인 도이고, 도 3을 참조하면, 클라이언트 장치(300)는, MPD 파서(310), 처리부(320), 및 VR 엔진(330)를 포함할 수 있다. FIG. 3 is a view showing a detailed configuration of the client device 300 shown in FIG. 2, and referring to FIG. 3, the client device 300 includes an MPD parser 310, a processing unit 320, and a VR engine ( 330).

MPD 파서(310)는 네트워크를 통해 수신된 URL(uniform resource locator)과 함께 요청들(가령, “GET” 요청들)을 HTTP 서버(200)로 보내고, 응답으로 수신된 세그먼트들의 미디어 데이터를 수신한다. 수신된 미디어 데이터는 처리부(320)로 전달된다.The MPD parser 310 sends requests (eg, “GET” requests) to the HTTP server 200 along with a uniform resource locator (URL) received over the network, and receives media data of the segments received in response. . The received media data is transmitted to the processing unit 320.

그리고 처리부(320)는 오디오 MPD 및 영상 MPD 및 세그먼트들을 수신하고 수신된 세그먼트를 초 단위 세그먼트로 분리한 다음 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측한다. Then, the processing unit 320 receives the audio MPD and the video MPD and the segments, divides the received segment into segments in seconds, and then predicts the user's viewpoint based on the video MPD and the audio MPD.

즉, 처리부(320)는 수신된 세그먼트에 대해 영상 MPD를 토대로 사용자 시점 위치를 예측한다. 즉, 사용자 시점 위치의 타일은 사용자의 오리엔테이션의 요각(Yaw)과 타일 별 지오메트릭 값으로부터 도출되며, 다음 관계 식 1으로부터 도출된다.That is, the processor 320 predicts the user's viewpoint position based on the image MPD for the received segment. That is, the tile at the user's viewpoint is derived from the yaw of the user's orientation and the geometric value for each tile, and is derived from the following equation (1).

[식 1][Equation 1]

then

이다. 즉, 전술한 조건을 만족하는 경우 타일은 현재 시점(viewport)에 속한다고 판단된다.then

to be. That is, if the above conditions are satisfied, it is determined that the tile belongs to the current viewport.

여기서, y는 현재 타일의 요각이고, f는 FOV(Field Of View)이다. 관계 식 1에 따르면, 처리부(320)는 현재 요각(y)을 기준으로 FOV(f)/2를 더한 값과 감산한 값이 각 타일의 중심 포인트(c)와 방위각의 시작점과 종점 간의 거리인 파이 길이(phi-length: l)/2의 차 보다 크고, 중심 포인트(c)와 파이 길이(phi-length: l)/2의 차 보다 작으면 해당 타일이 현재 시점에 해당하는 타일 세트로 판단한다.Here, y is the yaw angle of the current tile, and f is the field of view (FOV). According to the relational expression 1, the processing unit 320 is the distance between the center point (c) of each tile and the start point and end point of the azimuth angle, where the value obtained by adding and subtracting FOV(f)/2 based on the current yaw angle (y) is If the difference is greater than the difference between phi-length (l)/2 and less than the difference between the center point (c) and phi-length (l)/2, the tile is determined as the tile set corresponding to the current time do.

한편, 처리부(320)는 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목을 토대로 대역폭 상황을 도출하고 도출된 대역폭 상황에 따라 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목의 가중치를 도출한다.On the other hand, the processing unit 320 derives a bandwidth situation based on each item of interest region information (ROI), sound source positioning information (SLI), current view point (Viewport), and frequency domain pole, and derives the bandwidth situation. Accordingly, the weights of each item of the region of interest (ROI), the sound source positioning information (SLI), the current viewpoint (Viewport), and the frequency domain pole are derived.

즉, 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 대비 대역폭 상황 r은 가용 대역폭 R_B 와 i번째 타일의 비트율

의 차에 대한 최소화로 도출된다. That is, the bandwidth situation r of each item of the region of interest (ROI), the sound source positioning information (SLI), the current viewpoint (Viewport), and the pole in the frequency domain is available bandwidth R _B and bit rate of the i-th tile

It is drawn as a minimization of the difference.

즉,

이다.In other words,

to be.

이에 따라 대역폭 상황 r은 1 내지 4의 값으로, 매우 좋음, 좋음, 나쁨, 및 매우 나쁨으로 판정된다. Accordingly, the bandwidth situation r is a value of 1 to 4, and is judged to be very good, good, bad, and very bad.

그리고, 각 대역폭 상황 별 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 별 가중치는 다음 표 2에 도시된 바와 같고 각 항목 별 가중치는 룩 업 테이블 값으로 처리부(320)에 미리 저장된다.In addition, the weight of each item of interest region information (ROI), sound source positioning information (SLI), current view point (ViewLI), and frequency domain pole for each bandwidth situation is as shown in Table 2 below. Each weight is stored in advance in the processor 320 as a look-up table value.

[표 2][Table 2]

도 4는 영상 세트 R 에 대한 대역폭 상황 별 비트율 R _i 을 도출하는 알고리즘의 동작 과정을 보인 도면으로서, 도 4를 참조하여 알고리즘으로 영상 세트 R 에 대한 대역폭 상황 별 비트율 R _i 을 도출할 수 있다.FIG. 4 is a diagram showing an operation process of an algorithm for deriving the bit rate R _i for each bandwidth situation for the image set R. Referring to FIG. 4, the bit rate R _i for each bandwidth situation for the image set R can be derived by referring to FIG. 4.

여기서 R 은 전사적 자원 관리(Enterprise Resource Planning, ERP) 형태의 영상 내의 욜로 브이3에서 감지한 객체가 위치한 타일이고, S 은 음원이 위치한 영상 세트로서 3차원 공간 내의 정위한 음원이 위치한 타일이며, V 는 사용자 오리엔테이션 중 요각(y)과 영상 별 지오메트리 값으로 도출된 현재 시점(Viewport)에 해당하는 타일이고, P 는 사용자 오리엔테이션 중 피치각(pitch q)이 60도 미만 120도 이상 내에 해당하는 영상 세트이다.here R is a tile in which objects detected by YoLO V3 in an enterprise resource planning (ERP) type image are located, S is a set of images in which a sound source is located, and V is a tile in which a sound source in a 3D space is located. It is a tile corresponding to the current view (Viewport) derived from the yaw angle (y) and the geometry value for each image during user orientation, and P is a set of images whose pitch angle (pitch q) is within 60 degrees or less and 120 degrees or more during user orientation. .

(1) N 개의 타일로 분할된 영상 세트 R 에 대해, 각 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 별 가중치의 합으로부터 도출된 순위(r)이 1이면, 처리부(320)는 대역폭 상황을 매우 좋음으로 판정하고, SLI에 해당하는 타일의 가중치 w_S, ROI에 해당하는 타일의 가중치 w_R, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V, 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트 w_X 로 설정한다.(1) For an image set R divided into N tiles, If the rank r derived from the sum of weights for each item of each region of interest (ROI), sound source positioning information (SLI), current viewpoint (Viewport), and frequency domain poles is 1, the processor 320 ) Determines the bandwidth situation as very good, the weight w _S of the tile corresponding to the SLI, the weight w _R of the tile corresponding to the ROI, the weight w _V of the tile corresponding to the current viewport, and the Pole. The weight w _P of the corresponding tile is the maximum weight value set w _X Set to

이에 영상 세트 R에 대해 i번째 타일의 비트율 R _i 은 최대 가중치 값 max w_X _, i번째 타일 t_i, 및 기준 비트율 R _f 의 곱의 합으로 도출되며, 다음 식 2으로 나타낸다. 여기서, i는 1 부터 N 까지의 양의 정수이다.In this video set R With respect to, the bit rate R _i of the i-th tile is derived as the sum of the product of the maximum weight value max w _X _{, the} i-th tile t _i , and the reference bit rate R _f , and is expressed by the following equation 2. Here, i is a positive integer from 1 to N.

[수학식 2][Equation 2]

(2) 한편, 각 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 별 가중치의 합으로부터 도출된 순위 r이 2이면, 처리부(320)는 대역폭 상황을 좋음으로 판정하고, SLI에 해당하는 타일의 가중치 w_S, ROI에 해당하는 타일의 가중치 w_R, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V를 최대 가중치 값 세트 max w_x 로 설정하고, 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 로 설정한다.(2) On the other hand, if the priority r derived from the sum of weights for each item of each region of interest (ROI), sound source positioning information (SLI), current view (Viewport), and frequency domain poles (pole), is 2, The processor 320 determines the bandwidth situation as good, the weight w _S of the tile corresponding to the SLI, the weight w _R of the tile corresponding to the ROI, and the weight w _V of the tile corresponding to the current viewport (maximum weight value) Set max w _x Set to, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 1 from max w _x Set to -1.

이에 영상 세트 R에 대해 비트율 R _i 는 각 관심영역정보(ROI), 음원정위정보(SLI), 및 현재 시점(Viewport)의 타일의 합(S+R+V), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱의 합으로 도출되며, 영상 세트 R에 대한 세그먼트 비트율 R _i 는 다음 식 3을 만족된다. Accordingly, for the image set R, the bit rate R _i is the sum of tiles of each region of interest (ROI), sound source positioning information (SLI), and the current view (Viewport) ( S + R + V) , the reference bit rate R _f , and the maximum Weight value max w _X Tile P of the product of frequency and poles in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x Derived as the sum of products of -1, image set R The segment bit rate R _i for satisfies the following equation (3).

[수학식 3][Equation 3]

(3) 각 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 별 가중치의 합으로부터 도출된 순위(r)가 3이면, 처리부(320)는 대역폭 상황을 나쁨으로 판정하고 SLI에 해당하는 타일의 가중치 w_S, 및 ROI에 해당하는 타일의 가중치 w_R 를 최대 가중치 값 세트 max w_x 로 설정하고, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V및 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 로 설정한다.(3) If the rank r derived from the sum of weights for each item of each region of interest (ROI), sound source positioning information (SLI), current view (Viewport), and frequency domain poles is 3, The processor 320 determines the bandwidth situation as bad and sets the weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI to the maximum weight value set max w _x Set as, and the weight w _V of the tile corresponding to the current view (viewport) and the weight w _P of the tile corresponding to the pole (max w ₎ ) Subtracted 1 from max w _x Set to -1.

그리고 영상 세트 R에 대한 비트율 R _i 는 각 관심영역정보(ROI), 및 음원정위정보(SLI)의 타일의 합(S+R), 기준 비트율 R _f , 및 최대 가중치 값 max w_X 의 곱과 현재 시점(Viewport)의 타일 V 및 주파수 영역의 폴(pole)의 타일 P 의 합(V+P), 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱의 합으로 도출되며, 영상 세트 R에 대한 비트율 R _i 는 다음 식 4를 만족된다.And the video set R Bit rate R _i for each region of interest information (ROI), and the sum of tiles ( S + R ) of the sound source positioning information (SLI), the reference bit rate R _f , and the maximum weight value max w _X And the sum of the tile V in the current view (Viewport) and the tile P in the pole in the frequency domain ( V + P ), the reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x Derived as the sum of products of -1, image set R The bit rate R _i for satisfies the following equation (4).

[수학식 4][Equation 4]

(4) 각 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 각 항목 별 가중치의 합으로부터 도출된 순위(r)가 4이면, 처리부(320)는 대역폭 상황을 매우 나쁨으로 판정하고 SLI에 해당하는 타일의 가중치 w_S, 및 ROI에 해당하는 타일의 가중치 w_R 를 최대 가중치 값 세트 max w_x 로 설정하고, 현재 시점(viewport)에 해당하는 타일의 가중치 w_V를 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 로 설정하며, 폴(Pole)에 해당하는 타일의 가중치 w_P 를 최대 가중치 값 세트(max w_x )에서 2를 감산한 값 max w_x -2 로 설정한다.(4) If the rank r derived from the sum of weights for each item of each region of interest (ROI), sound source positioning information (SLI), current view (Viewport), and frequency domain poles is 4, The processor 320 determines the bandwidth situation as very bad and sets the weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI max w _x Set as, and the weight w _V of the tile corresponding to the current view (viewport) is the maximum weight value set (max w _x ) Subtracted 1 from max w _x Set to -1, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 2 from max w _x Set to -2.

이에 영상 세트 R에 대한 비트율 R _i 는 각 관심영역정보(ROI), 및 음원정위정보(SLI)의 타일의 합(S+R), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 현지 시점(Viewport)의 타일 V 와 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 2를 감산한 값 max w_x -2 의 곱의 합으로 도출되며, 영상 세트 R에 대한 비트율 R _i 는 다음 식 5를 만족된다.In this video set R Bit rate R _i for each region of interest information (ROI), and the sum of tiles ( S + R ) of the sound source positioning information (SLI), the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and tile V of local viewport and reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x A product of -1 and tile P of a pole in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 2 from max w _x Derived as the sum of the product of -2, the image set R The bit rate R _i for satisfies the following equation (5).

[수학식 5][Equation 5]

여기서, 수학식 2 내지 5의 인덱스를 정리하면 표 3에 도시된 바와 같다.Here, the indexes of Equations 2 to 5 are summarized as shown in Table 3.

[표 3][Table 3]

그리고 처리부(320)는 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점 (Viewport), 및 주파수 영역의 폴(pole)을 고려하여 설정된 가중치로 세그먼트의 비트율을 결정하고 결정된 세그먼트의 비트율로 예측된 사용자 시점 위치 타일을 네트워크를 통해 HTTP 서버(200)로 요청한다. In addition, the processor 320 determines the bit rate of the segment with a weight set in consideration of the region of interest (ROI), the sound source positioning information (SLI), the current view point (Viewport), and the pole of the frequency domain, and determines the bit rate of the segment Requests the location tile predicted by the user to the HTTP server 200 through the network.

이에 세그먼트 비트율을 결정하기 위한 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 주파수 영역의 폴(pole)의 항목 중 클라이언트의 대역폭 상황이 좋으면 모든 항목이 고려되어 결정된 높은 비트율로 최고의 품질 세그먼트 요청이 가능하다. 그러나, 대역폭 상황이 좋지 아니하면 전술한 항목 중 가중치가 높은 항목으로 결정된 비트율로 영역의 타일의 세그먼트의 요청이 가능하다. 이에 대역폭에 적응적으로 세그먼트 요청이 가능하므로, 세그먼트 품질이 향상된다. Accordingly, among items of interest region information (ROI), sound source positioning information (SLI), current viewpoint (Viewport), and frequency domain for determining the segment bit rate, if the client's bandwidth situation is good, all items are considered and determined. The highest bit rate enables the highest quality segment request. However, if the bandwidth situation is not good, it is possible to request a segment of a tile of an area at a bit rate determined as an item having a high weight among the items described above. Accordingly, since segment requests can be adaptively applied to bandwidth, segment quality is improved.

처리부(320)는 대역폭 상황에 따라 음원정위정보(SLI: Sound localization Information, 사용자 관심정보(ROI: Region of Interest), 현재 시점 (viewport), 및 폴(Pole) 각각의 항목에 가중치를 부여한 다음 기준 비트율 R _f 에 가중치를 곱하여 다음 요청할 수 있는 세그먼트 비트율 R _i 로 결정하고, 결정된 세그먼트 비트율 R _i 로 다음 세그먼트 요청을 HTTP 서버(200)로 전달한다.The processor 320 assigns weights to each item of sound localization information (SLI), region of interest (ROI), current viewport, and poll according to bandwidth conditions multiplied by the weight on the bit rate R _f determines the next segment bit rate R _i, which can request and passes the request to the next segment, the segment bit rate R _i is determined as an HTTP server (200).

그리고, HTTP 서버(200)는 결정된 세그먼트 비트율 R _i 로 할당된 대역폭으로 요청된 세그먼트 타일을 네트워크를 경유하여 처리부(320)로 전달하고 처리부(320)는 수신된 세그먼트를 처리한 후 처리된 세그먼트들을 VR 엔진(330)로 전달하며, VR 엔진(330)은 처리부(320)에서 처리된 세그먼트의 영상 및 오디오를 360도 공간에 랜더링하여 재생한다.Then, the HTTP server 200 delivers the requested segment tile with the bandwidth allocated at the determined segment bit rate R _i to the processing unit 320 via the network, and the processing unit 320 processes the received segments and then processes the processed segments. It is delivered to the VR engine 330, and the VR engine 330 renders and reproduces the video and audio of the segment processed by the processing unit 320 in a 360-degree space.

즉, VR 엔진(330)은 세그먼트들을 수신하여 알맞은 디코더들을 사용해 디코딩한 다음 디코딩 결과를 디스플레이 상에 표시할 수 있는 미디어 데이터로 렌더링하여 재생한다. 비한정 예들에 있어서, 관련된 연관 미디어의 디스플레이와 동기된 사적 광고 정보 시간 및 위치를 중첩시키고/거나 디스플레이의 코너및 디스플레이된 브로드캐스팅된 미디어 데이터의 관련된 연관 부분과 동기된 시간에 위치하는 스트리밍된 브로드밴드 미디어 콘텐츠의 PIP(picture-in-picture) 데이터를 제공할 수 있다.That is, the VR engine 330 receives the segments, decodes them using suitable decoders, and then renders the decoded results as media data that can be displayed on the display and plays them. In non-limiting examples, a streamed broadband located at a time synchronized with the display of the associated associated media and/or the private advertisement information time and location synchronized with the display and/or at a time synchronized with the corner of the display and the associated associated portion of the displayed broadcasted media data. It may provide picture-in-picture (PIP) data of media content.

이에 본 실시 예는 기존의 영상 MPD를 확장시켜 오디오 MPD의 음원정위정보(SLI)를 전송하고 대역폭 상황에 따라 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 폴(Pole)의 항목 별 가중치를 차등 부여한 다음 기준 비트율을 곱하여 다음 제공받은 세그먼트 비트율을 결정하고 결정된 세그먼트 비트율로 세그먼트를 수신한다. 이에 시각 및 청각 인지를 활용하여 사용자의 시점 위치 예측의 정확도를 향상시킬 수 있고, 사용자의 시점 위치 예측의 실패할 경우 사용자 현재 시점을 고려하여 사용자 시점 위치 예측이 가능하므로, 재생 속도 및 재생 품질에 대한 사용자 만족도(QoE: Quality of Experience)를 보장할 수 있다. Accordingly, the present embodiment extends the existing video MPD to transmit the sound source positioning information (SLI) of the audio MPD, and the region of interest information (ROI), the sound source positioning information (SLI), the current view point (Viewport), and the poll according to the bandwidth situation. The weight of each item of (Pole) is differentially assigned, and then multiplied by a reference bit rate to determine the next provided segment bit rate, and a segment is received at the determined segment bit rate. Therefore, by using visual and auditory perception, the accuracy of the user's viewpoint position prediction can be improved, and if the user's viewpoint position prediction fails, the user's viewpoint position can be predicted in consideration of the user's current viewpoint. Quality of Experience (QoE) can be guaranteed.

본 발명의 다른 실시 양태에 의해 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 방법은, (a) 콘텐츠 제작부에서 하나의 전사적 자원 관리(Enterprise Resource Planning, ERP) 형태의 파노라마 비디오를 복수의 타일로 공간 분할한 후 각 타일 별로 압축 코딩하여 미디어 데이터 형태로 HTTP 서버로 전달하는 단계; (b) HTTP 서버에서 수신된 미디어 데이터를 기 정해진 소정 시간 단위의 세그먼트로 분할한 다음 영상 MPD 및 오디오 MPD를 생성하고 생성된 영상 MPD 및 오디오 MPD를 포함하는 MPD와 세그먼트 타일을 네트워크의 기준 대역폭을 통해 클라이언트 장치로 전송하는 단계; (c) 클라이언트 장치에서 수신된 세그먼트 타일과 영상 MPD 및 오디오 MPD를 토대로 사용자 시점 위치를 예측하고, 다음에 제공받을 세그먼트의 비트율을 대역폭 상황 기반 적응적으로 결정하며 결정된 세그먼트 비트율로 할당된 대역폭을 HTTP 서버로 전달하는 단계; (d) HTTP 서버에서 클라이언트 장치로부터 제공받은 할당된 대역폭으로 다음 제공할 세그먼트를 클라이언트 장치로 전송하는 단계; 및 (e) 클라이언트 장치를 경유하여 전달받은 세그먼트 타일에 대해 VR 엔진에서 디코딩을 수행하여 영상 및 오디오를 획득하고 획득된 영상 및 오디오를 360도 공간에 랜더링하여 재생하는 단계를 포함하는 것을 특징으로 한다.According to another embodiment of the present invention, a user's viewpoint prediction method using sound source location information in a 360-degree VR content includes (a) a plurality of panoramic video in the form of an enterprise resource planning (ERP) in the content production unit. Dividing the space into tiles and compressing and coding for each tile to deliver the media data to an HTTP server; (b) The media data received from the HTTP server is divided into segments of a predetermined time unit, and then the video MPD and audio MPD are generated, and the MPD and segment tiles including the generated video MPD and audio MPD are used to determine the network's reference bandwidth. Transmitting to the client device through; (c) Predict the user's viewpoint based on the segment tile received from the client device and the video MPD and audio MPD, adaptively determine the bit rate of the next segment to be provided based on the bandwidth situation, and allocate the bandwidth allocated to the determined segment bit rate as HTTP Delivering to a server; (d) transmitting the next segment to be provided to the client device in the allocated bandwidth received from the client device in the HTTP server; And (e) obtaining a video and audio by performing a decoding in a VR engine on the segment tile received through the client device, and rendering and playing the obtained video and audio in a 360-degree space. .

그리고, 상기 c) 단계의 가중치는, 대역폭 상황 별로 기 정해진 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(viewport), 및 폴(Pole) 각각의 항목과 매칭되어 저장되어 있다.In addition, the weight of step c) is stored by matching each of the items of interest region information (ROI), sound source positioning information (SLI), current viewport, and poll predetermined for each bandwidth situation.

i번째 타일의 비트율 R _i 를 최대 가중치 값 w_X _, i번째 타일 t_i, 및 기준 비트율 R_f의 곱의 합으로 도출하도록 구비된다.the i-th tile of the bit rate R _i for the maximum weight value w _{_X,} i t _i-th tile, and the reference bit-rate are provided to derive a sum of a product of R _f.

세그먼트 비트율 R _i 를 각 관심영역정보(ROI), 음원정위정보(SLI), 및 현재 시점(Viewport)의 각 타일의 합(S+R+V), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱의 합으로 도출한다. The segment bit rate R _i is the sum of each region of interest (ROI), sound source positioning information (SLI), and each tile of the current view (Viewport) ( S + R + V) , the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and the tile P of the pole in the frequency domain, the reference bit rate R _f , And maximum weight value set (max w _x ) Subtracted 1 from max w _x It is derived by adding the product of -1.

영상 세트 R에 대한 세그먼트 비트율 R _i 를 각 관심영역정보(ROI), 및 음원정위정보(SLI) 각 타일의 합(S+R), 기준 비트율 R _f , 및 최대 가중치 값 max w_X 의 곱과 현지 시점(Viewport)의 타일 V 및 주파수 영역의 폴(pole)의 타일 P 의 합(V+P), 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x-1 의 곱의 합으로 도출한다.Video set R Segment bit rate R _i for each region of interest (ROI), and sound source positioning information (SLI) sum of each tile ( S + R ), reference bit rate R _f , and maximum weight value max w _X The product of the product of and the tile V of the local viewport (Viewport) and the tile P of the pole in the frequency domain ( V + P ), the reference bit rate R _f , and a set of maximum weight values (max w _x ) To the sum of the product of subtracting 1 from max w _x -1.

영상 세트 R에 대한 비트율 R _i 를 각 관심영역정보(ROI), 및 음원정위정보(SLI) 각 타일의 합(S+R), 기준 비트율 R_f, 및 최대 가중치 값 max w_X 의 곱과 현지 시점(Viewport)의 타일 V 와 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 1을 감산한 값 max w_x -1 의 곱과 주파수 영역의 폴(pole)의 타일 P, 기준 비트율 R _f , 및 최대 가중치 값 세트(max w_x )에서 2를 감산한 값 max w_x -2 의 곱의 합으로 도출한다.Video set R The bit rate R _i for each region of interest (ROI), and the sound source positioning information (SLI), the sum of each tile ( S + R ), the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and tile V of local viewport and reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x A product of -1 and tile P of a pole in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 2 from max w _x It is derived by adding the product of -2.

상기의 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 방법의 각 단계는 전술한 콘텐츠 제작부(100), HTTP 서버(200), 및 클라이언트 장치(300)와 MPD 파서(310), 처리부(320), 및 VR 엔진(330) 에서 수행되는 기능으로 자세한 원용은 생략한다. Each step of the user's viewpoint prediction method using the sound source location information in the 360-degree VR content includes the above-described content production unit 100, HTTP server 200, and client device 300, MPD parser 310, and processing unit ( 320), and a function performed by the VR engine 330, the detailed application is omitted.

이에 본 실시 예는 기존의 영상 MPD를 확장시켜 오디오 MPD의 음원정위정보(SLI)를 전송하고 대역폭 상황에 따라 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 폴(Pole)의 항목 별 가중치를 차등 부여한 다음 기준 비트율을 곱하여 다음 제공받은 세그먼트 비트율을 결정하고 결정된 세그먼트 비트율에 의해 할당된 대역폭으로 세그먼트를 수신한다. 이에 시각 및 청각 인지를 활용하여 사용자의 시점 위치 예측의 정확도를 향상시킬 수 있고, 사용자의 시점 위치 예측의 실패할 경우 사용자 현재 시점을 고려하여 사용자 시점 위치 예측이 가능하므로, 재생 속도 및 재생 품질에 대한 사용자 만족도(QoE: Quality of Experience)를 보장할 수 있다.Accordingly, the present embodiment extends the existing video MPD to transmit the sound source positioning information (SLI) of the audio MPD, and the region of interest information (ROI), the sound source positioning information (SLI), the current view point (Viewport), and the poll according to the bandwidth situation. The weight of each item of (Pole) is differentially assigned, and then multiplied by a reference bit rate to determine the next provided segment bit rate, and a segment is received with the bandwidth allocated by the determined segment bit rate. Therefore, by using visual and auditory perception, the accuracy of the user's viewpoint position prediction can be improved, and if the user's viewpoint position prediction fails, the user's viewpoint position can be predicted in consideration of the user's current viewpoint. Quality of Experience (QoE) can be guaranteed.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the following claims, but also by the claims and equivalents.

기존의 영상 MPD를 확장시켜 오디오 MPD의 음원정위정보(SLI)를 전송하고 대역폭 상황에 따라 관심영역정보(ROI), 음원정위정보(SLI), 현재 시점(Viewport), 및 폴(Pole)의 항목 별 가중치를 차등 부여한 다음 기준 비트율을 곱하여 다음 제공받은 세그먼트 비트율을 결정하고 결정된 세그먼트 비트율로 세그먼트를 수신한다. 이에 시각 및 청각 인지를 활용하여 사용자의 시점 위치 예측의 정확도를 향상시킬 수 있고, 사용자의 시점 위치 예측의 실패할 경우 사용자 현재 시점을 고려하여 사용자 시점 위치 예측이 가능하므로, 재생 속도 및 재생 품질에 대한 사용자 만족도(QoE: Quality of Experience)를 보장할 수 있는 360도 VR 콘텐츠 내의 음원 위치 정보를 이용한 사용자의 시점 예측 장치 및 방법에 대한 운용의 정확성 및 신뢰도 측면, 더 나아가 성능 효율 면에 매우 큰 진보를 가져올 수 있으며, 가상 현실 서비스를 제공하는 시스템의 시판 또는 영업의 가능성이 충분할 뿐만 아니라 현실적으로 명백하게 실시할 수 있는 정도이므로 산업상 이용가능성이 있는 발명이다.Expands the existing video MPD and transmits the sound source positioning information (SLI) of the audio MPD, and the items of interest area information (ROI), sound source positioning information (SLI), current view point (Viewport), and poll according to the bandwidth situation Differential weights are then differentially multiplied and multiplied by a reference bit rate to determine the next provided segment bit rate and a segment is received at the determined segment bit rate. Therefore, by using visual and auditory perception, the accuracy of the user's viewpoint position prediction can be improved, and if the user's viewpoint position prediction fails, the user's viewpoint position can be predicted in consideration of the user's current viewpoint. Significant progress in terms of accuracy and reliability of operation of the user's point of view prediction device and method using sound source location information in 360-degree VR content that can guarantee quality of experience (QoE) It is an invention with industrial applicability, as it is possible to commercialize or sell a system for providing a virtual reality service, as well as to the extent that it can be carried out realistically.

Claims

A content production unit for dividing a panoramic video of an enterprise resource planning (ERP) type into a plurality of tiles and compressing and coding each tile to deliver it to a Hyper-text Transport Protocol (HTTP) server as media data;
The received media data is divided into segments in a predetermined time unit, and then an image media presentation description (MPD) and an audio MPD are generated, and the MPD and segment tiles including the generated image MPD and audio MPD are transmitted through a reference bandwidth of the network. An HTTP server transmitting to the client device;
A client device that predicts a user's viewpoint based on the received segment tile and video MPD and audio MPD, adaptively determines the bit rate of the next segment to be provided based on the bandwidth situation, and delivers the bandwidth allocated to the determined segment bit rate to the HTTP server Including,
The HTTP server
It is provided to transmit the next segment to be provided to the client device with the allocated bandwidth provided from the client device,
The client device
A user using sound source location information in a 360-degree VR content, characterized in that it is provided to obtain video and audio by decoding in the VR engine on the received segment tile and render and play the obtained video and audio in a 360-degree space. Point of view prediction system.

The method of claim 1, wherein the audio MPD
Sound source localization information (SLI), and the audio MPD description set includes the SLI description set (SLID: Sound Localization Information Description).
The SLI description set (SILD) includes a sound source stereolocation identifier (SLI_id), a position of a sound source in a 360-degree space (x, y, z-axis values), and a panning model. User's viewpoint prediction system using sound source location information in VR content.

The method of claim 2, wherein the client device,
An MPD parser that parses the received segment tile and MPD;
A processor for predicting a user's viewpoint based on the video MPD and audio MPD of the parsed MPD, adaptively determining a bit rate of a segment to be provided next, and transmitting the bandwidth allocated to the determined segment bit rate to an HTTP server; And
And a VR engine that decodes the segment received through the processing unit to obtain audio and video, and reproduces and reproduces the obtained audio and video by 360-dimensional spatial three-dimensional rendering. User's viewpoint prediction system.

The method according to claim 3, wherein the processing unit, tile weight, reference bit rate and bandwidth for each item of at least one of interest region information (ROI), sound source positioning information (SLI), current view (viewport), and poll (Pole) Based on the ranking and region of interest information (ROI), sound source positioning information (SLI), the current view (viewport), and the tile located in the pole (Pole) to determine the next bit rate to be provided,
A user's viewpoint prediction system using sound source location information in a 360 degree VR content, characterized in that it is provided to allocate a bandwidth at a determined segment bit rate and deliver it to an HTTP server.

The weight of the processing unit,
Within the 360-degree VR content, characterized by being stored in correspondence with each item of interest region information (ROI), sound source positioning information (SLI), current viewport, and poll, which are predetermined for each bandwidth situation. A user's viewpoint prediction system using sound source location information.

The method of claim 4, wherein the processing unit,
If the rank r of the sum of weights for each item is 1, the bandwidth situation is determined to be very good, and then
The weight of the tile corresponding to the SLI w _S, the weight of the tile corresponding to the ROI w _R, the weight w _P each tile corresponding to the current weight w _V, Paul (Pole) of the tile corresponding to the point (viewport) maximum weight Value set max w _X Set to
360, characterized in that being adapted to derive the bit rate R _i in the i-th tile of the sum of the product of the maximum weight value w _{_X,} the i-th tile t _i, and the reference bit rates R _f is also in the VR content, the sound source position information User's viewpoint prediction system.

The method of claim 4, wherein the processing unit,
If the rank r of the sum of the weights for each item is 2, the bandwidth situation is determined to be good, and then
The weights of tiles corresponding to SLI, w _S , the weights of tiles corresponding to ROI w _R , and the weights of tiles corresponding to the current viewport (viewport) w _V are the maximum set of weight values max w _x Set to, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 1 from max w _x Set to -1,
The segment bit rate R _i is the sum of each region of interest (ROI), sound source positioning information (SLI), and each tile of the current view (Viewport) ( S + R + V) , the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and the tile P of the pole in the frequency domain, the reference bit rate R _f , And maximum weight value set (max w _x ) Subtracted 1 from max w _x It can be provided to derive as the sum of products of -1.

The method of claim 4, wherein the processing unit
If the rank r of the sum of the weights for each item is 3, the bandwidth situation is determined as bad, and then
The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Is set to, and the weight w _V of the tile corresponding to the current viewport and the weight w _P of the tile corresponding to the pole are subtracted 1 from the maximum weight value set (max w _x ) max w _x Set to -1,
Video set R Segment bit rate R _i for each region of interest (ROI), and sound source positioning information (SLI) sum of each tile ( S + R ), reference bit rate R _f , and maximum weight value max w _X The product of the product of and the tile V of the local viewport (Viewport) and the tile P of the pole in the frequency domain ( V + P ), the reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x -1, and a user's viewpoint prediction system using sound source location information in a 360-degree VR content, characterized in that it is provided to derive as a sum of products.

The method of claim 4, wherein the processing unit,
If the rank r of the sum of weights for each item is 4, the bandwidth situation is determined to be very bad, and then
The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Set as, and the weight w _V of the tile corresponding to the current view (viewport) is the maximum weight value set (max w _x ) Subtracted 1 from max w _x Set to -1, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 2 from max w _x Set to -2,
Video set R The bit rate R _i for each region of interest (ROI), and the sound source positioning information (SLI), the sum of each tile ( S + R ), the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and tile V of local viewport and reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x A product of -1 and tile P of a pole in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 2 from max w _x A user's viewpoint prediction system using sound source location information in a 360-degree VR content, characterized in that it is provided to derive as a sum of products of -2.

(a) the content production unit divides a panoramic video in the form of an enterprise resource planning (ERP) into a plurality of tiles and compresses each tile to deliver the media data in the form of media data to an HTTP server;
(b) The media data received from the HTTP server is divided into segments in a predetermined time unit, and then the video MPD and audio MPD are generated, and the MPD and segment tiles including the generated video MPD and audio MPD are used to determine the network's reference bandwidth. Transmitting to the client device through;
(c) Predict the user's viewpoint based on the segment tile and video MPD and audio MPD received from the client device, adaptively determine the bit rate of the next segment to be provided based on the bandwidth situation, and allocate the bandwidth allocated to the determined segment bit rate as HTTP Delivering to a server;
(d) transmitting the next segment to be provided to the client device in the allocated bandwidth received from the client device in the HTTP server; And
(e) performing a decoding in a VR engine on a segment tile received via a client device to obtain a video and audio, and rendering and playing the obtained video and audio in a 360-degree space. A method of predicting a user's viewpoint using location information of a sound source in a VR content.

11. The method of claim 10, The audio MPD
Sound Localization Information (SLI), the audio MPD description set includes the SLI description set (SLID),
The SLI description set (SILD) includes a sound source stereolocation identifier (SLI_id), a position of a sound source in a 360-degree space (x, y, z-axis values), and a panning model. A method of predicting a user's viewpoint using location information of a sound source in a VR content.

The method of claim 10, wherein step (c),
At least one item of interest area information (ROI), sound source positioning information (SLI), current viewport, and poll, tile weight, reference bit rate, bandwidth context ranking, and interest area information (ROI) , Based on the sound source positioning information (SLI), the current view (viewport), and the tile located in the pole (PLI), determines the next bit rate to be provided,
A method for predicting a user's viewpoint using sound source location information in a 360-degree VR content, characterized in that it is provided to allocate a bandwidth at a determined segment bit rate and deliver it to an HTTP server.

The method of claim 12, wherein the weight of step c),
Sound source location information in a 360-degree VR content characterized by matching and storing predetermined region of interest information (ROI), sound source positioning information (SLI), current viewport, and poll for each bandwidth situation. User's viewpoint prediction method using.

The method of claim 12, wherein step (c),
If the rank r of the sum of weights for each item is 1, the bandwidth situation is determined to be very good, and then
The weight of the tile corresponding to the SLI w _S, the weight of the tile corresponding to the ROI w _R, the weight w _P each tile corresponding to the current weight w _V, Paul (Pole) of the tile corresponding to the point (viewport) maximum weight Value set max w _X Set to
the bit rate R _i in the i-th tile maximum weight value w _{_X,} the i-th tile t _i, and the reference bit-rate 360, characterized in that which is provided to derive a sum of a product of R _f is also a user using the sound source position information in the VR content How to predict time.

The method of claim 12, wherein step (c),
If the rank r of the sum of the weights for each item is 2, the bandwidth situation is determined to be good, and then
The weights of tiles corresponding to SLI, w _S , the weights of tiles corresponding to ROI w _R , and the weights of tiles corresponding to the current viewport (viewport) w _V are the maximum set of weight values max w _x Set to, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 1 from max w _x Set to -1,
Segment bit rate R _i is the sum of each tile of each region of interest (ROI), sound source positioning information (SLI), and current view (Viewport) ( S + R + V) , reference bit rate R _f , and maximum weight value max w _X Multiplied by and the tile P of the pole in the frequency domain, the reference bit rate R _f , And maximum weight value set (max w _x ) Subtracted 1 from max w _x Method for predicting a user's viewpoint using sound source location information in a 360-degree VR content, characterized in that it is provided to be derived as a sum of products of -1.

The method of claim 12, wherein step (c),
If the rank r of the sum of the weights for each item is 3, the bandwidth situation is determined as bad, and then
The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Is set to, and the weight w _V of the tile corresponding to the current viewport and the weight w _P of the tile corresponding to the pole are subtracted 1 from the maximum weight value set (max w _x ) max w _x Set to -1,
Video set R Segment bit rate R _i for each region of interest information (ROI), and sound source positioning information (SLI) sum of each tile ( S + R ), reference bit rate R _f , and maximum weight value max w _X The product of the product of and the tile V of the local viewport (Viewport) and the tile P of the pole in the frequency domain ( V + P ), the reference bit rate R _f , and a set of maximum weight values (max w _x Method for predicting a user's viewpoint using sound source location information in a 360-degree VR content, characterized in that it is provided to derive as the sum of the product of subtracting 1 from max max _{x x} -1.

The method of claim 12, wherein step (c),
If the rank r of the sum of weights for each item is 4, the bandwidth situation is determined to be very bad, and then
The weight w _S of the tile corresponding to the SLI and the weight w _R of the tile corresponding to the ROI are set to the maximum weight value max w _x Set as, and the weight w _V of the tile corresponding to the current view (viewport) is the maximum weight value set (max w _x ) Subtracted 1 from max w _x Set to -1, and the weight w _P of the tile corresponding to the pole is set to the maximum weight value (max w _x ) Subtracted 2 from max w _x Set to -2,
Video set R The bit rate R _i for each region of interest (ROI), and the sound source positioning information (SLI), the sum of each tile ( S + R ), the reference bit rate R _f , and the maximum weight value max w _X Multiplied by and tile V of local viewport and reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 1 from max w _x A product of -1 and tile P of a pole in the frequency domain, a reference bit rate R _f , and a set of maximum weight values (max w _x ) Subtracted 2 from max w _x A user's viewpoint prediction method using sound source location information in a 360-degree VR content, characterized in that it is provided to derive as a sum of products of -2.