KR20210115710A

KR20210115710A - Image processing method, video playback method and apparatuses thereof

Info

Publication number: KR20210115710A
Application number: KR1020200031880A
Authority: KR
Inventors: 김영휘; 이정진; 한성규; 정승화
Original assignee: 주식회사 카이
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2021-09-27
Also published as: KR102391615B1

Abstract

The present invention relates to an image processing method, an image reproducing method and a device thereof. According to an embodiment of the present invention, an image processing method and the device are provided to receive a first image including a plurality of frames, acquire importance information indicating the importance of at least one area included in the plurality of frames, determine at least one mask indicating a scaling degree corresponding to at least one area of the first image based on the importance information, generate a second image by encoding the first image according to at least one mask, and output at least one mask and the second image. According to one embodiment of the present invention, it is possible to provide a high-quality video streaming service in a network environment having a low transmission capacity by providing streaming of an original streaming image at twice or lower resolution, and by providing an image reproducing device that maintains the original image through artificial neural network-based super-resolution-based restoration.

Description

Image processing method, image playback method, and apparatuses thereof

아래 실시예들은 영상 처리 방법, 영상 재생 방법 및 그 장치들에 관한 것이다.The following embodiments relate to an image processing method, an image reproducing method, and apparatuses thereof.

스트리밍(Streaming)을 제공하기 위하여 사용자 시점에 기반하는 방법, 컨텐츠에 기반한 방법 및 신경망에 기반한 방법 등이 이용될 수 있다. 사용자 시점에 기반하는 방법은 사용자가 바라보는 영역, 다시 말해 사용자의 시점에 대응하는 영역만을 고품질로 인코딩하여 스트리밍하는 방법이다. 사용자 시점에 기반하는 방법에서는 사용자가 시점을 갑자기 바꿀 경우, 화질 변화의 레이턴시(Latency)가 발생할 수 있다. 또한, 사용자 시점에 기반하는 방법에서 하나의 콘텐츠를 시점 별로 다르게 멀티 인코딩을 수행하는 경우, 영상의 용량 및 계산 과부하가 발생할 수 있다. 이 밖에도, 신경망에 기반한 방법에서는 하나의 네트워크를 사용할 경우 항상 높은 퀄리티를 보장하지 못한다. 또한, 콘텐츠 기반의 신경망을 사용할 경우, 모델링을 위해 대략 수십 시간이 소요되므로 라이브 스트리밍에 적합하지 못하고, 고화질 스트리밍을 위해서, 높은 GPU(Graphic Processing Unit) 사양이 요구된다는 문제점이 있다.In order to provide streaming, a method based on a user's point of view, a method based on content, and a method based on a neural network may be used. The method based on the user's point of view is a method of streaming with high quality encoding only the area the user looks at, that is, the area corresponding to the user's point of view. In the method based on the user's viewpoint, when the user suddenly changes the viewpoint, latency of image quality change may occur. In addition, when multi-encoding of one content is differently performed for each viewpoint in a method based on a user's viewpoint, image capacity and calculation overload may occur. In addition, a method based on a neural network does not always guarantee high quality when a single network is used. In addition, when a content-based neural network is used, it takes about tens of hours for modeling, so it is not suitable for live streaming, and there is a problem that a high GPU (Graphic Processing Unit) specification is required for high-definition streaming.

일 실시예에 따르면, 영상 처리 장치가 원본 스트리밍 영상을 2배 이상의 낮은 해상도로 스트리밍을 제공하고, 영상 재생 장치가 인공 신경망 기반의 초해상화(Super-Resolution) 기반 복원을 통해 원본 영상을 유지함으로써 낮은 전송 용량을 갖는 네트워크 환경에서 고품질의 영상 스트리밍 서비스를 제공할 수 있다.According to an embodiment, the image processing device provides streaming of the original streaming image at a resolution of 2 times or more, and the image reproducing device maintains the original image through super-resolution based restoration based on an artificial neural network. It is possible to provide a high-quality video streaming service in a network environment having a low transmission capacity.

일 실시예에 따르면, 원본 영상의 각 프레임의 중요 영역에 대하여는 원본 해상도를 최대한 유지하고, 중요 영역을 제외한 나머지 영역은 다운 샘플링하여 라이브 스트리밍 서비스를 위한 영상의 용량을 감소시킬 수 있다.According to an embodiment, the original resolution of the important region of each frame of the original image is maintained as much as possible, and the remaining region except for the important region is down-sampled to reduce the capacity of the image for the live streaming service.

일 실시예에 따르면, 스트리밍을 위한 서버에서 4 배 이상의 낮은 용량으로 스트리밍을 수행함으로써 HLS(HTTP Live Streaming) 서버의 사용 비용을 절감할 수 있다. According to an embodiment, it is possible to reduce the cost of using a HTTP Live Streaming (HLS) server by performing streaming at a capacity four times or more lower in the server for streaming.

일 실시예에 따르면, 마스크 기반의 스케일링을 통해 낮은 용량으로 영상 데이터를 전송하고 수신함에 따라 데이터 전송 비용을 절감할 수 있다. According to an embodiment, it is possible to reduce data transmission cost by transmitting and receiving image data with a low capacity through mask-based scaling.

일 실시예에 따르면, 인공 신경망 기반으로 저해상도 영상을 고해상도 영상으로 복원하는 초고해상도(super-resolution; SR) 기술에 의해 원본 영상에서 복원이 잘되지 않는 영역은 원본 해상도를 최대한 유지하고, 나머지 복원이 잘되는 영역의 해상도는 다운스케일링을 통해 해상도를 낮춰 전송함으로써 원본 영상에서 중요 영역 또는 복원 시에 초고해상도의 퀄리티가 떨어지는 영역의 영상 품질을 보존할 수 있다.According to an embodiment, an area that is not well restored in the original image by super-resolution (SR) technology that restores a low-resolution image to a high-resolution image based on an artificial neural network maintains the original resolution as much as possible, and the rest of the restoration is performed. By transmitting the resolution of the area that works well through downscaling, it is possible to preserve the image quality of the important area in the original image or the area where the quality of the ultra-high resolution is poor during restoration.

일 실시예에 따르면, 영상 처리 방법은 복수의 프레임들을 포함하는 제1 영상을 수신하는 단계; 상기 복수의 프레임들에 포함된 적어도 하나의 영역의 중요도를 지시하는 중요도 정보를 획득하는 단계; 상기 중요도 정보에 기초하여 상기 제1 영상의 적어도 하나의 영역에 대응하는 스케일링(scaling) 정도를 나타내는 적어도 하나의 마스크를 결정하는 단계; 상기 적어도 하나의 마스크에 따라 상기 제1 영상을 인코딩 함으로써 제2 영상을 생성하는 단계; 및 상기 적어도 하나의 마스크 및 상기 제2 영상을 출력하는 단계를 포함한다. According to an embodiment, an image processing method includes: receiving a first image including a plurality of frames; obtaining importance information indicating the importance of at least one region included in the plurality of frames; determining at least one mask indicating a scaling degree corresponding to at least one region of the first image based on the importance information; generating a second image by encoding the first image according to the at least one mask; and outputting the at least one mask and the second image.

상기 적어도 하나의 마스크를 결정하는 단계는 상기 중요도 정보에 기초하여, 상기 제1 영상의 적어도 하나의 제1 영역의 해상도가 유지되고, 상기 제1 영역을 제외한 나머지 영역들의 해상도가 다운 샘플링(down-sampling) 되도록, 상기 적어도 하나의 마스크 및 상기 적어도 하나의 마스크에 대응하는 마스크 정보를 결정하는 단계를 포함할 수 있다. In the determining of the at least one mask, the resolution of at least one first area of the first image is maintained based on the importance information, and resolutions of the remaining areas except for the first area are down-sampled (down-sampling). sampling), and determining the at least one mask and mask information corresponding to the at least one mask.

상기 중요도 정보를 획득하는 단계는 상기 제1 영상의 제작자 단말로부터, 상기 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하여 설정된 제1 중요도 정보를 수신하는 단계; 및 상기 제2 영상의 각 프레임의 적어도 하나의 영역을 미리 학습된 신경망에 의해 복원한 결과에 따른 정확도 맵(accuracy map)에 기초하여, 상기 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하여 결정된 제2 중요도 정보를 획득하는 단계 중 적어도 하나를 포함할 수 있다. The obtaining of the importance information may include: receiving, from a producer terminal of the first image, first importance information set corresponding to at least one region of each frame of the first image; and corresponding to at least one region of each frame of the first image based on an accuracy map according to a result of reconstructing at least one region of each frame of the second image by a pre-trained neural network. The method may include at least one of obtaining the determined second importance information.

상기 제2 중요도 정보를 획득하는 단계는 상기 제2 영상의 각 프레임의 적어도 하나의 영역을 상기 미리 학습된 신경망에 의해 복원한 결과와 상기 제2 영상의 각 프레임의 적어도 하나의 영역에 대응하는 정답 영상 간의 차이에 기초하여 상기 정확도 맵을 생성하는 단계; 및 상기 정확도 맵에 기초하여 상기 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하는 제2 중요도 정보를 결정하는 단계를 포함할 수 있다. The obtaining of the second importance information includes a result of restoring at least one region of each frame of the second image by the pre-trained neural network and a correct answer corresponding to at least one region of each frame of the second image. generating the accuracy map based on a difference between images; and determining second importance information corresponding to at least one region of each frame of the first image based on the accuracy map.

상기 적어도 하나의 마스크를 결정하는 단계는 상기 제2 중요도 정보를 정렬(sorting)하는 단계; 및 상기 정렬된 제2 중요도 정보를 기초로, 상기 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하는 적어도 하나의 마스크를 결정하는 단계를 포함할 수 있다. The determining of the at least one mask may include: sorting the second importance information; and determining at least one mask corresponding to at least one region of each frame of the first image based on the aligned second importance information.

상기 정렬된 제2 중요도 정보를 기초로, 상기 적어도 하나의 마스크를 결정하는 단계는 상기 정렬된 제2 중요도 정보에 따라 상기 제1 영상에서 제1 영역의 제1 해상도가 상기 제1 영상의 원본 해상도로 설정되도록 상기 제1 영역에 대응하는 마스크를 제1 마스크로 결정하는 단계; 상기 정렬된 제2 중요도 정보에 따라 상기 제1 영역을 제외한 제2 영역이 상기 제1 해상도보다 낮은 제2 해상도로 다운 샘플링되도록 상기 제2 영역에 대응하는 마스크를 제2 마스크로 결정하는 단계; 및 상기 정렬된 제2 중요도 정보에 따라 상기 제1 영역 및 상기 제2 영역을 제외한 나머지 제3 영역이 상기 제2 해상도보다 낮은 제3 해상도로 다운 샘플링되도록 상기 제3 영역에 대응하는 마스크를 제3 마스크로 결정하는 단계 중 적어도 하나를 포함할 수 있다. In the determining of the at least one mask based on the aligned second importance information, the first resolution of the first region in the first image according to the aligned second importance information is the original resolution of the first image. determining, as a first mask, a mask corresponding to the first area to be set to ; determining a mask corresponding to the second region as a second mask so that a second region excluding the first region is down-sampled to a second resolution lower than the first resolution according to the aligned second importance information; and a third mask corresponding to the third area so that the remaining third areas except for the first area and the second area are down-sampled to a third resolution lower than the second resolution according to the aligned second importance information. It may include at least one of determining the mask.

상기 신경망은 상기 제2 영상과 상기 제2 영상에 대응하는 제1 마스크, 제2 마스크 및 제3 마스크 중 적어도 하나의 마스크가 연결된(concatenated) 결합 영상을 기초로, 상기 제2 영상에 대응하는 복원 영상(reconstructed image)을 생성할 수 있다. The neural network restores the second image based on a combined image in which at least one of a first mask, a second mask, and a third mask corresponding to the second image is concatenated. You can create a reconstructed image.

상기 적어도 하나의 마스크를 결정하는 단계는 상기 제2 중요도 정보와 무관하게, 상기 제1 중요도 정보에 따라 상기 제1 영상의 제1 영역에 대한 적어도 하나의 마스크를 결정하는 단계; 및 상기 제1 영역을 제외한 나머지 영역들에 대한 적어도 하나의 마스크를 제2 중요도 정보에 따라 결정하는 단계를 포함할 수 있다. The determining of the at least one mask may include determining at least one mask for the first region of the first image according to the first importance information, regardless of the second importance information; and determining at least one mask for regions other than the first region according to the second importance information.

상기 적어도 하나의 마스크를 결정하는 단계는 상기 제2 영상의 용량이 상기 제1 영상을 미리 정해진 비율로 다운스케일링한 용량과 동일하게 유지되도록, 상기 중요도 정보에 기초하여 상기 제1 영상의 복수의 프레임들에 포함된 적어도 하나의 영역의 적어도 하나의 마스크를 결정하는 단계를 포함할 수 있다. In the determining of the at least one mask, the plurality of frames of the first image based on the importance information so that the capacity of the second image is maintained equal to the capacity obtained by downscaling the first image by a predetermined ratio. The method may include determining at least one mask of at least one region included in the fields.

상기 적어도 하나의 마스크를 결정하는 단계는 상기 중요도 정보에 기초하여, 상기 제1 영상의 각 프레임(frame) 별로 상기 적어도 하나의 마스크를 결정하는 단계; 및 상기 중요도 정보에 기초하여, 상기 제1 영상의 복수의 프레임들을 포함하는 청크(chunk) 별로 상기 적어도 하나의 마스크를 결정하는 단계 중 어느 하나를 포함할 수 있다. The determining of the at least one mask may include: determining the at least one mask for each frame of the first image based on the importance information; and determining the at least one mask for each chunk including a plurality of frames of the first image based on the importance information.

상기 중요도 정보를 획득하는 단계는 상기 복수의 프레임들 각각을, 상기 복수의 프레임들 각각의 적어도 하나의 영역에 대응하는 그리드(grid)들로 분할하는 단계; 및 상기 복수의 그리드들에 대응하는 영역의 중요도 정보를 획득하는 단계를 포함할 수 있다. The obtaining of the importance information may include: dividing each of the plurality of frames into grids corresponding to at least one region of each of the plurality of frames; and obtaining importance information of regions corresponding to the plurality of grids.

상기 제1 영상은 상기 적어도 하나의 마스크를 기초로 제1 영역, 제2 영역 및 제3 영역 중 적어도 하나의 영역으로 구분되고, 상기 제2 영상을 생성하는 단계는 상기 제1 영역에 대응하는 제1 마스크에 기초하여, 상기 제1 영역을 인코딩(encoding)하는 단계; 상기 제2 영역에 대응하는 제2 마스크에 기초하여, 상기 제2 영역을 인코딩하는 단계; 및 상기 제3 영역에 대응하는 제3 마스크에 기초하여, 상기 제3 영역을 인코딩하는 단계를 포함할 수 있다. The first image is divided into at least one of a first area, a second area, and a third area based on the at least one mask, and the generating of the second image includes a first area corresponding to the first area. encoding the first region based on one mask; encoding the second region based on a second mask corresponding to the second region; and encoding the third region based on a third mask corresponding to the third region.

상기 제1 영상은 라이브 스트리밍 컨텐츠(live streaming contents)를 포함할 수 있다. The first image may include live streaming contents.

일 실시예에 따르면, 영상 재생 방법은 복수의 해상도를 포함하는 복수의 영역들을 가지는 영상 및 상기 영상의 복수의 영역들에 대응하는 적어도 하나의 마스크를 포함하는 영상 정보를 획득하는 단계; 상기 영상 정보를 기초로, 미리 학습된 신경망을 이용하여 상기 영상을 복원하는 단계; 및 상기 복원된 영상을 재생하는 단계를 포함한다. According to an embodiment, an image reproducing method includes: acquiring image information including an image having a plurality of regions having a plurality of resolutions and at least one mask corresponding to the plurality of regions of the image; reconstructing the image using a pre-trained neural network based on the image information; and reproducing the restored image.

상기 영상을 복원하는 단계는 상기 영상 정보로부터 상기 적어도 하나의 마스크를 추출하는 단계; 및 상기 추출한 적어도 하나의 마스크를 기초로, 상기 신경망을 이용하여 상기 영상을 복원하는 단계를 포함할 수 있다. The restoring of the image may include: extracting the at least one mask from the image information; and reconstructing the image using the neural network based on the extracted at least one mask.

상기 신경망은 상기 영상 및 상기 영상에 대응하는 적어도 하나의 마스크를 기초로, 상기 영상에 대응하는 상기 복원 영상을 생성하도록 학습될 수 있다. The neural network may be trained to generate the reconstructed image corresponding to the image based on the image and at least one mask corresponding to the image.

일 실시예에 따르면, 영상 처리 장치는 복수의 프레임들을 포함하는 제1 영상을 수신하는 통신 인터페이스; 및 상기 복수의 프레임들에 포함된 적어도 하나의 영역의 중요도를 지시하는 중요도 정보를 획득하고, 상기 중요도 정보에 기초하여 상기 제1 영상의 적어도 하나의 영역에 대응하는 스케일링 정도를 나타내는 적어도 하나의 마스크를 결정하고, 상기 적어도 하나의 마스크에 따라 상기 제1 영상을 인코딩 함으로써 제2 영상을 생성하는 프로세서를 포함하고, 상기 통신 인터페이스는 상기 마스크 및 상기 제2 영상을 출력한다. According to an embodiment, an image processing apparatus includes: a communication interface for receiving a first image including a plurality of frames; and at least one mask that obtains importance information indicating importance of at least one region included in the plurality of frames, and indicates a scaling degree corresponding to at least one region of the first image based on the importance information. and a processor configured to generate a second image by determining , and encoding the first image according to the at least one mask, wherein the communication interface outputs the mask and the second image.

일 실시예에 따르면, 영상 재생 장치는 복수의 해상도를 포함하는 복수의 영역들을 가지는 영상 및 상기 영상의 복수의 영역들에 대응하는 적어도 하나의 마스크를 포함하는 영상 정보를 획득하는 통신 인터페이스; 상기 영상 정보를 기초로, 미리 학습된 신경망을 이용하여 상기 영상을 복원하는 프로세서; 및 상기 복원된 영상을 재생하는 디스플레이를 포함한다.According to an embodiment, an image reproducing apparatus includes: a communication interface for acquiring image information including an image having a plurality of regions including a plurality of resolutions and at least one mask corresponding to the plurality of regions of the image; a processor for reconstructing the image using a pre-trained neural network based on the image information; and a display for reproducing the restored image.

일 측에 따르면, 영상 처리 장치가 원본 스트리밍 영상을 2배 이상의 낮은 해상도로 스트리밍을 제공하고, 영상 재생 장치가 인공 신경망 기반의 초해상화(Super-Resolution) 기반 복원을 통해 원본 영상을 유지함으로써 낮은 전송 용량을 갖는 네트워크 환경에서 고품질의 영상 스트리밍 서비스를 제공할 수 있다.According to one side, the image processing device provides streaming of the original streaming video at a resolution that is two times or more lower, and the video playback device maintains the original video through the artificial neural network-based Super-Resolution-based restoration. It is possible to provide a high-quality video streaming service in a network environment having a transmission capacity.

일 측에 따르면, 원본 영상의 각 프레임의 중요 영역에 대하여는 원본 해상도를 최대한 유지하고, 중요 영역을 제외한 나머지 영역은 다운 샘플링하여 라이브 스트리밍 서비스를 위한 영상의 용량을 감소시킬 수 있다.According to one side, it is possible to reduce the capacity of an image for a live streaming service by maintaining the original resolution as much as possible for an important region of each frame of the original image and down-sampling the remaining region except for the important region.

일 측에 따르면, 스트리밍을 위한 서버에서 4 배 이상의 낮은 용량으로 스트리밍을 수행함으로써 HLS(HTTP Live Streaming) 서버의 사용 비용을 절감할 수 있다. According to one side, it is possible to reduce the cost of using an HLS (HTTP Live Streaming) server by performing streaming at a lower capacity of 4 times or more in the server for streaming.

일 측에 따르면, 마스크 기반의 스케일링을 통해 낮은 용량으로 영상 데이터를 전송하고 수신함에 따라 데이터 전송 비용을 절감할 수 있다. According to one side, data transmission cost can be reduced by transmitting and receiving image data with a low capacity through mask-based scaling.

일 측에 따르면, 인공 신경망 기반으로 저해상도 영상을 고해상도 영상으로 복원하는 초고해상도(super-resolution; SR) 기술에 의해 원본 영상에서 복원이 잘되지 않는 영역은 원본 해상도를 최대한 유지하고, 나머지 복원이 잘되는 영역의 해상도는 다운스케일링을 통해 해상도를 낮춰 전송함으로써 원본 영상에서 중요 영역 또는 복원 시에 초고해상도의 퀄리티가 떨어지는 영역의 영상 품질을 보존할 수 있다.According to one side, by super-resolution (SR) technology that restores a low-resolution image to a high-resolution image based on an artificial neural network, the original resolution is maintained as much as possible in the area that cannot be restored well in the original image, and the rest can be restored well. The resolution of the region can be transmitted at a lower resolution through downscaling, thereby preserving the image quality of an important region in the original image or a region with poor quality of ultra-high resolution during restoration.

도 1은 일 실시예에 따른 비디오 스트리밍 시스템의 구성 및 동작을 설명하기 위한 도면.
도 2는 일 실시예에 따른 영상 처리 방법을 나타낸 흐름도.
도 3은 일 실시예에 따라 제2 영상을 기초로 리스케일링된 영상을 생성하는 방법을 설명하기 위한 도면.
도 4는 일 실시예에 따른 제2 영상의 일 예시를 도시한 도면.
도 5는 일 실시예에 따른 영상 처리 장치의 구성 및 동작을 설명하기 위한 도면.
도 6은 도 5에 도시된 복원 신경망의 구조 및 동작을 설명하기 위한 도면.
도 7은 일 실시예에 따른 영상 재생 방법을 나타낸 흐름도.
도 8은 일 실시예에 따른 영상 처리 장치의 블록도.
도 9는 일 실시예에 따른 영상 재생 장치의 블록도.1 is a diagram for explaining the configuration and operation of a video streaming system according to an embodiment.
2 is a flowchart illustrating an image processing method according to an embodiment.
3 is a diagram for describing a method of generating a rescaled image based on a second image, according to an exemplary embodiment;
4 is a diagram illustrating an example of a second image according to an embodiment;
5 is a diagram for describing a configuration and operation of an image processing apparatus according to an exemplary embodiment;
FIG. 6 is a view for explaining the structure and operation of the reconstructed neural network shown in FIG. 5;
7 is a flowchart illustrating an image reproducing method according to an embodiment.
8 is a block diagram of an image processing apparatus according to an exemplary embodiment;
9 is a block diagram of an image reproducing apparatus according to an exemplary embodiment;

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for description purposes only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

도 1은 일 실시예에 따른 비디오 스트리밍 시스템의 구성 및 동작을 설명하기 위한 도면이다. 도 1을 참조하면, 일 실시예에 따라 영상을 처리하는 장치(이하, '영상 처리 장치')(110), 스트리밍 서버(130) 및 영상을 재생하는 장치(이하, '영상 재생 장치')(150)를 포함하는 비디오 스트리밍 시스템(100)이 도시된다. 1 is a diagram for explaining the configuration and operation of a video streaming system according to an embodiment. Referring to FIG. 1 , an apparatus for processing an image (hereinafter, 'image processing device') 110, a streaming server 130, and an apparatus for playing an image (hereinafter, 'image playback device') according to an embodiment A video streaming system 100 including 150 is shown.

일 실시예에 따른 영상 처리 장치(110)는 방송 송출자 또는 제작자 단말로부터 원본 영상(111)를 수신한다. 원본 영상(111)은 예를 들어, 다양한 스트리밍 프로토콜(streaming protocol)을 통해 송출된 라이브 스트리밍 컨텐츠(live streaming contents) 또는 360도 컨텐츠 영상일 수 있다. 스트리밍 프로토콜은 오디오, 비디오 및 기타 데이터 등을 인터넷을 통해 스트리밍하는 데에 이용되는 프로토콜로서, 예를 들어, 리얼 타임 메시징 프로토콜(Real Time Messaging Protocol; RTMP)이나 HLS(HTTP Live Streaming) 등을 포함할 수 있다. 원본 영상(111)은 예를 들어, 폭(w) x 높이(h)의 크기를 갖는 영상일 수 있다. 이때, 폭(w)은 전체 컬럼(column)들이 폭 방향으로 차지하는 크기에 해당하고, 높이(h)는 전체 로우(row)들이 높이 방향으로 차지하는 크기에 해당할 수 있다. 본 명세서에서는 설명의 편의를 위해 원본 영상(111)은 '제1 영상'이라 부를 수 있다. 이하, '원본 영상'과 '제1 영상'은 서로 동일한 의미로 이해될 수 있다. The image processing apparatus 110 according to an embodiment receives the original image 111 from a broadcast transmitter or a producer terminal. The original image 111 may be, for example, live streaming content or 360-degree content image transmitted through various streaming protocols. The streaming protocol is a protocol used to stream audio, video, and other data over the Internet, and may include, for example, Real Time Messaging Protocol (RTMP) or HTTP Live Streaming (HLS). can The original image 111 may be, for example, an image having a size of width (w) x height (h). In this case, the width w may correspond to a size occupied by all columns in the width direction, and the height h may correspond to a size occupied by all rows in the height direction. In this specification, for convenience of description, the original image 111 may be referred to as a 'first image'. Hereinafter, 'original image' and 'first image' may be understood to have the same meaning.

영상 처리 장치(110)는 실시간 다운스케일링(Down Scaling) 모듈(120)을 포함한다. 영상 처리 장치(110)는 실시간 다운스케일링(Down Scaling) 모듈(120)에 포함된 신경망에 의해 원본 영상에 대한 중요도 정보를 획득하고, 중요도 정보를 기초로 결정된, 원본 영상(111)의 적어도 하나의 영역에 대응하는 마스크에 따라 제1 영상을 인코딩하여 스케일링된 영상, 즉 제2 영상을 생성할 수 있다. 이하 설명의 편의를 위하여 라이브 영상을 '실시간으로' 처리하는 경우의 동작을 설명하나, 아래에서 설명하는 실시예들은 VOD 영상을 처리하는 경우에도 실질적으로 동일하게 적용될 수 있다. 이 경우 실시간 다운스케일링 모듈(120)은 다운스케일링 모듈로 지칭될 수 있다.The image processing apparatus 110 includes a real-time down-scaling module 120 . The image processing apparatus 110 obtains importance information on the original image by a neural network included in the real-time downscaling module 120 , and determines based on the importance information, at least one of the original image 111 . A scaled image, ie, a second image, may be generated by encoding the first image according to the mask corresponding to the region. Hereinafter, for convenience of description, an operation in the case of processing a live image 'in real time' will be described. However, the embodiments described below may be substantially equally applied to processing a VOD image. In this case, the real-time downscaling module 120 may be referred to as a downscaling module.

경우에 따라, 영상 처리 장치(110)는 중요 영역 지정 모듈(115)을 더 포함할 수 있다. 중요 영역 지정 모듈(115)은 원본 영상(111)의 각 프레임의 적어도 하나의 영역에 대응하는 중요도 정보('제1 중요도 정보')를 방송 송출자 또는 제작자로부터 직접 설정받을 수 있다. 여기서, '중요도 정보'는 원본 영상(111)의 복수의 프레임들에 포함된 영역(들)의 중요도를 지시하는 정보일 수 있다. 중요도 정보는 예를 들어, 원본 영상(111)에서 주인공, 주요 사건, 중요한 물건 등과 같이 제작자에 의해 중요한 것으로 설정된 영역, 다시 말해 중요 영역에 대한 중요도를 나타내는 정보를 포함할 수 있다. 제작자는 예를 들어, 원본 영상(111)의 중요 영역에 마스크(mask)를 지정함으로써 적어도 하나의 영역의 중요도를 지시하는 중요도 정보를 영상 처리 장치(110)에게 제공할 수 있다. 제작자는 예를 들어, 원본 영상(111)에 대한 마우스 클릭(mouse click) 및/또는 드래깅(dragging) 등의 동작을 통해 적어도 하나의 영역에 대해 마스크를 설정할 수 있다. 이하, 설명의 편의를 위하여, 중요 영역 지정 모듈(115)을 통해 제작자에 의해 설정된 중요도 정보를 '제1 중요도 정보'라 부르기로 한다. In some cases, the image processing apparatus 110 may further include an important region designation module 115 . The important region designation module 115 may directly receive importance information ('first importance information') corresponding to at least one region of each frame of the original image 111 from a broadcaster or producer. Here, the 'importance information' may be information indicating the importance of region(s) included in a plurality of frames of the original image 111 . The importance information may include, for example, information indicating the importance of a region set as important by a producer, ie, an important region, such as a main character, a major event, an important object, etc. in the original image 111 . For example, the producer may provide the image processing apparatus 110 with importance information indicating the importance of at least one region by designating a mask on the important region of the original image 111 . The producer may set a mask for at least one region through, for example, an operation such as a mouse click and/or dragging on the original image 111 . Hereinafter, for convenience of description, the importance information set by the producer through the important region designation module 115 will be referred to as 'first importance information'.

실시간 다운스케일링(Down Scaling) 모듈(120)은 중요 영역 지정 모듈(115)과 달리, 원본 영상(111)의 각 프레임의 적어도 하나의 영역에 대응하는 중요도 정보를 미리 학습된 신경망에 의해 자동적으로 설정한다. Unlike the important region designation module 115 , the real-time down-scaling module 120 automatically sets importance information corresponding to at least one region of each frame of the original image 111 by a pre-trained neural network. do.

실시간 다운스케일링 모듈(120)은 리얼 타임 메시징 프로토콜(Real Time Messaging Protocol; 이하 'RTMP') 서버(113)와 같은 영상 수집 서버를 통해 원본 영상(111)를 수신할 수 있다. RTMP 서버(113)은 예를 들어, 라이브 스트림 프로토콜을 통해 송출된 방송(예를 들어, 원본 영상(111) 또는 소스 비디오(Source Video))을 수집할 수 있다. The real-time downscaling module 120 may receive the original image 111 through an image collection server such as a Real Time Messaging Protocol (hereinafter, 'RTMP') server 113 . The RTMP server 113 may collect broadcasts (eg, the original image 111 or the source video) transmitted through a live stream protocol, for example.

실시간 다운스케일링 모듈(120)은 미리 학습된 신경망에 의해 RTMP 서버(113)에서 수집된 방송 영상(예를 들어, 원본 영상(111))에 대하여 자동으로 중요도 정보('제2 중요도 정보')를 설정할 수 있다. 이때, 중요도 정보는 예를 들어, 실시간 다운스케일링 모듈(120)에 의해 다운스케일링된 영상('제2 영상')의 각 프레임의 적어도 하나의 영역을 미리 학습된 신경망에 의해 복원한 결과에 따른 정확도 맵(accuracy map)에 기초하여 결정될 수 있다. 아래에서 구체적으로 설명하겠지만, 정확도 맵은 다운스케일링된 영상의 적어도 하나의 영역을 신경망에 의해 복원한 경우에 해당 영역의 복원 정확도를 나타내는 맵으로 이해될 수 있다. 본 명세서에서는 설명의 편의를 위하여 실시간 다운스케일링 모듈(120)에 의해 다운스케일링된 영상은 '제2 영상'이라 부를 수 있다. 이하, 스케일링된 영상(또는 리스케일링된 영상)과 제2 영상은 동일한 의미로 이해될 수 있다.The real-time downscaling module 120 automatically obtains importance information ('second importance information') for the broadcast image (eg, the original image 111) collected from the RTMP server 113 by a pre-trained neural network. can be set. At this time, the importance information is, for example, the accuracy according to the result of reconstructing at least one region of each frame of the image ('second image') downscaled by the real-time downscaling module 120 by a pre-trained neural network. It may be determined based on an accuracy map. As will be described in detail below, the accuracy map may be understood as a map indicating the restoration accuracy of at least one region of a downscaled image when the neural network reconstructs the region. In this specification, for convenience of description, the image downscaled by the real-time downscaling module 120 may be referred to as a 'second image'. Hereinafter, the scaled image (or the rescaled image) and the second image may be understood to have the same meaning.

예를 들어, 신경망에 의한 해당 영역의 복원 정확도가 높은 경우, 실시간 다운스케일링 모듈(120)은 해당 영역의 중요도를 낮게 설정할 수 있다. 또한, 신경망에 의한 해당 영역의 복원 정확도가 낮은 경우, 실시간 다운스케일링 모듈(120)은 해당 영역의 중요도를 높게 설정할 수 있다. 중요도 정보는 이와 같이 해당 영역의 복원 정확도에 따라 서로 다른 값으로 설정될 수 있다. 이하, 설명의 편의를 위하여, 실시간 다운스케일링 모듈(120)에 의해 결정된 중요도 정보를 '제2 중요도 정보'라고 부르기로 한다. For example, when the restoration accuracy of the corresponding region by the neural network is high, the real-time downscaling module 120 may set the importance of the corresponding region to be low. Also, when the restoration accuracy of the corresponding region by the neural network is low, the real-time downscaling module 120 may set the importance of the corresponding region to be high. The importance information may be set to different values according to the restoration accuracy of the corresponding region as described above. Hereinafter, for convenience of description, the importance information determined by the real-time downscaling module 120 will be referred to as 'second importance information'.

실시간 다운스케일링 모듈(120)은 예를 들어, 실시간 중요도 추정 모듈(121), 실시간 중요도 기반 다운스케일링 모듈(123), 및 인코딩 모듈을 포함할 수 있다. 인코딩 모듈을 예를 들어, HLS 인코딩(HTTP Live Streaming(HLS) Encoding) 모듈(125)을 포함할 수 있다. The real-time downscaling module 120 may include, for example, a real-time importance estimation module 121 , a real-time importance-based downscaling module 123 , and an encoding module. The encoding module may include, for example, an HTTP Live Streaming (HLS) Encoding module 125 .

실시간 중요도 추정 모듈(121)은 RTMP 서버(113)로부터 수신한 원본 영상(111)의 중요도 정보를 추정 또는 결정할 수 있다. 실시간 중요도 추정 모듈(121)은 원본 영상(111)의 중요도 정보를 실시간으로 획득할 수 있다. 예를 들어, 원본 영상(111)이 라이브 영상인 경우, 실시간 중요도 추정 모듈(121)은 제 2 영상을 미리 학습된 신경망에 의해 복원한 결과에 기초하여 해당 영상의 적어도 하나의 영역에 대응하는 중요도 정보를 실시간으로 획득할 수 있다. 신경망은 예를 들어, 원본 영상(111)을 다운스케일링한 제2 영상에 대응하는 복원 영상을 생성하도록 미리 학습된 신경망일 수 있다. 신경망은 예를 들어, 컨볼루션 레이어(Convolution Layer)를 포함하는 심층 신경망(Deep Neural Network)일 수도 있다. 신경망은 예를 들어, 아래의 도 6에 도시된 복원 신경망(610)일 수 있다. The real-time importance estimation module 121 may estimate or determine importance information of the original image 111 received from the RTMP server 113 . The real-time importance estimation module 121 may acquire importance information of the original image 111 in real time. For example, when the original image 111 is a live image, the real-time importance estimating module 121 determines the importance level corresponding to at least one region of the corresponding image based on a result of reconstructing the second image by a pre-trained neural network. Information can be obtained in real time. The neural network may be, for example, a neural network trained in advance to generate a reconstructed image corresponding to the second image obtained by downscaling the original image 111 . The neural network may be, for example, a deep neural network including a convolution layer. The neural network may be, for example, the reconstructed neural network 610 shown in FIG. 6 below.

실시간 중요도 기반 다운스케일링 모듈(123)은 실시간 중요도 추정 모듈(121)에 의해 결정된 중요도 정보('제2 중요도 정보')에 기초하여 원본 영상의 적어도 하나의 영역에 대응하는 스케일링 정도를 나타내는 마스크를 결정할 수 있다. 여기서, 스케일링 정도는 원본 영상의 각 영역의 압축률 또는 해상도라고도 부를 수 있다. 스케일링 정도는 예를 들어, 원본 영상의 1배, 1/2배, 및 1/4배 등을 포함할 수 있으며, 반드시 이에 한정되지는 않는다. 예를 들어, 제1 마스크는 원본 영상의 1배의 스케일링 정도를 나타내는 마스크이고, 제2 마스크는 원본 영상의 1/2배 스케일링 정도를 나타내는 마스크이며, 제3 마스크는 원본 영상의 1/4배의 스케일링 정도를 나타내는 마스크일 수 있다. 마스크는 예를 들어, 마스크 파일일 수 있다. The real-time importance-based downscaling module 123 determines a mask indicating a scaling degree corresponding to at least one region of the original image based on the importance information ('second importance information') determined by the real-time importance estimation module 121 . can Here, the scaling degree may also be referred to as a compression ratio or resolution of each region of the original image. The scaling degree may include, for example, 1x, 1/2x, and 1/4x of the original image, but is not limited thereto. For example, the first mask is a mask representing a scaling degree of 1 times the original image, the second mask is a mask representing a scaling degree of 1/2 times of the original image, and the third mask is 1/4 times of the original image. may be a mask indicating a scaling degree of . The mask may be, for example, a mask file.

실시간 중요도 기반 다운스케일링 모듈(123)은 중요도 정보에 따라 원본 영상(111)에서 해당 영역이 중요 영역으로 결정되는 경우, 해당 영역의 제1 마스크의 마스크 정보를 예를 들어, '1'로 설정할 수 있다. 또는 실시간 중요도 기반 다운스케일링 모듈(123)은 중요도 정보에 따라 원본 영상에서 해당 영역이 중요 영역에 해당하지 않는 경우, 중요도 정보에 따라 해당 영역의 제2 마스크 또는 제3 마스크의 마스크 정보를 '1' 또는 '0'로 설정할 수 있다.The real-time importance-based downscaling module 123 may set the mask information of the first mask of the corresponding region to, for example, '1' when the corresponding region in the original image 111 is determined as the important region according to the importance information. have. Alternatively, the real-time importance-based downscaling module 123 sets the mask information of the second mask or the third mask of the corresponding region to '1' according to the importance information when the corresponding region does not correspond to the important region in the original image according to the importance information. Alternatively, it can be set to '0'.

HLS 인코딩 모듈(125)은 실시간 중요도 기반 다운스케일링 모듈(123)에서 설정된 마스크의 마스크 정보에 따라 원본 영상(111)을 인코딩 함으로써 스케일링된 영상(scaled image)을 생성할 수 있다. 스케일링된 영상(scaled image)은 예를 들어, 실시간 중요도 기반 다운스케일링 모듈(123)에서 설정된 마스크에 따라 서로 다른 해상도들(예를 들어, 1080p, 720p, 480p 등)로 인코딩된 스트리밍 서비스를 위한 영상일 수 있다. 여기서, 스트리밍 서비스는 예를 들어, 실시간(live) 방송을 위한 스트리밍 서비스, VOD 재생을 위한 스트리밍 서비스 등을 포함할 수 있으며, 반드시 이에 한정되지는 않는다. 제2 영상은 예를 들어, 아래 도 4의 제2 영상(400)과 같이 영상 프레임의 영역들의 스케일링 정도 또는 해상도가 중요도 정보에 따라 서로 달리 설정된 저해상도 비디오 영상일 수 있다. The HLS encoding module 125 may generate a scaled image by encoding the original image 111 according to the mask information of the mask set in the real-time importance-based downscaling module 123 . A scaled image is, for example, an image for a streaming service encoded in different resolutions (eg, 1080p, 720p, 480p, etc.) according to a mask set in the real-time importance-based downscaling module 123 . can be Here, the streaming service may include, for example, a streaming service for live broadcasting, a streaming service for VOD playback, and the like, but is not necessarily limited thereto. The second image may be, for example, a low-resolution video image in which a scaling degree or resolution of regions of an image frame is set differently according to importance information, as in the second image 400 of FIG. 4 below.

실시간 다운스케일링 모듈(120)이 제1 영상으로부터 다운스케일링된 제 2 영상을 기초로 리스케일링된 영상을 생성하는 과정은 아래의 도 3을 참조하여 구체적으로 설명한다. 또한, 실시간 다운스케일링 모듈(120)의 각 모듈의 동작은 아래의 도 5를 참조하여 보다 구체적으로 설명한다. A process in which the real-time downscaling module 120 generates a rescaled image based on the downscaled second image from the first image will be described in detail with reference to FIG. 3 below. In addition, the operation of each module of the real-time downscaling module 120 will be described in more detail with reference to FIG. 5 below.

일 실시예에 따른 영상 처리 장치(110)는 원본 영상(111)에서 중요 영역의 원본 해상도를 유지하는 다운스케일링을 통해 낮은 네트워크 환경에서도 고품질의 영상 스트리밍 서비스를 제공할 수 있다. 보다 구체적으로, 영상 처리 장치(110)는 원본 영상(111)의 각 프레임의 중요 영역에 대하여는 원본 해상도를 최대한 유지하고, 중요 영역을 제외한 나머지 영역은 다운 샘플링하여 라이브 스트리밍 서비스를 위한 영상의 용량을 감소시킬 수 있다. 예를 들어, 영상 제작자가 360도 영상을 라이브 스트림 프로토콜을 통해 송출하면, 영상 처리 장치(110)는 컨텐츠 내 중요 영역의 해상도를 최대한 보존하는 다운스케일링(Down-scaling)을 실시간으로 수행할 수 있다. The image processing apparatus 110 according to an embodiment may provide a high-quality video streaming service even in a low network environment through downscaling that maintains the original resolution of an important region in the original image 111 . More specifically, the image processing apparatus 110 maintains the original resolution as much as possible for an important region of each frame of the original image 111 and down-samples the remaining regions except for the important region to increase the capacity of the image for the live streaming service. can be reduced For example, when an image producer transmits a 360-degree image through a live stream protocol, the image processing apparatus 110 may perform down-scaling in real time to preserve the resolution of an important area in the content as much as possible. .

스트리밍 서버(130)는 영상 처리 장치(110)로부터 수신한 적어도 하나의 마스크 및 제2 영상을 영상 재생 장치(150)들에게 전달할 수 있다. 스트리밍 서버(130)는 예를 들어, 부하 분산(load balancing)을 위한 복수의 가상 머신들(Virtual Machines)을 포함할 수 있다. 스트리밍 서버(130)는 필요 시에 가상 서버(또는 가상 머신)(들)를 구동할 수 있으며, 원하는 만큼 가상 서버(들)의 개수를 늘려 다채널 라이브 스트리밍 서비스를 제공할 수도 있다.The streaming server 130 may transmit at least one mask and the second image received from the image processing apparatus 110 to the image reproducing apparatuses 150 . The streaming server 130 may include, for example, a plurality of virtual machines (Virtual Machines) for load balancing. The streaming server 130 may drive a virtual server (or virtual machine)(s) when necessary, and may provide a multi-channel live streaming service by increasing the number of virtual server(s) as desired.

스트리밍 서버(130)는 예를 들어, 영상을 시청하는 시청자들의 수, 다시 말해 영상 재생 장치(150)들의 수에 따라 가상 머신들의 개수를 조정할 수 있다. 각 가상 머신은 예를 들어, HTTP Request를 처리하는 서버 역할을 수행할 수 있다. 스트리밍 서버(130)를 통해 영상 재생 장치(150)로 분배된 영상은 예를 들어, 컨텐츠 전송 네트워크(Content Delivery Network; CDN)를 통해 영상 재생 장치(150)에게 전달됨으로써 사용자에게 라이브 스트리밍 서비스를 제공하는 데에 이용될 수 있다. 마스크는 스트리밍을 위한 청크(chunk) 단위로 설정될 수 있다. 물론 단일 청크 내에서 장면(scene)의 변화가 존재하는 경우, 해당 청크에는 하나 이상의 마스크가 설정될 수도 있다. 또는, 시간의 흐름에 따라 마스크가 급격하게 변하지 않도록 스무딩(smoothing) 기법이 적용될 수도 있다.The streaming server 130 may, for example, adjust the number of virtual machines according to the number of viewers who watch the video, that is, the number of video reproducing devices 150 . Each virtual machine can, for example, act as a server that processes HTTP requests. The video distributed to the video playback device 150 through the streaming server 130 is delivered to the video playback device 150 through, for example, a Content Delivery Network (CDN), thereby providing a live streaming service to the user. can be used to The mask may be set in chunks for streaming. Of course, when there is a change in a scene within a single chunk, one or more masks may be set in the corresponding chunk. Alternatively, a smoothing technique may be applied so that the mask does not change rapidly over time.

영상 재생 장치(150)는 스트리밍 서버(130)를 통해 수신한 적어도 하나의 마스크 및 제2 영상을 포함하는 영상 정보를 기초로, 미리 학습된 신경망(151)을 이용하여 제2 영상을 복원할 수 있다. 이때, 적어도 하나의 마스크는 예를 들어, 제2 영상의 복수의 영역들에 대응하는 마스크 파일일 수 있다. 영상 재생 장치(150)는 복원된 영상을 디스플레이 장치(153)를 통해 재생할 수 있다. 일 실시예에 따른 영상 재생 장치(150)는 예를 들어, 인공 신경망 기반의 비디오 플레이어 혹은 비디오 플레이어를 포함하는 사용자 장치로서 단수 개일 수도 있고, 또는 복수 개일 수도 있다. 영상 재생 장치(150)의 동작은 아래의 도 7을 참조하여 구체적으로 설명한다. The image reproducing apparatus 150 may restore the second image using the pre-trained neural network 151 based on image information including at least one mask and the second image received through the streaming server 130 . have. In this case, the at least one mask may be, for example, a mask file corresponding to a plurality of regions of the second image. The image reproducing apparatus 150 may reproduce the restored image through the display apparatus 153 . The image reproducing apparatus 150 according to an embodiment may be a single or a plurality of user apparatuses including, for example, an artificial neural network-based video player or a video player. The operation of the image reproducing apparatus 150 will be described in detail with reference to FIG. 7 below.

서버의 실시간 다운스케일링 모듈(120)에서 이용하는 신경망과 클라이언트의 영상 재생 장치(150)에서 이용하는 신경망은 동일한 것일 수 있다. 신경망은 다운스케일링된 영상과 마스크를 입력 받아, 다운스케일링된 영상을 복원(예를 들어, 업스케일링)하는 동작을 수행할 수 있다. 이 때, 서버 혹은 클라이언트는 다운스케일링된 영상과 마스크를 결합(concatenation)하여 신경망에 입력할 수 있다.The neural network used in the real-time downscaling module 120 of the server and the neural network used in the image reproducing apparatus 150 of the client may be the same. The neural network may receive a downscaled image and a mask and perform an operation of restoring (eg, upscaling) the downscaled image. In this case, the server or the client may input the downscaled image and the mask to the neural network by concatenating them.

도 2는 일 실시예에 따른 영상 처리 방법을 나타낸 흐름도이다. 도 2를 참조하면, 일 실시예에 따른 영상 처리 장치는 복수의 프레임들을 포함하는 제1 영상을 수신한다(210).2 is a flowchart illustrating an image processing method according to an exemplary embodiment. Referring to FIG. 2 , the image processing apparatus according to an embodiment receives a first image including a plurality of frames ( 210 ).

영상 처리 장치는 복수의 프레임들에 포함된 적어도 하나의 영역의 중요도를 지시하는 중요도 정보를 획득한다(220). 단계(220)에서, 영상 처리 장치는 예를 들어, 제1 영상의 제작자 단말로부터, 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하여 설정된 제1 중요도 정보를 수신할 수 있다. 또는 영상 처리 장치는 예를 들어, 제2 영상의 각 프레임의 적어도 하나의 영역을 미리 학습된 신경망에 의해 복원한 결과에 따른 정확도 맵(accuracy map)에 기초하여 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하여 결정된 제2 중요도 정보를 획득할 수 있다. 여기서, 신경망은 제2 영상과 제2 영상에 대응하는 제1 마스크 정보, 제2 마스크 정보 및 제3 마스크 정보 중 적어도 하나의 마스크가 결합된(concatenated) 결합 영상을 기초로, 제2 영상에 대응하는 복원 영상(reconstructed image)을 생성하도록 학습된 신경망일 수 있다. 신경망은 예를 들어, 제2 영상에 대응하는 정답(ground truth) 영상과 복원 영상 간의 로스 함수를 최소화하도록 학습될 수 있다. The image processing apparatus acquires importance information indicating the importance of at least one region included in the plurality of frames ( S220 ). In operation 220 , the image processing apparatus may receive, for example, first importance information set corresponding to at least one region of each frame of the first image from a producer terminal of the first image. Alternatively, the image processing apparatus may include, for example, at least one of each frame of the first image based on an accuracy map according to a result of reconstructing at least one region of each frame of the second image by a pre-trained neural network. The second importance information determined corresponding to the area of . Here, the neural network corresponds to the second image based on the second image and the combined image in which at least one mask of first mask information, second mask information, and third mask information corresponding to the second image is concatenated. It may be a neural network trained to generate a reconstructed image. For example, the neural network may be trained to minimize a loss function between a ground truth image corresponding to the second image and a reconstructed image.

일 실시예에 따른 영상 처리 장치가 제2 중요도 정보를 획득하는 방법은 다음과 같다. 영상 처리 장치는 예를 들어, 제2 영상의 각 프레임의 적어도 하나의 영역을 미리 학습된 신경망에 의해 복원한 결과와 제2 영상 각 프레임의 적어도 하나의 영역에 대응하는 정답 영상 간의 차이에 기초하여 정확도 맵(accuracy map)을 생성할 수 있다. 영상 처리 장치는 정확도 맵에 기초하여 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하는 제2 중요도 정보를 결정할 수 있다. A method of obtaining the second importance information by the image processing apparatus according to an embodiment is as follows. The image processing apparatus may, for example, based on a difference between a result of reconstructing at least one region of each frame of the second image by a pre-trained neural network and a correct answer image corresponding to at least one region of each frame of the second image. You can create an accuracy map. The image processing apparatus may determine second importance information corresponding to at least one region of each frame of the first image based on the accuracy map.

영상 처리 장치는 예를 들어, 아래의 도 4와 같이 복수의 프레임들 각각을, 복수의 프레임들 각각의 적어도 하나의 영역에 대응하는 그리드(grid)들로 분할하고, 복수의 그리드들에 대응하는 영역의 중요도 정보를 획득할 수 있다. For example, the image processing apparatus divides each of the plurality of frames into grids corresponding to at least one region of each of the plurality of frames as shown in FIG. 4 below, and divides each of the plurality of frames into grids corresponding to the plurality of grids. It is possible to obtain the importance information of the area.

영상 처리 장치는 단계(220)에서 획득한 중요도 정보에 기초하여 제1 영상의 적어도 하나의 영역에 대응하는 스케일링(scaling) 정도를 나타내는 적어도 하나의 마스크를 결정한다(230). 영상 처리 장치는 중요도 정보에 기초하여, 제1 영상의 적어도 하나의 제1 영역의 해상도가 유지되고, 제1 영역을 제외한 나머지 영역들의 해상도가 다운 샘플링(down-sampling) 되도록, 적어도 하나의 마스크 및 적어도 하나의 마스크의 마스크 정보를 결정할 수 있다. The image processing apparatus determines at least one mask indicating a scaling degree corresponding to at least one region of the first image based on the importance information obtained in operation 220 (operation 230). The image processing apparatus maintains the resolution of at least one first area of the first image and down-samples the resolutions of the remaining areas except for the first area based on the importance information, including at least one mask and Mask information of at least one mask may be determined.

단계(220)에서 제1 중요도 정보를 수신한 경우, 영상 처리 장치는 예를 들어, 제2 중요도 정보와 무관하게, 제1 중요도 정보에 따라 제1 영상의 제1 영역에 대한 적어도 하나의 마스크를 결정할 수 있다. 영상 처리 장치는 제1 영역에 대한 제1 마스크의 마스크 정보를 예를 들어, 제1 값('1')으로 결정할 수 있다. 또는 단계(220)에서 제2 중요도 정보를 획득한 경우, 영상 처리 장치는 제1 영역을 제외한 나머지 영역들에 대한 마스크들(예를 들어, 제1 마스크, 제2 마스크 및 제3 마스크) 및 해당 마스크의 마스크 정보를 제2 중요도 정보에 따라 결정할 수 있다. 영상 처리 장치는 제1 영역을 제외한 나머지 영역에 대한 마스크를 제2 중요도 정보에 따라 제1 마스크, 제2 마스크 및 제3 마스크 중 어느 하나의 마스크로 결정하고, 결정된 마스크의 마스크 정보를 설정할 수 있다. When the first importance information is received in operation 220 , the image processing apparatus sets at least one mask for the first region of the first image according to the first importance information, regardless of the second importance information, for example. can decide The image processing apparatus may determine the mask information of the first mask for the first region as, for example, a first value ('1'). Alternatively, when the second importance information is obtained in operation 220 , the image processing apparatus selects masks (eg, the first mask, the second mask, and the third mask) for regions other than the first region and the corresponding masks. Mask information of the mask may be determined according to the second importance information. The image processing apparatus may determine a mask for an area other than the first area as any one of the first mask, the second mask, and the third mask according to the second importance information, and set mask information of the determined mask .

단계(220)에서 제2 중요도 정보를 획득한 경우, 영상 처리 장치는 예를 들어, 제2 중요도 정보를 정렬(sorting)하고, 정렬된 제2 중요도 정보를 기초로, 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하는 적어도 하나의 마스크를 결정할 수 있다. 보다 구체적으로, 영상 처리 장치는 정렬된 제2 중요도 정보에 따라 제1 영상에서 제1 영역의 제1 해상도가 제1 영상의 원본 해상도로 설정되도록 제1 영역에 대응하는 마스크를 제1 마스크로 결정할 수 있다. 영상 처리 장치는 정렬된 제2 중요도 정보에 따라 제1 영상에서 제1 영역을 제외한 제2 영역이 제1 해상도보다 낮은 제2 해상도로 다운 샘플링을 되도록 제2 영역에 대한 마스크를 제2 마스크로 결정할 수 있다. 영상 처리 장치는 정렬된 제2 중요도 정보에 따라 제1 영상에서 제1 영역 및 제2 영역을 제외한 나머지 제3 영역이 제2 해상도보다 낮은 제3 해상도로 다운 샘플링 되도록 제3 영역에 대한 마스크를 제3 마스크로 결정할 수 있다. When the second importance information is obtained in step 220 , the image processing apparatus sorts the second importance information, for example, and based on the sorted second importance information, the image processing apparatus of each frame of the first image At least one mask corresponding to at least one area may be determined. More specifically, the image processing apparatus determines the mask corresponding to the first region as the first mask so that the first resolution of the first region in the first image is set to the original resolution of the first image according to the aligned second importance information. can The image processing apparatus determines the mask for the second region as the second mask so that the second region excluding the first region from the first image is down-sampled to a second resolution lower than the first resolution according to the aligned second importance information. can The image processing apparatus generates a mask for the third region so that the remaining third region excluding the first region and the second region in the first image is down-sampled to a third resolution lower than the second resolution according to the aligned second importance information. You can decide with 3 masks.

영상 처리 장치는 예를 들어, 제2 영상의 용량이 제1 영상을 미리 정해진 비율(예를 들어, x1/2)로 다운스케일링한 용량과 동일하게 유지되도록, 중요도 정보에 기초하여 제1 영상의 복수의 프레임들에 포함된 적어도 하나의 영역의 적어도 하나의 마스크를 결정할 수 있다. 영상 처리 장치는 예를 들어, 하나의 영역을 x1/2->x1로 설정하면서, 네 개의 영역들을 x1/2->x1/4로 설정함으로써, 전체 영역들이 x1/2로 균일하게 다운스케일링되는 용량이 유지되도록 할 수 있다.The image processing apparatus may be configured to adjust the size of the first image based on the importance information so that, for example, the capacity of the second image remains the same as the capacity obtained by downscaling the first image by a predetermined ratio (eg, x1/2). At least one mask of at least one region included in the plurality of frames may be determined. For example, the image processing apparatus sets one area to x1/2->x1 and sets four areas to x1/2->x1/4, so that all areas are uniformly downscaled to x1/2. capacity can be maintained.

단계(230)에서, 영상 처리 장치는 예를 들어, 중요도 정보에 기초하여, 제1 영상의 각 프레임(frame) 별로 적어도 하나의 마스크를 결정할 수 있다. 또는 영상 처리 장치는 중요도 정보에 기초하여, 제1 영상의 복수의 프레임들을 포함하는 청크(chunk) 별로 적어도 하나의 마스크를 결정할 수 있다. 청크는 예를 들어, 512개의 프레임들을 포함할 수 있다. In operation 230 , the image processing apparatus may determine at least one mask for each frame of the first image, for example, based on importance information. Alternatively, the image processing apparatus may determine at least one mask for each chunk including a plurality of frames of the first image, based on the importance information. A chunk may include, for example, 512 frames.

영상 처리 장치는 적어도 하나의 마스크에 따라 제1 영상을 인코딩 함으로써 제2 영상을 생성한다(240). 제1 영상은 예를 들어, 적어도 하나의 마스크를 기초로 제1 영역, 제2 영역 및 제3 영역 중 적어도 하나의 영역으로 구분될 수 있다. 예를 들어, 영상 처리 장치는 제1 영역에 대응하는 제1 마스크에 기초하여 제1 영역을 인코딩(encoding)할 수 있다. 영상 처리 장치는 제2 영역에 대응하는 제2 마스크에 기초하여, 제2 영역을 인코딩할 수 있다. 또한, 영상 처리 장치는 제3 영역에 대응하는 제3 마스크에 기초하여 제3 영역을 인코딩할 수 있다. The image processing apparatus generates the second image by encoding the first image according to at least one mask ( S240 ). The first image may be divided into, for example, at least one of a first area, a second area, and a third area based on at least one mask. For example, the image processing apparatus may encode the first region based on the first mask corresponding to the first region. The image processing apparatus may encode the second region based on the second mask corresponding to the second region. Also, the image processing apparatus may encode the third region based on the third mask corresponding to the third region.

영상 처리 장치는 단계(230)에서 결정한 적어도 하나의 마스크 및 단계(240)에서 생성한 제2 영상을 출력한다(250). The image processing apparatus outputs the at least one mask determined in operation 230 and the second image generated in operation 240 (operation 250).

도 3은 일 실시예에 따라 제2 영상을 기초로 리스케일링된 영상을 생성하는 방법을 설명하기 위한 도면이고, 도 4는 일 실시예에 따른 제2 영상의 일 예시를 도시한 도면이다. 리스케일링된 영상은 마스크와 함께 신경망에 입력될 수 있다.FIG. 3 is a diagram for explaining a method of generating a rescaled image based on a second image according to an embodiment, and FIG. 4 is a diagram illustrating an example of a second image according to an embodiment. The rescaled image may be input to the neural network together with the mask.

일 실시예에 따른 영상 처리 장치는 예를 들어, 원본 영상(310)을 각각 1배(x1), 1/2배(x1/2), 1/4배(x1/4)의 크기로 다운스케일링(Downscaling)하여 다운스케일링된 영상들(Down-Scaled images)(320, 340, 360)을 생성할 수 있다. 영상 처리 장치는 원본 영상(310)을 다운스케일링하기 위하여, 기존에 알려진 다양한 기법들(예를 들어, 바이리니어(bilinear) 기법, 및 바이큐빅(bicubic) 보간 기법 등) 중 적어도 하나를 이용할 수 있다. 바이리이너 기법은 예를 들어, 내삽점 주위 4개의 화소값들의 평균값을 새로운 화소값으로 할당함으로써 영상을 다운 스케일할 수 있다. 또한, 바이큐빅 보간 방법은 예를 들어, 2 차원 규칙 그리드에서 데이터 포인트를 보간하기 위한 3 차 보간의 확장으로서, 라그랑지(Lagrange) 다항식, 입방 스플라인 또는 입방 컨볼루션 알고리즘을 사용하여 수행될 수 있다. The image processing apparatus according to an embodiment downscales, for example, the original image 310 to a size of 1x (x1), 1/2x (x1/2), and 1/4x (x1/4), respectively. Downscaling may generate down-scaled images 320 , 340 , and 360 . The image processing apparatus may use at least one of various known techniques (eg, a bilinear technique and a bicubic interpolation technique) to downscale the original image 310 . . The by-reiner technique may downscale an image by, for example, allocating an average value of four pixel values around an interpolation point as a new pixel value. Also, the bicubic interpolation method can be performed using, for example, a Lagrange polynomial, cubic spline or cubic convolution algorithm as an extension of cubic interpolation to interpolate data points in a two-dimensional rule grid. .

이후, 영상 처리 장치는 다운스케일링된 영상들(320, 340, 360) 각각을 다시 1배(x1), 2배(x2), 4배(x4)의 크기로 업스케일링(Upscaling)하여 업스케일링된 영상들(Up-Scaled images)(325, 345, 365)을 생성할 수 있다. 영상 처리 장치는 다운스케일링된 영상들(340, 360)을 업스케일링하기 위하여, 기존에 알려진 다양한 기법들(예를 들어, 바이리니어(bilinear) 기법, 및 바이큐빅(bicubic) 보간 기법 등) 중 적어도 하나를 이용할 수 있다. 설명의 편의를 위하여 도 3에 도시하였으나, 1배 다운스케일링된 영상(320) 및 1배 업스케일링된 영상(325)은 원본 영상(310)과 동일하며, 1배 다운스케일링된 영상(320) 및 1배 업스케일링된 영상(325)을 생성하기 위한 별도의 작업을 수행하지 않을 수 있다.Thereafter, the image processing apparatus upscales each of the downscaled images 320, 340, and 360 to a size of 1x (x1), 2x (x2), and 4x (x4) again. Up-scaled images 325 , 345 , and 365 may be generated. The image processing apparatus upscales the downscaled images 340 and 360, at least among various known techniques (eg, a bilinear technique and a bicubic interpolation technique). one is available Although shown in FIG. 3 for convenience of explanation, the 1x downscaled image 320 and 1x upscaled image 325 are the same as the original image 310, and the 1x downscaled image 320 and A separate operation for generating the 1x upscaled image 325 may not be performed.

영상 처리 장치는 전술한 과정에서 원본 영상(310)에 대응하는 중요도 정보에 기초하여 각 영역에 대응하는 적어도 하나의 마스크를 결정하고, 결정된 마스크를 업스케일링된 영상들(Up-Scaled images)(325, 345, 365) 각각에 곱하여 리스케일링된 영상(Rescaled Image)(380)을 생성할 수 있다. The image processing apparatus determines at least one mask corresponding to each region based on the importance information corresponding to the original image 310 in the process described above, and uses the determined mask as up-scaled images 325 . , 345, 365) may be multiplied to generate a rescaled image 380 .

보다 구체적으로, 영상 처리 장치는 중요도 정보에 기초한 결정에 따라 업스케일링된 영상(325)에 제1 마스크(330)를 곱하고, 업스케일링된 영상(345)에 제2 마스크(350)를 곱할 수 있다. 또한, 영상 처리 장치는 업스케일링된 영상(365)에 제3 마스크(370)를 곱할 수 있다. 이때, 각 마스크의 마스크 정보(예를 들어, 마스크 파일의 픽셀 값)은 예를 들어, 0 또는 1의 값을 가질 수 있다. 또한, 동일 좌표의 픽셀에 대응하여, 마스크 정보 중 하나만 1의 값을 가지고, 나머지 마스크 정보는 0의 값을 가질 수 있다. 예를 들어, Mask 1(x, y) + Mask 2(x, y) + Mask 3(x , y) = 1과 같이 원본 영상(310)의 각 프레임의 적어도 하나의 영역에 대응하는 픽셀의 스케일링 정도를 나타낼 수 있다. 여기서, (x, y)는 원본 영상(310)의 픽셀의 좌표에 대응하는 마스크 정보의 좌표에 해당할 수 있다. More specifically, the image processing apparatus may multiply the upscaled image 325 by the first mask 330 and the upscaled image 345 by the second mask 350 according to the determination based on the importance information. . Also, the image processing apparatus may multiply the upscaled image 365 by the third mask 370 . In this case, the mask information (eg, the pixel value of the mask file) of each mask may have a value of, for example, 0 or 1. Also, only one of the mask information may have a value of 1, and the remaining mask information may have a value of 0, corresponding to pixels of the same coordinates. For example, as Mask 1(x, y) + Mask 2(x, y) + Mask 3(x, y) = 1, scaling of pixels corresponding to at least one region of each frame of the original image 310 degree can be indicated. Here, (x, y) may correspond to the coordinates of the mask information corresponding to the coordinates of the pixels of the original image 310 .

따라서, 리스케일링된 영상(380)은 예를 들어, 아래의 도 4에 도시된 제2 영상(400)과 같이 부분적으로 원본 영상의 해상도를 가지는 제1 영역(450), 1/2로 다운스케일링된 해상도를 가지는 제2 영역(430), 및 1/4로 다운스케일링된 해상도를 가지는 제3 영역(410)을 포함하는 형태를 가질 수 있다. 이때 각 영역에 대응하는 마스크는 예를 들어, 영상 제작자에 의해 설정된 중요 영역으로 설정된 영역의 중요도 정보(예를 들어, 제1 중요도 정보)에 의해 결정된 것일수도 있고, 또는 해당 영역을 미리 학습된 신경망에 의해 복원한 결과에 따른 정확도 맵(accuracy map)에 기초하여 획득된 중요도 정보(예를 제2 중요도 정보)에 의해 결정된 것일 수 있다. Accordingly, the rescaled image 380 is downscaled to 1/2 of the first region 450 partially having the resolution of the original image, for example, as in the second image 400 shown in FIG. 4 below. It may have a shape including a second region 430 having a high resolution, and a third region 410 having a resolution downscaled to 1/4. In this case, the mask corresponding to each region may be determined by, for example, importance information (eg, first importance information) of a region set as an important region set by the image producer, or a neural network trained in advance for the region. It may be determined by importance information (eg, second importance information) obtained based on an accuracy map according to a result restored by .

이와 같이, 제2 영상(400)의 각 영역은 해당 영역의 중요도 정보에 따라 서로 다르게 스케일링되고, 해당하는 스케일링 정보는 마스크에 의하여 지시될 수 있다. In this way, each region of the second image 400 is scaled differently according to importance information of the corresponding region, and the corresponding scaling information may be indicated by a mask.

도 5는 일 실시예에 따른 영상 처리 장치의 구성 및 동작을 설명하기 위한 도면이다. 도 5를 참조하면, 일 실시예에 따른 영상 처리 장치(110)의 실시간 다운스케일링 모듈(500)의 구조 및 동작이 도시된다. 5 is a diagram for explaining the configuration and operation of an image processing apparatus according to an exemplary embodiment. Referring to FIG. 5 , the structure and operation of the real-time downscaling module 500 of the image processing apparatus 110 according to an exemplary embodiment is illustrated.

실시간 다운스케일링 모듈(500)은 실시간 정확도 추정 모듈(510), 정확도 기반 마스크 생성 모듈(530) 및 HLS 인코딩 모듈(550)을 포함할 수 있다. The real-time downscaling module 500 may include a real-time accuracy estimation module 510 , an accuracy-based mask generation module 530 , and an HLS encoding module 550 .

실시간 정확도 추정 모듈(510)은 다운스케일링된 제2 영상을 복원하는 마스크 기반의 복원 네트워크(513)을 포함할 수 있다. 복원 네트워크(513)는 '복원 신경망'이라고도 부를 수 있다. The real-time accuracy estimation module 510 may include a mask-based reconstruction network 513 for reconstructing the downscaled second image. The restoration network 513 may also be referred to as a 'reconstruction neural network'.

복원 네트워크(513)은 예를 들어, 원본 영상에 비해 x1/2의 해상도를 갖도록 다운스케일링된 영상(511)을 복원 영상(514)으로 복원할 수 있다. 이 때, 마스크는 x1/2에 대응하여 모두 1의 값을 가지고, x1과 x1/4에 대응하여 모두 0의 값을 가질 수 있다. 또한, 복원 네트워크(513)은 예를 들어, 원본 영상에 비해 x1/4의 해상도를 갖도록 다운스케일링된 영상(517)을 복원 영상(518)으로 복원할 수 있다. 이 때, 마스크는 x1/4에 대응하여 모두 1의 값을 가지고, x1과 x1/2에 대응하여 모두 0의 값을 가질 수 있다. 복원 네트워크(513)은 하나의 단일 신경망일 수 있다. The restoration network 513 may restore the image 511 downscaled to have a resolution of x1/2 compared to the original image as the restored image 514 . In this case, the mask may have all values of 1 corresponding to x1/2, and may have values of both 0 corresponding to x1 and x1/4. Also, the reconstruction network 513 may restore the downscaled image 517 to have a resolution of x1/4 compared to the original image as the restored image 518 , for example. In this case, the mask may have all values of 1 in correspondence with x1/4, and may have values of both 0 in correspondence of x1 and x1/2. The reconstruction network 513 may be one single neural network.

실시간 정확도 추정 모듈(510)은 복원 네트워크(513)을 통해 복원된 복원 영상(514)과 다운스케일링된 영상(511)에 대응하는 정답 영상(515) 간의 차이를 기초로 정확도 맵(x2)(516)을 생성할 수 있다. 여기서, 정답 영상(515)은 원본 영상에 해당할 수 있다. 여기서, 정확도 맵(x2)(516)는 다운스케일링된 영상(511)을 복원 네트워크(513)을 이용하여 x2로 복원(혹은 업스케일링)한 영상의 정확도를 나타내는 정확도 맵으로 이해될 수 있다. 실시간 정확도 추정 모듈(510)은 정확도 맵(x2)(516)에 기초하여 원본 영상에서 x1/2의 해상도를 갖도록 다운스케일링된 적어도 하나의 영역에 대응하는 제2 중요도 정보를 추정할 수 있다. The real-time accuracy estimation module 510 is an accuracy map (x2) 516 based on the difference between the reconstructed image 514 reconstructed through the reconstructed network 513 and the correct answer image 515 corresponding to the downscaled image 511 ) can be created. Here, the correct answer image 515 may correspond to the original image. Here, the accuracy map (x2) 516 may be understood as an accuracy map indicating the accuracy of an image obtained by reconstructing (or upscaling) the downscaled image 511 to x2 using the reconstruction network 513 . The real-time accuracy estimation module 510 may estimate second importance information corresponding to at least one region downscaled to have a resolution of x1/2 in the original image based on the accuracy map (x2) 516 .

또한, 실시간 정확도 추정 모듈(510)은 복원 네트워크(513)을 통해 복원된 복원 영상(518)과 다운스케일링된 영상(517)에 대응하는 정답 영상(519) 간의 차이를 기초로 정확도 맵(x4)(520)을 생성할 수 있다. 여기서, 정답 영상(519)은 원본 영상에 해당할 수 있다. 여기서, 정확도 맵(x4)(520)은 다운스케일링된 영상(517)을 복원 네트워크(513)을 이용하여 x4로 복원(혹은 업스케일링)한 영상에 의해 생성된 정확도 맵으로 이해될 수 있다. 실시간 정확도 추정 모듈(510)은 정확도 맵(x4)(516)에 기초하여 원본 영상에서 x1/4의 해상도를 갖도록 다운스케일링된 적어도 하나의 영역에 대응하는 제2 중요도 정보를 추정할 수 있다. 복원 네트워크(513)의 구조 및 동작은 아래 도 6의 복원 신경망(610)을 참조하여 보다 구체적으로 설명한다. In addition, the real-time accuracy estimation module 510 is an accuracy map (x4) based on the difference between the reconstructed image 518 reconstructed through the reconstructed network 513 and the correct answer image 519 corresponding to the downscaled image 517. 520 may be created. Here, the correct answer image 519 may correspond to the original image. Here, the accuracy map (x4) 520 may be understood as an accuracy map generated by an image obtained by reconstructing (or upscaling) the downscaled image 517 to x4 using the reconstruction network 513 . The real-time accuracy estimation module 510 may estimate the second importance information corresponding to at least one region downscaled to have a resolution of x1/4 in the original image based on the accuracy map (x4) 516 . The structure and operation of the reconstruction network 513 will be described in more detail with reference to the reconstruction neural network 610 of FIG. 6 below.

정확도 기반 마스크 생성 모듈(530)은 각 정확도 맵에 기초하여 추정된 제2 중요도 정보를 정렬(sorting)(531)할 수 있다. 정확도 기반 마스크 생성 모듈(530)은 정렬된 제2 중요도 정보를 기초로, 제1 영상의 각 프레임의 적어도 하나의 영역에 대응하는 마스크를 결정할 수 있다. The accuracy-based mask generation module 530 may sort 531 the estimated second importance information based on each accuracy map. The accuracy-based mask generating module 530 may determine a mask corresponding to at least one region of each frame of the first image based on the aligned second importance information.

예를 들어, 정확도 기반 마스크 생성 모듈(530)은 정렬된 제2 중요도 정보에 따라 제1 영상에서 제1 영역의 제1 해상도가 제1 영상의 원본 해상도로 설정되도록 제1 영역에 대한 마스크(Mask 1)(533)를 결정할 수 있다. 정확도 기반 마스크 생성 모듈(530)은 제1 영역을 제외한 제2 영역이 제1 해상도보다 낮은 제2 해상도로 다운 샘플링을 되도록 제2 영역에 대한 마스크(Mask 2)(535)를 결정할 수 있다. 또한, 정확도 기반 마스크 생성 모듈(530)은 정렬된 제2 중요도 정보에 따라 제1 영역 및 제2 영역을 제외한 나머지 제3 영역이 제2 해상도보다 낮은 제3 해상도로 다운 샘플링되도록 제3 영역에 대한 마스크(Mask 3)(537)를 결정할 수 있다. For example, the accuracy-based mask generating module 530 may configure the mask for the first region such that the first resolution of the first region in the first image is set as the original resolution of the first image according to the aligned second importance information. 1) (533) can be determined. The accuracy-based mask generating module 530 may determine a mask (Mask 2) 535 for the second region so that the second region excluding the first region is down-sampled to a second resolution lower than the first resolution. In addition, the accuracy-based mask generation module 530 may perform a third region for the third region to be down-sampled to a third resolution lower than the second resolution, except for the first region and the second region, according to the aligned second importance information. A mask (Mask 3) 537 may be determined.

예를 들어, 영상에서 A 영역에 대한 제1 마스크의 마스크 정보가 '1'인 경우, 해당 영역의 제2 마스크의 마스크 정보는 및 제3 마스크의 마스크 정보는 '0'될 수 있다. 이와 같이 정확도 기반 마스크 생성 모듈(530)은 영상의 일정 영역에 하나의 마스크만이 적용되도록 마스크 정보를 설정할 수 있다. For example, when the mask information of the first mask for the region A in the image is '1', the mask information of the second mask and the mask information of the third mask of the corresponding region may be '0'. As described above, the accuracy-based mask generating module 530 may set mask information such that only one mask is applied to a certain region of the image.

정확도 기반 마스크 생성 모듈(530)은 예를 들어, 하나의 제1 그리드(Grid) 영역의 정확도 맵(x2)(516)의 차이(Difference) 값이 네 개의 제2 그리드 영역들의 정확도 맵(x4)(520)의 차이 값들의 평균보다 클 경우, 제1 그리드 영역의 마스크(Mask 1)(533)의 마스크 정보를 '1'로 설정하고, 제2 그리드 영역들의 마스크(Mask 3)(537)의 마스크 정보를 '1'로 설정할 수 있다. 제1 그리드 영역과 제2 그리드 영역들은 서로 다른 영역에 해당한다. 정확도 기반 마스크 생성 모듈(530)은 위 과정을 반복적으로 수행한 뒤, Mask 2(x, y) = 1 - Mask 1(x,y) - Mask 3(x,y)를 통해 나머지 그리드 영역의 마스크(Mask 2)(535)를 결정할 수 있다. The accuracy-based mask generation module 530 may determine that, for example, a difference value between the accuracy maps (x2) and 516 of one first grid region is an accuracy map (x4) of four second grid regions. If it is greater than the average of the difference values of 520, the mask information of the mask 1 533 of the first grid area is set to '1', and the mask information of the mask 3 of the second grid area 537 is set to '1'. Mask information can be set to '1'. The first grid area and the second grid area correspond to different areas. The accuracy-based mask generation module 530 repeatedly performs the above process, and then masks the remaining grid area through Mask 2(x, y) = 1 - Mask 1(x,y) - Mask 3(x,y). (Mask 2) 535 may be determined.

실시간 다운스케일링 모듈(500)은 정확도 기반 마스크 생성 모듈(530)에 의해 설정된 각 영역에 대응하는 마스크들(533, 535, 537)과 원본 영상(539)을 활용하여 리스케일링된 영상(Rescaled Image)을 생성할 수 있다. 실시간 다운스케일링 모듈(500)은 원본 영상(539)에 대한 리스케일링(Rescaling)을 수행하여 예를 들어, 도 3을 통해 전술한 것과 같이 리스케일링된 영상(380)을 생성할 수 있다.The real-time downscaling module 500 utilizes the masks 533, 535, 537 and the original image 539 corresponding to each region set by the accuracy-based mask generation module 530. Rescaled Image. can create The real-time downscaling module 500 may perform rescaling on the original image 539 to generate, for example, the rescaled image 380 as described above with reference to FIG. 3 .

HLS 인코딩 모듈(550)은 해당 리스케일링된 영상 시퀀스의 HLS 인코딩을 수행할 수 있다. HLS 인코딩 모듈(550)은 정확도 기반 마스크 생성 모듈(530)에 의해 설정된 각 영역의 마스크 정보에 따라 리스케일링된 영상을 인코딩 할 수 있다. 그 결과, 예를 들어, 마스크 정보에 따라 프레임의 각 영역이 서로 다른 해상도들(예를 들어, 1080p, 720p, 480p 등)로 인코딩된 스트리밍 서비스를 위한 영상이 생성될 수 있다. HLS 인코딩 모듈(550)의 동작은 도 1을 통해 전술한 HLS 인코딩 모듈(125)에 해당할 수 있다. The HLS encoding module 550 may perform HLS encoding of the rescaled image sequence. The HLS encoding module 550 may encode the rescaled image according to mask information of each region set by the accuracy-based mask generation module 530 . As a result, for example, an image for a streaming service in which each region of a frame is encoded in different resolutions (eg, 1080p, 720p, 480p, etc.) according to mask information may be generated. The operation of the HLS encoding module 550 may correspond to the HLS encoding module 125 described above with reference to FIG. 1 .

도 6은 도 5에 도시된 복원 신경망의 구조 및 동작을 설명하기 위한 도면이다. 도 6을 참조하면, 일 실시예에 따른 초해상화(Super Resolution; SR) 기반의 복원 신경망의 학습 방법이 도시된다. FIG. 6 is a diagram for explaining the structure and operation of the reconstructed neural network shown in FIG. 5 . Referring to FIG. 6 , a learning method of a reconstructed neural network based on super resolution (SR) according to an embodiment is illustrated.

영상 처리 장치는 학습 영상 및 학습 영상에 대응하는 적어도 하나의 마스크를 수신한다. 학습 과정에서 이용되는 적어도 하나의 마스크는 랜덤하게 설정된 것일 수 있다. 영상 처리 장치는 예를 들어, 복원 신경망(610)의 Residual Block들(613)을 이용한 Skip Connection(또는 Shortcut connection)을 통해 복원 신경망(610)을 학습할 수 있다. 복원 신경망(610)은 예를 들어, Convolutional Neural Network(CNN) 중 하나인 Residual Networks(ResNet)일 수 있다. 복원 신경망(610)에서 Residual Block들(613)은 계층의 활성화가 신경망에서 더 깊은 계층으로 빨리 전달되도록 하며, 이러한 간단한 조정을 통해 훨씬 더 깊은 신경망을 트레이닝할 수 있다. Skip Connection을 적용하여 복원 신경망(610)을 학습하는 경우, 역전파(back propagation) 과정에서 identity mapping(x)를 미분하면 적어도 1이상의 값이 나오기 때문에 복원 신경망(610)의 계층이 깊어지더라도 안정적으로 학습이 가능해진다. The image processing apparatus receives a training image and at least one mask corresponding to the training image. At least one mask used in the learning process may be randomly set. The image processing apparatus may learn the reconstructed neural network 610 through, for example, a skip connection (or shortcut connection) using the residual blocks 613 of the reconstructed neural network 610 . The reconstructed neural network 610 may be, for example, Residual Networks (ResNet), which is one of Convolutional Neural Networks (CNN). Residual blocks 613 in the reconstructed neural network 610 allow the activation of a layer to be transferred from the neural network to a deeper layer quickly, and a much deeper neural network can be trained through this simple adjustment. In the case of learning the reconstructed neural network 610 by applying the skip connection, if the identity mapping(x) is differentiated in the back propagation process, a value of at least 1 is obtained, so it is stable even if the layer of the reconstructed neural network 610 deepens. learning is made possible

보다 구체적으로, 복원 신경망(610)에 (리)스케일링된 영상(601) 및 (리)스케일링된 영상(601)에 대응하는 제1 마스크(603), 제2 마스크(605) 및 제3 마스크(607) 중 적어도 하나의 마스크가 결합된(concatenated) 결합 영상(609)이 인가되었다고 하자. More specifically, a first mask 603 , a second mask 605 , and a third mask corresponding to the (re)scaled image 601 and the (re)scaled image 601 in the reconstructed neural network 610 ( It is assumed that a combined image 609 in which at least one mask of 607) is concatenated is applied.

복원 신경망(610)은 결합 영상(609)을 기초로 (리)스케일링된 영상(601)에 대응하는 복원 영상(reconstructed image)(630)을 생성하도록 학습될 수 있다. 이때, 복원 신경망(610)은 (리)스케일링된 영상(601)의 원본 영상인 정답(ground truth) 영상(635)과 복원 신경망(610)을 통해 실제 복원된 복원 영상(630) 간의 로스 함수(loss function)에 기초하여 학습될 수 있다. 복원 신경망(610)은 로스 함수를 최소화하는 방향으로 학습될 수 있다. The reconstructed neural network 610 may be trained to generate a reconstructed image 630 corresponding to the (re)scaled image 601 based on the combined image 609 . At this time, the reconstructed neural network 610 is a loss function ( loss function). The reconstructed neural network 610 may be trained in a direction to minimize the loss function.

일 실시예에서는 속도 향상을 위해 예를 들어, Inv Pixel Shuffle 모듈(611)을 통해 낮은 차원(dimension)에서 복원 신경망(610)을 학습할 수 있다. Inv Pixel Shuffle 모듈(611)은 예를 들어, F(Feature Channel Size) * W(Width) * H(Height) Dimension을 F*4(Feature Channel Size) * W/2(Width) * H/2(Height) Dimension으로 낮추어 계산할 수 있다. 이를 통해, 계산 시간은 대략 W*H 에서 (W/2) * (H/2)로 감소할 수 있다. In an embodiment, the reconstructed neural network 610 may be trained in a low dimension through, for example, the Inv Pixel Shuffle module 611 to improve speed. Inv Pixel Shuffle module 611 is, for example, F (Feature Channel Size) * W (Width) * H (Height) Dimension F * 4 (Feature Channel Size) * W/2 (Width) * H/2 ( Height) can be calculated by lowering it to Dimension. Through this, the computation time can be reduced from approximately W*H to (W/2) * (H/2).

복원 신경망(610)의 학습 시에 마스크 및 해당 마스크에 대응하는 마스크 정보들(예를 들어, 마스크 파일의 픽셀 값)은 랜덤(random)하게 생성될 수 있다. When the reconstruction neural network 610 is trained, a mask and mask information (eg, pixel values of a mask file) corresponding to the mask may be randomly generated.

영상 처리 장치는 도 6과 같이 학습된 복원 신경망(610)을 기반으로 실제 스트리밍 수행 시에 원본 영상의 각 프레임의 적어도 하나의 영역에 대응하여 결정된 제2 중요도 정보를 획득할 수 있다. The image processing apparatus may acquire second importance information determined to correspond to at least one region of each frame of the original image during actual streaming based on the learned reconstruction neural network 610 as shown in FIG. 6 .

도 7은 일 실시예에 따른 영상을 복원하는 방법을 나타낸 흐름도이다. 도 7을 조하면, 일 실시예에 따른 영상 재생 장치는 복수의 해상도를 포함하는 복수의 영역들을 가지는 영상 및 영상의 복수의 영역들에 대응하는 마스크 정보를 포함하는 영상 정보를 획득한다(710).7 is a flowchart illustrating a method of reconstructing an image according to an exemplary embodiment. Referring to FIG. 7 , the image reproducing apparatus according to an embodiment acquires an image having a plurality of regions including a plurality of resolutions and image information including mask information corresponding to the plurality of regions of the image ( 710 ). .

영상 재생 장치는 단계(710)에서 획득한 영상 정보를 기초로, 미리 학습된 신경망을 이용하여 영상을 복원한다(720). 영상 재생 장치는 영상 정보로부터 적어도 하나의 마스크를 추출할 수 있다. 영상 재생 장치는 추출한 적어도 하나의 마스크를 기초로, 신경망을 이용하여 영상을 복원할 수 있다. 이때, 신경망은 예를 들어, 영상 및 영상에 대응하는 적어도 하나의 마스크를 기초로, 영상에 대응하는 복원 영상을 생성하도록 학습된 신경망일 수 있다. The image reproducing apparatus reconstructs an image using a pre-trained neural network based on the image information obtained in operation 710 (operation 720). The image reproducing apparatus may extract at least one mask from the image information. The image reproducing apparatus may reconstruct an image using a neural network based on the extracted at least one mask. In this case, the neural network may be, for example, a neural network trained to generate a reconstructed image corresponding to an image based on an image and at least one mask corresponding to the image.

영상 재생 장치는 단계(720)에서 복원된 영상을 재생한다(730). 영상 재생 장치는 단계(720)에서 복원된 영상을 예를 들어, 디스플레이(도 9의 970참조)를 통해 재생할 수 있다. The image reproducing apparatus reproduces the image restored in step 720 (step 730). The image reproducing apparatus may reproduce the image restored in step 720 through, for example, a display (refer to 970 of FIG. 9 ).

도 8은 일 실시예에 따른 영상 처리 장치의 블록도이다. 도 8을 참조하면, 일 실시예에 따른 영상 처리 장치(이하, '영상 처리 장치')(800)는 통신 인터페이스(810) 및 프로세서(830)를 포함한다. 영상 처리 장치(800)는 메모리(850)를 더 포함할 수 있다. 통신 인터페이스(810), 프로세서(830) 및 메모리(850)는 통신 버스(805)를 통해 서로 통신할 수 있다. 8 is a block diagram of an image processing apparatus according to an exemplary embodiment. Referring to FIG. 8 , an image processing apparatus (hereinafter, 'image processing apparatus') 800 according to an exemplary embodiment includes a communication interface 810 and a processor 830 . The image processing apparatus 800 may further include a memory 850 . The communication interface 810 , the processor 830 , and the memory 850 may communicate with each other via the communication bus 805 .

통신 인터페이스(810)는 복수의 프레임들을 포함하는 제1 영상을 수신한다. 통신 인터페이스(810)는 프로세서(830)가 결정한 적어도 하나의 마스크 및 프로세서(830)가 생성한 제2 영상을 출력한다. The communication interface 810 receives a first image including a plurality of frames. The communication interface 810 outputs at least one mask determined by the processor 830 and the second image generated by the processor 830 .

프로세서(830)는 복수의 프레임들에 포함된 적어도 하나의 영역의 중요도를 지시하는 중요도 정보를 획득한다. 프로세서(830)는 중요도 정보에 기초하여 제1 영상의 적어도 하나의 영역에 대응하는 스케일링 정도를 나타내는 적어도 하나의 마스크를 결정한다. 프로세서(830)는 적어도 하나의 마스크에 따라 제1 영상을 인코딩한다. The processor 830 obtains importance information indicating the importance of at least one region included in the plurality of frames. The processor 830 determines at least one mask indicating a scaling degree corresponding to at least one region of the first image based on the importance information. The processor 830 encodes the first image according to at least one mask.

메모리(850)는 통신 인터페이스(810)를 통해 수신한 제1 영상을 처리하기 위하여, 제1 영상을 적어도 일시적으로 저장할 수 있다. 메모리(850)는 프로세서(830)가 제1 영상의 적어도 하나의 영역에 대응하여 획득한 중요도 정보를 저장할 수 있다. 또한, 메모리(850)는 프로세서(830)가 결정한 마스크 정보 및/또는 프로세서(830)가 생성한 제2 영상을 적어도 일시적으로 저장할 수 있다.The memory 850 may at least temporarily store the first image in order to process the first image received through the communication interface 810 . The memory 850 may store importance information obtained by the processor 830 corresponding to at least one region of the first image. Also, the memory 850 may at least temporarily store the mask information determined by the processor 830 and/or the second image generated by the processor 830 .

또한, 프로세서(830)는 도 1 내지 도 6을 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(830)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.Also, the processor 830 may perform the at least one method described above with reference to FIGS. 1 to 6 or an algorithm corresponding to the at least one method. The processor 830 may be a hardware-implemented data processing device having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented as hardware includes a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , an Application-Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

프로세서(830)는 프로그램을 실행하고, 영상 처리 장치(800)를 제어할 수 있다. 프로세서(830)에 의하여 실행되는 프로그램 코드는 메모리(850)에 저장될 수 있다.The processor 830 may execute a program and control the image processing apparatus 800 . The program code executed by the processor 830 may be stored in the memory 850 .

메모리(850)는 전술한 프로세서(830)의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(850)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(850)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(850)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다.The memory 850 may store various pieces of information generated in the process of the above-described processor 830 . In addition, the memory 850 may store various data and programs. The memory 850 may include a volatile memory or a non-volatile memory. The memory 850 may include a mass storage medium such as a hard disk to store various data.

도 9는 일 실시예에 따른 영상을 복원하는 장치의 블록도이다. 도 9를 참조하면, 일 실시예에 따른 영상을 복원하는 장치(이하, 영상 재생 장치')(900)는 통신 인터페이스(910), 프로세서(930), 및 디스플레이(970)를 포함한다. 영상 재생 장치(900)는 메모리(950)를 더 포함할 수 있다. 통신 인터페이스(910), 프로세서(930), 메모리(950) 및 디스플레이(970)는 통신 버스(905)를 통해 서로 통신할 수 있다. 9 is a block diagram of an apparatus for reconstructing an image according to an exemplary embodiment. Referring to FIG. 9 , an apparatus for reconstructing an image (hereinafter, referred to as an 'image reproducing apparatus') 900 according to an exemplary embodiment includes a communication interface 910 , a processor 930 , and a display 970 . The image reproducing apparatus 900 may further include a memory 950 . Communication interface 910 , processor 930 , memory 950 , and display 970 may communicate with each other via communication bus 905 .

통신 인터페이스(910)는 복수의 해상도를 포함하는 복수의 영역들을 가지는 영상 및 영상의 복수의 영역들에 대응하는 마스크 정보를 포함하는 영상 정보를 획득한다. The communication interface 910 obtains an image having a plurality of regions including a plurality of resolutions and image information including mask information corresponding to the plurality of regions of the image.

프로세서(930)는 영상 정보를 기초로, 미리 학습된 신경망을 이용하여 영상을 복원한다. The processor 930 reconstructs an image using a pre-trained neural network based on the image information.

메모리(950)는 통신 인터페이스(910)를 통해 획득한 영상 및 마스크 정보를 포함하는 영상 정보를 저장할 수 있다. 메모리(950)는 프로세서(930)에 의해 복원된 영상을 저장할 수 있다. The memory 950 may store image information including image and mask information acquired through the communication interface 910 . The memory 950 may store the image restored by the processor 930 .

디스플레이(970)는 프로세서(930)에 의해 복원된 영상을 재생한다. The display 970 reproduces the image restored by the processor 930 .

또한, 프로세서(930)는 도 1 및 도 7을 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(930)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.Also, the processor 930 may perform at least one method described above with reference to FIGS. 1 and 7 or an algorithm corresponding to the at least one method. The processor 930 may be a hardware-implemented data processing device having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, a data processing device implemented as hardware includes a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , an Application-Specific Integrated Circuit (ASIC), and a Field Programmable Gate Array (FPGA).

프로세서(930)는 프로그램을 실행하고, 영상 재생 장치(900)를 제어할 수 있다. 프로세서(930)에 의하여 실행되는 프로그램 코드는 메모리(950)에 저장될 수 있다.The processor 930 may execute a program and control the image reproducing apparatus 900 . The program code executed by the processor 930 may be stored in the memory 950 .

메모리(950)는 전술한 프로세서(930)의 처리 과정에서 생성되는 다양한 정보들을 저장할 수 있다. 이 밖에도, 메모리(950)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(950)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(950)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다.The memory 950 may store various pieces of information generated in the process of the above-described processor 930 . In addition, the memory 950 may store various data and programs. The memory 950 may include a volatile memory or a non-volatile memory. The memory 950 may include a mass storage medium such as a hard disk to store various data.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

100: 비디오 스트리밍 시스템
110: 영상 처리 장치
130: 스트리밍 서버
150: 영상 재생 장치100: video streaming system
110: image processing device
130: streaming server
150: video playback device

Claims

Receiving a first image including a plurality of frames;
obtaining importance information indicating the importance of at least one region included in the plurality of frames;
determining at least one mask indicating a scaling degree corresponding to at least one region of the first image based on the importance information;
generating a second image by encoding the first image according to the at least one mask; and
outputting the at least one mask and the second image
containing,
Image processing method.

According to claim 1,
The step of determining the at least one mask includes:
The at least one mask and the at least one mask so that the resolution of the at least one first area of the first image is maintained and resolutions of the remaining areas except for the first area are down-sampled based on the importance information; determining mask information corresponding to the at least one mask;
containing,
Image processing method.

According to claim 1,
The step of obtaining the importance information is
receiving, from a producer terminal of the first image, first importance information set corresponding to at least one region of each frame of the first image; and
At least one region of each frame of the second image is determined corresponding to at least one region of each frame of the first image based on an accuracy map according to a result of reconstructing at least one region of each frame of the first image by a pre-trained neural network. obtaining the second importance information
comprising at least one of
Image processing method.

4. The method of claim 3,
The step of obtaining the second importance information
The accuracy map is calculated based on a difference between a result of restoring at least one region of each frame of the second image by the pre-trained neural network and a correct answer image corresponding to at least one region of each frame of the second image. generating; and
determining second importance information corresponding to at least one region of each frame of the first image based on the accuracy map
containing,
Image processing method.

4. The method of claim 3,
The step of determining the at least one mask includes:
sorting the second importance information; and
determining at least one mask corresponding to at least one region of each frame of the first image based on the aligned second importance information;
containing,
Image processing method.

6. The method of claim 5,
Determining the at least one mask based on the aligned second importance information may include:
determining a mask corresponding to the first region as a first mask so that a first resolution of a first region in the first image is set to an original resolution of the first image according to the aligned second importance information;
determining a mask corresponding to the second region as a second mask so that a second region excluding the first region is down-sampled to a second resolution lower than the first resolution according to the aligned second importance information; and
A mask corresponding to the third region is used as a third mask so that the remaining third regions except for the first region and the second region are down-sampled to a third resolution lower than the second resolution according to the aligned second importance information. step to decide
comprising at least one of
Image processing method.

4. The method of claim 3,
The neural network is
A reconstructed image corresponding to the second image is obtained based on a combined image in which the second image and at least one of a first mask, a second mask, and a third mask corresponding to the second image are concatenated. image) to create
Image processing method.

4. The method of claim 3,
The step of determining the at least one mask includes:
determining at least one mask for the first region of the first image according to the first importance information regardless of the second importance information; and
determining at least one mask for regions other than the first region according to the second importance information;
containing,
Image processing method.

According to claim 1,
The step of determining the at least one mask includes:
At least one area of at least one region included in the plurality of frames of the first image based on the importance information so that the capacity of the second image is maintained equal to the capacity obtained by downscaling the first image by a predetermined ratio Steps to determine one mask
containing,
Image processing method.

According to claim 1,
The step of determining the at least one mask includes:
determining the at least one mask for each frame of the first image based on the importance information; and
determining the at least one mask for each chunk including a plurality of frames of the first image based on the importance information
including any one of
Image processing method.

According to claim 1,
The step of obtaining the importance information is
dividing each of the plurality of frames into grids corresponding to at least one region of each of the plurality of frames;
obtaining importance information of regions corresponding to the plurality of grids;
containing,
Image processing method.

According to claim 1,
The first image is
divided into at least one of a first region, a second region, and a third region based on the at least one mask;
generating the second image
encoding the first region based on a first mask corresponding to the first region;
encoding the second region based on a second mask corresponding to the second region; and
encoding the third region based on a third mask corresponding to the third region;
containing,
Image processing method.

According to claim 1,
The first image is
including live streaming contents;
Image processing method.

obtaining image information including an image having a plurality of regions including a plurality of resolutions and at least one mask corresponding to the plurality of regions of the image;
reconstructing the image using a pre-trained neural network based on the image information; and
Playing the restored image
containing,
How to play video.

15. The method of claim 14,
Restoring the image
extracting the at least one mask from the image information; and
Restoring the image using the neural network based on the extracted at least one mask
containing,
How to play video.

15. The method of claim 14,
The neural network is
learned to generate the reconstructed image corresponding to the image based on the image and at least one mask corresponding to the image,
How to play video.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of any one of claims 1 to 16.

a communication interface for receiving a first image including a plurality of frames; and
At least one mask indicating a degree of scaling corresponding to at least one region of the first image is obtained based on the importance information indicating the importance of at least one region included in the plurality of frames. a processor that determines and encodes the first image according to the at least one mask to generate a second image
including,
The communication interface is
outputting the mask and the second image,
image processing device.

a communication interface for obtaining image information including an image having a plurality of regions including a plurality of resolutions and at least one mask corresponding to the plurality of regions of the image;
a processor for reconstructing the image using a pre-trained neural network based on the image information; and
A display that reproduces the restored image
containing,
video playback device.