KR102273267B1

KR102273267B1 - Method and apparatus for processing sound

Info

Publication number: KR102273267B1
Application number: KR1020190044636A
Authority: KR
Inventors: 나민수; 이종민; 박경모; 이상민
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2021-07-05
Also published as: KR20200121982A

Abstract

사운드 프로세싱 방법 및 장치를 개시한다.
본 발명의 일 실시예에 의하면, 컨텐츠(contents)에 포함된 사운드를 프로세싱(processing)하는 방법으로서, 상기 컨텐츠에 포함된 영상으로부터 하나 이상의 딥 메타 데이터(deep meta data)를 추출하는 단계; 상기 추출된 딥 메타 데이터 중 상기 사운드를 발생시키는 대상 객체와 관련된 딥 메타 데이터인 객체 메타 데이터를 선별하는 단계; 및 상기 객체 메타 데이터를 기준으로 상기 사운드를 프로세싱하는 단계를 포함하는 것을 특징으로 하는 사운드 프로세싱 방법을 제공한다.A sound processing method and apparatus are disclosed.
According to an embodiment of the present invention, there is provided a method of processing sound included in content, the method comprising: extracting one or more deep meta data from an image included in the content; selecting object metadata, which is deep metadata related to the target object generating the sound, from among the extracted deep metadata; and processing the sound based on the object metadata.

Description

Sound processing method and apparatus {METHOD AND APPARATUS FOR PROCESSING SOUND}

본 발명은 사운드를 처리하는 방법 및 장치에 관한 것으로서, 더욱 구체적으로는 영상으로부터 추출된 딥 메타 데이터를 기반으로 해당 영상과 대응되는 사운드를 업스케일링하거나 해당 사운드에 다양한 이펙트를 적용하여 사운드에 대한 입체감과 실제감을 제공하는 사운드 프로세싱 방법 및 장치에 관한 것이다.The present invention relates to a method and an apparatus for processing sound, and more specifically, upscaling a sound corresponding to a corresponding image based on deep metadata extracted from an image or applying various effects to the corresponding sound to give a three-dimensional effect to the sound. and a sound processing method and apparatus for providing a sense of reality.

이 부분에 기술된 내용은 단순히 본 발명에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The content described in this section merely provides background information on the present invention and does not constitute the prior art.

통신 환경과 하드웨어 기술이 발전함에 따라, 고화질의 영상과 고음질의 사운드를 포함하는 고품질 컨텐츠에 대한 사용자들의 요구가 높아지고 있다.As communication environments and hardware technologies develop, users' demands for high-quality content including high-quality images and high-quality sound are increasing.

이러한 요구를 만족시키기 위하여, 딥러닝 기반 영상 업스케일링 기술을 통해 Full HD, 8bit non-HDR(high dynamic range) 영상을 4k UHD, 10bit HDR 영상으로 변환하는 방법, 딥러닝 기반 오디오 업스케일링 기술을 통해 128kbps MP3 음원을 320kbps 음원으로 변환하는 방법 등 다양한 방법들이 개발되고 있다.In order to satisfy these needs, a method of converting Full HD, 8-bit non-HDR (high dynamic range) images to 4k UHD, 10-bit HDR images through deep learning-based image upscaling technology, and deep learning-based audio upscaling technology Various methods are being developed, such as a method of converting a 128kbps MP3 sound source into a 320kbps sound source.

그러나, 컨텐츠에 포함된 사운드를 영상 내 객체의 움직임에 따라 변환함으로써 사용자에게 입체감을 제공하거나, 사운드가 발생하는 환경적 특성을 반영하여 해당 사운드를 더욱 실제적으로 표현하는 방법들은 전무한 실정이다.However, there are no methods for providing a three-dimensional effect to the user by converting the sound included in the content according to the movement of an object in the image, or for more realistically expressing the sound by reflecting the environmental characteristics in which the sound is generated.

본 발명의 일 실시예는, 영상의 딥 메타 데이터를 기반으로 사운드의 공간적 변화 또는 사운드가 발생하는 영상 내 환경의 특성을 정확히 파악하고, 이를 기초로 입체적이며 실제적인 사운드를 제공함으로써 고품질 컨텐츠에 대한 요구를 충족시킬 수 있는 사운드 프로세싱 방법 및 장치를 제공하는 데 주된 목적이 있다.An embodiment of the present invention accurately grasps the spatial change of sound or the characteristics of the environment in the image in which the sound occurs based on the deep metadata of the image, and provides three-dimensional and realistic sound based on this. A main object is to provide a sound processing method and apparatus that can meet the needs.

본 발명의 일 실시예에 의하면, 컨텐츠(contents)에 포함된 사운드를 프로세싱(processing)하는 방법으로서, 상기 컨텐츠에 포함된 영상으로부터 하나 이상의 딥 메타 데이터(deep meta data)를 추출하는 단계; 상기 추출된 딥 메타 데이터 중 상기 사운드를 발생시키는 대상 객체와 관련된 딥 메타 데이터인 객체 메타 데이터를 선별하는 단계; 및 상기 객체 메타 데이터를 기준으로 상기 사운드를 프로세싱하는 단계를 포함하는 것을 특징으로 하는 사운드 프로세싱 방법을 제공한다.According to an embodiment of the present invention, there is provided a method for processing sound included in content, the method comprising: extracting one or more deep meta data from an image included in the content; selecting object metadata, which is deep metadata related to the target object generating the sound, from among the extracted deep metadata; and processing the sound based on the object metadata.

본 발명의 다른 일 실시예에 의하면, 컨텐츠(contents)에 포함된 사운드를 프로세싱(processing)하는 장치로서, 상기 컨텐츠에 포함된 영상으로부터 하나 이상의 딥 메타 데이터(deep meta data)를 추출하는 추출부; 상기 추출된 딥 메타 데이터 중 상기 사운드를 발생시키는 대상 객체와 관련된 딥 메타 데이터인 객체 메타 데이터를 선별하는 선별부; 및 상기 객체 메타 데이터를 기준으로 상기 사운드를 프로세싱하는 프로세싱부를 포함하는 것을 특징으로 하는 사운드 프로세싱 장치를 제공한다.According to another embodiment of the present invention, there is provided an apparatus for processing sound included in contents, comprising: an extractor configured to extract one or more deep meta data from an image included in the contents; a selection unit for selecting object metadata, which is deep metadata related to the target object generating the sound, from among the extracted deep metadata; and a processing unit that processes the sound based on the object metadata.

본 발명은 사운드를 대상으로 사운드 소스에 해당하는 대상 객체의 움직임 또는 사운드가 발생하는 환경적 특성을 반영하도록 구성되므로, 더욱 입체적이며 실제적인 사운드를 제공할 수 있어 고품질의 컨텐츠를 제공할 수 있다.Since the present invention is configured to reflect the movement of a target object corresponding to the sound source or the environmental characteristics in which the sound is generated, a more three-dimensional and realistic sound can be provided, thereby providing high-quality content.

또한, 본 발명은 모바일 엣지 컴퓨팅 서버를 통해 사운드 프로세싱을 수행하도록 구성되므로, 컨텐츠 전송의 지연 문제, 사용자 단말의 오버헤드 문제, 사용자 단말의 가격 상승 문제 등을 일거에 해결할 수 있다.In addition, since the present invention is configured to perform sound processing through a mobile edge computing server, it is possible to solve the problem of delay in content transmission, overhead problem of the user terminal, and the problem of price increase of the user terminal at once.

또한, 본 발명은 딥 메타 데이터의 선택, 대상 객체의 선택, 환경적 특성의 반영 여부 등에 사용자의 의도를 반영할 수 있으므로, 사용자 개개인의 선호도와 요구에 부합하는 컨텐츠를 제공할 수 있다.In addition, the present invention can reflect a user's intentions in the selection of deep metadata, selection of a target object, and whether or not environmental characteristics are reflected, so that it is possible to provide content that meets the preferences and needs of individual users.

도 1은 본 발명의 일 실시예에 의한 사운드 프로세싱 장치와 관련 구성들을 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 의한 사운드 프로세싱 장치를 개략적으로 나타낸 블록 구성도이다.
도 3은 사운드 프로세싱 방법에 대한 본 발명의 일 실시예를 설명하기 위한 순서도이다.
도 4는 딥 메타 데이터에 대한 본 발명의 다양한 예를 설명하기 위한 도면이다.
도 5는 대상 객체의 움직임을 기준으로 사운드를 프로세싱하는 본 발명의 일 실시예를 설명하기 위한 순서도이다.
도 6은 사운드가 발생하는 환경적 특성을 반영하여 사운드를 프로세싱하는 본 발명의 일 실시예를 설명하기 위한 순서도이다.
도 7은 대상 객체의 움직임을 기준으로 한 사운드 업스케일링과 환경적 특성을 반영한 이펙트 적용이 유기적으로 수행되는 본 발명의 일 실시예를 설명하기 위한 순서도이다.1 is a diagram illustrating a sound processing apparatus and related components according to an embodiment of the present invention.
2 is a block diagram schematically illustrating a sound processing apparatus according to an embodiment of the present invention.
3 is a flowchart for explaining an embodiment of the present invention for a sound processing method.
4 is a diagram for explaining various examples of the present invention for deep metadata.
5 is a flowchart illustrating an embodiment of the present invention for processing sound based on the movement of a target object.
6 is a flowchart for explaining an embodiment of the present invention in which sound is processed by reflecting the environmental characteristics in which the sound is generated.
7 is a flowchart for explaining an embodiment of the present invention in which sound upscaling based on the movement of a target object and applying an effect reflecting environmental characteristics are organically performed.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the component from other components, and the essence, order, or order of the component is not limited by the term. Throughout the specification, when a part 'includes' or 'includes' a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

도 1은 본 발명의 일 실시예에 의한 사운드 프로세싱 장치(이하 '프로세싱 장치'라 지칭한다)(100)와 관련 구성들을 나타낸 도면이다. 이하에서는, 도 1을 참조하여 본 발명에 의한 프로세싱 장치(100)와 관련 구성들에 대해 설명하도록 한다.1 is a diagram illustrating a sound processing apparatus (hereinafter referred to as a 'processing apparatus') 100 and related components according to an embodiment of the present invention. Hereinafter, the processing apparatus 100 and related components according to the present invention will be described with reference to FIG. 1 .

본 발명의 사운드 프로세싱 방법은 컨텐츠 서버(contents server, 110), 모바일 엣지 컴퓨팅(MEC, mobile edge computing) 서버(120) 및 사용자 단말(UE, user equipment)(130) 중 하나 이상에서 구현될 수 있다. 이를 달리 표현하면, 본 발명의 프로세싱 장치(100)는 컨텐츠 서버(110), 모바일 엣지 컴퓨팅 서버(120) 및 사용자 단말(130) 중 하나 이상의 형태로 구현될 수 있다.The sound processing method of the present invention may be implemented in one or more of a contents server 110 , a mobile edge computing (MEC) server 120 , and a user equipment (UE) 130 . . In other words, the processing device 100 of the present invention may be implemented in the form of one or more of the content server 110 , the mobile edge computing server 120 , and the user terminal 130 .

도 1에 도시된 컨텐츠 서버(110)는 다양한 형태의 컨텐츠들을 관리하고 제공하는 구성으로서, 컨텐츠 관리 시스템(contents management system) 등 다양한 명칭으로 지칭될 수 있다. 컨텐츠 서버(110)는 컨텐츠 제공 업체가 소유, 관리 또는 운용하는 서버에 해당할 수 있다.The content server 110 illustrated in FIG. 1 is a configuration that manages and provides various types of contents, and may be referred to by various names such as a contents management system. The content server 110 may correspond to a server owned, managed, or operated by a content provider.

모바일 엣지 컴퓨팅 서버(120) 및/또는 사용자 단말(130)은 이 컨텐츠 서버(110)로부터 다양한 형태의 컨텐츠를 제공 받으므로, 모바일 엣지 컴퓨팅 서버(120)와 사용자 단말(130)의 입장에서 컨텐츠 서버(110)는 외부 데이터 소스에 해당한다.Since the mobile edge computing server 120 and/or the user terminal 130 receives various types of content from the content server 110 , the mobile edge computing server 120 and the user terminal 130 provide a content server from the standpoint of the mobile edge computing server 120 and the user terminal 130 . 110 corresponds to an external data source.

모바일 엣지 컴퓨팅 서버(120)는 컨텐츠가 제공되는 방향을 기준으로 컨텐츠 서버(110)와 사용자 단말(130)의 사이에 위치하며, 컨텐츠 서버(110)로부터 전송된 컨텐츠를 저장 및 관리하고, 사용자의 요청에 대응되는 컨텐츠를 사용자 단말(130)로 제공하는 구성에 해당한다. 모바일 엣지 컴퓨팅 서버(120)는 통신 사업자가 제공하는 서버에 해당할 수 있다.The mobile edge computing server 120 is located between the content server 110 and the user terminal 130 based on the direction in which the content is provided, stores and manages the content transmitted from the content server 110, and This corresponds to a configuration in which content corresponding to the request is provided to the user terminal 130 . The mobile edge computing server 120 may correspond to a server provided by a communication service provider.

모바일 엣지 컴퓨팅이란, 네트워크 트래픽(network traffic)의 폭발적인 증가에 따른 네트워크의 성능 저하, 과부하, 지연 등의 문제를 해결하기 위해 개발된 개념에 해당한다. 모바일 엣지 컴퓨팅에서는 모바일 엣지 컴퓨팅 서버(120)를 네트워크　엣지(network edge)와 가까운 곳에 위치시키고, 이 서버(120)를 통해 일정 비율 이상의 데이터를 처리한다. 즉, 모바일 엣지 컴퓨팅이란, 클라우드　컴퓨팅　서비스를 네트워크　엣지단으로 확장한 형태라고 볼 수 있다.Mobile edge computing corresponds to a concept developed to solve problems such as network performance degradation, overload, and delay due to an explosive increase in network traffic. In mobile edge computing, the mobile edge computing server 120 is located close to the network edge (network edge), and data over a certain rate is processed through the server 120 . In other words, mobile edge computing can be seen as an extension of cloud 　 computing and services to the network 　 edge level.

사용자 단말(130)은 컨텐츠 사용자가 소유 또는 관리하는 장치로서, 도 1에 표현된 바와 같이 모바일 단말, 랩 탑 컴퓨터, 데스크 탑 컴퓨터 등이 사용자 단말(130)에 포함될 수 있다. 또한, 도 1에는 표현되어 있지 않으나, 사용자 단말(130)에는 TV, 차량에 탑재된 표출 기기, AR(augmented reality)/VR(virtual reality) 기기 등도 포함될 수 있다.The user terminal 130 is a device owned or managed by a content user, and as shown in FIG. 1 , a mobile terminal, a laptop computer, a desktop computer, etc. may be included in the user terminal 130 . In addition, although not shown in FIG. 1 , the user terminal 130 may include a TV, a vehicle-mounted display device, an augmented reality (AR)/virtual reality (VR) device, and the like.

본 발명의 프로세싱 장치(100)가 컨텐츠 서버(110) 형태로 구현되는 경우, 본 발명에서 제안하는 사운드 프로세싱 방법은 컨텐츠 서버(110)에서 수행될 수 있다. 이와 같은 경우, 프로세싱된 사운드는 컨텐츠 서버(110)로부터 사용자 단말(130)로 직접 전송될 수 있다.When the processing apparatus 100 of the present invention is implemented in the form of the content server 110 , the sound processing method proposed in the present invention may be performed in the content server 110 . In this case, the processed sound may be directly transmitted from the content server 110 to the user terminal 130 .

본 발명의 프로세싱 장치(100)가 사용자 단말(130)의 형태로 구현되는 경우, 본 발명에서 제안하는 사운드 프로세싱 방법은 사용자 단말(130)에서 수행될 수 있다. 이와 같은 경우, 컨텐츠 서버(110)는 사운드 프로세싱을 수행하지 않고, 사용자의 요청에 대응되는 컨텐츠를 사용자 단말(130)로 전송하는 제한적인 역할만을 수행할 수 있다. 사용자 단말(130)은 전송된 컨텐츠(프로세싱되지 않은 컨텐츠)의 사운드를 자체적으로 프로세싱하여 사용자에게 제공할 수 있다. When the processing apparatus 100 of the present invention is implemented in the form of the user terminal 130 , the sound processing method proposed in the present invention may be performed in the user terminal 130 . In this case, the content server 110 may perform only a limited role of transmitting the content corresponding to the user's request to the user terminal 130 without performing sound processing. The user terminal 130 may process the sound of the transmitted content (unprocessed content) by itself and provide it to the user.

본 발명의 프로세싱 장치(100)가 모바일 엣지 컴퓨팅 서버(120) 형태로 구현되는 경우, 본 발명에서 제안하는 사운드 프로세싱 방법은 모바일 엣지 컴퓨팅 서버(120)에서 수행될 수 있다.When the processing device 100 of the present invention is implemented in the form of the mobile edge computing server 120 , the sound processing method proposed in the present invention may be performed in the mobile edge computing server 120 .

본 발명이 모바일 엣지 컴퓨팅 서버(120)에서 구현되면, 상대적으로 가까운 위치(네트워크의 엣지 단)에 자리하는 모바일 엣지 컴퓨팅 서버(120)와 사용자 단말(130) 사이의 통신을 통해 고품질 컨텐츠에 대한 제공이 이루어질 수 있으므로, 종래 클라우드 컴퓨팅 방법에서 발생하는 전송 지연의 문제를 해결할 수 있다.When the present invention is implemented in the mobile edge computing server 120, providing high-quality content through communication between the mobile edge computing server 120 and the user terminal 130 located in a relatively close position (edge end of the network) Since this can be done, it is possible to solve the problem of transmission delay occurring in the conventional cloud computing method.

또한, 데이터 처리(사운드 프로세싱)는 모바일 엣지 컴퓨팅 서버(120)에서 수행되고 사용자 단말(130)은 처리된 데이터를 표출하는 역할만 수행하므로, 사용자 단말(130)의 오버헤드를 감소시킬 수 있다. 따라서, 상대적으로 저사양의 하드웨어로 구성된 사용자 단말(130)에서도 고품질의 사운드(본 발명 사운드 프로세싱 방법이 적용된 사운드)를 구현할 수 있다. In addition, since data processing (sound processing) is performed in the mobile edge computing server 120 and the user terminal 130 only plays a role of expressing the processed data, the overhead of the user terminal 130 can be reduced. Accordingly, high-quality sound (sound to which the sound processing method of the present invention is applied) can be implemented even in the user terminal 130 configured with relatively low-spec hardware.

이하에서는, 프로세싱 장치(100)가 모바일 엣지 컴퓨팅 서버(120) 형태로 구현되는 실시예를 중심으로 본 발명에 대해 설명하도록 한다.Hereinafter, the present invention will be described with reference to an embodiment in which the processing device 100 is implemented in the form of a mobile edge computing server 120 .

모바일 엣지 컴퓨팅 서버(120) 즉, 프로세싱 장치(100)는 코어 네트워크(core network)를 통해 컨텐츠 서버(110)로부터 다양한 형태의 컨텐츠들을 제공 받고, 이 컨텐츠들에 포함된 영상으로부터 딥 메타 데이터를 추출한 후, 추출된 메타 데이터를 기반으로 컨텐츠들에 포함된 (해당 영상과 대응되는) 사운드를 프로세싱하는 장치에 해당한다.The mobile edge computing server 120, that is, the processing device 100 receives various types of contents from the contents server 110 through a core network, and extracts deep metadata from images included in the contents. Then, it corresponds to a device that processes sound (corresponding to the corresponding image) included in the contents based on the extracted metadata.

컨텐츠 서버(110)로부터 본 발명의 프로세싱 장치(100)로 제공되는 컨텐츠에는 미디어 컨텐츠, 영화 컨텐츠, 게임 컨텐츠, 드라마 컨텐츠, 교육용 컨텐츠 등과 같이 영상과 해당 영상에 대응되는 사운드가 결합된 복합 컨텐츠가 포함될 수 있다. The content provided from the content server 110 to the processing device 100 of the present invention includes complex content in which an image and a sound corresponding to the image are combined, such as media content, movie content, game content, drama content, educational content, etc. can

영상과 결합된 형태로 컨텐츠를 구성하는 사운드에는 해당 컨텐츠에 등장하는 캐릭터의 음성(voice), 해당 캐릭터의 움직임으로 인하여 발생하는 소리, 해당 컨텐츠의 영상으로 표출되는 물체들 간의 충돌로 인하여 발생하는 소리 등이 포함될 수 있다. The sound composing the content in a form combined with the image includes the voice of the character appearing in the content, the sound generated by the movement of the character, and the sound generated by the collision between the objects expressed in the image of the corresponding content. etc. may be included.

또한, 비가 내리는 소리, 번개로 인하여 발생하는 소리 등과 같이, 캐릭터 또는 물체로 인하여 발생하는 것이 아닌, 영상으로 표출되는 외부 환경적인 요인으로 인하여 발생하는 소리 등도 프로세싱 대상에 해당하는 사운드에 포함될 수 있다.In addition, sounds that are not generated by characters or objects, such as sounds generated by rain or lightning, but caused by external environmental factors expressed as images, etc. may be included in the sound corresponding to the processing target.

본 발명에서 제안하는 사운드 프로세싱에는 영상으로부터 추출된 딥 메타 데이터를 기반으로 사운드의 채널 별로 서로 다르거나 동일한 가중치를 적용하여 각 채널의 볼륨을 조절함으로써 입체감을 적용하는 업스케일링, 딥 메타 데이터를 기반으로 사운드에 다양한 이펙트(effect)를 적용하는 방법 등이 포함될 수 있다.The sound processing proposed by the present invention includes upscaling and deep metadata that applies a three-dimensional effect by applying different or the same weight to each channel of the sound based on the deep metadata extracted from the image and adjusting the volume of each channel. A method of applying various effects to the sound may be included.

본 발명에서 제안하는 사운드 프로세싱은 원래의 사운드(프로세싱되지 않은 사운드)에 존재하지 않던 입체감 또는 실제감을 적용한다는 측면에서 채널 확장(expansion) 또는 차원(dimension) 업스케일링으로 이해될 수 있다.The sound processing proposed by the present invention can be understood as channel expansion or dimension upscaling in terms of applying a three-dimensional effect or realism that did not exist in the original sound (unprocessed sound).

본 발명의 프로세싱 장치(100)는 프로세싱된 사운드가 포함된 컨텐츠를 기지국 등을 통하여 사용자 단말(130)로 전송함으로써 고품질 컨텐츠를 사용자에게 제공할 수 있다.The processing apparatus 100 of the present invention may provide high-quality content to the user by transmitting the content including the processed sound to the user terminal 130 through a base station or the like.

도 2는 본 발명의 일 실시예에 의한 프로세싱 장치(100)를 개략적으로 나타낸 블록 구성도이며, 도 3은 사운드 프로세싱 방법에 대한 본 발명의 일 실시예를 설명하기 위한 순서도이다.2 is a block diagram schematically showing a processing apparatus 100 according to an embodiment of the present invention, and FIG. 3 is a flowchart for explaining an embodiment of the present invention for a sound processing method.

도 2에 도시된 바와 같이, 프로세싱 장치(100)는 I/O 인터페이스부(210), 추출부(220), 선별부(230), 프로세싱부(240) 및 메타 데이터 저장부(250)를 포함하여 구성될 수 있다.As shown in FIG. 2 , the processing device 100 includes an I/O interface unit 210 , an extraction unit 220 , a selection unit 230 , a processing unit 240 , and a metadata storage unit 250 . can be configured.

먼저, I/O 인터페이스부(210)를 통해 '영상과 이에 대응되는 사운드가 포함된 컨텐츠'가 획득된다(S310). I/O 인터페이스부(210)를 통한 컨텐츠의 획득은 컨텐츠 서버(110)로부터의 전송을 통해 구현되거나, 컨텐츠 서버(110)로부터 전송된 컨텐츠를 메모리 등에 미리 저장하고 이 메모리에 엑세스(access)하는 방법 등을 통해 구현될 수 있다.First, 'content including an image and a sound corresponding thereto' is obtained through the I/O interface unit 210 (S310). Acquisition of content through the I/O interface unit 210 is implemented through transmission from the content server 110, or stores the content transmitted from the content server 110 in advance in a memory and accesses the memory. It may be implemented through a method or the like.

메타 데이터란 데이터의 속성을 설명하기 위한 데이터를 의미하며, 딥 메타 데이터는 컨텐츠에 포함된 유용한 특징(메타 데이터)을 AI(artificial intelligence) 즉, 머신 러닝 또는 딥 러닝 기술을 통해 추출한 메타 데이터를 의미한다. Metadata refers to data to describe the properties of data, and deep metadata refers to metadata extracted from useful features (metadata) included in content through artificial intelligence (AI), that is, machine learning or deep learning technology. do.

일반적으로, 영상으로부터 추출되는 딥 메타 데이터에는 영상으로부터 표출되는 인물의 얼굴, 감정, 움직임, 음원, 상황, 대사, 자막, 주변 환경 등이 포함될 수 있다. 이러한 다양한 형태의 딥 메타 데이터는 해당 영상의 유용한 속성을 나타내므로 컨텐츠 경험 제고 및 사용자 편의성 향상을 위해 활용될 수 있다.In general, deep metadata extracted from an image may include a person's face, emotion, movement, sound source, situation, dialogue, caption, surrounding environment, etc. expressed from the image. Since these various types of deep metadata indicate useful properties of the video, they can be utilized to enhance content experience and user convenience.

딥 메타 데이터를 활용하여 컨텐츠 경험을 제고하고 사용자 편의성을 향상시키는 대표적인 기술들로는 shot identification 기술, intro/ending auto detection 기술, alternative poster 기술, metadata composition 기술 등이 있다.Representative technologies that utilize deep metadata to enhance content experience and improve user convenience include shot identification technology, intro/ending auto detection technology, alternative poster technology, and metadata composition technology.

shot identification 기술은 영상에 포함된 프레임들 간의 유사도를 기반으로 카메라 shot이 변화하는 경계를 구분하는 기술에 해당한다. 이 기술에서는 컨텐츠의 장르(영화, 드라마, 예능 등), 프레임으로부터 인식되는 상황(밤/낮, 이동 속도, fade in/out 등) 등과 같은 영상의 특징을 고려하여 프레임들 간의 유사도를 판단한다.The shot identification technology corresponds to a technology that distinguishes the boundary at which the camera shot changes based on the similarity between frames included in the image. In this technology, the similarity between frames is determined in consideration of the characteristics of the image, such as the genre of the content (movie, drama, entertainment, etc.) and the situation recognized from the frame (night/day, movement speed, fade in/out, etc.).

intro/ending auto detection 기술은 컨텐츠의 intro/ending 구간을 자동적으로 탐지하는 기술에 해당한다. 이 기술은 shot identification 기술을 활용하여 intro/ending 구간을 탐지하는 데 특화되어 있으며, 하나의 드라마를 구성하는 여러 에피소드를 연속적으로 시청하는 binge watching 편의를 위한 인트로/엔딩 부분의 스킵(skip)에 활용될 수 있다.The intro/ending auto detection technology corresponds to a technology that automatically detects an intro/ending section of content. This technology is specialized in detecting the intro/ending section using shot identification technology, and it is used for skipping the intro/ending part for the convenience of binge watching, which continuously watches several episodes constituting one drama. can be

alternative poster 기술은 컨텐츠를 대표할 수 있는 이미지 후보들을 해당 컨텐츠로부터 추출하는 기술에 해당한다. 이 기술은 여러 이미지 후보들을 활용하여 해당 컨텐츠에 대한 요약 영상을 생성하는데 활용될 수 있다.The alternative poster technology corresponds to a technology for extracting image candidates that can represent the content from the corresponding content. This technology may be utilized to generate a summary image of the corresponding content by using several image candidates.

metadata composition 기술은 동일하거나 유사한 의미를 가지는 여러 메타 데이터를 결합하여 metadata set를 생성하는 기술에 해당한다. 이 기술은 컨텐츠 내에서 동일하거나 유사한 장면을 포함하는 프레임을 추출 및 제공하는데 활용될 수 있다. 여기서, 동일하거나 유사한 장면을 포함하는 프레임은 사용자가 시청하길 원하는 프레임에 해당할 수 있으며, 사용자의 요구는 음성 인식을 통해 입력될 수 있다. 따라서, 이 기술은 자연어 이해 기술과 결합될 수 있다.The metadata composition technique corresponds to a technique for creating a metadata set by combining multiple metadata having the same or similar meaning. This technology can be utilized to extract and provide frames including the same or similar scenes within the content. Here, a frame including the same or similar scene may correspond to a frame desired by the user to view, and the user's request may be input through voice recognition. Therefore, this technology can be combined with natural language understanding technology.

딥 메타 데이터는 위와 같은 대표적인 기술들 이외에도, 컨텐츠 별 배역 이미지 자동 추천, 인물 또는 음원의 장면 별 골라보기, 컨텐츠 내 딥 메타 데이터 추출/검색, 특정 컨텐츠 내 섹션 탐색 등을 위해 활용될 수 있다.In addition to the above representative technologies, deep metadata can be utilized for automatic recommendation of cast images for each content, selection by scene of a person or sound source, extraction/search of deep metadata within content, and section search for specific content.

추출부(220)는 획득된 영상으로부터 하나 이상의 딥 메타 데이터를 추출한다(S320). 추출부(220)가 딥 메타 데이터를 추출하는 방법에 대한 일 예를 설명하면 다음과 같다.The extractor 220 extracts one or more deep metadata from the acquired image (S320). An example of a method in which the extraction unit 220 extracts deep metadata will be described as follows.

특정 영상으로부터 인물, 상황, 장소 등을 구분/인식/인지할 수 있는 기준(baseline) 데이터가 DB화된다. 추출부(220)는 DB화된 기준 데이터(baseline DB)를 이용하여 미리 학습될 수 있다. 한편, 컨텐츠에 포함된 영상(영상신호)의 각 프레임 별로 이미지가 DB화된다.Baseline data that can distinguish/recognize/recognize people, situations, places, etc. from a specific image is converted into a DB. The extractor 220 may be pre-learned using DB-ized baseline data. On the other hand, the image for each frame of the image (video signal) included in the content is converted into DB.

추출부(220)는 각 프레임 별로 baseline DB와 DB화된 이미지(image DB)를 비교하고, 상호 연관성(co-relation) 연산을 적용함으로써 인물, 상황, 장소 등을 구분/인식/인지할 수 있다. 또한, 추출부(220)는 특정 프레임으로부터 '인지된 인물, 상황, 장소 등'을 추출함으로써 딥 메타 데이터를 추출하는 과정을 마무리할 수 있다. 이 과정에서, 해당 컨텐츠 전체로부터 딥 메타 데이터를 추출하기 위해, moving window 기술이 적용될 수 있다.The extractor 220 may classify/recognize/recognize a person, a situation, a place, etc. by comparing the baseline DB and the DB image DB for each frame and applying a co-relation operation. Also, the extraction unit 220 may finish the process of extracting the deep metadata by extracting 'recognized person, situation, place, etc.' from a specific frame. In this process, in order to extract deep metadata from the entire content, a moving window technique may be applied.

추출부(220)는 전체 영상에 대해 미리 설정된 시간 주기 단위로 딥 메타 데이터를 추출하거나, 미리 설정된 하나 이상의 시점마다 딥 메타 데이터를 추출하도록 구성될 수 있다. The extractor 220 may be configured to extract the deep metadata in units of a preset time period for the entire image or to extract the deep metadata at one or more preset points in time.

딥 메타 데이터가 추출되는 시간 주기 단위와 시점은 사운드 프로세싱의 필요성 존부에 따라 가변적으로 설정될 수 있다. 예를 들어, 추출부(220)는 컨텐츠에 포함된 전체 영상 중 캐릭터의 음성, 캐릭터의 움직임, 물체들 간의 충돌 등과 같은 사운드 프로세싱의 필요성을 가지는 영상들만을 대상으로 딥 메타 데이터를 추출하도록 구성될 수 있다.The time period unit and time point at which the deep metadata is extracted may be variably set according to the necessity of sound processing. For example, the extraction unit 220 may be configured to extract deep metadata only from images having a need for sound processing such as a character's voice, character movement, collision between objects, etc. among all images included in the content. can

선별부(230)는 추출된 딥 메타 데이터 중 대상 객체와 관련된 딥 메타 데이터(객체 메타 데이터)를 선별한다(S330). 여기서, 대상 객체는 컨텐츠에 포함된 사운드를 발생시키는 사운드 소스를 의미하며, 대상 객체에는 해당 컨텐츠에 등장하는 인물, 애니메이션 캐릭터, 게임 캐릭터, 사물 등과 같이 사운드를 발생시킬 수 있는 다양한 요소들이 포함될 수 있다.The selection unit 230 selects deep metadata (object metadata) related to the target object from among the extracted deep metadata ( S330 ). Here, the target object means a sound source that generates a sound included in the content, and the target object may include various elements that can generate sound, such as characters, animation characters, game characters, and things appearing in the corresponding content. .

프로세싱부(240)는 대상 객체와 관련된 딥 메타 데이터(객체 메타 데이터)를 기준으로 또는 이 객체 메타 데이터가 지시하는 바에 따라, 사운드를 프로세싱한다(S340).The processing unit 240 processes the sound on the basis of deep metadata (object metadata) related to the target object or as indicated by the object metadata (S340).

사운드 프로세싱이 완료되면, I/O 인터페이스부(210)는 프로세싱된 사운드가 포함되어 있는 컨텐츠를 사용자 단말(130)로 전송하여 해당 컨텐츠(고품질 컨텐츠)를 사용자에게 제공한다(S350).When the sound processing is completed, the I/O interface unit 210 transmits the content including the processed sound to the user terminal 130 to provide the corresponding content (high-quality content) to the user (S350).

실시형태에 따라, 본 발명에서 제안하는 사운드 프로세싱 방법은 컨텐츠 제공에 대한 요청이 사용자로부터 전송됨을 전제로 하여 구현되거나, 컨텐츠 제공에 대한 요청이 사용자로부터 전송되는지 여부와 무관하게 구현될 수 있다.According to an embodiment, the sound processing method proposed by the present invention may be implemented on the premise that a request for content provision is transmitted from a user, or may be implemented irrespective of whether a request for content provision is transmitted from the user.

전자는 컨텐츠 제공 요청이 수신된 후에 이 요청과 대응되는 컨텐츠의 사운드를 대상으로 사운드 프로세싱 방법을 적용한 후, 사운드 프로세싱된 컨텐츠를 사용자 단말(130)로 전송하는 경우를 의미한다. 후자는 컨텐츠 제공 요청의 수신 없이, 획득된 컨텐츠를 대상으로 미리 사운드 프로세싱 방법을 적용한 후, 해당 컨텐츠에 대한 제공 요청이 수신되는 경우에 한하여 해당 컨텐츠를 사용자 단말(130)로 전송하는 경우를 의미한다.The former means a case in which the sound processing method is applied to the sound of the content corresponding to the request after the content provision request is received, and then the sound-processed content is transmitted to the user terminal 130 . The latter means a case in which the content is transmitted to the user terminal 130 only when a request for providing the content is received after applying the sound processing method to the obtained content in advance without receiving the request for providing the content. .

실시형태에 따라, 본 발명에서 제안하는 사운드 프로세싱 방법은 사운드 프로세싱의 필요성에 따라 선택적으로 구현될 수 있다.According to embodiments, the sound processing method proposed by the present invention may be selectively implemented according to the necessity of sound processing.

캐릭터의 음성 발생 여부, 캐릭터의 움직임 존부, 캐릭터의 영상 내 표출 여부 등과 같이, 대상 객체로부터 발생되는 사운드에 입체감을 부여해야 할 필요성이 영상으로부터 인식되는 경우, 사운드 프로세싱의 필요성이 존재하는 것으로 판단될 수 있다.When the necessity to give a three-dimensional effect to the sound generated from the target object is recognized from the image, such as whether the character's voice is generated, whether the character's movement is present, or whether the character is expressed in the image, the need for sound processing is determined to exist. can

또한, 물체들 간의 충돌 여부, 장소 협소 여부, 날씨의 변화, 계절의 변화 등과 같이, 사운드에 대응되는 이펙트를 부여해야 할 필요성이 영상으로부터 인식되는 경우도 사운드 프로세싱의 필요성이 존재하는 것으로 판단될 수 있다.In addition, it can be determined that the need for sound processing exists even when the necessity to give an effect corresponding to the sound is recognized from the image, such as whether there is a collision between objects, whether a place is narrow, a change of weather, a change of season, etc. have.

본 발명의 프로세싱 장치(100)는 위와 같은 사운드 프로세싱 필요성을 자체적으로 판단하고, 그 판단 결과에 따라 본 발명에서 제안하는 사운드 프로세싱 방법을 선택적으로 적용 또는 구현할 수 있다.The processing apparatus 100 of the present invention may determine the necessity of sound processing by itself, and selectively apply or implement the sound processing method proposed in the present invention according to the determination result.

실시형태에 따라, 본 발명에서 제안하는 사운드 프로세싱 방법은 사운드 프로세싱에 대한 사용자의 의도에 따라 선택적으로 구현될 수도 있다. 즉, 사운드 프로세싱을 실행시키고자 하는 사용자의 의도가 사용자 단말(130)을 통해 수신됨을 전제로, 본 발명의 프로세싱 장치(100)는 사운드 프로세싱 방법을 적용 또는 구현할 수 있다. According to an embodiment, the sound processing method proposed by the present invention may be selectively implemented according to a user's intention for sound processing. That is, on the premise that the user's intention to execute sound processing is received through the user terminal 130 , the processing apparatus 100 of the present invention may apply or implement the sound processing method.

도 4는 딥 메타 데이터에 대한 본 발명의 다양한 예를 설명하기 위한 도면이다.4 is a diagram for explaining various examples of the present invention for deep metadata.

전술된 바와 같이, 영상으로부터 추출되는 딥 메타 데이터에는 영상으로부터 표출되는 인물의 얼굴, 감정, 움직임, 음원, 상황, 대사, 자막, 주변 환경 등이 포함될 수 있다.As described above, the deep metadata extracted from the image may include a person's face, emotion, movement, sound source, situation, dialogue, subtitles, surrounding environment, etc. expressed from the image.

도 4에 표현된 바와 같이, 추출부(220)를 통해 추출되는 딥 메타 데이터에는 사운드를 발생시키는 대상 객체와 관련된 딥 메타 데이터인 객체 메타 데이터, 영상의 배경 음악과 관련된 딥 메타 데이터인 배경 메타 데이터 등이 포함될 수 있다.As shown in FIG. 4 , the deep metadata extracted through the extraction unit 220 includes object metadata that is deep metadata related to a target object that generates sound, and background metadata that is deep metadata related to background music of an image. etc. may be included.

객체 메타 데이터에는 대상 객체의 움직임과 관련된 딥 메타 데이터인 움직임 메타 데이터, 사운드가 발생되는 영상의 상황(context)과 관련된 딥 메타 데이터인 컨텍스트 메타 데이터, 대상 객체의 영상 내 존재 여부에 대한 딥 메타 데이터인 인식 메타 데이터 등이 포함될 수 있다. The object metadata includes motion metadata, which is deep metadata related to the motion of the target object, context metadata, which is deep metadata related to the context of the image in which the sound is generated, and deep metadata about whether the target object exists in the image. recognition metadata, etc. may be included.

인식 메타 데이터는 대상 객체가 영상에서 시각적으로 인식되는지 여부를 지시하는 딥 메타 데이터로서, 예를 들어, 대상 객체가 게임 캐릭터이고 해당 게임 캐릭터가 영상에서 인식되지 않는다면, 인식 메타 데이터는 Off에 해당할 수 있다.Recognition metadata is deep metadata indicating whether the target object is visually recognized in the image. For example, if the target object is a game character and the game character is not recognized in the image, the recognition metadata may correspond to Off. can

움직임 메타 데이터에는 컨텐츠에 등장하는 캐릭터(인물, 게임 캐릭터, 애니메이션 캐릭터 등)의 움직임 여부, 움직임 방향 및 움직임 정도(움직임의 크기)를 나타내는 메타 데이터들이 포함될 수 있다. The motion metadata may include metadata indicating whether a character (a person, a game character, an animation character, etc.) appearing in the content moves, the direction of the motion, and the degree of motion (the magnitude of the motion).

도 4에 표현된 바와 같이, 움직임 메타 데이터는 컨텐츠에 등장하는 캐릭터 별로 구분될 수 있으며(대상 객체 1, 대상 객체 2 등), 움직임 메타 데이터가 지시하는 움직임 방향은 영상 또는 화면 내 특정 포인트(기준 포인트)를 기준으로 한 Up/Down, Left/Right 및 Front/Back를 포함할 수 있다. As shown in FIG. 4 , the motion metadata can be divided for each character appearing in the content (target object 1, target object 2, etc.), and the movement direction indicated by the motion metadata is a specific point (reference) in the image or screen. Point) based on Up/Down, Left/Right, and Front/Back.

즉, 움직임 메타 데이터에는 Up/Down, Left/Right 및 Front/Back 각각을 지시하는 메타 데이터들이 포함될 수 있다. 또한, 움직임 메타 데이터가 지시하는 움직임 정도는 Up/Down, Left/Right 및 Front/Back 각각에 대한 수치로 표현될 수 있다.That is, the motion metadata may include metadata indicating Up/Down, Left/Right, and Front/Back, respectively. In addition, the degree of motion indicated by the motion metadata may be expressed as a numerical value for each of Up/Down, Left/Right, and Front/Back.

컨텍스트 메타 데이터가 지시하는 영상의 상황이란 대상 객체 자체로부터 발생하는 고유 사운드에 변형 또는 변경을 유발할 수 있는 환경적 요인으로서, 주변 환경, 주변 여건 등으로 이해될 수 있다.The situation of the image indicated by the context metadata is an environmental factor that may cause deformation or change in the inherent sound generated from the target object itself, and may be understood as the surrounding environment, surrounding conditions, and the like.

도 4에 표현된 바와 같이, 컨텍스트 메타 데이터에는 사운드가 발생하는 장소와 관련된 딥 메타 데이터, 사운드가 발생하는 시간과 관련된 딥 메타 데이터, 사운드가 발생하는 날씨 또는 계절과 관련된 딥 메타 데이터(맑음, 흐림, 비 등), 대상 객체들의 충돌 여부와 관련된 딥 메타 데이터 등이 포함될 수 있다. 또한, 장소와 관련된 딥 메타 데이터에는 실내/실외 여부를 지시하는 메타 데이터, 해당 장소의 크기 정도를 지시하는 메타 데이터들이 포함될 수 있다.As shown in Fig. 4, context metadata includes deep metadata related to the place where the sound occurs, deep metadata related to the time the sound is generated, and deep metadata related to the weather or season in which the sound occurs (sunny, cloudy). , rain, etc.), deep metadata related to whether the target objects collide or not may be included. In addition, the deep metadata related to the place may include metadata indicating whether indoors/outdoors and metadata indicating the size of the corresponding place.

이러한 다양한 하위 메타 데이터들을 포함하는 딥 메타 데이터는 영상으로부터 추출된 후, 해당 영상과 링크되어 메타 데이터 저장부(250)에 저장될 수 있다.After the deep metadata including such various sub-meta data is extracted from the image, it may be linked to the corresponding image and stored in the metadata storage unit 250 .

도 5는 대상 객체의 움직임을 기준으로 사운드를 프로세싱하는 본 발명의 일 실시예를 설명하기 위한 순서도이다. 이하에서는, 도 5를 참조하여 움직임 메타 데이터를 기준으로 사운드를 업스케일링하여 사운드에 입체감을 적용하는 본 발명의 일 실시예에 대해 설명하도록 한다.5 is a flowchart illustrating an embodiment of the present invention for processing sound based on the movement of a target object. Hereinafter, an embodiment of the present invention in which a three-dimensional effect is applied to a sound by upscaling the sound based on motion metadata will be described with reference to FIG. 5 .

먼저, 도 5에 표현된 바와 같이, 영상과 이에 대응되는 사운드가 포함된 컨텐츠가 컨텐츠 서버(110)로부터 프로세싱 장치(100)로 전송되는 과정(S510)과 영상으로부터 하나 이상의 딥 메타 데이터를 추출하는 과정(S520)이 수행될 수 있다.First, as shown in FIG. 5 , a process ( S510 ) in which content including an image and a sound corresponding thereto is transmitted from the content server 110 to the processing device 100 and extracting one or more deep metadata from the image A process S520 may be performed.

선별부(230)는 추출된 딥 메타 데이터 중 움직임 메타 데이터를 선별한다(S540). 실시형태에 따라, 선별부(230)는 사용자가 의도하는 특정 캐릭터에 대한 움직임 메타 데이터를 선택적으로 선별하도록 구성될 수 있다.The selection unit 230 selects motion metadata among the extracted deep metadata ( S540 ). According to an embodiment, the selection unit 230 may be configured to selectively select motion metadata for a specific character intended by the user.

이를 위해, 움직임 메타 데이터를 선별하는 과정(S540) 이전에, 컨텐츠에 등장하는 캐릭터들 중 어느 하나를 지시하는 선택신호를 사용자 단말(130)로부터 수신하는 과정(S530)이 수행될 수 있다. 선택신호는 사용자가 사용자 단말(130)을 통해 입력한 신호 또는 데이터에 해당한다.To this end, before the process of selecting the motion metadata ( S540 ), a process of receiving a selection signal indicating any one of the characters appearing in the content from the user terminal 130 ( S530 ) may be performed. The selection signal corresponds to a signal or data input by the user through the user terminal 130 .

이와 같이, 본 발명이 사용자가 의도하는 특정 캐릭터에 대한 움직임 메타 데이터를 선택적으로 선별하도록 구성되면, 본 발명은 사용자의 의도에 부합되는 캐릭터에 대한 사운드 프로세싱(업스케일링)을 구현할 수 있다. 따라서, 본 발명은 사용자들 개개인의 다양한 요구를 충족시킬 수 있어 개인화 서비스를 구현할 수 있다.As such, if the present invention is configured to selectively select motion metadata for a specific character intended by the user, the present invention can implement sound processing (upscaling) for a character that meets the user's intention. Accordingly, the present invention can satisfy the various needs of individual users, thereby realizing a personalized service.

움직임 메타 데이터에 대한 선별이 완료되면, 프로세싱부(240)는 선별된 움직임 메타 데이터를 기준으로(움직임 메타 데이터가 지시하는 캐릭터의 움직임에 따라) 해당 캐릭터로부터 발생되는 사운드를 업스케일링한다(S550). When the selection of the motion metadata is completed, the processing unit 240 upscales the sound generated from the corresponding character based on the selected motion metadata (according to the movement of the character indicated by the motion metadata) (S550) .

예를 들어, 캐릭터가 우상측 방향으로 각각 +2만큼 이동한 경우, 움직임 메타 데이터의 Up/Down, Left/Right 및 Front/Back 각각은 +2/-2, -2/+2, 0/0을 나타낼 수 있다. 따라서, 프로세싱부(240)는 우측 방향과 상측 방향에 대응되는 채널 각각에 +2에 비례하는 가중치를 적용하고, 좌측 방향과 하측 방향에 대응되는 채널 각각에 -2에 비례하는 가중치를 적용하는 방법을 통해 사운드 업스케일링을 수행할 수 있다.For example, if the character moves by +2 in the upper right direction, respectively, Up/Down, Left/Right, and Front/Back of the movement metadata are +2/-2, -2/+2, 0/0. can represent Accordingly, the processing unit 240 applies a weight proportional to +2 to each of the channels corresponding to the right direction and the upper direction, and applies a weight proportional to -2 to each of the channels corresponding to the left direction and the lower direction. can perform sound upscaling.

다른 예로, 캐릭터가 영상을 시청하는 사용자와 가까워지는 방향(Front)으로 +3만큼 이동한 경우, 움직임 메타 데이터는 0/0, 0/0, +3/-3을 나타낼 수 있다. 따라서, 프로세싱부(240)는 Front 방향에 대응되는 채널에 +3에 비례하는 가중치를 적용하고 Back 방향과 대응되는 채널에 -3에 비례하는 가중치를 적용하는 방법을 통해 업스케일링을 수행할 수 있다.As another example, when the character moves by +3 in the direction (Front) closer to the user viewing the image, the motion metadata may indicate 0/0, 0/0, and +3/-3. Accordingly, the processing unit 240 may perform upscaling by applying a weight proportional to +3 to the channel corresponding to the front direction and applying a weight proportional to -3 to the channel corresponding to the back direction. .

움직임 메타 데이터를 기준으로 프로세싱을 수행하는 전술된 실시예에서는 설명과 이해의 편의를 위해 캐릭터의 움직임이 기준 포인트(0, 0)로부터 발생하는 것으로 가정하여 설명하였다. 따라서, 전술된 예에서, +2, -2, +3, -3 등은 캐릭터의 원위치(이전 위치)를 기준으로 한 움직임의 상대적 크기를 의미한다.In the above-described embodiment in which processing is performed based on movement metadata, it is assumed that the movement of the character occurs from the reference point (0, 0) for convenience of explanation and understanding. Accordingly, in the above example, +2, -2, +3, -3, etc. mean the relative magnitude of the movement based on the original position (previous position) of the character.

한편, 영상이 복수 개의 픽처(프레임)로 구성되는 점을 감안하면, 특정 시점의 픽처로부터 추출되는 움직임 메타 데이터와 이전 시점의 픽처로부터 추출되는 움직임 메타 데이터 사이에 움직임 방향과 움직임 정도에 대한 연속성을 확보할 필요성이 있다.On the other hand, considering that an image is composed of a plurality of pictures (frames), the continuity of the motion direction and the degree of motion between the motion metadata extracted from the picture of a specific viewpoint and the motion metadata extracted from the picture of the previous viewpoint is maintained. there is a need to secure

따라서, 특정 시점에 추출되는 움직임 메타 데이터는 이전 시점에 추출되는 움직임 메타 데이터의 움직임 방향과 움직임 정보를 원점으로 한 벡터 값으로 표현될 수 있다. 즉, 특정 시점의 움직임 메타 데이터는 이전 시점의 움직임 메타 데이터를 기준으로 한 상대적인 값으로 표현될 수 있다.Accordingly, the motion metadata extracted at a specific point in time may be expressed as a vector value based on the motion direction and motion information of the motion metadata extracted at a previous point in time. That is, the motion metadata of a specific viewpoint may be expressed as a relative value based on the motion metadata of a previous viewpoint.

사운드 프로세싱이 완료되면, I/O 인터페이스부(210)는 프로세싱된(업스케일링된) 사운드가 포함되어 있는 컨텐츠를 사용자 단말(130)로 전송하여 사용자에게 고품질 컨텐츠를 제공한다(S560). When the sound processing is completed, the I/O interface unit 210 transmits the content including the processed (upscaled) sound to the user terminal 130 to provide high-quality content to the user (S560).

실시형태에 따라, 영상으로부터 딥 메타 데이터를 추출하는 과정(S520)은 움직임 메타 데이터를 선별하는 과정(S540) 및 사운드를 업스케일링하는 과정(S550)과 시간적 차이를 두고 수행될 수 있다. According to an embodiment, the process of extracting the deep metadata from the image ( S520 ) may be performed with a time difference from the process of selecting the motion metadata ( S540 ) and the process of upscaling the sound ( S550 ).

예를 들어, 프로세싱 장치(100)는 컨텐츠 서버(110)로부터 수신된 영상 전체에 대해 딥 메타 데이터를 추출하여 딥 메타 데이터 저장부(250)에 저장한 후, 사용자 단말(130)로부터 해당 영상에 대한 제공이 요구됨을 조건으로 움직임 메타 데이터를 선별하는 과정(S540)과 사운드를 업스케일링하는 과정(S550)을 수행할 수 있다.For example, the processing device 100 extracts deep metadata from the entire image received from the content server 110 and stores it in the deep metadata storage unit 250 , and then stores it in the image from the user terminal 130 . A process of selecting motion metadata ( S540 ) and a process of upscaling the sound ( S550 ) may be performed on the condition that provision of the data is required.

실시형태에 따라, 선택신호를 수신하는 과정(S530)은 딥 메타 데이터를 추출하는 과정(S520) 이전에 수행될 수 있다. 예를 들어, 프로세싱 장치(100)는 사용자 단말(130)로부터 선택신호가 수신됨을 조건으로 하여 딥 메타 데이터 추출(S520), 선택신호에 대응되는 캐릭터의 움직임 메타 데이터 선별(S540) 및 사운드 업스케일링(S550)을 수행할 수 있다. 또 다른 실시형태에 따라, 선택신호를 수신하는 과정(S530)은 컨텐츠 서버(110)로부터 컨텐츠를 수신하는 과정(S510) 이전에 수행될 수도 있다.According to an embodiment, the process of receiving the selection signal ( S530 ) may be performed before the process of extracting the deep metadata ( S520 ). For example, the processing apparatus 100 extracts deep metadata on condition that a selection signal is received from the user terminal 130 ( S520 ), selects movement metadata of a character corresponding to the selection signal ( S540 ), and performs sound upscaling. (S550) may be performed. According to another embodiment, the process of receiving the selection signal ( S530 ) may be performed before the process of receiving the content from the content server 110 ( S510 ).

실시형태에 따라, 전술된 방법(움직임 메타 데이터를 기준으로 사운드 업스케일링)은 사용자의 의도에 따라 선택적으로 On 또는 Off 될 수 있다. 사용자가 사운드 프로세싱 기능 전체를 Off하거나 후술되는 이펙트 적용 기능만을 On하는 경우, 입체감을 부여하는 업스케일링 방법이 구현되지 않을 수 있다. 이와 달리, 사용자가 사운드 프로세싱 기능 전체를 On하거나 이펙트 적용 기능만을 Off하는 경우, 업스케일링 방법이 구현될 수 있다.According to an embodiment, the above-described method (sound upscaling based on motion metadata) may be selectively turned on or off according to a user's intention. If the user turns off the entire sound processing function or turns on only the effect application function, which will be described later, the upscaling method for giving a three-dimensional effect may not be implemented. On the other hand, when the user turns on the entire sound processing function or turns off only the effect application function, the upscaling method may be implemented.

도 6은 사운드가 발생하는 환경적 특성을 반영하여 사운드를 프로세싱하는 본 발명의 일 실시예를 설명하기 위한 순서도이다. 이하에서는, 도 6을 참조하여 컨텍스트 메타 데이터를 기준으로 사운드에 다양한 이펙트를 적용하는 방법에 대해 설명하도록 한다.6 is a flowchart for explaining an embodiment of the present invention in which sound is processed by reflecting the environmental characteristics in which the sound is generated. Hereinafter, a method of applying various effects to sound based on context metadata will be described with reference to FIG. 6 .

먼저, 영상과 이에 대응되는 사운드가 포함된 컨텐츠를 획득하는 과정(S610)과 영상으로부터 하나 이상의 딥 메타 데이터를 추출하는 과정(S620)이 전술된 바와 동일하게 수행될 수 있다. First, a process of obtaining content including an image and a sound corresponding thereto ( S610 ) and a process of extracting one or more deep metadata from the image ( S620 ) may be performed in the same manner as described above.

선별부(230)는 추출된 딥 메타 데이터 중 영상의 상황과 관련된 딥 메타 데이터인 컨텍스트 메타 데이터를 선별한다(S640). 실시형태에 따라, 선별부(230)는 사용자 단말(130)로부터 수신(S630)되는 선택신호를 이용하여 사용자가 의도하는 특정 상황에 대한 컨텍스트 메타 데이터를 선택적으로 선별하도록 구성될 수도 있으며, 이를 통해 본 발명은 개인화 서비스를 구현할 수 있다.The selection unit 230 selects context metadata, which is deep metadata related to an image situation, from among the extracted deep metadata (S640). According to an embodiment, the selection unit 230 may be configured to selectively select context metadata for a specific situation intended by the user using a selection signal received ( S630 ) from the user terminal 130 , through which The present invention can implement a personalization service.

컨텍스트 메타 데이터에 대한 선별이 완료되면, 프로세싱부(240)는 선별된 컨텍스트 메타 데이터가 지시하는 상황에 대응되는 다양한 이펙트를 사운드에 적용한다(S650).When the selection of the context metadata is completed, the processing unit 240 applies various effects corresponding to the situation indicated by the selected context metadata to the sound (S650).

예를 들어, 컨텍스트 메타 데이터가 '사운드가 발생하는 장소의 협소함'을 지시하는 경우, 프로세싱부(240)는 사운드를 대상으로 리버브(reverb) 이펙트를 적용하여 장소의 협소함에 따른 울림 효과(실제감 또는 공간감)를 부여할 수 있다.For example, when the context metadata indicates 'the narrowness of the place where the sound is generated', the processing unit 240 applies a reverb effect to the sound to obtain a reverberation effect (reality) according to the narrowness of the place. or a sense of space).

컨텍스트 메타 데이터는 장소의 협소함을 On 또는 Off로 나타내거나, 장소의 혐소함 정도 또는 장소의 크기를 수치적으로 나타낼 수 있다. 컨텍스트 메타 데이터가 장소의 협소함 정도를 수치적으로 나타내는 경우, 프로세싱부(240)는 해당 수치에 비례적으로 리버브 이펙트를 적용할 수 있다.The context metadata may indicate the narrowness of the place as On or Off, or numerically indicate the degree of disgust or the size of the place. When the context metadata numerically represents the degree of narrowness of a place, the processing unit 240 may apply a reverb effect proportionally to the corresponding numerical value.

다른 예로, 컨텍스트 메타 데이터가 '대상 객체의 충돌'을 지시하는 경우, 프로세싱부(240)는 사운드를 대상으로 하이패스 필터를 적용하여 사운드의 고주파수 대역을 강조함으로써 대상 객체의 충돌로 인하여 발생하는 사운드를 더욱 실제적으로 표현할 수 있다. 컨텍스트 메타 데이터는 대상 객체의 충돌에 대한 크기를 수치적으로 나타낼 수 있으며, 이와 같은 경우, 프로세싱부(240)는 해당 수치에 비례적으로 사운드 강조 이펙트를 적용할 수 있다. As another example, when the context metadata indicates 'collision of a target object', the processing unit 240 applies a high-pass filter to the sound to emphasize a high-frequency band of the sound, so that the sound generated due to the collision of the target object. can be expressed more realistically. The context metadata may numerically represent the size of the collision of the target object. In this case, the processing unit 240 may apply the sound emphasis effect proportionally to the corresponding numerical value.

사운드 프로세싱이 완료되면, I/O 인터페이스부(210)는 프로세싱된 사운드가 포함되어 있는 컨텐츠를 사용자 단말(130)로 전송하여 사용자에게 고품질 컨텐츠를 제공한다(S660).When the sound processing is completed, the I/O interface unit 210 transmits the content including the processed sound to the user terminal 130 to provide high-quality content to the user (S660).

도 5를 통하여 설명된 실시예와 마찬가지로 도 6을 통하여 설명된 실시예에서도 영상으로부터 딥 메타 데이터를 추출하는 과정(S620)은 컨텍스트 메타 데이터를 선별하는 과정(S640) 및 이펙트를 적용하는 과정(S650)과 시간적 차이를 가지고 수행될 수 있다. Similarly to the embodiment described with reference to FIG. 5 , in the embodiment described with reference to FIG. 6 , the process of extracting deep metadata from the image ( S620 ) is a process of selecting context metadata ( S640 ) and a process of applying an effect ( S650 ). ) and with a time difference.

또한, 선택신호를 수신하는 과정(S630)은 딥 메타 데이터를 추출하는 과정(S620) 이전에 수행되거나, 컨텐츠 서버(110)로부터 컨텐츠를 수신하는 과정(S610) 이전에 수행될 수도 있다.In addition, the process of receiving the selection signal ( S630 ) may be performed before the process of extracting the deep metadata ( S620 ) or before the process of receiving the content from the content server 110 ( S610 ).

또한, 전술된 방법(컨텍스트 메타 데이터를 기준으로 사운드에 이펙트 적용)은 사용자의 의도에 따라 선택적으로 On 또는 Off 될 수 있다. 사용자가 사운드 프로세싱 기능 전체를 Off하거나 전술된 업스케일링 기능만을 On하는 경우, 이펙트 적용 방법이 구현되지 않을 수 있다. 이와 달리, 사용자가 사운드 프로세싱 기능 전체를 On하거나 업스케일링 기능만을 Off하는 경우, 이펙트 적용 방법이 구현될 수 있다.In addition, the above-described method (applying an effect to a sound based on context metadata) may be selectively turned on or off according to a user's intention. When the user turns off the entire sound processing function or turns on only the upscaling function described above, the effect application method may not be implemented. On the contrary, when the user turns on the entire sound processing function or turns off only the upscaling function, an effect application method may be implemented.

도 7은 대상 객체의 움직임을 기준으로 한 사운드 업스케일링과 환경적 특성을 반영한 이펙트 적용이 유기적으로 수행되는 본 발명의 일 실시예를 설명하기 위한 순서도이다.7 is a flowchart for explaining an embodiment of the present invention in which sound upscaling based on the movement of a target object and applying an effect reflecting environmental characteristics are organically performed.

도 5 및 도 6을 통해, 사운드를 대상으로 입체감을 적용하는 방법과 사운드를 대상으로 이펙트를 적용하는 방법 각각을 개별적으로 설명하였다. 이하에서, 도 7을 통해 설명되는 실시예는 입체감을 적용하는 방법과 이펙트를 적용하는 방법이 동일한 사운드를 대상으로 동시에 구현되는 실시예에 해당한다. 5 and 6, a method of applying a three-dimensional effect to a sound and a method of applying an effect to a sound have been separately described. Hereinafter, the embodiment described with reference to FIG. 7 corresponds to an embodiment in which a method of applying a three-dimensional effect and a method of applying an effect are simultaneously implemented for the same sound.

먼저, 영상과 이에 대응되는 사운드가 포함된 컨텐츠를 획득하는 과정(S710) 및 영상으로부터 하나 이상의 딥 메타 데이터를 추출하는 과정(S720)이 앞서 설명된 바와 동일하게 수행될 수 있다.First, a process ( S710 ) of obtaining content including an image and a sound corresponding thereto and a process ( S720 ) of extracting one or more deep metadata from the image may be performed in the same manner as described above.

선별부(230)는 추출된 딥 메타 데이터로부터 객체 메타 데이터를 선별하고(S730), 프로세싱부(240)는 객체 메타 데이터를 기준으로 사운드 업스케일링을 수행한다(S740).The selection unit 230 selects object metadata from the extracted deep metadata (S730), and the processing unit 240 performs sound upscaling based on the object metadata (S740).

한편, 선별부(230)는 추출된 딥 메타 데이터로부터 컨텍스트 메타 데이터를 선별하고(S750), 프로세싱부(240)는 컨텍스트 메타 데이터가 지시하는 상황에 대응되는 이펙트를 업스케일링된 사운드에 적용한다(S760).Meanwhile, the selection unit 230 selects context metadata from the extracted deep metadata ( S750 ), and the processing unit 240 applies an effect corresponding to a situation indicated by the context metadata to the upscaled sound ( S750 ). S760).

사운드 프로세싱(업스케일링 및 이펙트 적용)이 완료되면, I/O 인터페이스부(210)는 프로세싱된 사운드가 포함되어 있는 컨텐츠를 사용자 단말(130)로 전송하여 사용자에게 고품질 컨텐츠를 제공한다(S770). When sound processing (upscaling and effect application) is completed, the I/O interface unit 210 transmits the content including the processed sound to the user terminal 130 to provide high-quality content to the user (S770).

도 5 및 도 6을 통하여 설명된 실시예와 마찬가지로 도 7을 통하여 설명된 실시예에서도 영상으로부터 딥 메타 데이터를 추출하는 과정(S720)은 메타 데이터를 선별하는 과정(S730, S750) 및 사운드를 프로세싱하는 과정(S740, S760)과 시간적 차이를 가지고 수행될 수 있다. 또한, 전술된 방법(업스케일링 및 이펙트 적용의 동시 수행)은 사용자의 의도에 따라 선택적으로 On 또는 Off 될 수 있다.Similarly to the embodiment described with reference to FIGS. 5 and 6 , in the embodiment described with reference to FIG. 7 , the process of extracting deep metadata from an image ( S720 ) includes the process of selecting metadata ( S730 and S750 ) and sound processing. It may be performed with a time difference from the processes (S740 and S760). In addition, the above-described method (simultaneous performance of upscaling and effect application) may be selectively turned on or off according to a user's intention.

도 3, 도 5, 도 6 및 도 7에서는 각 과정들을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 3, 도 5, 도 6 및 도 7에 기재된 순서를 변경하여 실행하거나 각 과정들 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 3, 도 5, 도 6 및 도 7은 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in FIGS. 3, 5, 6 and 7, this is merely illustrative of the technical idea of an embodiment of the present invention. In other words, those of ordinary skill in the art to which an embodiment of the present invention belongs change the order described in FIGS. 3, 5, 6 and 7 without departing from the essential characteristics of an embodiment of the present invention. 3, 5, 6, and 7 are not limited to a time-series order, since various modifications and variations may be applied by executing or executing one or more of the processes in parallel.

한편, 도 3, 도 5, 도 6 및 도 7에 도시된 과정들은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 즉, 컴퓨터가 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드디스크 등) 및 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Meanwhile, the processes illustrated in FIGS. 3, 5, 6 and 7 may be implemented as computer-readable codes on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. That is, the computer-readable recording medium includes a storage medium such as a magnetic storage medium (eg, ROM, floppy disk, hard disk, etc.) and an optically readable medium (eg, CD-ROM, DVD, etc.). In addition, the computer-readable recording medium is distributed in a network-connected computer system so that the computer-readable code can be stored and executed in a distributed manner.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and various modifications and variations will be possible by those skilled in the art to which this embodiment belongs without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of the present embodiment should be interpreted by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present embodiment.

Claims

A method of processing sound included in content, comprising:
extracting one or more deep meta data from the image included in the content;
selecting object metadata among the extracted deep metadata; and
processing the sound by upscaling the sound or applying an effect corresponding to a situation in the image to the sound based on the object metadata,
The object metadata is
Motion metadata that is deep metadata related to the motion of the target object generating the sound in the image, context metadata that is deep metadata related to the context in the image, and whether the target object exists in the image Among the recognition metadata that is deep metadata, it includes at least one,
The processing step is
and upscaling the sound by applying a weight to each channel of the sound according to the movement of the target object in the image indicated by the movement metadata.

delete

According to claim 1,
The target object is
It is a character appearing in the content,
The character is
Sound processing method, characterized in that the character indicated by the selection signal input from the user among one or more characters appearing in the content.

According to claim 1,
The processing step is
The sound processing method of claim 1, wherein the sound is processed by applying an effect corresponding to the situation in the image indicated by the context metadata to the sound.

A computer-readable recording medium recording a program for executing the method of any one of claims 1, 3 and 4 in a computer.

An apparatus for processing sound included in content, comprising:
an extractor for extracting one or more deep meta data from the image included in the content;
a selection unit for selecting object metadata among the extracted deep metadata; and
a processing unit for processing the sound by upscaling the sound or applying an effect corresponding to a situation in the image to the sound based on the object metadata,
The object metadata is
Motion metadata that is deep metadata related to the motion in the image of the target object that generates the sound, context metadata that is deep metadata related to the context in the image, and whether the target object exists in the image Among the recognition metadata that is deep metadata, it includes at least one,
The processing unit,
and upscaling the sound by applying a weight to each channel of the sound according to the movement of the target object in the image indicated by the movement metadata.

delete

7. The method of claim 6,
The target object is
It is a character appearing in the content,
The character is
Sound processing apparatus, characterized in that the character indicated by the selection signal input from the user among one or more characters appearing in the content.

7. The method of claim 6,
The processing unit,
The sound processing apparatus of claim 1, wherein the sound is processed by applying an effect corresponding to the situation in the image indicated by the context metadata to the sound.