KR102620852B1

KR102620852B1 - Method and apparatus for providing foley sound using artificial intelligence

Info

Publication number: KR102620852B1
Application number: KR1020230082246A
Authority: KR
Inventors: 권한슬; 구도형; 설한울
Original assignee: 주식회사 스튜디오프리윌
Priority date: 2023-06-26
Filing date: 2023-06-26
Publication date: 2024-01-03

Abstract

인공지능 기반 폴리 사운드 제공 장치 및 방법이 개시된다. 일 실시예에 따른 인공지능 기반 폴리 사운드 제공 방법은 타겟 행동을 수행하는 피사체를 포함하는 타겟 영상을 획득하는 단계, 타겟 영상으로부터, 피사체를 포함하는 적어도 하나의 객체의 성질을 인식하여 타겟 행동과 관련된 콘텍스트 정보를 결정하는 단계, 타겟 영상으로부터, 피사체의 자세를 추적하여 타겟 행동의 지속시간 및 타겟 행동의 강도를 결정하는 단계, 결정된 콘텍스트 정보, 지속시간 및 강도에 기초하여, 피사체의 타겟 행동에 대응되는 폴리 사운드를 생성하는 단계, 및 생성된 폴리 사운드의 다운로드 데이터 또는 생성된 폴리 사운드와 타겟 영상이 병합된 다운로드 데이터를 제공하는 단계를 포함한다.An artificial intelligence-based poly sound providing device and method are disclosed. An artificial intelligence-based Foley sound providing method according to an embodiment includes obtaining a target image including a subject performing a target action, recognizing the properties of at least one object including the subject from the target image, and generating information related to the target action. Determining context information, determining the duration and intensity of the target action by tracking the posture of the subject from the target image, responding to the target action of the subject based on the determined context information, duration and intensity. It includes generating a poly sound, and providing download data of the generated poly sound or download data in which the generated poly sound and the target image are merged.

Description

Artificial intelligence-based foley sound providing device and method {METHOD AND APPARATUS FOR PROVIDING FOLEY SOUND USING ARTIFICIAL INTELLIGENCE}

아래 개시는 인공지능을 기반으로 폴리 사운드(foley sound)를 생성하여 제공하는 장치 및 방법에 관한 것이다.The disclosure below relates to an apparatus and method for generating and providing foley sound based on artificial intelligence.

폴리 사운드는 영상 작업 시 음향의 질을 향상시키거나 소리로써 실감나는 영상을 묘사하기 위해 영상에 믹싱하여 넣는 음향효과로서, 이는 다양한 소도구와 장비를 이용해 만들어진다. 폴리 사운드를 믹싱하는 작업은 필수적인 작업이나, 폴리 사운드를 만들고 영상에 추가하는 작업은 수작업으로 수행되고 있다.Foley sound is a sound effect that is mixed into a video to improve the quality of sound during video work or to depict a realistic video with sound. It is created using various props and equipment. Mixing Foley sound is an essential task, but creating Foley sound and adding it to the video is done manually.

한편, 최근 다양한 신경망 모델의 개발로 인해 보다 정교한 영상 분석이 가능해졌다. 신경망 모델은 인간 수준의 지능을 구현하는 컴퓨터 시스템으로서 기계가 스스로 학습하고 판단하는 모델이다. 신경망 모델은 입력 데이터들의 특징을 스스로 분류/학습하는 알고리즘을 이용하는 기계학습(딥러닝) 기술 및 기계학습 알고리즘을 활용하여 인간 두뇌의 인지, 판단 등의 기능을 모사하는 요소 기술들로 구성된다. 요소 기술들은, 예로, 인간의 언어/문자를 인식하는 언어적 이해 기술, 사물을 인간의 시각처럼 인식하는 시각적 이해 기술, 정보를 판단하여 논리적으로 추론하고 예측하는 추론/예측 기술, 인간의 경험 정보를 지식데이터로 처리하는 지식 표현 기술 및 차량의 자율 주행, 로봇의 움직임을 제어하는 동작 제어 기술 중 적어도 하나를 포함할 수 있다.Meanwhile, the recent development of various neural network models has made more sophisticated image analysis possible. A neural network model is a computer system that implements human-level intelligence and is a model in which machines learn and make decisions on their own. The neural network model consists of machine learning (deep learning) technology that uses an algorithm that classifies/learns the characteristics of input data on its own, and element technologies that mimic the functions of the human brain such as cognition and judgment using machine learning algorithms. Element technologies include, for example, linguistic understanding technology that recognizes human language/characters, visual understanding technology that recognizes objects as if they were human eyes, reasoning/prediction technology that judges information and makes logical inferences and predictions, and human experience information. It may include at least one of a knowledge expression technology that processes knowledge data and a motion control technology that controls the autonomous driving of a vehicle and the movement of a robot.

위에서 설명한 배경기술은 발명자가 본원의 개시 내용을 도출하는 과정에서 보유하거나 습득한 것으로서, 반드시 본 출원 전에 일반 공중에 공개된 공지기술이라고 할 수는 없다.The background technology described above is possessed or acquired by the inventor in the process of deriving the disclosure of the present application, and cannot necessarily be said to be known technology disclosed to the general public before this application.

아래 개시는 영상 작업 시 자주 수행되는 폴리 사운드 믹싱 작업의 모든 단계들을 수작업으로 반복 수행해야 하는 문제점을 개선하기 위한 것이다.The disclosure below is intended to improve the problem of having to manually and repeatedly perform all steps of poly sound mixing work, which is frequently performed during video work.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, technical challenges are not limited to the above-mentioned technical challenges, and other technical challenges may exist.

일 실시예에 따른 인공지능 기반 폴리 사운드 제공 방법은 타겟 행동을 수행하는 피사체를 포함하는 타겟 영상을 획득하는 단계, 상기 타겟 영상으로부터, 상기 피사체를 포함하는 적어도 하나의 객체의 성질을 인식하여 상기 타겟 행동과 관련된 콘텍스트 정보를 결정하는 단계, 상기 타겟 영상으로부터, 상기 피사체의 자세를 추적하여 상기 타겟 행동의 지속시간 및 상기 타겟 행동의 강도를 결정하는 단계, 상기 결정된 콘텍스트 정보, 상기 지속시간 및 상기 강도에 기초하여, 상기 피사체의 타겟 행동에 대응되는 폴리 사운드를 생성하는 단계, 및 상기 생성된 폴리 사운드의 다운로드 데이터 또는 상기 생성된 폴리 사운드와 상기 타겟 영상이 병합된 다운로드 데이터를 제공하는 단계를 포함할 수 있다.An artificial intelligence-based Foley sound providing method according to an embodiment includes obtaining a target image including a subject performing a target action, recognizing properties of at least one object including the subject from the target image, and selecting the target. determining context information related to an action, determining a duration of the target action and an intensity of the target action by tracking the posture of the subject from the target image, the determined context information, the duration and the intensity Based on this, it may include generating a Foley sound corresponding to the target behavior of the subject, and providing download data of the generated Foley sound or download data in which the generated Foley sound and the target image are merged. You can.

상기 폴리 사운드를 생성하는 단계는, 사운드 라이브러리로부터 상기 결정된 콘텍스트 정보에 대응되는 사운드를 검색하는 단계, 및 상기 콘텍스트 정보에 대응되는 사운드가 검색된 경우, 상기 지속시간에 기초하여 상기 검색된 사운드의 길이를 조정하는 단계를 포함할 수 있다.The step of generating the poly sound includes searching a sound library for a sound corresponding to the determined context information, and when a sound corresponding to the context information is found, adjusting the length of the searched sound based on the duration. It may include steps.

상기 폴리 사운드를 생성하는 단계는, 상기 콘텍스트 정보에 대응되는 사운드가 검색되지 않은 경우, 상기 콘텍스트 정보, 상기 지속시간 및 상기 강도를 학습된 신경망 모델에 입력하는 단계, 및 상기 학습된 신경망 모델의 출력으로서, 상기 입력된 콘텍스트 정보, 지속시간 및 강도에 대응되는 폴리 사운드를 생성하는 단계를 더 포함할 수 있다.The step of generating the poly sound includes, when a sound corresponding to the context information is not searched, inputting the context information, the duration, and the intensity into a learned neural network model, and output of the learned neural network model. It may further include generating a poly sound corresponding to the input context information, duration, and intensity.

상기 타겟 행동은 사람의 발걸음이고, 상기 적어도 하나의 객체의 성질은, 상기 피사체의 성별, 상기 피사체가 걷고 있는 지면의 재질 및 상기 피사체가 신고 있는 신발의 재질을 포함하고, 상기 콘텍스트 정보는, 상기 인식된 피사체의 성별, 상기 인식된 지표면의 재질, 상기 인식된 신발의 재질에 기초하여 텍스트 형태로 결정되는 것을 특징으로 할 수 있다.The target action is a human step, the properties of the at least one object include the gender of the subject, the material of the ground on which the subject is walking, and the material of the shoe the subject is wearing, and the context information includes the It may be characterized in that it is determined in text form based on the gender of the recognized subject, the recognized material of the ground surface, and the recognized material of the shoe.

상기 타겟 행동은 사람의 발걸음이고, 타겟 행동의 지속시간 및 상기 타겟 행동의 강도를 결정하는 단계는, 상기 타겟 영상으로부터 상기 피사체의 자세를 검출하는 단계, 상기 검출된 자세를 추적하여 상기 타겟 영상에서의 상기 피사체의 움직임을 획득하는 단계, 및 상기 획득된 움직임에 기초하여 상기 발걸음의 지속시간 및 상기 발걸음의 강도를 결정하는 단계를 포함할 수 있다.The target action is a person's footsteps, and the step of determining the duration of the target action and the intensity of the target action includes detecting the posture of the subject from the target image, and tracking the detected posture in the target image. It may include acquiring the movement of the subject, and determining the duration of the step and the intensity of the step based on the obtained movement.

상기 피사체의 움직임을 획득하는 단계는, 상기 피사체의 상반신의 자세가 검출된 것에 응답하여, 상기 검출된 상반신의 자세를 추적하여 상기 상반신의 움직임을 획득하는 단계, 및 상기 상반신의 움직임에 기초하여 상기 피사체의 하반신의 움직임을 획득하는 단계를 포함할 수 있다.Obtaining the movement of the subject may include, in response to detecting the posture of the upper body of the subject, acquiring the movement of the upper body by tracking the detected posture of the upper body, and based on the movement of the upper body It may include the step of acquiring movement of the subject's lower body.

일 실시예에 따른 방법은 사용자 장치로부터 원본 영상을 수신하는 단계, 및 상기 원본 영상을 학습된 신경망 모델에 입력하여 상기 원본 영상의 프레임 중에서 상기 타겟 행동에 대응되는 적어도 하나의 연속된 프레임을 분류하는 단계를 더 포함하고, 상기 타겟 영상을 획득하는 단계는, 상기 분류된 적어도 하나의 연속된 프레임을 상기 타겟 영상으로 획득할 수 있다.A method according to an embodiment includes receiving an original image from a user device, and inputting the original image into a learned neural network model to classify at least one consecutive frame corresponding to the target action among the frames of the original image. Further comprising the step of acquiring the target image, the classified at least one consecutive frame may be acquired as the target image.

상기 적어도 하나의 객체 각각의 성질은, 영상으로부터 각 객체의 성질을 인식하도록 학습된 서로 다른 신경망 모델을 통해 인식되고, 상기 콘텍스트 정보는, 상기 서로 다른 신경망 모델의 출력을 연결한 텍스트 정보인 것을 특징으로 할 수 있다.The properties of each of the at least one object are recognized through different neural network models learned to recognize the properties of each object from the image, and the context information is text information connecting the outputs of the different neural network models. You can do this.

일 실시예에 따른 방법은 상기 생성된 폴리 사운드를 사용자 장치로 전송하는 단계, 상기 폴리 사운드를 수신한 사용자 장치로부터 상기 폴리 사운드에 대한 사운드 편집 명령을 수신하는 단계, 및 상기 사운드 편집 명령에 따라 상기 폴리 사운드를 편집하는 단계를 더 포함하고, 상기 다운로드 데이터를 제공하는 단계는, 상기 편집된 폴리 사운드의 다운로드 데이터 또는 상기 편집된 폴리 사운드가 병합된 타겟 영상의 다운로드 데이터를 제공하는 단계를 포함할 수 있다.A method according to an embodiment includes transmitting the generated Foley sound to a user device, receiving a sound editing command for the Foley sound from a user device that has received the Foley sound, and according to the sound editing command, It may further include the step of editing a poly sound, and the step of providing the download data may include providing download data of the edited poly sound or download data of a target video in which the edited poly sound is merged. there is.

일 실시예에 따른 폴리 사운드 제공 장치는 프로세서, 및 상기 프로세서에 의해 실행될 인스트럭션들(instructions)을 저장하는 메모리를 포함하고, 상기 프로세서에 의해 상기 인스트럭션들이 실행될 때, 상기 프로세서는, 타겟 행동을 수행하는 피사체를 포함하는 타겟 영상을 획득하고, 상기 타겟 영상으로부터, 상기 피사체를 포함하는 적어도 하나의 객체의 성질을 인식하여 상기 타겟 행동과 관련된 콘텍스트 정보를 결정하고, 상기 타겟 영상으로부터, 상기 피사체의 자세를 추적하여 상기 타겟 행동의 지속시간 및 상기 타겟 행동의 강도를 결정하고, 상기 결정된 콘텍스트 정보, 상기 지속시간 및 상기 강도에 기초하여, 상기 피사체의 타겟 행동에 대응되는 폴리 사운드를 생성하고, 및 상기 생성된 폴리 사운드의 다운로드 데이터 또는 상기 생성된 폴리 사운드와 상기 타겟 영상이 병합된 다운로드 데이터를 제공할 수 있다.The poly sound providing device according to one embodiment includes a processor and a memory storing instructions to be executed by the processor, and when the instructions are executed by the processor, the processor performs a target action. Obtain a target image including a subject, recognize the properties of at least one object including the subject from the target image, determine context information related to the target behavior, and determine the posture of the subject from the target image. Tracking to determine the duration of the target behavior and the intensity of the target behavior, based on the determined context information, the duration and the intensity, generating a poly sound corresponding to the target behavior of the subject, and generating the Download data of the created Foley sound or download data in which the generated Foley sound and the target image are merged may be provided.

본 개시의 실시 예에 따른 인공지능 기반 폴리 사운드 제공 장치 및 방법에 따르면, 영상 작업 시 자주 수행되는 폴리 사운드 믹싱 작업의 모든 단계들을 수작업으로 반복 수행해야 하는 문제점을 개선할 수 있는 효과가 있다.According to the artificial intelligence-based Foley sound providing device and method according to an embodiment of the present disclosure, there is an effect of improving the problem of having to manually repeatedly perform all steps of the Foley sound mixing task frequently performed during video work.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다.
도 2는 일 실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다.
도 3은 일 실시예에 따른 폴리 사운드 제공 시스템의 개괄적인 모습의 예를 도시한 도면이다.
도 4는 일 실시예에 따른 폴리 사운드 제공 장치의 예를 도시한 블록도이다.
도 5는 일 실시예에 따른 폴리 사운드 제공 방법의 흐름도이다.
도 6은 일 실시예에서 사람의 전신 또는 하반신이 촬영된 영상에 대해 발걸음의 폴리 사운드를 제공하는 방법의 흐름도이다.
도 7은 일 실시예에서 사람의 상반신이 촬영된 영상에 대해 발걸음의 폴리 사운드를 제공하는 방법의 흐름도이다.
도 8은 일 실시예에 따른 폴리 사운드 제공 방법의 사용자 화면의 예를 도시한 도면이다.
1 is a diagram illustrating an example of a network environment according to an embodiment.
Figure 2 is a block diagram illustrating an example of a computer device according to an embodiment.
Figure 3 is a diagram illustrating an example of a general view of a Foley sound providing system according to an embodiment.
Figure 4 is a block diagram illustrating an example of a Foley sound providing device according to an embodiment.
Figure 5 is a flowchart of a method for providing poly sound according to an embodiment.
Figure 6 is a flowchart of a method for providing foley sound of footsteps to an image of a person's full body or lower body, according to an embodiment.
Figure 7 is a flowchart of a method of providing foley sound of footsteps to an image of a person's upper body in one embodiment.
FIG. 8 is a diagram illustrating an example of a user screen of a method for providing Foley sound according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific disclosed embodiments, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but these terms should be interpreted only for the purpose of distinguishing one component from another component. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected or connected to the other component, but that other components may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 문서에서, "A 또는 B", "A 및 B 중 적어도 하나", "A 또는 B 중 적어도 하나", "A, B 또는 C", "A, B 및 C 중 적어도 하나", 및 "A, B, 또는 C 중 적어도 하나"와 같은 문구들 각각은 그 문구들 중 해당하는 문구에 함께 나열된 항목들 중 어느 하나, 또는 그들의 모든 가능한 조합을 포함할 수 있다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. As used herein, “A or B”, “at least one of A and B”, “at least one of A or B”, “A, B or C”, “at least one of A, B and C”, and “A Each of phrases such as “at least one of , B, or C” may include any one of the items listed together in the corresponding phrase, or any possible combination thereof. In this specification, terms such as “comprise” or “have” are intended to designate the presence of the described features, numbers, steps, operations, components, parts, or combinations thereof, and are intended to indicate the presence of one or more other features or numbers, It should be understood that this does not exclude in advance the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예에 따른 네트워크 환경의 예를 도시한 도면이다. 도 1의 네트워크 환경은 복수의 전자 기기들(110, 120, 130), 복수의 서버들(150, 160) 및 네트워크(140)를 포함하는 예를 나타내고 있다. 이러한 도 1은 발명의 설명을 위한 일례로 전자 기기의 수나 서버의 수가 도 1과 같이 한정되는 것은 아니다. 또한, 도 1의 네트워크 환경은 본 실시예들에 적용 가능한 환경들 중 하나의 예를 설명하는 것일 뿐, 본 실시예들에 적용 가능한 환경이 도 1의 네트워크 환경으로 한정되는 것은 아니다.1 is a diagram illustrating an example of a network environment according to an embodiment. The network environment in FIG. 1 shows an example including a plurality of electronic devices 110, 120, and 130, a plurality of servers 150 and 160, and a network 140. Figure 1 is an example for explaining the invention, and the number of electronic devices or servers is not limited as in Figure 1. In addition, the network environment in FIG. 1 only explains one example of environments applicable to the present embodiments, and the environment applicable to the present embodiments is not limited to the network environment in FIG. 1.

복수의 전자 기기들(110, 120, 130)은 컴퓨터 장치로 구현되는 고정형 단말이거나 이동형 단말일 수 있다. 복수의 전자 기기들(110, 120, 130)의 예를 들면, 스마트폰(smart phone), 휴대폰, 네비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC 등이 있다. 일례로 도 1에서는 전자 기기(110)의 예로 스마트폰의 형상을 나타내고 있으나, 본 발명의 실시예들에서 전자 기기(110)는 실질적으로 무선 또는 유선 통신 방식을 이용하여 네트워크(140)를 통해 서버(150, 160)와 통신할 수 있는 다양한 물리적인 컴퓨터 장치들 중 하나를 의미할 수 있다.The plurality of electronic devices 110, 120, and 130 may be fixed terminals implemented as computer devices or mobile terminals. Examples of the plurality of electronic devices 110, 120, 130 include smart phones, mobile phones, navigation devices, computers, laptops, digital broadcasting terminals, Personal Digital Assistants (PDAs), Portable Multimedia Players (PMPs), There are tablet PCs, etc. For example, in FIG. 1, the shape of a smartphone is shown as an example of the electronic device 110, but in embodiments of the present invention, the electronic device 110 is substantially connected to the server through the network 140 using a wireless or wired communication method. It may refer to one of a variety of physical computer devices capable of communicating with (150, 160).

통신 방식은 제한되지 않으며, 네트워크(140)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라 기기들간의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(140)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(140)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method is not limited, and may include not only a communication method utilizing a communication network that the network 140 may include (for example, a mobile communication network, wired Internet, wireless Internet, and a broadcast network), but also short-range wireless communication between devices. For example, the network 140 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , may include one or more arbitrary networks such as the Internet. Additionally, the network 140 may include any one or more of network topologies including a bus network, star network, ring network, mesh network, star-bus network, tree or hierarchical network, etc. Not limited.

서버(150, 160) 각각은 복수의 전자 기기들(110, 120, 130)과 네트워크(140)를 통해 통신하여 명령, 코드, 파일, 컨텐츠, 서비스 등을 제공하는 컴퓨터 장치 또는 복수의 컴퓨터 장치들로 구현될 수 있다. 예를 들어, 서버(150)는 네트워크(140)를 통해 접속한 복수의 전자 기기들(110, 120, 130)로 서비스(일례로, 클라우드 서비스, 게임 서비스, 컨텐츠 제공 서비스, 그룹 통화 서비스(또는 음성 컨퍼런스 서비스), 메시징 서비스, 메일 서비스, 소셜 네트워크 서비스, 지도 서비스, 번역 서비스, 금융 서비스, 결제 서비스, 검색 서비스 등)를 제공하는 시스템일 수 있다.Each of the servers 150 and 160 is a computer device or a plurality of computer devices that communicate with a plurality of electronic devices 110, 120, 130 and a network 140 to provide commands, codes, files, content, services, etc. It can be implemented as: For example, the server 150 provides a service (for example, a cloud service, a game service, a content provision service, a group call service (or It may be a system that provides (voice conference service), messaging service, mail service, social network service, map service, translation service, financial service, payment service, search service, etc.).

도 2는 일 실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다. 앞서 설명한 복수의 전자 기기들(110, 120, 130) 각각이나 서버들(150, 160) 각각은 도 2를 통해 도시된 컴퓨터 장치(200)에 의해 구현될 수 있다.Figure 2 is a block diagram illustrating an example of a computer device according to an embodiment. Each of the plurality of electronic devices 110, 120, and 130 described above or each of the servers 150 and 160 may be implemented by the computer device 200 shown in FIG. 2.

이러한 컴퓨터 장치(200)는 도 2에 도시된 바와 같이, 메모리(210), 적어도 하나의 프로세서(220), 통신 인터페이스(230) 그리고 입출력 인터페이스(240)를 포함할 수 있다. 메모리(210)는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 디스크 드라이브와 같은 비소멸성 대용량 기록장치는 메모리(210)와는 구분되는 별도의 영구 저장 장치로서 컴퓨터 장치(200)에 포함될 수도 있다. 또한, 메모리(210)에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리(210)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 메모리(210)로 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 인터페이스(230)를 통해 메모리(210)에 로딩될 수도 있다. 예를 들어, 소프트웨어 구성요소들은 네트워크(140)를 통해 수신되는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 컴퓨터 장치(200)의 메모리(210)에 로딩될 수 있다.As shown in FIG. 2, this computer device 200 may include a memory 210, at least one processor 220, a communication interface 230, and an input/output interface 240. The memory 210 is a computer-readable recording medium and may include a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), and a disk drive. Here, non-perishable large-capacity recording devices such as ROM and disk drives may be included in the computer device 200 as a separate permanent storage device that is distinct from the memory 210. Additionally, an operating system and at least one program code may be stored in the memory 210. These software components may be loaded into the memory 210 from a computer-readable recording medium separate from the memory 210. Such separate computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, and memory cards. In another embodiment, software components may be loaded into the memory 210 through the communication interface 230 rather than a computer-readable recording medium. For example, software components may be loaded into memory 210 of computer device 200 based on computer programs installed by files received over network 140.

프로세서(220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령(또는 인스트럭션(instruction))을 처리하도록 구성될 수 있다. 명령은 메모리(210) 또는 통신 인터페이스(230)에 의해 프로세서(220)로 제공될 수 있다. 예를 들어 프로세서(220)는 메모리(210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processor 220 may be configured to process instructions (or instructions) of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to the processor 220 by the memory 210 or the communication interface 230. For example, processor 220 may be configured to execute received instructions according to program code stored in a recording device such as memory 210.

통신 인터페이스(230)는 네트워크(140)를 통해 컴퓨터 장치(200)가 다른 장치(일례로, 앞서 설명한 저장 장치들)와 서로 통신하기 위한 기능을 제공할 수 있다. 일례로, 컴퓨터 장치(200)의 프로세서(220)가 메모리(210)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이나 명령, 데이터, 파일 등이 통신 인터페이스(230)의 제어에 따라 네트워크(140)를 통해 다른 장치들로 전달될 수 있다. 역으로, 다른 장치로부터의 신호나 명령, 데이터, 파일 등이 네트워크(140)를 거쳐 컴퓨터 장치(200)의 통신 인터페이스(230)를 통해 컴퓨터 장치(200)로 수신될 수 있다. 통신 인터페이스(230)를 통해 수신된 신호나 명령, 데이터 등은 프로세서(220)나 메모리(210)로 전달될 수 있고, 파일 등은 컴퓨터 장치(200)가 더 포함할 수 있는 저장 매체(상술한 영구 저장 장치)로 저장될 수 있다.The communication interface 230 may provide a function for the computer device 200 to communicate with other devices (eg, the storage devices described above) through the network 140. For example, a request, command, data, file, etc. generated by the processor 220 of the computer device 200 according to a program code stored in a recording device such as memory 210 is transmitted to the network ( 140) and can be transmitted to other devices. Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device 200 through the communication interface 230 of the computer device 200 via the network 140. Signals, commands, data, etc. received through the communication interface 230 may be transmitted to the processor 220 or memory 210, and files, etc. may be stored in a storage medium (as described above) that the computer device 200 may further include. It can be stored as a permanent storage device).

입출력 인터페이스(240)는 입출력 장치(250)와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크, 키보드 또는 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커와 같은 장치를 포함할 수 있다. 다른 예로 입출력 인터페이스(240)는 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수도 있다. 입출력 장치(250)는 컴퓨터 장치(200)와 하나의 장치로 구성될 수도 있다.The input/output interface 240 may be a means for interfacing with the input/output device 250. For example, input devices may include devices such as a microphone, keyboard, or mouse, and output devices may include devices such as displays and speakers. As another example, the input/output interface 240 may be a means for interfacing with a device that integrates input and output functions, such as a touch screen. The input/output device 250 may be configured as a single device with the computer device 200.

또한, 다른 실시예들에서 컴퓨터 장치(200)는 도 2의 구성요소들보다 더 적은 혹은 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 장치(200)는 상술한 입출력 장치(250) 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Additionally, in other embodiments, computer device 200 may include fewer or more components than those of FIG. 2 . However, there is no need to clearly show most prior art components. For example, the computer device 200 may be implemented to include at least some of the input/output devices 250 described above, or may further include other components such as a transceiver, a database, etc.

도 3은 일 실시예에 따른 폴리 사운드 제공 시스템의 개괄적인 모습의 예를 도시한 도면이다.Figure 3 is a diagram illustrating an example of a general view of a Foley sound providing system according to an embodiment.

일 실시예에 따른 폴리 사운드 제공 장치(320) 및 방법을 통해, 영상에 맞는 폴리 사운드가 생성되어 제공될 수 있으며, 이로써 폴리 사운드 녹음 및 믹싱에 드는 시간과 비용을 줄일 수 있다.Through the Foley sound providing device 320 and method according to an embodiment, Foley sound suitable for the image can be generated and provided, thereby reducing the time and cost required for Foley sound recording and mixing.

도 3을 참조하면, 폴리 사운드 제공 시스템(300)은 사용자 장치(310)(예: 전자 기기들(110, 120, 130))와 폴리 사운드 제공 장치(320)(예: 서버들(150, 160))를 포함할 수 있다. 여기서, 사용자 장치(310) 및 폴리 사운드 제공 장치(320)는 각각 적어도 하나의 컴퓨터 장치(200)에 의해 구현될 수 있다.Referring to FIG. 3, the Foley sound providing system 300 includes a user device 310 (e.g., electronic devices 110, 120, and 130) and a Foley sound providing device 320 (e.g., servers 150 and 160). ))) may be included. Here, the user device 310 and the Foley sound providing device 320 may each be implemented by at least one computer device 200.

사용자는 사용자 장치(310)를 통해 폴리 사운드를 추가하고자 하는 영상을 입력할 수 있다. 사용자 장치(310)를 통해 입력된 영상은 네트워크를 통해 폴리 사운드 제공 장치(320)로 업로드될 수 있다.The user can input an image to which he/she wants to add Foley sound through the user device 310. The image input through the user device 310 may be uploaded to the Foley sound providing device 320 through a network.

업로드된 영상을 수신한 폴리 사운드 제공 장치(320)는 인공지능을 기반으로 폴리 사운드를 생성할 수 있다. 구체적으로, 폴리사운드 제공 장치(320)는 적어도 하나의 신경망 모델을 이용하여 영상을 분석하고, 분석 결과에 따라 영상에 대응되는 폴리 사운드를 생성할 수 있다.The Foley sound providing device 320 that receives the uploaded video can generate Foley sound based on artificial intelligence. Specifically, the poly sound providing device 320 may analyze an image using at least one neural network model and generate poly sound corresponding to the image according to the analysis result.

신경망 모델은 입력 레이어와 출력 레이어 외에 복수의 히든 레이어를 포함하는 신경망을 의미할 수 있다. 신경망 모델을 이용하면 데이터의 잠재적인 구조(latent structures)를 파악할 수 있다. 즉, 영상의 잠재적인 구조(예를 들어, 어떤 객체가 사진에 있는지, 객체의 자세 등)를 파악할 수 있다. 신경망 모델은 컨볼루션 뉴럴 네트워크(CNN: convolutional neural network), 리커런트 뉴럴 네트워크(RNN: recurrent neural network), 오토 인코더(auto encoder), GAN(Generative Adversarial Networks), 제한 볼츠만 머신(RBM: restricted boltzmann machine), 심층 신뢰 네트워크(DBN: deep belief network), Q 네트워크, U 네트워크, 샴 네트워크, 적대적 생성 네트워크(GAN: Generative Adversarial Network) 등을 포함할 수 있다.A neural network model may refer to a neural network that includes multiple hidden layers in addition to the input layer and output layer. Using a neural network model, you can identify latent structures in data. In other words, the potential structure of the image (for example, which object is in the photo, the posture of the object, etc.) can be identified. Neural network models include convolutional neural network (CNN), recurrent neural network (RNN), auto encoder, Generative Adversarial Networks (GAN), and restricted Boltzmann machine (RBM). ), deep belief network (DBN), Q network, U network, Siamese network, generative adversarial network (GAN), etc.

일 실시예에 따른 신경망 모델은 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 반지도학습(semi supervised learning), 또는 강화학습(reinforcement learning) 중 적어도 하나의 방식으로 학습될 수 있다. 신경망 모델의 학습은 신경망 모델이 특정한 동작을 수행하기 위한 지식을 신경망에 적용하는 과정일 수 있다. 이하에서, 용어 ‘모델’은 신경망 모델을 의미할 수 있다.The neural network model according to one embodiment may be learned by at least one of supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. Learning a neural network model may be a process of applying knowledge for the neural network model to perform a specific operation to the neural network. Hereinafter, the term ‘model’ may refer to a neural network model.

예를 들어, 도 4를 참조하면, 일 실시예에 따른 폴리 사운드 제공 장치(320)의 예시적인 블록도가 도시되어 있다. 폴리 사운드 제공 장치(320)는 신경망 처리기(405), 편집기(410) 및 사운드 라이브러리(415)를 포함할 수 있다. 일 실시예에서, 신경망 처리기(405) 및 편집기(410)는 각각 적어도 하나의 컴퓨터 장치(200)에 의해 구현될 수 있다. 또는 폴리 사운드 제공 장치(320)는 하나의 컴퓨터 장치(200)로 구성되어 해당 컴퓨터 장치(200)의 프로세서(220)에 의해 신경망 처리기(405) 및 편집기(410)의 기능이 수행될 수 있다.For example, referring to Figure 4, an example block diagram of a Foley sound providing device 320 according to one embodiment is shown. The poly sound providing device 320 may include a neural network processor 405, an editor 410, and a sound library 415. In one embodiment, neural network processor 405 and editor 410 may each be implemented by at least one computer device 200. Alternatively, the Foley sound providing device 320 may be composed of a single computer device 200 and the functions of the neural network processor 405 and the editor 410 may be performed by the processor 220 of the computer device 200.

폴리 사운드 제공 장치(320)는 신경망 처리기(405), 편집기(410) 및 사운드 라이브러리(415)를 이용하여 입력된 영상에 대응되는 폴리 사운드를 생성할 수 있다. 신경망 처리기(405), 편집기(410) 및 사운드 라이브러리(415)의 기능에 대해서는 도 5 이하의 블록도를 참조하여 자세히 설명한다.The Foley sound providing device 320 can generate Foley sounds corresponding to the input image using a neural network processor 405, an editor 410, and a sound library 415. The functions of the neural network processor 405, editor 410, and sound library 415 will be described in detail with reference to the block diagram below in FIG. 5.

폴리 사운드 제공 장치(320)는 사용자 장치(310)를 통해 폴리 사운드를 사용자가 다운로드할 수 있도록 다운로드 데이터를 제공할 수 있고, 사용자는 폴리 사운드를 다운로드할 수 있다.The Foley sound providing device 320 can provide download data so that the user can download the Foley sound through the user device 310, and the user can download the Foley sound.

도 5는 일 실시예에 따른 폴리 사운드 제공 방법의 흐름도이다.Figure 5 is a flowchart of a method for providing poly sound according to an embodiment.

도 5를 참조하면, 폴리 사운드 제공 장치(320)는, 단계(505)에서, 타겟 행동(target action)을 수행하는 피사체를 포함하는 타겟 영상을 획득할 수 있다. 타겟 행동은 미리 결정되어 있을 수 있다. 예를 들어, 타겟 행동은 사람의 발걸음일 수 있다. 타겟 영상의 발걸음에 대응되는 폴리 사운드로써 발걸음 소리가 타겟 영상에 추가되어야 할 수 있다.Referring to FIG. 5, the Foley sound providing device 320 may acquire a target image including a subject performing a target action in step 505. The target behavior may be predetermined. For example, the target behavior could be a person's steps. Footstep sounds may need to be added to the target image as foley sounds that correspond to footsteps in the target image.

타겟 영상은 사용자가 업로드한 영상이거나, 또는 사용자가 업로드한 영상 중에서 폴리 사운드 제공 장치(320)에 의해 타겟 행동에 대응되는 것으로 분류된 영상일 수 있다.The target video may be a video uploaded by the user, or a video classified by the Foley sound providing device 320 as corresponding to the target behavior among the videos uploaded by the user.

일 실시예에서, 타겟 영상은 사용자가 원본 영상 중에서 발걸음 소리를 추가하고자 하는 부분 이외의 부분을 잘라내어 타겟 영상을 생성하고, 생성된 타겟 영상을 사용자 장치(310)를 통해 폴리 사운드 제공 장치(320)로 업로드한 것일 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 업로드한 타겟 영상을 수신하는 것으로써 타겟 영상을 획득할 수 있다.In one embodiment, the target image is generated by cutting out parts other than the part where the user wants to add footsteps from the original video, and the generated target image is sent to the Foley sound providing device 320 through the user device 310. It may have been uploaded by . The Foley sound providing device 320 can acquire the target image by receiving the target image uploaded by the user.

다른 예에서, 사용자는 원본 영상을 폴리 사운드 제공 장치(320)로 업로드할 수 있다. 타겟 영상은 폴리 사운드 제공 장치(320)가 원본 영상 중 타겟 행동에 대응되는 것으로 분류한 프레임에 대응되는 영상일 수 있다.In another example, the user may upload the original video to the Foley sound providing device 320. The target image may be an image corresponding to a frame that the Foley sound providing device 320 classifies as corresponding to the target action among the original images.

예를 들어, 폴리 사운드 제공 장치(320)의 신경망 처리기(405)는 입력된 영상으로부터 타겟 행동에 대응되는 적어도 하나의 연속된 프레임을 분류하도록 학습된 신경망 모델(예: 행동 분류 모델(미도시))을 포함할 수 있다.For example, the neural network processor 405 of the Foley sound providing device 320 may use a neural network model (e.g., action classification model (not shown)) learned to classify at least one consecutive frame corresponding to the target action from the input image. ) may include.

신경망 처리기(405)는 원본 영상을 학습된 행동 분류 모델(미도시)에 입력하여 타겟 행동에 대응되는 적어도 하나의 연속된 프레임을 분류할 수 있다. 폴리 사운드 제공 장치(320)는 타겟 행동에 대응되는 것으로 분류된 적어도 하나의 연속된 프레임을 타겟 영상으로 획득할 수 있다.The neural network processor 405 may classify at least one consecutive frame corresponding to the target behavior by inputting the original image into a learned behavior classification model (not shown). The Foley sound providing device 320 may acquire at least one continuous frame classified as corresponding to the target action as the target image.

폴리 사운드 제공 장치(320)는, 단계(510)에서, 타겟 영상으로부터 타겟 행동과 관련된 콘텍스트 정보를 결정할 수 있다. 콘텍스트 정보는 타겟 행동과 관련된 피사체 주변 환경에 대한 정보를 포함할 수 있다. 예를 들어, 콘텍스트 정보는 장치는 피사체를 포함하는 적어도 하나의 객체의 성질을 포함할 수 있다.In step 510, the Foley sound providing device 320 may determine context information related to the target behavior from the target image. Context information may include information about the surrounding environment of the subject related to the target behavior. For example, the context information may include properties of at least one object including the subject of the device.

폴리 사운드 제공 장치(320)는 타겟 영상으로부터 피사체를 포함하는 적어도 하나의 객체의 성질을 인식하여 타겟 행동과 관련된 콘텍스트 정보를 결정할 수 있다. 콘텍스트 정보는 피사체를 포함하는 적어도 하나의 객체의 성질을 텍스트로 표현한 정보일 수 있다.The Foley sound providing device 320 may determine context information related to target behavior by recognizing the properties of at least one object including the subject from the target image. Context information may be information expressing the properties of at least one object including the subject in text.

타겟 영상에 포함된 객체의 성질은 신경망 모델을 이용하여 획득될 수 있다. 예를 들어, 타겟 영상에 포함된 객체의 성질은 신경망 처리기(405)에 포함된 이미지 인식 모델(411 내지 413)을 이용하여 획득될 수 있다.The properties of objects included in the target image can be obtained using a neural network model. For example, the properties of the object included in the target image may be obtained using the image recognition models 411 to 413 included in the neural network processor 405.

도 4를 참조하면, 신경망 처리기(405)는 입력된 영상으로부터 특정 객체의 성질을 인식하도록 학습된 적어도 하나의 이미지 인식 모델(411 내지 413)을 포함할 수 있다. 각각의 이미지 인식 모델(411 내지 413)은 서로 다른 객체의 성질을 인식하도록 학습된 모델일 수 있다.Referring to FIG. 4 , the neural network processor 405 may include at least one image recognition model 411 to 413 learned to recognize the properties of a specific object from an input image. Each image recognition model 411 to 413 may be a model learned to recognize properties of different objects.

단계(510)에서 획득된 타겟 영상은 신경망 처리기(405)를 통해 적어도 하나의 이미지 인식 모델(411 내지 413) 각각에 입력될 수 있다. 신경망 처리기(405)는 적어도 하나의 이미지 인식 모델(411 내지 413)을 이용하여 타겟 영상에 포함된 서로 다른 적어도 하나의 객체의 성질을 출력할 수 있다.The target image acquired in step 510 may be input to each of at least one image recognition model 411 to 413 through the neural network processor 405. The neural network processor 405 may output properties of at least one different object included in the target image using at least one image recognition model 411 to 413.

폴리 사운드 제공 장치(320)는 생성된 객체의 성질을 이용하여 타겟 행동과 관련된 콘텍스트 정보를 결정할 수 있고, 결정된 콘텍스트 정보를 이용하여 타겟 영상의 상황에 대응되는 폴리 사운드를 제공할 수 있다.The Foley sound providing device 320 may determine context information related to target behavior using the properties of the generated object, and may provide Foley sound corresponding to the situation of the target image using the determined context information.

도면에는 표시되지 않았지만, 본 발명의 다른 실시 예에서, 폴리 사운드 제공 장치(320)는 사용자 장치로부터 타겟 영상의 피사체를 포함하는 적어도 하나의 객체의 성질에 대한 콘텍스트 정보를 수신할 수 있다.Although not shown in the drawing, in another embodiment of the present invention, the Foley sound providing device 320 may receive context information about the properties of at least one object including the subject of the target image from the user device.

사용자 장치로부터 수신한 콘텍스트 정보는 피사체에 대한 정보 또는 폴리 사운드에 영향을 미치는 주변 객체의 성질에 대한 정보를 포함할 수 있다. 예를 들어, 폴리 사운드가 발걸음 소리인 경우, 피사체에 대한 정보는 피사체의 성별, 나이, 체중, 복장이 포함될 수 있으나 이에 한정되지 않는다. 주변 객체의 성질에 대한 정보는 신발의 종류, 신발 바닥의 재질, 신발이 닿는 지면의 재질, 피사체가 위치한 공간, 주변 소음이 포함될 수 있으나 이에 한정되지 않는다.Context information received from the user device may include information about the subject or information about the properties of surrounding objects that affect the Foley sound. For example, if the Foley sound is the sound of footsteps, information about the subject may include, but is not limited to, the subject's gender, age, weight, and clothing. Information about the properties of surrounding objects may include, but is not limited to, the type of shoe, the material of the sole of the shoe, the material of the ground that the shoe touches, the space where the subject is located, and surrounding noise.

폴리 사운드 제공 장치(320)는 신경망 모델을 이용하여 피사체를 포함하는 적어도 하나의 객체의 성질을 자동으로 인식할 수 있으나, 인식된 객체의 성질에 폴리 사운드 생성을 위한 충분한 정보가 포함되지 않을 수 있다.The Foley sound providing device 320 can automatically recognize the properties of at least one object including the subject using a neural network model, but the properties of the recognized object may not include sufficient information for generating the Foley sound. .

이는 폴리 사운드 생성에 필요한 객체 성질이 타겟 영상 속 화면 밖에 위치하거나, 화면의 낮은 해상도 등의 이유로 화면 내 포함되었지만 신경망 모델이 미처 인식하지 못한 경우를 예시할 수 있다. 또는, 시청자의 보다 높은 몰입감을 구현하기 위하여, 보완적으로 폴리 사운드 생성 시 사용되는 객체의 성질을 사용자가 인위적으로 입력하고자 하는 경우를 예시할 수 있다.This may be an example of a case where the object properties required for poly sound generation are located outside the screen in the target image, or are included within the screen due to low resolution of the screen, but are not recognized by the neural network model. Alternatively, in order to realize a higher sense of immersion for the viewer, a case may be provided where the user wishes to artificially input the properties of the object used in generating the poly sound.

일 실시 예에서, 사용자 장치는 사용자가 객체의 성질을 입력할 수 있는 사용자 인터페이스를 사용자 장치의 화면에 표시할 수 있다. 객체의 성질을 입력할 수 있는 사용자 인터페이스는, 사용자가 원하는 단어 또는 문장을 입력할 수 있도록 공란을 표시하는 방법으로 제공될 수 있다.In one embodiment, the user device may display a user interface on the screen of the user device through which the user can input properties of an object. A user interface that allows input of object properties may be provided by displaying a blank space so that the user can input a desired word or sentence.

사용자 장치는 사용자가 입력한 단어 또는 문장을 폴리 사운드 제공 장치(320)에 전달할 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 입력한 단어 또는 문장에 대응되는 콘텍스트 정보를 기초로 폴리 사운드를 생성할 수 있다.The user device may transmit the word or sentence input by the user to the poly sound providing device 320. The Foley sound providing device 320 may generate a Foley sound based on context information corresponding to a word or sentence input by the user.

다른 실시 예에서, 사용자 장치는 입력 빈도수가 높은 복수의 객체 성질을 사용자가 선택할 수 있는 후보로 제공하여, 사용자가 원하는 객체의 성질을 선택하도록 표시할 수 있다.In another embodiment, the user device may provide a plurality of object properties with high input frequency as candidates for the user to select, and display the user to select the desired object property.

또 다른 실시 예에서, 사용자 장치는 일차적으로 이미지 인식 모델을 통해 인식된 객체의 성질을 기초로 연관성이 높은 객체 성질을 사용자가 선택할 수 있도록 후보로 제공할 수 있다. 즉, 폴리 사운드 제공 장치(320)는 이미지 인식 모델이 인식한 객체의 성질과 연관성이 높은 객체의 성질을 후보로 결정할 수 있다.In another embodiment, the user device may provide highly relevant object properties as candidates for the user to select based on the properties of the object primarily recognized through an image recognition model. That is, the Foley sound providing device 320 may determine the properties of an object that are highly correlated with the properties of the object recognized by the image recognition model as candidates.

예를 들어, 이미지 인식 모델이 인식한 객체에 신발, 지면이 포함되는 경우, 폴리 사운드 제공 장치(320)는 폴리 사운드의 종류를 발걸음 소리라고 판단할 수 있다. 폴리 사운드 제공 장치(320)는 발걸음 소리에 영향을 미치는 신발 바닥의 재질, 면적에 대한 정보, 피사체의 착용한 신발의 재질, 지면이 위치한 공간 정보 등을 발걸음 소리에 영향을 미치는 후보 객체 성질로써 사용자 장치에 제공할 수 있다.For example, if the objects recognized by the image recognition model include shoes and the ground, the Foley sound providing device 320 may determine that the type of Foley sound is a footsteps sound. The Foley sound providing device 320 uses the material of the shoe sole, information about the area, the material of the shoe worn by the subject, and spatial information on the ground, as candidate object properties that affect the sound of footsteps, to the user. Can be provided to the device.

사용자 장치는 후보 객체 성질을 사용자 인터페이스를 통해 사용자에게 표시할 수 있다. 사용자 장치는 사용자가 선택한 객체의 성질을 폴리 사운드 제공 장치(320)에 전달할 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 선택한 객체의 성질에 대응되는 콘텍스트 정보를 기초로 폴리 사운드를 생성할 수 있다.The user device may display the candidate object properties to the user through a user interface. The user device may transmit the properties of the object selected by the user to the poly sound providing device 320. The Foley sound providing device 320 may generate Foley sounds based on context information corresponding to the properties of the object selected by the user.

단계(515)에서, 폴리 사운드 제공 장치(320)는 획득된 타겟 영상으로부터 피사체의 자세를 추적하여 타겟 행동의 지속시간 및 타겟 행동의 강도를 결정할 수 있다. 폴리 사운드 제공 장치(320)는 신경망 처리기(405)를 이용하여 타겟 행동의 지속시간 및 타겟 행동의 강도를 결정할 수 있다.In step 515, the Foley sound providing device 320 may determine the duration and intensity of the target action by tracking the posture of the subject from the acquired target image. The Foley sound providing device 320 may determine the duration of the target action and the intensity of the target action using the neural network processor 405.

타겟 영상에 포함된 피사체의 자세는 신경망 모델을 이용하여 검출될 수 있다. 예를 들어, 신경망 처리기(405)는 획득된 타겟 영상으로부터 피사체(예를 들어, 사람)의 자세를 검출하도록 학습된 자세 검출 모델(420)을 포함할 수 있다.The posture of the subject included in the target image can be detected using a neural network model. For example, the neural network processor 405 may include a posture detection model 420 learned to detect the posture of a subject (eg, a person) from an acquired target image.

예를 들어, 자세 검출 모델(420)은 인체의 관절을 인식하는 스켈레톤 검출(skeleton detection) 모델일 수 있다. 폴리 사운드 제공 장치(320)는 타겟 영상의 각 프레임에 대한 자세 검출의 결과를 기초로 피사체의 자세를 추적하여 타겟 영상에서의 피사체의 움직임(또는 동작)(motion)을 획득할 수 있다.For example, the posture detection model 420 may be a skeleton detection model that recognizes joints of the human body. The Foley sound providing device 320 may acquire the movement (or motion) of the subject in the target image by tracking the posture of the subject based on the result of posture detection for each frame of the target image.

피사체의 자세는 다양한 알고리즘 또는 신경망 모델을 이용하여 추적될 수 있다. 예를 들어, 피사체의 자세는 자세 검출 모델(420)로부터 검출된 자세를 기초로 칼만 필터 기반의 자세 추적 알고리즘을 사용하여 추적될 수 있다. 예를 들어, 피사체의 자세는 자세 검출 모델(420)을 이용하여 검출된 자세를 기초로 연속된 자세를 추적하도록 학습된 신경망 모델을 이용하여 추적될 수 있다.The subject's posture can be tracked using various algorithms or neural network models. For example, the subject's posture may be tracked using a Kalman filter-based posture tracking algorithm based on the posture detected from the posture detection model 420. For example, the subject's posture can be tracked using a neural network model learned to track continuous postures based on the posture detected using the posture detection model 420.

단계(505)에서 획득되는 타겟 영상에 피사체의 타겟 행동이 포함되는 실시예에서, 신경망 처리기(405)는 획득된 피사체의 움직임에 기초하여 타겟 행동의 지속시간 및 타겟 행동의 강도를 결정할 수 있다.In an embodiment in which the target image acquired in step 505 includes the target behavior of the subject, the neural network processor 405 may determine the duration and intensity of the target behavior based on the acquired movement of the subject.

예를 들어, 걷는 사람의 전신 또는 하반신을 포함하는 타겟 영상이 신경망 처리기(405)에 입력되고 타겟 행동이 사람의 발걸음인 경우, 신경망 처리기(405)는 자세 검출 모델(420)을 통해 걷는 사람의 전신 또는 하반신의 자세를 검출하고, 자세 검출의 결과를 기초로 사람의 전신 또는 하반신의 자세를 추적하여 전신 또는 하반신의 움직임을 획득할 수 있다.For example, if a target image including the full body or lower body of a walking person is input to the neural network processor 405 and the target action is a person's footsteps, the neural network processor 405 detects the walking person's body through the posture detection model 420. The posture of the whole body or lower body can be detected, and the movement of the whole body or lower body can be obtained by tracking the posture of the person's whole body or lower body based on the posture detection results.

신경망 처리기(405)는 획득된 피사체의 움직임에 기초하여 발걸음의 지속시간 및 발걸음의 강도를 결정할 수 있다. 신경망 처리기(405)는 피사체의 움직임에 따라 타겟 영상에서 타겟 행동이 시작하는 시점과 끝나는 시점을 파악할 수 있고, 두 시점 간 차이를 타겟 행동의 지속 시간으로 결정할 수 있다. 신경망 처리기(405)는 피사체가 움직이는 속도에 기초하여 타겟 행동의 강도를 결정할 수 있다.The neural network processor 405 may determine the duration and intensity of footsteps based on the acquired movement of the subject. The neural network processor 405 can determine when the target action starts and ends in the target image according to the movement of the subject, and determines the difference between the two times as the duration of the target action. The neural network processor 405 may determine the intensity of the target action based on the speed at which the subject moves.

실시예에 따라, 신경망 처리기(405)는 타겟 영상으로부터 획득된 피사체의 움직임에 기초하여 피사체의 타겟 영상에 포함되지 않은 움직임을 예측하는 신경망 모델(예: 도 4의 동작 예측 모델(435))을 더 포함할 수 있다. 예를 들어, 단계(505)에서 획득되는 타겟 영상에 피사체의 타겟 행동이 포함되지 않는 실시예에서, 피사체의 타겟 행동에 대응되는 움직임을 예측하기 위해 동작 예측 모델(435)이 신경망 처리기(405)에 더 포함될 수 있다.Depending on the embodiment, the neural network processor 405 creates a neural network model (e.g., the motion prediction model 435 in FIG. 4) that predicts movement not included in the target image of the subject based on the movement of the subject obtained from the target image. More may be included. For example, in an embodiment in which the target image acquired in step 505 does not include the target behavior of the subject, the motion prediction model 435 is used by the neural network processor 405 to predict the movement corresponding to the target behavior of the subject. may be further included.

예를 들어, 걷는 사람의 상반신을 포함하는 타겟 영상이 신경망 처리기(405)에 입력되고 타겟 행동이 사람의 발걸음인 경우, 신경망 처리기(405)는 자세 검출 모델(420)을 통해 걷는 사람의 상반신의 자세를 검출하고 자세 검출의 결과를 기초로 사람의 상반신의 자세를 추적하여 상반신의 움직임을 획득할 수 있으나, 사람의 하반신의 움직임은 획득할 수 없을 수 있다.For example, when a target image including the upper body of a walking person is input to the neural network processor 405 and the target action is a person's footsteps, the neural network processor 405 detects the upper body of the walking person through the posture detection model 420. It is possible to detect the posture and track the posture of the person's upper body based on the posture detection result to obtain the movement of the person's upper body, but the movement of the person's lower body may not be obtained.

이 경우, 신경망 처리기(405)는 상반신 움직임에 기초하여 하반신 움직임을 예측하도록 학습된 동작 예측 모델(435)을 이용하여, 획득된 상반신의 움직임으로부터 하반신의 움직임을 예측할 수 있다. 폴리 사운드 제공 장치(320)는 획득된 하반신의 움직임에 기초하여 발걸음의 지속시간 및 발걸음의 강도를 결정할 수 있다.In this case, the neural network processor 405 can predict the movement of the lower body from the acquired movement of the upper body using the motion prediction model 435 learned to predict the movement of the lower body based on the movement of the upper body. The Foley sound providing device 320 may determine the duration and intensity of steps based on the acquired movement of the lower body.

단계(520)에서, 폴리 사운드 제공 장치(320)는 단계(510)에서 결정된 콘텍스트와 단계(515)에서 결정된 타겟 행동의 지속시간 및 강도에 기초하여 피사체의 타겟 행동에 대응되는 폴리 사운드를 생성할 수 있다.In step 520, the Foley sound providing device 320 generates a Foley sound corresponding to the target behavior of the subject based on the context determined in step 510 and the duration and intensity of the target behavior determined in step 515. You can.

도 4를 참조하면, 폴리 사운드 제공 장치(320)는 타겟 행동과 관련된 다양한 사운드가 해당 소리를 표현하는 텍스트와 매핑되어 저장된 사운드 라이브러리(415)를 포함할 수 있다.Referring to FIG. 4 , the Foley sound providing device 320 may include a sound library 415 in which various sounds related to target actions are mapped and stored with text representing the corresponding sounds.

일 실시예에서, 폴리 사운드 제공 장치(320)는 콘텍스트, 타겟 행동의 지속 시간 및 타겟 행동의 강도 중 적어도 하나에 기초하여 사운드 라이브러리(415)로부터 피사체의 타겟 행동에 대응되는 폴리 사운드를 검색할 수 있다.In one embodiment, the Foley sound providing device 320 may retrieve a Foley sound corresponding to a target action of a subject from the sound library 415 based on at least one of context, duration of the target action, and intensity of the target action. there is.

예를 들어, 폴리 사운드 제공 장치(320)는 사운드 라이브러리(415)로부터 콘텍스트, 타겟 행동의 지속 시간 및 타겟 행동의 강도 중 적어도 하나에 대응되는 폴리 사운드를 검색할 수 있다.For example, the Foley sound providing device 320 may retrieve a Foley sound corresponding to at least one of the context, the duration of the target action, and the intensity of the target action from the sound library 415.

다른 예에서, 폴리 사운드 제공 장치(320)는 콘텍스트, 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나에 기초하여 사운드 라이브러리(415)에 저장된 사운드 중 어느 하나를 출력하도록 학습된 신경망 모델(예: 도 4의 검색 모델(430))을 이용하여 폴리 사운드를 검색할 수 있다.In another example, the poly sound providing device 320 may use a neural network model (e.g., : You can search for Foley sounds using the search model 430 of FIG. 4.

도 4를 참조하면, 신경망 처리기(405)는 콘텍스트, 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나에 기초하여 사운드 라이브러리(415)에 저장된 사운드 중 어느 하나를 출력하도록 학습된 검색 모델(430)을 포함할 수 있다.Referring to FIG. 4, the neural network processor 405 uses a search model 430 learned to output one of the sounds stored in the sound library 415 based on at least one of the context, the duration of the target action, and the intensity of the target action. ) may include.

예를 들어, 콘텍스트, 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나가 신경망 처리기(405)의 검색 모델(430)에 입력될 수 있다. 신경망 처리기(405)는 검색 모델(430)을 이용하여 사운드 라이브러리(415)에 저장된 사운드 중 어느 하나를 출력할 수 있다.For example, at least one of context, duration of the target behavior, and intensity of the target behavior may be input to the search model 430 of the neural network processor 405. The neural network processor 405 can output one of the sounds stored in the sound library 415 using the search model 430.

폴리 사운드 제공 장치(320)는 사운드 라이브러리(415)에서 검색된 폴리 사운드 또는 검색 모델(430)을 이용하여 출력된 사운드의 길이를 타겟 행동의 지속 시간을 고려하여 조정함으로써 타겟 행동에 대응되는 폴리 사운드를 생성할 수 있다.The Foley sound providing device 320 uses the Foley sound retrieved from the sound library 415 or the search model 430 to adjust the length of the output sound in consideration of the duration of the target action to produce the Foley sound corresponding to the target action. can be created.

도 4를 참조하면, 폴리 사운드 제공 장치(320)는 사운드를 편집할 수 있는 편집기(410)를 포함할 수 있다. 편집기(410)는 오디오 트랙의 편집 기능을 수행할 수 있다.Referring to FIG. 4, the Foley sound providing device 320 may include an editor 410 capable of editing sound. The editor 410 can perform an editing function on the audio track.

본 발명의 실시 예에서, 편집기(410)는 생성된 폴리 사운드가 영상 속 화면에 대응되도록 싱크 조정 가능을 수행할 수 있다. 구체적으로, 편집기(410)는 오디오 트랙의 위치를 조정, 특정 지점을 자르거나 붙이기, 스니크 인 앤 아웃 기능을 수행할 수 있다.In an embodiment of the present invention, the editor 410 can adjust sync so that the generated poly sound corresponds to the screen in the video. Specifically, the editor 410 can adjust the position of the audio track, cut or paste a specific point, and perform sneak in and out functions.

일 실시예에서, 폴리 사운드 제공 장치(320)는 사운드 라이브러리(415)에서 사운드를 검색하여 폴리 사운드를 생성하는 것과 병렬적으로, 신경망 모델을 이용하여 타겟 행동에 대응되는 폴리 사운드를 생성할 수 있다.In one embodiment, the Foley sound providing device 320 searches for a sound in the sound library 415 and generates a Foley sound in parallel, and uses a neural network model to generate a Foley sound corresponding to the target action. .

예를 들어, 폴리 사운드 제공 장치(320)는 콘텍스트, 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나에 기초하여 폴리 사운드를 생성하도록 학습된 폴리 사운드 생성 모델(425)을 이용하여 타겟 행동에 대응되는 폴리 사운드를 생성할 수 있다.For example, the Foley sound providing device 320 uses the Foley sound generation model 425, which is learned to generate Foley sounds based on at least one of the context, the duration of the target behavior, and the intensity of the target behavior, to target behavior. A corresponding poly sound can be created.

도 4를 참조하면, 신경망 처리기(405)는 콘텍스트, 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나에 기초하여 폴리 사운드를 생성하도록 학습된 폴리 사운드 생성 모델(425)을 포함할 수 있다. 단계(510)에서 결정된 콘텍스트 및 단계(515)에서 결정된 타겟 행동의 지속시간 및 타겟 행동의 강도 중 적어도 하나가 폴리 사운드 생성 모델(425)에 입력될 수 있다. 신경망 처리기(405)는 폴리 사운드 생성 모델(425)을 이용하여 타겟 행동에 대응되는 폴리 사운드를 생성할 수 있다.Referring to FIG. 4 , the neural network processor 405 may include a Foley sound generation model 425 trained to generate Foley sounds based on at least one of context, duration of the target action, and intensity of the target action. At least one of the context determined in step 510 and the duration and intensity of the target action determined in step 515 may be input to the Foley sound generation model 425 . The neural network processor 405 may use the Foley sound generation model 425 to generate a Foley sound corresponding to the target action.

단계(525)에서, 폴리 사운드 제공 장치(320)는 생성된 폴리 사운드의 다운로드 데이터 또는 생성된 폴리 사운드와 타겟 영상이 병합된 다운로드 데이터를 사용자 장치로 제공할 수 있다. 생성된 폴리 사운드와 타겟 영상 간의 병합은 편집기(예: 도 4의 편집기(410))에 의해 수행될 수 있다. 편집기(410)는 생성된 폴리 사운드를 오디오 트랙에서 위치 조정, 자르거나 붙이기, 스니크 인 앤 아웃 중 적어도 하나의 기능을 통해 영상 속 화면에 대응시키는 동작을 수행할 수 있다. 사용자는 사용자 장치를 통해, 생성된 폴리 사운드를 다운로드 받을 수 있다.In step 525, the Foley sound providing device 320 may provide download data of the generated Foley sound or download data combining the generated Foley sound and the target image to the user device. Merging between the generated Foley sound and the target image may be performed by an editor (eg, editor 410 in FIG. 4). The editor 410 may perform an operation to correspond the generated poly sound to the screen in the video through at least one of the following functions: adjusting the position of the generated poly sound, cutting or pasting, and sneaking in and out on the audio track. Users can download the generated Foley sounds through their device.

이하, 도 6 및 도 7을 참조하여 타겟 행동이 사람의 발걸음인 경우에 폴리 사운드를 제공하는 예시적인 방법에 대해 설명한다.Hereinafter, an exemplary method of providing a Foley sound when the target action is a person's footsteps will be described with reference to FIGS. 6 and 7.

걷고 있는 사람을 촬영한 영상은 걷고 있는 사람의 전신을 포함하거나, 상반신을 포함하거나, 하반신을 포함할 수 있다. 사람의 전신이 타겟 영상에 포함되거나 사람의 하반신이 타겟 영상에 포함되는 경우, 폴리 사운드 제공 장치(320)는 타겟 영상으로부터 사람의 움직임을 추적하여 발걸음에 대응되는 폴리 사운드를 제공할 수 있다.An image of a person walking may include the full body, upper body, or lower body of the walking person. When a person's entire body is included in the target image or a person's lower body is included in the target image, the Foley sound providing device 320 may track the person's movements from the target image and provide Foley sounds corresponding to footsteps.

도 6을 참조하면, 일 실시예에서 사람의 전신 또는 하반신이 촬영된 타겟 영상에 대해, 타겟 행동인 발걸음의 폴리 사운드를 제공하는 방법의 흐름도가 도시되어 있다.Referring to FIG. 6, in one embodiment, a flowchart of a method for providing poly sound of footsteps, which is a target action, to a target image of a person's full body or lower body is shown.

일 실시예의 폴리 사운드 제공 장치(320)는, 단계(605)에서, 걷고 있는 피사체(즉, 사람)의 하반신 또는 전신을 포함하는 타겟 영상을 획득할 수 있다.The Foley sound providing device 320 of one embodiment may acquire a target image including the lower body or whole body of a walking subject (i.e., a person) in step 605.

일 실시예에서, 타겟 영상은 사용자가 원본 영상 중에서 걷고 있는 사람의 하반신 또는 전신을 포함하는 부분을 잘라내어 타겟 영상을 생성하고, 생성된 타겟 영상을 사용자 장치(310)를 통해 폴리 사운드 제공 장치(320)로 업로드한 것일 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 업로드한 타겟 영상을 수신하는 것으로써 타겟 영상을 획득할 수 있다.In one embodiment, the user generates a target image by cutting out a portion including the lower body or whole body of a person walking from the original image, and sends the generated target image to the Foley sound providing device 320 through the user device 310. ) may have been uploaded. The Foley sound providing device 320 can acquire the target image by receiving the target image uploaded by the user.

다른 예에서, 사용자는 원본 영상을 폴리 사운드 제공 장치(320)로 업로드할 수 있다. 타겟 영상은 폴리 사운드 제공 장치(320)가 원본 영상 중 걷고 있는 사람의 하반신 또는 전신을 포함하는 것으로 분류한 프레임에 대응되는 영상일 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)의 신경망 처리기(405)는 입력된 영상으로부터 걷고 있는 사람의 하반신 또는 전신을 포함하는 적어도 하나의 연속된 프레임을 분류하도록 학습된 신경망 모델(예: 행동 분류 모델(미도시))을 포함할 수 있다.In another example, the user may upload the original video to the Foley sound providing device 320. The target image may be an image corresponding to a frame classified by the Foley sound providing device 320 as containing the lower body or full body of a walking person among the original images. For example, the neural network processor 405 of the Foley sound providing device 320 may use a neural network model (e.g., action classification) learned to classify at least one continuous frame containing the lower body or full body of a person walking from the input image. A model (not shown) may be included.

도 6의 예에서, 단계(605)는 도 5의 단계(505)에 포함되는 단계일 수 있다.In the example of FIG. 6 , step 605 may be a step included in step 505 of FIG. 5 .

폴리 사운드 제공 장치(320)는, 단계(610), 단계(615) 및 단계(620)에서, 피사체를 포함하는 적어도 하나의 객체의 성질을 인식할 수 있다. 적어도 하나의 객체는 타겟 행동인 발걸음과 관련된 객체일 수 있다. 발걸음과 관련된 적어도 하나의 객체는 미리 결정되어 있을 수 있다. 도 6의 예시에서, 적어도 하나의 객체는 피사체인 사람, 사람이 걷고 있는 지면 및 사람이 신고 있는 신발을 포함할 수 있다. 다만, 이는 예시이며, 타겟 행동과 관련된 적어도 하나의 객체는 다양한 객체를 포함할 수 있다.The Foley sound providing device 320 may recognize the properties of at least one object including the subject in steps 610, 615, and 620. At least one object may be an object related to footsteps, which is the target action. At least one object related to the step may be predetermined. In the example of FIG. 6 , at least one object may include a person as the subject, the ground on which the person is walking, and the shoes the person is wearing. However, this is an example, and at least one object related to the target action may include various objects.

적어도 하나의 객체 각각의 성질은 서로 다른 신경망 모델(예: 도 4의 제1 이미지 인식 모델(411) 내지 제N 이미지 인식 모델(413))을 이용하여 획득될 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는, 단계(610)에서, 입력된 영상으로부터 사람의 성별을 인식하도록 학습된 제1 이미지 인식 모델에 입력하고, 단계(615)에서, 입력된 영상으로부터 지면의 재질을 인식하도록 학습된 제2 이미지 인식 모델에 입력하고, 단계(620)에서, 신발의 재질을 인식하도록 학습된 제3 이미지 인식 모델에 입력할 수 있다.The properties of each of at least one object may be acquired using different neural network models (eg, the first image recognition model 411 to the Nth image recognition model 413 in FIG. 4). For example, in step 610, the Foley sound providing device 320 inputs the input image into a first image recognition model learned to recognize the gender of the person, and in step 615, selects the input image from the input image. The image may be input to a second image recognition model trained to recognize the material of the ground, and in step 620, may be input to a third image recognition model learned to recognize the material of the shoe.

폴리 사운드 제공 장치(320)는 제1 이미지 인식 모델의 출력으로 사람의 성별을 획득하고, 제2 이미지 인식 모델의 출력으로 지면의 재질을 획득하고, 제3 이미지 인식 모델의 출력으로 신발의 재질을 획득할 수 있다.The poly sound providing device 320 acquires the gender of the person through the output of the first image recognition model, acquires the material of the ground through the output of the second image recognition model, and obtains the material of the shoe through the output of the third image recognition model. It can be obtained.

단계(625)에서, 폴리 사운드 제공 장치(320)는 단계(610), 단계(615) 및 단계(620)에서 획득된 적어도 하나의 객체의 성질에 기초하여 타겟 행동인 발걸음과 관련된 콘텍스트를 결정할 수 있다.In step 625, the Foley sound providing device 320 may determine the context associated with the target action, footsteps, based on the properties of the at least one object obtained in steps 610, 615, and 620. there is.

일 실시예에서 콘텍스트 정보는 적어도 하나의 객체의 성질에 기초하여 텍스트 형태로 결정될 수 있다. 예를 들어, 콘텍스트 정보는 적어도 하나의 객체의 성질을 텍스트로 표현한 정보일 수 있다. 예를 들어, 단계(610)에서 복수의 이미지 인식 모델을 통해 사람의 성별로서 남성이 인식되고, 단계(615)에서 지면의 재질로서 콘크리트가 인식되고, 단계(620)에서 신발의 재질로서 가죽이 인식된 경우, 콘텍스트 정보는 이미지 인식 모델들의 출력을 연결한 “남성-콘크리트-가죽”의 텍스트 정보일 수 있다. 도 6의 예에서, 단계(610), 단계(615), 단계(620) 및 단계(625)는 도 5의 단계(510)에 포함되는 단계일 수 있다.In one embodiment, context information may be determined in text form based on properties of at least one object. For example, context information may be information expressing the properties of at least one object in text. For example, in step 610, male is recognized as the human gender through a plurality of image recognition models, in step 615, concrete is recognized as the material of the ground, and in step 620, leather is recognized as the material of the shoe. If recognized, the context information may be text information of “Man-Concrete-Leather” linking the output of the image recognition models. In the example of FIG. 6 , steps 610 , 615 , 620 , and 625 may be steps included in step 510 of FIG. 5 .

단계(630)에서, 폴리 사운드 제공 장치(320)는 획득된 타겟 영상으로부터 피사체의 자세를 추적하여 타겟 행동의 지속시간 및 타겟 행동의 강도를 결정할 수 있다. 일 실시예에서, 폴리 사운드 제공 장치(320)는 영상으로부터 발걸음을 인식하도록 학습된 자세 검출 모델(예: 도 4의 자세 검출 모델(420))을 이용하여 타겟 행동인 발걸음의 지속시간 및 발걸음의 강도를 결정할 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는 자세 검출 모델에 획득된 타겟 영상을 입력하여 사람의 하반신 또는 전신의 자세를 검출하고 추적할 수 있다.In step 630, the Foley sound providing device 320 may determine the duration and intensity of the target action by tracking the posture of the subject from the acquired target image. In one embodiment, the Foley sound providing device 320 uses a posture detection model (e.g., the posture detection model 420 of FIG. 4) learned to recognize footsteps from an image, and determines the duration and duration of footsteps, which is the target action. intensity can be determined. For example, the Foley sound providing device 320 may detect and track the posture of the lower body or whole body of a person by inputting the acquired target image into the posture detection model.

폴리 사운드 제공 장치(320)는 사람의 하반신 또는 전신의 자세가 검출된 것에 응답하여, 자세 검출의 결과를 기초로 피사체의 행동을 추적함으로써 발걸음의 지속시간 및 발걸음의 강도를 결정할 수 있다. 도 6의 예에서, 단계(630)는 도 5의 단계(515)에 포함되는 단계일 수 있다.The Foley sound providing device 320 may determine the duration and intensity of steps by tracking the subject's behavior based on the results of posture detection in response to detecting the posture of the person's lower body or whole body. In the example of FIG. 6 , step 630 may be a step included in step 515 of FIG. 5 .

단계(635)에서, 폴리 사운드 제공 장치(320)는 단계(625)에서 결정된 콘텍스트, 단계(630)에서 결정된 발걸음의 지속시간 및 강도 중 적어도 하나에 기초하여 피사체의 발걸음에 대응되는 폴리 사운드를 사운드 라이브러리(예: 도 4의 사운드 라이브러리(415))에서 검색할 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는 사운드 라이브러리로부터 콘텍스트, 발걸음의 지속 시간 및 발걸음의 강도 중 적어도 하나에 대응되는 폴리 사운드를 검색할 수 있다.In step 635, the Foley sound providing device 320 sounds a Foley sound corresponding to the subject's steps based on at least one of the context determined in step 625 and the duration and intensity of the steps determined in step 630. You can search in a library (e.g., sound library 415 in FIG. 4). For example, the Foley sound providing device 320 may search for a Foley sound corresponding to at least one of the context, the duration of footsteps, and the intensity of footsteps from the sound library.

다른 예에서, 폴리 사운드 제공 장치(320)는 콘텍스트, 발걸음의 지속시간 및 발걸음의 강도 중 적어도 하나에 기초하여 사운드 라이브러리에 저장된 사운드 중 어느 하나를 출력하도록 학습된 검색 모델(예: 도 4의 검색 모델(430))을 이용하여 폴리 사운드를 검색할 수 있다.In another example, the Foley sound providing device 320 may use a search model learned to output any one of the sounds stored in the sound library based on at least one of the context, the duration of the footsteps, and the intensity of the footsteps (e.g., the search in FIG. 4 You can search for Foley sounds using the model 430).

단계(640)에서, 폴리 사운드 제공 장치(320)는 단계(635)에서 검색된 폴리 사운드의 길이를 단계(630)에서 결정된 지속 시간을 고려하여 조정함으로써 피사체의 발걸음에 대응되는 폴리 사운드를 생성할 수 있다.In step 640, the Foley sound providing device 320 may generate a Foley sound corresponding to the subject's footsteps by adjusting the length of the Foley sound retrieved in step 635 in consideration of the duration determined in step 630. there is.

단계(645)에서, 폴리 사운드 제공 장치(320)는 단계(635) 내지 단계(640)와 병렬적으로, 콘텍스트, 발걸음의 지속시간 및 발걸음의 강도 중 적어도 하나에 기초하여 폴리 사운드를 생성하도록 학습된 폴리 사운드 생성 모델(예: 도 4의 폴리 사운드 생성 모델(425))을 이용하여 피사체의 발걸음에 대응되는 폴리 사운드를 생성할 수 있다. 도 6의 예에서, 단계(635), 단계(640) 및 단계(645)는 도 5의 단계(520)에 포함될 수 있다.In step 645, in parallel with steps 635 to 640, the foley sound providing device 320 learns to generate foley sounds based on at least one of context, duration of footsteps, and intensity of footsteps. A Foley sound corresponding to the subject's footsteps can be generated using the Foley sound generation model (e.g., the Foley sound generation model 425 of FIG. 4). In the example of Figure 6, steps 635, 640, and 645 may be included in step 520 of Figure 5.

도 6에 도시되지 않았으나, 일 실시예에 따른 폴리 사운드 제공 방법은 폴리 사운드 제공 장치가 단계(650) 이전에 편집기(예: 도 4의 편집기(410))를 이용하여 타겟 영상과 폴리 사운드의 싱크를 조절하는 단계를 더 포함할 수 있다. 싱크를 조절하는 단계에서, 폴리 사운드 제공 장치는 단계(635) 내지 단계(640)를 통해 생성된 폴리 사운드와 단계(645)를 통해 생성된 폴리 사운드를 조합하여 하나의 폴리 사운드를 생성할 수도 있다.Although not shown in FIG. 6, in the method of providing Foley sound according to an embodiment, the Foley sound providing device synchronizes the target image and Foley sound using an editor (e.g., editor 410 in FIG. 4) before step 650. It may further include a step of adjusting. In the step of adjusting the sync, the Foley sound providing device may generate one Foley sound by combining the Foley sound generated through steps 635 to 640 and the Foley sound generated through step 645. .

도 6에서, 단계(635) 내지 단계(640)가 단계(645)와 병렬적으로 수행되는 것으로 도시되어 있으나, 다른 실시예에서, 단계(635)가 먼저 수행되고 단계(635)의 결과가 기설정된 조건을 만족하는지 여부에 따라 단계(645)를 수행할지 여부를 결정하는 단계(미도시)가 더 포함될 수 있다. 예를 들어, 단계(645)를 수행할지 여부를 결정하는 단계(미도시)에서, 폴리 사운드 제공 장치는 단계(635)에서 검색된 폴리 사운드가 적절한지 여부를 판단할 수 있다. 예를 들어, 폴리 사운드 제공 장치는 검색된 폴리 사운드의 길이와 발걸음의 지속 시간의 차이가 기설정된 값 이하인 경우 적절한 것으로 판단하고, 기설정된 값을 초과하는 경우 적절하지 않은 것으로 판단할 수 있다.In Figure 6, steps 635 through 640 are shown as being performed in parallel with step 645, but in other embodiments, step 635 is performed first and the result of step 635 is A step (not shown) of determining whether to perform step 645 depending on whether a set condition is satisfied may be further included. For example, in determining whether to perform step 645 (not shown), the Foley sound providing device may determine whether the Foley sound retrieved in step 635 is appropriate. For example, the Foley sound providing device may determine the difference between the length of the retrieved Foley sound and the duration of the footsteps to be appropriate if it is less than or equal to a preset value, and may determine it to be inappropriate if it exceeds the preset value.

예를 들어, 검색된 폴리 사운드의 지속 시간이 1초이고, 피사체의 발걸음의 지속 시간이 3초인 경우, 단계(640)에서 1초의 폴리 사운드를 3초로 늘일 경우 폴리 사운드의 퀄리티가 저하될 수 있으므로 폴리 사운드 제공 장치는 검색된 폴리 사운드가 적절하지 않은 것으로 판단할 수 있다. 단계(645)는 검색된 폴리 사운드가 적절하지 않은 것으로 판단된 경우에 수행될 수 있다. 검색된 폴리 사운드가 적절한 것으로 판단된 경우에는 단계(640)가 수행될 수 있다.For example, if the duration of the retrieved Foley sound is 1 second and the duration of the subject's footsteps is 3 seconds, if the Foley sound of 1 second is extended to 3 seconds in step 640, the quality of the Foley sound may deteriorate, so the Foley sound may be deteriorated. The sound providing device may determine that the searched Foley sound is not appropriate. Step 645 may be performed when it is determined that the retrieved Foley sound is not appropriate. If the retrieved Foley sound is determined to be appropriate, step 640 may be performed.

다시 도 6의 단계(650)에서, 폴리 사운드 제공 장치(320)는 단계(640) 및 단계(645) 중 적어도 하나의 단계에서 생성된 폴리 사운드의 다운로드 데이터를 사용자 장치(310)로 제공할 수 있다. 일 실시예에서, 폴리 사운드 제공 장치(320)는 사용자 입력에 기초하여, 생성된 폴리 사운드의 다운로드 데이터를 제공하거나, 생성된 폴리 사운드를 타겟 영상과 병합한 다운로드 데이터를 제공할 수 있다. 생성된 폴리 사운드와 타겟 영상 간의 병합은 편집기(예: 도 4의 편집기(410))에 의해 수행될 수 있다. 도 6의 예에서, 단계(650)는 도 5의 단계(525)에 포함될 수 있다.Again in step 650 of FIG. 6, the Foley sound providing device 320 may provide download data of the Foley sound generated in at least one of steps 640 and 645 to the user device 310. there is. In one embodiment, the Foley sound providing device 320 may provide download data of the generated Foley sound based on a user input, or may provide download data that merges the generated Foley sound with the target image. Merging between the generated Foley sound and the target image may be performed by an editor (eg, editor 410 in FIG. 4). In the example of Figure 6, step 650 may be included in step 525 of Figure 5.

도 7을 참조하면, 도 6의 경우와 달리 사람의 상반신이 촬영된 타겟 영상에 대해 타겟 행동인 발걸음의 폴리 사운드를 제공하는 방법의 흐름도가 도시되어 있다.Referring to FIG. 7, unlike the case of FIG. 6, a flowchart of a method of providing poly sound of footsteps, which is a target action, to a target image in which a person's upper body is captured is shown.

일 실시예의 폴리 사운드 제공 장치(320)는, 단계(705)에서, 걷고 있는 피사체(즉, 사람)의 상반신을 포함하는 타겟 영상을 획득할 수 있다.The Foley sound providing device 320 of one embodiment may acquire a target image including the upper body of a walking subject (i.e., a person) in step 705.

일 실시예에서, 타겟 영상은 사용자가 원본 영상 중에서 걷고 있는 사람의 상반신을 포함하는 부분을 잘라내어 타겟 영상을 생성하고, 생성된 타겟 영상을 사용자 장치(310)를 통해 폴리 사운드 제공 장치(320)로 업로드한 것일 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 업로드한 타겟 영상을 수신하는 것으로써 타겟 영상을 획득할 수 있다.In one embodiment, the user creates a target image by cutting out a portion including the upper body of a person walking from the original image, and sends the generated target image to the Foley sound providing device 320 through the user device 310. It may have been uploaded. The Foley sound providing device 320 can acquire the target image by receiving the target image uploaded by the user.

다른 예에서, 사용자는 원본 영상을 폴리 사운드 제공 장치(320)로 업로드할 수 있다. 타겟 영상은 폴리 사운드 제공 장치(320)가 원본 영상 중 걷고 있는 사람의 상반신을 포함하는 것으로 분류한 프레임에 대응되는 영상일 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)의 신경망 처리기(405)는 입력된 영상으로부터 걷고 있는 사람의 상반신을 포함하는 적어도 하나의 연속된 프레임을 분류하도록 학습된 신경망 모델(예: 행동 분류 모델(미도시))을 포함할 수 있다.In another example, the user may upload the original video to the Foley sound providing device 320. The target image may be an image corresponding to a frame classified by the Foley sound providing device 320 as containing the upper body of a walking person among the original images. For example, the neural network processor 405 of the Foley sound providing device 320 may use a neural network model (e.g., a behavior classification model ( (not shown)) may be included.

도 7의 예에서, 단계(705)는 도 5의 단계(505)에 포함되는 단계일 수 있다.In the example of FIG. 7 , step 705 may be a step included in step 505 of FIG. 5 .

폴리 사운드 제공 장치(320)는, 단계(710)에서, 피사체를 포함하는 적어도 하나의 객체의 성질을 인식할 수 있다. 적어도 하나의 객체는 타겟 행동인 발걸음과 관련된 객체일 수 있다. 발걸음과 관련된 적어도 하나의 객체는 미리 결정되어 있을 수 있다. 도 7의 예시에서, 적어도 하나의 객체는 피사체일 수 있다. 다만, 이는 예시이며, 타겟 행동과 관련된 적어도 하나의 객체는 다양한 객체를 포함할 수 있다.In step 710, the Foley sound providing device 320 may recognize the properties of at least one object including the subject. At least one object may be an object related to footsteps, which is the target action. At least one object related to the step may be predetermined. In the example of FIG. 7, at least one object may be a subject. However, this is an example, and at least one object related to the target action may include various objects.

적어도 하나의 객체 각각의 성질은 서로 다른 신경망 모델(예: 도 4의 제1 이미지 인식 모델(411) 내지 제N 이미지 인식 모델(413))을 이용하여 획득될 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는, 단계(710)에서, 입력된 영상으로부터 사람의 성별을 인식하도록 학습된 제1 이미지 인식 모델에 입력할 수 있다. 폴리 사운드 제공 장치(320)는 제1 이미지 인식 모델의 출력으로서 사람의 성별을 획득할 수 있다.The properties of each of at least one object may be acquired using different neural network models (eg, the first image recognition model 411 to the Nth image recognition model 413 in FIG. 4). For example, in step 710, the Foley sound providing device 320 may input input into a first image recognition model learned to recognize a person's gender from an input image. The Foley sound providing device 320 may obtain the person's gender as an output of the first image recognition model.

단계(715)에서, 폴리 사운드 제공 장치(320)는 단계(710)에서 획득된 객체의 성질(즉, 사람의 성별)에 기초하여 타겟 행동인 발걸음과 관련된 콘텍스트를 결정할 수 있다. 일 실시예에서, 콘텍스트 정보는 적어도 하나의 객체의 성질에 기초하여 텍스트 형태로 결정될 수 있다. 예를 들어, 콘텍스트 정보는 적어도 하나의 객체의 성질을 텍스트로 표현한 정보일 수 있다. 예를 들어, 단계(710)에서 사람의 성별로서 남성이 인식된 경우, 콘텍스트 정보는 “남성” 일 수 있다. 다만, 사람의 성별은 하나의 예시일 뿐, 폴리 사운드에 영향을 미치는 임의의 객체의 성질을 표현하는 콘텍스트 정보가 포함될 수 있다. 도 7의 예에서, 단계(710) 및 단계(715)는 도 5의 단계(510)에 포함되는 단계일 수 있다.In step 715, the Foley sound providing device 320 may determine a context related to footsteps, which is the target action, based on the property of the object (i.e., the gender of the person) obtained in step 710. In one embodiment, context information may be determined in text form based on properties of at least one object. For example, context information may be information expressing the properties of at least one object in text. For example, if male is recognized as the person's gender in step 710, the context information may be “male.” However, a person's gender is only an example, and context information expressing the properties of any object that affects the Foley sound may be included. In the example of FIG. 7 , steps 710 and 715 may be steps included in step 510 of FIG. 5 .

도면에는 도시되지 않았지만, 폴리 사운드 제공 장치(320)는 사용자 장치로부터 타겟 영상의 피사체를 포함하는 적어도 하나의 객체의 성질에 대한 콘텍스트 정보를 수신할 수 있다.Although not shown in the drawing, the Foley sound providing device 320 may receive context information about the properties of at least one object including the subject of the target image from the user device.

폴리 사운드 생성을 위해 필요한 콘텍스트 정보가 타겟 영상의 화면 밖에 위치하여, 타겟 영상에 정확히 대응되는 폴리 사운드가 생성되기 어려운 경우가 있다. 도 7의 실시 예와 같이, 발걸음 소리를 생성하고자 할 때 피사체의 상반신만 화면에 표시될 수 있다. 따라서, 상반신의 움직임으로부터 피사체에 관한 정보를 예측하거나, 또는 피사체를 둘러싼 주변 환경으로부터 객체의 성질을 예측해야 하는 한계가 존재할 수 있다.There are cases where it is difficult to generate foley sound that accurately corresponds to the target image because the context information required for generating foley sound is located outside the screen of the target image. As in the embodiment of FIG. 7, when trying to generate a footsteps sound, only the upper body of the subject may be displayed on the screen. Accordingly, there may be a limitation in predicting information about the subject from the movement of the upper body or predicting the properties of the object from the surrounding environment surrounding the subject.

따라서, 폴리 사운드 제공 장치(320)는 사용자가 직접 입력한 객체의 성질에 대한 콘텍스트 정보를 기초로 폴리 사운드를 생성할 수 있다. 이를 위해, 폴리 사운드 제공 장치(320)는 사용자 장치를 통해 콘텍스트 정보를 입력 받을 수 있는 사용자 인터페이스를 제공할 수 있다. 사용자는 객체의 성질에 대한 콘텍스트 정보를 입력할 수 있다. 사용자 장치는 입력된 콘텍스트 정보를 폴리 사운드 제공 장치(320)에 전달할 수 있다.Accordingly, the Foley sound providing device 320 may generate the Foley sound based on context information about the properties of the object directly input by the user. To this end, the Foley sound providing device 320 may provide a user interface that can receive context information through a user device. The user can enter context information about the properties of the object. The user device may transmit the input context information to the Foley sound providing device 320.

본 발명의 실시 예에서, 폴리 사운드 제공 장치(320)는 사용자 장치로부터 신발의 재질 및 지면의 재질에 대한 콘텍스트 정보를 수신할 수 있다. 폴리 사운드 제공 장치(320)는 전술한 제1 이미지 인식 모델을 통해 결정된 사람의 성별에 대한 콘텍스트 정보와 함께, 사용자가 직접 입력한 신발의 재질 및 지면의 재질에 대한 콘텍스트 정보를 기초로 발걸음 소리를 생성할 수 있다.In an embodiment of the present invention, the Foley sound providing device 320 may receive context information about the material of the shoe and the material of the ground from the user device. The Foley sound providing device 320 generates footsteps based on context information about the material of the shoe and the material of the ground directly input by the user, along with context information about the person's gender determined through the first image recognition model described above. can be created.

다른 실시 예에서, 폴리 사운드 제공 장치(320)는 타겟 영상의 화면 속 객체를 인식한 정보를 기초로, 발걸음 소리에 영향을 미치지만 화면에 표시되지 않은 객체의 성질을 예측할 수도 있다. 예측된 객체의 성질은 텍스트로 표현된 콘텍스트 정보의 형태로 사용자에게 제공될 수 있다.In another embodiment, the Foley sound providing device 320 may predict the properties of an object that affects the sound of footsteps but is not displayed on the screen, based on information recognizing objects in the screen of the target image. The properties of the predicted object may be provided to the user in the form of context information expressed as text.

사용자는 화면에 표시되지 않았지만 폴리 사운드에 영향을 미치는 콘텍스트 정보 중 타겟 영상에 적합한 콘텍스트 정보를 선택할 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 선택한 콘텍스트 정보를 수신하고, 이를 기초로 폴리 사운드를 생성할 수 있다.The user can select context information appropriate for the target image among context information that is not displayed on the screen but affects Foley sound. The Foley sound providing device 320 may receive context information selected by the user and generate Foley sound based on this.

단계(720)에서, 폴리 사운드 제공 장치(320)는 타겟 영상으로부터 피사체 상반신의 자세를 추적하여 피사체 하반신의 움직임을 예측할 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는 자세 검출 모델(예: 도 4의 자세 검출 모델(420))을 통해 타겟 영상으로부터 피사체 상반신의 자세를 검출할 수 있다. 폴리 사운드 제공 장치(320)는 피사체의 상반신의 자세가 검출된 것에 응답하여, 자세 검출의 결과를 기초로 상반신의 자세를 추적하여 상반신의 움직임을 획득할 수 있다. 폴리 사운드 제공 장치(320)는 상반신 움직임에 기초하여 하반신 움직임을 예측하도록 학습된 신경망 모델(예: 도 4의 동작 예측 모델(435))을 이용하여, 획득된 상반신의 움직임으로부터 하반신의 움직임을 획득할 수 있다.In step 720, the Foley sound providing device 320 may predict the movement of the subject's lower body by tracking the posture of the subject's upper body from the target image. For example, the Foley sound providing device 320 may detect the posture of the subject's upper body from the target image through a posture detection model (e.g., the posture detection model 420 of FIG. 4). In response to detecting the posture of the subject's upper body, the Foley sound providing device 320 may obtain movement of the upper body by tracking the posture of the upper body based on the result of posture detection. The Foley sound providing device 320 uses a neural network model (e.g., motion prediction model 435 in FIG. 4) learned to predict lower body movement based on upper body movement, and acquires lower body movement from the obtained upper body movement. can do.

단계(725)에서, 폴리 사운드 제공 장치(320)는 단계(720)에서 예측된 하반신의 움직임에 기초하여 타겟 행동인 발걸음의 지속 시간 및 강도를 결정할 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는 하반신의 움직임에 따라 타겟 영상에서 발걸음이 시작하는 시점과 끝나는 시점을 파악할 수 있고, 두 시점 간 차이를 발걸음의 지속 시간으로 결정할 수 있다. 폴리 사운드 제공 장치(320)는 피사체가 움직이는 속도에 기초하여 발걸음의 강도를 결정할 수 있다. 도 7의 예에서, 단계(720) 및 단계(725)는 도 5의 단계(515)에 포함되는 단계일 수 있다.In step 725, the Foley sound providing device 320 may determine the duration and intensity of a step, which is the target action, based on the movement of the lower body predicted in step 720. For example, the Foley sound providing device 320 can determine when steps start and end in the target image according to the movement of the lower body, and determine the difference between the two times as the duration of steps. The Foley sound providing device 320 may determine the intensity of footsteps based on the speed at which the subject moves. In the example of FIG. 7 , steps 720 and 725 may be steps included in step 515 of FIG. 5 .

단계(730)에서, 폴리 사운드 제공 장치(320)는 단계(715)에서 결정된 콘텍스트, 단계(725)에서 결정된 발걸음의 지속시간 및 강도 중 적어도 하나에 기초하여 피사체의 발걸음에 대응되는 폴리 사운드를 사운드 라이브러리(예: 도 4의 사운드 라이브러리(415))에서 검색할 수 있다. 예를 들어, 폴리 사운드 제공 장치(320)는 사운드 라이브러리로부터 콘텍스트, 발걸음의 지속 시간 및 발걸음의 강도 중 적어도 하나에 대응되는 폴리 사운드를 검색할 수 있다.In step 730, the Foley sound providing device 320 sounds a Foley sound corresponding to the subject's steps based on at least one of the context determined in step 715 and the duration and intensity of the steps determined in step 725. You can search in a library (e.g., sound library 415 in FIG. 4). For example, the Foley sound providing device 320 may search for a Foley sound corresponding to at least one of the context, the duration of footsteps, and the intensity of footsteps from the sound library.

단계(735)에서, 폴리 사운드 제공 장치(320)는 단계(730)에서 검색된 폴리 사운드의 길이를 단계(725)에서 결정된 지속 시간을 고려하여 조정함으로써 피사체의 발걸음에 대응되는 폴리 사운드를 생성할 수 있다.In step 735, the Foley sound providing device 320 may generate a Foley sound corresponding to the subject's footsteps by adjusting the length of the Foley sound retrieved in step 730 by considering the duration determined in step 725. there is.

단계(740)에서, 폴리 사운드 제공 장치(320)는 단계(730) 내지 단계(735)와 병렬적으로, 콘텍스트, 발걸음의 지속시간 및 발걸음의 강도 중 적어도 하나에 기초하여 폴리 사운드를 생성하도록 학습된 폴리 사운드 생성 모델(예: 도 4의 폴리 사운드 생성 모델(425))을 이용하여 피사체의 발걸음에 대응되는 폴리 사운드를 생성할 수 있다. 도 7의 예에서, 단계(730), 단계(735) 및 단계(740)는 도 5의 단계(520)에 포함될 수 있다.In step 740, in parallel with steps 730 to 735, the foley sound providing device 320 learns to generate foley sounds based on at least one of context, duration of footsteps, and intensity of footsteps. A Foley sound corresponding to the subject's footsteps can be generated using the Foley sound generation model (e.g., the Foley sound generation model 425 of FIG. 4). In the example of Figure 7, steps 730, 735, and 740 may be included in step 520 of Figure 5.

도 7에 도시되지 않았으나, 일 실시예에 따른 폴리 사운드 제공 방법은 폴리 사운드 제공 장치가 단계(745) 이전에 편집기(예: 도 4의 편집기(410))를 이용하여 타겟 영상과 폴리 사운드의 싱크를 조절하는 단계를 더 포함할 수 있다. 싱크를 조절하는 단계에서, 폴리 사운드 제공 장치는 단계(730) 내지 단계(735)를 통해 생성된 폴리 사운드와 단계(740)를 통해 생성된 폴리 사운드를 조합하여 하나의 폴리 사운드를 생성할 수도 있다.Although not shown in FIG. 7, the Foley sound providing method according to one embodiment involves the Foley sound providing device synchronizing the target image and the Foley sound using an editor (e.g., the editor 410 of FIG. 4) before step 745. It may further include the step of adjusting . In the step of adjusting the sync, the Foley sound providing device may generate one Foley sound by combining the Foley sound generated through steps 730 to 735 and the Foley sound generated through step 740. .

도 7에서, 단계(730) 내지 단계(735)가 단계(740)와 병렬적으로 수행되는 것으로 도시되어 있으나, 다른 실시예에서, 단계(730)가 먼저 수행되고 단계(730)의 결과가 기설정된 조건을 만족하는지 여부에 따라 단계(740)를 수행할지 여부를 결정하는 단계(미도시)가 더 포함될 수 있다. 예를 들어, 단계(740)를 수행할지 여부를 결정하는 단계(미도시)에서, 폴리 사운드 제공 장치는 단계(730)에서 검색된 폴리 사운드가 적절한지 여부를 판단할 수 있다. 예를 들어, 폴리 사운드 제공 장치는 검색된 폴리 사운드의 길이와 발걸음의 지속 시간의 차이가 기설정된 값 이하인 경우 적절한 것으로 판단하고, 기 설정된 값을 초과하는 경우 적절하지 않은 것으로 판단할 수 있다. 예를 들어, 검색된 폴리 사운드의 지속 시간이 1초이고, 피사체의 발걸음의 지속 시간이 3초인 경우, 단계(735)에서 1초의 폴리 사운드를 3초로 늘일 경우 폴리 사운드의 퀄리티가 저하될 수 있으므로 폴리 사운드 제공 장치는 검색된 폴리 사운드가 적절하지 않은 것으로 판단할 수 있다.In Figure 7, steps 730 through 735 are shown as being performed in parallel with step 740, but in other embodiments, step 730 is performed first and the result of step 730 is A step (not shown) of determining whether to perform step 740 depending on whether a set condition is satisfied may be further included. For example, in determining whether to perform step 740 (not shown), the Foley sound providing device may determine whether the Foley sound retrieved in step 730 is appropriate. For example, the Foley sound providing device may determine the difference between the length of the searched Foley sound and the duration of the footsteps to be appropriate if it is less than or equal to a preset value, and may determine it to be inappropriate if it exceeds the preset value. For example, if the duration of the retrieved Foley sound is 1 second and the duration of the subject's footsteps is 3 seconds, if the Foley sound of 1 second is extended to 3 seconds in step 735, the quality of the Foley sound may deteriorate, so the Foley sound may be deteriorated. The sound providing device may determine that the searched Foley sound is not appropriate.

단계(740)는 검색된 폴리 사운드가 적절하지 않은 것으로 판단된 경우에 수행될 수 있다. 검색된 폴리 사운드가 적절한 것으로 판단된 경우에는 단계(735)가 수행될 수 있다.Step 740 may be performed when it is determined that the retrieved Foley sound is not appropriate. If the retrieved Foley sound is determined to be appropriate, step 735 may be performed.

다시 도 7의 단계(745)에서, 폴리 사운드 제공 장치(320)는 단계(735) 및 단계(740) 중 적어도 하나의 단계에서 생성된 폴리 사운드의 다운로드 데이터를 사용자 장치(310)로 제공할 수 있다. 일 실시예에서, 폴리 사운드 제공 장치(320)는 사용자 입력에 기초하여, 생성된 폴리 사운드의 다운로드 데이터를 제공하거나, 생성된 폴리 사운드를 타겟 영상과 병합한 다운로드 데이터를 제공할 수 있다. 생성된 폴리 사운드와 타겟 영상 간의 병합은 편집기(예: 도 4의 편집기(410))에 의해 수행될 수 있다. 도 7의 예에서, 단계(745)는 도 5의 단계(525)에 포함될 수 있다.Again in step 745 of FIG. 7, the Foley sound providing device 320 may provide download data of the Foley sound generated in at least one of steps 735 and 740 to the user device 310. there is. In one embodiment, the Foley sound providing device 320 may provide download data of the generated Foley sound based on a user input, or may provide download data that merges the generated Foley sound with the target image. Merging between the generated Foley sound and the target image may be performed by an editor (eg, editor 410 in FIG. 4). In the example of Figure 7, step 745 may be included in step 525 of Figure 5.

도 8은 일 실시예에 따른 폴리 사운드 제공 방법의 사용자 화면의 예를 도시한 도면이다.FIG. 8 is a diagram illustrating an example of a user screen of a method for providing Foley sound according to an embodiment.

일 실시예에서, 사용자는 사용자 장치(310)를 통해 폴리 사운드 제공 장치(320)가 제공하는 폴리 사운드 제공 서비스를 이용할 수 있다. 도 8을 참조하면, 사용자가 사용자 장치(310)를 통해 폴리 사운드 제공 서비스를 이용하는 경우 사용자 장치(310)에 표시되는 사용자 화면(800)의 예시가 도시되어 있다.In one embodiment, the user can use the Foley sound providing service provided by the Foley sound providing device 320 through the user device 310. Referring to FIG. 8 , an example of the user screen 800 displayed on the user device 310 when the user uses the Foley sound providing service through the user device 310 is shown.

사용자 화면(800)은 프로젝트 파일을 입력받는 프로젝트 업로드 영역(805), 업로드된 영상을 표시하는 비디오 뷰어 영역(810), 업로드된 영상의 비디오 트랙 및 폴리 사운드 제공 장치(320)에 의해 생성된 폴리 사운드의 사운드 트랙을 표시하는 타임라인 영역(815), 사운드 볼륨 및 EQ(equalizer)를 조절하는 제어 영역(820), 폴리 사운드 제공 장치(320)에 의해 생성된 콘텍스트를 표시하는 텍스트 태그 영역(825) 및 기타 영역(830)을 포함할 수 있다.The user screen 800 includes a project upload area 805 that receives the project file, a video viewer area 810 that displays the uploaded video, a video track of the uploaded video, and a poly sound generated by the poly sound providing device 320. A timeline area 815 that displays the soundtrack of the sound, a control area 820 that adjusts the sound volume and equalizer (EQ), and a text tag area 825 that displays the context created by the poly sound providing device 320. ) and other areas 830.

텍스트 태그 영역(825)에는, 예를 들어, 폴리 사운드 제공 장치를 통해 인식된 타겟 영상의 콘텍스트 정보가 표시될 수 있다. 기타 영역(830)은, 예를 들어, 폴리 사운드의 다운로드를 명령하는 UI(user interface) 엘리먼트 및 폴리 사운드가 병합된 영상의 다운로드를 명령하는 UI엘리먼트 중 적어도 하나를 포함할 수 있다.For example, context information of a target image recognized through a Foley sound providing device may be displayed in the text tag area 825. The other area 830 may include, for example, at least one of a user interface (UI) element commanding the download of a poly sound and a UI element commanding the download of an image in which the poly sound is merged.

사용자는 프로젝트 업로드 영역(805) 또는 비디오 뷰어 영역(810)을 통해 폴리 사운드를 믹싱하고자 하는 영상을 업로드할 수 있다. 폴리 사운드 제공 장치(320)는 업로드된 영상에 대해 도 4 내지 도 7의 방법을 수행하여 폴리 사운드를 생성하고 생성된 폴리 사운드를 사용자 장치(310)로 전송할 수 있다. 사용자 장치(310)로 전송된 폴리 사운드의 사운드 트랙은 업로드된 영상의 비디오 트랙과 함께 사용자 화면(800)의 타임라인 영역(815)에 표시될 수 있다.The user can upload a video for mixing Foley sound through the project upload area 805 or the video viewer area 810. The Foley sound providing device 320 may generate Foley sounds by performing the methods of FIGS. 4 to 7 on the uploaded video and transmit the generated Foley sounds to the user device 310. The sound track of the Foley sound transmitted to the user device 310 may be displayed in the timeline area 815 of the user screen 800 along with the video track of the uploaded image.

사용자 장치(310)의 타임라인 영역(815)에 표시되는 폴리 사운드는, 예를 들어, 도 5의 단계(525), 도 6의 단계(650) 및 도 7의 단계(745)의 다운로드 데이터가 제공되기 이전에 사용자 장치(310)에 표시되는 폴리 사운드일 수 있다. 사용자는 타임라인 영역(815)에 표시된 사운드 트랙을 조작하여 폴리 사운드를 편집할 수 있다. 폴리 사운드 제공 장치(320)는 사용자가 사용자 장치를 통해 폴리 사운드를 편집하는 경우 사용자 조작에 따른 사운드 편집 명령을 사용자 장치로부터 수신할 수 있다. 사운드 편집 명령에 따른 폴리 사운드의 편집은 폴리 사운드 제공 장치(320)의 편집기(예 도 4의 편집기(410))를 통해 폴리 사운드 제공 장치(320)가 수행할 수 있다.For example, the poly sound displayed in the timeline area 815 of the user device 310 is the download data of step 525 in FIG. 5, step 650 in FIG. 6, and step 745 in FIG. 7. It may be a Foley sound displayed on the user device 310 before being provided. The user can edit the foley sound by manipulating the sound track displayed in the timeline area 815. When the user edits the Foley sound through the user device, the Foley sound providing device 320 may receive a sound editing command according to the user's manipulation from the user device. Editing of the Foley sound according to the sound editing command can be performed by the Foley sound providing device 320 through an editor (e.g., the editor 410 of FIG. 4) of the Foley sound providing device 320.

폴리 사운드 제공 장치(320)는 사용자가 폴리 사운드를 편집한 경우, 도 5의 단계(525), 도 6의 단계(650) 및 도 7의 단계(745)에서 편집된 폴리 사운드의 다운로드 데이터 또는 편집된 폴리 사운드가 병합된 영상의 다운로드 데이터를 제공할 수 있다.When the user edits the Foley sound, the Foley sound providing device 320 downloads or edits the edited Foley sound in step 525 of FIG. 5, step 650 of FIG. 6, and step 745 of FIG. 7. The poly sound can provide download data for the merged video.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. The embodiments described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, and a field programmable gate (FPGA). It may be implemented using a general-purpose computer or a special-purpose computer, such as an array, programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions.

처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The processing device may execute an operating system (OS) and software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. A computer-readable medium may store program instructions, data files, data structures, etc., singly or in combination, and the program instructions recorded on the medium may be specially designed and constructed for the embodiment or may be known and available to those skilled in the art of computer software. there is. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or multiple software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

110, 120, 130: 전자 기기
140: 네트워크
150, 160: 서버
210: 메모리
220: 프로세서
230: 통신 인터페이스
240: 입출력 인터페이스
250: 입출력 장치
300: 폴리 사운드 제공 시스템
310: 사용자 장치
320: 폴리 사운드 제공 장치
110, 120, 130: Electronic devices
140: network
150, 160: Server
210: memory
220: processor
230: communication interface
240: input/output interface
250: input/output device
300: Foley sound provision system
310: user device
320: Foley sound providing device

Claims

In a method of providing artificial intelligence-based poly sound performed by a computer device,
Obtaining a target image including a subject performing a target action;
determining context information related to the target behavior from the target image;
From the target image, determining the duration of the target action and the intensity of the target action by tracking the posture of the subject;
generating a Foley sound corresponding to a target action of the subject based on the determined context information, the duration, and the intensity; and
Providing download data of the generated Foley sound or download data in which the generated Foley sound and the target image are merged.
Including,
The step of determining the context information is,
The properties of at least one object including the subject are recognized by the artificial intelligence and provided as candidates for the user to select, or based on information that initially recognizes the object in the screen of the target image by the artificial intelligence, A method comprising predicting the properties of other objects that affect the poly sound but not displayed on the screen and providing them as candidates for the user to select.

According to paragraph 1,
The step of generating the poly sound is,
searching for a sound corresponding to the determined context information from a sound library; and
When a sound corresponding to the context information is searched, adjusting the length of the searched sound based on the duration.
Method, including.

According to paragraph 2,
The step of generating the poly sound is,
When a sound corresponding to the context information is not searched, inputting the context information, the duration, and the intensity into a learned neural network model; and
As an output of the learned neural network model, generating a poly sound corresponding to the input context information, duration, and intensity.
A method further comprising:

According to paragraph 1,
The target action is a person's footsteps,
The properties of the at least one object are:
A method including the gender of the subject, the material of the ground on which the subject is walking, and the material of the shoes the subject is wearing.

According to paragraph 4,
The context information is,
Characterized in that the method is determined in text form based on the gender of the recognized subject, the recognized material of the ground surface, and the recognized material of the shoe.

According to paragraph 1,
The target action is a person's footsteps,
The step of determining the duration of the target behavior and the intensity of the target behavior is,
detecting the posture of the subject from the target image;
acquiring movement of the subject in the target image by tracking the detected posture; and
determining the duration of the step and the intensity of the step based on the obtained movement.
Method, including.

According to clause 6,
The step of acquiring the movement of the subject is,
In response to detecting the posture of the subject's upper body, tracking the detected posture of the upper body to obtain movement of the upper body; and
Obtaining movement of the lower body of the subject based on movement of the upper body
Method, including.

According to paragraph 1,
The above method is,
Receiving an original image from a user device; and
Inputting the original image into a learned neural network model to classify at least one consecutive frame corresponding to the target action among the frames of the original image,
The step of acquiring the target image is,
A method of obtaining the classified at least one consecutive frame as the target image.

According to paragraph 1,
The properties of each of the at least one object are:
It is recognized through different neural network models learned to recognize the properties of each object from images,
The context information is,
A method, characterized in that it is text information connecting the outputs of the different neural network models.

According to paragraph 1,
The above method is,
transmitting the generated Foley sound to a user device;
Receiving a sound editing command for the Foley sound from a user device that has received the Foley sound; and
Editing the poly sound according to the sound editing command
It further includes,
The step of providing the download data is,
A method comprising providing download data of the edited Foley sound or download data of a target video into which the edited Foley sound is merged.

A computer program combined with hardware and stored in a computer-readable recording medium to execute the method of any one of claims 1 to 10.

In computer devices,
processor; and
Includes a memory that stores instructions to be executed by the processor,
When the instructions are executed by the processor, the processor:
Acquire a target image containing a subject performing the target action,
From the target image, determine context information related to the target behavior,
From the target image, track the posture of the subject to determine the duration of the target action and the intensity of the target action,
Based on the determined context information, the duration and the intensity, generate a poly sound corresponding to a target action of the subject, and
Providing download data of the generated Foley sound or download data in which the generated Foley sound and the target video are merged,
The context information is,
The user selects the properties of at least one object including the subject by recognizing it using artificial intelligence, or does not affect the foley sound based on information that initially recognizes the object in the screen of the target image using artificial intelligence. A computer device, in which the user's selection is determined only by predicting the properties of other objects not displayed on the screen.