KR102281298B1

KR102281298B1 - System and method for video synthesis based on artificial intelligence

Info

Publication number: KR102281298B1
Application number: KR1020200048850A
Authority: KR
Inventors: 이정현; 장평옥
Original assignee: 이정현; 장평옥
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2021-07-23

Abstract

A method for synthesizing a video, according to an exemplary embodiment of the present invention, may comprise the steps of: obtaining input data that at least partially defines a video; extracting a series of instructions and content data to be included as part of the video from the input data based on predefined rules; generating structured intermediate data from content data and a series of instructions, based on at least one machine learning model learned by a plurality of instruction samples; and generating the video based on the intermediate data. Rules may include rules for identifying content data and a set of instructions. A user can easily synthesize the video.

Description

SYSTEM AND METHOD FOR VIDEO SYNTHESIS BASED ON ARTIFICIAL INTELLIGENCE

본 발명의 기술적 사상은 동영상 합성에 관한 것으로서, 자세하게는 인공지능 기반 동영상 합성을 위한 시스템 및 방법에 관한 것이다.The technical idea of the present invention relates to video synthesis, and more particularly, to a system and method for video synthesis based on artificial intelligence.

유무선 통신 기술의 발달에 기인하여 고용량 데이터의 전송이 가능해지고, 스토리지 기술의 발전에 기인하여 데이터 저장 용량이 증대되어, 기존 텍스트나 이미지를 통한 정보의 이동을 대체하여 동영상을 통한 정보의 이동이 현저하게 증가하고 있다. 동영상을 통해서 정보를 획득하는 것은 텍스트나 이미지로부터 정보를 획득하는 것과 비교할 때 높은 편의성을 제공하는 반면, 정보의 생산자, 즉 동영상의 제작자는 동영상을 제작하는데 텍스트나 이미지보다 많은 노력을 필요로 할 수 있다.Due to the development of wired and wireless communication technology, high-capacity data transmission is possible, and the data storage capacity is increased due to the development of storage technology. is increasing significantly. While acquiring information through video provides high convenience compared to acquiring information from text or images, the producer of information, that is, the producer of video, may require more effort than text or images to produce a video. there is.

본 발명의 기술적 사상은, 사용자로 하여금 동영상을 손쉽게 합성하도록 하는 인공지능 기반 동영상 합성을 위한 시스템 및 방법을 제공한다.The technical idea of the present invention provides a system and method for video synthesis based on artificial intelligence that allows a user to easily synthesize a video.

상기와 같은 목적을 달성하기 위하여, 본 발명의 기술적 사상의 일측면에 따라 동영상을 합성하기 위한 방법은, 동영상을 적어도 부분적으로 정의하는 입력 데이터를 획득하는 단계, 미리 정의된 규칙들에 기초하여, 입력 데이터로부터 일련의 지시들 및 동영상의 일부로서 포함될 콘텐츠 데이터를 추출하는 단계, 복수의 지시 샘플들에 의해서 학습된 적어도 하나의 기계 학습 모델에 기초하여, 콘텐츠 데이터 및 일련의 지시들로부터 구조화된 중간 데이터를 생성하는 단계, 및 중간 데이터에 기초하여 동영상을 생성하는 단계를 포함할 수 있고, 규칙들은, 콘텐츠 데이터 및 일련의 지시들을 식별하기 위한 규칙들을 포함할 수 있다.In order to achieve the above object, a method for synthesizing a video according to an aspect of the technical idea of the present invention includes the steps of obtaining input data that at least partially defines a video, based on predefined rules, extracting a series of instructions from the input data and content data to be included as part of a moving picture, based on at least one machine learning model learned by a plurality of instruction samples, a structured intermediate from the content data and a series of instructions generating data, and generating a moving picture based on the intermediate data, and the rules may include rules for identifying content data and a set of instructions.

본 발명의 예시적 실시예에 따라, 일련의 지시들 및 콘텐츠 데이터를 추출하는 단계는, 규칙들에 기초하여 텍스트 기반 지시를 추출하는 단계를 포함할 수 있고, 텍스트 기반 지시를 추출하는 단계는, 콘텐츠의 삽입을 지시하는 제1 지시를 추출하는 단계, 동영상의 플로우(flow)를 제어하는 제2 지시를 추출하는 단계, 및 자연어로 작성된 제3 지시를 추출하는 단계; 중 적어도 하나를 포함할 수 있다.According to an exemplary embodiment of the present invention, extracting the set of instructions and content data may include extracting a text-based instruction based on rules, wherein extracting the text-based instruction includes: extracting a first instruction for instructing insertion of content, extracting a second instruction for controlling a flow of a moving picture, and extracting a third instruction written in natural language; may include at least one of

본 발명의 예시적 실시예에 따라, 중간 데이터를 생성하는 단계는, 제1 지시가 자막의 삽입에 대응하는 경우, 자막을 적어도 하나의 부분으로 분할하고, 적어도 하나의 부분을 포함하는 중간 데이터를 생성하는 단계, 제1 지시가 이미지의 삽입에 대응하는 경우, 이미지의 위치 및/또는 크기에 대한 정보를 획득하고, 획득된 정보를 포함하는 중간 데이터를 생성하는 단계, 및 제1 지시가 사운드의 삽입에 대응하는 경우, 음성, 배경음, 효과음 중 적어도 하나를 판정하고, 판정된 사운드에 대한 정보를 포함하는 중간 데이터를 생성하는 단계 중 적어도 하나를 포함할 수 있다.According to an exemplary embodiment of the present invention, the generating of the intermediate data comprises: when the first instruction corresponds to the insertion of the subtitle, dividing the subtitle into at least one part, and generating the intermediate data including the at least one part generating, when the first instruction corresponds to the insertion of an image, obtaining information about the position and/or size of the image, and generating intermediate data including the obtained information, and the first instruction is When it corresponds to insertion, the method may include at least one of determining at least one of a voice, a background sound, and an effect sound, and generating intermediate data including information on the determined sound.

본 발명의 예시적 실시예에 따라, 중간 데이터를 생성하는 단계는, 제2 지시에 응답하여, 일련의 프레임들의 타이밍 정보를 생성하고, 타이밍 정보를 포함하는 중간 데이터를 생성하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present invention, generating the intermediate data may include, in response to the second instruction, generating timing information of a series of frames, and generating intermediate data including the timing information. there is.

본 발명의 예시적 실시예에 따라, 중간 데이터를 생성하는 단계는, 제3 지시에 응답하여, 자연어를 처리함으로써 제1 지시 및 제2 지시 중 적어도 하나를 생성하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present invention, generating the intermediate data may include, in response to the third instruction, generating at least one of the first instruction and the second instruction by processing the natural language.

본 발명의 예시적 실시예에 따라, 일련의 지시들 및 콘텐츠 데이터를 추출하는 단계는, 규칙들에 기초하여 액션 기반 지시를 추출하는 단계를 포함할 수 있고, 액션 기반 지시를 추출하는 단계는, 액션 레이어에 포함된 형상을 추출하는 단계, 적어도 하나의 기계 학습 모델에 기초하여, 형상으로부터 기호 및/또는 언어를 인식하는 단계, 및 기호 및/또는 언어에 기초하여, 적어도 하나의 텍스트 기반 지시를 생성하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present invention, extracting the set of instructions and content data may include extracting an action-based instruction based on rules, wherein extracting the action-based instruction includes: extracting a shape included in the action layer, recognizing a symbol and/or language from the shape, based on at least one machine learning model, and generating at least one text-based instruction based on the symbol and/or language It may include the step of generating.

본 개시의 예시적 실시예에 따라, 일련의 지시들 및 콘텐츠 데이터를 추출하는 단계는, 규칙들에 기초하여 액션 기반 지시를 추출하는 단계를 포함할 수 있고, 액션 기반 지시를 추출하는 단계는, 적어도 하나의 기계 학습 모델에 기초하여, 액션 기반 지시를 식별하고, 식별된 액션 기반 지시로부터 형상을 추출하는 단계, 적어도 하나의 기계 학습 모델에 기초하여, 형상으로부터 기호 및/또는 언어를 인식하는 단계, 및 기호 및/또는 언어에 기초하여, 적어도 하나의 텍스트 기반 지시를 생성하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present disclosure, extracting the set of instructions and content data may include extracting an action-based instruction based on rules, wherein extracting the action-based instruction includes: based on the at least one machine learning model, identifying an action-based indication and extracting a shape from the identified action-based indication, based on the at least one machine learning model, recognizing a symbol and/or language from the shape; , and generating at least one text-based indication based on the symbol and/or language.

본 발명의 예시적 실시예에 따라, 입력 데이터를 획득하는 단계는, 사용자에 입력 인터페이스를 제공하는 단계, 및 입력 인터페이스를 통해서 입력 데이터를 수신하는 단계를 포함할 수 있고, 입력 인터페이스를 제공하는 단계는, 동영상의 재생 시간 중 일부인 장면(scene)에 대응하고, 장면에 대응하는 콘텐츠 데이터 및 일련의 지시들이 표시되는 장면 영역을 디스플레이하는 단계, 및 일련의 장면들 중 적어도 일부를 포함하는 플로우(flow) 영역을 디스플레이하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present invention, obtaining the input data may include providing an input interface to the user, and receiving the input data through the input interface, and providing the input interface. corresponds to a scene that is a part of the playback time of the moving picture, displaying a scene area in which content data corresponding to the scene and a series of instructions are displayed, and a flow including at least a part of the series of scenes ) displaying the region.

본 발명의 예시적 실시예에 따라, 일련의 지시들 및 콘텐츠 데이터를 추출하는 단계는, 플로우 영역을 통해서 타이밍에 의존하는 효과를 지시하는 제4 지시를 추출하는 단계를 포함할 수 있다.According to an exemplary embodiment of the present invention, extracting the series of instructions and content data may include extracting a fourth instruction indicating a timing-dependent effect through the flow region.

본 발명의 기술적 사상의 일측면에 따라 동영상을 합성하기 위한 방법은, 제1 동영상을 다중 사용자들에게 제공하는 단계, 제1 동영상에 대한 제1 피드백을 다중 사용자들로부터 수신하는 단계, 제1 동영상을 적어도 부분적으로 수정하는 제1 입력 데이터를 다중 사용자들 중 적어도 일부로부터 수신하는 단계, 제1 입력 데이터에 기초하여 제1 동영상을 수정함으로써 제2 동영상을 합성하는 단계, 및 제2 동영상을 다중 사용자들에게 제공하는 단계를 포함할 수 있다.According to an aspect of the present invention, a method for synthesizing a video includes providing a first video to multiple users, receiving a first feedback on the first video from multiple users, and a first video Receiving first input data that at least partially modifies from at least some of multiple users, synthesizing a second video by modifying the first video based on the first input data, and converting the second video to multiple users It may include the step of providing to them.

본 발명의 예시적 실시예에 따라, 동영상을 합성하기 위한 방법은, 제2 동영상에 대한 제2 피드백을 다중 사용자들로부터 수신하는 단계, 제2 동영상을 적어도 부분적으로 수정하는 제2 입력 데이터를 다중 사용자들 중 적어도 일부로부터 수신하는 단계, 제2 입력 데이터에 기초하여 제2 동영상을 수정함으로써 제3 동영상을 합성하는 단계, 및 제3 동영상을 다중 사용자들에게 제공하는 단계를 더 포함할 수 있다.According to an exemplary embodiment of the present invention, a method for synthesizing a video includes: receiving second feedback on a second video from multiple users; The method may further include receiving from at least some of users, synthesizing a third video by modifying the second video based on the second input data, and providing the third video to multiple users.

본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 동영상 생성에 필요한 정보를 사용자가 직관적으로 입력하도록 할 수 있고, 이에 따라 촬영을 통한 동영상 생성에 제한되지 아니할 수 있고, 동영상 생성의 편의성이 향상될 수 있다.According to the system and method according to the technical idea of the present invention, it is possible for a user to intuitively input information required for video generation, and accordingly, it is possible to not be limited to video generation through shooting, and the convenience of video generation can be improved. can

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 사용자는 규칙에 기반한 구조화된 입력뿐만 아니라 자연어 입력을 사용할 수 있고, 이에 따라 누구든지 손쉽게 동영상을 제작할 수 있다.In addition, according to the system and method according to the technical idea of the present invention, a user can use a natural language input as well as a structured input based on rules, and thus anyone can easily create a moving picture.

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 다중 사용자들에 의한 동영상 수정을 가능하게 할 수 있고, 다양한 비즈니스 모델에서 다중 사용자에 의한 동영상의 집단(collective) 생성이 활용될 수 있다.In addition, according to the system and method according to the technical idea of the present invention, it is possible to edit a video by multiple users, and collective generation of a video by multiple users can be utilized in various business models.

본 발명의 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 아니하며, 언급되지 아니한 다른 효과들은 이하의 본 발명의 실시예들에 대한 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 도출되고 이해될 수 있다. 즉, 본 발명을 실시함에 따른 의도하지 아니한 효과들 역시 본 발명의 실시예들로부터 당해 기술분야의 통상의 지식을 가진 자에 의해 도출될 수 있다.Effects that can be obtained in the embodiments of the present invention are not limited to the effects mentioned above, and other effects not mentioned are common in the art to which the present invention belongs from the description of the embodiments of the present invention below. It can be clearly derived and understood by those with knowledge. That is, unintended effects of carrying out the present invention may also be derived by those of ordinary skill in the art from the embodiments of the present invention.

도 1은 본 발명의 예시적 실시예에 따른 동영상 합성 시스템을 나타내는 블록도이다.
도 2는 본 발명의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다.
도 3은 본 발명의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다.
도 4a는 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타내고, 도 4b는 본 발명의 예시적 실시예에 따라 합성된 동영상의 예시적 프레임을 나타낸다.
도 5a는 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타내고, 도 5b는 본 발명의 예시적 실시예에 따라 합성된 동영상의 예시적 프레임들을 나타낸다.
도 6은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다.
도 7은 본 개시의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다.
도 8은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다.
도 9a 및 도 9b는 본 발명의 예시적 실시예들에 따라 동영상을 합성하기 위한 방법의 예시들을 나타내는 순서도들이다.
도 10은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다.
도 11은 본 개시의 예시적 실시예에 따른 동영상 합성 시스템이 사용되는 예시를 나타내는 블록도이다.
도 12는 본 발명의 예시적 실시예에 따라 도 11의 동영상 합성 시스템의 동작의 예시를 나타내는 순서도이다.1 is a block diagram showing a video synthesis system according to an exemplary embodiment of the present invention.
Fig. 2 is a flowchart showing an example of a method for synthesizing a moving picture according to an exemplary embodiment of the present invention.
3 is a flowchart illustrating an example of a method for synthesizing a moving picture according to an exemplary embodiment of the present invention.
4A shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention, and FIG. 4B shows an exemplary frame of a video synthesized according to an exemplary embodiment of the present invention.
5A shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention, and FIG. 5B shows exemplary frames of a video synthesized according to an exemplary embodiment of the present invention.
6 shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention.
7 is a flowchart illustrating an example of a method for synthesizing a video according to an exemplary embodiment of the present disclosure.
8 shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention.
9A and 9B are flowcharts illustrating examples of a method for synthesizing a moving picture according to exemplary embodiments of the present invention.
10 shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention.
11 is a block diagram illustrating an example in which a video synthesis system according to an exemplary embodiment of the present disclosure is used.
Fig. 12 is a flowchart showing an example of the operation of the video synthesis system of Fig. 11 according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명한다. 본 발명의 실시예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention are provided to more completely explain the present invention to those of ordinary skill in the art. Since the present invention can have various changes and can have various forms, specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numerals are used for like elements. In the accompanying drawings, the dimensions of the structures are enlarged or reduced than the actual size for clarity of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that it does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings in the context of the related art, and unless explicitly defined in the present application, they are not interpreted in an ideal or excessively formal meaning. .

이하 도면 및 설명에서, 하나의 블록으로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 적어도 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다.In the drawings and description below, a component indicated or described as one block may be a hardware block or a software block. For example, each of the components may be an independent hardware block that transmits and receives signals with each other, or may be a software block executed by at least one processor.

도 1은 본 발명의 예시적 실시예에 따른 동영상 합성 시스템(10)을 나타내는 블록도이다. 도 1에 도시된 바와 같이, 동영상 합성 시스템(10)은 사용자(20)와 통신할 수 있고, 네트워크(32)에 접속하거나 데이터 저장소(34)와 통신할 수도 있다. 일부 실시예들에서, 동영상 합성 시스템(10)은 서버, 데스크탑 컴퓨터 등과 같이 고정식 컴퓨팅 시스템일 수도 있고, 랩탑 컴퓨터, 모바일 폰 등과 같이 이동식 컴퓨팅 시스템일 수도 있다. Fig. 1 is a block diagram showing a video synthesis system 10 according to an exemplary embodiment of the present invention. As shown in FIG. 1 , the video synthesis system 10 may communicate with a user 20 , and may connect to a network 32 or communicate with a data store 34 . In some embodiments, the video synthesis system 10 may be a stationary computing system, such as a server, desktop computer, or the like, or a mobile computing system, such as a laptop computer, mobile phone, or the like.

동영상 합성 시스템(10)은 사용자(20)가 제공하는 입력 데이터(IN)에 따라 동영상을 합성하여 제공할 수 있다. 도면들을 참조하여 후술되는 바와 같이, 사용자(20)는 동영상을 용이하게 정의할 수 있는 입력 데이터(IN)를 동영상 합성 시스템(10)에 제공할 수 있고, 동영상 합성 시스템(10)은 인공지능(artificial intelligence; AI)에 기초하여 입력 데이터(IN)를 처리함으로써 동영상을 합성할 수 있다. 이에 따라, 사용자(20)는 동영상 생성에 필요한 정보를 직관적으로 동영상 합성 시스템(10)에 입력할 수 있고, 동영상 생성의 편의성이 향상될 수 있다. 또한, 사용자(20)는 규칙에 기반한 구조화된 입력 데이터(IN)뿐만 아니라 자연어 입력을 입력 데이터(IN)로서 사용할 수 있고, 이에 따라 사용자(20)가 속한 그룹의 제한없이 누구든지 손쉽게 동영상을 제작할 수 있다. 또한, 도 11 및 도 12를 참조하여 후술되는 바와 같이, 동영상 합성 시스템(10)은 다중 사용자들에 의한 동영상 수정을 가능하게 할 수 있고, 다양한 비즈니스 모델에서 다중 사용자에 의한 동영상의 집단(collective) 생성이 활용되는 것을 가능하게 할 수 있다. 도 1에 도시된 바와 같이, 동영상 합성 시스템(10)은 사용자 인터페이스(11), 입력 처리부(13), 중간 데이터 생성부(15), 동영상 생성부(17) 및 적어도 하나의 기계 학습 모델(19)을 포함할 수 있다.The video synthesis system 10 may synthesize and provide the video according to the input data IN provided by the user 20 . As will be described later with reference to the drawings, the user 20 may provide input data IN that can easily define a moving picture to the moving picture synthesizing system 10, and the moving picture synthesizing system 10 may use artificial intelligence ( Based on artificial intelligence (AI), video can be synthesized by processing input data (IN). Accordingly, the user 20 can intuitively input information required for video generation into the video synthesis system 10 , and the convenience of video generation can be improved. In addition, the user 20 can use not only the rule-based structured input data IN, but also the natural language input as the input data IN, so that anyone can easily produce a video without limitation of the group to which the user 20 belongs. can In addition, as will be described later with reference to FIGS. 11 and 12 , the video synthesis system 10 may enable video editing by multiple users, and collect videos by multiple users in various business models. It may enable creation to be utilized. As shown in FIG. 1 , the video synthesis system 10 includes a user interface 11 , an input processing unit 13 , an intermediate data generation unit 15 , a video generation unit 17 , and at least one machine learning model 19 . ) may be included.

사용자 인터페이스(11)는 사용자(20)와의 통신을 제공할 수 있다. 예를 들면, 도 1에 도시된 바와 같이, 사용자 인터페이스(11)는 사용자(20)로부터 입력 데이터(IN)를 수신할 수 있고, 수신된 입력 데이터(IN)를 입력 처리부(13)에 제공할 수 있다. 또한, 사용자 인터페이스(11)는 입력 처리부(13)로부터 제공된 응답(RES) 및/또는 동영상 생성부(17)로부터 제공된 동영상 데이터(VDO)를 출력 데이터(OUT)로서 사용자(20)에 제공할 수 있다. 일부 실시예들에서, 사용자(20)는, 동영상 합성 시스템(10)에 연결된 입력 장치, 예컨대 키보드, 마우스, 터치 패드 등을 통해서 입력 데이터(IN)를 동영상 합성 시스템(10)에 제공할 수 있고, 동영상 합성 시스템(10)에 연결된 출력 장치, 예컨대 모니터, 프린터, 스피커 등을 통해서 출력 데이터(OUT)를 수신할 수 있다. 또한, 일부 실시예들에서, 사용자(20)는 자신의 단말, 예컨대 스마트 폰, 태블릿 PC 등을 통해서 동영상 합성 시스템(10)에 접속할 수 있고, 단말을 통해서 입력 데이터(IN)를 제공하거나 출력 데이터(OUT)를 수신할 수도 있다.The user interface 11 may provide communication with the user 20 . For example, as shown in FIG. 1 , the user interface 11 may receive input data IN from the user 20 , and provide the received input data IN to the input processing unit 13 . can In addition, the user interface 11 may provide the response RES provided from the input processing unit 13 and/or the video data VDO provided from the video generation unit 17 to the user 20 as output data OUT. there is. In some embodiments, the user 20 may provide input data IN to the video synthesis system 10 through an input device connected to the video synthesis system 10 , such as a keyboard, mouse, touch pad, etc. , the output data OUT may be received through an output device connected to the video synthesis system 10 , for example, a monitor, a printer, a speaker, or the like. Also, in some embodiments, the user 20 may access the video synthesis system 10 through his/her terminal, for example, a smart phone, a tablet PC, or the like, and may provide input data IN or output data through the terminal. (OUT) may be received.

입력 처리부(13)는 사용자 인터페이스(11)로부터 입력 데이터(IN)를 수신할 수 있고, 입력 데이터(IN)를 처리함으로써 사용자 인터페이스(11)에 응답(RES)을 제공할 수 있다. 입력 처리부(13)는 입력 데이터(IN)가 수신되면, 입력 데이터(IN)가 반영된 디스플레이를 제공하거나 추가적인 입력 데이터(IN)를 요청하기 위하여 응답(RES)을 생성할 수 있고, 사용자 인터페이스(11)를 통해서 사용자(20)에 제공할 수 있다. 입력 데이터(IN)는 사용자(20)로부터 제공되는, 동영상을 적어도 부분적으로 정의하는 데이터를 지칭할 수 있다. 예를 들면, 입력 데이터(IN)는 동영상에 일부로서 포함되는 콘텐츠, 예컨대 동영상, 자막, 사진, 사운드 등에 관한 정보를 포함할 수 있고, 동영상 합성에 사용되는 가이드, 예컨대 효과, 재생 속도, 화면 구성 등에 관한 정보를 포함할 수도 있다. The input processing unit 13 may receive the input data IN from the user interface 11 , and may provide a response RES to the user interface 11 by processing the input data IN. When the input data IN is received, the input processing unit 13 may generate a response RES to provide a display reflecting the input data IN or to request additional input data IN, and the user interface 11 . ) through the user 20 can be provided. The input data IN may refer to data that at least partially defines a video provided by the user 20 . For example, the input data IN may include information about content included as a part of a moving picture, for example, a moving picture, subtitles, photos, sound, etc. It may also include information about

입력 처리부(13)는 입력 데이터(IN)로부터, 콘텐츠에 관한 콘텐츠 데이터(CNT) 및 가이드에 관한 지시(DIR)를 추출할 수 있고, 콘텐츠 데이터(CNT) 및 지시(DIR)를 중간 데이터 생성부(15)에 제공할 수 있다. 또한, 입력 처리부(13)는 복수의 콘텐츠에 대응하는 콘텐츠 데이터(CNT) 및 일련의 지시들에 대응하는 지시(DIR)를 입력 데이터(IN)로부터 추출할 수 있다. 일부 실시예들에서, 도 1에서 점선으로 도시된 바와 같이, 입력 처리부(13)는 적어도 하나의 기계 학습 모델(19)에 기초하여 입력 데이터(IN)로부터 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출할 수 있고, 적어도 하나의 기계 학습 모델(19)은 복수의 입력 데이터 샘플들에 의해서 학습된 상태일 수 있다.The input processing unit 13 can extract, from the input data IN, the content data CNT about the content and the instruction DIR about the guide, and convert the content data CNT and the instruction DIR to the intermediate data generating unit (15) can be provided. Also, the input processing unit 13 may extract content data CNT corresponding to a plurality of contents and an instruction DIR corresponding to a series of instructions from the input data IN. In some embodiments, as shown by dotted lines in FIG. 1 , the input processing unit 13 is configured to perform content data CNT and instructions DIR from input data IN based on at least one machine learning model 19 . may be extracted, and the at least one machine learning model 19 may be in a state learned by a plurality of input data samples.

중간 데이터 생성부(15)는 입력 처리부(13)로부터 콘텐츠 데이터(CNT) 및 지시(DIR)를 수신할 수 있고, 콘텐츠 데이터(CNT) 및 지시(DIR)로부터 중간(intermediate) 데이터(INT)를 생성할 수 있다. 중간 데이터(INT)는 구조화된 데이터로서, 콘텐츠 데이터(CNT) 및 지시(DIR)를 결합함으로써 생성될 수 있고, 동영상 생성부(17)가 인식 가능한 구조를 가질 수 있다. 일부 실시예들에서, 도 1에서 점선으로 도시된 바와 같이, 중간 데이터 생성부(15)는 적어도 하나의 기계 학습 모델(19)에 기초하여 콘텐츠 데이터(CNT) 및 지시(DIR)로부터 중간 데이터(INT)를 생성할 수 있고, 적어도 하나의 기계 학습 모델(19)은 복수의 지시 샘플들에 의해서 학습된 상태일 수 있다.The intermediate data generating unit 15 may receive the content data CNT and the instruction DIR from the input processing unit 13 , and intermediate data INT from the content data CNT and the instruction DIR. can create The intermediate data INT is structured data, and may be generated by combining the content data CNT and the instruction DIR, and may have a structure recognizable by the video generating unit 17 . In some embodiments, as shown by a dotted line in FIG. 1 , the intermediate data generation unit 15 generates intermediate data ( INT), and the at least one machine learning model 19 may be in a state learned by a plurality of instruction samples.

중간 데이터(INT)는 동영상의 적어도 일부분으로서 세그먼트(segment)를 정의할 수 있다. 예를 들면, 중간 데이터(INT)는 세그먼트를 정의하기 위하여, 이미지, 사운드, 다른 동영상 등에 대한 경로를 포함할 수 있고, 메모리에 저장된 데이터에 대한 포인터를 포함할 수도 있다. 또한, 중간 데이터(INT)는 사운드로서, TTS(Text-to-Speech)로 합성한 음성, 사용자(20)의 음성 등을 정의할 수도 있다. 일부 실시예들에서, 중간 데이터(INT)는 전술된 정형화된 데이터뿐만 아니라, 비정형의 데이터, 예컨대 임베딩(embedding) 및/또는 레이턴트 벡터(latent vector)를 포함할 수도 있다. 임베딩, 레이턴트 벡터는 기계 학습 모델의 일종으로서 인공 신경망이 입력을 처리함으로써 생성한 중간 결과물을 지칭할 수 있다. 예를 들면, 중간 데이터 생성부(15)는 콘텐츠 데이터(CNT) 및 지시(DIR)로부터 임베딩 및/또는 레이턴트 벡터를 생성할 수 있고, 임베딩 및/또는 레이턴트 벡터를 포함하는 중간 데이터(INT)를 생성할 수 있다. 중간 데이터(INT)에 포함된 임베딩 및/또는 레이턴트 벡터는 동영상 생성부(17)에서 인공 신경망에 제공될 수 있고, 인공 신경망의 출력 데이터가 동영상 생성시 활용될 수 있다.The intermediate data INT may define a segment as at least a part of a moving picture. For example, the intermediate data INT may include a path to an image, sound, other moving picture, etc., to define a segment, and may include a pointer to data stored in a memory. In addition, the intermediate data INT may define a voice synthesized by text-to-speech (TTS), a voice of the user 20, and the like as sound. In some embodiments, the intermediate data INT may include not only the above-described structured data, but also unstructured data, such as embeddings and/or latent vectors. Embedding and latency vectors are a kind of machine learning model, and may refer to intermediate results generated by an artificial neural network processing an input. For example, the intermediate data generating unit 15 may generate an embedding and/or a latency vector from the content data (CNT) and the instruction (DIR), and the intermediate data (INT) including the embedding and/or the latency vector. ) can be created. The embedding and/or latency vector included in the intermediate data INT may be provided to the artificial neural network by the video generating unit 17 , and output data of the artificial neural network may be utilized when generating the video.

일부 실시예들에서, 중간 데이터 생성부(15)는 이전에 생성된 중간 데이터(INT)에 기초하여 새로운 중간 데이터(INT)를 생성할 수 있다. 즉, 중간 데이터 생성부(15)가 콘텐츠 데이터(CNT) 및 지시(DIR)로부터 중간 데이터(INT)를 생성하기 위하여 많은 리소스들을 사용할 수 있고, 이에 따라 중간 데이터 생성부(15)는 지시(DIR)에 기초하여 이미 생성된 중간 데이터(INT)를 수정하거나 이미 생성된 중간 데이터(INT)에 데이터를 추가할지 여부를 판정할 수 있다. 예를 들면, 중간 데이터 생성부(15)는 입력 데이터(IN)에 따라 수정된 지시(DIR)가 수신되는 경우, 음성 합성부터 다시 수행해야 할지, 다른 영상을 합쳐야 할지, 신규 세그먼트를 생성하여 붙여야 할지를 판정할 수 있다.In some embodiments, the intermediate data generator 15 may generate new intermediate data INT based on previously generated intermediate data INT. That is, the intermediate data generating unit 15 may use many resources to generate the intermediate data INT from the content data CNT and the instruction DIR, and accordingly, the intermediate data generating unit 15 may use the instruction DIR. ), it is possible to determine whether to modify the already generated intermediate data INT or add data to the already generated intermediate data INT. For example, when a modified instruction DIR is received according to the input data IN, the intermediate data generating unit 15 needs to perform the voice synthesis again, combine other images, or create and attach a new segment. can decide whether to

동영상 생성부(17)는 중간 데이터 생성부(15)로부터 중간 데이터(INT)를 수신할 수 있고, 중간 데이터(INT)에 기초하여 동영상을 합성함으로써 동영상 데이터(VDO)를 생성할 수 있다. 예를 들면, 동영상 생성부(17)는 중간 데이터(INT)에 의해서 정의되는 영상이나 사운드를 합성함으로써 세그먼트를 생성할 수 있고, 세그먼트들을 결합함으로써 동영상을 생성할 수 있다. 일부 실시예들에서, 동영상 생성부(17)는 중간 데이터(INT)로부터 일련의 세그먼트들에 반영이 필요한 효과를 검출할 수 있고, 검출된 효과를 일련의 세그먼트들에 공통적으로 적용할 수 있다. 일부 실시예들에서, 도 1에서 점선으로 도시된 바와 같이, 동영상 생성부(17)는 적어도 하나의 기계 학습 모델(19)에 기초하여 중간 데이터(INT)로부터 동영상 데이터(VDO)를 생성할 수 있고, 적어도 하나의 기계 학습 모델(19)은 복수의 중간 데이터 샘플들에 의해서 학습된 상태일 수 있다. 또한, 일부 실시예들에서, 동영상 생성부(17)는, 예컨대 인터넷과 같은 네트워크(32)에 접속할 수 있고, 중간 데이터(INT)에 포함된 가이드에 기초하여 검색을 수행함으로써 네트워크(32)를 통해서 수신된 데이터를 동영상 데이터(VDO)를 생성하는데 사용할 수 있다. 또한, 동영상 생성부(17)는 데이터 저장소(34)에 액세스할 수 있고, 중간 데이터(INT)에 포함된 가이드에 기초하여 데이터 저장소(34)로부터 필요한 데이터를 획득할 수 있고, 획득된 데이터를 동영상 데이터(VDO)를 생성하는데 사용할 수 있다. 일부 실시예들에서, 도 1에 도시된 바와 상이하게, 데이터 저장소(34)는 동영상 합성 시스템(10)에 포함될 수도 있다.The video generator 17 may receive the intermediate data INT from the intermediate data generator 15 , and may generate video data VDO by synthesizing the video based on the intermediate data INT. For example, the moving picture generator 17 may generate a segment by synthesizing an image or sound defined by the intermediate data INT, and may generate a moving picture by combining the segments. In some embodiments, the video generating unit 17 may detect an effect that needs to be reflected in a series of segments from the intermediate data INT, and may apply the detected effect to the series of segments in common. In some embodiments, as shown by a dotted line in FIG. 1 , the video generating unit 17 may generate video data VDO from the intermediate data INT based on at least one machine learning model 19 . In addition, at least one machine learning model 19 may be in a state learned by a plurality of intermediate data samples. Further, in some embodiments, the video generating unit 17 may connect to a network 32 such as, for example, the Internet, and search the network 32 by performing a search based on a guide included in the intermediate data INT. The data received through this may be used to generate moving picture data (VDO). In addition, the video generating unit 17 may access the data storage 34, obtain necessary data from the data storage 34 based on the guide included in the intermediate data INT, and store the obtained data. It can be used to create video data (VDO). In some embodiments, different from that shown in FIG. 1 , data store 34 may be included in video synthesis system 10 .

일부 실시예들에서, 동영상 생성부(17)는 동영상의 미리보기를 사용자 인터페이스를 통해서 출력 데이터(OUT)로서 사용자(20)에 제공할 수 있다. 예를 들면, 입력 데이터(IN)에 포함된 "이후 장면들은 영화 sin city의 영상 효과와 같이 흑백으로 하되, 붉은 색만을 살리도록 하라"라는 자연어 지시로부터 입력 처리부(13)가 지시(DIR)를 생성하고 중간 데이터 생성부(15)가 지시(DIR)에 기초하여 중간 데이터(INT)를 생성하는 경우, 동영상 생성부(17)는 일련의 세그먼트들에 공통적으로 적용되는 효과가 높은 컴퓨팅 리소스를 요구하는 것임을 판정할 수 있고, 이와 같은 효과가 적용된 일부 세그먼트들을 사용자(20)에 미리 보여줌으로써 사용자(20)로부터 적용 여부의 확인(예컨대, 예 또는 아니오)을 받을 수 있다.In some embodiments, the video generating unit 17 may provide a preview of the video to the user 20 as output data OUT through a user interface. For example, the input processing unit 13 receives the instruction DIR from the natural language instruction, "Subsequent scenes are in black and white like the video effect of the movie sin city, but use only red" included in the input data IN. When generating and the intermediate data generating unit 15 generates the intermediate data INT based on the instruction DIR, the moving picture generating unit 17 requires a computing resource with a high effect commonly applied to a series of segments. It may be determined that this is the case, and by showing some segments to which such an effect is applied to the user 20 in advance, a confirmation (eg, yes or no) of whether to apply may be received from the user 20 .

적어도 하나의 기계 학습 모델(19)은 복수의 샘플들에 의해서 학습된(trained) 상태일 수 있고, 임의의 인공지능 모델일 수 있다. 또한, 일부 실시예들에서 적어도 하나의 기계 학습 모델(19)은 사용자(20)의 입력 데이터(IN)에 따라 동영상 합성 시스템(10)에 의한 동영상 합성이 수행되는 과정에서 지속적으로 학습될 수도 있다. 예를 들면, 적어도 하나의 기계 학습 모델(19)은 인공 신경망(artificial neural network), 결정 트리, 서포트 벡터 머신, 회귀 분석(regression analysis), 베이즈 네트워크(Bayesian network), 유전 계획법(genetic algorithm) 등에 기초한 모델일 수 있다. 본 명세서에서, 적어도 하나의 기계 학습 모델(19)은 도 1에 도시되 바와 같이 인공 신경망을 주로 참조하여 설명될 것이나, 본 개시의 예시적 실시예들이 이에 제한되지 아니하는 점이 유의된다. 인공 신경망은, 비제한적인 예시로서 CNN(Convolution Neural Network), R-CNN(Region with Convolution Neural Network), RPN(Region Proposal Network), RNN(Recurrent Neural Network), S-DNN(Stacking-based deep Neural Network), S-SDNN(State-Space Dynamic Neural Network), Deconvolution Network, DBN(Deep Belief Network), RBM(Restricted Boltzmann Machine), Fully Convolutional Network, LSTM(Long Short-Term Memory) Network, Classification Network 등을 포함할 수 있다.At least one machine learning model 19 may be in a trained state by a plurality of samples, and may be any artificial intelligence model. In addition, in some embodiments, at least one machine learning model 19 may be continuously learned while video synthesis is performed by the video synthesis system 10 according to the input data IN of the user 20 . . For example, the at least one machine learning model 19 may be an artificial neural network, a decision tree, a support vector machine, a regression analysis, a Bayesian network, or a genetic algorithm. It may be a model based on, etc. In this specification, the at least one machine learning model 19 will be mainly described with reference to an artificial neural network as shown in FIG. 1 , but it is noted that the exemplary embodiments of the present disclosure are not limited thereto. Artificial neural networks include, as non-limiting examples, Convolution Neural Network (CNN), Region with Convolution Neural Network (R-CNN), Region Proposal Network (RPN), Recurrent Neural Network (RNN), Stacking-based deep neural (S-DNN). Network), S-SDNN (State-Space Dynamic Neural Network), Deconvolution Network, DBN (Deep Belief Network), RBM (Restricted Boltzmann Machine), Fully Convolutional Network, LSTM (Long Short-Term Memory) Network, Classification Network, etc. may include

일부 실시예들에서, 적어도 하나의 기계 학습 모델(19)은 입력 처리부(13)에 의해서 사용되는 기계 학습 모델, 중간 데이터 생성부(15)에 의해서 사용되는 기계 학습 모델 및 동영상 생성부(17)에 의해서 사용되는 기계 학습 모델 중 적어도 하나를 포함할 수 있다. 동영상 합성 시스템(10)은, 일부 실시예들에서 적어도 하나의 기계 학습 모델(19)을 구현하기 위하여 설계된 전용의 하드웨어, 예컨대 NPU(neural processing unit), GPU(graphic processing unit) 등을 포함할 수도 있고, 일부 실시예들에서 범용(general) 프로세서에서 실행됨으로써 적어도 하나의 기계 학습 모델(19)을 구현하는 소프트웨어 블록을 포함할 수도 있다. 또한, 일부 실시예들에서, 도 1에 도시된 바와 상이하게, 적어도 하나의 기계 학습 모델(19)은 동영상 합성 시스템(10)의 외부에 구현될 수 있고, 동영상 합성 시스템(10)은 적어도 하나의 기계 학습 모델(19)에 액세스할 수도 있다.In some embodiments, at least one machine learning model 19 is a machine learning model used by the input processor 13 , a machine learning model used by the intermediate data generator 15 , and a video generator 17 . It may include at least one of the machine learning models used by The video synthesis system 10 may, in some embodiments, include dedicated hardware designed to implement the at least one machine learning model 19 , such as a neural processing unit (NPU), a graphic processing unit (GPU), or the like. and, in some embodiments, may include a software block that implements at least one machine learning model 19 by being executed on a general processor. Also, in some embodiments, different from that shown in FIG. 1 , at least one machine learning model 19 may be implemented outside of the video synthesis system 10 , and the video synthesis system 10 includes at least one It is also possible to access the machine learning model 19 of

도 2는 본 발명의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다. 도 2에 도시된 바와 같이, 동영상을 합성하기 위한 방법은 복수의 단계들(S200, S400, S600, S800)을 포함할 수 있다. 일부 실시예들에서, 도 2의 방법은 도 1의 동영상 합성 시스템(10)에 의해서 수행될 수 있고, 이하에서 도 2는 도 1을 참조하여 설명될 것이다.Fig. 2 is a flowchart showing an example of a method for synthesizing a moving picture according to an exemplary embodiment of the present invention. As shown in FIG. 2 , the method for synthesizing a moving picture may include a plurality of steps S200 , S400 , S600 , and S800 . In some embodiments, the method of FIG. 2 may be performed by the video synthesis system 10 of FIG. 1 , which will be described below with reference to FIG. 1 .

도 2를 참조하면, 단계 S200에서 입력 데이터(IN)를 획득하는 동작이 수행될 수 있다. 예를 들면, 사용자 인터페이스(11)는 사용자(20)로부터 입력 데이터(IN)를 수신할 수 있고, 입력 처리부(13)에 입력 데이터(IN)를 제공할 수 있다. 또한, 동영상 합성 시스템(10)은 입력 데이터(IN)가 저장된 저장소를 액세스함으로써 입력 데이터(IN)를 획득할 수도 있다. 도 1을 참조하여 전술된 바와 같이, 입력 데이터(IN)는 동영상을 적어도 부분적으로 정의하는 데이터를 지칭할 수 있다.Referring to FIG. 2 , an operation of acquiring input data IN may be performed in step S200 . For example, the user interface 11 may receive the input data IN from the user 20 and provide the input data IN to the input processing unit 13 . Also, the video synthesis system 10 may acquire the input data IN by accessing the storage in which the input data IN is stored. As described above with reference to FIG. 1 , the input data IN may refer to data that at least partially defines a moving picture.

단계 S400에서, 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출하는 동작이 수행될 수 있다. 일부 실시예들에서, 입력 처리부(13)는 미리 정의된 규칙들에 기초하여, 입력 데이터(IN)로부터 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출할 수 있다. 예를 들면, 콘텐츠 데이터(CNT) 및 지시(DIR)를 식별하기 위한 규칙들(예컨대, 지시들을 정의하는 문법들)이 미리 정의될 수 있고, 사용자(20)는 미리 정의된 규칙들에 따라 작성된 가이드를 포함하는 입력 데이터(IN)를 제공할 수 있고, 이에 따라 입력 처리부(13)는 미리 정의된 규칙들에 기초하여 가이드에 대응하는 지시(DIR)를 추출할 수 있고, 입력 데이터(IN)의 잔존 부분에서 콘텐츠 데이터(CNT)를 추출할 수 있다. 또한, 일부 실시예들에서, 입력 처리부(13)는 적어도 하나의 기계 학습 모델(19)에 기초하여 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출할 수도 있다. 단계 S400의 예시가 도 3, 도 9a 및 도 9b 등을 참조하여 후술될 것이다.In step S400, an operation of extracting the content data (CNT) and the instruction (DIR) may be performed. In some embodiments, the input processing unit 13 may extract the content data CNT and the instruction DIR from the input data IN based on predefined rules. For example, rules for identifying content data (CNT) and instruction (DIR) (eg, grammars defining instructions) may be predefined, and the user 20 may write according to the predefined rules. The input data IN including the guide may be provided, and accordingly, the input processing unit 13 may extract an instruction DIR corresponding to the guide based on predefined rules, and the input data IN Content data (CNT) can be extracted from the remaining part of In addition, in some embodiments, the input processing unit 13 may extract the content data (CNT) and the instruction (DIR) based on the at least one machine learning model (19). An example of step S400 will be described later with reference to FIGS. 3, 9A, 9B, and the like.

단계 S600에서, 중간 데이터(INT)를 생성하는 동작이 수행될 수 있다. 예를 들면, 도 1을 참조하여 전술된 바와 같이, 중간 데이터 생성부(15)는, 단계 S400에서 추출된 콘텐츠 데이터(CNT) 및 지시(DIR)를 결합함으로써 구조화된 중간 데이터(INT)를 생성할 수 있다. 또한, 일부 실시예들에서, 중간 데이터 생성부(15)는, 복수의 지시 샘플들에 의해서 학습된 적어도 하나의 기계 학습 모델(19)에 기초하여 중간 데이터(INT)를 생성할 수도 있다.In operation S600, an operation of generating the intermediate data INT may be performed. For example, as described above with reference to FIG. 1 , the intermediate data generating unit 15 generates structured intermediate data INT by combining the content data CNT and the instruction DIR extracted in step S400 . can do. Also, in some embodiments, the intermediate data generator 15 may generate the intermediate data INT based on at least one machine learning model 19 learned by a plurality of instruction samples.

단계 S800에서, 동영상을 생성하는 동작이 수행될 수 있다. 예를 들면, 동영상 생성부(17)는 단계 S600에서 생성된 중간 데이터(INT)에 기초하여 동영상 데이터(VDO)를 생성할 수 있다. 일부 실시예들에서, 동영상 생성부(17)는 동영상 데이터(VDO)를 생성하기 위하여, 적어도 하나의 기계 학습 모델(19)에 기초할 수도 있고, 네트워크(32)에 접속할 수도 있으며, 데이터 저장소(34)에 액세스할 수도 있다.In step S800, an operation of generating a moving picture may be performed. For example, the moving picture generator 17 may generate the moving picture data VDO based on the intermediate data INT generated in step S600 . In some embodiments, the video generating unit 17 may be based on at least one machine learning model 19, may be connected to the network 32, and may be configured to generate video data (VDO), a data storage ( 34) can also be accessed.

도 3은 본 발명의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다. 구체적으로, 도 3의 순서도는 도 2의 단계 S400에 포함되는 단계 S420을 나타낸다. 도 2를 참조하여 전술된 바와 같이, 도 2의 단계 S400에서 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출하는 동작이 수행될 수 있다. 일부 실시예들에서, 도 2의 단계 S400은 텍스트 기반 지시를 추출하는 도 3의 단계 S420을 포함할 수 있다. 도 3에 도시된 바와 같이, 단계 S420은 복수의 단계들(S422, S424, S426)을 포함할 수 있다. 일부 실시예들에서, 도 3에 도시된 바와 상이하게, 단계 S420은 복수의 단계들(S422, S424, S426) 중 일부만을 포함할 수도 있다. 일부 실시예들에서, 단계 S420은 도 1의 입력 처리부(13)에 의해서 수행될 수 있고, 이하에서 도 3은 도 1 및 도 2를 참조하여 설명될 것이다.3 is a flowchart illustrating an example of a method for synthesizing a moving picture according to an exemplary embodiment of the present invention. Specifically, the flowchart of FIG. 3 shows step S420 included in step S400 of FIG. 2 . As described above with reference to FIG. 2 , an operation of extracting the content data CNT and the instruction DIR may be performed in step S400 of FIG. 2 . In some embodiments, step S400 of FIG. 2 may include step S420 of FIG. 3 extracting a text-based instruction. As shown in FIG. 3 , step S420 may include a plurality of steps S422 , S424 , and S426 . In some embodiments, different from that shown in FIG. 3 , step S420 may include only some of the plurality of steps S422 , S424 , and S426 . In some embodiments, step S420 may be performed by the input processing unit 13 of FIG. 1 , and FIG. 3 will be described below with reference to FIGS. 1 and 2 .

도 3을 참조하면, 단계 S422에서 콘텐츠를 삽입하는 지시를 추출하는 동작이 수행될 수 있다. 예를 들면, 입력 처리부(13)는 콘텐츠, 예컨대 동영상, 자막, 이미지, 사운드 중 적어도 하나의 삽입에 대응하는 지시를 추출할 수 있고, 본 명세서에서, 콘텐츠를 삽입하는 지시는 제1 지시로서 지칭될 수 있다. 후술되는 바와 같이, 중간 데이터 생성부(15)는 제1 지시에 의해서 삽입되는 콘텐츠에 따라 중간 데이터(INT)를 생성할 수 있다.Referring to FIG. 3 , an operation of extracting an instruction for inserting content may be performed in step S422. For example, the input processing unit 13 may extract an instruction corresponding to insertion of at least one of content, such as a moving picture, subtitle, image, and sound, and in this specification, the instruction to insert the content is referred to as a first instruction. can be As will be described later, the intermediate data generating unit 15 may generate the intermediate data INT according to the content inserted according to the first instruction.

일부 실시예들에서, 자막의 삽입에 대응하는 제1 지시를 수신하는 경우, 중간 데이터 생성부(15)는 자막을 적어도 하나의 부분으로 분할할 수 있고, 적어도 하나의 부분을 포함하는 중간 데이터(INT)를 생성할 수 있다. 예를 들면, 중간 데이터 생성부(15)는 글자의 크기 및/또는 화면의 크기에 기초하여 자막을 분할할 수도 있고, 적어도 하나의 기계 학습 모델(19)에 자막을 제공함으로써 사람이 끊어서 읽는 구간에 기초하여 자막을 분할할 수도 있다. 중간 데이터 생성부(15)는 자막의 분할된 부분들을 구분하는 정보를 포함하는 중간 데이터(INT)를 생성할 수 있고, 이에 따라 도 4b 등을 참조하여 후술되는 바와 같이, 자막은 분할되어 동영상에 포함될 수 있다.In some embodiments, upon receiving the first instruction corresponding to the insertion of the caption, the intermediate data generating unit 15 may divide the caption into at least one part, and the intermediate data ( INT) can be created. For example, the intermediate data generating unit 15 may divide the subtitles based on the size of characters and/or the size of the screen, and by providing the subtitles to at least one machine learning model 19 , a section for human reading Subtitles may be divided based on The intermediate data generator 15 may generate intermediate data INT including information for distinguishing the divided parts of the subtitle, and accordingly, the subtitle is divided into the moving picture as will be described later with reference to FIG. 4B and the like. may be included.

일부 실시예들에서, 이미지의 삽입에 대응하는 제1 지시를 수신하는 경우, 중간 데이터 생성부(15)는 이미지의 위치 및/또는 크기에 대한 정보를 획득할 수 있고, 획득된 정보를 포함하는 중간 데이터(INT)를 생성할 수 있다. 예를 들면, 중간 데이터 생성부(15)는 이미지에 관하여 사용자(20)가 직접 입력한 위치 및/또는 크기에 대한 정보를 획득할 수도 있고, 사용자(20)가 배치한 이미지로부터 위치 및/또는 크기에 대한 정보를 획득할 수도 있다. 중간 데이터 생성부(15)는 이미지의 위치 및/또는 크기에 대한 정보를 포함하는 중간 데이터(INT)를 생성할 수 있고, 이에 따라, 도 4b 등을 참조하여 후술되는 바와 같이, 이미지가 동영상에 포함될 수 있다.In some embodiments, upon receiving the first instruction corresponding to the insertion of the image, the intermediate data generating unit 15 may obtain information about the position and/or size of the image, including the obtained information. Intermediate data (INT) can be generated. For example, the intermediate data generating unit 15 may obtain information on the location and/or size directly input by the user 20 with respect to the image, and may obtain information on the location and/or size from the image placed by the user 20 . It is also possible to obtain information about the size. The intermediate data generating unit 15 may generate intermediate data INT including information on the position and/or size of the image, and accordingly, as will be described later with reference to FIG. 4B and the like, the image is included in the moving picture. may be included.

일부 실시예들에서, 사운드의 삽입에 대응하는 제1 지시를 수신하는 경우, 중간 데이터 생성부(15)는 음성, 배경음, 효과음 중 적어도 하나를 판정할 수 있고, 판정된 사운드에 대한 정보를 포함하는 중간 데이터(INT)를 생성할 수 있다. 동영상은 다양한 사운드를 포함할 수 있고, 사운드는 음성, 배경음, 효과음으로 구분될 수 있다. 중간 데이터 생성부(15)는 사용자(20)가 작성한 지시에 따라 사운드를 판정할 수도 있고, 적어도 하나의 기계 학습 모델(19)에 기초하여 사운드를 판정할 수도 있다. 중간 데이터 생성부(15)는, 사운드가 음성으로 판정된 경우 해당 부분을 음성으로 출력하기 위한 정보(예컨대, 입력 데이터에 포함된 문자를 낭독하는 정보)를 중간 데이터(INT)에 포함시킬 수 있고, 사운드가 배경음으로 판정된 경우 네트워크(32) 및/또는 데이터 저장소(34)로부터 획득한 배경음에 대한 정보를 중간 데이터(INT)에 포함시킬 수도 있으며, 사운드가 효과음으로 판정된 경우 네트워크(32) 및/또는 데이터 저장소(34)로부터 획득한 효과음에 대한 정보를 중간 데이터(INT)에 포함시킬 수도 있다.In some embodiments, upon receiving the first instruction corresponding to the insertion of the sound, the intermediate data generating unit 15 may determine at least one of a voice, a background sound, and an effect sound, and include information about the determined sound. Intermediate data (INT) can be generated. A moving picture may include various sounds, and the sound may be divided into a voice, a background sound, and an effect sound. The intermediate data generator 15 may determine a sound according to an instruction written by the user 20 , or may determine a sound based on at least one machine learning model 19 . The intermediate data generating unit 15 may include, in the intermediate data INT, information (eg, information for reading characters included in the input data) for outputting the corresponding part as a sound when the sound is determined to be voice, and , information about the background sound obtained from the network 32 and/or the data storage 34 when the sound is determined as a background sound may be included in the intermediate data INT, and when the sound is determined as the background sound, the network 32 and/or information about sound effects obtained from the data storage 34 may be included in the intermediate data INT.

다시 도 3을 참조하면, 단계 S424에서, 동영상의 플로우(flow)를 제어하는 지시를 추출하는 동작이 수행될 수 있다. 동영상의 플로우는 동영상을 구성하는 프레임들의 타이밍을 지칭할 수 있다. 본 명세서에서, 동영상의 플로우를 제어하는 지시는 제2 지시로서 지칭될 수 있다. 중간 데이터 생성부(15)는 제2 지시에 응답하여, 일련의 프레임들의 타이밍 정보를 생성하고, 타이밍 정보를 포함하는 중간 데이터(INT)를 생성할 수 있다.Referring back to FIG. 3 , in step S424, an operation of extracting an instruction for controlling a flow of a video may be performed. The flow of the video may refer to the timing of frames constituting the video. In this specification, an instruction for controlling the flow of a video may be referred to as a second instruction. In response to the second instruction, the intermediate data generator 15 may generate timing information of a series of frames and generate intermediate data INT including the timing information.

단계 S426에서, 자연어 지시를 추출하는 동작이 수행될 수 있다. 사용자(20)는 미리 정의된 규칙들에 의해서 정의된 지시뿐만 아니라 자연어로 작성된 지시를 입력 데이터(IN)에 포함시킬 수 있고, 입력 처리부(13)는 미리 정해진 규칙들에 의해서 정의되지 아니한 자연어 지시를 추출할 수 있다. 본 명세서에서, 자연어로 작성된 지시는 제3 지시로서 지칭될 수 있고, 자연어 지시를 처리하는 동작의 예시가 도 7 등을 참조하여 후술될 것이다.In step S426, an operation of extracting the natural language instruction may be performed. The user 20 may include instructions written in natural language as well as instructions defined by predefined rules in the input data IN, and the input processing unit 13 provides natural language instructions that are not defined by predefined rules. can be extracted. In this specification, an instruction written in a natural language may be referred to as a third instruction, and an example of an operation of processing the natural language instruction will be described later with reference to FIG. 7 and the like.

도 4a는 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타내고, 도 4b는 본 발명의 예시적 실시예에 따라 합성된 동영상의 예시적 프레임을 나타낸다. 일부 실시예들에서, 사용자 인터페이스(11)는 도 4a에 도시된 바와 같은 입력 인터페이스를 사용자(20)에 제공할 수 있고, 입력 인터페이스를 통해서 입력 데이터(IN)를 수신할 수 있다. 즉, 도 2의 입력 데이터를 획득하는 단계 S200은, 사용자(20)에 입력 인터페이스를 제공하는 단계 및 입력 인터페이스를 통해서 입력 데이터(IN)를 수신하는 단계를 포함할 수 있다. 예를 들면, 사용자 인터페이스(11)는 동영상 합성 시스템(10)에 연결된 디스플레이 장치(예컨대, 모니터)를 통해서 도 4a의 디스플레이를 표시하게 할 수도 있고, 사용자(20)의 단말의 표시 장치에 도 4a의 디스플레이를 표시하게 할 수도 있다. 또한, 일부 실시예들에서, 사용자(20)의 단말에서 실행되는 어플리케이션에 의해서 도 4a의 디스플레이가 표시될 수도 있다. 이하에서, 도 4a 및 도 4b는 도 1을 참조하여 설명될 것이다.4A shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention, and FIG. 4B shows an exemplary frame of a video synthesized according to an exemplary embodiment of the present invention. In some embodiments, the user interface 11 may provide the user 20 with an input interface as shown in FIG. 4A , and may receive input data IN through the input interface. That is, the step S200 of obtaining the input data of FIG. 2 may include providing an input interface to the user 20 and receiving the input data IN through the input interface. For example, the user interface 11 may display the display of FIG. 4A through a display device (eg, a monitor) connected to the video synthesis system 10 , and may display the display of FIG. 4A on the display device of the terminal of the user 20 . It is also possible to display the display of Also, in some embodiments, the display of FIG. 4A may be displayed by an application running on the terminal of the user 20 . Hereinafter, FIGS. 4A and 4B will be described with reference to FIG. 1 .

도 4a를 참조하면, 디스플레이는 제1 내지 제4 영역(R1 내지 R4)을 포함할 수 있다. 제1 영역(R1)은 사용자(20)가 입력 데이터(IN)를 생성하기 위하여 사용할 수 있는 도구들을 제공할 수 있다. 예를 들면, 도 4a에 도시된 바와 같이, 제1 영역(R1)은 현재까지 작성된 입력 데이터(IN)를 저장하기 위한 저장 도구(C11), 새로운 동영상을 생성하기 위한 새파일 도구(C12)를 포함할 수 있다. 또한, 제1 영역(R1)은, 자막 및/또는 텍스트 기반 지시를 삽입하기 위한 자막 도구(C21), 이미지를 삽입하기 위한 이미지 도구(C22), 사운드를 삽입하기 위한 사운드 도구(C23)를 포함할 수 있다. 또한, 제1 영역(R1)은 현재까지 작성된 입력 데이터(IN)에 의해서 생성된 동영상을 보기 위한 프리뷰 도구(C30)를 포함할 수 있다. 또한, 제1 영역(R1)은 동영상의 개체(object)에 대한 가이드를 추가하기 위한 말풍선 도구(C41) 및 필기 도구(C42)를 포함할 수 있다. 일부 실시예들에서, 도 4a에 도시된 바와 상이하게, 사용자 인터페이스(11)는 도 4a에 도시된 도구들 중 일부만을 사용자(20)에 제공할 수도 있고, 도 4a에 도시되지 아니한 추가적인 도구들을 사용자(20)에 더 제공할 수도 있다.Referring to FIG. 4A , the display may include first to fourth regions R1 to R4 . The first area R1 may provide tools that the user 20 can use to generate the input data IN. For example, as shown in FIG. 4A , the first area R1 includes a storage tool C11 for storing input data IN created up to now, and a new file tool C12 for creating a new video. may include Further, the first region R1 includes a subtitle tool C21 for inserting subtitles and/or text-based instructions, an image tool C22 for inserting an image, and a sound tool C23 for inserting a sound. can do. In addition, the first region R1 may include a preview tool C30 for viewing a video generated by input data IN created up to now. Also, the first region R1 may include a speech balloon tool C41 and a writing tool C42 for adding a guide for an object of a moving picture. In some embodiments, different from that shown in FIG. 4A , the user interface 11 may provide the user 20 with only some of the tools shown in FIG. 4A , and may provide additional tools not shown in FIG. 4A . It may be further provided to the user 20 .

제2 영역(R2)은 일련의 장면들을 사용자에게 표시할 수 있고, 본 명세서에서 플로우 영역으로 지칭될 수 있다. 예를 들면, 도 4a에 도시된 바와 같이, 제2 영역(R2)은 제1 내지 제5 장면(S01 내지 S05)을 사용자(20)에 표시할 수 있다. 본 명세서에서 장면(scene)은 동영상의 재생 시간 중 일부에 대응하는 구성을 지칭할 수 있고, 적어도 하나의 프레임으로 구성될 수 있다. 예를 들면, 도 4a에 도시된 바와 같이, 제2 영역(R2)은 제1 내지 제5 장면(S01 내지 S05)과 함께 제1 내지 제5 장면(S01 내지 S05) 각각의 개시 시간(예컨대, 제2 장면(S02)의 25.50)을 표시할 수 있다. 또한, 사용자(20)는 제2 영역(R2)에서 타이밍에 의존하는 효과에 대한 가이드를 입력할 수 있고, 제2 영역(R2)에서 사용자(20)가 입력하는 가이드에 대한 예시가 도 6을 참조하여 후술될 것이다.The second area R2 may display a series of scenes to the user, and may be referred to as a flow area in this specification. For example, as shown in FIG. 4A , the second region R2 may display the first to fifth scenes S01 to S05 to the user 20 . In the present specification, a scene may refer to a configuration corresponding to a portion of a playback time of a moving picture, and may include at least one frame. For example, as shown in FIG. 4A , the second region R2 has the first to fifth scenes S01 to S05 together with the start times of each of the first to fifth scenes S01 to S05 (eg, 25.50 of the second scene S02 may be displayed. In addition, the user 20 may input a guide for timing-dependent effects in the second region R2, and FIG. 6 is an example of a guide input by the user 20 in the second region R2. It will be described later with reference.

제3 영역(R3)은 제2 영역(R2)에 표시된 일련의 장면들 중 선택된 장면을 표시할 수 있고, 장면 영역으로 지칭될 수 있다. 사용자(20)는 제3 영역(R3)에서 해당 장면을 정의하기 위한 콘텐츠 및 가이드를 입력할 수 있다. 예를 들면, 도 4a에 도시된 바와 같이, 사용자(20)는 제4 장면(S04)에서 이미지를 삽입할 수 있고, 이미지 아래에 텍스트를 입력할 수 있다. 도 4a에서 "#"으로 시작하는 "#국채 사진 설명"은 사용자(20) 자신의 이해를 위한 주석을 나타낼 수 있고, 동영상 합성에 사용되지 아니할 수 있다. 도 4a에서 "청나라 (중략) 했습니다."는 자막으로서, 일부 실시예들에서 입력 처리부(13)는 텍스트 입력을 자막 삽입을 위한 지시로서 추출할 수 있고, 해당 텍스트를 자막을 포함하는 콘텐츠 데이터(CNT)로서 생성할 수 있다. 도 4b를 참조하면, 도 4a의 제4 장면(S04)에 대응하는 프레임에서, 도 4a의 제3 영역(R3)에서 삽입된 이미지 및 자막이 표시될 수 있다. 일부 실시예들에서, 도 10을 참조하여 후술되는 바와 같이, 사용자(20)는 자막뿐만 아니라 동영상에 등장하는 인물들의 대사들을 제3 영역에 입력할 수도 있다.The third area R3 may display a scene selected from a series of scenes displayed in the second area R2 , and may be referred to as a scene area. The user 20 may input content and a guide for defining a corresponding scene in the third area R3. For example, as shown in FIG. 4A , the user 20 may insert an image in the fourth scene S04 and may input text under the image. In FIG. 4A, "#Government photo description" starting with "#" may indicate a comment for the user 20's own understanding, and may not be used for video synthesis. In FIG. 4A, "Qing Dynasty (omitted)" is a subtitle, and in some embodiments, the input processing unit 13 may extract text input as an instruction for inserting a subtitle, and extract the text into content data ( CNT) can be produced. Referring to FIG. 4B , in a frame corresponding to the fourth scene S04 of FIG. 4A , an image and a caption inserted in the third region R3 of FIG. 4A may be displayed. In some embodiments, as will be described later with reference to FIG. 10 , the user 20 may input not only subtitles but also lines of characters appearing in a video into the third area.

다시 도 4a를 참조하면, 사용자(20)는 제3 영역(R3)에 표시된 장면에 대응하는 콘텐츠나 가이드에 대한 추가적인 지시를 제4 영역(R4)에 입력할 수 있다. 예를 들면, 사용자(20)는, "img location = centre, width = 70%", "img location = left 220, ratio=30%" 등과 같이 이미지의 위치 및/또는 크기에 대한 지시를 제4 영역(R4)에 입력할 수 있다. 일부 실시예들에서, 제4 영역(R4)은 제3 영역(R3)에서 사용자(20)에 의해서 현재 선택된 부분, 예컨대 이미지, 자막, 지시, 주석 등에 관한 정보 및 가이드가 표시될 수 있다.Referring back to FIG. 4A , the user 20 may input an additional instruction for content or guides corresponding to a scene displayed in the third area R3 into the fourth area R4. For example, the user 20, "img location = center, width = 70%", "img location = left 220, ratio = 30%", etc., such as an indication of the location and / or size of the image in the fourth area (R4) can be entered. In some embodiments, in the fourth area R4 , information and guides regarding a portion currently selected by the user 20 in the third area R3 , for example, images, subtitles, instructions, annotations, and the like may be displayed.

도 5a는 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타내고, 도 5b는 본 발명의 예시적 실시예에 따라 합성된 동영상의 예시적 프레임들을 나타낸다. 일부 실시예들에서, 도 1의 동영상 합성 시스템(10)은 사용자(20)가 입력한 텍스트를 낭독하는 사람의 모습을 생성하여 동영상에 포함시킬 수 있다. 도 5a에 대한 설명에서 도 4a에 대한 설명과 중복되는 내용은 생략될 것이다.5A shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention, and FIG. 5B shows exemplary frames of a video synthesized according to an exemplary embodiment of the present invention. In some embodiments, the video synthesis system 10 of FIG. 1 may generate a figure of a person reading the text input by the user 20 and include it in the video. In the description of FIG. 5A , content overlapping with the description of FIG. 4A will be omitted.

도 5a를 참조하면, 디스플레이는 제1 내지 제4 영역(R1 내지 R4)을 포함할 수 있다. 제3 영역(R3)에서, "voc/cap:"에 의해서 "청나라 (중략) 이후 국채는"을 낭독하기 위한 사람이 동영상에 포함될 수 있고, 동일한 텍스트가 자막으로 표시될 수 있다. 예를 들면, 도 5b의 상단에 도시된 바와 같이, "청나라 경제 붕괴는 (중략) 유발되었고,"를 낭독하는 사람이 동영상에 포함될 수 있다. 또한, 도 5a에서 "img: (중략) :img"에 의해서 제3 영역(R3)에 삽입된 이미지가 동영상에 포함될 타이밍이 결정될 수 있다. 예를 들면, 도 5b의 중간에 도시된 바와 같이, "신해혁명이 (중략) 되었습니다."가 낭독되고 자막으로 표시되는 동안, 제3 영역(R3)에서 삽입된 이미지가 표시될 수 있다. 그 다음에, 도 5b의 하단에 도시된 바와 같이, "이후 국채는 ..."은 다시 사람에 의해서 낭독될 수 있고, 자막으로 표시될 수 있다.Referring to FIG. 5A , the display may include first to fourth regions R1 to R4. In the third region R3, a person for reading “Gukchae after the Qing Dynasty (omitted)” by “voc/cap:” may be included in the moving picture, and the same text may be displayed as subtitles. For example, as shown in the upper part of FIG. 5B, a person reading "The Qing Dynasty's economic collapse was (omitted) induced," may be included in the video. Also, the timing at which the image inserted into the third region R3 is included in the moving picture may be determined by “img: (omitted):img” in FIG. 5A . For example, as shown in the middle of FIG. 5B , an image inserted in the third area R3 may be displayed while "Xinhai Revolution has been (omitted)" is read and displayed as subtitles. Then, as shown in the lower part of Fig. 5B, "After that, the government bonds ..." can be read again by a person and displayed as subtitles.

도 6은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다. 도 6에 도시된 바와 같이, 디스플레이는 제1 내지 제4 영역(R1 내지 R4)을 포함할 수 있다. 일부 실시예들에서, 사용자(20)는 제2 영역(R2)에서 타이밍에 의존하는 효과에 대한 가이드를 입력할 수 있다. 도 6에 대한 설명에서, 도 4a 및 도 5a에 대한 설명과 중복되는 내용은 생략될 것이다.6 shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention. 6 , the display may include first to fourth regions R1 to R4. In some embodiments, the user 20 may input a guide for a timing-dependent effect in the second region R2 . In the description of FIG. 6 , content overlapping with the descriptions of FIGS. 4A and 5A will be omitted.

도 6을 참조하면, 제2 영역(R2)은 일련의 장면들(S01 내지 S05)을 표시할 수 있다. 일련의 장면들(S01 내지 S05) 중 제2 장면(S02)은 이미지로 구성될 수 있고, 제3 장면(S03)은 동영상으로 구성될 수 있다. 사용자(20)는 제2 영역(R2)에서 타이밍에 의존하는 효과에 대한 가이드를 입력할 수 있다. 즉, 도 2의 콘텐츠 데이터 및 지시를 추출하는 단계 S400은, 제2 영역(R2)을 통해서 타이밍에 의존하는 효과를 나타내는 지시(본 명세서에서 제4 지시로서 지칭될 수 있다)를 추출하는 단계를 포함할 수 있다. 예를 들면, 사용자(20)는 제2 영역(R2)에서 마우스 우측 클릭을 통해서, 장면의 개시 시간, 장면들의 순서, 장면들의 등장 효과 등에 대한 가이드를 입력할 수 있다. 또한, 사용자(20)는 이미지로 구성된 제2 장면(S02)에 대하여 배경음, 지속시간, 장면 전환 효과 등에 대한 가이드를 입력할 수 있다. 또한, 사용자(20)는 동영상으로 구성된 제3 장면(S03)에 대하여 재상속도, 음소거 여부, 장면 전환 효과 등에 대한 가이드를 입력할 수도 있다. Referring to FIG. 6 , the second area R2 may display a series of scenes S01 to S05. Among the series of scenes S01 to S05, the second scene S02 may be composed of an image, and the third scene S03 may be composed of a moving picture. The user 20 may input a guide for timing-dependent effects in the second region R2 . That is, the step S400 of extracting the content data and the instruction in FIG. 2 includes the step of extracting an instruction (which may be referred to as a fourth instruction in this specification) indicating a timing-dependent effect through the second region R2. may include For example, the user 20 may input a guide for a start time of a scene, an order of scenes, an appearance effect of the scenes, and the like, by right-clicking the mouse in the second area R2 . Also, the user 20 may input a guide for a background sound, a duration, a scene change effect, etc. with respect to the second scene S02 composed of an image. In addition, the user 20 may input a guide for the replay speed, whether to mute, the scene change effect, etc. with respect to the third scene S03 composed of a moving picture.

도 7은 본 개시의 예시적 실시예에 따라 동영상을 합성하기 위한 방법의 예시를 나타내는 순서도이다. 구체적으로, 도 7은 도 2의 단계 S600의 예시를 나타내고, 도 7의 단계 S600'에서 중간 데이터(INT)를 생성하는 동작이 수행될 수 있다. 도 7에 도시되 바와 같이, 단계 S600'은 복수의 단계들(S620, S640, S660, S680)을 포함할 수 있다. 이하에서, 도 7은 도 1 및 도 2를 참조하여 설명될 것이다.7 is a flowchart illustrating an example of a method for synthesizing a video according to an exemplary embodiment of the present disclosure. Specifically, FIG. 7 shows an example of step S600 of FIG. 2 , and an operation of generating the intermediate data INT may be performed in step S600 ′ of FIG. 7 . As shown in FIG. 7 , step S600 ′ may include a plurality of steps S620 , S640 , S660 , and S680 . Hereinafter, FIG. 7 will be described with reference to FIGS. 1 and 2 .

도 7을 참조하면, 단계 S620에서 자연어 지시 여부를 판정하는 동작이 수행될 수 있다. 예를 들면, 도 3을 참조하여 전술된 바와 같이, 도 2의 단계 S400에서 자연어 지시, 제3 지시가 추출될 수 있다. 예를 들면, 사용자(20)는, 도 4a, 도 5a 및 도 6을 참조하여 전술된 가이드들을 자연어로 작성하여 입력할 수 있고, 입력 처리부(13)에 의해서 자연어 지시가 추출될 수 있다. 도 7에 도시된 바와 같이, 비자연어 지시가 추출된 경우 단계 S640이 후속하여 수행될 수 있고, 단계 S640에서 지시에 따라 중간 데이터(INT)를 생성하는 동작이 수행될 수 있다. 다른 한편으로, 자연어 지시가 추출된 경우 단계 S660이 후속하여 수행될 수 있다.Referring to FIG. 7 , an operation of determining whether to indicate a natural language may be performed in step S620 . For example, as described above with reference to FIG. 3 , the natural language instruction and the third instruction may be extracted in step S400 of FIG. 2 . For example, the user 20 may write and input the guides described above in natural language with reference to FIGS. 4A , 5A and 6 , and the natural language instruction may be extracted by the input processing unit 13 . As shown in FIG. 7 , when the non-natural language instruction is extracted, step S640 may be subsequently performed, and an operation of generating intermediate data INT according to the instruction may be performed in step S640 . On the other hand, when the natural language instruction is extracted, step S660 may be subsequently performed.

단계 S660에서, 자연어 처리가 수행될 수 있다. 자연어 처리는 형태소 분석, 어휘 분석 등을 통해서 자연어에 포함된 의미를 추출하는 것으로서, 인공지능 스피커, 질의 응답 시스템 등 다양한 분야에서 활용된다. 중간 데이터 생성부(15)는 적어도 하나의 기계 학습 모델(19)에 기초하여 자연어 지시의 자연어를 처리할 수 있고, 이에 따라 자연어 지시에 포함된 사용자(20)의 의도를 추출할 수 있다.In step S660, natural language processing may be performed. Natural language processing extracts the meaning contained in natural language through morpheme analysis and lexical analysis, and is used in various fields such as artificial intelligence speakers and question and answer systems. The intermediate data generator 15 may process the natural language of the natural language instruction based on the at least one machine learning model 19 , and thus extract the intention of the user 20 included in the natural language instruction.

단계 S680에서, 비자연어 지시를 생성하는 동작이 수행될 수 있다. 예를 들면, 중간 데이터 생성부(15)는 단계 S660에서 추출된 자연어 지시의 의도에 따라 적어도 하나의 비자연어 지시를 생성할 수 있다. 적어도 하나의 비자연어 지시가 생성되는 예시가 도 8을 참조하여 후술될 것이다. 단계 S680에 후속하여, 단계 S640이 수행될 수 있고, 이에 따라 단계 S680에서 생성된 비자연어 지시에 따라 중간 데이터(INT)가 생성될 수 있다.In step S680, an operation of generating a non-natural language instruction may be performed. For example, the intermediate data generator 15 may generate at least one non-natural language instruction according to the intention of the natural language instruction extracted in step S660. An example in which at least one non-natural language indication is generated will be described later with reference to FIG. 8 . Following step S680, step S640 may be performed, and thus intermediate data INT may be generated according to the non-natural language instruction generated in step S680.

도 8은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다. 도 8에 도시된 바와 같이, 디스플레이는 제1 내지 제4 영역(R1 내지 R4)을 포함할 수 있다. 도 7을 참조하여 전술된 바와 같이, 도 1의 사용자(20)는 자연어 지시를 입력할 수 있고, 입력 처리부(13) 및 중간 데이터 생성부(15)는 자연어 지시를 처리할 수 있다. 이하에서, 도 8은 도 1을 참조하여 설명될 것이다.8 shows an example of a display provided to a user for video synthesis according to an exemplary embodiment of the present invention. As illustrated in FIG. 8 , the display may include first to fourth regions R1 to R4 . As described above with reference to FIG. 7 , the user 20 of FIG. 1 may input a natural language instruction, and the input processing unit 13 and the intermediate data generating unit 15 may process the natural language instruction. Hereinafter, FIG. 8 will be described with reference to FIG. 1 .

도 8을 참조하면, 제3 영역(R3)에서 사용자(20)는 "// 거~액으로 강조해서 읽기"라는 자연어 지시를 입력할 수 있다. 입력 처리부(13)는 "// 거~액으로 강조해서 읽기"를 자연어 지시로서 입력 데이터(IN)로부터 추출할 수 있고, 중간 데이터 생성부(15)에 제공할 수 있다. 중간 데이터 생성부(15)는 "// 거~액으로 강조해서 읽기"를 자연어 처리함으로써 자연어 지시의 의도를 추출할 수 있고, 추출된 의도에 기초하여, 예컨대 "청나라 (중략) 이 accent: 거 :accent 액의 (중략)"와 같이 적어도 하나의 비자연어 지시 "accent"를 생성할 수 있다.Referring to FIG. 8 , in the third region R3 , the user 20 may input a natural language instruction “// Read with emphasis on large and large amounts”. The input processing unit 13 may extract "// read with emphasis on a large amount" from the input data IN as a natural language instruction, and may provide it to the intermediate data generation unit 15 . The intermediate data generating unit 15 can extract the intention of the natural language instruction by natural language processing "// read with emphasis on large and large amounts", and based on the extracted intention, for example, "Qing Dynasty (omitted) accent: large" It is possible to generate at least one non-natural language indication "accent", such as :accent (omitted).

또한, 도 8을 참조하면, 제3 영역(R3)에서 사용자(20)는 "// 이미지 축소하면서 사라지기"라는 자연어 지시를 입력할 수 있다. 입력 처리부(13)는 "// 이미지 축소하면서 사라지기"를 자연어 지시로서 입력 데이터(IN)로부터 추출할 수 있고, 중간 데이터 생성부(15)에 제공할 수 있다. 중간 데이터 생성부(15)는 "// 이미지 축소하면서 사라지기"를 자연어 처리함으로써 자연어 지시의 의도를 추출할 수 있고, 추출된 의도에 기초하여, 예컨대 "img: shrink & disappear :img"와 같이 적어도 하나의 비자연어 지시 "img:", "shrink", "disappear", ":img"를 생성할 수 있다.Also, referring to FIG. 8 , in the third region R3 , the user 20 may input a natural language instruction “// disappear while reducing the image”. The input processing unit 13 may extract “// disappear while reducing the image” as a natural language instruction from the input data IN, and may provide it to the intermediate data generating unit 15 . The intermediate data generating unit 15 may extract the intention of the natural language instruction by natural language processing of "// disappear while reducing the image", and based on the extracted intention, for example, "img: shrink & disappear :img" At least one non-natural language directive "img:", "shrink", "disappear", ":img" may be generated.

도 9a 및 도 9b는 본 발명의 예시적 실시예들에 따라 동영상을 합성하기 위한 방법의 예시들을 나타내는 순서도들이고, 도 10은 본 발명의 예시적 실시예에 따라 동영상 합성을 위하여 사용자에 제공되는 디스플레이의 예시를 나타낸다. 구체적으로, 도 9a 및 도 9b의 순서도들은 도 2의 단계 S400에 포함되는 단계 S440a 및 단계 S440b를 각각 나타낸다. 도 2를 참조하여 전술된 바와 같이, 도 2의 단계 S400에서 콘텐츠 데이터(CNT) 및 지시(DIR)를 추출하는 동작이 수행될 수 있다. 일부 실시예들에서, 도 2의 단계 S400은 액션 기반 지시를 추출하는 도 9a 및 도 9b의 단계 S440a 및/또는 단계 S440b를 포함할 수 있다. 일부 실시예들에서, 단계 S440a 및 단계 S440b는 도 1의 입력 처리부(13)에 의해서 수행될 수 있고, 이하에서 도 9a 및 도 9b는 도 1 및 도 2를 참조하여 설명될 것이고, 도 9a 및 도 9b에 대한 설명 중 상호 중복되는 내용은 생략될 것이다.9A and 9B are flowcharts illustrating examples of a method for synthesizing a video according to exemplary embodiments of the present invention, and FIG. 10 is a display provided to a user for video synthesis according to an exemplary embodiment of the present invention. shows an example of Specifically, the flowcharts of FIGS. 9A and 9B show steps S440a and S440b included in step S400 of FIG. 2 , respectively. As described above with reference to FIG. 2 , an operation of extracting the content data CNT and the instruction DIR may be performed in step S400 of FIG. 2 . In some embodiments, step S400 of FIG. 2 may include steps S440a and/or step S440b of FIGS. 9A and 9B of extracting an action-based instruction. In some embodiments, steps S440a and S440b may be performed by the input processing unit 13 of FIG. 1 , and FIGS. 9A and 9B will be described below with reference to FIGS. 1 and 2 , FIGS. 9A and 9B . Contents overlapping each other in the description of FIG. 9B will be omitted.

도 9a를 참조하면, 단계 S440a는 복수의 단계들(S442a, S444a, S446a)을 포함할 수 있다. 단계 S442a에서 액션 레이어에 포함된 형상을 추출하는 동작이 수행될 수 있다. 액션 레이어는 제3 영역(R3)에서 액션 기반 지시를 포함하는 레이어를 지칭할 수 있고, 일부 실시예들에서 말풍선 도구(C41) 또는 필기 도구(C42)를 통해서 입력된 지시들이 액션 레이어에 포함될 수 있다. 예를 들면, 도 10에 도시된 바와 같이, 사용자(20)는 말풍선 도구(C41)를 통해서 제3 영역(R3)의 이미지 상에 "20대 남성" 및 "50대 남성"을 입력할 수 있다. 또한, 사용자(20)는 필기 도구(C42)를 통해서, 이미지 상에 "A" 및 "B"를 입력할 수도 있다. 입력 처리부(13)는 말풍선 도구(C41) 및 필기 도구(C42)에 의해서 입력된 액션 기반 지시들이 포함된 액션 레이어에서 형상들을 추출할 수 있고, 이에 따라 2개의 말풍선들 및 2개의 필기 문자들이 추출될 수 있다.Referring to FIG. 9A , step S440a may include a plurality of steps S442a, S444a, and S446a. In step S442a, an operation of extracting a shape included in the action layer may be performed. The action layer may refer to a layer including an action-based instruction in the third region R3, and in some embodiments, instructions input through the speech balloon tool C41 or the writing tool C42 may be included in the action layer. there is. For example, as shown in FIG. 10 , the user 20 may input “male in his twenties” and “male in his 50s” on the image of the third region R3 through the speech balloon tool C41. . Also, the user 20 may input “A” and “B” on the image through the writing tool C42 . The input processing unit 13 may extract shapes from the action layer including the action-based instructions input by the speech balloon tool C41 and the writing tool C42, and accordingly, two speech balloons and two handwritten characters are extracted. can be

단계 S444a에서, 형상으로부터 기호 및/또는 언어를 인식하는 동작이 수행될 수 있다. 예를 들면, 입력 처리부(13)는 단계 S442a에서 추출된 말풍선들에서 "20대 남성" 및 "50대 남성"을 추출할 수 있고, 인물들 상에 기재된 "A" 및 "B"를 추출할 수 있다. 또한, 도 10에 도시되지 아니하였으나, 사용자(20)는 액션 레이어에서 화살표, 돼지꼬리 등과 같은 기호를 입력할 수 있고, 입력 처리부(13)는 화살표, 돼지꼬리와 같은 기호를 인식할 수 있다.In step S444a, an operation of recognizing a symbol and/or a language from the shape may be performed. For example, the input processing unit 13 may extract “male in 20s” and “male in 50s” from the speech bubbles extracted in step S442a, and extract “A” and “B” written on the people. can In addition, although not shown in FIG. 10 , the user 20 may input a symbol such as an arrow or a pig's tail in the action layer, and the input processing unit 13 may recognize a symbol such as an arrow and a pig's tail.

단계 S446a에서, 텍스트 기반 지시를 생성하는 동작이 수행될 수 있다. 예를 들면, 입력 처리부(13)는 단계 S444a에서 인식된 기호 및/또는 언어로부터 텍스트 기반 지시를 생성할 수 있다. 예를 들면, 입력 처리부(13)는 적어도 하나의 기계 학습 모델(19)에 기초하여 이미지에 포함된 개체들, 즉 인물들을 인식할 수 있고, 인식된 인물들 상에 기재된 "A" 및 "B"로부터 2명의 인물들을 각각 식별할 수 있다. 이에 따라, 입력 처리부(13)는 왼쪽 인물의 식별자를 "A"로 지정하고, 오른쪽 인물의 식별자를 "B"로 지정하는 텍스트 기반 지시들을 생성할 수 있고, 중간 데이터 생성부(15)에 전달할 수 있다. 또한, 입력 처리부(13)는 인식된 인물들을 가리키는 말풍선들 내에 기재된 "20대 남성" 및 "50대 남성"으로부터 2명의 인물들의 속성들을 지정하는 텍스트 기반 지시들을 생성할 수 있고, 중간 데이터 생성부(15)에 전달할 수 있다. 이에 따라, 중간 데이터 생성부(15)는 이미지 아래의 대사들을 "50대 남성" 및 "20대 남성"의 목소리로 각각 재생하게 하는 중간 데이터(INT)를 생성할 수 있다.In step S446a, an operation of generating a text-based indication may be performed. For example, the input processing unit 13 may generate a text-based instruction from the symbol and/or language recognized in step S444a. For example, the input processing unit 13 may recognize objects included in the image, ie, people, based on the at least one machine learning model 19 , and “A” and “B” written on the recognized people From ", two people can be identified, respectively. Accordingly, the input processing unit 13 may generate text-based instructions specifying the identifier of the left person as “A” and the identifier of the right person as “B”, and to be transmitted to the intermediate data generating unit 15 . can Further, the input processing unit 13 may generate text-based instructions specifying attributes of two persons from “a male in his twenties” and “a man in his 50s” written in speech bubbles indicating the recognized persons, and the intermediate data generating unit (15) can be forwarded. Accordingly, the intermediate data generating unit 15 may generate the intermediate data INT for reproducing the lines under the image with the voices of a “male in his 50s” and a “male in his twenties”, respectively.

액션 기반 지시는 전술된 예시들에 제한되지 아니한다. 예를 들면, 액션 기반 지시는, 카메라워크(camerawork)에 대한 지시, 고개를 흔들거나 눈을 감는 등 동영상에 등장하는 인물의 행동에 대한 지시 등을 포함할 수 있고, 이와 같은 액션 기반 지시로부터 3차원 모델링을 위한 라이브러리를 사용하는 텍스트 기반 지시들이 생성될 수 있다. 이에 따라 렌더링을 위한 3차원 데이터가 중간 데이터로서 생성될 수 있으며, 동영상이 합성될 수 있다.The action-based instruction is not limited to the above-described examples. For example, the action-based instruction may include instructions for camerawork, instructions for actions of characters appearing in the video, such as shaking the head or closing eyes, and the like. Text-based instructions can be generated using a library for dimensional modeling. Accordingly, 3D data for rendering may be generated as intermediate data, and a moving picture may be synthesized.

도 9b를 참조하면, 단계 S440b는 복수의 단계들(S442b, S444b, S446b)을 포함할 수 있다. 단계 S442b에서 액션 기반 지시를 식별하고 형상을 추출하는 동작이 수행될 수 있다. 예를 들면, 입력 처리부(13)는 입력 데이터(IN)에 포함된 액션 기반 지시를 적어도 하나의 기계 학습 모델(19)에 기초하여 식별할 수 있다. 일반적으로 사람이 콘텐츠 및 콘텐츠와 함께 기재된, 예컨대 기호, 텍스트 등과 같은 콘티(continuity)를 구분할 수 있는 것과 같이, 적어도 하나의 기계 학습 모델(19) 역시 복수의 샘플 콘티 샘플들에 의해서 학습된 상태일 수 있다. 이에 따라, 입력 처리부(13)는, 사용자(20)가 액션 기반 지시임을 나타내는 표시자(예컨대, 액션 레이어) 없이 입력 데이터(IN)에 포함시킨 액션 기반 지시를 적어도 하나의 기계 학습 모델(19)에 기초하여 식별할 수 있다. 입력 처리부(13)는 식별된 액션 기반 지시로부터, 도 9a를 참조하여 전술된 바와 같이 형상을 추출할 수 있다. 그 다음에, 단계 S444b에서 형상으로부터 기호 및/또는 언어를 인식하는 동작이 수행될 수 있고, 단계 S446b에서 텍스트 기반 지시를 생성하는 동작이 수행될 수 있다.Referring to FIG. 9B , step S440b may include a plurality of steps S442b, S444b, and S446b. In step S442b, an operation of identifying an action-based indication and extracting a shape may be performed. For example, the input processing unit 13 may identify the action-based instruction included in the input data IN based on the at least one machine learning model 19 . In general, at least one machine learning model 19 may also be in a state of being trained by a plurality of sample continuity samples, just as a human can distinguish content and continuity described with the content, such as symbols, text, etc. can Accordingly, the input processing unit 13 includes the action-based instruction included in the input data IN without an indicator (eg, an action layer) indicating that the user 20 is an action-based instruction in at least one machine learning model 19 . can be identified based on The input processing unit 13 may extract a shape from the identified action-based instruction as described above with reference to FIG. 9A . Then, an operation of recognizing a symbol and/or a language from the shape may be performed in step S444b, and an operation of generating a text-based indication may be performed in step S446b.

도 11은 본 개시의 예시적 실시예에 따른 동영상 합성 시스템이 사용되는 예시를 나타내는 블록도이고, 도 12는 본 발명의 예시적 실시예에 따라 도 11의 동영상 합성 시스템(40)의 동작의 예시를 나타내는 순서도이다. 일부 실시예들에서, 도 11에 도시된 바와 같이, 동영상 합성 시스템(40)은 네트워크(60)를 통해서 다중 사용자들(70)을 위하여 사용될 수 있다.11 is a block diagram illustrating an example in which a video synthesis system according to an exemplary embodiment of the present disclosure is used, and FIG. 12 is an example of the operation of the video synthesis system 40 of FIG. 11 according to an exemplary embodiment of the present disclosure. is a flowchart showing In some embodiments, as shown in FIG. 11 , video synthesis system 40 may be used for multiple users 70 via network 60 .

도 11을 참조하면, 동영상 합성 시스템(40)은 동영상 저장 서버(50)와 통신할 수 있다. 예를 들면, 동영상 합성 시스템(40)은 생성된 동영상을 동영상 저장 서버(50)에 저장할 수 있고, 동영상 저장 서버(50)에 저장된 동영상을 독출할 수도 있다. 또한, 동영상 합성 시스템(40)은 인터넷과 같은 네트워크(60)를 통해서 다중 사용자들(70)에게 동영상 합성을 위한 입력 인터페이스를 제공할 수 있고, 다중 사용자들(70)로부터 입력 데이터를 수신할 수 있다. 또한, 동영상 합성 시스템(40)은 네트워크(60)를 통해서 합성된 동영상을 다중 사용자들(70)에게 제공할 수도 있고, 제공된 동영상에 대한 피드백을 다중 사용자들(70)로부터 수신할 수도 있다. 도 1을 참조하여 전술된 바와 같이, 다중 사용자들(70) 각각은 자신의 단말, 예컨대 퍼스널 컴퓨터, 모바일 폰 등을 사용하여 네트워크(60)를 통해서 동영상 합성 시스템(40)에 접속할 수 있다.Referring to FIG. 11 , the video synthesis system 40 may communicate with the video storage server 50 . For example, the video synthesis system 40 may store the generated video in the video storage server 50 and read the video stored in the video storage server 50 . In addition, the video synthesis system 40 may provide an input interface for video synthesis to multiple users 70 through a network 60 such as the Internet, and may receive input data from multiple users 70 . there is. In addition, the video synthesis system 40 may provide a video synthesized through the network 60 to the multiple users 70 , and may receive feedback on the provided video from the multiple users 70 . As described above with reference to FIG. 1 , each of the multiple users 70 may access the video synthesis system 40 through the network 60 using their terminal, for example, a personal computer, a mobile phone, or the like.

도 12를 참조하면, 도 11의 동영상 합성 시스템(40)의 동작 방법은 복수의 단계들(S20, S40, S60, S80)을 포함할 수 있다. 먼저 단계 S20에서, 동영상을 다중 사용자들(70)에 제공하는 동작이 수행될 수 있다. 예를 들면, 동영상 합성 시스템(40)은, 다중 사용자들(70)의 요청에 응답하여 동영상 저장 서버(50)에 저장된 동영상을 독출하여 네트워크(60)를 통해서 다중 사용자들(70)에 제공할 수 있다.Referring to FIG. 12 , the method of operating the video synthesis system 40 of FIG. 11 may include a plurality of steps S20 , S40 , S60 , and S80 . First, in step S20 , an operation of providing a video to multiple users 70 may be performed. For example, the video synthesis system 40 reads the video stored in the video storage server 50 in response to the request of the multiple users 70 and provides it to the multiple users 70 through the network 60 . can

단계 S40에서, 피드백을 수신하고 제공하는 동작이 수행될 수 있다. 예를 들면, 동영상 합성 시스템(40)은 단계 S20에서 제공된 동영상에 대한 피드백을 다중 사용자들(70)로부터 수신할 수 있고, 수신된 피드백을 게시함으로써 다중 사용자들(70)에 피드백을 제공할 수 있다. 이에 따라, 다중 사용자들(70)은 자신이 제공한 피드백은 물론 다른 사용자가 동영상에 대하여 제공한 피드백을 확인할 수 있다. 일부 실시예들에서, 피드백은 다중 사용자들(70)의 코멘트를 포함할 수도 있고, 단계 S20에서 제공된 동영상에 대한 다중 사용자들(70)의 평가(또는 리워드)를 포함할 수도 있다.In step S40, an operation of receiving and providing feedback may be performed. For example, the video synthesis system 40 may receive feedback on the video provided in step S20 from the multiple users 70, and may provide feedback to the multiple users 70 by posting the received feedback. there is. Accordingly, the multi-users 70 may check the feedback provided by the user as well as the feedback provided by other users for the video. In some embodiments, the feedback may include comments from multiple users 70 , and may include ratings (or rewards) from multiple users 70 on the video provided in step S20 .

단계 S60에서, 입력 데이터를 수신하는 동작이 수행될 수 있고, 단계 S80에서 동영상을 합성하는 동작이 수행될 수 있다. 단계 S80에 후속하여 단계 S20이 다시 수행될 수 있고, 이에 따라 복수의 단계들(S20, S40, S60, S80)은 반복될 수 있다.In operation S60, an operation of receiving input data may be performed, and in operation S80, an operation of synthesizing a moving picture may be performed. Step S20 may be performed again following step S80, and accordingly, the plurality of steps S20, S40, S60, and S80 may be repeated.

예를 들면, 동영상 합성 시스템(40)은 도면들을 참조하여 전술된 바와 같이 입력 인터페이스를 네트워크(60)를 통해서 다중 사용자들(70)에 제공할 수 있고, 다중 사용자들(70)이 입력 인터페이스를 통해 입력한 입력 데이터를 네트워크(60)를 통해서 수신할 수 있다. 도 1 등을 참조하여 전술된 바와 같이, 입력 데이터는 동영상의 적어도 일부분을 정의하는 데이터로서, 단계 S60에서 수신된 입력 데이터는 단계 S20에서 다중 사용자들(70)에게 제공된 동영상을 수정하는 내용을 포함할 수 있다. 다중 사용자들(70)은 단계 S40에서 자신이 제공한 피드백뿐만 아니라 다른 사용자의 피드백을 보면서 흥미나 성취감을 느낄 수 있고, 보다 큰 즐거움이나 만족을 위하여 또는 다른 사용자의 호응을 얻기 위하여, 단계 S20에서 제공된 동영상을 수정하기 위한 입력 데이터를 네트워크(60)를 통해서 동영상 합성 시스템(40)에 제공할 수 있다. For example, the video synthesis system 40 may provide an input interface to multiple users 70 through the network 60 as described above with reference to the drawings, and the multiple users 70 may use the input interface It is possible to receive input data input through the network 60 through the network 60 . As described above with reference to FIG. 1 and the like, the input data is data defining at least a part of a moving picture, and the input data received in step S60 includes the content of modifying the moving picture provided to the multiple users 70 in step S20. can do. Multiple users 70 may feel an interest or a sense of achievement by looking at the feedback provided by themselves in step S40 as well as feedback of other users, and in step S20 for greater pleasure or satisfaction or to obtain a response from other users. Input data for correcting the provided video may be provided to the video synthesis system 40 through the network 60 .

도면들을 참조하여 전술된 바와 같이, 동영상 합성 시스템(40)은 다중 사용자들(70)로 하여금 동영상 생성에 필요한 정보를 직관적으로 입력할 수 있게 함으로써 동영상 생성의 편의성을 향상시킬 수 있고, 자연어 입력을 지원함으로써 누구든지 손쉽게 동영상을 제작하게 할 수 있다. 이에 따라, 다중 사용자들(70)이 동영상 합성에 쉽게 참여할 수 있고, 동영상이 다중 사용자들(70)에 의해서 점진적으로 가공됨으로써 동영상의 수정에 관여한 사용자들뿐만 아니라 가공 과정의 동영상에 대한 피드백을 제공하는 사용자들, 피드백을 확인함으로써 흥미를 느끼는 사용자들과 같이 많은 사용자들이 동영상 합성 시스템(40)에 접속할 수 있다. 이를 통해 동영상 합성 시스템(40)은, 예컨대 광고와 같은 다양한 비즈니스 모델에 활용될 수 있다.As described above with reference to the drawings, the video synthesis system 40 allows multiple users 70 to intuitively input information required for video generation, thereby improving the convenience of video creation and natural language input. By supporting it, anyone can easily create a video. Accordingly, multiple users 70 can easily participate in video synthesis, and as the video is gradually processed by the multiple users 70, feedback on the video of the processing process as well as users involved in the editing of the video is provided. Many users may access the video synthesizing system 40 , such as users who provide, and users who are interested by checking feedback. Through this, the video synthesis system 40 may be utilized in various business models, such as advertisements.

이상에서와 같이 도면과 명세서에서 예시적인 실시예들이 개시되었다. 본 명세서에서 특정한 용어를 사용하여 실시예들을 설명되었으나, 이는 단지 본 발명의 기술적 사상을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.Exemplary embodiments have been disclosed in the drawings and specification as described above. Although the embodiments have been described using specific terms in the present specification, these are only used for the purpose of explaining the technical idea of the present invention and not used to limit the meaning or the scope of the present invention described in the claims . Therefore, it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible therefrom. Accordingly, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

Claims

obtaining input data that at least partially defines a moving picture;
extracting, from the input data, content data about content to be included as part of a video and a series of instructions about a guide used for video synthesis, based on predefined rules;
generating structured intermediate data from the content data and a series of instructions based on at least one machine learning model learned by a plurality of instruction samples; and
Comprising the step of generating a video based on the intermediate data,
the rules include rules for identifying the content data and the set of instructions;
wherein extracting the series of instructions and content data includes extracting text-based instructions based on the rules;
Extracting the text-based instruction comprises:
extracting a first instruction indicating insertion of content;
extracting a second instruction for controlling the flow of the video; and
extracting a third instruction written in natural language; Method for synthesizing a video, characterized in that it comprises at least one of.

delete

The method according to claim 1,
The step of generating the intermediate data includes:
dividing the subtitle into at least one part and generating the intermediate data including the at least one part when the first instruction corresponds to the insertion of the subtitle;
when the first instruction corresponds to the insertion of an image, obtaining information on the position and/or size of the image, and generating the intermediate data including the obtained information; and
determining at least one of a voice, a background sound, and an effect sound when the first instruction corresponds to insertion of a sound, and generating the intermediate data including information on the determined sound; A method for synthesizing a video, comprising at least one of.

The method according to claim 1,
The step of generating the intermediate data includes:
and generating timing information of a series of frames in response to the second instruction, and generating intermediate data including the timing information.

The method according to claim 1,
The generating of the intermediate data comprises generating at least one of a first instruction and a second instruction by processing the natural language in response to the third instruction.

The method according to claim 1,
extracting the series of instructions and content data includes extracting an action-based instruction as at least one of the series of instructions based on the rules;
The step of extracting the action-based instruction is,
extracting a shape included in the action layer;
recognizing a sign and/or language from the shape based on the at least one machine learning model; and
based on the symbol and/or the language, generating at least one text-based indication.

The method according to claim 1,
extracting the series of instructions and content data includes extracting an action-based instruction as at least one of the series of instructions based on the rules;
The step of extracting the action-based instruction is,
identifying an action-based instruction based on the at least one machine learning model, and extracting a shape from the identified action-based instruction;
recognizing a sign and/or language from the shape based on the at least one machine learning model; and
based on the symbol and/or the language, generating at least one text-based indication.

The method according to claim 1,
The step of obtaining the input data includes:
providing an input interface to a user; and
receiving the input data via the input interface;
Providing the input interface comprises:
displaying a scene area corresponding to a scene that is part of the reproduction time of the moving picture, and in which the content data and a series of instructions corresponding to the scene are displayed; and
A method for synthesizing a moving picture comprising the step of displaying a flow area comprising at least a portion of a series of scenes.

9. The method of claim 8,
and extracting the series of instructions and content data includes extracting a fourth instruction indicating a timing-dependent effect through the flow area.

A method of synthesizing a video performed by a video synthesis system accessed by multiple users, the method comprising:
providing a first video to the multiple users;
receiving a first feedback on the first video from the multiple users;
receiving first input data for at least partially modifying the first video from at least some of the multiple users;
synthesizing a second video by modifying the first video based on the first input data;
providing the second video to the multiple users;
receiving a second feedback on the second video from the multiple users;
receiving second input data for at least partially modifying the second video from at least some of the multiple users;
synthesizing a third video by modifying the second video based on the second input data; and
and providing the third video to the multiple users.

delete