KR102661488B1

KR102661488B1 - Servers, systems, methods and programs that provide special effect creation and 3d model creation services using generative ai models

Info

Publication number: KR102661488B1
Application number: KR1020240008780A
Authority: KR
Inventors: 이태희; 문기환; 엄대호; 윤성민; 트란딘썬
Original assignee: 주식회사 더매트릭스; 주식회사 오핌디지털
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-26

Abstract

본 발명의 실시 예에 따르면, 생성형 AI 모델을 이용한 특수효과 생성 및 3D 모델 생성 서비스 제공 서버를 개시한다. 상기 서버는, 이미지 세그멘테이션 모델을 이용해 입력되는 이미지에 포함된 객체를 결정하는 객체 분할부; 비디오 생성 모델을 이용해 특수효과 비디오를 생성하는 비디오 생성부; 및 사용자 단말로부터 수신한 사용자 지정 영역을 이용해 원본비디오에서의 특수효과 입력영역을 지정하고, 원본비디오의 특수효과 입력영역에 특수효과 비디오를 합성하는 합성부를 포함한다. According to an embodiment of the present invention, a server providing special effect generation and 3D model generation services using a generative AI model is disclosed. The server includes an object segmentation unit that determines objects included in an input image using an image segmentation model; A video generator that generates a special effects video using a video generation model; and a synthesis unit that designates a special effect input area in the original video using a user-specified area received from the user terminal and synthesizes the special effect video in the special effect input area of the original video.

Description

Servers, systems, methods and programs that provide special effect synthesis and 3D model creation services using generative AI models {SERVERS, SYSTEMS, METHODS AND PROGRAMS THAT PROVIDE SPECIAL EFFECT CREATION AND 3D MODEL CREATION SERVICES USING GENERATIVE AI MODELS}

본 발명은 생성형 AI 모델을 이용한 특수효과 생성 및 3D 모델 생성 서비스 제공 서버, 시스템, 방법 및 프로그램에 관한 것이다. The present invention relates to a server, system, method, and program that provides special effect generation and 3D model generation services using generative AI models.

본 명세서에서 달리 표시되지 않는 한, 이 섹션에 설명되는 내용들은 이 출원의 청구항들에 대한 종래 기술이 아니며, 이 섹션에 포함된다고 하여 종래 기술이라고 인정되는 것은 아니다.Unless otherwise indicated herein, the material described in this section is not prior art to the claims of this application, and is not admitted to be prior art by inclusion in this section.

영화 산업에서 물체 제거 및 교체 작업은 많은 시간과 비용이 소요되는 작업 중 하나다. 무수히 많은 프레임마다 물체를 제거 및 교체를 수작업으로 진행했기 때문에, 많은 시간 및 인력이 소요되었으며, 작업자의 실력에 따라 제거 및 교체 품질이 달라지는 문제가 발생되고 있다. Removing and replacing objects in the film industry is one of the most time-consuming and expensive tasks. Because objects were removed and replaced manually for each countless frame, a lot of time and manpower were consumed, and the quality of removal and replacement varied depending on the skill of the worker.

딥러닝을 이용한 인페인팅 모델 및 아웃페인팅 모델을 이용해 물체 제거 및 교체 작업을 수행하는 경우, 자동화를 통해 작업 속도가 비약적으로 향상되며 품질의 일관성이 유지된다. 특히, 기존의 방식에 비해 많은 비용이 절감되기 때문에, 영화 산업에 인페인팅 모델 및 아웃페인팅 모델의 도입이 가속화되고 있다. When removing and replacing objects using inpainting models and outpainting models using deep learning, work speed is dramatically improved through automation and quality consistency is maintained. In particular, the introduction of inpainting models and outpainting models into the film industry is accelerating because costs are greatly reduced compared to existing methods.

대한민국 등록특허공보 제10-2486300호(2023.01.04)Republic of Korea Patent Publication No. 10-2486300 (2023.01.04) 대한민국 공개특허공보 제10-2023-0133059호(2023.09.19)Republic of Korea Patent Publication No. 10-2023-0133059 (2023.09.19) 대한민국 공개특허공보 제10-2021-0144294호(2021.11.30)Republic of Korea Patent Publication No. 10-2021-0144294 (2021.11.30) 대한민국 공개특허공보 제10-2338913호(2021.12.08)Republic of Korea Patent Publication No. 10-2338913 (2021.12.08)

본 발명은, 생성형 AI를 이용해 특수효과가 적용된 이미지 또는 비디오를 생성하는 서비스를 제공하는, 생성형 AI 모델을 이용한 특수효과 생성 및 3D 모델 생성 서비스 제공 서버, 시스템, 방법 및 프로그램을 제공하는 것을 일 목적으로 한다.The present invention provides a server, system, method, and program that provides a service for generating images or videos with special effects applied using generative AI, and a service for creating special effects and 3D models using a generative AI model. It is for work purposes.

본 발명은, 생성형 AI를 이용해 3D 모델링 서비스를 제공하는, 생성형 AI 모델을 이용한 특수효과 생성 및 3D 모델 생성 서비스 제공 서버, 시스템, 방법 및 프로그램을 제공하는 것을 일 목적으로 한다.The purpose of the present invention is to provide a server, system, method, and program that provides a 3D modeling service using generative AI, a special effect generation using a generative AI model, and a 3D model creation service.

상기 목적을 달성하기 위한 본 발명의 일 측면은, 생성형 AI 모델을 이용한 특수효과 생성 및 3D 모델 생성 서비스 제공 서버를 제공한다.One aspect of the present invention to achieve the above object is to provide a server that provides special effect generation and 3D model generation services using a generative AI model.

상기 서버는, 이미지 세그멘테이션 모델을 이용해 입력되는 이미지에 포함된 객체를 결정하는 객체 분할부; 비디오 생성 모델을 이용해 특수효과 비디오를 생성하는 비디오 생성부; 및 사용자 단말로부터 수신한 사용자 지정 영역을 이용해 원본비디오에서의 특수효과 입력영역을 지정하고, 원본비디오의 특수효과 입력영역에 특수효과 비디오를 합성하는 합성부를 포함한다. The server includes an object segmentation unit that determines objects included in an input image using an image segmentation model; A video generator that generates a special effects video using a video generation model; and a synthesis unit that designates a special effect input area in the original video using a user-specified area received from the user terminal and synthesizes the special effect video in the special effect input area of the original video.

또한, 상기 객체 분할부는, 이미지 세그멘테이션 모델 및 비디오 오브젝트 세그멘테이션 모델을 이용해 특수효과 비디오에 포함된 복수의 프레임 각각을 특수효과 객체영역과 배경영역으로 분리하고, 특수효과 비디오에 포함된 복수의 프레임 각각에서 배경영역을 제거한다.In addition, the object division unit divides each of the plurality of frames included in the special effect video into a special effect object area and a background area using an image segmentation model and a video object segmentation model, and divides each of the plurality of frames included in the special effect video into a special effect object area and a background area. Remove the background area.

또한, 상기 합성부는, 사용자 지정 영역을 원본비디오의 첫 프레임의 특수효과 입력영역으로 지정하고, 상기 객체 분할부는, 이미지 세그멘테이션 모델을 이용해 원본비디오의 첫 프레임의 특수효과 입력영역에 포함된 적어도 하나의 객체를 식별하고, 적어도 하나의 객체 중 어느 하나를 타겟객체로 결정하며, 타겟객체와 매칭되는 마스크를 생성하며, 타겟객체와 매칭되는 마스크 및 비디오 오브젝트 세그멘테이션 모델을 이용해 원본비디오의 나머지 프레임 각각을 타겟객체영역과 배경영역으로 분리하고, 상기 합성부는, 원본비디오의 나머지 프레임 각각의 타겟객체영역을 이용하여 원본비디오의 나머지 프레임 각각의 특수효과 입력영역을 생성한다.In addition, the synthesis unit designates a user-specified area as a special effect input area of the first frame of the original video, and the object division unit uses an image segmentation model to create at least one special effect input area included in the first frame of the original video. Identify the object, determine one of at least one object as the target object, create a mask matching the target object, and target each remaining frame of the original video using the mask matching the target object and the video object segmentation model. Separated into an object area and a background area, the synthesis unit creates a special effect input area for each remaining frame of the original video using the target object area of each remaining frame of the original video.

또한, 상기 합성부는, 원본비디오의 복수의 프레임 각각과 특수효과 비디오의 복수의 프레임 각각의 1:1 매칭하고, 원본비디오의 복수의 프레임 각각의 특수효과 입력영역에 특수효과 비디오의 복수의 프레임 각각을 합성한다.In addition, the synthesis unit performs 1:1 matching between each of the plurality of frames of the original video and each of the plurality of frames of the special effects video, and each of the plurality of frames of the special effects video is placed in the special effect input area of each of the plurality of frames of the original video. synthesize.

또한, 상기 서버는, 특수효과 비디오에 포함된 복수의 프레임 각각과 이미지 생성 모델을 이용해, 특수효과 비디오에 포함된 복수의 프레임 각각과 매칭되고 복수의 프레임 각각에 포함된 특수효과객체를 여러 방향의 각도에서 도시하는 복수의 모델링 이미지를 생성하는 이미지 생성부; 및 3D 생성 모델을 이용해 복수의 모델링 이미지와 매칭되는 3D 모델을 생성하는 3D 생성부를 더 포함한다.In addition, the server uses each of the plurality of frames included in the special effects video and an image generation model to match each of the plurality of frames included in the special effects video and generates special effect objects included in each of the plurality of frames in various directions. an image generator that generates a plurality of modeling images shown from angles; and a 3D generation unit that generates a 3D model matching a plurality of modeling images using the 3D generation model.

또한, 상기 비디오 생성부는, 복수의 모델링 이미지 중 어느 하나의 모델링 이미지에 마커 생성영역을 지정하고, 상기 객체 분할부는, 마커 생성영역과 비디오 오브젝트 세그멘테이션 모델을 이용해 복수의 모델링 이미지 중 나머지 모델링 이미지를 마커 생성영역과 배경영역으로 분리하며, 상기 서버는, 인페인팅 모델을 이용해 복수의 모델링 이미지 각각의 마커 생성영역에 마커를 생성하는 인페인팅부를 포함한다. In addition, the video generator specifies a marker creation area on one of the plurality of modeling images, and the object segmentation unit marks the remaining modeling images among the plurality of modeling images using the marker creation area and the video object segmentation model. It is divided into a creation area and a background area, and the server includes an inpainting unit that creates a marker in the marker creation area of each of the plurality of modeling images using an inpainting model.

본 발명의 일 실시 예에 따르면, 연속적인 원본 비디오 시퀸스에 생성형 AI를 이용해 생성된 특수효과 비디오 시퀸스가 합성된 합성 비디오를 생성할 수 있다.According to an embodiment of the present invention, a composite video can be generated by combining a continuous original video sequence with a special effects video sequence created using generative AI.

본 발명의 일 실시 예에 따르면, 생성형 AI를 이용해 복수의 모델링 이미지를 생성하고, 생성된 모델링 이미지를 이용해 3D 모델을 생성할 수 있다. According to an embodiment of the present invention, a plurality of modeling images can be generated using generative AI, and a 3D model can be created using the generated modeling images.

도 1은 본 발명의 일 실시 예에 따른 생성형 AI 모델을 이용한 특수효과 합성 및 3D 모델 생성 서비스를 제공하기 위한 시스템에 대한 개요도이다.
도 2는 도 1에 따른 서비스 제공 서버의 기능적 모듈을 예시적으로 나타낸 블록도이다.
도 3은 도 1에 따른 서비스 제공 서버가 비디오의 프레임에서 동적 객체를 탐지하는 과정을 도시하는 흐름도이다.
도 4는 이미지 세그멘테이션 모델이 프레임에서 타겟객체과 매칭되는 영역에 마스크를 생성하는 과정을 도시하는 도면이다.
도 5는 비디오 오브젝트 세그멘테이션 모델이 비디오의 첫 프레임의 마스크와 비디오의 나머지 프레임을 입력받아 나머지 프레임에 포함된 동적 객체를 탐지하는 과정을 도시하는 흐름도이다.
도 6은 도 3의 S200단계의 구체적인 과정의 일 실시 예를 도시하는 흐름도이다
도 7은 비디오 오브젝트 세그멘테이션 모델의 동작 중 이미지 세그멘테이션 모델이 개입되는 과정을 도시하는 도면이다,
도 8은 인페인팅 모델이 비디오에서 탐지된 동적 객체를 제거하고 제거된 부분을 복원하는 과정을 도시하는 도면이다.
도 9는 인페인팅 모델이 비디오에서 탐지된 동적 객체를 제거하고 제거된 부분을 복원하는 과정을 도시하는 도면이다.
도 10은 도 1에 따른 서비스 제공 서버가 원본비디오에 특수효과 비디오를 합성하는 과정을 도시하는 흐름도이다.
도 11은 도 10의 S310단계의 구체적인 과정을 도시하는 흐름도이다.
도 12는 도 10의 S320단계의 구체적인 과정을 도시하는 흐름도이다.
도 13은 특수효과 비디오를 생성하는 과정을 개념적으로 도시하는 도면이다.
도 14는 도 1에 따른 서비스 제공 서버가 3D 모델을 생성하는 과정을 도시하는 흐름도이다.
도 15는 도 14의 S410단계의 일 실시 예를 도시하는 흐름도이다.
도 16은 도 14의 S410단계의 일 실시 예를 도시하는 흐름도이다.
도 17은 도 14의 S420단계의 구체적인 과정을 도시하는 흐름도이다.
도 18은 인페인팅 모델을 이용해 마커 생성영역에 마커가 생성된 복수의 모델링 이미지를 예시적으로 도시하는 도면이다.
도 19는 UV map을 예시적으로 도시하는 도면이다.
도 20은 도 1에 따른 서비스 제공 서버의 하드웨어 구성을 예시적으로 나타낸 도면이다.Figure 1 is an outline diagram of a system for providing special effect synthesis and 3D model creation services using a generative AI model according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating functional modules of the service providing server according to FIG. 1.
FIG. 3 is a flowchart illustrating a process in which the service providing server according to FIG. 1 detects a dynamic object in a video frame.
Figure 4 is a diagram illustrating a process in which an image segmentation model creates a mask in an area that matches a target object in a frame.
Figure 5 is a flowchart showing a process in which a video object segmentation model receives the mask of the first frame of the video and the remaining frames of the video and detects dynamic objects included in the remaining frames.
Figure 6 is a flowchart showing an example of a specific process in step S200 of Figure 3.
Figure 7 is a diagram showing the process in which the image segmentation model is involved during the operation of the video object segmentation model.
Figure 8 is a diagram illustrating a process in which an inpainting model removes a dynamic object detected in a video and restores the removed part.
Figure 9 is a diagram illustrating a process in which an inpainting model removes a dynamic object detected in a video and restores the removed part.
FIG. 10 is a flowchart illustrating a process in which the service providing server according to FIG. 1 synthesizes a special effects video with the original video.
FIG. 11 is a flowchart showing the specific process of step S310 of FIG. 10.
FIG. 12 is a flowchart showing the specific process of step S320 of FIG. 10.
Figure 13 is a diagram conceptually showing the process of creating a special effects video.
FIG. 14 is a flowchart showing a process in which the service providing server according to FIG. 1 creates a 3D model.
FIG. 15 is a flowchart illustrating an example of step S410 of FIG. 14.
FIG. 16 is a flowchart illustrating an example of step S410 of FIG. 14.
FIG. 17 is a flowchart showing the specific process of step S420 of FIG. 14.
FIG. 18 is a diagram illustrating a plurality of modeling images in which markers are created in a marker creation area using an inpainting model.
Figure 19 is a diagram illustrating a UV map as an example.
FIG. 20 is a diagram illustrating the hardware configuration of the service providing server according to FIG. 1.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention. While describing each drawing, similar reference numerals are used for similar components.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. Terms such as first, second, A, and B may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, a first component may be named a second component without departing from the scope of the present invention, and similarly, the second component may also be named a first component. The term and/or includes any of a plurality of related stated items or a combination of a plurality of related stated items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

이하, 본 발명에 따른 바람직한 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시 예에 따른 생성형 AI 모델을 이용한 특수효과 합성 및 3D 모델 생성 서비스를 제공하기 위한 시스템에 대한 개요도이다.Figure 1 is an outline diagram of a system for providing special effect synthesis and 3D model creation services using a generative AI model according to an embodiment of the present invention.

도 1을 참조하면, 동적 객체 추적, 인페인팅 및 아웃페인팅 서비스를 제공하기 위한 시스템은, 서비스 제공 서버(100), 사용자 단말(200) 및 인공신경망 모델 서버(300)를 포함한다.Referring to FIG. 1, a system for providing dynamic object tracking, inpainting, and outpainting services includes a service providing server 100, a user terminal 200, and an artificial neural network model server 300.

서비스 제공 서버(100), 사용자 단말(200) 및 인공신경망 모델 서버(300)는 네트워크를 통해 서로 연결된다. The service providing server 100, the user terminal 200, and the artificial neural network model server 300 are connected to each other through a network.

아울러, 이러한 네트워크는 예컨대, 다수의 접속망(미도시) 및 코어망(미도시)을 포함하며, 외부망, 예컨대 인터넷망(미도시)을 포함하여 구성될 수 있다. 여기서, 접속망(미도시)은 서비스 제공 서버(100), 사용자 단말(200) 및 인공신경망 모델 서버(300)와 유무선 통신을 수행하는 접속망으로서, 예를 들어, BS(Base Station), BTS(Base Transceiver Station), NodeB, eNodeB 등과 같은 다수의 기지국과, BSC(Base Station Controller), RNC(Radio Network Controller)와 같은 기지국 제어기로 구 현될 수 있다. 또한, 전술한 바와 같이, 상기 기지국에 일체로 구현되어 있던 디지털 신호 처리부와 무선 신호 처리부를 각각 디지털 유니트(Digital Unit, 이하 DU라 함)와 무선 유니트(Radio Unit, 이하 RU라 함)로 구분하여, 다수의 영역에 각각 다수의 RU(미도시)를 설치하고, 다수의 RU(미도시)를 집중화된 DU(미도시)와 연결하여 구성할 수도 있다.In addition, this network includes, for example, a plurality of access networks (not shown) and a core network (not shown), and may be configured to include an external network, such as an Internet network (not shown). Here, the access network (not shown) is an access network that performs wired and wireless communication with the service providing server 100, the user terminal 200, and the artificial neural network model server 300, for example, BS (Base Station), BTS (Base Station) It can be implemented with multiple base stations such as Transceiver Station, NodeB, eNodeB, etc., and base station controllers such as BSC (Base Station Controller) and RNC (Radio Network Controller). In addition, as described above, the digital signal processing unit and the wireless signal processing unit integrated in the base station are divided into a digital unit (hereinafter referred to as DU) and a radio unit (hereinafter referred to as RU), respectively. , it can be configured by installing multiple RUs (not shown) in multiple areas and connecting multiple RUs (not shown) with a centralized DU (not shown).

또한, 접속망(미도시)과 함께 모바일 망을 구성하는 코어망(미도시)은 접속망(미도시)과 외부 망, 예컨대, 인터넷망(미도시)을 연결하는 역할을 수행한다.In addition, the core network (not shown) that constitutes the mobile network together with the access network (not shown) serves to connect the access network (not shown) with an external network, such as an Internet network (not shown).

이러한 코어망(미도시)은 앞서 설명한 바와 같이, 접속망(미도시) 간의 이동성 제어 및 스위칭 등의 이동통신 서비스를 위한 주요 기능을 수행하는 네트워크 시스템으로서, 서킷 교환(circuit switching) 또는 패킷 교환(packet switching)을 수행하며, 모바일 망 내에서의 패킷 흐름을 관리 및 제어한다. 또한, 코어망(미도시)은 주파수 간 이동성을 관리하고, 접속망(미도시) 및 코어망(미도시) 내의 트래픽 및 다른 네트워크, 예컨대 인터넷 망(미도시)과의 연동을 위한 역할을 수행할 수도 있다. 이러한 코어망(미도시)은 SGW(Serving GateWay), PGW(PDN GateWay), MSC(Mobile Switching Center), HLR(Home Location Register), MME(Mobile Mobility Entity)와 HSS(Home Subscriber Server) 등을 더 포함하여 구성될 수도 있다.As described above, this core network (not shown) is a network system that performs major functions for mobile communication services such as mobility control and switching between access networks (not shown), and is a network system that performs circuit switching or packet switching. switching) and manages and controls packet flow within the mobile network. In addition, the core network (not shown) manages mobility between frequencies and plays a role in linking traffic within the access network (not shown) and core network (not shown) with other networks, such as the Internet network (not shown). It may be possible. This core network (not shown) further includes Serving GateWay (SGW), PDN GateWay (PGW), Mobile Switching Center (MSC), Home Location Register (HLR), Mobile Mobility Entity (MME), and Home Subscriber Server (HSS). It may be configured to include.

또한, 인터넷망(미도시)은 TCP/IP 프로토콜에 따라서 정보가 교환되는 통상의 공개된 통신망, 즉 공용망을 의미하는 것으로, 서비스 제공 서버(100), 사용자 단말(200) 및 인공신경망 모델 서버(300)로부터 제공되는 정보를 코어망(미도시) 및 접속망(미도시)을 거쳐 네트워크로 제공할 수 있고, 반대로 네트워크로 제공되는 정보를 코어망(미도시) 및 접속망(미도시)을 거쳐 서비스 제공 서버(100), 사용자 단말(200) 및 인공신경망 모델 서버(300)로 제공할 수도 있다. 그러나, 이에 한정되는 것은 아니며, 서비스 제공 서버(100)는 코어망(미도시)과 일체로 구현될 수도 있다.In addition, the Internet network (not shown) refers to a general public communication network in which information is exchanged according to the TCP/IP protocol, that is, a public network, and includes a service providing server 100, a user terminal 200, and an artificial neural network model server. The information provided from 300 can be provided to the network through the core network (not shown) and the access network (not shown), and conversely, the information provided through the network can be provided through the core network (not shown) and the access network (not shown). It may also be provided through the service provision server 100, user terminal 200, and artificial neural network model server 300. However, it is not limited to this, and the service providing server 100 may be implemented integrally with the core network (not shown).

또한, 상술한 통신 방식 이외에도 기타 널리 공지되었거나 향후 개발될 모든 형태의 통신 방식을 포함할 수 있다.In addition, in addition to the above-mentioned communication methods, it may include all other types of communication methods that are widely known or will be developed in the future.

서비스 제공 서버(100)는, 동적 객체 추적, 인페인팅 및 아웃페인팅 서비스를 제공하는 웹 페이지, 앱 페이지, 프로그램 또는 애플리케이션을 제공하는 서버일 수 있다. The service providing server 100 may be a server that provides a web page, app page, program, or application that provides dynamic object tracking, inpainting, and outpainting services.

사용자 단말(200)은, 동적 객체 추적, 인페인팅 및 아웃페인팅 서비스를 사용하려는 사용자의 단말을 의미한다. The user terminal 200 refers to a user terminal that wants to use dynamic object tracking, inpainting, and outpainting services.

사용자 단말(200)은, 네트워크를 통하여 원격지의 서버나 단말에 접속할 수 있는 컴퓨터로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 네비게이션, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크 톱(Desktop), 랩톱(Laptop) 등을 포함할 수 있다. 이때, 사용자 단말(200)은, 네트워크를 통해 원격지의 서버나 단말에 접속할 수 있는 단말로 구현될 수 있다. 사용자 단말(200)은, 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, 네비게이션, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말, 스마트폰(Smartphone), 스마트 패드(Smartpad), 타블렛 PC(Tablet PC) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치 를 포함할 수 있다.The user terminal 200 may be implemented as a computer capable of accessing a remote server or terminal through a network. Here, the computer may include, for example, a laptop equipped with a navigation system and a web browser, a desktop, a laptop, etc. At this time, the user terminal 200 may be implemented as a terminal that can connect to a remote server or terminal through a network. The user terminal 200 is, for example, a wireless communication device that guarantees portability and mobility, and includes navigation, personal communication system (PCS), global system for mobile communications (GSM), personal digital cellular (PDC), and personal communication system (PHS). Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), Wibro (Wireless Broadband Internet) terminal, It may include all types of handheld wireless communication devices such as smartphones, smartpads, tablet PCs, etc.

인공신경망 모델 서버(300)는, 이미지 세그멘테이션 모델, 비디오 오브젝트 세그멘테이션 모델, 해상도 변환 모델, 인페인팅 모델, 아웃페인팅 모델을 포함할 수 있다. The artificial neural network model server 300 may include an image segmentation model, a video object segmentation model, a resolution conversion model, an inpainting model, and an outpainting model.

이미지 세그멘테이션 모델은, 이미지의 각 픽셀을 특정 시맨틱 범주로 분류하여 다양한 객체 및 영역을 세분화할 수 있도록 학습된다. 이미지 세그멘테이션 모델은 이미지 각 픽셀에 대해 정답을 레이블링하여 세그멘테이션 마스크를 생성한 훈련데이터셋을 이용한 학습을 통해 생성될 수 있다. 이미지 세그멘테이션 모델의 학습에는 U-Net, FCN (Fully Convolutional Network), SegNet, DeepLab, Mask R-CNN등 공지된 다양한 알고리즘이 사용될 수 있으며, 손실 함수를 이용한 경사하강법을 통해 가중치를 조절하여 모델이 학습될 수 있다. 일 실시 예에서, 이미지 세그멘테이션 모델의 생성에는, Segment Anything(Kirillov, Alexander, et al. arXiv preprint arXiv:2304.02643, 2023) 방식이 사용될 수 있다. Image segmentation models are learned to classify each pixel of an image into a specific semantic category to segment various objects and areas. An image segmentation model can be created through learning using a training dataset in which a segmentation mask is created by labeling the correct answer for each pixel of the image. A variety of known algorithms such as U-Net, FCN (Fully Convolutional Network), SegNet, DeepLab, and Mask R-CNN can be used to learn the image segmentation model, and the model is adjusted by adjusting the weights through gradient descent using a loss function. It can be learned. In one embodiment, the Segment Anything (Kirillov, Alexander, et al. arXiv preprint arXiv:2304.02643, 2023) method may be used to create an image segmentation model.

비디오 오브젝트 세그멘테이션 모델은, 연속된 프레임 각각에서 타겟객체의 영연인 마스크가 라벨링된 훈련데이터셋을 이용한 학습을 통해 생성될 수 있다. 비디오 오브젝트 세그멘테이션 모델의 학습에는, 3D Convolutional Neural Networks (3D CNNs), 2D CNNs with temporal information modeling, 혹은 RNN과 LSTM을 결합한 모델 등 공지된 다양한 알고리즘이 사용될 수 있다. 예를 들어, Mask R-CNN for Video방식, XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model(Cheng, Ho Kei, and Alexander G. Schwing, European Conference on Computer Vision, 2022) 방식이 사용될 수 있다. 설정한 손실 함수를 이용한 경사하강법을 이용해 가중치를 조절하여 비디오 오브젝트 세그멘테이션 모델이 학습될 수 있다. A video object segmentation model can be created through learning using a training dataset in which a mask, which is the image of the target object, is labeled in each consecutive frame. To learn a video object segmentation model, various known algorithms can be used, such as 3D Convolutional Neural Networks (3D CNNs), 2D CNNs with temporal information modeling, or a model combining RNN and LSTM. For example, the Mask R-CNN for Video method and the You can. A video object segmentation model can be learned by adjusting the weights using gradient descent using a set loss function.

인페인팅 모델은, 누락된 또는 손상된 부분을 복원할 수 있또록 학습된다. 인페인팅 모델은, 누락 또는 손상된 이미지와 정상적인 정답 이미지를 레이블링한 훈련데이터를 이용한 학습을 통해 생성된다. 인페인팅 모델의 학습에는, 자동 인코더(Autoencoder), 컨볼루션 신경망(CNN) 기반의 모델, GAN (Generative Adversarial Network) 기반의 알고리즘이 사용될 수 있다. 예를 들어, 인페인팅 모델의 생성에는, Stable Diffusion XL(Stability AI, GitHub Repository: Stable Diffusion XL) 방식이 사용될 수 있다. 설정한 손실 함수를 이용한 경사하강법을 이용해 가중치를 조절하여 인페인팅 모델이 학습될 수 있다.The inpainting model is trained to restore missing or damaged parts. The inpainting model is created through learning using training data that labels missing or damaged images and normal correct images. To learn the inpainting model, an autoencoder, a convolutional neural network (CNN)-based model, or a GAN (Generative Adversarial Network)-based algorithm can be used. For example, the Stable Diffusion XL (Stability AI, GitHub Repository: Stable Diffusion XL) method can be used to create an inpainting model. The inpainting model can be learned by adjusting the weights using gradient descent using a set loss function.

아웃페인팅 모델은, 기존 이미지의 경계를 넘어서 이미지를 확장할 수 있도록 학습된다. 아웃페인팅 모델을 이용해 이미지를 더 넓은 시야나 환경으로 확장할 수 있다. 아웃페인팅 모델은, 원본 이미지와 확장된 이미지, 축소된 이미지와 원본 이미지를 훈련데이터로 하여 학습될 수 있다. 일 실시 예에서, Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), Attention Mechanisms and Transformers, Recurrent Neural Networks (RNNs) 등의 알고리즘이 인페인팅 모델의 학습에 사용될 수 있다. 설정된 손실함수를 최소화하도록 가중치를 조절하여 인페인팅 모델을 학습시킬 수 있다. The outpainting model is learned to extend the image beyond the boundaries of the existing image. Using the outpainting model, the image can be expanded to a wider field of view or environment. The outpainting model can be learned using the original image, expanded image, reduced image, and original image as training data. In one embodiment, algorithms such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), Attention Mechanisms and Transformers, and Recurrent Neural Networks (RNNs) may be used to learn the inpainting model. The inpainting model can be learned by adjusting the weights to minimize the set loss function.

해상도 변환 모델은, 저해상도 이미지를 고해상도 이미지로 변환하도록 학습된다. 해상도 변환 모델은, 고해상도의 이미지와 고해상도 이미지를 다운샘플링한 저해상도의 이미지를 레이블링하여 훈련데이터를 생성하고, 생성한 훈련데이터를 이용한 학습을 통해 생성될 수 있다. 일 실시 예에서, Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), Deep Residual Networks (ResNet), Attention Mechanisms, Transformer Models 등의 알고리즘이 해상도 변환 모델의 학습에 사용될 수 있다. 설정된 손실함수를 최소화하도록 가중치를 조절하여 해상도 변환 모델을 학습시킬 수 있다. The resolution conversion model is learned to convert low-resolution images into high-resolution images. The resolution conversion model can be created by generating training data by labeling a high-resolution image and a low-resolution image obtained by downsampling the high-resolution image, and learning using the generated training data. In one embodiment, algorithms such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), Deep Residual Networks (ResNet), Attention Mechanisms, Transformer Models, etc. may be used to learn the resolution transformation model. The resolution conversion model can be learned by adjusting the weights to minimize the set loss function.

이미지 생성 모델은, 수집된 훈련용 이미지 데이터를 전처리하고, 전처리된 이미지 데이터를 이용한 훈련을 통해 생성된다. 일 실시 예에서, 이미지 생성 모델은, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Stable diffusion, Auto-Regressive Models (e.g., PixelRNN, PixelCNN) 등의 알고리즘을 이용해 생성될 수 있다. 다만, 이에 한정되는 것은 아니며, 공지된 다양한 알고리즘이 사용될 수 있다. 훈련 과정에서 설정된 알고리즘에 대해 정의된 손실 함수를 최소화하는 방향으로 파라미터가 설정되며, 이를 통해 이미지 생성 모델의 성능이 향상될 수 있다. The image generation model is created by preprocessing the collected training image data and training using the preprocessed image data. In one embodiment, the image generation model may be created using algorithms such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Stable diffusion, Auto-Regressive Models (e.g., PixelRNN, PixelCNN). However, it is not limited to this, and various known algorithms can be used. During the training process, parameters are set to minimize the loss function defined for the set algorithm, and this can improve the performance of the image generation model.

비디오 생성 모델은, 이미지와 이미지와 관련된 비디오 시퀸스를 훈련용 데이터로 준비하고, 훈련용 데이터를 이용한 훈련을 통해 생성될 수 있다. 일 실시 예에서, 비디오 생성 모델은, 텍스트와 텍스트와 관련된 비디오 시퀸스를 훈련용 데이터로 준비하고, 훈련용 데이터를 이용한 훈련을 통해 생성될 수 있다. 일 실시 예에서, 비디오 생성 모델은, 텍스트 및 이미지와 텍스트 및 이미지와 관련된 비디오 시퀸스를 훈련용 데이터로 준비하고, 훈련용 데이터를 이용한 훈련을 통해 생성될 수 있다. 일 실시 예에서, 이미지 생성 모델은, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Recurrent Neural Networks, 3D Convolutional Neural Networks, Recurrent Neural Networks, 3D Convolutional Neural Networks 등의 알고리즘을 이용해 생성될 수 있다. 다만, 이에 한정되는 것은 아니며, 공지된 다양한 알고리즘이 사용될 수 잇다. 훈련 과정에서 설정된 알고리즘에 대해 정의된 손실 함수를 최소화하는 방향으로 파라미터가 설정되며, 이를 통해 비디오 생성 모델의 성능이 향상될 수 있다. A video generation model can be created by preparing images and video sequences related to the images as training data and training using the training data. In one embodiment, a video generation model may be created by preparing text and video sequences related to the text as training data and training using the training data. In one embodiment, a video generation model may be created by preparing text and images and video sequences related to the text and images as training data, and training using the training data. In one embodiment, the image generation model may be created using algorithms such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Recurrent Neural Networks, 3D Convolutional Neural Networks, Recurrent Neural Networks, and 3D Convolutional Neural Networks. However, it is not limited to this, and various known algorithms can be used. During the training process, parameters are set to minimize the loss function defined for the set algorithm, which can improve the performance of the video generation model.

3D 생성 모델은, 특징점 추출 및 매칭, 3D 재구성, 텍스쳐 매핑, 정제 및 최적화를 수행하는 모델로서, 다각도 모델링 이미지를 이용해 3D 모델을 생성하는 모델을 의미하며, Structure from Motion (SfM), Multi-view Stereo (MVS), Photogrammetry 등의 알고리즘이 3D 생성 모델 구축에 사용될 수 있다. 다만, 이에 한정되는 것은 아니며 다양한 공지된 알고리즘이 사용될 수 있다. 예를 들어, 3D 생성 모델로, Agisoft Metashape, RealityCapture와 같은 소프트웨어가 사용될 수 있다. The 3D generation model is a model that performs feature point extraction and matching, 3D reconstruction, texture mapping, refining and optimization. It refers to a model that creates a 3D model using multi-angle modeling images, Structure from Motion (SfM), Multi-view Algorithms such as Stereo (MVS) and Photogrammetry can be used to build 3D generation models. However, it is not limited to this and various known algorithms can be used. For example, to create 3D models, software such as Agisoft Metashape and RealityCapture can be used.

도시된 실시 예에서, 인공신경망 모델 서버(300)가 서비스 제공 서버(100)와 별도의 서버로 도시되나, 이에 한정되는 것은 아니며 서비스 제공 서버(100)와 인공신경망 모델 서버(300)가 일체로 형성될 수 있다. In the illustrated embodiment, the artificial neural network model server 300 is shown as a separate server from the service providing server 100, but the service providing server 100 and the artificial neural network model server 300 are not limited thereto. can be formed.

도 2는 도 1에 따른 서비스 제공 서버(100)의 기능적 모듈을 예시적으로 나타낸 블록도이다.FIG. 2 is a block diagram illustrating functional modules of the service providing server 100 according to FIG. 1.

서비스 제공 서버(100)는, 객체 분할부(101), 인페인팅부(102), 이미지 생성부(103), 비디오 생성부(104), 합성부(105) 및 3D 생성부(106)를 포함한다.The service providing server 100 includes an object division unit 101, an inpainting unit 102, an image creation unit 103, a video creation unit 104, a synthesis unit 105, and a 3D creation unit 106. do.

서비스 제공 서버(100)가 사용자 단말(200)로 동적 객체 추적 및 인페인팅 서비스를 제공하는 애플리케이션, 프로그램, 앱 페이지, 웹 페이지 등을 전송하는 경우, 사용자 단말(200)은, 동적 객체 추적 및 인페인팅 서비스를 제공하는 애플리케이션, 프로그램, 앱 페이지, 웹 페이지 등을 설치하거나 열 수 있다. 또한, 웹 브라우저에서 실행되는 스크립트를 이용하여 서비스 프로그램이 사용자 단말(200)에서 구동될 수도 있다. 여기서, 웹 브라우저는 웹(WWW: World Wide Web) 서비스를 이용할 수 있게 하는 프로그램으로 HTML(Hyper Text Mark-up Language)로 서술된 하이퍼 텍스트를 받아서 보여주는 프로그램을 의미하며, 예를 들어 넷스케이프(Netscape), 익스플로러(Explorer), 크 롬(Chrome) 등을 포함한다. 또한, 애플리케이션은 단말 상의 응용 프로그램(Application)을 의미하며, 예를 들어, 모바일 단말(스마트폰)에서 실행되는 앱(App)을 포함한다.When the service providing server 100 transmits an application, program, app page, web page, etc. that provides dynamic object tracking and inpainting services to the user terminal 200, the user terminal 200 performs dynamic object tracking and inpainting. You can install or open applications, programs, app pages, web pages, etc. that provide painting services. Additionally, a service program may be run on the user terminal 200 using a script executed in a web browser. Here, a web browser is a program that allows the use of web (WWW: World Wide Web) services and refers to a program that receives and displays hypertext written in HTML (Hyper Text Mark-up Language), for example, Netscape. , Explorer, Chrome, etc. Additionally, an application refers to an application on a terminal and includes, for example, an app running on a mobile terminal (smartphone).

도 3은 도 1에 따른 서비스 제공 서버가 비디오의 프레임에서 동적 객체를 탐지하는 과정을 도시하는 흐름도이다. FIG. 3 is a flowchart illustrating a process in which the service providing server according to FIG. 1 detects a dynamic object in a video frame.

도 3을 참조하면, 객체 분할부(101)가, 이미지 세그멘테이션 모델을 이용해 비디오의 첫 프레임에 포함된 타겟객체의 영역을 나타내는 마스크를 생성한다(S100).Referring to FIG. 3, the object segmentation unit 101 generates a mask representing the area of the target object included in the first frame of the video using an image segmentation model (S100).

도 4는 이미지 세그멘테이션 모델이 프레임에서 타겟객체과 매칭되는 영역에 마스크를 생성하는 과정을 도시하는 도면이다. Figure 4 is a diagram illustrating a process in which an image segmentation model creates a mask in an area that matches a target object in a frame.

객체 분할부(101)는, 사용자 단말(200)에, 프레임 내에 관심영역(ROI)을 지정할 수 있는 사용자 인터페이스를 제공할 수 있다. The object division unit 101 may provide the user terminal 200 with a user interface that can designate a region of interest (ROI) within a frame.

사용자 인터페이스를 통해 관심영역이 입력되면, 객체 분할부(101)는, 프레임에서 관심영역과 매칭되는 부분을 이미지 세그멘테이션 모델에 입력하고, 이미지 세그멘테이션 모델은 관심영역에 포함된 타겟객체를 식별하고 타겟객체의 영역과 나머지 배경 영역을 분리한다. When a region of interest is input through the user interface, the object segmentation unit 101 inputs the portion of the frame that matches the region of interest into the image segmentation model, and the image segmentation model identifies the target object included in the region of interest and determines the target object. Separate the area from the rest of the background area.

객체 분할부(101)는, 사용자 단말(200)에 비디오의 첫 프레임을 제공하고, 사용자 단말(200)로부터 첫 프레임의 관심영역을 수신하며, 첫 프레임의 관심영역과 매칭되는 부분을 이미지 세그멘테이션 모델에 입력하고, 이미지 세그멘테이션 모델이 생성한 첫 프레임에 포함된 타겟객체와 매칭되는 마스크를 획득한다.The object segmentation unit 101 provides the first frame of the video to the user terminal 200, receives the region of interest of the first frame from the user terminal 200, and generates a portion matching the region of interest of the first frame into an image segmentation model. and obtain a mask that matches the target object included in the first frame generated by the image segmentation model.

다시 도 3을 참조하면, 객체 분할부(101)는, 첫 프레임에서 생성된 마스크와 비디오 오브젝트 세그멘테이션 모델을 이용해 비디오의 나머지 프레임에 포함된 타겟객체를 탐지한다(S200). Referring again to FIG. 3, the object segmentation unit 101 detects target objects included in the remaining frames of the video using the mask generated in the first frame and the video object segmentation model (S200).

도 5는 비디오 오브젝트 세그멘테이션 모델이 비디오의 첫 프레임의 마스크와 비디오의 나머지 프레임을 입력받아 나머지 프레임에 포함된 동적 객체를 탐지하는 과정을 도시하는 흐름도이다. Figure 5 is a flowchart showing a process in which a video object segmentation model receives the mask of the first frame of the video and the remaining frames of the video and detects dynamic objects included in the remaining frames.

객체 분할부(101)는, 첫 프레임에서 생성된 마스크와 비디오의 나머지 프레임을 비디오 오브젝트 세그멘테이션 모델에 입력하고, 비디오 오브젝트 세그멘테이션 모델이 첫 프레임에서 생성된 마스크를 이용해 나머지 프레임을 타겟객체의 영역과 나머지 배경 영역으로 분할한다. 객체 분할부(101)는, 나머지 프레임 각각에 대해 타겟객체의 영역인 마스크를 획득한다. The object segmentation unit 101 inputs the mask generated from the first frame and the remaining frames of the video into the video object segmentation model, and the video object segmentation model divides the remaining frames into the target object area and the remaining frames using the mask generated from the first frame. Divide into background areas. The object division unit 101 obtains a mask, which is the area of the target object, for each remaining frame.

도 6은 도 3의 S200단계의 구체적인 과정의 일 실시 예를 도시하는 흐름도이다 Figure 6 is a flowchart showing an example of a specific process in step S200 of Figure 3.

비디오 오브젝트 세그멘테이션 모델을 이용해 나머지 프레임을 모두를 타겟객체의 영역과 나머지 배경 영역으로 분리하는 경우, 비디오의 재생시간이 길어져 프레임의 수가 상대적으로 커질수록 타겟객체의 영역 탐지의 정확도가 저감되는 문제가 발생될 수 있다. When all remaining frames are separated into the target object area and the remaining background area using the video object segmentation model, the accuracy of detecting the target object area decreases as the video playback time becomes longer and the number of frames becomes relatively larger. It can be.

비디오 오브젝트 세그멘테이션 모델이 나머지 프레임에서 타겟객체를 탐지하는 과정의 중간에 이미지 세그멘테이션 모델이 개입하는 경우, 프레임의 수와 증가하는 경우에도 타겟객체의 영역 탐지의 정확도 저감을 방지할 수 있다. If the image segmentation model intervenes in the process of the video object segmentation model detecting the target object in the remaining frames, it is possible to prevent a decrease in the accuracy of detecting the area of the target object even when the number of frames increases.

도 6을 참조하면, 객체 분할부(101)는, 첫 프레임에서 생성된 마스크 및 비디오 오브젝트 세그멘테이션 모델을 이용해 미리 설정된 기준 개수의 프레임에 포함된 타겟객체를 탐지한다(S210). Referring to FIG. 6, the object segmentation unit 101 detects target objects included in a preset reference number of frames using a mask and a video object segmentation model generated in the first frame (S210).

객체 분할부(101)는, 첫 프레임에서 생성된 마스크를 비디오 오브젝트 세그멘테이션 모델에 제공하고, 비디오 오브젝트 세그멘테이션 모델을 이용해 첫 프레임 이후 미리 설정된 기준 개수만큼의 프레임을 타겟객체의 영역과 나머지 배경 영역으로 분리한다. 객체 분할부(101)는, 나머지 프레임 각각에 대해 타겟객체의 영역인 마스크를 획득한다. The object segmentation unit 101 provides the mask generated from the first frame to the video object segmentation model, and uses the video object segmentation model to separate a preset standard number of frames after the first frame into the target object area and the remaining background area. do. The object division unit 101 obtains a mask, which is the area of the target object, for each remaining frame.

미리 설정된 기준 개수만큼의 프레임에 대한 마스크 생성이 완료되면, 객체 분할부(101)는, 다음 프레임에 포함된 타겟객체와 매칭되는 마스크를 생성한다(S220). When mask generation for a preset reference number of frames is completed, the object division unit 101 generates a mask matching the target object included in the next frame (S220).

일 실시 예에서, 객체 분할부(101)는, 사용자 단말(200)에, 프레임 내에 관심영역(ROI)을 지정할 수 있는 사용자 인터페이스를 제공하고, 이미지 세그멘테이션 모델을 이용해 사용자 단말(200)로부터 수신한 관심영역에 포함된 타겟객체와 매칭되는 마스크를 생성할 수 있다. In one embodiment, the object segmentation unit 101 provides the user terminal 200 with a user interface for specifying a region of interest (ROI) within a frame, and divides the information received from the user terminal 200 using an image segmentation model. A mask that matches the target object included in the region of interest can be created.

일 실시 예에서, 객체 분할부(101)는, 비디오 오브젝트 세그멘테이션 모델이 마지막으로 처리한 프레임의 타겟객체를 포함하는 미리 설정된 크기의 관심영역을 생성하고, 다음 프레임에서 관심영역과 매칭되는 부분을 이미지 세그멘테이션 모델에 입력하며, 이미지 세그멘테이션 모델은 관심영역에 포함된 타겟객체를 식별하고 타겟객체의 영역과 나머지 배경 영역을 분리한다. In one embodiment, the object segmentation unit 101 creates a region of interest of a preset size including the target object of the frame last processed by the video object segmentation model, and images the portion matching the region of interest in the next frame. As input to the segmentation model, the image segmentation model identifies the target object included in the region of interest and separates the area of the target object from the remaining background area.

객체 분할부(101)는, 이미지 세그멘테이션 모델을 이용해, 비디오 오브젝트 세그멘테이션 모델이 마지막으로 처리한 프레임의 다음 프레임에 포함된 타겟객체의 영역인 마스크를 생성한다. The object segmentation unit 101 uses an image segmentation model to generate a mask that is an area of the target object included in the next frame of the frame last processed by the video object segmentation model.

이미지 세그멘테이션 모델을 이용한 마스크 생성이 완료되면, 객체 분할부(101)는, 나머지 프레임 중 미리 설정된 기준 개수의 프레임에 포함된 타겟객체를 탐지한다(S230). 타겟객체 탐지에는 비디오 오브젝트 세그멘테이션이 처리한 마지막 프레임의 다음 프레임에서 생성된 마스크 및 비디오 오브젝트 세그멘테이션 모델이 사용된다. When mask generation using the image segmentation model is completed, the object segmentation unit 101 detects target objects included in a preset reference number of frames among the remaining frames (S230). For target object detection, a mask and a video object segmentation model generated from the frame following the last frame processed by video object segmentation are used.

미리 설정된 기준 개수의 프레임에 포함된 타겟객체의 탐지가 완료되면, 객체 분할부(101)는, 처리되지 않은 프레임이 존재하는지 판단한다(S240). When detection of the target object included in the preset reference number of frames is completed, the object division unit 101 determines whether there are unprocessed frames (S240).

처리되지 않은 프레임이 존재하는 경우, 객체 분할부(101)는, S220단계 및 S230단계를 다시 수행한다. If there are unprocessed frames, the object division unit 101 performs steps S220 and S230 again.

처리되지 않은 프레임이 존재하지 않는 경우, 객체 분할부(101)는, 타겟객체 탐지를 종료한다. If there are no unprocessed frames, the object segmentation unit 101 ends target object detection.

도 7은 비디오 오브젝트 세그멘테이션 모델의 동작 중 이미지 세그멘테이션 모델이 개입되는 과정을 도시하는 도면이다, Figure 7 is a diagram showing the process in which the image segmentation model is involved during the operation of the video object segmentation model.

도시된 실시 예에서, 비디오 오브젝트 세그멘테이션 모델이 나머지 프레임에서 타겟객체를 탐지하는 과정의 중간에 이미지 세그멘테이션 모델이 개입하여 타겟객체의 마스크를 생성하고, 생성한 마스크를 비디오 오브젝트 세그멘테이션 모델에 제공한다. 즉, 프레임이 분할되어 처리되므로, 프레임의 수가 증가하는 경우에도 타겟객체 탐지의 정확도 저감을 방지할 수 있다.In the illustrated embodiment, the image segmentation model intervenes in the middle of the process of the video object segmentation model detecting the target object in the remaining frames, generates a mask of the target object, and provides the generated mask to the video object segmentation model. In other words, since the frames are divided and processed, a decrease in the accuracy of target object detection can be prevented even when the number of frames increases.

다만, 프레임을 분할하여 처리하는 경우, 비디오 오브젝트 세그멘테이션 모델을 이용해 첫 프레임을 제외한 모든 프레임을 연속적으로 처리하는 것에 비해, 연산에 소요되는 시간 및 자원소모가 증가될 수 있다. However, when processing frames by dividing them, the time required for calculation and resource consumption may increase compared to continuously processing all frames except the first frame using the video object segmentation model.

따라서, 비디오 오브젝트 세그멘테이션 모델이 타겟객체 탐지 정확도를 유지하면서 처리할 수 있는 최대의 프레임의 수를 적절하게 설정하는 것이 중요하다. 탐지 정확도가 유지되면서 연속적으로 처리할 수 있는 최적의 프레임의 수를 설정하는 경우, 연산에 소요되는 시간 및 자원소모의 증가를 최소화시킴과 동시에 재생시간이 긴 비디오의 탐지 정확도를 향상시킬 수 있다. Therefore, it is important to appropriately set the maximum number of frames that the video object segmentation model can process while maintaining target object detection accuracy. If you set the optimal number of frames that can be processed continuously while maintaining detection accuracy, you can minimize the increase in computation time and resource consumption and improve the detection accuracy of videos with long playback times.

일 실시 예에서, 객체 분할부(101)는, 프레임 수 결정 모델을 이용해 비디오 오브젝트 세그멘테이션 모델이 한 번에 연속적으로 처리하는 프레임의 수를 결정할 수 있다. 프레임 수 결정 모델은, 추적되는 객체, 처리되는 비디오 프레임의 수, 탐지정확도에 프레임의 수를 레이블링하여 생성된 훈련데이터를 이용한 학습을 통해 생성될 수 있다. 일 실시 예에서, 정확도로는, IoU(Intersection over Union), Pixel Accuracy, Boundary F1 Score, Temporal Consistency 등이 사용될 수 있다. 일 실시 예에서, ANN(Artificial Neural Networks), RNN(Recurrent Neural Networks), LSTM(Long Short-Term Memory), Random Forest, Xgboost 등 공지된 다양한 알고리즘이 학습에 사용될 수 있다. 일 실시 예에서, 서비스 제공 서버(100)의 데이터베이스에는 탐지 객체, 프레임 수, 정확도가 매칭되어 저장될 수 있다. 객체 분할부(101)는, 탐지 객체, 프레임 수와 매칭되는 정확도를 검색한 후, 프레임 수 결정 모델에 탐지 객체, 프레임 수, 정확도를 입력 값으로 입력하고 대응하는 프레임 수를 결과 값으로 획득할 수 있다. 객체 분할부(101)는, 프레임 수 결정 모델로부터 획득한 프레임 수를 비디오 오브젝트 세그멘테이션 모델이 연속적으로 처리하는 프레임의 기준 개수로 설정할 수 있다. In one embodiment, the object segmentation unit 101 may determine the number of frames that the video object segmentation model sequentially processes at a time using a frame number determination model. The frame number determination model can be created through learning using training data generated by labeling the number of frames with the tracked object, the number of video frames processed, and the detection accuracy. In one embodiment, accuracy may include Intersection over Union (IoU), Pixel Accuracy, Boundary F1 Score, Temporal Consistency, etc. In one embodiment, various known algorithms such as Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Random Forest, and Xgboost may be used for learning. In one embodiment, the detection object, number of frames, and accuracy may be matched and stored in the database of the service providing server 100. The object division unit 101 searches for the accuracy that matches the detection object and the number of frames, then inputs the detection object, the number of frames, and the accuracy as input values to the frame number determination model, and obtains the corresponding number of frames as the result value. You can. The object segmentation unit 101 may set the number of frames obtained from the frame number determination model as the standard number of frames continuously processed by the video object segmentation model.

도 8 및 도 9는 인페인팅 모델이 비디오에서 탐지된 동적 객체를 제거하고 제거된 부분을 복원하는 과정을 도시하는 도면이다. Figures 8 and 9 are diagrams illustrating a process in which an inpainting model removes a dynamic object detected in a video and restores the removed part.

인페인팅부(102)는, 인페인팅 모델에 프레임 및 프레임의 마스크를 입력 값으로 제공하고, 인페인팅 모델로부터 마스크 부분이 제거 및 복원된 프레임을 획득할 수 있다. The inpainting unit 102 may provide a frame and a frame mask as input values to the inpainting model, and obtain a frame from which the mask portion has been removed and restored from the inpainting model.

인페인팅부(102)는, 인페인팅 모델에 프레임 및 프레임의 마스크를 입력 값으로 제공하고, 인페인팅 모델로부터 마스크 부분일 제외한 배경의 제거 및 배경이 복원된 프레임을 획득할 수 있다. The inpainting unit 102 may provide a frame and a frame mask as input values to the inpainting model, remove the background excluding the mask portion, and obtain a frame with the background restored from the inpainting model.

인페인팅부(102)는, 사용자 단말(200)에 마스크 또는 배경에 대한 제거 및 복원 요청에 대한 사용자 프롬프트를 입력할 수 있는 사용자 인터페이스를 제공할 수 있다. The inpainting unit 102 may provide the user terminal 200 with a user interface through which a user can input a prompt for requesting removal and restoration of the mask or background.

인페인팅부(102)는, 사용자 단말(200)로부터 사용자가 제공한 사용자 프롬프트를 수신한다. 인페인팅부(102)는, 사용자 프롬프트를 이용해 인페인팅 모델에 입력될 입력 프롬프트를 생성하고, 입력 프롬프트 및 네거티브 프롬프트를 프레임 및 마스크와 함께 인페인팅 모델에 제공한다. 인페인팅 모델은 입력 프롬프트 및 네거티브 프롬프트를 프레임 및 마스크를 입력받아 출력 프레임을 생성하고 이를 인페인팅부(102)에 제공한다. 일 실시 예에서, 인페인팅부(102)는, 사용자 프롬프트 동일한 입력 프롬프트를 생성할 수 있다. The inpainting unit 102 receives a user prompt provided by the user from the user terminal 200. The inpainting unit 102 uses a user prompt to generate an input prompt to be input to the inpainting model, and provides the input prompt and negative prompt along with a frame and mask to the inpainting model. The inpainting model receives frames and masks for input prompts and negative prompts, generates output frames, and provides them to the inpainting unit 102. In one embodiment, the inpainting unit 102 may generate an input prompt that is the same as the user prompt.

인페인팅부(102)는, 출력 프레임을 사용자 단말(200)에 제공한다. The inpainting unit 102 provides an output frame to the user terminal 200.

예를 들어, 입력 프레임 및 마스크와 함께 입력 프롬프트로 "remove it"이 인페인팅 모델에 입력된 경우, 인페인팅 모델은 마스크 부분을 제거하고 제거된 부분을 복원한 출력 프레임을 생성할 수 있다.For example, if “remove it” is input to the inpainting model as an input prompt along with the input frame and mask, the inpainting model may remove the mask portion and generate an output frame in which the removed portion is restored.

예를 들어, 입력 프레임 및 마스크와 함께 입력 프롬프트로 "remove it with tree"가 인페인팅 모델에 입력된 경우, 인페인팅 모델은 마스크 부분을 제거하고 제거된 부분에 나무를 생성한 출력 프레임을 생성할 수 있다. For example, if "remove it with tree" is input to an inpainting model as an input prompt along with an input frame and a mask, the inpainting model will generate an output frame with the masked portion removed and a tree created in the removed portion. You can.

일 실시 예에서, 인페인팅부(102)는, 이미지 생성 성능을 향상시키기 위해 입력 프롬프트와 함께 네거티브 프롬프트를 입력할 수 있다. In one embodiment, the inpainting unit 102 may input a negative prompt along with the input prompt to improve image creation performance.

일 실시 예에서, 인페인팅 모델은 부정적인 출력이미지에 사용된 프롬프트를 네거티브 프롬프트로 레이블링한 훈련데이터를 이용해 네거티브 프롬프트 입력 시 네거티브 프롬프트에 대해 레이블링된 출력이미지를 피하도록 학습될 수 있다. In one embodiment, the inpainting model can be trained to avoid output images labeled for negative prompts when negative prompts are input using training data that labels the prompts used for negative output images as negative prompts.

일 실시 예에서, 인페인팅부(102)는, 네거티브 워드 생성 모델을 이용해 출력이미지의 퀄리티 향상을 위한 네거티브 워드를 획득하고, 네거티브 워드를 이용해 네거티브 프롬프트를 생성할 수 있다. 네거티브 워드 생성 모델은, 프레임에 포함된 객체들, 프레임에서 제거된 객체와 네거티브 워딩을 레이블링해서 생성된 훈련데이터를 이용한 기계학습을 통해 훈련될 수 있다. 예를 들어, 프레임에 포함된 객체가 "car", "window", "human", "balcony", "pillar", "entrance door"이고, 제거되는 객체가 "human"인 경우, "car", "window", "human", "balcony", "pillar", "entrance door", "human"을 전처리하고, 네거티브 워딩 "cartoon", "cgi", "render", "illustration", "painting", "drawing", "front door"을 레이블링하여 훈련데이터를 생성하고, 훈련데이터를 이용한 학습을 통해 네거티브 워드 생성 모델이 생성될 수 있다. 예를 들어, ANN(Artificial Neural Networks), RNN(Recurrent Neural Networks), LSTM(Long Short-Term Memory) 등 공지된 다양한 딥러닝 분야 알고리즘이 학습에 사용될 수 있다. In one embodiment, the inpainting unit 102 may use a negative word generation model to obtain a negative word to improve the quality of the output image and generate a negative prompt using the negative word. The negative word generation model can be trained through machine learning using training data generated by labeling objects included in the frame, objects removed from the frame, and negative wording. For example, if the objects included in the frame are "car", "window", "human", "balcony", "pillar", and "entrance door", and the object to be removed is "human", then "car", Preprocess "window", "human", "balcony", "pillar", "entrance door", "human", negative wording "cartoon", "cgi", "render", "illustration", "painting", Training data can be created by labeling “drawing” and “front door,” and a negative word generation model can be created through learning using the training data. For example, various known deep learning algorithms such as ANN (Artificial Neural Networks), RNN (Recurrent Neural Networks), and LSTM (Long Short-Term Memory) can be used for learning.

일 실시 예에서, 인페인팅부(102)는, 비디오 첫 프레임을 이미지 세그멘테이션 모델에 입력하여 첫 프레임에 포함된 모든 객체를 획득하고, 첫 프레임에 포함된 모든 객체 및 제거된 객체를 네거티브 워드 생성 모델에 입력 값으로 입력하며, 네거티브 워드 생성 모델로부터 네거티브 워드를 결과 값으로 획득할 수 있다. 인페인팅부(102)는, 획득한 네거티브 워드를 나열하여 네거티브 프롬프트를 생성할 수 있다. 예를 들어, 네거티브 워드가 "cartoon", "cgi", "render", "illustration", "painting", "drawing"인 경우, "cartoon, cgi, render, illustration, painting, drawing"을 네거티브 프롬프트로 생성한다. 인페인팅부(102)는, 비디오 프레임, 비디오 프레임 각각의 마스크, 입력 프롬프트, 네거티브 프롬프트를 인페인팅 모델에 입력하고, 인페인팅 모델로부터 비디오 프레임 각각과 대응되는 출력프레임을 획득할 수 있다. 인페인팅부(102)는, 비디오 프레임 각각과 대응하여 생성된 출력프레임을 사용자 단말(200)에 제공한다. In one embodiment, the inpainting unit 102 inputs the first frame of the video into an image segmentation model to obtain all objects included in the first frame, and all objects included in the first frame and the removed objects are converted into a negative word generation model. It is input as an input value, and a negative word can be obtained as a result value from the negative word generation model. The inpainting unit 102 may generate a negative prompt by listing the obtained negative words. For example, if the negative words are "cartoon", "cgi", "render", "illustration", "painting", and "drawing", use "cartoon, cgi, render, illustration, painting, drawing" as the negative prompt. Create. The inpainting unit 102 may input a video frame, each video frame's mask, input prompt, and negative prompt into an inpainting model, and obtain an output frame corresponding to each video frame from the inpainting model. The inpainting unit 102 provides the user terminal 200 with an output frame generated in correspondence with each video frame.

도 10은 도 1에 따른 서비스 제공 서버가 원본비디오에 특수효과 비디오를 합성하는 과정을 도시하는 흐름도이다.FIG. 10 is a flowchart showing a process in which the service providing server according to FIG. 1 synthesizes a special effect video with the original video.

도 10을 참조하면, 합성부(105)가, 사용자 단말(200)로부터 수신한 영역을 이용해 원본비디오에서의 특수효과 입력영역을 지정한다(S310). Referring to FIG. 10, the synthesis unit 105 designates a special effect input area in the original video using the area received from the user terminal 200 (S310).

합성부(105)는, 사용자 단말(200)에 특수효과 합성에 사용될 원본비디오를 입력할 수 있는 사용자 인터페이스를 제공한다. 합성부(105)는, 사용자 단말(200)에, 원본비디오에서 특수효과를 합성할 영역을 지정할 수 있는 사용자 인터페이스를 제공한다. 합성부(105)는, 사용자 단말(200)로부터 원본비디오를 제공받을 수 있으며, 원본비디오에 포함된 복수의 프레임 중 적어도 하나에 대하여 특수효과를 합성할 영역을 수신할 수 있다. The synthesis unit 105 provides the user terminal 200 with a user interface through which original video to be used for combining special effects can be input. The synthesis unit 105 provides the user terminal 200 with a user interface that allows designating an area to synthesize special effects in the original video. The synthesis unit 105 may receive an original video from the user terminal 200, and may receive an area in which a special effect is to be synthesized for at least one of a plurality of frames included in the original video.

합성부(105)는, 사용자 단말(200)로부터 수신한 특수효과를 합성할 영역을 이용해 원본비디오에서의 특수효과 입력영역을 지정할 수 있다. The synthesis unit 105 can designate a special effect input area in the original video using the area where the special effect received from the user terminal 200 is to be synthesized.

일 실시 예에서, 합성부(105)는, 사용자 단말(200)로부터 수신한 특수효과를 합성할 영역을 원본비디오에서의 특수효과 입력영역으로 지정할 수 있다. In one embodiment, the synthesis unit 105 may designate an area for combining special effects received from the user terminal 200 as a special effect input area in the original video.

도 11은 도 10의 S310단계의 구체적인 과정을 도시하는 흐름도이다Figure 11 is a flowchart showing the specific process of step S310 of Figure 10.

일 실시 예에서, 사용자 단말(200)로부터 첫 프레임에 대해 특수효과를 입력할 영역을 수신하여 첫 프레임에 대해 특수효과 입력영역을 지정한 경우, 합성부(105)는, 객체 분할부(101)에 나머지 프레임에 대한 동적 객체 추적을 요청할 수 있다. In one embodiment, when a special effect input area for the first frame is received from the user terminal 200 and a special effect input area is designated for the first frame, the compositing unit 105 enters the object division unit 101. You can request dynamic object tracking for the remaining frames.

객체 분할부(101)는, 이미지 세그멘테이션 모델을 이용해 첫 프레임의 특수효과 입력영역에 포함된 객체와 매칭되는 마스크를 생성한다(S311). The object segmentation unit 101 uses an image segmentation model to create a mask that matches the object included in the special effect input area of the first frame (S311).

객체 분할부(101)는, 첫 프레임에서 특수효과 입력영역을 이미지 세그멘테이션 모델에 입력하고, 이미지 세그멘테이션 모델은 특수효과 입력영역에 포함된 객체를 식별하고, 식별된 객체를 사용자 단말(200)에 제공한다. The object division unit 101 inputs the special effect input area into the image segmentation model in the first frame, the image segmentation model identifies objects included in the special effect input area, and provides the identified object to the user terminal 200. do.

객체 분할부(101)는, 사용자 단말(200)로부터 제공된 객체 중 동적 객체 추적의 기준이 될 타겟객체를 수신할 수 있다. 객체 분할부(101)는, 사용자 단말(200)에 제공된 객체 중 어느 하나를 선택할 수 있는 사용자 인터페이스를 제공할 수 있다. 타겟객체가 결정되면, 객체 분할부(101)는, 타겟객체영역과 배경영역을 분리하고, 타겟객체영역과 매칭되는 마스크를 생성한다.The object division unit 101 may receive a target object that will be the standard for dynamic object tracking among objects provided from the user terminal 200. The object division unit 101 may provide a user interface for selecting one of the objects provided to the user terminal 200. Once the target object is determined, the object dividing unit 101 separates the target object area and the background area and generates a mask matching the target object area.

객체 분할부(101)는, 첫 프레임에서 생성된 마스크와 비디오 오브젝트 세그멘테이션 모델을 이용해 원본비디오의 나머지 프레임에 포함된 객체를 탐지하고, 나머지 프레임을 타겟객체영역과 배경영역으로 분리한다(S312). The object segmentation unit 101 detects objects included in the remaining frames of the original video using the mask generated in the first frame and the video object segmentation model, and separates the remaining frames into a target object area and a background area (S312).

연속적인 프레임에서 타겟객체영역과 배경영역을 분리하는 구체적인 과정에 대해서는 도 5 내지 도 7을 참조하여 구체적으로 상술한 바, 이에 갈음한다. The specific process of separating the target object area and the background area in consecutive frames has been described in detail with reference to FIGS. 5 to 7, and this will be replaced.

나머지 프레임 각각에 대해 타겟객체영역과 배경영역이 분리되면, 합성부(105)는, 나머지 프레임 각각의 타겟객체영역을 이용하여 나머지 프레임 각각의 특수효과 입력영역을 생성한다(S313).When the target object area and the background area are separated for each of the remaining frames, the synthesis unit 105 creates a special effect input area for each of the remaining frames using the target object area of each of the remaining frames (S313).

일 실시 예에서, 합성부(105)는, 첫 프레임의 타겟객체영역과 특수효과 입력영역의 위치관계를 이용해 나머지 프레임 각각의 특수효과 입력영역을 생성할 수 있다. 예를 들어, 특수효과 입력영역이 사각형으로 형성되는 경우, 합성부(105)는, 첫 프레임의 타겟객체영역의 둘레와 특수효과 입력영역의 둘레 사이의 좌측, 우측, 상측, 하측 최단거리를 이용해 나머지 프레임 각각의 특수효과 입력영역을 생성할 수 있다.In one embodiment, the synthesis unit 105 may generate a special effect input area for each of the remaining frames using the positional relationship between the target object area and the special effect input area of the first frame. For example, when the special effect input area is formed in a square, the composite unit 105 uses the left, right, upper, and lower shortest distances between the perimeter of the target object area of the first frame and the perimeter of the special effect input area. You can create a special effect input area for each remaining frame.

다시 도 10을 참조하면, 비디오 생성부(104)가 비디오 생성 모델을 이용해 특수효과 비디오를 생성한다(S320). Referring again to FIG. 10, the video generator 104 generates a special effect video using the video generation model (S320).

도 12는 도 10의 S320단계의 구체적인 과정을 도시하는 흐름도이다. 도 13은 특수효과 비디오를 생성하는 과정을 개념적으로 도시하는 도면이다. FIG. 12 is a flowchart showing the specific process of step S320 of FIG. 10. Figure 13 is a diagram conceptually showing the process of creating a special effects video.

도 12를 참조하면, 이미지 생성부(103)가, 이미지 생성 모델을 이용해 특수효과 이미지를 생성한다(S321). 예를 들어, 이미지 생성부(103)가, 이미지 생성 모델에 "flame on a chromakey green background"를 입력 프롬프트로 입력하고, 이미지 생성 모델로부터 녹색 크로마키 배경에 화염이 타오르고 있는 이미지를 획득할 수 있다. Referring to FIG. 12, the image generator 103 generates a special effect image using an image generation model (S321). For example, the image generator 103 may input “flame on a chromakey green background” as an input prompt to the image generation model and obtain an image of a flame burning on a green chromakey background from the image generation model. there is.

비디오 생성부(104)는, 생성된 특수효과 이미지와 비디오 생성 모델을 이용해 생성된 특수효과 비디오를 생성한다(S322). The video generator 104 generates a special effects video using the generated special effects image and video generation model (S322).

일 실시 예에서, 비디오 생성 모델이 이미지와 이미지와 관련된 비디오 시퀸스를 훈련용 데이터로 학습된 경우, 비디오 생성부(104)는, 비디오 생성모델에 생성된 특수효과 이미지를 입력하고, 비디오 생성 모델로부터 특수효과 이미지와 연결된 비디오 시퀸스를 획득한다. In one embodiment, when the video generation model is learned with images and video sequences related to the images as training data, the video generation unit 104 inputs the generated special effect image to the video generation model and retrieves the image from the video generation model. Acquire video sequences linked to special effect images.

일 실시 예에서, 비디오 생성 모델이 텍스트 및 이미지와 텍스트 및 이미지와 관련된 비디오 시퀸스를 훈련용 데이터로 학습된 경우, 비디오 생성부(104)는 비디오 생성모델에 생성된 특수효과 이미지와 입력 프롬프트를 입력하고, 비디오 생성 모델로부터 특수효과 이미지와 입력 프롬프트에 대응하는 비디오 시퀸스를 획득한다. 예를 들어, 녹색 크로마키 배경에 화염이 타오르고 있는 특수효과 이미지와 "flame is burning"과 같은 입력 프롬프트가 비디오 생성 모델에 입력될 수 있다. In one embodiment, when the video generation model is learned with text and images and video sequences related to the text and images as training data, the video generation unit 104 inputs the special effect image and input prompt generated into the video generation model. And obtain video sequences corresponding to special effect images and input prompts from the video generation model. For example, a special effect image of a burning flame on a green chroma key background and an input prompt such as “flame is burning” can be input to the video generation model.

도시되지 않은 실시 예에서, 비디오 생성부(104)는, 이미지 생성 모델을 이용해 생성한 특수효과 이미지가 아닌, 데이터베이스에 미리 저장된 특수효과 이미지를 사용해 특수효과 비디오를 생성할 수 있다. In an embodiment not shown, the video generator 104 may generate a special effect video using a special effect image pre-stored in a database, rather than a special effect image generated using an image generation model.

도시되지 않은 실시 예에서, 비디오 생성 모델이, 텍스트와 텍스트와 관련된 비디오 시퀸스를 훈련용 데이터로 학습된 경우, 비디오 생성부(104)는, 이미지 생성 모델을 이용해 특수효과 이미지를 생성하지 않고, 입력 프롬프트만을 비디오 생성 모델에 입력하고, 비디오 생성 모델로부터 입력 프롬프트와 대응되는 비디오 시퀸스를 획득할 수 있다. In an embodiment not shown, when the video generation model is learned with text and video sequences related to the text as training data, the video generation unit 104 does not generate a special effect image using the image generation model, but inputs the special effect image. Only the prompt can be input into the video generation model, and a video sequence corresponding to the input prompt can be obtained from the video generation model.

특수효과 비디오(비디오 시퀸스)가 생성되면, 객체 분할부(101)는, 이미지 세그멘테이션 모델을 이용해 특수효과 비디오에서 특수효과 객체영역과 배경영역을 분리한다(S323). When a special effect video (video sequence) is created, the object segmentation unit 101 separates the special effect object area and the background area from the special effect video using an image segmentation model (S323).

일 실시 예에서, 객체 분할부(101)는, 이미지 세그멘테이션 모델을 이용해 특수효과 비디오에 포함된 특수효과 객체를 식별할 수 있다. 예를 들어, 객체 분할부(101)는, 특수효과 비디오에 포함된 복수의 프레임 각각을 이미지 세그멘테이션 모델에 입력하고, 이미지 세그멘테이션 모델로부터 복수의 프레임 각각에 포함된 특수효과 객체영역을 획득할 수 있다. In one embodiment, the object segmentation unit 101 may identify special effect objects included in a special effect video using an image segmentation model. For example, the object segmentation unit 101 may input each of a plurality of frames included in a special effect video into an image segmentation model and obtain a special effect object area included in each of the plurality of frames from the image segmentation model. .

일 실시 예에서, 객체 분할부(101)는, 특수효과 비디오의 첫 프레임을 이미지 세그멘테이션 모델에 입력하고, 이미지 세그멘테이션 모델은 특수효과 입력영역에 포함된 특수효과 객체와 대응되는 마스크를 생성한다. In one embodiment, the object segmentation unit 101 inputs the first frame of the special effect video into an image segmentation model, and the image segmentation model creates a mask corresponding to the special effect object included in the special effect input area.

객체 분할부(101)는, 첫 프레임에서 생성된 마스크와 비디오 오브젝트 세그멘테이션 모델을 이용해 특수효과 비디오의 나머지 프레임에 포함된 객체를 탐지하고, 나머지 프레임을 특수효과 객체영역과 배경영역으로 분리한다. The object segmentation unit 101 detects objects included in the remaining frames of the special effect video using the mask generated in the first frame and the video object segmentation model, and separates the remaining frames into a special effect object area and a background area.

연속적인 프레임에서 객체영역과 배경영역을 분리하는 구체적인 과정에 대해서는 도 5 내지 도 7을 참조하여 구체적으로 상술한 바, 이에 갈음한다. The specific process of separating the object area and the background area in consecutive frames has been described in detail with reference to FIGS. 5 to 7, and this will be replaced.

비디오 생성부(104)는, 특수효과 비디오에 포함된 복수의 프레임 각각에서 배경영역을 제거한다(S324). The video generator 104 removes the background area from each of the plurality of frames included in the special effect video (S324).

다시 도 10을 참조하면, 합성부(105)가, 원본비디오의 특수효과 입력영역에 특수효과 비디오를 합성한다(S330).Referring again to FIG. 10, the synthesis unit 105 synthesizes a special effect video into the special effect input area of the original video (S330).

합성부(105)는, 원본비디오의 복수의 프레임과 특수효과 비디오의 복수의 프레임을 1:1 매칭하고, 원본비디오의 복수의 프레임 각각의 특수효과 입력영역에 특수효과 비디오의 복수의 프레임 각각을 합성한다. The synthesis unit 105 matches a plurality of frames of the original video with a plurality of frames of the special effects video 1:1, and inserts each of the plurality of frames of the special effect video into the special effect input area of each of the plurality of frames of the original video. synthesize.

도 14는 도 1에 따른 서비스 제공 서버(100)가 3D 모델을 생성하는 과정을 도시하는 흐름도이다.FIG. 14 is a flowchart showing a process in which the service providing server 100 according to FIG. 1 generates a 3D model.

도 14를 참조하면, 비디오 생성부(104)가 비디오 생성 모델을 이용해 다각도로 촬영된 복수의 모델링 이미지를 생성한다(S410). Referring to FIG. 14, the video generator 104 generates a plurality of modeling images captured from multiple angles using a video generation model (S410).

일 실시 예에서, 비디오 생성부(104)가 비디오 생성 모델에 입력 프롬프트를 입력하고, 비디오 생성 모델로부터 복수의 모델링 이미지를 획득할 수 있다. 예를 들어, 비디오 생성부(104)는 "A video sequence showcasing a chair from various angles, suitable for generation with a video creation model. The video should feature the chair in different positions: frontal, side, rear, and diagonal views"를 입력하고, 비디오 생성 모델로부터 의자를 다양한 각도로 촬영한 복수의 모델링 이미지를 획득할 수 있다. In one embodiment, the video generator 104 may input an input prompt to the video generation model and obtain a plurality of modeling images from the video generation model. For example, the video creation unit 104 may say, “A video sequence showcasing a chair from various angles, suitable for generation with a video creation model. The video should feature the chair in different positions: frontal, side, rear, and diagonal views. By entering ", you can obtain multiple modeling images of the chair taken at various angles from the video generation model.

일 실시 예에서, 이미지 생성부(103)가 이미지 생성 모델에 입력 프롬프트를 입력하고, 이미지 생성 모델로부터 복수의 모델링 이미지를 획득할 수 있다. 예를 들어, 이미지 생성부(103)는,"A collection of images showcasing a chair from various angles, suitable for generation with Stable Diffusion. The images should feature the chair in different positions: frontal, side, rear, and diagonal views"를 입력하고, 이미지 생성 모델로부터 의자를 다양한 각도로 촬영한 복수의 모델링 이미지를 획득할 수 있다. In one embodiment, the image generator 103 may input an input prompt to an image generation model and obtain a plurality of modeling images from the image generation model. For example, the image generator 103 may say, "A collection of images showcasing a chair from various angles, suitable for generation with Stable Diffusion. The images should feature the chair in different positions: frontal, side, rear, and diagonal views. By entering ", you can obtain multiple modeling images of the chair taken at various angles from the image generation model.

도 15는 도 14의 S410단계의 일 실시 예를 도시하는 흐름도이다.FIG. 15 is a flowchart illustrating an example of step S410 of FIG. 14.

도 15를 참조하면, 비디오 생성부(104)는, 비디오 생성 모델을 이용해 복수의 모델링 이미지를 생성한다(S411). Referring to FIG. 15, the video generator 104 generates a plurality of modeling images using a video generation model (S411).

일 실시 예에서, 이미지 생성부(103)가 이미지 생성 모델을 이용해 복수의 모델링 이미지를 생성할 수 있다.In one embodiment, the image generator 103 may generate a plurality of modeling images using an image generation model.

비디오 생성부(104)는, 복수의 모델링 이미지 중 어느 하나의 모델링 이미지에 마커 생성영역을 지정한다(S412). The video generator 104 designates a marker creation area in one of the plurality of modeling images (S412).

일 실시 예에서, 비디오 생성부(104)는, 모델링 이미지에 포함된 객체와 대응되는 객체영역에 미리 설정된 개수의 마커 생성영역을 지정할 수 있다. In one embodiment, the video generator 104 may designate a preset number of marker creation areas in object areas corresponding to objects included in the modeling image.

일 실시 예에서, 비디오 생성부(104)는, 모델링 이미지에 포함된 객체와 대응되는 객체영역에 미리 설정된 개수의 내부 마커 생성영역을 지정하고, 객체영역 이외의 배경영역에 미리 설정된 개수의 외부 마커 생성영역을 지정할 수 있다. In one embodiment, the video generator 104 designates a preset number of internal marker creation areas in the object area corresponding to the object included in the modeling image, and creates a preset number of external markers in the background area other than the object area. You can specify the creation area.

일 실시 예에서, 비디오 생성부(104)는, 사용자 단말(200)에 모델링 이미지에 마커 생성영역을 지정할 수 있는 사용자 인터페이스를 제공할 수 있다. 비디오 생성부(104)는, 사용자 단말(200)로부터 수신한 마커 생성영역을 모델링 이미지의 마커 생성영역으로 지정할 수 있다. In one embodiment, the video generator 104 may provide the user terminal 200 with a user interface that can designate a marker creation area in the modeling image. The video generator 104 may designate the marker creation area received from the user terminal 200 as the marker creation area of the modeling image.

객체 분할부(101)는, 마커 생성영역과 비디오 오브젝트 세그멘테이션 모델을 이용해 복수의 모델링 이미지 중 나머지 모델링 이미지를 마커 생성영역과 배경영역으로 분리한다(S413). The object division unit 101 separates the remaining modeling images among the plurality of modeling images into a marker creation area and a background area using the marker creation area and the video object segmentation model (S413).

객체 분할부(101)는, 복수의 모델링 이미지 중 어느 하나의 모델링 이미지에 지정된 마커 생성영역과 대응되는 마스크와 비디오 오브젝트 세그멘테이션 모델을 이용해 복수의 모델링 이미지 중 나머지 모델링 이미지에 포함된 마커 생성영역을 탐지하고, 나머지 프레임을 마커 생성영역과 배경영역으로 분리한다. The object segmentation unit 101 detects the marker creation area included in the remaining modeling image among the plurality of modeling images using a video object segmentation model and a mask corresponding to the marker creation area specified in one of the modeling images. Then, separate the remaining frames into a marker creation area and a background area.

연속적인 프레임에서 특정 객체영역과 배경영역을 분리하는 구체적인 과정에 대해서는 도 5 내지 도 7을 참조하여 구체적으로 상술한 바, 이에 갈음한다. The specific process of separating a specific object area and a background area in consecutive frames has been described in detail with reference to FIGS. 5 to 7, and this will be replaced.

인페인팅부(102)는, 인페인팅 모델을 이용해 복수의 모델링 이미지 각각의 마커 생성영역에 마커를 생성한다(S414). The inpainting unit 102 creates a marker in the marker creation area of each of the plurality of modeling images using the inpainting model (S414).

도 18은 인페인팅 모델을 이용해 마커 생성영역에 마커가 생성된 복수의 모델링 이미지를 예시적으로 도시하는 도면이다. 도 18을 참조하면, 다양한 각도로 생성된 곤충 모델링 이미지 각각의 마커 생성영역에 마커(M)가 생성된다. FIG. 18 is a diagram illustrating a plurality of modeling images in which markers are created in a marker creation area using an inpainting model. Referring to FIG. 18, a marker (M) is created in the marker creation area of each insect modeling image generated at various angles.

일 실시 예에서, 객체의 내부에 형성되는 마커 생성영역과 객체의 외부에 생성되는 마커 생성영역에는 서로 다른 마커가 인페인팅되어 생성될 수 있다. In one embodiment, different markers may be inpainted into a marker creation area formed inside the object and a marker creation area created outside the object.

모델링 이미지에 마커가 생성됨에 따라 3D 모델링 성능이 향상될 수 있다. 3D modeling performance can be improved as markers are created in modeling images.

도 16은 도 14의 S410단계의 일 실시 예를 도시하는 도면이다. FIG. 16 is a diagram illustrating an example of step S410 of FIG. 14.

도 16을 참조하면, 이미지 생성부(103)가 이미지 생성 모델을 이용해 모델링 이미지를 생성한다(S415). Referring to FIG. 16, the image generator 103 generates a modeling image using the image generation model (S415).

이미지 생성부(103)는, 모델링 이미지에 적어도 하나의 마커 생성영역을 지정한다(S416).The image generator 103 designates at least one marker creation area in the modeling image (S416).

일 실시 예에서, 이미지 생성부(103)는, 모델링 이미지에 포함된 객체와 대응되는 객체영역에 미리 설정된 개수의 마커 생성영역을 지정할 수 있다. In one embodiment, the image generator 103 may designate a preset number of marker creation areas in object areas corresponding to objects included in the modeling image.

일 실시 예에서, 이미지 생성부(103)는, 모델링 이미지에 포함된 객체와 대응되는 객체영역에 미리 설정된 개수의 내부 마커 생성영역을 지정하고, 객체영역 이외의 배경영역에 미리 설정된 개수의 외부 마커 생성영역을 지정할 수 있다. In one embodiment, the image generator 103 designates a preset number of internal marker creation areas in the object area corresponding to the object included in the modeling image, and creates a preset number of external markers in the background area other than the object area. You can specify the creation area.

인페인팅부(102)는, 인페인팅 모델을 이용해 모델링 이미지의 마커 생성영역에 마커를 생성한다(S417). The inpainting unit 102 creates a marker in the marker creation area of the modeling image using the inpainting model (S417).

비디오 생성부(104)는, 마커가 생성된 모델링 이미지와 비디오 생성 모델을 이용해 다각도로 촬영된 복수의 모델링 이미지를 생성한다(S418).The video generator 104 generates a plurality of modeling images photographed from various angles using the modeling image in which the marker is created and the video generation model (S418).

일 실시 예에서, 비디오 생성부(104)가 비디오 생성 모델에 마커가 생성된 모델링 이미지와 입력 프롬프트를 입력하고 비디오 생성 모델로부터 복수의 모델링 이미지를 획득할 수 있다. 예를 들어, 비디오 생성부(104)는 "A video sequence showcasing the input image from various angles, suitable for generation with a video creation model. The video should feature the input image in different positions: frontal, side, rear, and diagonal views."를 입력 프롬프트로 마커가 생성된 모델링 이미지와 함께 비디오 생성 모델에 입력할 수 있다. In one embodiment, the video generator 104 may input a modeling image with a marker and an input prompt to the video generation model and obtain a plurality of modeling images from the video generation model. For example, the video creation unit 104 may say, “A video sequence showcasing the input image from various angles, suitable for generation with a video creation model. The video should feature the input image in different positions: frontal, side, rear, and “diagonal views.” can be entered into the video generation model along with the modeling image from which the marker was created as an input prompt.

일 실시 예에서, 이미지 생성부(103)가 이미지 생성 모델을 이용해 복수의 모델링 이미지를 생성할 수 있다. 예를 들어, 이미지 생성부(103)는,"A collection of images showcasing the input image from various angles, suitable for generation with Stable Diffusion. The images should feature the input image in different positions: frontal, side, rear, and diagonal views"를 입력 프롬프트로 마커가 생성된 모델링 이미지와 함께 이미지 생성 모델에 입력할 수 있다.In one embodiment, the image generator 103 may generate a plurality of modeling images using an image generation model. For example, the image generator 103 says, "A collection of images showcasing the input image from various angles, suitable for generation with Stable Diffusion. The images should feature the input image in different positions: frontal, side, rear, and You can input “diagonal views” into the image generation model along with the modeling image from which the marker was created as an input prompt.

다시 도 14를 참조하면, 3D 생성부(106)는, 3D 생성 모델을 이용해 다각도로 촬영된 복수의 모델링 이미지와 매칭되는 3D 모델을 생성한다(S420). Referring again to FIG. 14, the 3D generation unit 106 generates a 3D model that matches a plurality of modeling images photographed from various angles using the 3D generation model (S420).

도 17은 도 14의 S420단계의 구체적인 과정을 도시하는 흐름도이다.Figure 17 is a flowchart showing the specific process of step S420 of Figure 14.

3D 생성부(106)는, 복수의 모델링 이미지를 3D 생성모델에 입력하고, 3D 생성모델에 의해 복수의 모델링 이미지 각각에서 특징점이 결정되고(S421), 3D 생성 모델이 특징점을 이용해 복수의 모델링 이미지에 대한 3D 모델을 생성한다(S422). The 3D generation unit 106 inputs a plurality of modeling images into a 3D generation model, feature points are determined in each of the plurality of modeling images by the 3D generation model (S421), and the 3D generation model creates a plurality of modeling images using the feature points. Create a 3D model for (S422).

일 실시 예에서, 3D 생성 모델은, 복수의 모델링 이미지 각각에서 특징점을 결정하고, 결정된 특징점 사이의 관계에 기초하여 특징점들의 3차원 공간에서의 배치를 산출하며, 산출된 배치를 토대로 3D 모델을 생성한다.In one embodiment, the 3D generation model determines feature points in each of a plurality of modeling images, calculates the arrangement of the feature points in 3D space based on the relationship between the determined feature points, and generates a 3D model based on the calculated arrangement. do.

3D 생성 모델에 의해 UV Map이 3D 모델에 맵핑된다(S423). The UV Map is mapped to the 3D model by the 3D generated model (S423).

도 19는 UV map을 예시적으로 도시하는 도면이다.Figure 19 is a diagram illustrating a UV map as an example.

3D 모델링에서 UV 매핑(UV Mapping)은 3D 모델의 표면에 2D 텍스처를 적용하는 과정을 의미한다. UV는 3D 모델링 공간의 X, Y, Z 축과 혼동을 피하기 위해 사용되는 2D 텍스처 공간의 축을 의미하고, U와 V는 각각 2D 텍스처의 가로와 세로 축을 의미한다. UV 맵은 3D 모델의 각 지점이 2D 텍스처 상에서 어디에 위치할지 정의하며, 이를 통해 복잡한 3D 객체에 텍스처를 정확하게 매핑한다.In 3D modeling, UV Mapping refers to the process of applying 2D texture to the surface of a 3D model. UV refers to the axes of 2D texture space, which are used to avoid confusion with the X, Y, and Z axes of 3D modeling space, and U and V refer to the horizontal and vertical axes of 2D texture, respectively. UV maps define where each point in a 3D model will be located on a 2D texture, accurately mapping the texture to complex 3D objects.

일 실시 예에서, 객체 분할부(101)는, UV map을 지정된 마커 제거영역과 배경영역으로 분리한다. 객체 분할부(101)는, 사용자 단말(200)에 UV map을 제공하고 제공한 UV map에서 마커 제거영역을 지정할 수 있는 사용자 인터페이스를 제공할 수 있다. 객체 분할부(101)는, 사용자 단말로부터 입력된 마커 제거영역과 배경영역을 분리한다. 인페인팅부(102)는, UV map에서 마커 제거영역을 제거한 후 제거된 영역을 새롭게 생성할 수 있다. In one embodiment, the object division unit 101 separates the UV map into a designated marker removal area and a background area. The object division unit 101 may provide a UV map to the user terminal 200 and provide a user interface capable of specifying a marker removal area in the provided UV map. The object division unit 101 separates the marker removal area input from the user terminal and the background area. The inpainting unit 102 may remove the marker removal area from the UV map and then create a new removed area.

마커를 표시하여 3D 모델을 생성하는 경우 3D 모델 생성 성능은 향상되지만, UV map에 마커가 포함되어 3D 모델의 표면에 마커가 표시되는 문제가 발생된다. UV map에서 마커가 표시된 영역을 삭제한 이후 인페인팅하여 3D 모델에 맵핑하는 경우, 3D 모델의 표면에서 마커를 제거할 수 있다. If you create a 3D model by displaying a marker, 3D model creation performance improves, but a problem occurs where the marker is displayed on the surface of the 3D model because the marker is included in the UV map. If you delete the marked area from the UV map and then inpaint it to map it to the 3D model, you can remove the marker from the surface of the 3D model.

도 20은 도 1에 따른 서비스 제공 서버(100)의 하드웨어 구성을 예시적으로 나타낸 도면이다.FIG. 20 is a diagram illustrating an exemplary hardware configuration of the service providing server 100 according to FIG. 1.

도 20을 참조하면, 서비스 제공 서버(100)는, 적어도 하나의 프로세서(110) 및 상기 적어도 하나의 프로세서(110)가 적어도 하나의 동작(operation)을 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory)를 포함할 수 있다.Referring to FIG. 20, the service providing server 100 stores at least one processor 110 and instructions instructing the at least one processor 110 to perform at least one operation. may include memory.

상기 적어도 하나의 동작은 전술한 서비스 제공 서버(100)의 구성부들(101~106)이나 기타 기능 또는 동작 방법을 포함할 수 있다.The at least one operation may include the components 101 to 106 of the service providing server 100 described above or other functions or operation methods.

여기서 적어도 하나의 프로세서(110)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시 예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 메모리(120) 및 저장 장치(160) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. Here, the at least one processor 110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. there is. Each of the memory 120 and the storage device 160 may be comprised of at least one of a volatile storage medium and a non-volatile storage medium.

예를 들어, 메모리(120)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중 하나일 수 있고, 저장 장치(160)는, 플래시메모리(flash-memory), 하드디스크 드라이브(HDD), 솔리드 스테이트 드라이브(SSD), 또는 각종 메모리 카드(예를 들어, micro SD 카드) 등일 수 있다.For example, the memory 120 may be one of read only memory (ROM) and random access memory (RAM), and the storage device 160 may be flash memory. , a hard disk drive (HDD), a solid state drive (SSD), or various memory cards (eg, micro SD card).

또한, 서버(100)는, 무선 네트워크를 통해 통신을 수행하는 송수신 장치(transceiver)(130)를 포함할 수 있다. 또한, 서버(100)는 입력 인터페이스 장치(140), 출력 인터페이스 장치(150), 저장 장치(160) 등을 더 포함할 수 있다. 서버(100)에 포함된 각각의 구성 요소들은 버스(bus, 170)에 의해 연결되어 서로 통신을 수행할 수 있다.Additionally, the server 100 may include a transceiver 130 that performs communication through a wireless network. Additionally, the server 100 may further include an input interface device 140, an output interface device 150, a storage device 160, etc. Each component included in the server 100 is connected by a bus 170 and can communicate with each other.

서버(100)의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.Examples of the server 100 include a desktop computer, a laptop computer, a laptop, a smart phone, a tablet PC, and a mobile phone that can communicate. , smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game console, navigation device, digital camera, DMB (digital multimedia broadcasting) player. , a digital audio recorder, a digital audio player, a digital video recorder, a digital video player, a PDA (Personal Digital Assistant), etc.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.Methods according to the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. Computer-readable media may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on a computer-readable medium may be those specifically designed and configured for the present invention, or may be known and usable by those skilled in the art of computer software.

컴퓨터 판독 가능 매체의 예에는 롬(ROM), 램(RAM), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable media may include hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions may include machine language code such as that created by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The above-described hardware device may be configured to operate with at least one software module to perform the operations of the present invention, and vice versa.

또한, 상술한 방법 또는 장치는 그 구성이나 기능의 전부 또는 일부가 결합되어 구현되거나, 분리되어 구현될 수 있다. Additionally, the above-described method or device may be implemented by combining all or part of its components or functions, or may be implemented separately.

상기에서는 본 발명의 바람직한 실시 예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although the present invention has been described above with reference to preferred embodiments, those skilled in the art may make various modifications and changes to the present invention without departing from the spirit and scope of the present invention as set forth in the following patent claims. You will understand that you can do it.

Claims

As a server that provides special effects generation and 3D model generation services using generative AI models,
an object segmentation unit that determines objects included in the input image using an image segmentation model;
A video generator that generates a special effects video using a video generation model; and
It includes a synthesis unit that specifies a special effect input area in the original video using a user-specified area received from the user terminal and synthesizes a special effect video in the special effect input area of the original video;
The object division unit,
Using the image segmentation model and video object segmentation model, each of the multiple frames included in the special effect video is separated into a special effect object area and a background area,
Remove the background area from each of the multiple frames included in the special effects video,
The synthetic part,
Designate the user-specified area as the special effect input area of the first frame of the original video,
The object division unit,
Identify at least one object included in the special effect input area of the first frame of the original video using an image segmentation model,
Determines one of at least one object as the target object and creates a mask matching the target object,
Each remaining frame of the original video is separated into a target object area and a background area using a mask and video object segmentation model that matches the target object.
The synthetic part,
The target object area of each remaining frame of the original video is used to create a special effect input area of each remaining frame of the original video.
The server is,
Using each of the plurality of frames included in the special effects video and an image generation model, a plurality of frames are matched to each of the plurality of frames included in the special effects video and the special effect objects included in each of the plurality of frames are shown from various angles. An image generator that generates a modeling image; and
It further includes a 3D generation unit that generates a 3D model matching a plurality of modeling images using the 3D generation model,
The video generator,
Designate a marker creation area on one of the plurality of modeling images,
The object division unit,
Using the marker creation area and video object segmentation model, the remaining modeling images among the plurality of modeling images are separated into a marker creation area and a background area.
The server is,
An inpainting unit that generates a marker in the marker creation area of each of the plurality of modeling images using an inpainting model,
server.

delete

According to paragraph 1,
The synthetic part,
1:1 matching of each of the multiple frames of the original video and each of the multiple frames of the special effects video,
Composing each of a plurality of frames of a special effect video into the special effect input area of each of a plurality of frames of the original video,
server.

delete