KR20210115401A

KR20210115401A - System and method for intelligent scene split based on deeplearning

Info

Publication number: KR20210115401A
Application number: KR1020200031129A
Authority: KR
Inventors: 심충섭; 하영광
Original assignee: 주식회사 딥하이
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2021-09-27

Abstract

A deep learning-based intelligent scene segmentation method and a system thereof are disclosed. According to one aspect of the present invention, the deep learning-based intelligent scene segmentation method includes the steps of: generating, by a system, corresponding frame texts for each of a plurality of frame images included in a video to be edited; detecting, by the system, adjacent frame texts having significant semantic changes by analyzing the frame texts; and determining, by the system, a scene transition boundary in the video to be edited based on the adjacent frame texts detected by the system.

Description

Deep learning-based intelligent scene split method and its system {System and method for intelligent scene split based on deeplearning}

본 발명은 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 편집 방법 및 그 시스템에 관한 것이다. 보다 상세하게는 비디오에 포함된 이미지 프레임에 대해 이미지 캡션닝(image captioning)을 수행하고, 이미지 캡셔닝 수행결과 생성되는 텍스트에 기반하여 비디오를 효과적으로 편집할 수 있는 시스템 및 방법에 관한 것이다.The present invention relates to a video editing method and system using deep learning-based image captioning. More particularly, it relates to a system and method capable of performing image captioning on an image frame included in a video and effectively editing a video based on text generated as a result of performing image captioning.

또한 텍스트에 기반하여 비디오의 장면을 자동으로 분할할 수 있는 시스템 및 방법에 관한 것이다.It also relates to a system and method capable of automatically segmenting scenes in a video based on text.

비디오(동영상)의 활용도가 매우 높아지고, 1인 미디어, 미디어 커머스, SNS 등에 따라 비디오 컨텐츠를 제작하고자 하는 시도가 다수 이루어지고 있다.The utilization of video (movie) is very high, and many attempts are being made to produce video contents according to one-person media, media commerce, and SNS.

하지만 비디오 컨텐츠를 완성하기 위해서는 촬영한 비디오에 대한 편집이 필요한데, 이러한 편집에 상대적으로 많은 시간, 노하우, 및 비용이 소요된다.However, in order to complete the video content, it is necessary to edit the recorded video, which requires relatively much time, know-how, and cost for such editing.

통상 종래의 비디오 편집을 위해서는 불필요한 프레임(frame) 또는 컷(cut)들을 삭제하거나 필요한 컷들만을 골라내야 하고, 이러한 경우 하나하나의 프레임들을 지켜보면서 편집을 수행하는 과정이 필요했다.In general, for conventional video editing, unnecessary frames or cuts should be deleted or only necessary cuts should be selected. In this case, a process of editing while watching each frame was required.

이러한 편집을 비디오에 포함된 프레임에 상응하는 텍스트 기반으로 효과적으로 수행할 수 있는 기술적 사상이 필요할 수 있다.It may be necessary to have a technical idea to effectively perform such editing based on text corresponding to frames included in the video.

또한 텍스트 기반으로 유의미한 장면 전환 경계를 디텍팅함으로써, 종래의 이미지 특성의 변화가 크지 않은 경우의 장면 전환을 디텍팅할 수 있는 효과가 있다.In addition, by detecting a meaningful scene change boundary based on text, there is an effect of detecting a scene change in the case where the change of the conventional image characteristic is not large.

공개특허공보 10-2011-0062567 (비디오 스크랩을 이용한 비디오 콘텐츠 요약 방법 및 장치)Laid-Open Patent Publication No. 10-2011-0062567 (Method and apparatus for summarizing video contents using video clipping)

본 발명이 해결하고자 하는 과제는 비디오에 포함된 복수의 프레임들 각각에 상응하는 텍스트에 기반하여 비디오를 편집할 수 있도록 하는 방법 및 시스템을 제공하는 것이다.An object of the present invention is to provide a method and system for editing a video based on text corresponding to each of a plurality of frames included in the video.

또한 텍스트를 통해 장면경계를 결정함으로써 이미지 특징에 의해서는 디텍팅되지 않는 유의미한 장면경계를 디텍팅할 수 있는 방법 및 시스템을 제공하는 것이다.Another object of the present invention is to provide a method and system capable of detecting a meaningful scene boundary that is not detected by image features by determining the scene boundary through text.

본 발명의 일 측면에 따르면, 딥러닝 기반의 지능형 장면 분할 방법은 시스템이 편집대상 비디오에 포함된 복수의 프레임 이미지들 각각에 대해 상응하는 프레임 텍스트들을 생성하는 단계, 상기 시스템이 상기 프레임 텍스트들을 분석하여 유의미한 의미변화가 존재하는 인접 프레임 텍스트들을 디텍팅하는 단계, 및 상기 시스템이 디텍팅한 인접 프레임 텍스트들에 기초하여 상기 편집대상 비디오에서 장면전환 경계를 결정하는 단계를 포함한다.According to one aspect of the present invention, a deep learning-based intelligent scene segmentation method comprises: generating, by a system, corresponding frame texts for each of a plurality of frame images included in a video to be edited; the system analyzing the frame texts and detecting adjacent frame texts having significant semantic changes by doing so, and determining a scene change boundary in the video to be edited based on the adjacent frame texts detected by the system.

..

본 발명의 일 실시예에 따르면, 텍스트를 통한 비디오 편집이 가능해지므로 프레임의 선택이나 장면의 검색이 매우 빠른시간에 효과적으로 이루어질 수 있어서 편집에 이용되는 리소스를 매우 절약할 수 있는 효과가 있다.According to an embodiment of the present invention, since video editing through text is possible, selection of a frame or searching for a scene can be performed effectively in a very short time, thereby greatly saving resources used for editing.

또한 텍스트를 통해 장면경계를 결정함으로써 이미지 특징에 의해서는 디텍팅되지 않는 유의미한 장면경계를 디텍팅할 수 있는 효과가 있다.In addition, by determining the scene boundary through text, there is an effect of detecting a meaningful scene boundary that is not detected by image features.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 편집 방법의 개략적인 플로우를 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 딥러닝 기반의 지능형 장면 분할 방법의 개략적인 플로우를 나타내는 도면이다.In order to more fully understand the drawings cited in the Detailed Description, a brief description of each drawing is provided.
1 is a diagram illustrating a schematic flow of a video editing method using deep learning-based image captioning according to an embodiment of the present invention.
2 is a diagram illustrating a schematic flow of an intelligent scene segmentation method based on deep learning according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Since the present invention can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에 있어서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present specification, terms such as "comprise" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, and one or more other It is to be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 명세서에 있어서는 어느 하나의 구성요소가 다른 구성요소로 데이터를 '전송'하는 경우에는 상기 구성요소는 상기 다른 구성요소로 직접 상기 데이터를 전송할 수도 있고, 적어도 하나의 또 다른 구성요소를 통하여 상기 데이터를 상기 다른 구성요소로 전송할 수도 있는 것을 의미한다. 반대로 어느 하나의 구성요소가 다른 구성요소로 데이터를 '직접 전송'하는 경우에는 상기 구성요소에서 다른 구성요소를 통하지 않고 상기 다른 구성요소로 상기 데이터가 전송되는 것을 의미한다.In addition, in the present specification, when any one component 'transmits' data to another component, the component may directly transmit the data to the other component, or through at least one other component. This means that the data may be transmitted to the other component. Conversely, when one component 'directly transmits' data to another component, it means that the data is transmitted from the component to the other component without passing through the other component.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 편집 방법의 개략적인 플로우를 나타내는 도면이다.1 is a diagram illustrating a schematic flow of a video editing method using deep learning-based image captioning according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 시스템(이하, 시스템)은 편집을 수행할 편집대상 비디오를 입력받을 수 있다.Referring to FIG. 1 , a video system (hereinafter, referred to as a system) using deep learning-based image captioning according to an embodiment of the present invention may receive an editing target video to be edited.

그러면 상기 시스템은 편집대상 비디오에 포함된 복수의 프레임 이미지들을 선택할 수 있다. The system may then select a plurality of frame images included in the video to be edited.

선택되는 프레임 이미지들은 편집대상 비디오에 포함된 전체 프레임 이미지들일 수도 있고, 일부일 수도 있다. 예컨대, 키 프레임만이 선택될 수도 있고, 미리 정해진 간격(10개 프레임마다 1개씩)대로 선택될 수도 있다.The selected frame images may be whole frame images included in the video to be edited, or may be part of them. For example, only key frames may be selected, or they may be selected at a predetermined interval (one for every 10 frames).

그러면 선택된 프레임 이미지별로 상기 시스템은 프레임 텍스트들을 생성할 수 있다. The system can then generate frame texts for each selected frame image.

예컨대, 총 100개의 프레임 이미지들이 선택되면, 상기 시스템은 100개의 프레임 이미지들 각각에 상응하는 프레임 텍스트들을 생성할 수 있다.For example, if a total of 100 frame images are selected, the system may generate frame texts corresponding to each of the 100 frame images.

생성한 프레임 텍스트 각각은 상응하는 프레임 이미지들을 설명할 수 있는 텍스트일 수 있다. Each of the generated frame text may be text that can describe corresponding frame images.

이를 위해 상기 시스템은 딥러닝 기반의 이미지 캡셔닝 모듈을 구비할 수 있다. 이미지 캡셔닝 모듈은 이미지에 대해 해당 이미지에 상응하는 텍스트를 생성할 수 있도록 학습된 구성일 수 있다. To this end, the system may include a deep learning-based image captioning module. The image captioning module may be a learned configuration to generate text corresponding to the image for the image.

이러한 이미지 캡셔닝 모듈은 다양한 방식으로 구현될 수 있으며, 이미 공지되어 있으므로 상세한 설명은 생략한다.Such an image captioning module may be implemented in various ways, and since it is already known, a detailed description thereof will be omitted.

프레임 텍스트들이 생성되면, 상기 시스템은 생성된 프레임 텍스트들을 이용하여 비디오 편집을 수행할 수 있다.Once the frame texts are generated, the system can perform video editing using the generated frame texts.

비디오 편집을 수행하기 위해서는 편집 후의 비디오에 포함될 또는 삭제될 프레임의 선택이 필요할 수 있고, 이러한 프레임의 선택이 생성된 프레임 텍스트들을 이용하여 텍스트 기반으로 수행될 수 있다.In order to perform video editing, it may be necessary to select a frame to be included in or deleted from the video after editing, and the selection of such a frame may be performed based on text using the generated frame texts.

예컨대, 상기 시스템은 선택조건을 입력받을 수 있다. 선택조건은 프레임을 선택하기 위한 텍스트일 수 있다.For example, the system may receive a selection condition. The selection condition may be text for selecting a frame.

예컨대, 상기 시스템이 "사람이 2명"이란 선택조건을 입력받으면 상기 시스템은 프레임 텍스트들 중 "사람이 2명"이란 텍스트를 그대로 포함하고 있거나 시멘틱 서치를 통해 사람이 2명이라는 의미를 가지는 프레임 텍스트를 검색할 수 있다. 선택조건은 반드시 하나가 아니라 논리 연산을 통해 수행될 수도 있다. For example, when the system receives a selection condition of "two people", the system includes the text "two people" among the frame texts as it is, or a frame having the meaning of two people through semantic search You can search for text. The selection condition is not necessarily one, but may be performed through logical operation.

예컨대, "사람이 2명" and "산"이란 선택조건이 입력되면 사람이 2명 있고 산이 존재하는 프레임 텍스트들(즉 프레임 이미지가)이 선택될 수 있다.For example, when a selection condition of "two people" and "mountain" is input, frame texts (ie, a frame image) in which two people and a mountain exist may be selected.

이러한 검색을 위해서는 NLP(Natural Language Processing)를 위해 학습된 언어모델을 통해 키워드 서치가 아닌 시멘틱 서치(sematic search)가 활용될 수 있다.For such a search, sematic search, not keyword search, may be utilized through a language model trained for NLP (Natural Language Processing).

상기 시스템에 의해 검색된 프레임 텍스트들 즉 대상 프레임 텍스트들이 결정되면, 대상 프레임 텍스트에 상응하는 프레임 이미지 즉, 대상 프레임 이미지가 사용자가 선택하고자 하는 프레임 이미지일 수 있다.When the frame texts retrieved by the system, that is, the target frame texts are determined, the frame image corresponding to the target frame text, ie, the target frame image, may be a frame image that the user wants to select.

즉, 사람이 2명 존재하는 프레임 이미지들이 선택될 수 있다.That is, frame images in which two people exist may be selected.

그러면 사용자의 요청에 따라 선택된 대상 프레임 이미지들을 삭제하거나 또는 선택된 프레임 이미지들만 남기는 등의 편집 커맨드를 소정의 방식(텍스트 커맨드 또는 소정의 UI 등)을 통해 입력받을 수 있다.Then, according to the user's request, an editing command such as deleting the selected target frame images or leaving only the selected frame images may be input through a predetermined method (such as a text command or a predetermined UI).

다양한 방식으로 프레임 텍스트를 이용해 사용자가 원하는 프레임의 선택이 매우 효과적으로 텍스트 기반으로 선택될 수 있고, 이를 이용한 비디오 편집이 효과적으로 수행될 수 있다.A selection of a frame desired by a user can be very effectively selected based on text using frame text in various ways, and video editing using the frame text can be effectively performed.

예컨대, 개와 고양이가 촬영된 비디오에서 개만 존재하는 프레임 이미지들만 남기고 싶은 경우, 선택조건으로 "개"를 입력한 후 특정된 대상 프레임 이미지들만을 남기고 나머지 프레임들을 삭제하는 편집 커맨드가 입력되면 편집 대상 이미지에서 개가 등장하는 프레임들로만 구성된 편집된 비디오가 획득될 수 있다.For example, if you want to leave only frame images in which only dogs exist in a video in which dogs and cats are photographed, after inputting “dog” as a selection condition, an edit command is input that leaves only the specified target frame images and deletes the remaining frames. An edited video consisting only of frames in which a dog appears can be obtained.

다양한 방식의 편집 커맨드(선택, 삭제, 또는 순서의 조정 등)이 가능할 수 있다.Various types of editing commands (such as selection, deletion, or order adjustment) may be possible.

한편 상기 시스템은 프레임 텍스트들을 이용하여 자동으로 장면전환 경계를 탐색할 수 있다. On the other hand, the system can automatically detect a scene transition boundary using frame texts.

일반적으로 종래의 장면 경계를 찾기 위해서는 이미지 특징(예컨대, 색상정보, 음영 등)들의 변화를 이용하는 방식이 이용되고 있다.In general, in order to find a scene boundary in the related art, a method using changes in image characteristics (eg, color information, shading, etc.) is used.

하지만 이러한 이미지 특징의 변화에서 잘 디텍팅되지 않는 연속된 샷들이라도 이미지 캡셔닝에 의해 유의미하게 달라지는 경우가 존재할 수 있다.However, even consecutive shots that are not well detected in such a change in image characteristics may be significantly changed by image captioning.

즉, 이미지 캡셔닝에 의해 디텍팅되는 이미지의 내용의 변화가 종래의 이미지 특징의 변화에 의해 디텍팅되는 장면 경계와는 다를 수 있다.That is, a change in the content of an image detected by image captioning may be different from a scene boundary detected by a change in a conventional image feature.

따라서 본 발명의 기술적 사상에 의하면 이미지 캡셔닝에 의해 생성된 프레임 텍스트들 간에 상당한 정도의 유의미한 변화가 있는 경우를 디텍팅함으로써 장면 경계를 디텍팅할 수 있다.Therefore, according to the technical idea of the present invention, it is possible to detect a scene boundary by detecting a case where there is a significant degree of significant change between frame texts generated by image captioning.

도 2는 본 발명의 일 실시예에 따른 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 편집 방법의 개략적인 플로우를 나타내는 도면이다.2 is a diagram illustrating a schematic flow of a video editing method using deep learning-based image captioning according to an embodiment of the present invention.

도 2를 참조하면, 텍스트 기반의 장면 분할방법을 위해서는 이미지 캡셔닝을 위한 딥러닝 모델에서 디텍팅되는 오브젝트 또는 행위가 어떤 것인지를 잘 결정하는 것이 필요하다.Referring to FIG. 2 , for the text-based scene segmentation method, it is necessary to well determine which object or action is detected in a deep learning model for image captioning.

예컨대, 이미지의 색상이나 음영 등은 크게 변하지 않지만, 사람의 행위가 달라지는 경우(예컨대, 앉아 있다가 물건을 집는 행위)를 장면 경계로 디텍팅하기 위해서는 이미지 캡셔닝 모듈은 사람의 행위를 디텍팅하고 이를 텍스트로 표현할 수 있도록 학습될 필요가 있다.For example, the image captioning module detects the human behavior and detects a change in human behavior (eg, sitting and picking up an object) as a scene boundary, although the color or shading of the image does not change significantly. It needs to be learned to express this in text.

그러면 상기 시스템은 입력된 편집 대상 비디오로부터 복수의 프레임 이미지들 각각에 상응하는 프레임 텍스트들을 생성할 수 있다. Then, the system may generate frame texts corresponding to each of the plurality of frame images from the input video to be edited.

그리고 상기 프레임 텍스트들을 분석하여 유의미한 의미변화가 존재하는 인접 프레임 텍스트들을 디텍팅할 수 있다. 유의미한 의미변화가 존재하는 경우는 예컨대, 새로운 오브젝트의 등장, 새로운 행위의 수행 등과 같이 미리 정해질 수도 있고, 프레임 텍스트들의 유사도의 변화를 산출하여 미리 설정된 값 이상의 변화가 있는 경우를 디텍팅할 수도 있다.In addition, by analyzing the frame texts, it is possible to detect adjacent frame texts having a meaningful change in meaning. Cases where there is a significant change in meaning may be predetermined, for example, such as the appearance of a new object or performance of a new action, or a case where there is a change of more than a preset value by calculating a change in the similarity of frame texts may be detected. .

그러면 상기 시스템은 디텍팅한 인접 프레임 텍스트들에 기초하여 상기 편집대상 비디오에서 장면전환 경계를 결정할 수 있다. 예컨대, 디텍팅된 인점 프레임 텍스트들의 중간을 장면 경계롤 결정할 수 있다.The system can then determine a scene transition boundary in the video to be edited based on the detected adjacent frame texts. For example, the middle of the detected in-point frame texts may be determined as a scene boundary.

한편, 본 발명의 실시예에 따른 방법은 컴퓨터가 읽을 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 읽을 수 있는 기록 매체에 저장될 수 있으며, 본 발명의 실시예에 따른 제어 프로그램 및 대상 프로그램도 컴퓨터로 판독 가능한 기록 매체에 저장될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.On the other hand, the method according to the embodiment of the present invention may be implemented in the form of a computer-readable program command and stored in a computer-readable recording medium, and the control program and the target program according to the embodiment of the present invention are also implemented in the computer. It may be stored in a readable recording medium. The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored.

기록 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the software field.

컴퓨터로 읽을 수 있는 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and floppy disks and hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, and the like. In addition, the computer-readable recording medium is distributed in a computer system connected through a network, so that the computer-readable code can be stored and executed in a distributed manner.

프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 전자적으로 정보를 처리하는 장치, 예를 들어, 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Examples of the program instruction include not only machine code such as generated by a compiler, but also a device for electronically processing information using an interpreter or the like, for example, a high-level language code that can be executed by a computer.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타나며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

Claims

generating, by the system, corresponding frame texts for each of a plurality of frame images included in the video to be edited;
detecting, by the system, adjacent frame texts having significant semantic changes by analyzing the frame texts;
and determining a scene transition boundary in the video to be edited based on adjacent frame texts detected by the system.

A computer program installed in a data processing apparatus and stored in a computer-readable recording medium to perform the method according to claim 1.

As an intelligent scene segmentation system based on deep learning,
processor; and
a memory for storing a computer program executed by the processor;
The computer program, when executed by the processor, a deep learning-based intelligent scene segmentation system to perform the method according to claim 1.