KR102350457B1

KR102350457B1 - System and method for image searching using image captioning based on deep learning

Info

Publication number: KR102350457B1
Application number: KR1020210035048A
Authority: KR
Inventors: 심충섭; 하영광
Original assignee: 주식회사 딥하이
Priority date: 2020-03-18
Filing date: 2021-03-18
Publication date: 2022-01-13
Also published as: KR102350457B9; KR20210117207A

Abstract

딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법 및 그 시스템이 개시된다. 본 발명의 일 측면에 따르면, 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법은 시스템이 편집대상 비디오를 복수의 샷으로 분할하는 단계, 상기 시스템이 상기 분할된 샷들 각각에 대해 적어도 하나의 샷 텍스트를 생성하는 단계, 상기 시스템이 영상 검색을 위한 검색조건 텍스트를 입력받는 단계, 및 상기 시스템이 상기 검색조건 텍스트에 상응하는 매칭 샷 텍스트를 선택하고, 선택된 매칭 샷 텍스트에 상응하는 매칭영상을 추출하는 단계를 포함하며, 상기 시스템이 상기 분할된 샷들 각각에 대해 적어도 하나의 샷 텍스트를 생성하는 단계는, 상기 분할된 샷들 각각에 포함된 복수의 프레임들 중에서 적어도 하나의 선택 프레임을 결정하는 단계 및 결정한 선택 프레임에 상응하는 영상 캡셔닝을 통해 상기 샷 텍스트를 생성하는 단계를 포함한다.A video stream processing method and system through deep learning-based image captioning are disclosed. According to an aspect of the present invention, a method for processing a video stream through deep learning-based image captioning includes: dividing, by a system, a video to be edited into a plurality of shots, by the system, at least one shot for each of the divided shots generating text; receiving, by the system, a search condition text for image search; and selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text. and generating, by the system, at least one shot text for each of the divided shots, determining at least one selection frame from among a plurality of frames included in each of the divided shots; and generating the shot text through image captioning corresponding to the determined selection frame.

Description

Video stream processing method and system through image captioning based on deep learning {System and method for image searching using image captioning based on deep learning}

본 발명은 딥러닝 기반의 이미지 캡셔닝을 이용한 비디오 편집 방법 및 그 시스템에 관한 것이다. 보다 상세하게는 비디오에 포함된 적어도 하나의 프레임에 대해 영상 캡션닝(image captioning)을 수행하고, 영상 캡셔닝 수행결과 생성되는 텍스트에 기반하여 비디오를 효과적으로 처리(예컨대, 편집, 검색 등)할 수 있는 시스템 및 방법에 관한 것이다.The present invention relates to a video editing method and system using deep learning-based image captioning. More specifically, image captioning is performed on at least one frame included in the video, and the video can be effectively processed (eg, edited, searched, etc.) based on the text generated as a result of performing the image captioning. It relates to a system and method.

비디오(동영상)의 활용도가 매우 높아지고, 1인 미디어, 미디어 커머스, SNS 등에 따라 비디오 컨텐츠를 제작하고자 하는 시도가 다수 이루어지고 있다.The utilization of video (movie) is very high, and many attempts are being made to produce video content according to one-person media, media commerce, and SNS.

하지만 비디오 컨텐츠를 완성하기 위해서는 촬영한 비디오에 대한 편집이 필요한데, 이러한 편집에 상대적으로 많은 시간, 노하우, 및 비용이 소요된다.However, in order to complete the video content, it is necessary to edit the recorded video, and such editing requires relatively much time, know-how, and cost.

통상 종래의 비디오 편집을 위해서는 불필요한 프레임(frame) 또는 컷(cut)들을 삭제하거나 필요한 컷들만을 골라내야 하고, 이러한 경우 하나하나의 프레임들을 지켜보면서 편집을 수행하는 과정이 필요했다.In general, for conventional video editing, unnecessary frames or cuts should be deleted or only necessary cuts should be selected, and in this case, a process of editing while watching each frame was required.

이러한 편집을 비디오에 포함된 프레임에 상응하는 텍스트 기반으로 효과적으로 수행할 수 있는 기술적 사상이 필요할 수 있다. 그리고 이를 이용하여 특정 비디오 내에서도 유저들이 원하는 영상만을 검색하여 제공할 수 있는 기술적 사상이 필요할 수 있다.It may be necessary to have a technical idea that can effectively perform such editing based on text corresponding to the frame included in the video. Also, there may be a need for a technical idea that allows users to search for and provide only images they want even within a specific video using this.

공개특허공보 10-2011-0062567 (비디오 스크랩을 이용한 비디오 콘텐츠 요약 방법 및 장치)Laid-Open Patent Publication No. 10-2011-0062567 (Method and apparatus for summarizing video content using video clipping)

본 발명이 해결하고자 하는 과제는 비디오에 포함된 적어도 하나의 프레임들 각각의 시각적 정보에 상응하는 텍스트에 기반하여 사용자가 원하는 영상 부분만을 검색하거나 또는 편집할 수 있는 방법 및 시스템을 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to provide a method and system by which a user can search for or edit only a desired image portion based on text corresponding to visual information of each of at least one frame included in a video.

또한 영상 캡셔닝을 통해 비디오 내에서 유저들이 원하는 영상을 용이하게 검색할 수 있으므로, 영상의 공유 및 판매를 위한 플랫폼에 활용할 수 있는 방법 및 시스템을 제공하는 것이다.In addition, it is to provide a method and system that can be utilized in a platform for sharing and selling images, since users can easily search for desired images within a video through image captioning.

본 발명의 일 측면에 따르면, 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법은 시스템이 편집대상 비디오를 복수의 샷으로 분할하는 단계, 상기 시스템이 상기 분할된 샷들 각각에 대해 적어도 하나의 샷 텍스트를 생성하는 단계, 상기 시스템이 영상 검색을 위한 검색조건 텍스트를 입력받는 단계, 및 상기 시스템이 상기 검색조건 텍스트에 상응하는 매칭 샷 텍스트를 선택하고, 선택된 매칭 샷 텍스트에 상응하는 매칭영상을 추출하는 단계를 포함하며, 상기 시스템이 상기 분할된 샷들 각각에 대해 적어도 하나의 샷 텍스트를 생성하는 단계는, 상기 분할된 샷들 각각에 포함된 복수의 프레임들 중에서 적어도 하나의 선택 프레임을 결정하는 단계 및 결정한 선택 프레임에 상응하는 영상 캡셔닝을 통해 상기 샷 텍스트를 생성하는 단계를 포함한다.According to an aspect of the present invention, a method for processing a video stream through deep learning-based image captioning includes: dividing, by a system, a video to be edited into a plurality of shots, by the system, at least one shot for each of the divided shots generating text; receiving, by the system, a search condition text for image search; and selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text. and generating, by the system, at least one shot text for each of the divided shots, determining at least one selection frame from among a plurality of frames included in each of the divided shots; and generating the shot text through image captioning corresponding to the determined selection frame.

상기 분할된 샷들 각각에 포함된 복수의 프레임들 중에서 적어도 하나의 선택 프레임을 결정하는 단계는, 상기 분할된 샷들에 포함되며 연속된 복수의 프레임들인 동영상을 상기 적어도 하나의 선택 프레임으로 결정하는 단계를 포함하며, 상기 결정한 선택 프레임에 상응하는 영상 캡셔닝을 통해 상기 샷 텍스트를 생성하는 단계는, 상기 동영상에 대한 동영상 캡션닝을 통해 상기 샷 텍스트를 생성하는 단계를 포함할 수 있다.The determining of at least one selected frame from among the plurality of frames included in each of the divided shots may include determining a moving picture including a plurality of consecutive frames included in the divided shots as the at least one selected frame. In addition, the generating of the shot text through image captioning corresponding to the determined selection frame may include generating the shot text through moving image captioning for the moving image.

상기 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법은, 상기 시스템이 상기 매칭영상을 상기 편집 대상 비디오에서 삭제한 제1편집 비디오를 생성하는 단계 또는 상기 시스템이 상기 매칭영상을 제외한 나머지 프레임들을 삭제한 제2편집 비디오를 생성하는 단계를 포함할 수 있다.In the video stream processing method through the deep learning-based image captioning, the system generates a first edited video in which the matching image is deleted from the editing target video, or the system processes the remaining frames except for the matching image. It may include generating the deleted second edited video.

상기 분할된 샷들 각각에 포함된 복수의 프레임들 중에서 적어도 하나의 선택 프레임을 결정하는 단계는, 상기 분할된 샷들 중에서 미리 정해진 오브젝트가 포함된 프레임을 상기 선택 프레임으로 결정할 수 있다.In the determining of at least one selection frame from among the plurality of frames included in each of the divided shots, a frame including a predetermined object among the divided shots may be determined as the selection frame.

상기 분할된 샷들 중에서 미리 정해진 오브젝트가 포함된 프레임을 상기 선택 프레임으로 결정하는 단계는, 상기 분할된 샷의 중앙에 상응하는 중앙 프레임에 상기 오브젝트가 포함된 경우에는 상기 중앙 프레임을 상기 선택 프레임에 포함시키고, 상기 중앙 프레임에 상기 오브젝트가 포함되지 않은 경우에는 상기 중앙 프레임으로부터 인접한 프레임의 순서대로 상기 오브젝트가 포함된 프레임을 탐색하여 먼저 탐색된 프레임을 상기 선택 프레임에 포함시키는 단계를 포함할 수 있다.The step of determining a frame including a predetermined object from among the divided shots as the selection frame includes, when the object is included in a central frame corresponding to the center of the divided shot, including the central frame in the selection frame and, when the object is not included in the central frame, searching for frames including the object in the order of adjacent frames from the central frame and including the first searched frame in the selection frame.

상기 결정한 선택 프레임에 상응하는 영상 캡셔닝을 통해 상기 샷 텍스트를 생성하는 단계는, 상기 선택 프레임을 포함하는 미리 결정된 개수의 연속된 복수의 프레임들인 동영상에 대한 동영상 캡셔닝 결과로 상기 샷 텍스트를 생성하는 단계를 포함할 수 있다.The generating of the shot text through image captioning corresponding to the determined selected frame may include generating the shot text as a result of video captioning for a video that is a plurality of consecutive frames including the selected frame. may include the step of

상기의 방법은 데이터 처리장치에 설치된 컴퓨터 프로그램에 의해 구현될 수 있다.The above method may be implemented by a computer program installed in the data processing apparatus.

다른 일 측면에 따르면, 본 발명의 기술적 사상을 구현하기 위한 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법을 위한 시스템은 프로세서 및 상기 프로세서에 의하여 실행되는 컴퓨터 프로그램을 저장하는 메모리를 포함하며, 상기 프로세서는 상기 프로그램을 구동하여, 편집대상 비디오를 복수의 샷으로 분할하고, 상기 분할된 샷들 각각에 대해 적어도 하나의 샷 텍스트를 생성하며, 검색조건 텍스트를 입력받고, 상기 검색조건 텍스트에 상응하는 매칭 샷 텍스트를 선택하고, 선택된 매칭 샷 텍스트에 상응하는 매칭 샷 또는 매칭 프레임을 추출하되, 상기 분할된 샷들 각각에 포함된 복수의 프레임들 중에서 적어도 하나의 선택 프레임을 결정하고, 결정한 선택 프레임에 상응하는 영상 캡셔닝을 통해 상기 샷 텍스트를 생성한다.According to another aspect, a system for a video stream processing method through deep learning-based image captioning for implementing the technical idea of the present invention includes a processor and a memory for storing a computer program executed by the processor, The processor drives the program, divides the video to be edited into a plurality of shots, generates at least one shot text for each of the divided shots, receives a search condition text, and receives a search condition text corresponding to the search condition text. Selecting matching shot text, extracting a matching shot or matching frame corresponding to the selected matching shot text, determining at least one selection frame from among a plurality of frames included in each of the divided shots, and corresponding to the determined selection frame The shot text is generated through video captioning.

본 발명의 일 실시예에 따르면, 비디오 내의 분할된 영상들 각각의 시각적 정보를 나타내는 텍스트를 생성하고, 생성한 텍스트를 이용하여 비디오의 검색 또는 편집이 가능해지므로 매우 효율적인 영상처리(영상의 검색 또는 편집 등)를 수행할 수 있는 효과가 있다. According to an embodiment of the present invention, text representing visual information of each of the divided images in a video is generated, and a video search or edit is possible using the generated text, so that a very efficient image processing (image search or editing) is possible. etc.) can be performed.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법의 개요를 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법을 수행하기 위한 시스템의 개략적인 구성을 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법의 개념을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시 예에 따른 선택 프레임을 결정하는 개념을 설명하기 위한 도면이다.
도 5 및 도 6은 본 발명의 일 실시 예에 따른 영상처리를 설명하기 위한 도면이다.In order to more fully understand the drawings cited in the Detailed Description, a brief description of each drawing is provided.
1 is a diagram for explaining the outline of a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.
2 is a diagram showing a schematic configuration of a system for performing a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.
3 is a diagram for explaining the concept of a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.
4 is a diagram for explaining the concept of determining a selection frame according to an embodiment of the present invention.
5 and 6 are diagrams for explaining image processing according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Since the present invention can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise.

본 명세서에 있어서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this specification, terms such as "include" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, and one or more other It should be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 명세서에 있어서는 어느 하나의 구성요소가 다른 구성요소로 데이터를 '전송'하는 경우에는 상기 구성요소는 상기 다른 구성요소로 직접 상기 데이터를 전송할 수도 있고, 적어도 하나의 또 다른 구성요소를 통하여 상기 데이터를 상기 다른 구성요소로 전송할 수도 있는 것을 의미한다. 반대로 어느 하나의 구성요소가 다른 구성요소로 데이터를 '직접 전송'하는 경우에는 상기 구성요소에서 다른 구성요소를 통하지 않고 상기 다른 구성요소로 상기 데이터가 전송되는 것을 의미한다.In addition, in the present specification, when any one component 'transmits' data to another component, the component may directly transmit the data to the other component or through at least one other component. This means that the data may be transmitted to the other component. Conversely, when one component 'directly transmits' data to another component, it means that the data is transmitted from the component to the other component without passing through the other component.

도 1은 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법의 개요를 설명하기 위한 도면이다.1 is a diagram for explaining the outline of a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법을 위해서는 소정의 시스템(100)이 구비될 수 있다. Referring to FIG. 1 , a predetermined system 100 may be provided for a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.

상기 시스템(100)은 편집을 수행할 편집대상 비디오를 입력받을 수 있다.The system 100 may receive an editing target video to be edited.

그러면 본 발명의 기술적 사상에 따라 상기 시스템(100)은 편집대상 비디오와 관련된 텍스트를 생성할 수 있다.Then, according to the spirit of the present invention, the system 100 may generate text related to the video to be edited.

상기 텍스트는 상기 편집대상 비디오의 시각적 정보를 설명하거나 표현할 수 있는 텍스트일 수 있다. The text may be text that can describe or express visual information of the video to be edited.

이러한 텍스트는 상기 편집대상 비디오를 구성하는 모든 프레임 즉, 정지영상별로 정지영상에 표현된 시각적 정보를 나타내는 텍스트를 생성하는 딥러닝 모델을 통해 생성될 수도 있다.Such text may be generated through a deep learning model that generates text representing visual information expressed in still images for all frames constituting the editing target video, that is, for each still image.

하지만 실시 예에 따라서는 상기 편집대상 비디오의 모든 프레임별로 시각적 정보를 나타내는 텍스트를 생성하는 것이 비효율적일 수 있다. 왜냐하면 상당수의 연속된 프레임에서는 시각적 정보의 차이는 거의 없을 수 있기 때문이다. However, in some embodiments, it may be inefficient to generate text representing visual information for every frame of the video to be edited. This is because there may be little difference in visual information in a significant number of consecutive frames.

따라서 본 발명의 기술적 사상에 의하면, 상기 시스템(100)은 편집대상 비디오를 소정의 기준으로 분할하고, 분할된 영상들 각각 즉, 샷(shot)들 각각에 대해 상기 샷들의 시각적 정보를 나타내는 하나 또는 복수의 텍스트를 생성할 수 있다. 이러한 텍스트를 본 명세서에서는 샷 텍스트로 정의하기로 한다.Accordingly, according to the technical idea of the present invention, the system 100 divides the video to be edited based on a predetermined criterion, and for each of the divided images, that is, each of the shots, one or Multiple texts can be created. Such text will be defined as shot text in this specification.

상기 시스템(100)은 샷들 별로 하나 또는 복수의 샷 텍스트를 생성할 수 있다.The system 100 may generate one or a plurality of shot texts for each shot.

그리고 시스템(100)은 생성된 샷 텍스트에 기반하여, 사용자가 원하는 장면(적어도 하나의 프레임) 및/또는 원하는 샷을 용이하게 선택할 수 있는 기능 및/또는 UI를 제공할 수 있다. 사용자가 원하는 장면/샷이 용이하게 선택되면, 사용자는 매우 효율적으로 편집대상 비디오에서 원하는 장면/샷을 검색하거나 원하는 검색결과를 이용해 편집대상 비디오에 대한 편집을 수행할 수 있다.In addition, the system 100 may provide a function and/or UI through which the user can easily select a desired scene (at least one frame) and/or a desired shot based on the generated shot text. When a desired scene/shot is easily selected by the user, the user can very efficiently search for a desired scene/shot from the editing target video or edit the editing target video using the desired search result.

상기 시스템(100)은 샷들 별로 하나 또는 복수의 샷 텍스트를 생성하기 위해, 영상 캡셔닝을 수행할 수 있다. 상기 영상 캡셔닝은 정지영상의 캡셔닝 및/또는 동영상 캡션이을 포함할 수 있다.The system 100 may perform image captioning to generate one or a plurality of shot texts for each shot. The image captioning may include still image captioning and/or video captioning.

상기 시스템(100)은 샷들 별로 하나 또는 복수의 프레임(정지영상)을 선택할 수 있다.The system 100 may select one or a plurality of frames (still images) for each shot.

그리고 선택된 프레임별 각각별로 정지영상의 시각정보를 텍스트로 생성 즉, 캡셔닝(captioning)할 수 있다. In addition, visual information of a still image for each selected frame may be generated as text, ie, captioned.

정지영상에서 시각적 정보를 텍스트로 생성하기 위한 기술적 사상은 공지된 바 있다. 이러한 정지영상 캡셔닝을 위해서는 적어도 하나의 딥러닝 모델을 이용할 수 있다. 예컨대, 정지영상으로부터 적어도 하나의 오브젝트를 디텍팅하기 위한 제1딥러닝 모델(예컨대, InceptionV3), 상기 제1딥러닝 모델에서 추출된 피쳐들을 이용하여 텍스트를 생성하기 위한 제2딥러닝 모델(예컨대, RNN, LSTM 등)이 이용되어 정지영상별로 캡셔닝이 수행되고, 그 결과 정지영상 즉, 프레임별로 상기 프레임을 설명할 수 있는 텍스트가 생성될 수 있다.A technical idea for generating visual information from a still image as text has been known. For such still image captioning, at least one deep learning model may be used. For example, a first deep learning model (eg, InceptionV3) for detecting at least one object from a still image, and a second deep learning model (eg, InceptionV3) for generating text using features extracted from the first deep learning model , RNN, LSTM, etc.) is used to perform captioning for each still image, and as a result, a still image, that is, text that can describe the frame for each frame can be generated.

상기 시스템(100)은 샷들 각각별로 하나 또는 복수의 프레임을 선택 프레임으로 결정하고, 선택 프레임 각각별로 캡셔닝을 수행하여 해당 샷을 설명하는 텍스트 즉 샷 텍스트를 생성할 수 있다. 복수의 프레임을 선택한 경우에는 어느 하나의 샷을 설명하기 위해 복수의 텍스트가 생성될 수 있음을 의미할 수 있다.The system 100 may determine one or a plurality of frames for each shot as a selection frame, and perform captioning for each selected frame to generate text describing the shot, that is, shot text. When a plurality of frames are selected, it may mean that a plurality of texts may be generated to describe any one shot.

한편, 본 발명의 기술적 사상에 의하면, 상기 시스템(100)은 샷을 설명하기 위해 상기 샷에 포함된 하나 또는 복수의 프레임 즉, 정지영상을 설명하기 위한 텍스트를 이용할 수도 있지만, 필요에 따라서는 정지영상 만으로는 해당 샷을 충분히 설명하기 어려운 경우도 존재한다.Meanwhile, according to the technical idea of the present invention, the system 100 may use one or a plurality of frames included in the shot to describe the shot, that is, text for describing a still image, but if necessary, the system 100 may use a still image. In some cases, it is difficult to sufficiently explain the shot with only the video.

예컨대, 특정 샷에서 어느 하나의 프레임이 선택되고 상기 프레임의 시각정 정보가 캡셔닝 모델을 통해 텍스트로 설명되는 경우와, 상기 특정 샷에서 상기 선택된 프레임을 포함하는 상대적으로 짧은 동영상 즉, 연속된 복수개의 프레임들에 기반하여 텍스트를 생성하는 것은 그 결과가 매우 상이할 수 있다.For example, when any one frame is selected from a specific shot and visual information of the frame is described as text through a captioning model, and a relatively short video including the selected frame from the specific shot, that is, a plurality of consecutive Generating text based on frames can have very different results.

예컨대, 어느 하나의 정지영상에서는 '사람이 단순히 입을 벌리고 있다'라는 텍스트가 생성될 수도 있지만, 상기 정지영상을 포함하는 일정 기간의 동영상을 보고 상기 동영상을 설명하는 경우에는 '사람이 하품을 하고 있다'라고 설명될 수 있다.For example, in any one still image, the text 'People are simply opening their mouths' may be generated. ' can be explained.

즉, 정지영상의 시각적 정보를 캡셔닝 하는 것에 비해, 상기 정지영상을 포함하는 동영상을 캡셔닝 하는 경우에 보다 유의미한 정보 즉, 해당 샷을 더 정확하게 설명할 수 있는 텍스트를 생성하는 것이 가능할 수 있다.That is, it may be possible to generate more meaningful information, ie, text that can more accurately describe a corresponding shot, in the case of captioning a moving image including the still image compared to captioning the visual information of the still image.

따라서 본 발명의 기술적 사상에 따른 상기 시스템(100)은 샷별로 상기 샷을 설명하기 위한 텍스틀 생성할 때, 상기 샷에 포함되는 동영상 자체를 캡셔닝하여 텍스트를 생성할 수도 있다. Therefore, the system 100 according to the technical idea of the present invention may generate text by captioning the moving picture itself included in the shot when generating a text for describing the shot for each shot.

이처럼 상대적으로 짧은 동영상을 설명하기 위한 텍스트를 생성 즉, 캡셔닝하기 위한 기술적 사상 역시 공지된 바 있다. 예컨대, Facebook AI Research 가 공개한 SlowFast 딥러닝 모델은 상대적으로 짧은 동영상을 설명할 수 있는 텍스트를 생성할 수 있다. 본 발명의 기술적 사상에 의하면 이러한 SlowFast 모델 이외에도 동영상을 설명하는 텍스틀 설명하기 위한 다양한 딥러닝 모델이 설계되고 학습되어 활용될 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다.A technical idea for generating, ie, captioning, text for describing a relatively short video has also been known. For example, the SlowFast deep learning model released by Facebook AI Research can generate text that can explain relatively short videos. According to the technical idea of the present invention, in addition to this SlowFast model, an average expert in the technical field of the present invention can easily infer that various deep learning models can be designed, learned, and utilized to explain the text that describes the video. .

한편 상기 편집대상 비디오를 복수의 샷들로 분할하는 태스크 역시 공지된 바 있다. 일반적으로 샷 경계 검출(shot boundary detection)은 시각적 정보가 유의미한 정도로 전환/변환되는 지점을 검출하기 위한 소정의 엔진을 구현하고 이를 통해 샷 경계 검출이 이루어질 수 있다. Meanwhile, a task of dividing the video to be edited into a plurality of shots has also been known. In general, shot boundary detection implements a predetermined engine for detecting a point at which visual information is converted/transformed to a significant degree, through which shot boundary detection can be performed.

이러한 샷 경계 검출은 통상적으로는 카메라의 테이크 또는 카메라의 속성(방향 전환, 줌인/아웃 등)이 유의미할 정도로 변환되는 지점을 검출하도록 구현되는 것이 일반적인데, 본 발명의 기술적 사상에 따르면 카메라의 속성이 전환되지 않더라도 시각적 정보가 유의미하게 변환되는 지점 즉, 샷 경계를 디텍팅할 수 있도록 학습된 딥러닝 모델이 이용되어 샷 분할이 수행될 수도 있다.Such shot boundary detection is typically implemented to detect a point at which a camera take or a camera property (direction change, zoom in/out, etc.) is significantly converted. According to the technical idea of the present invention, the camera property Even if this conversion is not performed, a deep learning model trained to detect a point where visual information is significantly converted, that is, a shot boundary, may be used to perform shot segmentation.

예컨대, 다수의 학습 데이터 즉, 비디오에서 시각적 정보가 크게 변환되는 것으로 판단되는 지점이 라벨링된 데이터가 구비되고, 이러한 다수의 학습 데이터를 학습한 딥러닝 모델을 이용하여 시각적 정보가 크게 변환되는 지점을 검출할 수 있는 CNN(Convolution Neural Network)가 구현될 수 있다. 이러한 딥러닝 모델을 구현할 때 어떤 지점을 샷 경계로 검출할지는 학습데이터의 라벨링을 어떤 지점에 하는지에 따라서 달라질 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다. For example, a number of learning data, that is, a point at which the visual information is determined to be significantly transformed in the video is provided with labeled data, and the point at which the visual information is significantly converted using a deep learning model that has learned the plurality of learning data A convolutional neural network (CNN) capable of detecting may be implemented. When implementing such a deep learning model, an average expert in the technical field of the present invention will be able to easily infer that which point to detect as a shot boundary may vary depending on which point the labeling of the training data is performed.

샷 분할을 위한 구성(예컨대, 딥러닝 모델)이 어떻게 구현되느냐에 따라 상기 편집대상 비디오가 분할된 샷들 각각의 길이는 달라질 수 있으며, 각각의 샷들별로 선택 프레임의 개수가 달리 결정될 수도 있다. 예컨대, 상대적으로 긴 샷에 대해서는 해당 샷을 설명하기 위한 프레임이 복수개 선택될 수 있고, 상대적으로 짧은 샷에 대해서는 해당 샷을 설명하기 위한 프레임이 하나만 선택될 수도 있다.Depending on how a configuration (eg, a deep learning model) for shot division is implemented, the length of each of the shots into which the video to be edited is divided may vary, and the number of selected frames may be determined differently for each shot. For example, for a relatively long shot, a plurality of frames for describing the corresponding shot may be selected, and for a relatively short shot, only one frame for describing the corresponding shot may be selected.

또한 샷별로 샷 텍스트가 복수 개 생성될 수 있고, 이때 복수 개의 샷 텍스트는 서로 다른 정지영상들 각각의 캡셔닝 텍스트일 수도 있지만, 구현 예에 따라서는 복수 개의 샷 텍스트들 중 일부는 정지영상의 캡셔닝 텍스트이고 나머지 일부는 동영상 캡셔닝 텍스트일 수도 있다.In addition, a plurality of shot texts may be generated for each shot, and in this case, the plurality of shot texts may be caption texts of different still images. It is a captioning text, and the remaining part may be a video captioning text.

물론 실시 예에 따라서는 샷 텍스트는 정지영상의 캡셔닝 텍스트만이거나 또는 동영상의 캡셔닝 텍스트만일 수도 있다.Of course, according to an embodiment, the shot text may be only the captioning text of a still image or only the captioning text of a moving image.

어떠한 경우든 상기 시스템(100)은 편집대상 비디오를 분할하고, 분할된 각각의 분할 비디오를 잘 설명할 수 있는 텍스트를 생성할 수 있다.In any case, the system 100 may segment the video to be edited and generate text that can better describe each segmented video.

그러면 사용자는 편집대상 비디오에서 사용자가 원하는 일부 구간만을 텍스트를 통해 검색하고, 필요에 따라 검색결과를 활용하여 편집대상 비디오에 대한 편집을 수행할 수도 있다.Then, the user may search for only a part desired by the user through text in the video to be edited, and edit the video to be edited by using the search result if necessary.

예컨대, 상기 시스템(100)이 "사람이 2명"이란 검색조건을 입력받으면 상기 시스템(100)은 샷 텍스트들 중 "사람이 2명"이란 텍스트를 그대로 포함하고 있거나 시멘틱 서치를 통해 사람이 2명이라는 의미를 가지는 샷 텍스트 즉, 매칭 샷 텍스트를 검색할 수 있다. 물론 이러한 검색조건은 반드시 하나가 아니라 논리 연산을 통해 수행될 수도 있다. For example, when the system 100 receives a search condition of “two people”, the system 100 includes the text “two people” among the shot texts as it is, or two people through semantic search. A shot text having the meaning of a name, that is, a matching shot text can be searched for. Of course, such a search condition is not necessarily one, but may be performed through a logical operation.

예컨대, "사람이 2명" and "산"이란 선택조건이 입력되면 사람이 2명 있고 산이 존재하는 샷 텍스트가이 매칭 샷 텍스트로 선택될 수 있다.For example, when a selection condition of "two people" and "mountain" is input, shot text with two people and a mountain may be selected as the matching shot text.

이러한 검색을 위해서는 NLP(Natural Language Processing)를 위해 학습된 언어모델을 통해 키워드 서치가 아닌 시멘틱 서치(sematic search)가 활용될 수 있다.For such a search, sematic search, not keyword search, may be utilized through a language model trained for NLP (Natural Language Processing).

그러면 상기 시스템(100)은 매칭 샷 텍스트에 상응하는 매칭영상을 추출할 수 있다. 상기 매칭영상은 상기 매칭 샷 텍스트에 상응하는 샷일 수도 있고, 매칭 샷 텍스트에 상응하는 정지영상(프레임) 또는 동영상일 수도 있다.Then, the system 100 may extract a matching image corresponding to the matching shot text. The matching image may be a shot corresponding to the matching shot text, or a still image (frame) or moving image corresponding to the matching shot text.

예컨대, 전술한 예에서는, 사람이 2명 존재하는 매칭 샷, 상기 매칭 샷에 포함되고 캡셔닝된 특정 프레임 이미지, 또는 상기 매칭 샷에 포함되고 캡셔닝된 동영상이 매칭영상으로 추출될 수 있다. For example, in the above example, a matching shot in which two people exist, a specific frame image included in the matching shot and captioned, or a video included in the matching shot and captioned may be extracted as a matching image.

그러면 사용자의 요청에 따라 선택된 매칭영상을 삭제하여 상기 편집대상 비디오를 편집하거나 또는 매칭영상만을 남기는 등의 편집 커맨드를 소정의 방식(텍스트 커맨드 또는 소정의 UI 등)을 통해 입력받을 수 있다. Then, according to the user's request, an editing command such as editing the editing target video by deleting the selected matching image or leaving only the matching image may be input through a predetermined method (such as a text command or a predetermined UI).

다양한 방식으로 샷 텍스트를 이용해 사용자가 원하는 매칭영상의 선택이 매우 효과적으로 텍스트 기반으로 선택될 수 있고, 이를 이용한 비디오 편집이 효과적으로 수행될 수 있다.The selection of the matching image desired by the user can be very effectively selected based on the text by using the shot text in various ways, and video editing using this can be effectively performed.

예컨대, 개와 고양이가 촬영된 비디오에서 개만 존재하는 매칭영상들만 남기고 싶은 경우, 선택조건으로 "개"를 입력한 후 특정된 매칭영상만을 남기고 나머지 프레임들을 삭제하는 편집 커맨드가 입력되면 편집 대상 이미지에서 개가 등장하는 프레임들로만 구성된 편집된 비디오가 획득될 수 있다.For example, if you want to leave only matching images in which only dogs exist in a video of a dog and a cat, input "dog" as a selection condition, and then enter an edit command that deletes the remaining frames while leaving only the specified matching image. An edited video composed only of appearing frames can be obtained.

다양한 방식의 편집 커맨드(선택, 삭제, 또는 순서의 조정 등)이 가능할 수 있다.Various types of editing commands (selection, deletion, or order adjustment, etc.) may be possible.

이러한 과정을 통해 사용자는 편집대상 비디오의 시각적 정보가 표현된 텍스트를 이용한 영상처리를 매우 효율적으로 수행할 수 있다. Through this process, the user can very efficiently perform image processing using text in which visual information of the video to be edited is expressed.

한편, 샷들 각각을 설명하기 위한 샷 텍스트를 생성하는 경우, 상기 시스템(100)은 해당 샷을 대표적으로 설명할 수 있는 샷 텍스트를 생성하는 것이 바람직하다. Meanwhile, when generating shot text for describing each shot, it is preferable that the system 100 generates shot text that can representatively describe the shot.

그리고 샷 분할이 시각적 정보가 유의미하게 전환되는 지점을 검출하여 수행되는 경우라면, 시각적 정보의 전환이 이루어질 때마다 샷이 분할될 수 있고, 이러한 경우 각각의 샷을 대표하기 위한 설명은 샷의 중앙지점(예컨대, 프레임수를 기준으로 중간 또는 중간에 가장 근접한 프레임)에 해당하는 프레임을 기준으로 수행되는 것이 바람직할 수 있다. And if the shot division is performed by detecting a point at which the visual information is changed significantly, the shot may be divided whenever the visual information is switched. In this case, the description for representing each shot is the central point of the shot. (For example, it may be preferable to perform based on a frame corresponding to the middle or the frame closest to the middle based on the number of frames).

이처럼 단순히 해당 샷을 설명하기 위한 선택 프레임이 프레임의 위치를 기준으로 결정될 수도 있지만, 실시 예에 따라서는 해당 프레임에 미리 정해진 오브젝트가 포함되어 있는지를 기준으로 결정될 필요가 있다.As described above, the selection frame for simply describing the shot may be determined based on the position of the frame, but according to an exemplary embodiment, the selection frame needs to be determined based on whether a predetermined object is included in the corresponding frame.

예컨대, 사람의 행위, 또는 수를 기준으로 영상처리(검색 및/또는 편집)을 하고자 하는 경우에는, 샷별로 사람이 포함된 프레임이 선택 프레임으로 선택되어 해당 프레임을 설명하는 텍스트가 생성되는 것이 바람직할 수 있다. 그래야만 사용자가 원하는 오브젝트 기준으로 텍스트 기반의 영상 검색 및/또는 편집이 효율적으로 이루어질 수 있다.For example, when image processing (search and/or editing) is performed based on human actions or numbers, it is preferable that a frame including a person for each shot is selected as a selection frame and text describing the frame is generated. can do. Only then, text-based image search and/or editing can be efficiently performed based on the object desired by the user.

따라서 상기 시스템(100)은 상기 분할된 샷들 중에서 미리 정해진 오브젝트가 포함된 프레임을 제한적으로 선택 프레임으로 결정할 수 있고, 결정된 선택 프레임에 대해 캡셔닝을 수행하여 상기 샷을 설명하는 샷 텍스트로 생성할 수 있다.Accordingly, the system 100 can restrictively determine a frame including a predetermined object from among the divided shots as a selection frame, and perform captioning on the determined selection frame to generate shot text describing the shot. have.

이처럼 미리 정해진 오브젝트가 포함된 프레임을 선택프레임으로 결정하기 위해서는 정지영상에서 상기 오브젝트가 포함되어 있는지를 판단할 수 있어야 하며, 이를 위해 오브젝트 디텍션을 위한 딥러닝 모델이 상기 시스템(100)에 구비될 수 있음은 물론이다. 오브젝트 디텍션을 위한 딥러닝 모델은 널리 공지되어 있다. 통상 CNN 모델이 활용되며, R-CNN, Yolo, InceptionV3, Resnet 계열의 딥러닝 모델이 활용될 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다.In order to determine the frame including the predetermined object as the selection frame, it should be possible to determine whether the object is included in the still image, and for this, a deep learning model for object detection may be provided in the system 100 is of course Deep learning models for object detection are well known. An average expert in the technical field of the present invention will be able to easily infer that a CNN model is usually used, and that deep learning models of R-CNN, Yolo, InceptionV3, and Resnet series can be utilized.

한편, 미리 정해진 오브젝트의 포함여부뿐만 아니라 프레임의 위치도 해당 샷을 대표하는 설명을 생성하는데 중요할 수 있다. 따라서 이러한 두 가지 팩터를 모두 고려하여 선택 프레임이 결정될 필요가 있을 수 있다.Meanwhile, not only whether a predetermined object is included, but also the location of the frame may be important in generating a description representing a corresponding shot. Therefore, it may be necessary to determine the selection frame in consideration of both of these two factors.

이를 위해 상기 시스템(100)은 각각의 샷의 중앙에 상응하는 중앙 프레임(예컨대, 프레임 위치에 기반한 정중간 프레임 또는 정중간 프레임이 없는 경우 가장 인접한 프레임)에 상기 오브젝트가 포함된 경우에는 상기 중앙 프레임을 상기 선택 프레임에 포함시킬 수 있다. 즉 상기 중앙 프레임은 적어도 상기 샷을 설명하기 위한 프레임에 포함되며 추가로 몇몇 프레임이 더 선택 프레임으로 선택될 수도 있다.To this end, the system 100 controls the center frame when the object is included in the center frame corresponding to the center of each shot (eg, the mid-middle frame based on the frame position or the closest frame when there is no mid-middle frame). may be included in the selection frame. That is, the central frame is included in at least a frame for describing the shot, and some additional frames may be further selected as selection frames.

만약, 상기 중앙 프레임에 상기 오브젝트가 포함되지 않은 경우에는 상기 시스템(100)은 상기 중앙 프레임으로부터 인접한 프레임의 순서대로 상기 오브젝트가 포함된 프레임을 탐색하여 먼저 탐색된 프레임을 상기 선택 프레임에 포함시킬 수 있다.If the central frame does not include the object, the system 100 searches for frames including the object in the order of adjacent frames from the central frame, and includes the first searched frame in the selection frame. have.

이를 통해 프레임의 위치 및 관심있는 오브젝트의 포함여부를 모두 만족시키면서 해당 샷을 대표적으로 설명할 수 있는 샷 텍스트가 생성될 수 있다. Through this, shot text that can representatively describe the shot while satisfying both the frame position and whether the object of interest is included can be generated.

또한 해당 샷 텍스트를 생성하기 위해 동영상 캡셔닝이 수행되는 경우에는, 동영상 캡셔닝 역시 사용자가 원하는 오브젝트를 기준으로 상기 오브젝트의 행위(움직임, 상태변화 등)가 캡셔닝 되는 것이 바람직할 수 있다.In addition, when video captioning is performed to generate the corresponding shot text, it may be preferable that the action (movement, state change, etc.) of the object is also captioned based on the object desired by the user.

예컨대, 사용자가 특정 오브젝트(예컨대, 사람, 동물 등)를 기준으로 동영상을 검색하거나 편집하기 원한다면, 정지영상 캡셔닝 뿐만 아니라 동영상 캡셔닝 결과 생성되는 텍스트 역시 상기 특정 오브젝트의 행위가 포함되는 것이 바람직할 수 있다. For example, if a user wants to search for or edit a video based on a specific object (eg, a person, an animal, etc.), it is preferable that the action of the specific object is included not only in still image captioning but also in text generated as a result of video captioning. can

이를 위해 상기 시스템(100)은 오브젝트가 포함되도록 결정된 상기 선택 프레임을 포함하면서 미리 결정된 개수(미리 결정된 길이를 포함함)의 연속된 복수의 프레임들인 동영상에 대해 동영상 캡셔닝을 수행하여 그 결과로 상기 샷 텍스트를 생성할 수 있다. 선택 프레임이 동영상의 중간에 위치하도록 복수의 프레임들이 선택될 수도 있고, 실시 예에 따라서는 선택 프레임이 동영상의 앞 또는 뒤의 소정의 위치에 위치하도록 복수의 프레임들이 선택될 수도 있다.To this end, the system 100 performs video captioning on a video that is a plurality of consecutive frames of a predetermined number (including a predetermined length) while including the selection frame determined to include an object, and as a result, You can create shot text. A plurality of frames may be selected so that the selection frame is located in the middle of the moving picture, or the plurality of frames may be selected so that the selection frame is located at a predetermined position before or after the moving picture.

결국 본 발명의 기술적 사상에 의하면, 상기 시스템(100)은 편집대상 비디오의 시각적 정보를 잘 나타내는 텍스트를 생성하고, 이러한 텍스트는 분할된 샷 별로 생성될 수 있으며, 상기 텍스틀 이용해 사용자가 원하는 편집대상 비디오의 일부분인 매칭영상이 추출될 수 있고, 이를 이용해 편집 등의 다양한 영상처리가 이루어질 수 있다.After all, according to the technical idea of the present invention, the system 100 generates text that well represents the visual information of the video to be edited, and such text can be generated for each divided shot, and the user wants to edit the text using the text. A matching image that is a part of the video may be extracted, and various image processing such as editing may be performed using this.

이러한 기술적 사상을 구현하기 위한 상기 시스템(100)의 구성은 도 2와 같을 수 있다.The configuration of the system 100 for implementing this technical idea may be as shown in FIG. 2 .

도 2는 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법을 수행하기 위한 시스템의 개략적인 구성을 나타내는 도면이다. 2 is a diagram showing a schematic configuration of a system for performing a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 기술적 사상에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법을 구현하기 위해서는 본 발명의 기술적 사상에 따른 시스템(100)이 구비된다.Referring to FIG. 2 , in order to implement a video stream processing method through image captioning based on deep learning according to the technical idea of the present invention, a system 100 according to the technical idea of the present invention is provided.

상기 시스템(100)은 본 발명의 기술적 사상을 구현하기 위한 프로그램이 저장되는 메모리(120), 및 상기 메모리(120)에 저장된 프로그램을 실행하기 위한 프로세서(110)가 구비될 수 있다.The system 100 may include a memory 120 in which a program for implementing the technical idea of the present invention is stored, and a processor 110 for executing the program stored in the memory 120 .

본 명세서에서 상기 시스템(100)이 수행하는 기능 및/또는 동작은 상기 프로세서(110)에 의해 상기 메모리(120)에 저장된 프로그램이 구동되어 수행될 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다.An average expert in the technical field of the present invention easily understands that the function and/or operation performed by the system 100 in this specification can be performed by driving the program stored in the memory 120 by the processor 110 . can be inferred.

상기 프로세서(110)는 상기 시스템(100)의 구현 예에 따라, CPU, 모바일 프로세서 등 다양한 명칭으로 명명될 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다. 또한, 상기 시스템(100)은 복수의 물리적 장치들이 유기적으로 결합되어 구현될 수도 있으며, 이러한 경우 상기 프로세서(110)는 물리적 장치별로 적어도 한 개 구비되어 본 발명의 시스템(100)을 구현할 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다.An average expert in the art of the present invention will be able to easily infer that the processor 110 may be named in various names, such as a CPU or a mobile processor, depending on the implementation example of the system 100 . In addition, the system 100 may be implemented by organically combining a plurality of physical devices. In this case, at least one processor 110 is provided for each physical device to realize the system 100 of the present invention. An average expert in the art of the present invention can easily infer.

상기 메모리(120)는 상기 프로그램이 저장되며, 상기 프로그램을 구동시키기 위해 상기 프로세서가 접근할 수 있는 어떠한 형태의 저장장치로 구현되어도 무방하다. 또한 하드웨어적 구현 예에 따라 상기 메모리(120)는 어느 하나의 저장장치가 아니라 복수의 저장장치로 구현될 수도 있다. 또한 상기 메모리(120)는 주기억장치 뿐만 아니라, 임시기억장치를 포함할 수도 있다. 또한 휘발성 메모리 또는 비휘발성 메모리로 구현될 수도 있으며, 상기 프로그램이 저장되고 상기 프로세서에 의해 구동될 수 있도록 구현되는 모든 형태의 정보저장 수단을 포함하는 의미로 정의될 수 있다. The memory 120 stores the program and may be implemented as any type of storage device that the processor can access to drive the program. Also, depending on the hardware implementation, the memory 120 may be implemented as a plurality of storage devices instead of any one storage device. Also, the memory 120 may include a temporary memory as well as a main memory. In addition, it may be implemented as a volatile memory or a non-volatile memory, and may be defined to include all types of information storage means implemented so that the program can be stored and driven by the processor.

상기 시스템(100)은 실시 예에 따라 사용자의 단말기가 접속할 수 있는 웹 에 연결된 웹 서버로 구현될 수도 있고 사용자의 단말기에 설치되거나 다양한 방식으로 구현될 수도 있으며, 본 명세서에서 정의되는 기능을 수행할 수 있는 어떠한 형태의 데이터 프로세싱 장치도 포함하는 의미로 정의될 수 있다.The system 100 may be implemented as a web server connected to a web that the user's terminal can access according to an embodiment, may be installed in the user's terminal, or may be implemented in various ways, and perform the functions defined in this specification. It may be defined to include any type of data processing device that can be used.

또한 상기 시스템(100)의 실시 예에 따라 다양한 주변장치들(주변장치 1(130) 내지 주변장치 N(130-1))이 더 구비될 수 있다. 예컨대, 키보드, 모니터, 그래픽 카드, 통신장치 등이 주변장치로써 상기 시스템(100)에 더 포함될 수 있음을 본 발명의 기술분야의 평균적 전문가는 용이하게 추론할 수 있을 것이다. In addition, various peripheral devices (peripheral device 1 130 to peripheral device N 130 - 1) may be further provided according to an embodiment of the system 100 . For example, an average expert in the art can easily infer that a keyboard, monitor, graphic card, communication device, etc. may be further included in the system 100 as peripheral devices.

도 3은 본 발명의 일 실시예에 따른 딥러닝 기반의 영상 캡셔닝을 통한 비디오 스트림 처리 방법의 개념을 설명하기 위한 도면이다. 3 is a diagram for explaining the concept of a video stream processing method through deep learning-based image captioning according to an embodiment of the present invention.

도 3을 참조하면, 상기 시스템(100)은 비디오 스트림 즉, 편집대상 비디오(10)를 입력받을 수 있다.Referring to FIG. 3 , the system 100 may receive a video stream, that is, a video to be edited 10 .

그러면 상기 시스템(100)은 상기 편집대상 비디오(10)를 복수의 샷들(S1, S2, S3, ... , Sn-1, Sn)로 구분할 수 있다.Then, the system 100 may divide the video to be edited 10 into a plurality of shots S1, S2, S3, ..., Sn-1, Sn.

이러한 샷 구분을 위해 샷 경계 검출을 위한 딥러닝 모델이 구비될 수 있음은 전술한 바와 같다.As described above, a deep learning model for shot boundary detection may be provided for such shot classification.

그러면 상기 시스템(100)은 각각의 샷들(S1, S2, S3, ... , Sn-1, Sn)별로 해당 샷을 대표적으로 잘 설명할 수 있는 샷 텍스트들(st1, st2, st3, stn-1, stn)을 생성할 수 있다.Then, the system 100 provides shot texts (st1, st2, st3, stn-) that can representatively describe the corresponding shot for each shot (S1, S2, S3, ..., Sn-1, Sn). 1, stn) can be created.

그러면 상기 시스템(100)은 검색조건을 텍스트로 입력받고, 입력받은 검색조건에 상응하는 매칭 샷 텍스트를 검출할 수 있다. 예컨대, 매칭 샷 텍스트는 st2, st3, stn일 수 있고, 그러면 상기 시스템(100)은 매칭영상으로써 매칭 샷 텍스트(st2, st3, stn)에 상응하는 매칭 샷(S2, S3, Sn) 또는 상기 매칭 샷(S2, S3, Sn)에 포함되는 선택 프레임(정지영상) 또는 선택 프레임들(동영상)을 매칭영상으로 추출할 수 있다.Then, the system 100 may receive a search condition as text and detect a matching shot text corresponding to the received search condition. For example, the matching shot text may be st2, st3, stn, and then the system 100 as a matching image, the matching shot (S2, S3, Sn) corresponding to the matching shot text (st2, st3, stn) or the matching A selected frame (still image) or selected frames (movie) included in the shots S2, S3, and Sn may be extracted as a matching image.

도 4는 본 발명의 일 실시 예에 따른 선택 프레임을 결정하는 개념을 설명하기 위한 도면이다.4 is a diagram for explaining the concept of determining a selection frame according to an embodiment of the present invention.

도 4를 참조하면, 소정의 샷(예컨대, Si)에 대해 상기 샷(예컨대, Si)을 설명하기 위한 샷 텍스트를 생성하기 위해서는 선택 프레임이 결정되어야 할 수 있다.Referring to FIG. 4 , for a predetermined shot (eg, Si), a selection frame may need to be determined in order to generate shot text for describing the shot (eg, Si).

이를 위해 상기 시스템(100)은 상기 샷(예컨대, Si)에서 단순히 중앙 프레임(예컨대, f4)를 선택 프레임으로 결정할 수도 있다.To this end, the system 100 may simply determine the center frame (eg, f4) as the selection frame in the shot (eg, Si).

하지만 실시 예에 따라서, 상기 시스템(100)은 상기 중앙 프레임(예컨대, f4)에 미리 정해진 오브젝트가 디텍팅되는지 여부에 따라 상기 중앙 프레임(예컨대, f4)을 선택 프레임으로 결정할 수도 있고, 그렇지 않을 수도 있다.However, according to an embodiment, the system 100 may determine the central frame (eg, f4) as a selection frame according to whether a predetermined object is detected in the central frame (eg, f4), or may not. have.

예컨대, 상기 시스템(100)은 상기 중앙 프레임(예컨대, f4)에 미리 정해진 오브젝트가 포함된 경우(오브젝트가 디텍팅되는 경우)에는 상기 중앙 프레임(예컨대, f4)을 선택 프레임으로 결정할 수 있다. 만약 상기 중앙 프레임(예컨대, f4)에 상기 오브젝트가 포함되지 않은 경우, 상기 시스템(100)은 상기 중앙 프레임(예컨대, f4)의 최인접 프레임(예컨대, f3, f5 등, f3과 f5 간의 순서는 미리 결정되어 있을 수 있음)부터 순차적으로 상기 오브젝트가 포함되어 있는지를 판단하고 가장 먼저 오브젝트가 포함된 것으로 결정된 프레임을 선택 프레임에 포함시킬 수 있다. For example, when a predetermined object is included in the central frame (eg, f4) (eg, when an object is detected), the system 100 may determine the central frame (eg, f4) as a selection frame. If the object is not included in the central frame (eg, f4), the system 100 determines that the nearest frame (eg, f3, f5, etc., f3 and f5) of the central frame (eg, f4) is It may be determined in advance), it may be sequentially determined whether the object is included, and a frame determined as including the object first may be included in the selection frame.

이를 통해 사용자가 원하는 오브젝트를 기준으로 샷을 설명하면서도 그 위치도 대표적인 중앙에 근접한 위치를 갖는 프레임이 선택 프레임으로 결정될 수 있다.Through this, a frame having a position close to the center while describing the shot based on the object desired by the user may be determined as the selection frame.

한편 샷 텍스트가 동영상을 기준으로 캡셔닝되는 경우에도, 단순히 중앙 프레임(예컨대, f4)부터 선후로 미리 결정된 개수(또는 시간)만큼의 연속된 프레임들이 동영상으로 선택되고, 이러한 동영상이 캡셔닝되어 샷 텍스트가 생성될 수도 있다.On the other hand, even when the shot text is captioned based on a moving picture, continuous frames as many as a predetermined number (or time) are selected as a moving picture simply from the central frame (eg, f4), and the moving picture is captured and shot Text may be generated.

하지만 전술한 바와 같이 상기 오브젝트를 포함하는 선택 프레임이 결정되면, 결정된 상기 선택 프레임을 포함하는 복수의 프레임들이 캡셔닝될 동영상으로 특정되고 상기 특정된 동영상이 캡셔닝되어 상기 샷을 설명하는 샷 텍스트가 생성될 수도 있다.However, as described above, when the selection frame including the object is determined, a plurality of frames including the determined selection frame are specified as a moving picture to be captioned, and the specified moving picture is captioned to obtain a shot text describing the shot. may be created.

도 5 및 도 6은 본 발명의 일 실시 예에 따른 영상처리를 설명하기 위한 도면이다.5 and 6 are diagrams for explaining image processing according to an embodiment of the present invention.

이처럼 샷 텍스트가 결정되고 매칭영상이 결정되면, 상기 시스템(100)은 상기 매칭영상(도 5 및 도 6에서는 매칭 샷이 매칭영상인 경우를 설명함)을 기반으로 편집대상 비디오(10)가 용이하게 편집될 수 있다.As such, when the shot text is determined and the matching image is determined, the system 100 makes it easy to edit the target video 10 based on the matching image (FIGS. 5 and 6 describe a case where the matching shot is a matching image). can be edited.

예컨대, 사용자가 매칭영상(예컨대, S2, S3, Sn)을 삭제하는 편집을 원하여 이러한 커맨드가 입력되면, 도 5에 도시된 바와 같이 상기 시스템(100)은 상기 매칭영상(예컨대, S2, S3, Sn)이 상기 편집대상 비디오(10)에서 삭제된 영상을 편집 비디오로 생성할 수 있다.For example, when a user wants to edit a matching image (eg, S2, S3, Sn) to delete the matching image and such a command is input, as shown in FIG. 5 , the system 100 displays the matching image (eg, S2, S3, S3). , Sn) may generate an image deleted from the editing target video 10 as an edited video.

예컨대, 사용자가 매칭영상(예컨대, S2, S3, Sn)만을 남기고, 이에 기반하여 편집 비디오를 생성하고자 하는 경우, 도 6에 도시된 바와 같이 상기 시스템(100)은 상기 매칭영상(예컨대, S2, S3, Sn)만을 남긴 편집 비디오를 생성할 수 있다. For example, if the user leaves only the matching image (eg, S2, S3, Sn) and wants to create an edited video based on it, the system 100 as shown in FIG. 6 provides the matching image (eg, S2, S2, You can create an edited video that leaves only S3, Sn).

물론 사용자는 이러한 편집 비디오에 대해 추가적으로 영상을 삽입하거나 순서를 변경하는 등의 편집행위를 더 수행할 수도 있다.Of course, the user may further perform editing actions such as additionally inserting images or changing the order of the edited video.

결국 본 발명의 기술적 사상에 의하면, 비디오에 대해 상기 비디오의 시각적 정보를 잘 설명할 수 있는 텍스트를 생성하고 이를 이용해 영상처리를 매우 효과적으로 수행할 수 있는 효과가 있다.As a result, according to the technical idea of the present invention, there is an effect that, for a video, text capable of explaining the visual information of the video can be generated and image processing can be performed very effectively by using the text.

한편, 본 발명의 실시예에 따른 방법은 컴퓨터가 읽을 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 읽을 수 있는 기록 매체에 저장될 수 있으며, 본 발명의 실시예에 따른 제어 프로그램 및 대상 프로그램도 컴퓨터로 판독 가능한 기록 매체에 저장될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.On the other hand, the method according to the embodiment of the present invention may be implemented in the form of a computer-readable program command and stored in a computer-readable recording medium, and the control program and the target program according to the embodiment of the present invention are also implemented in the computer. It may be stored in a readable recording medium. The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored.

기록 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the software field.

컴퓨터로 읽을 수 있는 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and floppy disks, and hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, and the like. In addition, the computer-readable recording medium is distributed in network-connected computer systems, and computer-readable codes can be stored and executed in a distributed manner.

프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 전자적으로 정보를 처리하는 장치, 예를 들어, 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by an apparatus for electronically processing information using an interpreter or the like, for example, a computer.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타나며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. .

Claims

dividing, by the system, the video to be edited into a plurality of shots;
generating, by the system, at least one shot text for each of the divided shots;
receiving, by the system, a search condition text for image search; and
selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text;
generating, by the system, at least one shot text for each of the divided shots,
determining at least one selection frame from among a plurality of frames included in each of the divided shots; and
generating the shot text through image captioning corresponding to the determined selection frame;
The step of determining at least one selection frame from among a plurality of frames included in each of the divided shots,
determining, as the at least one selected frame, a moving picture including a plurality of consecutive frames included in the divided shots,
The step of generating the shot text through image captioning corresponding to the determined selected frame comprises:
A video stream processing method through deep learning-based video captioning, comprising generating the shot text through video captioning for the video.

delete

In a video stream processing method through video captioning based on deep learning,
dividing, by the system, the video to be edited into a plurality of shots;
generating, by the system, at least one shot text for each of the divided shots;
receiving, by the system, a search condition text for image search; and
selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text;
generating, by the system, at least one shot text for each of the divided shots,
determining at least one selection frame from among a plurality of frames included in each of the divided shots; and
generating the shot text through image captioning corresponding to the determined selection frame;
The video stream processing method through the deep learning-based image captioning,
generating, by the system, a first edited video in which the matching image is deleted from the editing target video; or
Video stream processing method through deep learning-based image captioning further comprising the step of the system generating a second edited video in which frames other than the matching image are deleted.

delete

dividing, by the system, the video to be edited into a plurality of shots;
generating, by the system, at least one shot text for each of the divided shots;
receiving, by the system, a search condition text for image search; and
selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text;
generating, by the system, at least one shot text for each of the divided shots,
determining at least one selection frame from among a plurality of frames included in each of the divided shots; and
generating the shot text through image captioning corresponding to the determined selection frame;
The step of determining at least one selection frame from among a plurality of frames included in each of the divided shots,
determining a frame including a predetermined object from among the divided shots as the selection frame,
Determining a frame including a predetermined object from among the divided shots as the selection frame includes:
When the object is included in the center frame corresponding to the center of the divided shot, the center frame is included in the selection frame, and when the object is not included in the center frame, the order of adjacent frames from the center frame A video stream processing method through deep learning-based image captioning, comprising the step of searching for a frame including the object as described above and including the first searched frame in the selected frame.

dividing, by the system, the video to be edited into a plurality of shots;
generating, by the system, at least one shot text for each of the divided shots;
receiving, by the system, a search condition text for image search; and
selecting, by the system, matching shot text corresponding to the search condition text, and extracting a matching image corresponding to the selected matching shot text;
generating, by the system, at least one shot text for each of the divided shots,
determining at least one selection frame from among a plurality of frames included in each of the divided shots; and
generating the shot text through image captioning corresponding to the determined selection frame;
The step of generating the shot text through image captioning corresponding to the determined selected frame comprises:
A video stream processing method through deep learning-based image captioning, comprising generating the shot text as a result of video captioning for a video that is a plurality of consecutive frames of a predetermined number including the selection frame.

A computer program installed in a data processing device and stored in a computer-readable recording medium to perform the method according to any one of claims 1, 3, 5 or 6.

A system for a video stream processing method through deep learning-based image captioning, comprising:
processor; and
a memory for storing a computer program executed by the processor;
The processor drives the program,
A video to be edited is divided into a plurality of shots, at least one shot text is generated for each of the divided shots, a search condition text is input, a matching shot text corresponding to the search condition text is selected, and the selected matching Extracting a matching shot or matching frame corresponding to the shot text,
A system for determining a moving picture including a plurality of consecutive frames included in the divided shots as at least one selected frame, and generating the shot text through moving picture captioning for the moving picture.