KR102293073B1

KR102293073B1 - Core video generation device and method considering the context of video

Info

Publication number: KR102293073B1
Application number: KR1020200116858A
Authority: KR
Inventors: 이계민; 이한솔
Original assignee: 서울과학기술대학교 산학협력단
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2021-08-25

Abstract

Provided is a device for generating a core image, which separates voice information and image information from video images, generates voice characteristic information and image characteristic information, generates continuous characteristic information by analyzing information of fusing the voice characteristic information and the image characteristic information generated to be continuous, analyzes the fused information according to mid to long term time intervals to generate mid to long term characteristic information, and generates context information indicating the correlation between continuous characteristic information and the mid to long term characteristic information to generate a core image.

Description

CORE VIDEO GENERATION DEVICE AND METHOD CONSIDERING THE CONTEXT OF VIDEO

본 발명은 비디오 영상의 맥락을 고려한 핵심 영상 생성 장치 및 방법에 관한 것으로, 보다 상세하게는, 긴 시간 간격으로 나타나는 비디오 영상으로부터 짧은 시간 간격으로 요약된 핵심 영상을 생성하는 핵심 영상 생성 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a core image in consideration of the context of a video image, and more particularly, to a core image generating apparatus and method for generating a core image summarized at a short time interval from a video image appearing at a long time interval. it's about

스마트폰, 인터넷 등의 IT 기술의 발전으로 스트리밍 플랫폼 서비스(Streaming Platform Service)에 대한 접근이 편리해지면서, 축구, 야구, e-스포츠 등의 경기 영상 콘텐츠가 대량으로 생산 및 업로드되는 추세이며, 이에 따라, 축구, 야구, e-스포츠 등의 경기 영상 콘텐츠에 대한 수요도 나날이 증가하는 추세이다.With the development of IT technologies such as smartphones and the Internet, access to streaming platform services has become more convenient, and video content of games such as soccer, baseball, and e-sports is being produced and uploaded in large quantities. , soccer, baseball, e-sports, etc. demand for video content is also increasing day by day.

이에 따라, 방송국에서는 시청자의 편의와 네트워크의 효율을 위해 긴 시간 간격으로 나타나는 경기 영상에서, 시청자들의 흥미를 끄는 장면들을 추출하여 짧은 시간 간격으로 나타나는 핵심 영상을 제공하고 있다.Accordingly, for the convenience of viewers and network efficiency, the broadcasting station extracts scenes of interest to viewers from the game video that appear at long time intervals and provides core images that appear at short time intervals.

그러나, 기존의 핵심 영상은 편집자가 경기 영상을 직접 확인하고, 경기 영상으로부터 일부의 장면을 추출하여 직접 편집하게 되며, 이러한 방법은 전문적인 편집 기술과 편집을 수행하는 긴 시간이 요구된다.However, in the existing core video, the editor directly checks the game video, extracts some scenes from the game video and edits it directly, and this method requires professional editing skills and a long time to edit.

이에 따라, 긴 시간 간격으로 나타나는 경기 영상 등의 비디오 영상으로부터 효율적으로 핵심 영상을 생성하는 방안이 요구되는 실정이다.Accordingly, there is a need for a method for efficiently generating a core image from video images such as game images appearing at long time intervals.

본 발명이 해결하고자 하는 기술적 과제는 비디오 영상으로부터 나타나는 스토리의 맥락을 고려하여 핵심 영상을 생성하는 핵심 영상 생성 장치 및 방법을 제공하는 것이다.SUMMARY The technical problem to be solved by the present invention is to provide a core image generating apparatus and method for generating a core image in consideration of the context of a story appearing from a video image.

본 발명의 일측면은, 비디오 영상으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리하는 비디오 분리부; 상기 음성 정보를 분석하여 음성 특징 정보를 생성하는 음성 분석부; 사전에 마련되는 학습 모델에 기초하여, 상기 영상 정보로부터 영상 특징 정보를 생성하는 영상 분석부; 연속되도록 생성되는 상기 음성 특징 정보와 상기 영상 특징 정보를 융합하고, 융합된 정보를 분석하여 연속 특징 정보를 생성하는 연속 정보 분석부; 사전에 설정되는 중장기 시간 간격에 따라 상기 융합된 정보를 분석하여 중장기 특징 정보를 생성하는 중장기 정보 분석부; 상기 연속 특징 정보와 상기 중장기 특징 정보를 비교하여, 상기 연속 특징 정보와 상기 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성하는 맥락 분석부; 및 상기 맥락 정보에 기초하여 핵심 영상을 생성하는 영상 생성부를 포함할 수 있다.According to an aspect of the present invention, there is provided a video separation unit that separates audio information and image information from a video image so as to be continuous according to time sequence; a voice analyzer that analyzes the voice information to generate voice characteristic information; an image analyzer configured to generate image characteristic information from the image information based on a learning model prepared in advance; a continuous information analysis unit that fuses the audio characteristic information and the image characteristic information generated to be continuous, and analyzes the fused information to generate continuous characteristic information; a mid/long-term information analysis unit for generating mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval; a context analysis unit that compares the continuous characteristic information with the mid- to long-term characteristic information to generate context information indicating a correlation between the continuous characteristic information and the mid- to long-term characteristic information; and an image generator that generates a core image based on the context information.

또한, 상기 음성 분석부는, 상기 음성 정보를 사전에 설정되는 세그먼트 개수에 따라 복수개의 세그먼트로 분리하고, 상기 세그먼트로부터 사전에 설정되는 주파수 차원 개수에 따라 각각의 주파수 대역에서의 음성 특징 정보를 생성할 수 있다.In addition, the voice analysis unit divides the voice information into a plurality of segments according to a preset number of segments, and generates voice characteristic information in each frequency band according to a preset number of frequency dimensions from the segments. can

또한, 상기 영상 분석부는, 임의의 동영상으로부터 프레임 단위로 분리된 영상 정보를 영상 특징 정보가 추출되도록 학습하여 학습 모델을 생성할 수 있다.Also, the image analyzer may generate a learning model by learning image information separated by frame units from an arbitrary moving image so that image characteristic information is extracted.

또한, 상기 영상 생성부는, 상기 비디오 영상의 길이에 대한 상기 핵심 영상의 길이의 비율에 따라 하나 이상의 맥락 정보를 추출하고, 추출된 맥락 정보를 이용하여 핵심 영상을 생성할 수 있다.Also, the image generator may extract one or more context information according to a ratio of the length of the core image to the length of the video image, and generate the core image by using the extracted context information.

본 발명의 다른 일측면은, 비디오 영상의 맥락을 고려한 핵심 영상 생성 장치를 이용하여 핵심 영상을 생성하는 방법에 있어서, 비디오 영상으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리하는 단계; 상기 음성 정보를 분석하여 음성 특징 정보를 생성하는 단계; 사전에 마련되는 학습 모델에 기초하여, 상기 영상 정보로부터 영상 특징 정보를 생성하는 단계; 연속되도록 생성되는 상기 음성 특징 정보와 상기 영상 특징 정보를 융합하고, 융합된 정보를 분석하여 연속 특징 정보를 생성하는 단계; 사전에 설정되는 중장기 시간 간격에 따라 상기 융합된 정보를 분석하여 중장기 특징 정보를 생성하는 단계; 상기 연속 특징 정보와 상기 중장기 특징 정보를 비교하여, 상기 연속 특징 정보와 상기 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성하는 단계; 및 상기 맥락 정보에 기초하여 핵심 영상을 생성하는 단계를 포함할 수 있다.Another aspect of the present invention provides a method for generating a core image by using a core image generating apparatus in consideration of the context of a video image, the method comprising: separating audio information and image information from a video image so as to be continuous in chronological order; generating voice characteristic information by analyzing the voice information; generating image feature information from the image information based on a learning model prepared in advance; fusing the audio characteristic information and the image characteristic information generated to be continuous, and analyzing the fused information to generate continuous characteristic information; generating mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval; comparing the continuous characteristic information with the mid- to long-term characteristic information to generate context information indicating a correlation between the continuous characteristic information and the mid- to long-term characteristic information; and generating a core image based on the context information.

상술한 본 발명의 일측면에 따르면, 비디오 영상의 맥락을 고려한 핵심 영상 생성 장치 및 방법을 제공함으로써, 비디오 영상으로부터 나타나는 스토리의 맥락을 고려하여 핵심 영상을 생성할 수 있다.According to the above-described aspect of the present invention, by providing an apparatus and method for generating a core image in consideration of the context of a video image, a core image may be generated in consideration of the context of a story appearing from the video image.

도1은 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 개략도이다.
도2는 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 제어블록도이다.
도3은 도2의 연속 정보 분석부 또는 중장기 정보 분석부에서 융합된 정보를 분석하는 과정을 나타낸 블록도이다.
도4는 도2의 영상 생성부에서 핵심 영상을 생성하는 과정을 나타낸 블록도이다.
도5는 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 일 실시예를 나타낸 개략도이다.
도6은 본 발명의 일 실시예에 따른 핵심 영상 생성 방법의 순서도이다.1 is a schematic diagram of an apparatus for generating a core image according to an embodiment of the present invention.
2 is a control block diagram of an apparatus for generating a core image according to an embodiment of the present invention.
3 is a block diagram illustrating a process of analyzing the fused information in the continuous information analysis unit or the mid- to long-term information analysis unit of FIG. 2 .
4 is a block diagram illustrating a process of generating a core image in the image generator of FIG. 2 .
5 is a schematic diagram illustrating an embodiment of an apparatus for generating a core image according to an embodiment of the present invention.
6 is a flowchart of a method for generating a core image according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0023] Reference is made to the accompanying drawings, which show by way of illustration specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein with respect to one embodiment may be implemented in other embodiments without departing from the spirit and scope of the invention. In addition, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the detailed description set forth below is not intended to be taken in a limiting sense, and the scope of the invention, if properly described, is limited only by the appended claims, along with all scope equivalents to those claimed. Like reference numerals in the drawings refer to the same or similar functions throughout the various aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도1은 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 개략도이다.1 is a schematic diagram of an apparatus for generating a core image according to an embodiment of the present invention.

핵심 영상 생성 장치(200)는 비디오 영상(100)으로부터 나타나는 일부 장면을 추출하여 핵심 영상(300)을 생성할 수 있다.The core image generating apparatus 200 may generate the core image 300 by extracting some scenes appearing from the video image 100 .

여기에서, 비디오 영상(100)은 핵심 영상 생성 장치(200)에서 생성되는 핵심 영상(300)과 비교하여, 시간 간격이 긴 영상 정보인 것으로 이해할 수 있으며, 예를 들어, 비디오 영상(100)은 축구, 야구, e-스포츠 등의 경기 영상을 포함할 수 있다.Here, the video image 100 may be understood as image information having a longer time interval compared to the core image 300 generated by the core image generating device 200 , and for example, the video image 100 is It may include video games of soccer, baseball, e-sports, and the like.

이에 따라, 비디오 영상(100)은 프레임 단위로 마련되어, 시청자의 시각적 자극을 유발하는 영상 정보와 시청자의 청각적 자극을 유발하는 음성 정보를 포함할 수 있다.Accordingly, the video image 100 may be provided in units of frames and include image information that induces a viewer's visual stimulus and audio information that induces a viewer's auditory stimulus.

이와 같은, 비디오 영상(100)은 트위치(Twitch), 카카오TV(Kakao TV), 아프리카TV(Afreeca TV), 유튜브(Youtube) 및 네이버TV(Naver TV) 등의 비디오 영상(100)을 제공하는 플랫폼(Platform)의 서버 장치로부터 제공되는 것일 수 있다.As such, the video image 100 provides video images 100 such as Twitch, Kakao TV, Afreeca TV, YouTube, and Naver TV. It may be provided from the server device of the platform (Platform).

한편, 핵심 영상(300)은 비디오 영상(100)으로부터 일부 장면을 추출하여, 비디오 영상(100)에서 나타나는 시간 순서에 따라 연결한 것으로 이해할 수 있으며, 이때, 비디오 영상(100)의 일부 장면은 임의의 연속된 프레임 간격으로 나타나는 장면을 의미할 수 있다.On the other hand, the core image 300 can be understood as extracting some scenes from the video image 100 and connecting them according to the chronological order appearing in the video image 100, at this time, some scenes of the video image 100 are arbitrary. It may mean a scene appearing at consecutive frame intervals of

여기에서, 장면은 비디오 영상(100)으로부터 임의의 시간 간격의 음성 정보와 영상 정보가 추출된 것으로 이해할 수 있다.Here, the scene may be understood as audio information and image information at an arbitrary time interval extracted from the video image 100 .

이에 따라, 핵심 영상 생성 장치(200)는 비디오 영상(100)에 포함되는 하나 이상의 시점으로부터의 일부 장면을 추출하여 연결할 수 있으며, 이때, 핵심 영상 생성 장치(200)는 추출된 하나 이상의 일부 장면을 연결하여 핵심 영상(300)을 생성할 수 있다.Accordingly, the core image generating apparatus 200 may extract and connect some scenes from one or more viewpoints included in the video image 100, and in this case, the core image generating apparatus 200 may connect the extracted one or more partial scenes. By connecting them, the core image 300 may be generated.

여기에서, 핵심 영상(300)은 비디오 영상(100)에서 나타나는 주요 장면을 추출하여 연결한 것일 수 있으며, 예를 들어, 주요 장면은 비디오 영상(100)이 축구, 야구, e-스포츠 등의 스포츠 경기인 경우에, 득점 장면, 실점 장면, 득점 실패 장면, 반칙 장면 등을 포함할 수 있다. 또한, 주요 장면은 해설자의 톤이 높게 측정되는 장면, 관중의 함성 소리가 발생하는 장면, 관중의 박수 소리가 발생하는 장면 등을 포함할 수 있다.Here, the core image 300 may be extracted and connected to a main scene appearing in the video image 100, for example, the main scene is the video image 100 is a sports such as soccer, baseball, e-sports, etc. In the case of a game, it may include a scoring scene, a scoring scene, a scoring failure scene, a foul scene, and the like. In addition, the main scene may include a scene in which the narrator's tone is measured high, a scene in which a shout of the audience occurs, a scene in which a sound of applause of the audience occurs, and the like.

이때, 득점 장면, 실점 장면, 득점 실패 장면, 반칙 장면 등은 시청자의 시각적 자극을 유발하는 비디오 영상(100)의 영상 정보를 이용하여 판단될 수 있으며, 해설자의 톤이 높게 측정되는 장면, 관중의 함성 소리가 발생하는 장면, 관중의 박수 소리가 발생하는 장면 등은 시청자의 청각적 자극을 유발하는 비디오 영상(100)의 음성 정보를 이용하여 판단될 수 있다.At this time, the scoring scene, the scoring scene, the scoring failure scene, the foul scene, etc. can be determined using the image information of the video image 100 that induces the viewer's visual stimulation, the scene in which the narrator's tone is measured high, the A scene in which a shouting sound is generated, a scene in which a spectator's applause is generated, etc. may be determined using voice information of the video image 100 that induces auditory stimulation of the viewer.

이하에서, 비디오 영상(100)으로부터 핵심 영상(300)을 생성하는 핵심 영상 생성 장치(200)에 대해 상세히 설명하도록 한다.Hereinafter, the core image generating apparatus 200 for generating the core image 300 from the video image 100 will be described in detail.

핵심 영상 생성 장치(200)는 비디오 영상(100)으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리할 수 있다.The core image generating apparatus 200 may separate audio information and image information from the video image 100 so as to be continuous according to time sequence.

이때, 핵심 영상 생성 장치(200)는 분리되는 음성 정보 또는 영상 정보가 사전에 설정되는 시간 간격으로 나타나도록 비디오 영상(100)을 분리할 수 있으며, 예를 들어, 핵심 영상 생성 장치(200)는 비디오 영상(100)으로부터 분리되는 음성 정보 또는 영상 정보가 1초의 시간 간격으로 나타나도록 설정될 수 있다.At this time, the core image generating apparatus 200 may separate the video image 100 so that the separated audio information or image information appears at a preset time interval. For example, the core image generating apparatus 200 may It may be set so that audio information or image information separated from the video image 100 appears at a time interval of 1 second.

한편, 핵심 영상 생성 장치(200)는 비디오 영상(100)으로부터 영상 정보가 추출되는 시점과 음성 정보가 추출되는 시점이 동일하도록 영상 정보와 음성 정보를 분리할 수 있다.Meanwhile, the core image generating apparatus 200 may separate the image information and the audio information so that a time point at which image information is extracted from the video image 100 and a time point at which the audio information is extracted are the same.

또한, 핵심 영상 생성 장치(200)는 비디오 영상(100)으로부터 영상 정보가 추출된 시점에서, 사전에 설정되는 지연 시간이 경과된 시점에서 음성 정보를 추출할 수 있다.Also, the core image generating apparatus 200 may extract audio information when the image information is extracted from the video image 100 and when a preset delay time elapses.

여기에서, 지연 시간은 영상 정보로부터 임의의 상황이 나타나는 시점과 임의의 상황에 대해 음성 정보로부터 임의의 반응이 나타나는 시점의 시간 차이로 설정될 수 있다.Here, the delay time may be set as a time difference between a point in time when an arbitrary situation appears from the image information and a time when an arbitrary response appears from the audio information to an arbitrary situation.

예를 들어, 지연 시간은 비디오 영상(100)이 축구, 야구, e-스포츠 등의 경기 영상인 경우에, 경기를 수행하는 선수의 행동과 선수의 행동에 의한 관중의 반응에 따라 설정될 수 있으며, 보다 상세하게는, 지연 시간은 비디오 영상(100)에서, 경기를 수행하는 선수의 행동이 영상 정보로부터 나타나는 시점과, 비디오 영상(100)에서, 선수의 행동에 의한 관중의 반응이 음성 정보로부터 나타나는 시점의 차이로 설정될 수 있다.For example, when the video image 100 is a game image of soccer, baseball, e-sport, etc., the delay time may be set according to the behavior of a player performing the game and the reaction of the spectators by the behavior of the player, , More specifically, the delay time is determined from the video image 100, when the player's actions appear from the video information, and from the video image 100, the response of the spectators by the player's actions from the audio information. It can be set by the difference in time of appearance.

이때, 선수의 행동은 선수가 득점을 하는 장면, 선수가 실점을 하는 장면, 선수가 득점을 실패하는 장면, 선수가 반칙을 하는 장면 등으로부터 나타날 수 있으며, 관중의 반응은 득점에 대한 해설자의 톤 상승, 관중의 환호 또는 박수, 실점에 대한 해설자의 톤 하강, 관중의 야유 또는 응원 등을 포함할 수 있다.At this time, the player's behavior can be expressed from the scene where the player scores, the scene where the player makes a goal, the scene where the player fails to score, the scene where the player commits a foul, etc. This may include a rise, cheers or applause from the audience, a narrator's tone down for a run, boos or cheers from the audience, and the like.

핵심 영상 생성 장치(200)는 추출된 음성 정보를 분석하여 음성 특징 정보를 생성할 수 있다.The core image generating apparatus 200 may analyze the extracted voice information to generate voice characteristic information.

이때, 핵심 영상 생성 장치(200)는 음성 정보를 사전에 설정되는 세그먼트 개수에 따라 복수개의 세그먼트로 분리할 수 있고, 핵심 영상 생성 장치(200)는 분리된 세그먼트로부터 음성 특징 정보를 추출할 수 있다.In this case, the core image generating apparatus 200 may divide the audio information into a plurality of segments according to a preset number of segments, and the core image generating apparatus 200 may extract voice characteristic information from the separated segments. .

예를 들어, 핵심 영상 생성 장치(200)는 1초의 시간 간격으로 나타나도록 추출된 음성 정보를 25개의 세그먼트로 분리할 수 있으며, 이러한 경우에, 각각의 세그먼트는 0.04초의 음성 정보로 나타날 수 있다.For example, the core image generating apparatus 200 may divide the extracted audio information into 25 segments to appear at a time interval of 1 second, and in this case, each segment may appear as audio information of 0.04 seconds.

이와 관련하여, 핵심 영상 생성 장치(200)는 하나의 세그먼트로부터 MFCC(Mel-Frequency Cepstral Coefficient)를 이용하여 서로 다른 주파수 대역의 정보를 추출할 수 있으며, 이때, 핵심 영상 생성 장치(200)는 사전에 설정되는 주파수 차원 개수에 따라 하나의 세그먼트로 분리된 음성 정보로부터 서로 다른 주파수 대역의 정보를 추출할 수 있다.In this regard, the core image generating apparatus 200 may extract information of different frequency bands from one segment using a Mel-Frequency Cepstral Coefficient (MFCC), and in this case, the core image generating apparatus 200 may pre Information of different frequency bands may be extracted from voice information divided into one segment according to the number of frequency dimensions set in .

여기에서, 주파수 차원 개수는 하나의 세그먼트로부터 추출하는 서로 다른 주파수 대역의 개수를 의미할 수 있다.Here, the number of frequency dimensions may mean the number of different frequency bands extracted from one segment.

예를 들어, 핵심 영상 생성 장치(200)는 하나의 세그먼트로 분리된 음성 정보로부터 20 차원의 MFCC를 이용하여 20개의 주파수 대역의 정보를 추출할 수 있다. 이러한 경우에, 핵심 영상 생성 장치(200)는 1초의 시간 간격으로 나타나도록 추출되어, 25개의 세그먼트로 분리된 음성 정보로부터 20 차원의 주파수 대역의 정보를 추출할 수 있다.For example, the core image generating apparatus 200 may extract information of 20 frequency bands using 20-dimensional MFCC from audio information divided into one segment. In this case, the core image generating apparatus 200 may extract information of a 20-dimensional frequency band from audio information that is extracted to appear at a time interval of 1 second and divided into 25 segments.

여기에서, MFCC는 복수개의 주파수 대역으로 나타나는 오디오 신호 등으로부터 인간의 청각에 민감한 주파수 대역의 에너지를 강조하도록 마련되는 멜 스케일(Mel-Scale)을 이용하여 각각의 주파수 대역의 신호를 추출하는 기법으로 이해할 수 있다.Here, MFCC is a technique for extracting signals of each frequency band using a Mel-Scale, which is provided to emphasize the energy of a frequency band sensitive to human hearing, from an audio signal that appears in a plurality of frequency bands. I can understand.

이와 같이, 핵심 영상 생성 장치(200)는 사전에 설정되는 주파수 차원 개수에 따라, 하나의 세그먼트로 분리된 음성 정보로부터 복수개의 주파수 대역의 정보를 추출할 수 있다.As such, the core image generating apparatus 200 may extract information of a plurality of frequency bands from voice information divided into one segment according to a preset number of frequency dimensions.

이에 따라, 핵심 영상 생성 장치(200)는 추출된 각각의 주파수 대역의 정보로부터 BiLSTM(Bidirectional Long-Short Term Memory)을 이용하여 음성 특징 정보를 생성할 수 있다.Accordingly, the core image generating apparatus 200 may generate voice feature information from the extracted information of each frequency band using Bidirectional Long-Short Term Memory (BiLSTM).

여기에서, BiLSTM은 시계열로 나타나는 임의의 신호가 순방향으로 입력되는 LSTM(Long-Short Term Memory)과 순방향으로 입력되는 신호와 동일한 신호가 역방향으로 입력되는 LSTM을 이용하여 양방향으로 LSTM을 수행하는 기법이며, LSTM은 시계열로 나타나는 임의의 신호를 입력 받아 처리하는 기법으로써, 이전 시점에 입력된 신호를 현재 시점에 입력된 신호에 반영하여 시간적으로 연속하는 정보를 처리하는 기법으로 이해할 수 있다.Here, BiLSTM is a technique for performing LSTM in both directions using LSTM (Long-Short Term Memory) in which an arbitrary signal appearing in time series is input in the forward direction and LSTM in which the same signal as the forward input signal is input in the reverse direction. , LSTM is a technique for receiving and processing an arbitrary signal appearing in time series, and can be understood as a technique for processing temporally continuous information by reflecting a signal input at a previous time to a signal input at a current time.

예를 들어, 핵심 영상 생성 장치(200)는 1초의 시간 간격으로 나타나도록 추출되어, 25개의 세그먼트로 분리된 음성 정보로부터 20 차원의 주파수 대역의 정보를 추출하도록 마련되는 경우에, 각각의 주파수 대역에서 음성 특징 정보를 생성할 수 있으며, 이에 따라, 핵심 영상 생성 장치(200)는 1초 당 500개(또는 500차원)의 음성 특징 정보를 생성할 수 있다.For example, when the core image generating apparatus 200 is provided to extract information of a 20-dimensional frequency band from voice information that is extracted to appear at a time interval of 1 second and divided into 25 segments, each frequency band may generate voice characteristic information, and accordingly, the core image generating apparatus 200 may generate 500 pieces (or 500 dimensions) of voice characteristic information per second.

이때, 핵심 영상 생성 장치(200)는 다른 시간 간격으로 음성 특징 정보를 생성할 수도 있음은 물론이며, 또한, 핵심 영상 생성 장치(200)는 설정된 시간 간격 당 생성되는 음성 특징 정보의 개수가 다른 개수로 생성될 수 있음은 물론이다.In this case, of course, the core image generating apparatus 200 may generate the voice characteristic information at different time intervals, and the core image generating apparatus 200 may generate a different number of voice characteristic information per set time interval. Of course, it can be created as

핵심 영상 생성 장치(200)는 사전에 마련되는 학습 모델에 기초하여, 영상 정보로부터 영상 특징 정보를 추출할 수 있다.The core image generating apparatus 200 may extract image characteristic information from image information based on a previously prepared learning model.

이를 위해, 핵심 영상 생성 장치(200)는 임의의 비디오 영상(100)에서, 프레임 단위로 분리된 영상 정보로부터 영상 특징 정보가 추출되도록 학습하여 학습 모델을 생성할 수 있다.To this end, the core image generating apparatus 200 may generate a learning model by learning to extract image feature information from image information separated in frame units from an arbitrary video image 100 .

이때, 핵심 영상 생성 장치(200)는 CNN(Convolution Neural Network)를 이용하여 임의의 비디오 영상(100)에서 프레임 단위로 분리된 영상 정보를 학습할 수 있다.In this case, the core image generating apparatus 200 may learn image information separated by frame from an arbitrary video image 100 using a Convolution Neural Network (CNN).

여기에서, CNN은 하나 이상의 합성곱 층(Convolution Layer), 저류층(Pooling Layer) 및 완전 연결 층(Fully-Connected Layer)으로 마련될 수 있다. 이에 따라, CNN은 영상 또는 이미지 등의 정보를 입력 받아, 사전에 학습된 필터를 통해 입력된 정보의 특징 값을 추출하는 기법으로 이해할 수 있다.Here, the CNN may be provided with one or more convolutional layers, a pooling layer, and a fully-connected layer. Accordingly, CNN can be understood as a technique for receiving information such as an image or an image and extracting feature values of the input information through a pre-learned filter.

이에 따라, 핵심 영상 생성 장치(200)는 CNN이 적용된 BiLSTM을 이용하여 영상 정보로부터 영상 특징 정보를 추출할 수 있다.Accordingly, the core image generating apparatus 200 may extract image feature information from image information using BiLSTM to which CNN is applied.

이때, 핵심 영상 생성 장치(200)는 비디오 영상(100)에서 분리된 영상 정보에 나타나는 각각의 프레임으로부터 영상 특징 정보를 각각 추출할 수 있다.In this case, the core image generating apparatus 200 may extract image characteristic information from each frame appearing in the image information separated from the video image 100 .

핵심 영상 생성 장치(200)는 연속되도록 생성되는 음성 특징 정보와 영상 특징 정보를 융합할 수 있고, 핵심 영상 생성 장치(200)는 융합된 정보를 분석하여 연속 특징 정보를 생성할 수 있다.The core image generating apparatus 200 may fuse the continuously generated audio characteristic information and the image characteristic information, and the core image generating apparatus 200 may analyze the fused information to generate continuous characteristic information.

여기에서, 핵심 영상 생성 장치(200)는 BiLSTM을 이용하여, 융합된 정보로부터 연속 특징 정보를 생성할 수 있으며, 이때, 음성 특징 정보와 영상 특징 정보를 융합하는 것은 음성 특징 정보와 영상 특징 정보를 중첩하는 것일 수 있으며, 또는, 음성 특징 정보와 영상 특징 정보를 이용하여 다차원의 정보를 생성하는 것일 수도 있다.Here, the core image generating apparatus 200 may generate continuous feature information from the fused information by using BiLSTM. It may be overlapped, or multidimensional information may be generated using audio characteristic information and image characteristic information.

이러한 경우에, 연속 특징 정보는 핵심 영상 생성 장치(200)에서 비디오 영상(100)으로부터 연속되도록 분리되는 음성 특징 정보와 영상 특징 정보에 대해, 매치되도록 생성될 수 있으며, 다시 말해서, 핵심 영상 생성 장치(200)는 융합된 정보를 연속되도록 분석하여 연속 특징 정보를 생성하는 것으로 이해할 수 있다.In this case, the continuous feature information may be generated so as to match the audio feature information and the image feature information that are continuously separated from the video image 100 in the core image generating device 200 , that is, the core image generating device 200 may be understood as generating continuous feature information by analyzing the fused information to be continuous.

핵심 영상 생성 장치(200)는 사전에 설정되는 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있다.The core image generating apparatus 200 may generate mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval.

여기에서, 핵심 영상 생성 장치(200)는 BiLSTM을 이용하여, 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있으며, 이때, 음성 특징 정보와 영상 특징 정보를 융합하는 것은 음성 특징 정보와 영상 특징 정보를 중첩하는 것일 수 있으며, 또는, 음성 특징 정보와 영상 특징 정보를 이용하여 다차원의 정보를 생성하는 것일 수도 있다.Here, the core image generating apparatus 200 may generate mid- to long-term characteristic information by analyzing the fused information according to mid- to long-term time intervals using BiLSTM. The feature information and the image feature information may be superimposed, or multidimensional information may be generated using the audio feature information and the image feature information.

이때, 사전에 설정되는 중장기 시간 간격은 융합된 정보가 BiLSTM에 입력되는 시간을 의미할 수 있으며, 구체적으로, 핵심 영상 생성 장치(200)는 중장기 시간 간격에 따른 시점마다, 연속 특징 정보가 생성되도록 융합된 정보와 동일하게 융합된 정보를 BiLSTM에 입력하는 것으로 이해할 수 있다.In this case, the preset mid/long-term time interval may mean the time at which the fused information is input to the BiLSTM. Specifically, the core image generating apparatus 200 generates continuous feature information at each time point according to the mid/long-term time interval. It can be understood as inputting the fused information into the BiLSTM in the same way as the fused information.

이에 따라, 핵심 영상 생성 장치(200)는 중장기 시간 간격에 따른 시점마다, 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있다.Accordingly, the core image generating apparatus 200 may generate mid- to long-term characteristic information by analyzing the fused information at each time point according to a mid- to long-term time interval.

이러한 경우에, 중장기 시간 간격은 중장기 시간 간격에 따라 비디오 영상(100)의 장면 변화를 검출하도록 설정될 수 있으며, 이를 통해, 중장기 특징 정보는 비디오 영상(100)의 장면 변화에 대해, 연속 특징 정보와 비교하여, 보다 긴 시간 간격으로 변화하는 특징을 나타내도록 생성되는 것으로 이해할 수 있다.In this case, the mid-to-long-term time interval may be set to detect a scene change of the video image 100 according to the mid-to-long-term time interval, and through this, the mid- to long-term characteristic information for the scene change of the video image 100, continuous characteristic information Compared with , it can be understood as being generated to exhibit a characteristic that changes with a longer time interval.

핵심 영상 생성 장치(200)는 연속 특징 정보와 중장기 특징 정보를 비교하여, 연속 특징 정보와 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성할 수 있다.The core image generating apparatus 200 may compare the continuous feature information with the mid- to long-term feature information to generate context information indicating the correlation between the continuous feature information and the mid- to long-term feature information.

이때, 핵심 영상 생성 장치(200)는 임의의 시점에서 융합된 정보로부터 생성된 연속 특징 정보를 비교할 수 있으며, 또한, 핵심 영상 생성 장치(200)는 임의의 시점에서 융합된 정보로부터 생성된, 중장기 특징 정보와, 해당 시점으로부터, 중장기 특징 정보가 생성되도록 정보가 융합되는 다음 시점까지 융합된 정보로부터 생성된, 연속 특징 정보를 각각 비교할 수 있다.In this case, the core image generating apparatus 200 may compare the continuous feature information generated from the fused information at any point in time, and the core image generating apparatus 200 may generate mid- to long-term information generated from the fused information at any point in time. It is possible to compare the characteristic information and continuous characteristic information generated from the fused information from the corresponding time point to the next time point at which the information is fused to generate mid- to long-term characteristic information, respectively.

또한, 핵심 영상 생성 장치(200)는 임의의 시점에서 융합된 정보로부터 생성된 하나의 연속 특징 정보와 비디오 영상(100)으로부터 생성된 복수개의 중장기 특징 정보를 각각 비교할 수도 있다.Also, the core image generating apparatus 200 may compare one piece of continuous feature information generated from the fused information at an arbitrary point in time with a plurality of mid- to long-term feature information generated from the video image 100 , respectively.

여기에서, 시간적으로 인접한 중장기 특징 정보를 생성하도록 융합된 정보가 융합되는 시점 간의 차이는 중장기 시간 간격일 수 있으며, 이러한 경우에, 임의의 중장기 특징 정보와 비교되는 연속 특징 정보는 중장기 시간 간격 내에서 생성된 것일 수 있고, 임의의 중장기 특징 정보와 비교되는 연속 특징 정보는 비디오 영상(100)에서 나타나는 시간 간격 내에서 생성된 것일 수도 있다.Here, the difference between the time points at which the fused information is fused to generate temporally adjacent mid- to long-term characteristic information may be a mid- to long-term time interval. It may be generated, and the continuous characteristic information compared with any mid- to long-term characteristic information may be generated within a time interval appearing in the video image 100 .

이에 따라, 핵심 영상 생성 장치(200)는 중장기 특징 정보와 연속 특징 정보를 비교하여 맥락 정보를 생성할 수 있으며, 핵심 영상 생성 장치(200)는 생성된 맥락 정보를 연속 특징 정보에 적용하고, 맥락 정보가 적용된 연속 특징 정보로부터, 완전 연결 층(Fully-Connected Layer)을 이용하여 중요도 점수를 생성할 수 있다.Accordingly, the core image generating apparatus 200 may generate context information by comparing the mid- to long-term feature information with the continuous feature information, and the core image generating apparatus 200 applies the generated context information to the continuous feature information, From the continuous feature information to which the information is applied, an importance score may be generated using a fully-connected layer.

이때, 핵심 영상 생성 장치(200)는 중장기 특징 정보와 연속 특징 정보를 비교하여 생성된 맥락 정보가 적용된 연속 특징 정보를 완전 연결 층(Fully-Connected Layer)에 입력하여 출력으로 나타나는 중요도 점수를 생성할 수도 있다.At this time, the core image generating device 200 inputs the continuous feature information to which the context information generated by comparing the mid- and long-term feature information with the continuous feature information is applied to the Fully-Connected Layer to generate an importance score that appears as an output. may be

이때, 완전 연결층(Fully-Connected Layer)은 입력 값에 매칭되는 결과 값을 출력하도록 마련되는 기법을 의미할 수 있다.In this case, the fully-connected layer may refer to a technique provided to output a result value matching an input value.

이와 관련하여, 핵심 영상 생성 장치(200)는 아래의 수학식 1 내지 수학식 3의 수식을 이용하여 맥락 정보를 생성할 수 있다.In this regard, the core image generating apparatus 200 may generate context information by using Equations 1 to 3 below.

여기에서, e_ti는 중장기 특징 정보와 연속 특징 정보를 비교하여 산출되는 제 1 가중치 계수를 의미할 수 있고, z_t는 연속 특징 정보를 의미할 수 있으며, h_i는 중장기 특징 정보를 의미할 수 있다. 또한, T는 행렬의 행과 열을 변환하는 전치 연산(Transpose)을 의미할 수 있다.Here, e_ti may mean a first weighting coefficient calculated by comparing the mid- to long-term feature information with the continuous feature information, z_t may mean continuous feature information, and h_i may mean mid- to long-term feature information. In addition, T may mean a transpose operation for transforming rows and columns of a matrix.

이에 따라, 핵심 영상 생성 장치(200)는 연속 특징 정보의 행과 열을 변환한 값과, 중장기 특징 정보를 곱하여 제 1 가중치 계수를 산출할 수 있다.Accordingly, the core image generating apparatus 200 may calculate a first weighting coefficient by multiplying a value obtained by converting rows and columns of continuous feature information and mid- to long-term feature information.

여기에서, Alpha_ti는 제 2 가중치 계수를 의미할 수 있고, e_ti는 제 1 가중치 계수를 의미할 수 있다.Here, Alpha_ti may mean a second weighting coefficient, and e_ti may mean a first weighting coefficient.

이에 따라, 핵심 영상 생성 장치(200)는 제 1 가중치 계수의 총합에 대한 임의의 하나의 제 1 가중치 계수의 비율에 따라 제 2 가중치 계수를 산출할 수 있다.Accordingly, the core image generating apparatus 200 may calculate the second weighting coefficient according to the ratio of any one first weighting coefficient to the sum of the first weighting coefficients.

이와 관련하여, softmax()는 softmax function을 의미할 수 있으며, softmax function은 임의의 집합 내에 존재하는 모든 정보의 합에 대한 해당 집합 내에 존재하는 임의의 하나의 정보의 비율을 산출하는 기법으로 이해할 수 있다.In this regard, softmax() can mean a softmax function, and the softmax function can be understood as a technique for calculating the ratio of any single piece of information in a given set to the sum of all information in the set. have.

여기에서, Alpha_t는 맥락 정보를 의미할 수 있고, Alpah_ti는 제 2 가중치 계수를 의미할 수 있으며, h_i는 중장기 특징 정보를 의미할 수 있다. 또한, N은 임의의 비디오 영상(100)에서 생성된 중장기 특징 정보의 개수를 의미할 수 있다.Here, Alpha_t may mean context information, Alpah_ti may mean a second weighting coefficient, and h_i may mean mid- to long-term feature information. In addition, N may mean the number of mid- to long-term characteristic information generated from an arbitrary video image 100 .

이에 따라, 핵심 영상 생성 장치(200)는 임의의 비디오 영상(100)에서 생성된 중장기 특징 정보와 제 2 가중치 계수의 곱의 총합으로 가중치를 산출할 수 있다.Accordingly, the core image generating apparatus 200 may calculate a weight using the sum of the product of the mid- to long-term characteristic information generated from the arbitrary video image 100 and the second weighting coefficient.

핵심 영상 생성 장치(200)는 산출된 맥락 정보를 연속 특징 정보에 곱하거나, 더하여 맥락 정보가 적용된 연속 특징 정보를 생성할 수 있다.The core image generating apparatus 200 may generate continuous feature information to which context information is applied by multiplying or adding the calculated context information to the continuous feature information.

이와 관련하여, 핵심 영상 생성 장치(200)는 연속 특징 정보와 중장기 특징 정보로부터 Attention 기법을 이용하여 맥락 정보를 생성하는 것일 수 있다.In this regard, the core image generating apparatus 200 may generate context information from continuous feature information and mid/long-term feature information using an attention technique.

여기에서, Attention 기법은 시계열로 생성된 정보 중, 중요한 정보로 판단되는 곳에 가중치를 부여하고, 합을 계산하여 정보들 간의 관계를 나타내는 기법일 수 있다.Here, the attention technique may be a technique for indicating a relationship between information by assigning weights to places determined to be important information among information generated in time series, and calculating a sum.

이러한 경우에, 핵심 영상 생성 장치(200)는 Attention 기법을 이용하여, 중장기 특징 정보 및 연속 특징 정보로부터 맥락 정보를 생성하고, 생성된 맥락 정보를 연속 특징 정보에 적용할 수 있으며, 이후, 핵심 영상 생성 장치(200)는 맥락 정보가 적용된 연속 특징 정보를 완전 연결 층(Fully-Connected Layer)에 입력하여, 중요도 점수를 생성할 수 있다.In this case, the core image generating apparatus 200 may generate context information from the mid- to long-term feature information and the continuous feature information by using the attention technique, and apply the generated context information to the continuous feature information, and then, the core image The generating device 200 may generate an importance score by inputting continuous feature information to which context information is applied to a fully-connected layer.

핵심 영상 생성 장치(200)는 맥락 정보 또는 중요도 점수에 기초하여 핵심 영상(300)을 생성할 수 있다. 이때, 핵심 영상 생성 장치(200)는 맥락 정보로부터 생성되는 중요도 점수에 따라 하나 이상의 중요도 점수를 추출할 수 있으며, 핵심 영상 생성 장치(200)는 추출된 중요도 점수에 매칭되는 연속 특징 정보가 생성되는 비디오 영상(100)의 일부 장면을 추출하는 것으로 이해할 수 있다.The core image generating apparatus 200 may generate the core image 300 based on context information or an importance score. In this case, the core image generating device 200 may extract one or more importance scores according to the importance scores generated from the context information, and the core image generating device 200 generates continuous feature information matching the extracted importance scores. It may be understood as extracting some scenes of the video image 100 .

여기에서, 비디오 영상(100)의 일부 장면을 추출하는 것은 추출된 하나 이상의 연속 특징 정보가 비디오 영상(100)으로부터 나타나는 시간 간격 또는 프레임 간격에 따라, 비디오 영상(100)으로부터 해당 시간 간격 또는 프레임 간격의 영상 정보와 음성 정보를 추출하여 연결하는 것으로 이해할 수 있으며, 이를 통해, 핵심 영상 생성 장치(200)는 하나 이상의 연속 특징 정보에 따라, 비디오 영상(100)으로부터 추출된 영상 정보와 음성 정보를 시간 순서에 따라 연결하여 핵심 영상(300)을 생성할 수 있다.Here, the extraction of a partial scene of the video image 100 is performed according to a time interval or frame interval at which the extracted one or more continuous feature information appears from the video image 100 , the corresponding time interval or frame interval from the video image 100 . It can be understood as extracting and connecting the image information and audio information of The core image 300 may be generated by connecting them in order.

이때, 핵심 영상 생성 장치(200)는 비디오 영상(100)의 길이에 대한 핵심 영상(300)의 길이의 비율에 따라 하나 이상의 중요도 점수를 추출할 수 있고, 핵심 영상 생성 장치(200)는 추출된 중요도 점수 또는 연속 특징 정보를 이용하여 핵심 영상(300)을 생성할 수 있다.In this case, the core image generating apparatus 200 may extract one or more importance scores according to the ratio of the length of the core image 300 to the length of the video image 100 , and the core image generating apparatus 200 may extract the extracted The core image 300 may be generated using the importance score or continuous feature information.

예를 들어, 핵심 영상 생성 장치(200)는 비디오 영상(100)의 길이가 30분이고, 핵심 영상(300)의 길이가 3분인 경우에, 비디오 영상(100)의 길이에 대한 핵심 영상(300)의 길이의 비율이 10%이므로, 핵심 영상 생성 장치(200)는 중요도 점수 중 상위 10%를 만족하는 중요도 점수에 따라 핵심 영상(300)을 생성할 수 있다.For example, in the case where the length of the video image 100 is 30 minutes and the length of the core image 300 is 3 minutes, the core image generating apparatus 200 has the core image 300 for the length of the video image 100 . Since the ratio of the length of is 10%, the core image generating apparatus 200 may generate the core image 300 according to the importance score that satisfies the top 10% of the importance scores.

도2는 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 제어블록도이다.2 is a control block diagram of an apparatus for generating a core image according to an embodiment of the present invention.

핵심 영상 생성 장치(200)는 비디오 분리부(210), 음성 분석부(220), 영상 분석부(230), 연속 정보 분석부(240), 중장기 정보 분석부(250), 맥락 분석부(260) 및 영상 생성부(270)를 포함할 수 있다.The core image generating apparatus 200 includes a video separation unit 210 , a voice analysis unit 220 , an image analysis unit 230 , a continuous information analysis unit 240 , a mid- to long-term information analysis unit 250 , and a context analysis unit 260 . ) and an image generator 270 .

비디오 분리부(210)는 비디오 영상(100)으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리할 수 있다.The video separation unit 210 may separate audio information and image information from the video image 100 so as to be continuous according to time sequence.

이때, 비디오 분리부(210)는 분리되는 음성 정보 또는 영상 정보가 사전에 설정되는 시간 간격으로 나타나도록 비디오 영상(100)을 분리할 수 있다.In this case, the video separation unit 210 may separate the video image 100 so that the separated voice information or image information appears at a preset time interval.

한편, 비디오 분리부(210)는 비디오 영상(100)으로부터 영상 정보가 추출되는 시점과 음성 정보가 추출되는 시점이 동일하도록 영상 정보와 음성 정보를 분리할 수 있다.Meanwhile, the video separation unit 210 may separate the image information and the audio information so that a time point at which image information is extracted from the video image 100 and a time point at which the audio information is extracted are the same.

또한, 비디오 분리부(210)는 비디오 영상(100)으로부터 영상 정보가 추출된 시점에서, 사전에 설정되는 지연 시간이 경과된 시점에서 음성 정보를 추출할 수 있다.Also, the video separation unit 210 may extract audio information at a point in time when image information is extracted from the video image 100 and when a preset delay time elapses.

음성 분석부(220)는 추출된 음성 정보를 분석하여 음성 특징 정보를 생성할 수 있다.The voice analyzer 220 may analyze the extracted voice information to generate voice characteristic information.

이때, 음성 분석부(220)는 음성 정보를 사전에 설정되는 세그먼트 개수에 따라 복수개의 세그먼트로 분리할 수 있고, 음성 분석부(220)는 분리된 세그먼트로부터 음성 특징 정보를 추출할 수 있다.In this case, the voice analyzer 220 may divide the voice information into a plurality of segments according to a preset number of segments, and the voice analyzer 220 may extract voice characteristic information from the divided segments.

이와 관련하여, 음성 분석부(220)는 하나의 세그먼트로부터 MFCC(Mel-Frequency Cepstral Coefficient)를 이용하여 서로 다른 주파수 대역의 정보를 추출할 수 있으며, 이때, 음성 분석부(220)는 사전에 설정되는 주파수 차원 개수에 따라 하나의 세그먼트로 분리된 음성 정보로부터 서로 다른 주파수 대역의 정보를 추출할 수 있다.In this regard, the voice analyzer 220 may extract information of different frequency bands from one segment using a Mel-Frequency Cepstral Coefficient (MFCC), and in this case, the voice analyzer 220 sets the preset Information of different frequency bands can be extracted from voice information divided into one segment according to the number of frequency dimensions to be used.

이와 같이, 음성 분석부(220)는 사전에 설정되는 주파수 차원 개수에 따라, 하나의 세그먼트로 분리된 음성 정보로부터 복수개의 주파수 대역의 정보를 추출할 수 있다.As such, the voice analyzer 220 may extract information of a plurality of frequency bands from voice information divided into one segment according to a preset number of frequency dimensions.

이에 따라, 음성 분석부(220)는 추출된 각각의 주파수 대역의 정보로부터 BiLSTM(Bidirectional Long-Short Term Memory)을 이용하여 음성 특징 정보를 생성할 수 있다.Accordingly, the voice analyzer 220 may generate voice characteristic information from the extracted information of each frequency band using Bidirectional Long-Short Term Memory (BiLSTM).

영상 분석부(230)는 사전에 마련되는 학습 모델에 기초하여, 영상 정보로부터 영상 특징 정보를 추출할 수 있다.The image analyzer 230 may extract image feature information from the image information based on a pre-prepared learning model.

이를 위해, 영상 분석부(230)는 임의의 비디오 영상(100)에서, 프레임 단위로 분리된 영상 정보로부터 영상 특징 정보가 추출되도록 학습하여 학습 모델을 생성할 수 있다.To this end, the image analyzer 230 may generate a learning model by learning to extract image feature information from image information separated in frame units from an arbitrary video image 100 .

이때, 영상 분석부(230)는 CNN(Convolution Neural Network)를 이용하여 임의의 비디오 영상(100)에서 프레임 단위로 분리된 영상 정보를 학습할 수 있다.In this case, the image analyzer 230 may learn image information separated by frame from an arbitrary video image 100 using a Convolution Neural Network (CNN).

이에 따라, 영상 분석부(230)는 CNN이 적용된 BiLSTM을 이용하여 영상 정보로부터 영상 특징 정보를 추출할 수 있다.Accordingly, the image analysis unit 230 may extract image feature information from the image information using BiLSTM to which CNN is applied.

이때, 영상 분석부(230)는 비디오 영상(100)에서 분리된 영상 정보에 나타나는 각각의 프레임으로부터 영상 특징 정보를 각각 추출할 수 있다.In this case, the image analyzer 230 may extract image characteristic information from each frame appearing in the image information separated from the video image 100 .

연속 정보 분석부(240)는 연속되도록 생성되는 음성 특징 정보와 영상 특징 정보를 융합할 수 있고, 연속 정보 분석부(240)는 융합된 정보를 분석하여 연속 특징 정보를 생성할 수 있다.The continuous information analysis unit 240 may fuse the audio characteristic information and image characteristic information that are continuously generated, and the continuous information analysis unit 240 may analyze the fused information to generate continuous characteristic information.

여기에서, 연속 정보 분석부(240)는 BiLSTM을 이용하여, 융합된 정보로부터 연속 특징 정보를 생성할 수 있으며, 이때, 음성 특징 정보와 영상 특징 정보를 융합하는 것은 음성 특징 정보와 영상 특징 정보를 중첩하는 것일 수 있으며, 또는, 음성 특징 정보와 영상 특징 정보를 이용하여 다차원의 정보를 생성하는 것일 수도 있다.Here, the continuous information analyzer 240 may generate continuous feature information from the fused information by using BiLSTM. It may be overlapped, or multidimensional information may be generated using audio characteristic information and image characteristic information.

중장기 정보 분석부(250)는 사전에 설정되는 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있다.The mid/long-term information analysis unit 250 may generate mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval.

여기에서, 중장기 정보 분석부(250)는 BiLSTM을 이용하여, 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있다.Here, the mid/long-term information analysis unit 250 may generate mid/long-term characteristic information by analyzing the fused information according to a mid/long-term time interval using BiLSTM.

맥락 분석부(260)는 연속 특징 정보와 중장기 특징 정보를 비교하여, 연속 특징 정보와 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성할 수 있다.The context analyzer 260 may compare the continuous feature information with the mid- to long-term feature information to generate context information indicating the correlation between the continuous feature information and the mid-to-long-term feature information.

이때, 맥락 분석부(260)는 임의의 시점에서 융합된 정보로부터 생성된 연속 특징 정보를 비교할 수 있으며, 또한, 맥락 분석부(260)는 임의의 시점에서 융합된 정보로부터 생성된, 중장기 특징 정보와, 해당 시점으로부터, 중장기 특징 정보가 생성되도록 정보가 융합되는 다음 시점까지 융합된 정보로부터 생성된, 연속 특징 정보를 각각 비교할 수 있다.In this case, the context analysis unit 260 may compare the continuous feature information generated from the fused information at an arbitrary time point, and the context analyzer 260 also provides mid- to long-term characteristic information generated from the fused information at an arbitrary time point. and continuous feature information generated from the fused information from the corresponding point in time until the next point in time when the information is fused to generate mid- to long-term feature information may be compared respectively.

또한, 맥락 분석부(260)는 임의의 시점에서 융합된 정보로부터 생성된 하나의 연속 특징 정보와 비디오 영상(100)으로부터 생성된 복수개의 중장기 특징 정보를 각각 비교할 수도 있다.Also, the context analyzer 260 may compare one piece of continuous feature information generated from the fused information at an arbitrary point in time with a plurality of mid- to long-term feature information generated from the video image 100 , respectively.

이에 따라, 맥락 분석부(260)는 중장기 특징 정보와 연속 특징 정보를 비교하여 맥락 정보를 생성할 수 있으며, 맥락 분석부(260)는 생성된 맥락 정보를 연속 특징 정보에 적용할 수 있다.Accordingly, the context analyzer 260 may generate context information by comparing the mid- to long-term feature information with the continuous feature information, and the context analyzer 260 may apply the generated context information to the continuous feature information.

이때, 맥락 분석부(260)는 중장기 특징 정보와 연속 특징 정보를 비교하여 생성된 맥락 정보가 적용된 연속 특징 정보를 완전 연결 층(Fully-Connected Layer)에 입력하여 출력으로 나타나는 중요도 점수를 생성할 수도 있다.At this time, the context analyzer 260 may generate an importance score displayed as an output by inputting the continuous feature information to which the context information generated by comparing the mid- to long-term feature information with the continuous feature information is applied to the Fully-Connected Layer. have.

영상 생성부(270)는 맥락 정보 또는 중요도 점수에 기초하여 핵심 영상(300)을 생성할 수 있다. 이때, 영상 생성부(270)는 맥락 정보로부터 생성되는 중요도 점수의 크기에 따라 하나 이상의 중요도 점수를 추출할 수 있다.The image generator 270 may generate the core image 300 based on context information or an importance score. In this case, the image generator 270 may extract one or more importance scores according to the size of the importance scores generated from the context information.

이때, 영상 생성부(270)는 비디오 영상(100)의 길이에 대한 핵심 영상(300)의 길이의 비율에 따라 하나 이상의 중요도 점수를 추출할 수 있고, 영상 생성부(270)는 추출된 중요도 점수 또는 연속 특징 정보를 이용하여 핵심 영상(300)을 생성할 수 있다.In this case, the image generating unit 270 may extract one or more importance scores according to a ratio of the length of the core image 300 to the length of the video image 100 , and the image generating unit 270 may extract the extracted importance scores. Alternatively, the core image 300 may be generated using the continuous feature information.

도3은 도2의 연속 정보 분석부 또는 중장기 정보 분석부에서 융합된 정보를 분석하는 과정을 나타낸 블록도이다.3 is a block diagram illustrating a process of analyzing the fused information in the continuous information analysis unit or the mid- to long-term information analysis unit of FIG. 2 .

도3을 참조하면, 비디오 분리부(210)는 비디오 영상(100)으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리할 수 있다.Referring to FIG. 3 , the video separation unit 210 may separate audio information and image information from the video image 100 so as to be continuous according to time sequence.

이에 따라, 음성 분석부(220)는 추출된 음성 정보를 분석하여 음성 특징 정보를 생성할 수 있다.Accordingly, the voice analyzer 220 may analyze the extracted voice information to generate voice characteristic information.

또한, 영상 분석부(230)는 사전에 마련되는 학습 모델에 기초하여, 영상 정보로부터 영상 특징 정보를 추출할 수 있다.Also, the image analyzer 230 may extract image feature information from the image information based on a pre-prepared learning model.

이때, 영상 분석부(230)는 CNN(Convolution Neural Network)를 이용하여 임의의 비디오 영상(100)에서 프레임 단위로 분리된 영상 정보를 학습할 수 있다.In this case, the image analyzer 230 may learn the image information separated by frame from an arbitrary video image 100 by using a Convolution Neural Network (CNN).

이에 따라, 연속 정보 분석부(240)는 연속되도록 생성되는 음성 특징 정보와 영상 특징 정보를 융합할 수 있고, 연속 정보 분석부(240)는 융합된 정보를 분석하여 연속 특징 정보를 생성할 수 있다.Accordingly, the continuous information analysis unit 240 may fuse the audio characteristic information and image characteristic information that are continuously generated, and the continuous information analysis unit 240 may analyze the fused information to generate continuous characteristic information. .

또한, 중장기 정보 분석부(250)는 사전에 설정되는 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성할 수 있다.Also, the mid/long-term information analysis unit 250 may generate mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval.

도4는 도2의 영상 생성부에서 핵심 영상을 생성하는 과정을 나타낸 블록도이다.4 is a block diagram illustrating a process of generating a core image in the image generator of FIG. 2 .

도4를 참조하면, 맥락 분석부(260)는 연속 특징 정보와 중장기 특징 정보를 비교하여, 연속 특징 정보와 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성할 수 있다.Referring to FIG. 4 , the context analyzer 260 compares the continuous feature information with the mid- to long-term feature information to generate context information indicating the correlation between the continuous feature information and the mid- to long-term feature information.

도5는 본 발명의 일 실시예에 따른 핵심 영상 생성 장치의 일 실시예를 나타낸 개략도이다.5 is a schematic diagram illustrating an embodiment of an apparatus for generating a core image according to an embodiment of the present invention.

도5를 참조하면, 비디오 영상(100)으로부터 음성 정보와 영상 정보가 분리되어, 서로 다른 BiLSTM에 입력되는 것을 확인할 수 있다.Referring to FIG. 5 , it can be seen that audio information and image information are separated from the video image 100 and input to different BiLSTMs.

한편, 도5에서, Video는 동영상(100)을 의미하고, audio는 음성 정보를 의미하며, image는 영상 정보를 의미할 수 있다. 또한, {x_t}audio는 사전에 설정되는 시간 간격으로 분리된 음성 정보를 의미할 수 있고, {x_t}image는 사전에 설정되는 시간 간격으로 분리된 영상 정보를 의미할 수 있다. 또한, {z_t}는 연속 특징 정보를 의미할 수 있고, {h_i}는 중장기 특징 정보를 의미할 수 있으며, s_t는 맥락 정보가 적용된 연속 특징 정보가 완전 연결 층(Fully-Connected Layer)에 입력되어 출력되는 중요도 점수를 의미할 수 있으며, 이에 따라, 핵심 영상 생성 장치(200)는 맥락 정보 또는 중요도 점수에 기초하여 비디오 영상(100)으로부터 핵심 영상을 생성할 수 있다.Meanwhile, in FIG. 5 , video may mean moving picture 100, audio may mean audio information, and image may mean image information. Also, {x_t}audio may mean audio information separated at a preset time interval, and {x_t}image may mean image information separated at a preset time interval. In addition, {z_t} may mean continuous feature information, {h_i} may mean mid/long-term feature information, and s_t is continuous feature information to which context information is applied is input to the Fully-Connected Layer. It may mean an output importance score. Accordingly, the core image generating apparatus 200 may generate the core image from the video image 100 based on context information or the importance score.

도6은 본 발명의 일 실시예에 따른 핵심 영상 생성 방법의 순서도이다.6 is a flowchart of a method for generating a core image according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 핵심 영상 생성 방법은 도 1에 도시된 핵심 영상 생성 장치(200)와 실질적으로 동일한 구성 상에서 진행되므로, 도 1의 핵심 영상 생성 장치(200)와 동일한 구성요소에 대해 동일한 도면 부호를 부여하고, 반복되는 설명은 생략하기로 한다.Since the method for generating a core image according to an embodiment of the present invention is performed in substantially the same configuration as the core image generating apparatus 200 shown in FIG. 1 , the same components as the core image generating apparatus 200 of FIG. 1 are used. The same reference numerals are given, and repeated descriptions will be omitted.

핵심 영상 생성 방법은 음성 정보와 영상 정보를 분리하여 추출하는 단계(600), 음성 특징 정보를 생성하는 단계(610), 영상 특징 정보를 생성하는 단계(620), 연속 특징 정보를 생성하는 단계(630), 중장기 특징 정보를 생성하는 단계(640), 맥락 정보를 생성하는 단계(650) 및 핵심 영상을 생성하는 단계(660)를 포함할 수 있다.The core image generation method includes the steps of separating and extracting audio information and image information (600), generating audio characteristic information (610), generating image characteristic information (620), and generating continuous characteristic information ( 630), generating mid- to long-term feature information (640), generating context information (650), and generating a core image (660) may be included.

음성 정보와 영상 정보를 분리하여 추출하는 단계(600)는 비디오 분리부(210)가 비디오 영상으로부터 시간 순서에 따라 연속되도록 음성 정보와 영상 정보를 분리하는 단계일 수 있다.The step 600 of separating and extracting the audio information and the image information may be a step in which the video separation unit 210 separates the audio information and the image information so that they are continuous from the video image in chronological order.

음성 특징 정보를 생성하는 단계(610)는 음성 분석부(220)가 음성 정보를 분석하여 음성 특징 정보를 생성하는 단계일 수 있다.The step 610 of generating the voice characteristic information may be a step in which the voice analyzer 220 analyzes the voice information to generate the voice characteristic information.

영상 특징 정보를 생성하는 단계(620)는 영상 분석부(230)가 사전에 마련되는 학습 모델에 기초하여, 영상 정보로부터 영상 특징 정보를 생성하는 단계일 수 있다.The generating 620 of the image characteristic information may be a step of generating the image characteristic information from the image information based on a learning model prepared in advance by the image analyzing unit 230 .

연속 특징 정보를 생성하는 단계(630)는 연속 정보 분석부(240)가 연속되도록 생성되는 음성 특징 정보와 영상 특징 정보를 융합하고, 융합된 정보를 분석하여 연속 특징 정보를 생성하는 단계일 수 있다.The step 630 of generating the continuous feature information may be a step in which the continuous information analyzer 240 fuses the audio feature information and the image feature information generated to be continuous, and analyzes the fused information to generate the continuous feature information. .

중장기 특징 정보를 생성하는 단계(640)는 중장기 정보 분석부(250)가 사전에 설정되는 중장기 시간 간격에 따라 융합된 정보를 분석하여 중장기 특징 정보를 생성하는 단계일 수 있다.The step 640 of generating the mid- to long-term characteristic information may be a step in which the mid- to long-term information analysis unit 250 analyzes the fused information according to a preset mid/long-term time interval to generate mid- to long-term characteristic information.

맥락 정보를 생성하는 단계(650)는 맥락 분석부(260)가 연속 특징 정보와 중장기 특징 정보를 비교하여, 연속 특징 정보와 중장기 특징 정보의 연관성을 나타내는 맥락 정보를 생성하는 단계일 수 있다.The step of generating the context information 650 may be a step in which the context analyzer 260 compares the continuous feature information with the mid- to long-term feature information to generate context information indicating the correlation between the continuous feature information and the mid- to long-term feature information.

핵심 영상을 생성하는 단계(660)는 영상 생성부(270)가 맥락 정보에 기초하여 핵심 영상을 생성하는 단계일 수 있다.The step 660 of generating the core image may be a step in which the image generator 270 generates the core image based on context information.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to the embodiments, those skilled in the art will understand that various modifications and changes can be made to the present invention without departing from the spirit and scope of the present invention as set forth in the following claims. will be able

100: 비디오 영상
200: 핵심 영상 생성 장치
300: 핵심 영상100: video image
200: core image generating device
300: core video

Claims

a video separation unit that separates audio information and image information from the video image so as to be continuous according to time sequence;
a voice analyzer that analyzes the voice information to generate voice characteristic information;
an image analyzer configured to generate image characteristic information from the image information based on a learning model prepared in advance;
a continuous information analysis unit that fuses the audio characteristic information and the image characteristic information generated so as to be continuous, and analyzes the fused information to generate continuous characteristic information;
a mid/long-term information analysis unit for generating mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval;
a context analysis unit comparing the continuous characteristic information with the mid- to long-term characteristic information to generate context information indicating a correlation between the continuous characteristic information and the mid- to long-term characteristic information; and
and an image generator configured to generate a core image based on the context information.

According to claim 1, wherein the voice analyzer,
A core image generating apparatus for dividing the voice information into a plurality of segments according to a preset number of segments, and generating voice characteristic information in each frequency band according to a preset number of frequency dimensions from the segments.

According to claim 1, wherein the image analysis unit,
A core image generating apparatus that generates a learning model by learning image information separated by frame from an arbitrary video so that image characteristic information is extracted.

According to claim 1, wherein the image generating unit,
Extracting one or more context information according to a ratio of the length of the core image to the length of the video image, and generating a core image by using the extracted context information.

In the method of generating a core image by using a core image generating device in consideration of the context of a video image,
separating audio information and image information from the video image so as to be continuous in time order;
generating voice characteristic information by analyzing the voice information;
generating image feature information from the image information based on a learning model prepared in advance;
fusing the audio characteristic information and the image characteristic information generated to be continuous, and analyzing the fused information to generate continuous characteristic information;
generating mid/long-term characteristic information by analyzing the fused information according to a preset mid/long-term time interval;
comparing the continuous characteristic information with the mid- to long-term characteristic information to generate context information indicating a correlation between the continuous characteristic information and the mid- to long-term characteristic information; and
and generating a core image based on the context information.