KR102512396B1

KR102512396B1 - Automatic extraction method of e-sports highlights based on multimodal learning

Info

Publication number: KR102512396B1
Application number: KR1020210093522A
Authority: KR
Inventors: 권혁윤; 박강민; 현혜인
Original assignee: 서울과학기술대학교 산학협력단
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-03-20
Also published as: KR20230012800A

Abstract

본 발명의 하이라이트 영상 자동생성 장치에서 수행되는 하이라이트영상 자동생성 방법은, 영상이 입력되는 단계; 입력된 영상에서 비디오 특징, 오디오 특징, 채팅 특징을 원시 특징으로 추출하는 단계; 추출된 원시 특징을 기설정된 크기로 전처리한 후 연결하는 단계; 연결된 특징을 입력받아 하이라이트 프레임으로 분류된 프레임을 하이라이트 영상에 포함시킬 영상 클립으로 선택하는 단계; 및 선택된 영상 클립을 연결하여 하이라이트 영상을 생성하는 단계를 포함한다. 이에 의해 E-스포츠의 중계영상에서 하이라이트를 자동 생성함으로써 영상 컨텐츠의 제작효율을 향상시킬 수 있게 된다.The automatic highlight image generation method performed by the apparatus for automatically generating a highlight image of the present invention includes the steps of inputting an image; extracting video features, audio features, and chatting features as raw features from the input video; pre-processing the extracted raw features to a predetermined size and then connecting them; receiving connected features and selecting frames classified as highlight frames as video clips to be included in a highlight image; and generating a highlight image by connecting the selected image clips. As a result, it is possible to improve the production efficiency of video contents by automatically generating highlights in the relay video of E-sports.

Description

Method for automatically generating multimodal learning-based E-sports highlight video and device for performing the same

본 발명은 멀티모달 학습 기반 E-스포츠 하이라이트 영상 자동생성 방법 및 이를 수행하기 위한 장치에 관한 것으로 보다 상세하게는 원본 비디오로부터 중요한 하이라이트 장면만 추출하여 하이라이트 영상을 자동으로 생성할 수 있는 멀티모달 학습 기반 E-스포츠 하이라이트 영상 자동생성 방법 및 이를 수행하기 위한 장치에 관한 것이다.The present invention relates to a method for automatically generating a multimodal learning-based E-sports highlight video and an apparatus for performing the same, and more particularly, to a multimodal learning-based method capable of automatically generating a highlight video by extracting only important highlight scenes from an original video. It relates to a method for automatically generating an E-sports highlight video and an apparatus for performing the same.

스포츠 중 하나로 점점 인기가 증가하고 있는 E-스포츠의 경우에는, 2018년 아시안 게임에서 공식 스포츠로 선정되었으며, 2019년을 기준으로 E-스포츠의 시청자수는 4억 43백만명에 달했을 정도이다. In the case of E-Sports, which is increasingly popular as one of the sports, it was selected as an official sport in the 2018 Asian Games, and as of 2019, the number of viewers of E-Sports reached 443 million.

더불어 E-스포츠 시청자의 65%를 보유하고 있는 중계방송 플랫폼 트위치(Twitch)는 E-스포츠 경기 중계 및 하이라이트 영상을 제공하고 있으며, 시간과 장소에 제약이 적은 E-스포츠의 특성상 매일 수많은 경기의 개최 및 이에 대한 하이라이트 영상 제작이 이루어지고 있다. In addition, Twitch, a relay broadcasting platform that holds 65% of E-Sports viewers, provides E-Sports game relay and highlight videos. The event is being held and a highlight video is being produced.

특히 디지털 영상 콘텐츠가 놀랄 만큼 성장하면서 콘텐츠에서 나머지 부분보다 더 중요한 이벤트, 즉 하이라이트를 추출해야 할 필요성이 대두됐다. 그래서 하이라이트를 추출하기 위한 많은 연구들이 진행됐다. 또한 최근 유투브 발표에 따르면 스포츠 하이라이트 영상의 시청 시간은 2016년에 비해 2017년에 약 80% 증가하였다. In particular, with the remarkable growth of digital video content, the need to extract more important events than the rest of the content, i.e. highlights, has emerged. Therefore, many studies have been conducted to extract highlights. Also, according to a recent YouTube announcement, the viewing time of sports highlight videos increased by about 80% in 2017 compared to 2016.

이처럼 E스포츠와 실시간 게임 영상 스트리밍 같은 게임 관련 영상 콘텐츠들이 확산되면서 게임 분야에서도 다른 분야처럼 영상의 양이 많아지며 영상에서 하이라이트를 뽑는 것이 필요해지고 있으며, E-스포츠 시장의 규모가 커져 감에 따라 E-스포츠 경기 영상의 양이 많아지면서 다양한 채널을 통해 색인을 생성하고 공유하는 효율적인 방법의 필요성이 대두되고 있다. As game-related video contents such as E-sports and real-time game video streaming spread, the amount of video in the game field increases as in other fields, and it is necessary to extract highlights from the video. -As the amount of sports game videos increases, the need for an efficient method of indexing and sharing through various channels is emerging.

또한 일반 게이머들은 E-스포츠 영상과 실시간 게임 영상 스트리밍 영상에서 하이라이트를 보고 싶어하지만 긴 방송시간은 방송을 보고 시청을 판단하는 시청자들이 새로운 방송에 유입되는데 걸림돌이 되므로 영상 하이라이트를 추출하는 것이 이를 해결하는 방법 중 하나라고 할 수 있다. In addition, general gamers want to see highlights in E-sports videos and real-time game video streaming videos, but long broadcasting time is an obstacle for viewers who judge viewing and watching broadcasts to enter new broadcasts, so extracting video highlights solves this problem. can be said to be one of the methods.

하지만, 제작자들이 직접 혹은 전문 편집자를 통해 하이라이트 영상을 제공하게 되는 경우에는 너무 많은 시간과 비용을 소요하게 된다는 문제가 있기에 원본 비디오로부터 중용한 장면만 자동으로 추출하기 위한 하이라이트 자동 추출의 필요성이 증대되고 있다. However, when producers provide highlight videos directly or through professional editors, there is a problem that it takes too much time and money, so the need for automatic highlight extraction to automatically extract only important scenes from the original video is increasing. there is.

한국공개특허공보 제2010-0114131호Korean Patent Publication No. 2010-0114131

본 발명은 상기와 같은 문제를 해결하기 위해 안출된 것으로, 본 발명의 목적은 멀티모달 딥러닝을 활용하여 E-스포츠 중계영상으로부터 하이라이트를 자동으로 추출할 수 있는 멀티모달 학습 기반 E-스포츠 하이라이트 영상 자동생성 방법 및 이를 수행하기 위한 장치를 제공하는 것이다.The present invention has been made to solve the above problems, and an object of the present invention is a multi-modal learning-based E-sports highlight video that can automatically extract highlights from an E-sports relay video using multi-modal deep learning. It is to provide an automatic generation method and a device for performing the same.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 장치에서 수행되는 하이라이트 영상 자동생성 방법은, 영상이 입력되는 단계; 입력된 영상에서 원시 특징을 추출하되, 상기 원시 특징은 비디오 특징, 오디오 특징, 채팅 특징을 포함하는, 원시 특징을 추출하는 단계; 상기 추출된 원시 특징을 기설정된 크기로 전처리하여 상기 전처리된 비디오 특징, 오디오 특징 및 채팅 특징을 상호 연결하는 단계; 상기 연결된 비디오 특징, 오디오 특징 및 채팅 특징에서 하이라이트 프레임을 분류하는 단계; 상기 분류된 하이라이트 프레임을 하이라이트 영상에 포함시킬 영상 클립으로 선택하는 단계; 선택된 영상 클립을 연결하여 하이라이트 영상을 생성하는 단계를 포함한다. To achieve the above object, a method for automatically generating a highlight image performed by an apparatus for automatically generating a highlight image according to an embodiment of the present invention includes inputting an image; extracting a raw feature from an input video, wherein the raw feature includes a video feature, an audio feature, and a chat feature; pre-processing the extracted raw features to a predetermined size and interconnecting the pre-processed video features, audio features, and chatting features; classifying highlight frames from the concatenated video feature, audio feature and chat feature; selecting the classified highlight frames as video clips to be included in a highlight video; and generating a highlight video by connecting the selected video clips.

여기서 상기 전처리된 비디오 특징, 오디오 특징 및 채팅 특징을 상호 연결하는 단계는, 상기 추출된 원시 특징에 포함된 상기 추출된 비디오 특징, 오디오 특징 및 채팅 특징을 LSTM(Long Short-Term Memory) 모델을 통해 동일한 크기로 리사이징(resizing)하여 상호 연결할 수 있다. Here, the step of interconnecting the preprocessed video feature, audio feature, and chat feature includes the extracted video feature, audio feature, and chat feature included in the extracted raw feature through a Long Short-Term Memory (LSTM) model. They can be interconnected by resizing to the same size.

그리고 상기 연결된 비디오 특징, 오디오 특징 및 채팅 특징에서 하이라이트 프레임을 분류하는 단계는, 상기 LSTM 모델에 단일 FC(Fully-connected) 레이어를 추가하여 상기 LSTM 모델을 통해 상기 연결된 비디오 특징, 오디오 특징 및 채팅 특징에서 하이라이트 프레임 또는 비-하이라이트 프레임으로 분류할 수 있다. In the step of classifying the highlight frame from the connected video, audio, and chat features, a single Fully-connected (FC) layer is added to the LSTM model to obtain the connected video, audio, and chat features through the LSTM model. can be classified as a highlight frame or a non-highlight frame.

또한 상기 연결하는 단계를 수행하기 이전에 상기 추출된 원시 특징 중 적어도 하나의 특징에 대해 이동 평균을 기반으로 하는 스무딩 기법을 적용하는 단계를 더 포함할 수 있다. The method may further include applying a smoothing technique based on a moving average to at least one of the extracted original features before performing the connecting step.

한편, 상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 장치는, 영상을 입력받는 입력부; 입력된 영상에서 원시 특징을 추출하되, 상기 원시 특징은 비디오 특징, 오디오 특징, 채팅 특징을 포함하는, 원시 특징을 추출하는 추출부; 상기 추출된 원시 특징을 기설정된 크기로 전처리하여 상기 전처리된 비디오 특징, 오디오 특징 및 채팅 특징을 상호 연결하는 연결부; 상기 연결된 비디오 특징, 오디오 특징 및 채팅 특징에서 하이라이트 프레임을 분류하고, 상기 분류된 하이라이트 프레임을 하이라이트 영상에 포함시킬 영상 클립으로 선택하는 클립 생성부; 및 선택된 영상 클립을 연결하여 하이라이트 영상을 생성하는 하이라이트 생성부를 포함할 수 있다. Meanwhile, an apparatus for automatically generating a highlight image according to an embodiment of the present invention for achieving the above object includes an input unit for receiving an image; an extraction unit for extracting raw features from an input video, wherein the original features include video features, audio features, and chatting features; a connection unit for pre-processing the extracted raw features to a predetermined size and interconnecting the pre-processed video features, audio features, and chatting features; a clip creation unit that classifies highlight frames from the connected video, audio, and chatting features and selects the classified highlight frames as video clips to be included in a highlight video; and a highlight generator generating a highlight image by connecting the selected image clips.

상술한 본 발명의 일측면에 따르면, 멀티모달 학습 기반 E-스포츠 하이라이트 영상 자동생성 방법을 제공함으로써, E-스포츠의 중계영상에서 하이라이트를 자동 생성함으로써 영상 컨텐츠의 제작효율을 향상시킬 수 있게 된다. According to one aspect of the present invention described above, by providing a multimodal learning-based method for automatically generating an E-sports highlight video, it is possible to improve video content production efficiency by automatically generating a highlight in an E-sports relay video.

또한, 본 발명에 의해 멀티모달 딥 러닝을 활용하여 다중 소스로부터 수집된 데이터들을 통합하고 데이터별로 효과적인 특징을 선정함으로써 하이라이트 자동 추출의 전체적인 성능을 향상시킬 수 있다. In addition, according to the present invention, the overall performance of automatic highlight extraction can be improved by integrating data collected from multiple sources using multimodal deep learning and selecting effective features for each data.

도 1은 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 장치의 구성을 설명하기 위한 블록도,
도 2는 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 방법을 설명하기 위한 흐름도,
도 3은 본 발명의 일 실시예에 따라 추출된 원시 특징이 연결되는 모습을 설명하기 위한 도면,
도 4는 본 발명의 일 실시예에 따라 추출된 원시 특징을 전처리한 후 연결하는 모습, 그리고,
도 5는 본 발명의 멀티모달 학습 기반 모델을 설명하기 위한 도면이다.1 is a block diagram for explaining the configuration of an apparatus for automatically generating a highlight image according to an embodiment of the present invention;
2 is a flowchart for explaining a method for automatically generating a highlight image according to an embodiment of the present invention;
3 is a diagram for explaining how extracted raw features are connected according to an embodiment of the present invention;
4 is a state of connecting raw features extracted according to an embodiment of the present invention after preprocessing, and,
5 is a diagram for explaining the multimodal learning-based model of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in another embodiment without departing from the spirit and scope of the invention in connection with one embodiment. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하에서는 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 장치(10)의 구성을 설명하기 위한 블록도로, 본 실시예에 따른 하이라이트 영상 자동생성 장치(10)는, 통신부(13), 메모리(15), 출력부(17) 및 프로세서(19)를 포함하여 마련된다. 1 is a block diagram for explaining the configuration of an apparatus 10 for automatically generating a highlight image according to an embodiment of the present invention. 15), an output unit 17 and a processor 19 are provided.

입력부(11)는 사용자 명령을 입력 받기 위한 입력 수단으로 경기 영상, 중계 영상 등과 같은 하이라이트 추출을 위한 동영상을 입력 받을 수 있고, 출력부(17)는 하이라이트 자동 추출에 대한 과정 및 결과를 표시하기 위한 것으로 디스플레이를 포함할 수 있다. The input unit 11 is an input means for receiving a user command and can receive a video for highlight extraction, such as a game video or a relay video, and the output unit 17 is a function for displaying the process and result of automatic highlight extraction. may include a display.

그리고 통신부(13)는 외부 기기 또는 외부 네트워크로부터 필요한 정보를 송수신하기 위해 마련되는 것으로 이를 통해 학습 데이터나 하이라이트를 추출하기 위한 동영상을 입력 받을 수 있다. In addition, the communication unit 13 is provided to transmit and receive necessary information from an external device or external network, and through this, learning data or a video for extracting highlights can be input.

메모리(15)는 하이라이트 영상 자동생성 방법을 수행하기 위한 프로그램이 기록되고, 프로세서(19)가 동작함에 있어 필요한 저장 공간을 제공하여 프로세서(19)가 처리하는 데이터를 일시적 또는 영구적으로 저장하며, 휘발성 저장매체 또는 비휘발성 저장 매체를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 또한 메모리(15)는 하이라이트 영상 자동생성 방법을 수행하면서 누적되는 데이터가 저장될 수 있다. The memory 15 stores a program for automatically generating a highlight image, provides a storage space necessary for the processor 19 to operate, temporarily or permanently stores data processed by the processor 19, and is volatile. It may include a storage medium or a non-volatile storage medium, but the scope of the present invention is not limited thereto. In addition, the memory 15 may store data accumulated while performing the automatic highlight image generation method.

한편 프로세서(19)는 하이라이트 자동 추출을 위한 딥러닝 모델인 멀티모달 학습에 따라 학습시키고, 이러한 멀티모달 학습 기반의 모델을 이용하여 하이라이트 영상에 포함시킬 영상 클립을 선택하여 선택된 영상 클립을 연결하여 하이라이트 영상을 생성하기 위한 CPU와 GPU들로, 이러한 프로세서(19)는 하이라이트 영상 자동생성 방법을 제공하는 전체 과정을 제어할 수 있고, 이를 위한 소프트웨어(어플리케이션)가 설치되어 실행될 수 있다. Meanwhile, the processor 19 learns according to multimodal learning, which is a deep learning model for automatic highlight extraction, selects video clips to be included in the highlight video using the multimodal learning-based model, connects the selected video clips, and highlights With CPUs and GPUs for generating images, this processor 19 can control the entire process of providing a method for automatically generating a highlight image, and software (application) for this can be installed and executed.

이러한 프로세서(19)는 상술한 하이라이트 영상 자동생성 방법을 수행하기 위해 추출부(191), 연결부(193), 클립 생성부(195) 및 하이라이트 생성부(197)를 포함할 수 있다. The processor 19 may include an extraction unit 191, a connection unit 193, a clip generation unit 195, and a highlight generation unit 197 in order to perform the aforementioned method of automatically generating a highlight image.

추출부(191)는 입력부(11) 또는 통신부(13)를 통해 입력된 영상에 포함된 비디오 데이터, 오디오 데이터 및 채팅 데이터로부터 비디오 특징, 오디오 특징 및 채팅 특징을 원시 특징으로서 각각 추출할 수 있다. The extraction unit 191 may extract video features, audio features, and chatting features as raw features from video data, audio data, and chatting data included in the video input through the input unit 11 or the communication unit 13, respectively.

추출부(191)는 게임 중 LoL을 예로 들면, 얻은 골드, 파괴된 타워의 수, 처치한 상대 캐릭터 수, 각 팀의 남작 및 드래곤의 수와 같은 승패 확률에 영향을 미치는 정보인 소스 데이터를 모두 통합하고, 이를 비디오 데이터로 사용하여 비디오 특징을 추출할 수 있다. The extraction unit 191 extracts all of the source data, which is information that affects the win/loss probability, such as the gold obtained, the number of destroyed towers, the number of opponent characters defeated, and the number of Barons and Dragons in each team, taking LoL during the game as an example. It can be integrated and used as video data to extract video features.

또한 추출부(191)는 브로드캐스터(Broadcaster)의 음성 및 현장 관객들의 함성 등을 포함하는 오디오 데이터로부터 오디오 특징 및 채팅 특징을 추출한다. Also, the extraction unit 191 extracts audio features and chatting features from audio data including voices of broadcasters and shouts of on-site audiences.

보다 구체적으로 추출부(191)는 오디오 데이터에 대한 멜 주파수 셉스트랄 계수(MFCC, Mel Frequency Cepstral Coefficient)와 같은 음성 인식 기술을 이용하여 스펙트럼 정보에 대한 특징 벡터를 추출할 수 있다. 이러한 멜 주파수 셉스트랄 계수는 소리의 특징을 추출하는 기법으로, 입력된 소리 전체를 대상으로 하는 것이 아니라, 일정 구간으로 나누어 해당 구간에 대한 스펙트럼을 분석하여 특징을 추출하는 기법이다. More specifically, the extractor 191 may extract a feature vector of spectrum information using a speech recognition technique such as Mel Frequency Cepstral Coefficient (MFCC) for audio data. The Mel frequency cepstral coefficient is a technique for extracting features of sound, and is a technique for extracting features by dividing the input sound into certain sections and analyzing the spectrum of the corresponding sections rather than targeting the entire input sound.

이러한 오디오 데이터에 대한 특징 벡터에는 해설자의 목소리 크기와 톤, 말하는 속도 등에 관한 정보를 포함할 수 있다. 그리고 추출부(191)는 오디오 특징을 추출하기 위해 최초 오디오 데이터를 44,100에서 4,410 샘플 속도로 언더 샘플링을 수행하고, 초당 13개의 계수로 100개의 MFCC 결과를 얻은 다음 초당 각 계수 값을 평균 값을 얻는다. 이렇게 얻은 초당 13개의 계수 값을 오디오 특징으로 사용하게 된다. The feature vector of the audio data may include information about the volume and tone of the commentator's voice, speaking speed, and the like. In addition, the extraction unit 191 performs undersampling on the initial audio data at a sample rate of 44,100 to 4,410 to extract audio features, obtains 100 MFCC results with 13 coefficients per second, and then obtains an average value of each coefficient value per second. . The obtained 13 coefficient values per second are used as audio features.

그리고 추출부(191)는 동영상의 시청자들이 직접 입력하여 전송한 채팅 내역인 채팅 데이터를 통해 채팅 특징을 추출할 수 있다. 구체적으로 본 실시예에 따른 추출부(191)는 전체 채팅 데이터를 사용하는 한편, 문자 레벨 레이블 인코딩을 사용함으로써 벡터 공간을 줄이는 동시에 원핫 인코딩과 유사한 성능으로 특징을 추출할 수 있다. 이러한 레이블 인코딩은 각 문자에 대해 레이블링된 값을 나타낼 때 1 크기의 벡터 공간만을 필요로 하므로, 본 실시예에 따른 추출부(191)는 초당 최대 1,000개의 문자를 사용하여 초당 채팅 문자의 99%를 포함할 수 있도록 하여 채팅 특징으로 초당 1,000개의 값을 추출할 수 있다. In addition, the extraction unit 191 may extract chat features through chat data, which are chat details directly input and transmitted by viewers of the video. In detail, the extractor 191 according to the present embodiment can extract features with performance similar to that of one-hot encoding while reducing a vector space by using character level label encoding while using the entire chatting data. Since such label encoding requires only a vector space of size 1 when representing a labeled value for each character, the extraction unit 191 according to the present embodiment uses up to 1,000 characters per second to extract 99% of chatting characters per second. 1,000 values per second can be extracted with the chat feature.

또한 추출부(191)는 채팅 데이터에 대해 자연어 처리(NLP, Natural Language Processing) 도구를 이용하여 특징 벡터를 추출할 수 있고, 채팅 데이터에 대한 특징 벡터에는 같은 시간 내에 몇 개의 채팅이 등장하였는지에 관한 정보도 함께 이용하기 위해 모든 채팅의 끝에 특정 문자를 추가하여 채팅이 끝났음을 표시할 수 있다. In addition, the extraction unit 191 may extract a feature vector from the chat data using a natural language processing (NLP) tool, and the feature vector of the chat data includes information on how many chats appeared within the same time period. In order to use it together, you can add a specific character to the end of every chat to indicate that the chat is over.

이러한 채팅 특징 벡터는 채팅 수, 특정 키워드 등에 관한 정보를 포함할 수 있고, 일정한 시간 구간에 등장하는 채팅을 모두 하나의 특징 벡터로 표현할 수 있다. 비디오, 오디오 및 채팅 데이터 그 자체가 아닌 그로부터 추출한 특징 벡터를 이용하면, 이를 위한 인공신경망을 이용한 학습을 수행하는 과정에서 연산시간과 메모리 효율이 개선되는 장점이 있다. The chatting feature vector may include information on the number of chatting chats, a specific keyword, and the like, and chats appearing in a certain time interval may all be expressed as a single feature vector. Using a feature vector extracted from video, audio, and chat data rather than the data itself has the advantage of improving computation time and memory efficiency in the process of learning using an artificial neural network for this purpose.

한편 연결부(193)는 추출된 비디오 특징, 오디오 특징 및 채팅 특징을 LSTM 모델을 통해 기설정된 크기로 전처리하여 특징들을 연결할 수 있다. Meanwhile, the connection unit 193 may connect the features by pre-processing the extracted video features, audio features, and chat features to a predetermined size through the LSTM model.

그리고 연결부(193)는 원시 특징을 연결하기 이전에 추출부(191)에서 추출된 특징들 중 적어도 하나의 특징에 대해서는 스무딩 기법(smoothing technique)을 적용할 수 있다. 여기서 스무딩 기법은, 매 시점에서 직전 N개 데이터의 평균을 산출하여 평활치로 사용하는 이동 평균(moving average)에 기초한 스무딩 기법일 수 있다. Also, the connection unit 193 may apply a smoothing technique to at least one feature among the features extracted by the extraction unit 191 before connecting the original features. Here, the smoothing technique may be a smoothing technique based on a moving average that calculates the average of the previous N pieces of data at each time point and uses it as a smoothing value.

특히 본 실시예에 따른 연결부(193)는 추출된 원시특징인 비디오 특징, 오디오 특징 및 채팅 특징 중에서 비디오 특징에 대해 스무딩 기법을 적용할 수 있다. 이 때 이동 평균은 비디오 특징과 창 크기의 역수 값을 저장하는 창 배열을 사용하여 컨볼루션을 계산할 수 있고, 이동 평균(moving average)의 창 크기는 멀티모달 학습 기반 모델에서 사용되는 시퀀스 길이로 설정하는 것이 바람직하다. In particular, the connection unit 193 according to the present embodiment may apply a smoothing technique to a video feature among extracted original features such as video features, audio features, and chatting features. At this time, the moving average can be calculated using a window array that stores the video features and the reciprocal of the window size, and the window size of the moving average is set to the sequence length used in the multimodal learning-based model. It is desirable to do

본 발명에서와 같이 하이라이트 감지에서는 이벤트가 연속 시간 프레임에 인과 관계를 포함하기 때문에 특정 이벤트가 발생하는 시점을 기준으로 연속 비디오 세그먼트를 추출해야 한다. 특히 점수 변경과 같은 개별 특징을 사용하는 경우에는 특정 이벤트를 기반으로 연속 프레임을 효과적으로 선택하는 메커니즘이 필요한데, 이를 위해 본 실시예에서는 이동 평균을 기반으로 한 스무딩 기법을 적용하는 것이다. As in the present invention, in highlight detection, since events include causal relationships in continuous time frames, continuous video segments should be extracted based on the point in time at which a specific event occurs. In particular, when using individual features such as score change, a mechanism for effectively selecting continuous frames based on a specific event is required. To this end, a smoothing technique based on a moving average is applied in this embodiment.

그리고 연결부(193)는 스무딩 기법이 적용된 비디오 특징과, 오디오 특징 및 채팅 특징을 연결함에 있어 크기가 서로 다른 여러 데이터의 특징을 통합하여 연결하여야 한다. 이에 본 실시예에 따른 연결부(193)는 추출부(191)에서 추출된 원시 특징을 직접 사용하지 않고 LSTM모델을 통해 원시 특징을 일정 크기로 리사이징(resizing)하는 전처리를 수행한 후에 각 특징을 연결한다. 구체적으로 도 4에 도시된 바와 같이 각 원시 특징을 LSTM 모델을 통해 각 원시 특징을 128 크기로 리사이징하는 전처리를 수행한다. 그리고 시퀀스 길이로는 7개의 프레임을 사용할 수 있다.In addition, the connection unit 193 must integrate and connect the characteristics of various data having different sizes in connecting the video characteristics to which the smoothing technique is applied, the audio characteristics, and the chatting characteristics. Therefore, the connection unit 193 according to the present embodiment does not directly use the raw features extracted by the extraction unit 191, but performs preprocessing of resizing the original features to a certain size through the LSTM model, and then connects each feature do. Specifically, as shown in FIG. 4 , preprocessing of resizing each raw feature to a size of 128 is performed through an LSTM model. And, as the sequence length, 7 frames can be used.

또한 연결부(193)는 LSTM모델을 사용하여 모든 데이터 유형(비디오, 오디오 및 채팅)에 대해 인접한 프레임 간의 순차 정보를 유지하도록 할 수 있다. 이렇게 연결부(193)에서 LSTM 모델을 통해 전처리됨으로써 동일한 크기를 가지고 연결된 특징은 클립 생성부(195)의 멀티모달 학습 기반 모델로 전달된다. In addition, the connection unit 193 can maintain sequential information between adjacent frames for all data types (video, audio, and chat) using the LSTM model. In this way, by preprocessing through the LSTM model in the connection unit 193, the connected features having the same size are transferred to the multimodal learning-based model of the clip generation unit 195.

한편 클립 생성부(195)는 연결부(193)에서 연결된 특징을 멀티모달 학습 기반 모델을 통해 하이라이트 프레임으로 분류된 프레임을 하이라이트 영상에 포함될 영상 클립으로 선택할 수 있다. Meanwhile, the clip generator 195 may select a frame classified as a highlight frame through a multimodal learning-based model of features connected by the connection unit 193 as an image clip to be included in the highlight image.

이를 위해 도 5에 도시된 바와 같이 본 발명의 멀티모달 학습 기반 모델은 3계층 LSTM 모델을 기반으로 하며, 각 층의 크기가 128인 3개의 은닉층으로 구성될 수 있다. 그리고 멀티모달 학습 기반 모델은 연결부(193)의 LSTM 모델에 단일 FC(Fully-Connected) 레이어를 추가하여 LSTM 모델의 출력을 하이라이트 또는 비-하이라이트 프레임으로 분류할 수 있다. To this end, as shown in FIG. 5, the multimodal learning-based model of the present invention is based on a 3-layer LSTM model, and may be composed of three hidden layers with each layer having a size of 128. In addition, the multimodal learning-based model may classify the output of the LSTM model as a highlight frame or a non-highlight frame by adding a single Fully-Connected (FC) layer to the LSTM model of the connection unit 193.

여기서 하이라이트 또는 비-하이라이트 프레임의 분류는, 멀티모달 딥 러닝을 통한 학습과정에서 사용된 정답영상에 기초하여 학습된 결과를 기준으로 하여 하이라이트 또는 비-하이라이트 프레임을 분류할 수 있다. 즉 온라인 상에 업로드되어있는 E-스포츠 하이라이트 영상이나, 사전에 관리자가 별도로 입력한 E-스포츠 하이라이트 영상을 정답영상으로 하여 학습한 결과에 기초하여 하이라이트 또는 비-하이라이트 프레임의 분류를 수행하게 된다. 보다 구체적으로 예를 들면, 상술한 과정을 통해 추출된 특징의 벡터값과 학습을 위해 입력된 정답영상에서 하이라이트 프레임으로 분류된 영상의 벡터값을 비교하여 소정의 유사도를 만족하면 해당 프레임은 하이라이트 프레임으로 분류하고, 그렇지 않으면 비-하이라이트 프레임으로 분류하게 되는 것이다. Here, highlight or non-highlight frames may be classified based on a result learned based on an answer image used in a learning process through multimodal deep learning. That is, highlight or non-highlight frames are classified based on the learning result of E-sports highlight images uploaded online or E-sports highlights images separately input by the manager in advance as the correct answer images. More specifically, for example, if a predetermined similarity is satisfied by comparing the vector value of the feature extracted through the above process with the vector value of the image classified as a highlight frame in the correct answer image input for learning, the corresponding frame is a highlight frame. , otherwise it will be classified as a non-highlight frame.

그리고 클립 생성부(195)는 멀티모달 학습을 사용하는 멀티모달 학습 기반 모델을 통해 도 5에 도시된 바와 같이 데이터 세트에 대한 하이라이트 프레임의 평균으로 얻은 시퀀스 길이로 23개의 프레임을 사용할 수 있고, 개별 특징을 비교로 사용하는 경우에는 7개의 프레임을 사용할 수 있다. 이를 통해 클립 생성부(195)는 분류된 프레임 중 하이라이트 프레임으로 분류된 프레임을 하이라이트 영상에 포함시킬 영상 클립으로 선택할 수 있다. And, as shown in FIG. 5 through a multimodal learning-based model using multimodal learning, the clip generation unit 195 may use 23 frames as a sequence length obtained as an average of highlight frames for the data set, and individual Seven frames can be used when features are used as comparisons. Through this, the clip generator 195 may select a frame classified as a highlight frame among the classified frames as an image clip to be included in the highlight image.

그리고 하이라이트 생성부(197)는 클립 생성부(195)에서 선택된 영상 클립을 연결하여 최종적으로 하이라이트 영상을 생성할 수 있다.Also, the highlight generator 197 may connect the video clips selected by the clip generator 195 to finally generate a highlight image.

도 2는 본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 방법을 설명하기 위한 흐름도, 도 3은 본 발명의 일 실시예에 따라 추출된 원시 특징이 연결되는 모습을 설명하기 위한 도면, 도 4는 본 발명의 일 실시예에 따라 추출된 원시 특징을 기설정된 크기로 전처리한 후 연결되는 모습을 설명하기 위한 도면, 그리고, 도 5는 본 발명의 멀티모달 학습 기반 모델을 설명하기 위한 도면이다. 2 is a flowchart for explaining a method for automatically generating a highlight image according to an embodiment of the present invention, FIG. 3 is a diagram for explaining how raw features extracted according to an embodiment of the present invention are connected, and FIG. A diagram for explaining how extracted primitive features are preprocessed to a predetermined size and then connected according to an embodiment of the present invention, and FIG. 5 is a diagram for explaining the multimodal learning-based model of the present invention.

본 발명의 일 실시예에 따른 하이라이트 영상 자동생성 방법은, E-스포츠의 게임 영상 또는 중계 영상과 같은 동영상으로부터 자동으로 하이라이트를 추출하기 위해 마련되는 것으로, 하이라이트 자동 추출 장치(10)에 의해 수행될 수 있다. A method for automatically generating a highlight image according to an embodiment of the present invention is provided to automatically extract a highlight from a video such as an E-sports game video or a relay video, and is performed by the automatic highlight extraction device 10. can

이러한 본 실시예에 따른 하이라이트 영상 자동생성 방법은 먼저, 하이라이트 자동 추출 장치(10)가 경기 영상 또는 중계 영상과 같은 동영상을 입력 받는다(S110). In the method for automatically generating a highlight image according to the present embodiment, first, the automatic highlight extraction device 10 receives a video such as a match video or a relay video (S110).

이후 입력된 영상 중에서 비디오 특징을 추출한다(S120)Then, video features are extracted from the input image (S120).

본 실시예에 따른 하이라이트 자동 추출 장치(10)는 비디오 특징을 추출함에 있어서 경기 내의 승률변화에 영향을 미치는 데이터를 사용하여 비디오 특징을 추출한다. 게임 중 LoL을 예로 들면, 얻은 골드, 파괴된 타워의 수, 처치한 상대 캐릭터 수, 각 팀의 남작 및 드래곤의 수와 같은 승패 확률에 영향을 미치는 정보인 소스 데이터를 모두 통합하고, 이를 비디오 데이터로 사용하여 비디오 특징을 추출할 수 있다.In extracting video features, the highlight automatic extraction apparatus 10 according to the present embodiment extracts video features by using data affecting a win rate change in a game. Taking LoL during the game as an example, we integrate all the source data, which is information that affects the win-lose probability, such as the gold obtained, the number of towers destroyed, the number of opponent characters defeated, and the number of Barons and Dragons in each team, all of which are integrated into the video data can be used to extract video features.

이후 하이라이트 자동 추출 장치(10)는 입력된 영상 중에서 오디오 특징을 추출한다(S130).Thereafter, the automatic highlight extraction device 10 extracts audio features from the input image (S130).

오디오 특징을 추출하기 위해 본 실시예에서는 입력된 소리 전체를 대상으로 하는 것이 아니라, 일정 구간으로 나누어 해당 구간에 대한 스펙트럼을 분석하여 특징을 추출하는 기법인 멜 주파수 셉스트랄 계수(MFCC, Mel Frequency Cepstral Coefficient)와 같은 음성 인식 기술을 이용하여 스펙트럼 정보에 대한 특징 벡터를 추출할 수 있다. 보다 구체적으로 본 실시예에서는 최초 오디오 데이터를 44,100에서 4,410 샘플 속도로 언더 샘플링을 수행하고, 초당 13개의 계수로 100개의 MFCC(Mel-Frequiency Cepstral Coefficient) 결과를 얻은 다음 초당 각 계수 값의 평균을 얻는다. 이렇게 얻은 초당 13개의 계수 값을 오디오 특징으로 사용하게 된다. In order to extract audio features, in this embodiment, instead of targeting the entire input sound, Mel Frequency Cepstral Coefficients (MFCC, Mel Frequency A feature vector for spectrum information may be extracted using a speech recognition technique such as Cepstral Coefficient. More specifically, in this embodiment, undersampling is performed on the initial audio data at a sample rate of 44,100 to 4,410, 100 MFCC (Mel-Frequiency Cepstral Coefficients) results are obtained with 13 coefficients per second, and then the average of each coefficient value per second is obtained. . The obtained 13 coefficient values per second are used as audio features.

그리고 입력된 영상 중에서 채팅 특징을 추출할 수 있다(S140). 본 실시예에 따른 채팅 특징의 추출에는 전체 채팅 데이터를 사용하는 한편, 종래의 원-핫 벡터로 인코딩하는 대신 문자 레벨 레이블 인코딩을 사용한다. 본 실시예에서는 레이블 인코딩을 사용함으로써 벡터 공간을 크게 줄이는 동시에 원핫 인코딩과 유사한 성능으로 특징을 추출할 수 있다. 특히 종래의 원핫 인코딩은 각 문자를 표현하기 위해 채팅 데이터의 모든 문자를 ASCⅡ코드로 변환하고 각 ASCⅡ 코드에 대한 이진 표현을 위한 벡터 공간을 필요로 한다. 반면 본 실시예에 따른 레이블 인코딩은 각 문자에 대해 레이블링된 값을 나타낼 때 1 크기의 벡터 공간만을 필요로 한다. 따라서 채팅 문자의 범위를 극대화하기 위해 초당 최대 1,000개의 문자를 사용하였으며, 이는 초당 채팅 문자의 99%를 포함할 수 있게 되므로 채팅 특징으로 초당 1,000개의 값을 추출한다. And chatting features can be extracted from the input video (S140). While the entire chatting data is used to extract chatting features according to the present embodiment, character level label encoding is used instead of conventional one-hot vector encoding. In this embodiment, by using label encoding, a vector space can be greatly reduced and features can be extracted with performance similar to that of one-hot encoding. In particular, conventional one-hot encoding converts all characters of chatting data into ASCII codes to represent each character and requires a vector space for binary expression of each ASCII code. On the other hand, label encoding according to the present embodiment requires only a vector space of size 1 to indicate a value labeled for each character. Therefore, in order to maximize the range of chatting characters, up to 1,000 characters per second were used, which can include 99% of chatting characters per second, so 1,000 values per second are extracted as chat features.

그리고, 본 실시예에 따른 하이라이트 자동추출 방법은, 상술한 특징을 추출하는 단계에서 추출된 원시 특징 중 적어도 하나의 특징에 대해서 스무딩 기법(smoothing technique)을 적용하는 단계(S150)를 더 포함할 수 있다. And, the automatic highlight extraction method according to the present embodiment may further include applying a smoothing technique to at least one feature among the original features extracted in the feature extraction step (S150). there is.

구체적으로 하이라이트 감지에서는 이벤트가 연속 시간 프레임에 인과 관계를 포함하기 때문에 특정 이벤트가 발생하는 시점을 기준으로 연속 비디오 세그먼트를 추출해야 한다. 특히 점수 변경과 같은 개별 특징을 사용하는 경우에는 특정 이벤트를 기반으로 연속 프레임을 효과적으로 선택하는 메커니즘이 필요하다. Specifically, in highlight detection, continuous video segments must be extracted based on when a specific event occurs because events contain causal relationships in continuous time frames. Especially when using discrete features such as score changes, we need a mechanism to effectively select consecutive frames based on specific events.

따라서 본 실시예에서는 이동 평균을 기반으로 한 스무딩 기법(smoothing technique)을 적용하게 되며, 여기서 스무딩 기법은 매 시점에서 직전 N개 데이터의 평균을 산출하여 평활치로 사용하는 이동 평균에 기초한 스무딩 기법이다. 그리고 이동 평균은 비디오 특징과 창 크기의 역수 값을 저장하는 창 배열을 사용하여 컨볼루션을 계산할 수 있으며, 이동 평균의 창 크기는 후술할 멀티모달 학습 기반 모델에서 사용되는 시퀀스의 길이로 설정하는 것이 바람직하다. Therefore, in this embodiment, a smoothing technique based on a moving average is applied. Here, the smoothing technique is a smoothing technique based on a moving average that calculates the average of the previous N data at each point and uses it as a smoothing value. In addition, moving average can calculate convolution using a window array that stores video features and a reciprocal value of window size, and it is recommended to set the window size of moving average to the length of the sequence used in the multimodal learning-based model described later. desirable.

본 실시예에서는 추출된 원시 특징 중 비디오 특징에 스무딩 기법을 적용하였으며, 이러한 스무딩 기법을 비디오 특징에 적용하기 전후를 비교한 결과 비디오 특징의 정밀도가 0.707에서 0.726으로 증가되는 것을 확인하였다. In this embodiment, the smoothing technique was applied to the video features among the extracted raw features, and as a result of comparing before and after applying the smoothing technique to the video features, it was confirmed that the precision of the video features increased from 0.707 to 0.726.

이를 통해 비디오 특징에 대한 스무딩 기법을 적용하는 경우 명백한 효과가 있음을 알 수 있기에 원시 특징 중에서 비디오 특징에 스무딩 기법을 적용하는 것으로 상정하였지만, 이에 한정되는 것은 아니며 필요에 의해 추출된 원시 특징 중 오디오 특징 또는 채팅 특징에도 스무딩 기법이 적용될 수 있다.From this, it can be seen that there is an obvious effect when the smoothing technique is applied to the video feature, so it is assumed that the smoothing technique is applied to the video feature among the raw features, but it is not limited thereto, and the audio feature among the raw features extracted as needed Alternatively, the smoothing technique may also be applied to chat features.

이후 추출된 원시 특징을 LSTM 모델을 통해 기설정된 크기로 전처리한 후 연결할 수 있다(S160). Thereafter, the extracted raw features may be preprocessed to a predetermined size through the LSTM model and connected (S160).

본 실시예에 따른 추출부(191)는 초당 1프레임을 캡처하고, 각 특징은 1부터 t까지 각 프레임에 대해 추출되며, 각 프레임은 0(하이라이트 표시되지 않음) 또는 1(하이라이트 표시)로 레이블이 지정될 수 있다. 본 실시예에 따른 멀티모달 학습 모델에서 각 특징의 연결은 단순히 "+"기호로 표현되는데, 예를 들어 비디오 특징 V와 오디오 특징 A, 그리고 채팅 특징 C가 연결된 경우 V+A+C로 표현될 수 있다. The extraction unit 191 according to the present embodiment captures 1 frame per second, each feature is extracted for each frame from 1 to t, and each frame is labeled as 0 (not highlighted) or 1 (highlighted) can be specified. In the multimodal learning model according to this embodiment, the connection of each feature is simply represented by a "+" sign. For example, when video feature V, audio feature A, and chat feature C are connected, it will be expressed as V+A+C. can

여기서 하이라이트 또는 비-하이라이트 프레임의 분류는, 멀티모달 딥 러닝을 통한 학습과정에서 사용된 정답영상에 기초하여 학습된 결과를 기준으로 하여 하이라이트 또는 비-하이라이트 프레임을 분류할 수 있다. 즉 온라인 상에 업로드되어있는 E-스포츠 하이라이트 영상이나, 사전에 관리자가 별도로 입력한 E-스포츠 하이라이트 영상을 정답영상으로 하여 학습한 결과에 기초하여 하이라이트 또는 비-하이라이트 프레임의 분류를 수행하게 된다. 보다 구체적으로 예를 들면, 상술한 과정을 통해 추출된 특징의 벡터값과 학습을 위해 입력된 정답영상에서 하이라이트 프레임으로 분류된 영상의 벡터값을 비교하여 소정의 유사도를 만족하면 해당 프레임은 하이라이트 프레임으로 분류하고, 그렇지 않으면 비-하이라이트 프레임으로 분류하게 되는 것이다.Here, highlight or non-highlight frames may be classified based on a result learned based on an answer image used in a learning process through multimodal deep learning. That is, highlight or non-highlight frames are classified based on the learning result of E-sports highlight images uploaded online or E-sports highlights images separately input by the manager in advance as the correct answer images. More specifically, for example, if a predetermined similarity is satisfied by comparing the vector value of the feature extracted through the above process with the vector value of the image classified as a highlight frame in the correct answer image input for learning, the corresponding frame is a highlight frame. , otherwise it will be classified as a non-highlight frame.

추출된 각 특징을 연결하는 기준으로 도 3에 도시된 바와 같이 추출된 비디오 특징, 오디오 특징 및 채팅 특징을 원시 특징(raw-feature)으로 하여 별도의 전처리없이 곧바로 연결할 수 있다. 이는 도시된 바와 같이 각 프레임에 대해서 상술한 방법을 통해 추출된 여러 데이터 유형(비디오, 오디오 및 채팅)에 대한 원시 특징을 단순히 연결하는 것이다. 도 3은 원시 특징을 연결하는 프로세스를 표현한 것이고, 도면에 도시된 바와 같이 추출된 원시 특징 중 비디오 데이터에 대한 특징인 V_SCORE는 프레임당 하나의 요소, 오디오 데이터에 대한 특징인 A_MFCC은 프레임 당 MFCC 13개의 요소, 채팅 데이터에 대한 특징인 C_LABEL은 프레임당 1,000개의 요소로 구성될 수 있다. As a criterion for connecting each extracted feature, as shown in FIG. 3, the extracted video feature, audio feature, and chat feature can be directly connected without separate preprocessing as raw features. This simply concatenates the raw features for the different data types (video, audio, and chat) extracted through the method described above for each frame, as shown. 3 shows the process of connecting raw features. As shown in the figure, among the extracted raw features, V _SCORE , which is a feature for video data, is one element per frame, and A _MFCC , which is a feature for audio data, is per frame. MFCC 13 elements, C _LABEL , which is a characteristic for chatting data, can be composed of 1,000 elements per frame.

이처럼 원시 특징 연결은 학습 프로세스에서 원시 특징을 직접 사용하는 이점이 있지만, 특징이 다른 여러 데이터의 특징을 통합해야하는 멀티모달 학습에는 한계가 있게 된다. 이 한계를 해결하기 위해 본 발명에서는 추출된 원시 특징을 LSTM(Long Short-Term Memory) 모델을 통해 기설정된 크기로 리사이징하는 전처리를 수행한 후 연결한다. As such, raw feature linking has the advantage of directly using raw features in the learning process, but has limitations in multimodal learning, which requires integrating features of multiple data with different features. In order to solve this limitation, in the present invention, pre-processing of resizing the extracted raw features to a predetermined size through a long short-term memory (LSTM) model is performed, and then they are connected.

도 4는 본 발명의 LSTM 모델을 통한 특징 연결 과정을 설명하기 위한 것으로, 도시된 바와 같이 본 실시예에 따른 LSTM 모델을 통한 특징 연결은 서로 다른 크기의 각 특징을 도 4에 도시된 바와 같이 128 크기의 동일한 크기로 리사이징하는 전처리 과정을 거쳐 결과적으로 각 특징이 멀티모달 학습에서 균일하게 영향을 미칠 수 있게 된다. 여기서 시퀀스 길이로 7개의 프레임을 사용할 수 있고, LSTM 모델에 적용되는 특징 중 비디오 특징은 상술한 바와 같이 스무딩 기법이 적용된 비디오 특징이다. 4 is for explaining the feature connection process through the LSTM model of the present invention. As shown, the feature connection through the LSTM model according to the present embodiment is 128 as shown in FIG. As a result of the preprocessing process of resizing to the same size, each feature can have a uniform effect in multimodal learning. Here, seven frames can be used as the sequence length, and among the features applied to the LSTM model, the video features are video features to which the smoothing technique is applied as described above.

그리고 본 실시예에서는 LSTM 모델을 사용하여 모든 데이터 유형(비디오, 오디오 및 채팅)에 대해 인접한 프레임 간의 순차 정보를 유지하도록 한다. 하기의 식 1은 각 원시 특징에 대해 LSTM 모델에서 추출한 특징을 정의한 것으로, D는 원시 특징을 의미한다. And, in this embodiment, the LSTM model is used to maintain sequential information between adjacent frames for all data types (video, audio, and chat). Equation 1 below defines a feature extracted from the LSTM model for each raw feature, and D denotes a raw feature.

[식 1][Equation 1]

이렇게 S160 단계에 따라 LSTM 모델을 통해 동일한 크기로 전처리된 후 연결된 특징은 멀티모달 학습 기반 모델을 통해 처리된다. After being preprocessed to the same size through the LSTM model in step S160, the connected features are processed through the multimodal learning-based model.

그러면 하이라이트 자동 추출 장치(10)는 멀티모달 학습 기반 모델을 통해 하이라이트 프레임으로 분류된 프레임을 하이라이트 영상에 포함시킬 영상 클립으로 선택한다(S170). 구체적으로 도 5는 본 실시예에 따른 멀티모달 학습 기반 모델에 대한 도면으로 도시된 바와 같이 3계층 LSTM 모델을 기반으로 하며, 각 층의 크기가 128인 3개의 은닉층으로 구성된다. 이러한 멀티모달 학습 기반 모델은 상술한 추출된 특징을 연결하기 위한 LSTM 모델에 단일 FC(fully-connected) 레이어를 추가하여 LSTM 모델의 출력(output)을 하이라이트 또는 비하이라이트 프레임으로 분류한다. 본 실시예에서의 3가지 특징을 갖는 멀티모달 학습을 사용하는 멀티모달 학습 기반 모델은 데이터 세트에 대한 하이라이트 프레임의 평균으로 얻은 시퀀스 길이에 23개의 프레임을 사용할 수 있고, 개별 특징을 비교로 사용하는 경우에는 7개의 프레임을 사용할 수 있다. Then, the apparatus 10 for automatically extracting highlights selects a frame classified as a highlight frame through a multimodal learning-based model as an image clip to be included in a highlight image (S170). Specifically, FIG. 5 is based on a 3-layer LSTM model as shown in a diagram of a multimodal learning-based model according to this embodiment, and is composed of three hidden layers each layer having a size of 128. This multimodal learning-based model classifies the output of the LSTM model into a highlight or non-highlight frame by adding a single fully-connected (FC) layer to the LSTM model for connecting the above-described extracted features. The multimodal learning-based model using multimodal learning with three features in this embodiment can use 23 frames for the sequence length obtained as the average of highlight frames for the data set, and use individual features as comparisons. In this case, 7 frames can be used.

이렇게 분류된 프레임 중 하이라이트 프레임으로 분류된 프레임은 하이라이트 영상에 포함시킬 영상 클립으로서 선택된다(S170). Among the classified frames, frames classified as highlight frames are selected as video clips to be included in the highlight image (S170).

이후 하이라이트 자동 추출 장치(10)는 선택된 영상 클립을 연결하여 하이라이트 영상을 생성할 수 있다(S180). Thereafter, the automatic highlight extraction device 10 may generate a highlight image by connecting the selected image clips (S180).

이를 통해 본 실시예에 따른 E-스포츠 하이라이트 영상 자동생성 방법은 E-스포츠 중계방송에서 비디오, 오디오 및 채팅 세 가지 데이터를 모두 활용하여 하이라이트를 추출할 수 있다. Through this, the method of automatically generating an E-sports highlight video according to the present embodiment can extract highlights by utilizing all three data of video, audio, and chatting in E-sports relay broadcasting.

이와 같은 본 발명의 하이라이트 영상 자동생성 방법은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. The method for automatically generating a highlight image of the present invention may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것은 물론, 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, as well as those known and usable to those skilled in the art of computer software.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드 뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

[실험예][Experimental example]

본 실시예에 따른 하이라이트 자동추출의 결과에 대한 객관적인 성능 평가를 위하여 2019년에 개최된 99개의 LoL 월드컵 경기를 사용하였으며, 해당 경기 영상의 총 길이는 200,759초(즉, 약 55시간)이며, 경기 당 평균 33분의 영상이다. In order to objectively evaluate the performance of the results of automatic highlight extraction according to this embodiment, 99 LoL World Cup matches held in 2019 were used, and the total length of the match video was 200,759 seconds (ie, about 55 hours). It is an average of 33 minutes per video.

이 중 무작위로 선택된 70개의 게임 영상을 교육 데이터셋으로, 14개를 검증 데이터 셋으로, 그리고 15개를 테스트 데이터 셋으로 사용하였다. Among them, 70 randomly selected game images were used as a training dataset, 14 as a validation dataset, and 15 as a test dataset.

비교를 위한 승패확률 모델을 훈련하기 위해 본 실험에서는 LoL의 개발자인 라이엇 게임즈(Riot Games)가 제공하는 오픈 API를 사용하여 2016년부터 2019년까지 10,810개의 게임 영상을 더 수집하였다. To train the win-loss probability model for comparison, in this experiment, 10,810 more game videos from 2016 to 2019 were collected using an open API provided by Riot Games, the developer of LoL.

하이라이트 라벨을 지정하기 위한 정답영상으로는 YouTube 채널 Onivia에서 제공하는 LoL 동영상의 하이라이트를 사용하였고, 영상의 시간 정보를 추출하기 위해 화면에 시간이 표시되는 영역을 전처리한 후 오픈 소스 OCR 엔진인 Tesseract API를 적용하였다. 하이라이트 프레임은 총 36,416초가 걸리며 각 게임에는 평균 6분의 하이라이트가 포함된다. 데이터의 불균형을 해결하기 위해 트레이닝을 위해 22,000개의 하이라이트 프레임과 22,000개의 비 하이라이트 프레임을 무작위로 선택하였다. 그리고 모델은 32개의 배치 크기, 100개의 epoch, 확률적 경사 하강법(Stochastic Gradient Descent, SGD) 최적화 프로그램 및 10^-2의 학습율로 학습되었으며, 손실함수에는 교차 엔트로피가 사용되었다. 그리고 하기의 식 2 내지 4에 정의된 대로 평가 지표에 F1 점수, 정밀도 및 재현율을 사용하였으며, 각 비디오에 대해 평가 지표를 사용하였고, 모든 비디오에 대해 각각 평균을 산출하였다. As the correct video for specifying the highlight label, the highlight of the LoL video provided by the YouTube channel Onivia was used. To extract the time information of the video, the area where the time is displayed on the screen is pre-processed, and then the Tesseract API, an open source OCR engine. was applied. A highlight frame takes a total of 36,416 seconds, and each game contains an average of 6 minutes of highlights. To solve the data imbalance, 22,000 highlight frames and 22,000 non-highlight frames were randomly selected for training. And the model was trained with a batch size of 32, 100 epochs, a Stochastic Gradient Descent (SGD) optimizer and a learning rate of 10 ^-2 , and cross-entropy was used for the loss function. And, as defined in Equations 2 to 4 below, the F1 score, precision, and recall were used as evaluation indices, the evaluation indices were used for each video, and the average was calculated for each video.

[식 2][Equation 2]

=

[식 3][Equation 3]

=

[식 4][Equation 4]

-score =

-score=

여기서 H_gt는 실측값이고, H_pred는 각 모델에 의해 예측된 결과를 의미한다. Here, H _gt is a measured value, and H _pred means a result predicted by each model.

하기 표 1은 종래의 연구에서 제안된 모델의 성능을 나타낸 것으로, 데이터 세트가 다르기 때문에 이전 연구에서 보고된 결과와 공정하게 비교할 수는 없지만, 전체적인 추세가 비슷하다는 것은 확인할 수 있다. 본 실험의 목적은 개별 기준 모델로부터 추출된 특징을 기반으로 멀티모달 학습의 효과를 보여주는 것이다. Table 1 below shows the performance of the model proposed in the previous study. Although it cannot be fairly compared with the results reported in previous studies because the data set is different, it can be confirmed that the overall trend is similar. The purpose of this experiment is to show the effect of multimodal learning based on features extracted from individual reference models.

FeaturesFeatures PrecisionPrecision RecallRecall F1-scoreF1-score V_WINLOSS V _WINLOSS 0.6100.610 0.7680.768 0.6730.673 A_EMD A _EMD 0.2100.210 0.4360.436 0.2780.278 C_1HOT C _1HOT 0.3180.318 0.3250.325 0.3210.321

V_WINLOSS는 종래의 승률예측모델을 통해 계산된 승률이 급격하게 변하는 지점을 감지하고, 비디오 특징으로 추출하여 하이라이트를 추출하는 방식이고, A _EMD 는 경기 영상에 포함된 오디오 데이터를 분해하기 위해 EMD(Empirical Mode Decomposition)을 적용하여 중요하지 않은 데이터를 소거하는 방식으로 오디오 특징을 추출하여 하이라이트를 추출하는 종래의 방식이며, 그리고 C _1HOT 은 채팅 데이터에 대해 각 프레임마다 하이라이트와 비 하이라이트를 1과 0으로 각각 라벨링하여 입력 문자에 대해 원핫 인코딩을 구하는 방식으로 채팅 특징을 추출하여 하이라이트를 추출하는 종래의 방식이다. V _WINLOSS is a method of detecting the point where the odds calculated through a conventional odds ratio prediction model changes rapidly and extracting highlights by extracting them as video features, and A _EMD is an EMD (EMD) to decompose the audio data included in the match video Empirical Mode Decomposition) is a conventional method of extracting highlights by extracting audio features by removing unimportant data, and C _1HOT sets highlights and non-highlights to 1 and 0 for each frame for chatting data. This is a conventional method of extracting highlights by extracting chatting features by labeling each character and obtaining one-hot encoding for the input text.

한편 이하의 표2는 본 발명의 하이라이트 추출방법의 개별 특징을 사용한 모델의 결과를 보여주는 것으로, 성능 평가를 위해 상술한 LSTM 모델을 통해 추출된 특징을 연결하여 LSTM 기반의 멀티모달 학습 기반 모델을 모든 특징에 일관되게 사용하여 제안된 특징 추출의 성능에 주목하였다. A_MFCC와 C_LABEL을 사용하는 모델은 표 1의 기존 특징을 사용하여 F1 점수에서 각각 90.68 %와 7.79 % 향상되다. V_SCORE의 경우 V_WINLOSS에 비해 F1 점수에서 성능이 17.04 % 저하되었으나 V_SCORE를 사용하는 멀티모달 모델이 모든 이전 연구의 성능을 향상시키는 것을 후술할 표 3에서 확인할 수 있는데, 다른 특징들 중 특히 V_WINLOSS에 비해 V_SCORE의 정확도가 매우 높아서 멀티모달 학습에서 학습에서 다른 특징과 결합할 때 성능이 향상됨을 알 수 있다. On the other hand, Table 2 below shows the results of models using individual features of the highlight extraction method of the present invention. For performance evaluation, the LSTM-based multimodal learning-based models are all connected by connecting the features extracted through the LSTM model. We note the performance of the proposed feature extraction using features consistently. Models using A _MFCC and C _LABEL improve by 90.68% and 7.79%, respectively, in F1 scores using the existing features in Table 1. In the case of V _SCORE , the performance decreased by 17.04% in the F1 score compared to V _WINLOSS , but it can be seen in Table 3 that the multimodal model using V _SCORE improves the performance of all previous studies. Among other features, especially V Compared to _WINLOSS , the accuracy of V _SCORE is very high, so it can be seen that the performance improves when combined with other features in multimodal learning.

FeaturesFeatures PrecisionPrecision RecallRecall F1-scoreF1-score V_SCORE V _SCORE 0.7260.726 0.4810.481 0.5750.575 A_MFCC A _MFCC 0.5360.536 0.5540.554 0.5300.530 C_LABEL C _LABEL 0.2370.237 0.6300.630 0.3460.346

하기 표 3은 본 발명의 멀티모달 학습 모델의 결과를 보여주기 위한 것으로, 각 유형(비디오, 오디오 및 채팅)별로 추출된 특징을 LSTM을 통해 각 특징을 연결하여 사용하는 본 발명의 모델을 종래의 연구 및 이를 기반으로 한 멀티모달 학습과 비교하였다. Table 3 below shows the results of the multimodal learning model of the present invention, and the model of the present invention in which the features extracted for each type (video, audio, and chat) are connected and used through LSTM is compared to the conventional Research and multimodal learning based on it were compared.

첫째, 원시 특징 연결의 성능은 F1 점수 0.546으로 본 발명의 특징을 사용하는 개별 모델 중 최고의 성능을 보여주는 V_SCORE 사용 모델보다 5.04 % 더 낮다. 이것은 원시 특징의 단순한 연결이 서로 다른 특징들 간의 불균일한 특성으로 인해 멀티모달 학습에 효과적이지 않음을 의미한다. First, the performance of raw feature concatenation is 5.04% lower than the model using V _SCORE , which shows the best performance among individual models using the features of the present invention with an F1 score of 0.546. This means that simple concatenation of raw features is not effective for multimodal learning due to the non-uniform nature of different features.

둘째, LSTM을 통한 특징 연결은 원시 기능 연결의 성능을 크게 향상시키는데, 그 결과 LSTM을 통한 특징 연결은 원시 특징 연결의 F1 점수를 32.23 % 향상시키고, 종래의 승률예측모델 V_WINLOSS보다 7.28 % 향상시킨다는 점에서 E-스포츠 하이라이트에서 본 발명의 멀티모달 학습 기반 모델은 승률예측모델 V_WINLOSS의 정밀도와 재현율을 모두 향상시키는 것을 알 수 있다. Second, feature linking through LSTM greatly improves the performance of raw feature linking. As a result, feature linking through LSTM improves the F1 score of raw feature linking by 32.23% and improves it by 7.28% compared to the conventional odds prediction model V _WINLOSS . At this point, it can be seen that the multimodal learning-based model of the present invention in the E-sports highlight improves both the precision and recall of the win rate prediction model V _WINLOSS .

또한 본 발명을 통해 추출된 특징의 효과를 검증하기 위해 종래의 V_WINLOSS,A_EMD 및 C_1HOT에서 사용된 특징을 결합한 멀티모달 학습 모델을 구성하였다. 이를 위해 V_WINLOSS,A_EMD 및 C_1HOT를 하나의 특징으로 연결하고 상술한 바와 같이 LSTM을 통한 특징 연결을 사용하여 멀티모달 학습으로 학습시켰다. 그 결과 본 발명의 모델이 의미있는 성능 차이가 있는 기존 특징을 사용하여 멀티모달 학습 모델의 정밀도와 재현성을 향상시키는 것을 보여준다. In addition, in order to verify the effect of the features extracted through the present invention, a multimodal learning model combining the features used in the conventional V _WINLOSS, A _EMD , and C _1HOT was constructed. To this end, V _WINLOSS, A _EMD , and C _1HOT were connected as one feature, and as described above, feature connection through LSTM was used for multimodal learning. As a result, it is shown that the model of the present invention improves the precision and reproducibility of the multimodal learning model by using existing features with significant performance differences.

FeaturesFeatures PrecisionPrecision RecallRecall F1-scoreF1-score V_SCORE+ A_MFCC+ C_LABEL V _SCORE + A _MFCC + C _LABEL 0.5250.525 0.6110.611 0.5460.546 LSTM(V_SCORE)+LSTM(A_MFCC)+
LSTM(C_LABEL)LSTM(V _SCORE )+LSTM(A _MFCC )+
LSTM(C _LABEL ) 0.6370.637 0.8380.838 0.7220.722 V_WINLOSS V _WINLOSS 0.6100.610 0.7680.768 0.6730.673 V_WINLOSS+ A_EMD+ C_1HOT V _WINLOSS + A _EMD + C _1HOT 0.6080.608 0.8180.818 0.6910.691

그리고, 본 발명의 실시예에 따른 하이라이트 추출방법에 따라 LoL Twitch 방송의 하이라이트 감지 전체 과정을 보여주는 데모 영상 (https://url.kr/ey8bvz)을 제작하였다. 이를 위해 사용자로부터 입력 영상에 대한 정보를 수신하고 비디오의 하이라이트를 생성하는 웹 사이트를 구축하였다. 사용자가 간단한 Twitch 비디오 링크, 메타 데이터 (예: 게임 이름, 비디오 시작 시간)를 입력하는 경우에만 하이라이트 감지가 자동으로 수행되며, 약 30 분 길이의 동영상의 경우 CPU 3.50GHz, 메모리 64G, GPU RTX 2080 Ti가 장착된 환경에서 4 분이 채 걸리지 않았다. In addition, a demo video (https://url.kr/ey8bvz) showing the entire highlight detection process of LoL Twitch broadcasting was produced according to the highlight extraction method according to the embodiment of the present invention. To this end, we built a website that receives information about input images from users and creates video highlights. Highlight detection is performed automatically only when the user enters a simple Twitch video link, metadata (e.g. game name, video start time), CPU 3.50GHz, memory 64G, GPU RTX 2080 for a video about 30 minutes long It took less than 4 minutes in the Ti equipped environment.

이상에서는 본 발명의 다양한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.Although various embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and is commonly used in the technical field to which the present invention pertains without departing from the gist of the present invention claimed in the claims. Of course, various modifications are possible by those with knowledge of, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

10 : 하이라이트 영상 자동생성 장치 11 : 입력부
13 : 통신부 15 : 메모리부
17 : 출력부 19 : 프로세서
191 : 추출부 193 : 연결부
195 : 클립 생성부 197 : 하이라이트 생성부10: automatic highlight video generation device 11: input unit
13: communication unit 15: memory unit
17: output unit 19: processor
191: extraction unit 193: connection unit
195: clip generation unit 197: highlight generation unit

Claims

A method of automatically generating a highlight image performed by an apparatus for automatically generating a highlight image,
Step of inputting an E-sports relay image;
extracting a raw feature from an input video, wherein the raw feature includes a video feature, an audio feature, and a chat feature;
pre-processing the extracted raw features to a predetermined size and interconnecting the pre-processed video features, audio features, and chatting features;
classifying highlight frames from the concatenated video feature, audio feature, and chat feature;
selecting the classified highlight frames as video clips to be included in a highlight video;
Creating a highlight video by connecting the selected video clips;
In the step of extracting the raw features,
Extracting the video feature from video data included in the input video, wherein the video data is a combination of source data, which is information that affects win-loss probabilities,
Further comprising applying a smoothing technique to at least one of the extracted raw features before performing the concatenation step,
In the step of applying the smoothing technique,
A method for automatically generating a highlight image, wherein a smoothing technique based on a moving average is applied to the video features to extract continuous video segments based on a point in time when a specific event including a score change occurs.

According to claim 1,
The step of interconnecting the preprocessed video feature, audio feature and chat feature,
The extracted video features, audio features, and chat features included in the extracted raw features are resized to the same size through a Long Short-Term Memory (LSTM) model and interconnected. Automatically generating a highlight image, characterized in that method.

According to claim 2,
Classifying highlight frames from the connected video features, audio features, and chat features,
Automatically generating a highlight image, characterized in that by adding a single fully-connected (FC) layer to the LSTM model and classifying the connected video features, audio features, and chat features into highlight frames or non-highlight frames through the LSTM model. .

delete

an input unit that receives an E-sports relay video;
an extraction unit for extracting raw features from an input video, wherein the original features include video features, audio features, and chatting features;
a connection unit for pre-processing the extracted raw features to a predetermined size and interconnecting the pre-processed video features, audio features, and chatting features;
a clip creation unit that classifies highlight frames from the connected video, audio, and chatting features and selects the classified highlight frames as video clips to be included in a highlight video; and
A highlight generator for generating a highlight video by connecting selected video clips;
The extraction part,
The video feature is extracted from video data included in the input video, wherein the video data is a combination of source data, which is information that affects win-loss probabilities,
The connection part,
Applying a smoothing technique to at least one of the extracted original features before performing the interconnection;
The connection part,
An apparatus for automatically generating a highlight image that applies a smoothing technique based on a moving average to video features to extract continuous video segments based on a point in time at which a specific event including a score change occurs.