KR20200092502A

KR20200092502A - Aparatus and method for generating a highlight video using chat data and audio data

Info

Publication number: KR20200092502A
Application number: KR1020190004112A
Authority: KR
Inventors: 이계민; 김은율
Original assignee: 서울과학기술대학교 산학협력단
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2020-08-04
Also published as: KR102153211B1

Abstract

An objective of the present invention is to provide an apparatus for automatically generating a highlight video. The apparatus for generating a highlight video using chat data and audio data created while providing a video comprises: a separation unit to separate data extracted from a video into a first time section and a second time section longer than the first time section to generate first segment data and second segment data; an extraction unit to extract a first feature vector from the first segment data of the separation unit and extract a second feature vector from the second segment data of the separation unit; a learning unit to learn about the first feature vector and the second feature vector of the extraction unit through a recurrent neural network (RNN) and use learned results to generate a first result vector and a second result vector; a prediction unit to generate a score vector for a probability which can be predicted as a highlight from the first result vector and the second result vector of the learning unit; and a determination unit to use the score vector generated by the prediction unit to generate a final highlight.

Description

Apparatus and method for highlight video generation using chat data and audio data{APARATUS AND METHOD FOR GENERATING A HIGHLIGHT VIDEO USING CHAT DATA AND AUDIO DATA}

본 발명은 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치 및 방법에 관한 것으로서, 더욱 상세하게는 일정한 시간 구간의 영상에 대해 하이라이트가 될 확률을 예측하여 최종 하이라이트 영상을 식별 및 생성하는 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for generating a highlight image using chat data and audio data, and more specifically, chat data and audio for identifying and generating a final highlight image by predicting a probability of being a highlight for an image of a certain time period. The present invention relates to an apparatus and method for generating a highlight image using data.

머신 러닝(machine learning)은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 하는 알고리즘과 기술을 개발하는 분야를 말한다. 머신 러닝(machine learning)의 구현을 위해 레이어가 필요하며, 퍼셉트론(perceptron), 컨볼루션 신경망(CNN, Convolution Neural Network) 및 순환 신경망(RNN, Recurrent Neural Network)과 같은 다양한 레이어가 이용되고 있다. Machine learning is a field of artificial intelligence that develops algorithms and technologies that enable computers to learn. A layer is required for the implementation of machine learning, and various layers such as perceptron, convolution neural network (CNN) and recurrent neural network (RNN) are used.

퍼셉트론(Perceptron)은 인공 신경망의 한 종류로, 가중치 세트를 특징 벡터와 결합하는 선형 예측 함수에 기초하여 예측하는 분류 알고리즘이다. 숫자 벡터로 표현된 입력이 특정 클래스에 속하는지 여부를 결정할 수 있는 함수인 이진 분류자(binary classifier)의 감독 학습을 위한 알고리즘이다. Perceptron (Perceptron) is a kind of artificial neural network, a classification algorithm that predicts based on a linear prediction function that combines a set of weights with a feature vector. It is an algorithm for supervised learning of a binary classifier, a function that can determine whether an input represented by a numeric vector belongs to a specific class.

다층 퍼셉트론(MLP, Multi-Layer Perceptron)은 입력 레이어(input layer), 출력 레이어(output layer) 및 숨겨진 레이어(hidden layer)를 포함하는 적어도 세 개의 레이어로 구성되는 일종의 피드포워드(feed-forward) 인공 신경망(neural network)이다. 훈련을 위해 역전파(back-propagation)이라고 하는 감독 학습 기법을 사용한다. 퍼셉트론(Perceptron)과 달리 선형으로 분리할 수 없는 데이터를 구별할 수 있다.Multi-layer perceptron (MLP) is a kind of feed-forward artificial consisting of at least three layers including input layer, output layer and hidden layer. It is a neural network. For training, a supervised learning technique called back-propagation is used. Unlike Perceptron, it is possible to distinguish data that cannot be separated linearly.

순환 신경망(RNN, Recurrent Neural Network)은　인공 신경망의 한 종류이며, 유닛(unit)간의 연결이　순환적 구조를 갖는 특징을 가지고 있다. 이러한 구조는 시변적 동적 특징을 모델링 할 수 있도록 신경망 내부에 상태를 저장할 수 있게 해준다.　전방 전달 신경망과 달리, 순환 신경망(RNN)은 내부의　메모리를 이용해 시퀸스　형태의 입력을 처리할 수 있다. 따라서 순환 신경망(RNN)은 필기체 인식이나　음성 인식과 같이 시변적 특징을 가지는 데이터를 처리할 수 있다. 순환 신경망(RNN)의 한 종류로 장단기 메모리(LSTM, Long Short Term Memory)를 포함한다.Recurrent neural network (RNN) is a type of artificial neural network, and it has a characteristic that the connection between units is a cyclic structure. This structure allows the state to be stored inside the neural network to model the time-varying dynamic features. Unlike forward delivery neural networks, cyclic neural networks (RNNs) can process sequence-like inputs using internal memory. Therefore, the circulating neural network (RNN) can process data having time-varying characteristics, such as handwriting recognition or speech recognition. One type of cyclic neural network (RNN) includes Long Short Term Memory (LSTM).

장단기 메모리(LSTM, Long Short Term Memory)는 셀(cell), 입력 게이트(input gate), 출력 게이트(output gate) 및 망각 게이트(forget gate)로 구성된다. 이 때 장단기 메모리(LSTM) 단위로 구성된 순환 신경망(RNN)을 장단기 메모리(LSTM) 네트워크라고 한다. 장단기 메모리(LSTM) 네트워크는 시계열에서 중요한 이벤트 사이에 알 수 없는 지속 시간이 지연 될 수 있기 때문에 시계열 데이터를 기반으로 예측을 분류, 처리 및 예측하는 데 적합하다.Long Short Term Memory (LSTM) is composed of a cell, an input gate, an output gate, and a forget gate. At this time, a cyclic neural network (RNN) composed of long- and short-term memory (LSTM) units is called a long- and short-term memory (LSTM) network. Long- and short-term memory (LSTM) networks are suitable for classifying, processing, and predicting predictions based on time-series data, as unknown durations can be delayed between time-series and important events.

스트리밍(streaming) 형식의 영상을 제공할 때 발생하는 채팅 데이터와 오디오 데이터는 시변적 특성을 가지는 시계열 데이터이므로 순환 신경망(RNN)을 통해 처리될 수 있다. 이 때, 순환 신경망(RNN)을 이용하여 데이터를 적절하게 처리하기 위해서는 입력 데이터의 특징이 명확해야 한다. Chat data and audio data generated when providing a streaming format image can be processed through a cyclic neural network (RNN) because they are time-series data having time-varying characteristics. At this time, in order to properly process data using a cyclic neural network (RNN), the characteristics of the input data must be clear.

오디오 데이터에서 특징 값을 추출하기 위해 멜 주파수 셉스트랄 계수(MFCC, Mel Frequency Cepstral Coefficient)를 일반적으로 이용하고, 채팅 내역과 같은 문자 데이터에서 특징 값을 추출하기 위해 일반적으로 자연 언어 처리(NLP, Natural Language Processing) 알고리즘을 이용한다.Mel Frequency Cepstral Coefficient (MFCC) is commonly used to extract feature values from audio data, and natural language processing (NLP, is commonly used to extract feature values from text data such as chat history). Natural Language Processing) algorithm.

멜 주파수 셉스트랄 계수(MFCC, Mel Frequency Cepstral Coefficient)는 소리의 특징을 추출하는 기법이며, 입력된 소리 전체를 대상으로 하는 것이 아니라, 일정　구간으로 나누어, 해당 구간에 대한　스펙트럼을 분석하여　특징을 추출하는 기법이다.Mel Frequency Cepstral Coefficient (MFCC) is a technique that extracts the characteristics of sound, and does not target the entire input sound, but divides it into constant 　 intervals and analyzes 　 spectrum for that section to analyze the 　 characteristics. It is an extraction technique.

자연 언어 처리(NLP, Natural Language Processing) 알고리즘은 인간 언어 분석과 표현을 자동화하기 위한 계산 기법이다. fastText는 자연 언어 처리(NLP, Natural Language Processing) 알고리즘의 일종으로 단어 삽입 및 텍스트 분류를 학습하는 라이브러리이다.The Natural Language Processing (NLP) algorithm is a computational technique for automating human language analysis and expression. fastText is a natural language processing (NLP) algorithm that is a library for learning word insertion and text classification.

한편, 방송 플랫폼을 통한 스포츠 경기 영상 등의 송출과 중계가 늘어나고 있다. 해당 영상을 송출하는 자 또는 중계하는 자는 영상에서 시청자가 관심을 가질 부분을 모아 하이라이트 영상을 제작하여 제공하고 있고, 시청자는 원하는 영상을 찾기 위해 하이라이트 영상을 보고 선별하는 것이 일반적이다.Meanwhile, the transmission and relay of sports game videos through the broadcasting platform are increasing. The person who transmits or relays the corresponding video collects the part of the viewer's interest and produces and provides a highlight video, and it is common for the viewer to view and select the highlight video to find the desired video.

이에 따라, 하이라이트 영상을 제공하는 것이 일반적인 추세로서 제작 수요가 증가하고 있으나, 일반적인 개인 방송과 같은 소규모 방송의 경우 시간적, 비용적, 기술적 제약으로 인해 하이라이트 영상 제작에 어려움을 겪고 있다.Accordingly, it is a general trend to provide a highlight image, and the demand for production is increasing, but in the case of small-scale broadcasting such as general personal broadcasting, it is difficult to produce the highlight image due to time, cost, and technical constraints.

이를 극복하기 위해 종래에는 영상에 포함된 음성 신호의 세기를 이용하여 하이라이트를 추출하는 방법(일본공개특허 JP.H08079674.A) 및 오디오 트랙의 신호 에너지 레벨을 이용하여 하이라이트로 추출하는 방법(미국등록특허 US 6,973,256 B1)등이 개시되어 사용하여 오고 있다. 스포츠 게임이나 e-스포츠 게임을 중계할 때, 브로드캐스터(Broadcaster)의 목소리나 관객들의 함성이 커지는 부분이 하이라이트일 가능성이 높다. 다만, 중계 방식에 따라 소리가 안 들어가거나 작게 들어갈 수 있고, 긴장되는 순간에는 조용히 관전할 수 있다. 따라서, 해당 순간의 오디오 신호의 세기나 에너지 밀도만으로 하이라이트 부분인지를 판단하기에는 정확도가 떨어진다. 이에 따라, 하이라이트를 예측하기 위해서는 보다 많은 정보를 활용할 필요성이 대두되고 있다.In order to overcome this, a method of extracting a highlight using the intensity of an audio signal included in an image (Japanese Patent Publication JP.H08079674.A) and a method of extracting a highlight using a signal energy level of an audio track (US registration) Patents US 6,973,256 B1) and the like have been disclosed and used. When broadcasting a sports game or an e-sports game, it is highly likely that the part of the broadcaster's voice or the audience's shout is growing. However, depending on the relaying method, the sound may not enter or may enter small, and you can watch quietly when you are nervous. Therefore, it is inaccurate to determine whether it is a highlight by only the intensity or energy density of the audio signal at the moment. Accordingly, there is a need to use more information to predict highlights.

도 1은 본 발명의 인터넷 개인 방송의 실시 화면이다.1 is an implementation screen of an Internet personal broadcast of the present invention.

도 1을 참조하면, 개인 방송의 경우 컴퓨터 장치의 디스플레이를 통해 제공된다. 화면은 중계 영상이 제공되는 영역(110), 시청자의 의견이 개시되는 채팅 영역(120) 및 개인 방송을 하는 크리에이터(Creator), 스트리머(Streamer) 및 중계자(Broadcasting Jockey)등 과 같은 브로드캐스터(Broadcaster)를 촬영한 영상이 제공되는 영역(130)으로 나뉘어지며, 중계 영상을 보는 시청자는 채팅 영역(120)을 통해 브로드캐스터(Broadcaster)와 커뮤니케이션을 할 수 있다.Referring to FIG. 1, in the case of personal broadcasting, it is provided through a display of a computer device. The screen includes a broadcast area such as an area 110 in which a relay video is provided, a chat area 120 in which viewers' opinions are disclosed, and a creator, personal streamer, streamer, and broadcasting Jockey, etc. Broadcaster) is divided into an area 130 in which an image is photographed, and a viewer watching a broadcast video can communicate with a broadcaster through the chat area 120.

이에 따라, 영상의 흐름에 따라 중계 영상내의 소리 및 브로드캐스터의 음성과 같은 오디오 데이터가 발생할 수 있고, 마찬가지로 영상의 흐름에 따라 채팅 데이터가 발생할 수 있다.Accordingly, audio data such as sound in a relay video and voice of a broadcaster may be generated according to the flow of video, and chat data may be generated according to the flow of video.

한국공개특허 제10-2005-0003071호(2005.01.10)Korean Patent Publication No. 10-2005-0003071 (2005.01.10) 한국공개특허 제10-2014-0056618호(2014.05.12)Korean Patent Publication No. 10-2014-0056618 (2014.05.12) 한국등록특허 제10-1867082호(2018.6.5)Korean Registered Patent No. 10-1867082 (2018.6.5) 일본공개특허 JP.H08079674.A(1996.03.22)Japanese Patent Publication JP.H08079674.A (1996.03.22) 미국등록특허 US 6,973,256 B1 (2005.12.06)U.S. registered patent US 6,973,256 B1 (2005.12.06)

본 발명의 기술적 과제는 이러한 점에서 착안된 것으로 본 발명의 목적은, 개인 방송을 송출할 때 생성되는 채팅 데이터와 오디오 데이터를 일정한 시간 단위로 쪼개어 순환 신경망(RNN, Recurrent Neural Network)에 학습시켜 전후의 맥락까지 고려하여 하이라이트 영상을 자동으로 생성하는 장치를 제공하는 것이다.The technical problem of the present invention has been devised in this regard, and the object of the present invention is to divide the chat data and audio data generated when transmitting a personal broadcast into regular time units and learn them in a recurrent neural network (RNN) before and after. It is to provide a device for automatically generating a highlight image in consideration of the context of.

본 발명의 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

전술한 목적을 달성하기 위한 본 발명의 일 실시 예에 따른 영상의 제공 중에 발생하는 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치에 있어서, 영상의 제공 중에 발생하는 데이터를 제 1 시간 구간 및 제 1 시간 구간 보다 긴 제 2 시간 구간으로 분리하여 제 1 분절 데이터 및 제 2 분절 데이터를 생성하는 분리부; 상기 분리부의 제 1 분절 데이터에서 제 1 특징 벡터를 추출하고, 상기 분리부의 제 2 분절 데이터에서 제 2 특징 벡터를 추출하는 추출부; 상기 추출부의 제 1 특징 벡터 및 제 2 특징 벡터에 대해 순환 신경망(RNN, Recurrent Neural Network)을 통해 학습하고, 학습된 결과를 이용하여 제 1 결과 벡터 및 제 2 결과 벡터를 생성하는 학습부; 상기 학습부의 제 1 결과 벡터와 제 2 결과 벡터로부터 하이라이트로 예측될 수 있는 확률에 대한 스코어 벡터를 생성하는 예측부; 및 상기 예측부에서 생성한 스코어 벡터를 이용하여 최종 하이라이트를 생성하는 결정부;를 포함할 수 있다.In the highlight image generation apparatus using chat data and audio data generated during the provision of an image according to an embodiment of the present invention for achieving the above object, the data generated during the provision of the image in the first time interval and the first A separating unit separating the second time interval longer than the time interval to generate first segment data and second segment data; An extraction unit that extracts a first feature vector from the first segment data of the separation unit and extracts a second feature vector from the second segment data of the separation unit; A learning unit learning the first feature vector and the second feature vector of the extraction unit through a recurrent neural network (RNN), and generating a first result vector and a second result vector using the learned result; A prediction unit generating a score vector for a probability that can be predicted as a highlight from the first result vector and the second result vector of the learning unit; And a determination unit that generates a final highlight using the score vector generated by the prediction unit.

본 발명의 일 실시 예에서, 상기 영상의 제공 중에 발생하는 데이터는 채팅 데이터 및 오디오 데이터 중 어느 하나일 수 있고, 채팅 데이터 및 오디오 데이터 둘 다를 포함할 수 있다.In one embodiment of the present invention, data generated during the provision of the video may be any one of chat data and audio data, and may include both chat data and audio data.

본 발명의 일 실시 예에서, 상기 추출부는 채팅 데이터의 특징 벡터를 추출하기 위해 자연어 처리 도구(NLP, Natural Language Processing)를 이용할 수 있다.In one embodiment of the present invention, the extraction unit may use a natural language processing tool (NLP) to extract a feature vector of chat data.

본 발명의 일 실시 예에서, 상기 추출부는 오디오 데이터의 특징 벡터를 추출하기 위해 멜 주파수 셉스트럴 계수(MFCC, Mel Frequency cepstral coefficients)를 이용할 수 있다.In one embodiment of the present invention, the extraction unit may use Mel Frequency cepstral coefficients (MFCC) to extract a feature vector of audio data.

본 발명의 일 실시 예에서, 상기 학습부는 상기 추출부의 제 1 특징 벡터 및 제 2 특징 벡터에 대해 이전 시간 구간의 정보를 이용하여 결과 벡터를 생성하는 순방향 학습부; 및 상기 추출부의 제 1 특징 벡터 및 제 2 특징 벡터에 대해 이후 시간 구간의 정보를 이용하여 결과 벡터를 생성하는 단기 역방향 학습부;를 포함할 수 있다.In an embodiment of the present invention, the learning unit includes a forward learning unit that generates a result vector using information of a previous time interval for the first feature vector and the second feature vector of the extraction unit; And a short-term backward learning unit generating a result vector by using information of a time interval after the first feature vector and the second feature vector of the extraction unit.

본 발명의 일 실시 예에서, 상기 순환 신경망(RNN, Recurrent Neural Network)은 장단기 메모리(LSTM, Long Short Term Memory)를 이용할 수 있다.In an embodiment of the present invention, the Recurrent Neural Network (RNN) may use a Long Short Term Memory (LSTM).

본 발명의 일 실시 예에서, 상기 예측부는 스코어 벡터를 생성하기 위해 다층 퍼셉트론(MLP, Multi-Layer Perceptron)을 이용할 수 있다.In one embodiment of the present invention, the prediction unit may use a multi-layer perceptron (MLP) to generate a score vector.

본 발명의 일 실시 예에서, 상기 결정부는 상기 예측부에서 생성한 스코어 벡터에서 확률이 높은 순서대로 미리 설정해둔 범위만큼 하이라이트 구간으로 결정하고, 미리 정해둔 길이로 최종 하이라이트로 결정하여 하이라이트 영상을 생성할 수 있다.In one embodiment of the present invention, the determining unit determines a highlight section in a predetermined order in a high probability order from the score vector generated by the prediction unit, and determines a final highlight with a predetermined length to generate a highlight image. can do.

전술한 목적을 달성하기 위한 본 발명의 일 실시 예에 따른 영상의 제공 중에 발생하는 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 방법에 있어서, 영상으로부터 추출되는 데이터를 제 1 시간 구간 및 제 1 시간 구간 보다 긴 제 2 시간 구간으로 분리하여 제 1 분절 데이터 및 제 2 분절 데이터를 생성하는 단계; 상기 제 1 분절 데이터에서 제 1 특징 벡터를 추출하고, 상기 제 2 분절 데이터에서 제 2 특징 벡터를 추출하는 단계; 상기 제 1 특징 벡터 및 상기 제 2 특징 벡터에 대해 순환 신경망(RNN, Recurrent Neural Network)을 통해 학습하고, 학습된 결과를 이용하여 제 1 결과 벡터 및 제 2 결과 벡터를 생성하는 단계; 상기 제 1 결과 벡터와 상기 제 2 결과 벡터로부터 하이라이트로 예측될 수 있는 확률에 대한 스코어 벡터를 생성하는 단계; 및 상기 스코어 벡터를 이용하여 최종 하이라이트를 생성하는 단계;를 포함할 수 있다.In the method for generating a highlight image using chat data and audio data generated during the provision of an image according to an embodiment of the present invention for achieving the above object, the data extracted from the image is the first time period and the first time period Separating into a longer second time interval to generate first segment data and second segment data; Extracting a first feature vector from the first segment data and extracting a second feature vector from the second segment data; Learning the first feature vector and the second feature vector through a recurrent neural network (RNN), and generating a first result vector and a second result vector using the learned results; Generating a score vector for probability that can be predicted as a highlight from the first result vector and the second result vector; And generating a final highlight using the score vector.

상술한 본 발명의 일 측면에 따르면, 본 발명의 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치 및 방법에 의해 제공되는 효과는, 개인 방송을 하는 브로드캐스터(Broadcaster)는 인공 신경망을 통해 채팅 데이터와 오디오 데이터의 특징을 학습한 장치에 의해 하이라이트 영상을 자동으로 생성할 수 있게 된다. 따라서, 브로드캐스터(Broadcaster)는 별도의 시간 및 비용을 절감할 수 있고, 시청자는 다양한 하이라이트 영상을 제공받을 수 있게 되는 유리한 효과가 있다.According to one aspect of the present invention described above, the effect provided by the highlight image generating apparatus and method using the chat data and the audio data of the present invention, a broadcaster (Broadcaster) that performs a personal broadcast, and chat data through an artificial neural network A highlight image can be automatically generated by a device that has learned the characteristics of audio data. Therefore, a broadcaster can save a separate time and cost, and the viewer has an advantageous effect of being able to receive various highlight images.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned may be clearly understood by those skilled in the art from the following description. .

도 1은 본 발명의 인터넷 개인 방송의 실시 화면이다.
도 2는 본 발명의 일 실시 예에 따른 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치의 개략적인 블록도이다.
도 3은 일반적인 장단기 메모리(LSTM, Long Short Term Memory)로 셀(cell)내부에서의 데이터 흐름을 나타내는 블록도이다.
도 4는 단일 데이터 하이라이트 예측 모델(S-BiLSTM)에서 데이터의 흐름을 나타내는 상세하게 나타내는 블록도이다.
도 5는 단일 데이터 하이라이트 예측 모델(S-BiLSTM)에서 데이터의 흐름을 나타내는 간략하게 나타내는 블록도이다.
도 6은 다중 데이터 하이라이트 예측 모델(M-BiLSTM)에서 데이터의 흐름을 나타내는 블록도이다.
도 7은 다중 시간 간격에 대한 단일 데이터 하이라이트 예측 모델에서 데이터의 흐름을 나타내는 블록도이다.
도 8은 다중 시간 간격에 대한 다중 데이터 하이라이트 예측 모델에서 데이터의 흐름을 나타내는 블록도이다.
도 9은 다중 시간 간격에 대한 다중 데이터 하이라이트 예측 모델에서 하이라이트를 생성하는 방법을 나타내는 블록도이다.1 is an implementation screen of an Internet personal broadcast of the present invention.
2 is a schematic block diagram of a highlight image generating apparatus using chat data and audio data according to an embodiment of the present invention.
FIG. 3 is a block diagram showing data flow in a cell in a general long-term memory (LSTM).
4 is a detailed block diagram showing the flow of data in a single data highlight prediction model (S-BiLSTM).
5 is a simplified block diagram showing the flow of data in a single data highlight prediction model (S-BiLSTM).
6 is a block diagram showing the flow of data in a multiple data highlight prediction model (M-BiLSTM).
7 is a block diagram showing the flow of data in a single data highlight prediction model for multiple time intervals.
8 is a block diagram illustrating the flow of data in a multiple data highlight prediction model for multiple time intervals.
9 is a block diagram illustrating a method of generating highlights in a multiple data highlight prediction model for multiple time intervals.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시 예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시 예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시 예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시 예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시 예로 구현될 수 있다. 또한, 각각의 개시된 실시 예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.For a detailed description of the present invention, which will be described later, reference is made to the accompanying drawings that illustrate specific embodiments in which the present invention may be practiced. These embodiments are described in detail enough to enable those skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with one embodiment. In addition, it should be understood that the location or placement of individual components within each disclosed embodiment can be changed without departing from the spirit and scope of the invention. Therefore, the following detailed description is not intended to be taken in a limiting sense, and the scope of the present invention, if appropriately described, is limited only by the appended claims, along with all ranges equivalent to those claimed. In the drawings, similar reference numerals refer to the same or similar functions across various aspects.

이하에서는 첨부한 도면들을 참조하여 본 발명의 바람직한 실시 예에 따른 하이라이트 영상 생성 장치에 대해서 상세하게 설명하기로 한다.Hereinafter, a highlight image generating apparatus according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일 실시 예에 따른 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치의 개략적인 블록도이다.2 is a schematic block diagram of a highlight image generating apparatus using chat data and audio data according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시 예에 따른 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치(200)는 분리부(210), 추출부(220), 학습부(230), 예측부(240) 및 결정부(250)를 포함할 수 있음을 알 수 있다. Referring to FIG. 2, the highlight image generation apparatus 200 using chat data and audio data according to an embodiment of the present invention includes a separation unit 210, an extraction unit 220, a learning unit 230, and a prediction unit ( 240) and the crystal unit 250.

본 발명의 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 방법(200)은 하이라이트 영상 생성 작업을 수행하기 위한 소프트웨어(어플리케이션)가 설치되어 실행될 수 있으며, 분리부(210), 추출부(220), 학습부(230), 예측부(240) 및 결정부(250)의 구성은 장치(200)에서 실행되는 소프트웨어에 의해 제어될 수 있다.The highlight image generation method 200 using chat data and audio data of the present invention may be installed and executed with a software (application) for performing a highlight image generation operation, a separation unit 210, an extraction unit 220, and learning The configuration of the unit 230, the prediction unit 240, and the determination unit 250 may be controlled by software executed in the device 200.

본 발명의 채팅 데이터는 영상의 시청자들이 직접 입력하여 전송한 대화 내역이고, 오디오 데이터는 브로드캐스터(Broadcaster)의 음성 및 현장 관객들의 함성 등을 의미한다.The chat data of the present invention is a conversation history that is directly input and transmitted by viewers of a video, and the audio data means a voice of a broadcaster and a shout of the audience.

분리부(210)는 하이라이트 생성에 이용하기 위해 영상의 제공 중에 발생하는 데이터를 제 1 시간 구간과 제 2 시간 구간으로 분리하여 제 1 분절 데이터와 제 2 분절 데이터를 생성한다. The separating unit 210 generates first segment data and second segment data by separating data generated during the provision of an image into a first time interval and a second time interval for use in generating highlights.

이 때, 영상의 제공 중에 발생하는 데이터는 채팅 데이터와 오디오 데이터를 의미하며, 채팅 데이터 및 오디오 데이터 중 어느 하나에 대해서 분절 데이터를 생성할 수 있고, 둘 모두에 대해서 분절 데이터를 생성할 수 있다. At this time, data generated during the provision of the video means chat data and audio data, and segment data may be generated for any one of chat data and audio data, and segment data may be generated for both.

채팅 데이터 및 오디오 데이터 중 어느 하나만 이용할 경우 총 2가지의 분절 데이터(즉, 제 1 분절 데이터와 제 2 분절 데이터)가 생성된다. 채팅 데이터와 오디오 데이터 모두를 이용할 경우에는 각각에 대해 2개의 분절 데이터가 생성되므로 총 4가지의 분절 데이터(즉, 채팅 데이터에 대한 제 1 분절 데이터와 제 2 분절 데이터 및 오디오 데이터에 대한 제 1 분절 데이터와 제 2 분절 데이터)가 생성된다.When only one of the chat data and audio data is used, a total of two pieces of segment data (ie, first segment data and second segment data) are generated. When both chat data and audio data are used, two segment data are generated for each, so a total of four segment data (i.e., first segment data for chat data and second segment data and first segment for audio data) Data and second segment data) are generated.

한편, 제 1 시간 구간은 1초 혹은 초 단위에서 약간의 가감이 있는 시간을 의미하고, 제 2 시간 구간은 제 1 시간 구간보다 긴 시간 구간으로, 1분 혹은 분 단위에서 약간의 차이가 있는 시간을 의미한다. 구현 방식에 따라 더 길어지고 짧아질 수 있으나, 제 2 시간 구간이 제 1 시간 구간 보다 긴 시간인 것은 변하지 않는다.Meanwhile, the first time period means a time with a slight addition or subtraction in 1 second or second units, and the second time period is a time period longer than the first time period, and has a slight difference in 1 minute or minute units. Means Depending on the implementation method, it may be longer and shorter, but the second time interval is not longer than the first time interval.

추출부(220)는 상기 분리부(210)에서 생성한 분절 데이터로부터 특징 벡터를 추출할 수 있다. The extraction unit 220 may extract a feature vector from the segmentation data generated by the separation unit 210.

추출부(220)는 채팅 데이터에 대한 제 1 분절 데이터와 제 2 분절 데이터에 대해 FastText와 같은 자연어 처리(NLP, Natural Language Processing) 도구를 이용하여 제 1 특징 벡터(x_t)와 제 2 특징 벡터(x_t)를 추출할 수 있다. 이를 수식으로 표현하면 수학식 1과 같다.The extracting unit 220 uses a first language vector (x _t ) and a second feature vector using natural language processing (NLP) tools such as FastText for the first segment data and the second segment data for the chat data. (x _t ) can be extracted. This is expressed as Equation (1).

[수학식 1][Equation 1]

이 때, 채팅 데이터에 대한 특징 벡터에는 같은 시간 내에 몇 개의 채팅이 등장하였는지에 관한 정보도 함께 이용하기 위해 모든 채팅의 끝에 특정 문자를 추가하여 채팅이 끝났음을 표시할 수 있다. 채팅 특징 벡터는 채팅 수, 특정 키워드 등에 관한 정보를 포함할 수 있다. 일정한 시간 구간에 등장하는 채팅을 모두 하나의 특징 벡터로 표현할 수 있다.In this case, in order to use information on how many chats appeared in the same time in the feature vector for chat data, a specific character may be added to the end of all chats to indicate that the chat is over. The chat feature vector may include information on the number of chats, specific keywords, and the like. All chats appearing in a certain time period can be expressed as a single feature vector.

추출부(220)는 오디오 데이터에 대한 제 1 분절 데이터와 제 2 분절 데이터에 대해 멜 주파수 셉스트랄 계수(MFCC, Mel Frequency Cepstral Coefficient)와 같은 음성 인식 기술을 이용하여 스펙트럼 정보에 대한 제 1 특징 벡터(x_t)와 제 2 특징 벡터(x_t)를 추출할 수 있다. 이를 수식으로 표현하면 수학식 2와 같다.The extracting unit 220 uses the speech recognition technology such as Mel Frequency Cepstral Coefficient (MFCC) for the first segment data and the second segment data for audio data to provide first features for spectrum information. The vector (x _t ) and the second feature vector (x _t ) can be extracted. This is expressed as Equation (2).

[수학식 2][Equation 2]

오디오 데이터에 대한 특징 벡터에는 해설자의 목소리 크기와 톤, 말하는 속도 등에 관한 정보를 포함할 수 있다.The feature vector for audio data may include information about the voice size, tone, and speech speed of the commentator.

한편, 순환 신경망(RNN, Recurrent Neural Network)을 이용하여 학습하는 경우에 채팅 데이터(chat_t) 및 오디오 데이터(audio_t) 자체가 아닌 그로부터 추출한 특징 벡터(x_t)를 이용할 때 연산 시간과 메모리 효율이 개선되는 장점이 있다.On the other hand, Recurrent Neural Network (RNN, Recurrent Neural Network) the chat in the case of learning by using the data (chat _t) and the audio data (audio _t) computation time and memory efficiency when using the feature vector (x _t) derived from it, not itself This has the advantage of being improved.

학습부(230)는 채팅 데이터(chat_t)에 대한 제 1 특징 벡터와 제 2 특징 벡터에 대해 순환 신경망(RNN, Recurrent Neural Network)을 통해 학습하고, 학습된 결과를 이용하여 제 1 결과 벡터와 제 2 결과 벡터를 생성할 수 있다. 오디오 데이터(audio_t)에 대해서도 동일한 과정을 통해 제 1 결과 벡터와 제 2 결과 벡터를 생성할 수 있다.The learning unit 230 learns the first feature vector and the second feature vector for the chat data (chat _t ) through a recurrent neural network (RNN), and uses the learned result and the first result vector. A second result vector can be generated. The first result vector and the second result vector can be generated for the audio data (audio _t ) through the same process.

이 때, 학습부(230)는 순차적인 데이터로부터 정보를 추출할 수 있는 순환 신경망(RNN, Recurrent Neural Network)의 특징을 활용하기 위해 과거 시점 - 현재 시점 - 미래 시점의 순서로 학습하는 순방향 학습부(231)와 미래 시점 - 현재 - 과거 시점의 순서로 학습하는 역방향 학습부(233)를 포함할 수 있다. At this time, the learning unit 230 is a forward learning unit that learns in the order of past viewpoint-present viewpoint-future viewpoint in order to utilize the features of a recurrent neural network (RNN) capable of extracting information from sequential data. It may include a reverse learning unit 233 for learning in the order of (231) and the future viewpoint-present-past viewpoint.

순방향 학습부(231)는 이전 시점의 정보를 학습하여 현재의 결과 벡터의 생성에 이용하고, 역방향 학습부(233)는 이후 시점의 정보를 학습하여 현재의 단기 결과 벡터의 생성에 이용한다.The forward learning unit 231 learns the information of the previous viewpoint and uses it to generate the current result vector, and the reverse learning unit 233 learns the information of the later viewpoint and uses it to generate the current short-term result vector.

순방향 학습부(231) 및 역방향 학습부(233)는 순환 신경망(RNN, Recurrent Neural Network)으로 이루어지며, 각각은 하나의 단위를 나타낼 수 있고, 모든 단위를 포함하는 의미일 수 있다.The forward learning unit 231 and the reverse learning unit 233 are composed of a recurrent neural network (RNN), each of which may represent one unit, and may include all units.

이 때, 순환 신경망(RNN, Recurrent Neural Network)은 개별 단위 간의 간격이 멀어질 경우 학습능력이 크게 저하되는 기울기 값이 사라지는 문제(vanishing gradient problem)를 극복하기 위해 장단기 메모리(LSTM, Long Short Term Memory)를 사용할 수 있다. At this time, the recurrent neural network (RNN) is a long-term memory (LSTM) to overcome a problem of a vanishing gradient problem in which learning ability is significantly deteriorated when the interval between individual units is far apart. ) Can be used.

장단기 메모리(LSTM, Long Short Term Memory)는 망각 게이트(forget gate), 입력 게이트(input gate) 및 출력 게이트(output gate)로 구성될 수 있다. 각각의 게이트는 정해진 방법으로 계산을 수행하여 다양한 중간 벡터를 생성한다. 중간 벡터는 망각 게이트(forget gate)의 망각 벡터(332), 입력 게이트(input gate)에 대한 입력 벡터(334) 및 출력 게이트(output gate)에 대한 출력 벡터(338)을 포함할 수 있다. 이 때, 장단기 메모리(LSTM)는 다양한 방식으로 변형된 구조로 설계될 수 있다.Long Short Term Memory (LSTM) may be composed of a forget gate, an input gate, and an output gate. Each gate performs calculations in a given way to generate various intermediate vectors. The intermediate vector may include an oblivion vector 332 of a forget gate, an input vector 334 for an input gate, and an output vector 338 for an output gate. At this time, the long and short-term memory (LSTM) may be designed in a structure modified in various ways.

도 3은 장단기 메모리(LSTM, Long Short Term Memory)의 셀(cell)내부에서의 데이터 흐름을 나타내는 블록도이다.FIG. 3 is a block diagram showing data flow inside a cell of a Long Short Term Memory (LSTM).

도 3을 참조하면, 특징 벡터(310), 이전 단계의 결과 벡터(330) 및 이전 단계의 셀 상태 벡터(350)들을 이용한 연산 과정을 알 수 있다. Referring to FIG. 3, an operation process using the feature vector 310, the result vector 330 of the previous step, and the cell state vectors 350 of the previous step can be seen.

각각의 게이트는 정보의 흐름을 선택하기 위한 것으로, 시그모이드 레이어(σ, 331, 333, 337), 곱셈 연산(

, 351,353, 359), 덧셈 연산(

, 355) 및 하이퍼볼릭 탄젠트(tanh, 335, 357) 연산으로 구성됨을 알 수 있다. Each gate is for selecting the flow of information, sigmoid layer (σ, 331, 333, 337), multiplication operation (

, 351,353, 359), addition operation (

, 355) and hyperbolic tangents (tanh, 335, 357).

이 때, 시그모이드 레이어(σ, 331, 333, 337)는 0과 1 사이의 값을 출력하며, 각각의 구성요소가 얼마만큼의 영향을 주게 될지를 결정할 수 있다.At this time, the sigmoid layer (σ, 331, 333, 337) outputs a value between 0 and 1, it is possible to determine how much each component will affect.

셀 상태 벡터(360)는 셀(cell)의 현재 시점의 상태(state)이며 체인의 전체를 가로지르면서, 게이트(gate)를 통해 셀(cell)에 정보를 추가하거나 제거하여 정제된 결과이다. The cell state vector 360 is a state at the current point in time of a cell, and is a refined result by adding or removing information to a cell through a gate while traversing the entire chain.

장단기 메모리(300)는 망각 게이트(331), 입력 게이트(333) 및 출력 게이트(337)를 사용하여 셀 상태 벡터(360)를 보호하고 조절할 수 있다.The long- and short-term memory 300 may protect and adjust the cell state vector 360 using the oblivion gate 331, the input gate 333, and the output gate 337.

망각 게이트(331)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)를 참조하여 유지할지 제거할지 결정할 수 있다. The forgetting gate 331 may determine whether to keep or remove the feature vector 310 and the result vector 330 of the previous time interval.

망각 벡터(332)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)에 대해 해당 레이어의 가중치(w_f)를 감안하여 산출한 값에 오프셋인 바이어스 벡터(b_f)를 더한 값을 시그모이드 레이어(331)를 통과하여 얻을 수 있다. 이를 수식으로 표현하면 수학식 3과 같다.The oblivion vector 332 is a value obtained by adding a bias vector (b _f ) as an offset to a value calculated in consideration of the weight (w _f ) of the corresponding layer for the feature vector 310 and the result vector 330 of the previous time interval. It can be obtained by passing through the sigmoid layer 331. This is expressed as Equation (3).

[수학식 3][Equation 3]

입력 게이트(333)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)를 보고, 어떤 새로운 정보를 셀 상태 벡터(360)에 저장할지 결정한다. The input gate 333 looks at the feature vector 310 and the result vector 330 of the previous time interval, and determines what new information is stored in the cell state vector 360.

입력 벡터(334)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)에 대해 입력 레이어의 가중치(w_i)를 감안하여 산출한 값에 오프셋인 바이어스 벡터(b_i)를 더한 값을 시그모이드 레이어(333)를 통과하여 얻을 수 있다. 이를 수식으로 표현하면 수학식 4와 같다.The input vector 334 adds the bias vector (b _i ) which is an offset to the calculated value in consideration of the weight (w _i ) of the input layer for the feature vector 310 and the result vector 330 of the previous time interval. It can be obtained by passing through the sigmoid layer 333. This is expressed as Equation (4).

[수학식 4][Equation 4]

후보 벡터(336)는 셀 상태 벡터(360)에 더해질 새로운 후보 값들을 포함할 수 있다. 후보 벡터(336)는 이전 시간 구간의 결과 벡터(330)에 대해 입력 레이어의 가중치(w_c)를 감안하여 산출한 값에 오프셋인 바이어스 벡터(b_c)를 더한 값을 하이퍼볼릭 탄젠트 레이어(tanh layer, 335) 를 통과하여 얻을 수 있다. 이를 수식으로 표현하면 수학식 5와 같다.The candidate vector 336 may include new candidate values to be added to the cell state vector 360. The candidate vector 336 is a hyperbolic tangent layer (tanh) obtained by adding a bias vector (b _c ) as an offset to a value calculated in consideration of the weight (w _c ) of the input layer with respect to the result vector 330 of the previous time interval. layer, 335). This is expressed by Equation (5).

[수학식 5][Equation 5]

셀 상태 벡터(360)는 이전 셀 상태 벡터(350)와 망각 게이트(331)의 망각 벡터(332)에 벡터 곱(351)을 수행하여 얻은 결과 값과 입력 게이트의 입력 벡터(334)와 후보 벡터(336)에 대해 벡터 곱(353)을 수행하여 얻은 결과 값을 합(355)함으로써 셀 상태 벡터(360)를 얻을 수 있다. 이를 수식으로 표현하면 수학식 6과 같다.The cell state vector 360 is a result value obtained by performing a vector product 351 of the previous cell state vector 350 and the oblivion vector 332 of the oblivion gate 331 and the input vector 334 and candidate vector of the input gate. The cell state vector 360 can be obtained by summing (355) the result values obtained by performing the vector product 353 on (336). This is expressed as Equation (6).

[수학식 6][Equation 6]

출력 게이트(337)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)를 보고, 셀 상태의 어느 부분을 아웃풋으로 낼지 결정한다. The output gate 337 looks at the feature vector 310 and the result vector 330 of the previous time interval, and determines which portion of the cell state to output.

출력 벡터(338)는 특징 벡터(310)와 이전 시간 구간의 결과 벡터(330)에 대해 출력 레이어의 가중치(w_o)를 감안하여 산출한 값에 오프셋인 바이어스 벡터(b_o)를 더한 값을 시그모이드 레이어(337)를 통과하여 얻을 수 있다. 이를 수식으로 표현하면 수학식 7과 같다.The output vector 338 adds a bias vector (b _o ) which is an offset to a value calculated in consideration of the weight (w _o ) of the output layer for the feature vector 310 and the result vector 330 of the previous time interval. It can be obtained by passing through the sigmoid layer 337. This is expressed as Equation (7).

[수학식 7][Equation 7]

마지막으로 결과 벡터(340, 380)는 장단기 메모리(300)의 결과 벡터로 최종 결과를 무엇을 낼지 결정한다. 결과 벡터(340, 380)는 셀 상태 벡터(360)를 하이퍼볼릭 탄젠트 레이어(357)에 통과한 결과와 출력 벡터(338)에 대해 벡터 곱(359)을 수행함으로써 얻을 수 있다. 이를 수식으로 표현하면 수학식 8과 같다.Finally, the result vectors 340 and 380 determine what to produce the final result as the result vector of the long and short-term memory 300. The result vectors 340 and 380 may be obtained by performing a vector product 359 on the output vector 338 and the result of passing the cell state vector 360 through the hyperbolic tangent layer 357. This can be expressed as Equation (8).

[수학식 8][Equation 8]

이상에서 설명한 과정은 현재 시점의 결과 벡터(340, 380)를 생성하기 위해 이전 시간 구간의 결과 벡터(330) 및 이전 시간 구간의 셀 상태 벡터(350)를 이용하는 순방향을 기준으로 한 것이다. 역방향의 경우에는 이후 시간 구간의 결과 벡터(h_t+1) 및 이후 시간 구간의 셀 상태 벡터(c_t+1)를 이용하는 점에서 차이가 있을 뿐 상기에서 설명한 과정과 동일한 과정을 수행할 수 있다. 또한, 상기의 과정은 상기 학습부(230)의 순방향 학습부(231) 및 역방향 학습부(233)에서 대해 동일하게 수행될 수 있다.The process described above is based on the forward direction using the result vector 330 of the previous time interval and the cell state vector 350 of the previous time interval to generate the result vectors 340 and 380 of the current time. In the case of the reverse direction, there is a difference in using the result vector (h _t+1 ) of the subsequent time interval and the cell state vector (c _t+1 ) of the subsequent time interval, and the same process as described above can be performed. . In addition, the above process may be performed in the same manner in the forward learning unit 231 and the backward learning unit 233 of the learning unit 230.

예측부(240)는 상기 학습부(230)에서 얻은 결과 벡터에서 얻은 정보를 이용하여 상기 결과 벡터를 얻은 시간 구간이 하이라이트로 예측될 수 있는지를 판단하여 확률 벡터인 스코어 벡터(S_t)를 얻을 수 있다. 이 때, 단일한 시간 간격에 대해 채팅 데이터 또는 오디오 데이터 중 하나만을 활용하는 경우(도4 및 도 5), 채팅 데이터 및 오디오 데이터를 둘 다 활용하는 경우(도 6)와 다중 시간 간격에 대해 채팅 데이터 또는 오디오 데이터 중 하나만을 활용하는 경우(도 7)와 채팅 데이터 및 오디오 데이터를 둘 다 활용하는 경우(도 8)에 따라 차이가 있으므로 이하에서 나누어 설명한다.The prediction unit 240 determines whether a time interval obtained from the result vector can be predicted as a highlight by using information obtained from the result vector obtained from the learning unit 230 to obtain a score vector S _{t as} a probability vector Can. At this time, if only one of the chat data or audio data is utilized for a single time interval (FIGS. 4 and 5), both of the chat data and audio data are used (FIG. 6), and the chat is performed for multiple time intervals. Since there is a difference between using only one of data or audio data (FIG. 7) and using both chat data and audio data (FIG. 8), the following description will be given separately.

도 4는 단일 데이터 하이라이트 예측 모델(S-BiLSTM) 에서 데이터의 흐름을 나타내는 상세하게 나타내는 블록도이다.4 is a detailed block diagram showing the flow of data in a single data highlight prediction model (S-BiLSTM).

도 4를 참조하면, 각각의 일정 시간 구간의 특징 벡터(X110, X130, X150)들에 대해 하이라이트가 될 예측 확률에 대한 스코어 벡터(S_t)를 생성하는 과정을 알 수 있다. 첫 번째 시간 구간의 영상이 하이라이트로 예측될 확률은 첫 번째 시간 구간의 특징 벡터(X110, X₁)가 첫 번째 순방향 장단기 메모리(LSTM, F110) 및 역방향 장단기 메모리(LSTM, B110)를 거쳐 산출한 결과 벡터(h₁)로부터 얻을 수 있다. 이와 같은 과정을 모든 입력 구간(X110, X₁, X130, X₂, X150, X_n)에 대해 순방향 장단기 메모리(F110, F130, F150)와 역방향 장단기 메모리(B110, B130, B150)를 거쳐 얻은 결과 벡터(h₁, h₂, h_n)로부터 스코어 벡터(S_t, S110)를 얻을 수 있다.Referring to FIG. 4, it can be seen that a process of generating a score vector S _t for a prediction probability to be a highlight for feature vectors X110, X130, and X150 of each constant time interval is shown. The probability that the image of the first time interval is predicted as a highlight is calculated by the feature vectors (X110, X ₁ ) of the first time interval through the first forward and short-term memory (LSTM, F110) and reverse long-term and short-term memory (LSTM, B110). It can be obtained from the result vector (h ₁ ). The result obtained through forward and short-term memory (F110, F130, and F150) and reverse long- and short-term memory (B110, B130, B150) for all input sections (X110, X ₁ , X130, X ₂ , X150, X _n ) Score vectors (S _t , S110) can be obtained from vectors (h ₁ , h ₂ , h _n ).

도 5는 단일 시간 간격에 대한 단일 데이터 하이라이트 예측 모델(S-BiLSTM) 에서 데이터의 흐름을 나타내는 간략하게 나타내는 블록도이다.5 is a simplified block diagram showing the flow of data in a single data highlight prediction model (S-BiLSTM) for a single time interval.

도 5를 참조하면 도 4에서 설명한 과정을 간략하게 보여주는 모델로 특징 벡터(X_t, X210)에 대해 순방향 장단기 메모리(F210, LSTM_forward)를 거친 결과 벡터(

)와 역방향 장단기 메모리(B210, LSTM_backward)를 거친 결과 벡터(

)를 고려하여 최종적으로 스코어 벡터(S210, s_t)를 생성하는 과정을 알 수 있다. 이 때, 스코어 벡터(s_t)는 순방향에 대한 결과 벡터(

)와 역방향에 대한 결과 벡터(

)에 대해 가중치(w_s)를 감안하여 시그모이드 레이어(σ)를 통과하여 얻을 수 있다. 모든 일정 시간 구간(즉, t=1, 2, ... , n)에 대해 동일한 과정을 수행할 수 있다. 이를 수식으로 표현하면 수학식 9와 같다.Referring to FIG. 5, a model showing a process briefly described in FIG. 4 is a result vector (for a feature vector (X _t , X210) that has undergone forward and short-term memory (F210, LSTM _forward ))

) And backward and short-term memory (B210, LSTM _backward ).

), the process of finally generating the score vectors S210 and s _t can be seen. At this time, the score vector (s _t ) is the result vector for the forward direction (

) And the result vector for the inverse (

) Can be obtained by passing the sigmoid layer (σ) in view of the weight (w _s ). The same process can be performed for all predetermined time periods (ie, t=1, 2, ..., n). This is expressed as Equation (9).

[수학식 9][Equation 9]

도 6은 단일 시간 간격에 대한 다중 데이터 하이라이트 예측 모델(M-BiLSTM)에서 데이터의 흐름을 나타내는 블록도이다.6 is a block diagram showing the flow of data in a multiple data highlight prediction model (M-BiLSTM) for a single time interval.

도 6을 참조하면, 채팅 데이터 또는 오디오 데이터 중 하나를 이용하는 단일 데이터 하이라이트 예측 모델(S-BiLSTM)과 달리, 다중 데이터 하이라이트 예측 모델(M-BiLSTM)은 두 가지 데이터를 모두 활용하는 점에서 차이가 있음을 알 수 있다. 이러한 차이로 인해 다중 데이터 하이라이트 예측 모델(M-BiLSTM)은 다층 퍼셉트론(M310, MLP, Multi-Layer Perceptron)을 더 포함할 수 있다. Referring to FIG. 6, unlike the single data highlight prediction model (S-BiLSTM) using either chat data or audio data, the multi-data highlight prediction model (M-BiLSTM) differs in that it utilizes both data. You can see that there is. Due to this difference, the multi-data highlight prediction model (M-BiLSTM) may further include a multi-layer perceptron (M310, MLP, Multi-Layer Perceptron).

채팅 데이터의 특징 벡터(X310,

)에 대해 순방향 장단기 메모리(F310, LSTM_forward)의 결과 벡터(

) 및 역방향 장단기 메모리(B310, LSTM_backward)의 결과 벡터(

)와 오디오 데이터의 특징 벡터(X330,

)에 대해 순방향 장단기 메모리(F330, LSTM_forward)의 결과 벡터(

) 및 역방향 장단기 메모리(B330, LSTM_backward) 의 결과 벡터(

)들을 모두 이어 붙여 다중 데이터 특징 벡터(

)를 생성할 수 있다. 생성된 다중 데이터 특징 벡터(

)를 다층 퍼셉트론(M310, MLP)에 입력하여 스코어 벡터(S310)를 생성할 수 있다. 이를 수식으로 표현하면 수학식 10과 같다.Chat data feature vector (X310,

Result vector of forward and short-term memory (F310, LSTM _forward )

) And the result vector () of reverse long and short-term memory (B310, LSTM _backward )

) And audio data feature vector (X330,

Result vector of forward and short-term memory (F330, LSTM _forward )

) And the result vector () of the reverse long- and short-term memory (B330, LSTM _backward )

Multiple data feature vectors (

). Generated multiple data feature vectors (

) Into a multi-layer perceptron (M310, MLP) to generate a score vector (S310). This is expressed as Equation (10).

[수학식 10][Equation 10]

도 7은 다중 시간 간격에 대한 단일 데이터 하이라이트 예측 모델에서 데이터의 흐름을 나타내는 블록도이다.7 is a block diagram showing the flow of data in a single data highlight prediction model for multiple time intervals.

도 7을 참조하면, 제 1 시간 구간과 제 2 시간 구간에 대해 단일 데이터 하이라이트 예측 모델(S-BiLSTM)을 적용한 상태에서 데이터의 흐름을 알 수 있다.Referring to FIG. 7, the flow of data can be seen in a state in which a single data highlight prediction model (S-BiLSTM) is applied to the first time period and the second time period.

제 1 시간 구간(M410)에 채팅 데이터 및 오디오 데이터 중 어느 하나에 대한 특징 벡터(X410) 및 제 2 시간 구간(M430)에 채팅 데이터 및 오디오 데이터 중 어느 하나에 대한 특징 벡터(X430)를 생성할 수 있다.A feature vector (X410) for any one of chat data and audio data in the first time period (M410) and a feature vector (X430) for any one of chat data and audio data in the second time period (M430) are generated. Can be.

제 1 시간 구간(M410)에 대한 특징 벡터(X410)에 대해 순방향 장단기 메모리(F410) 및 역방향 장단기 메모리(B410)를 통해 학습한 정보로부터 제 1 결과 벡터를 생성할 수 있다.The first result vector may be generated from information learned through the forward and short-term memory F410 and the backward and short-term memory B410 for the feature vector X410 for the first time period M410.

제 2 시간 구간(M410)에 대한 특징 벡터(X430)에 대해 순방향 장단기 메모리(F430) 및 역방향 장단기 메모리(B430)를 통해 학습한 정보로부터 제 2 결과 벡터를 생성할 수 있다.A second result vector may be generated from information learned through the forward and short-term memory F430 and the backward and short-term memory B430 for the feature vector X430 for the second time period M410.

제 1 결과 벡터와 제 2 결과 벡터를 이어 붙여 다층 퍼셉트론(MLP, M450)에 입력하여 스코어 벡터(S410, s_t)을 생성할 수 있다.Score vectors S410 and s _t may be generated by joining the first result vector and the second result vector and inputting them to the multilayer perceptron (MLP, M450).

도 8은 다중 시간 간격에 대한 다중 데이터 하이라이트 예측 모델에서 데이터의 흐름을 나타내는 블록도이다.8 is a block diagram illustrating the flow of data in a multiple data highlight prediction model for multiple time intervals.

도 8을 참조하면, 제 1 시간 구간과 제 2 시간 구간에 대해 각각 다중 데이터 하이라이트 예측 모델(M-BiLSTM)을 적용한 상태에서 데이터의 흐름을 알 수 있다.Referring to FIG. 8, it is possible to know the flow of data in a state in which multiple data highlight prediction models (M-BiLSTM) are applied to the first time period and the second time period, respectively.

제 1 시간 구간(M510)에 대해 채팅 데이터에 대한 특징 벡터(X510,

) 및 오디오 데이터에 대해 특징 벡터(X530,

)를 생성할 수 있다. 상기 채팅 데이터에 대한 특징 벡터(X510,

)에 대해 순방향 장단기 메모리(F510, LSTM_forward) 및 역방향 장단기 메모리(B510, LSTM_0backward)를 통해 학습한 정보로부터 채팅 데이터에 대한 제 1 결과 벡터(

,

)를 생성할 수 있다. 오디오 데이터에 대한 특징 벡터(X530,

)에 대해서도 동일한 과정을 통해 오디오 데이터에 대한 제 1 결과 벡터(

,

)를 생성할 수 있다.Feature vector (X510, for chat data) for the first time interval M510

) And feature data for audio data (X530,

). Feature vector for the chat data (X510,

), the first result vector for chat data from information learned through the forward and short-term memory (F510, LSTM _forward ) and the reverse long- and short-term memory (B510, LSTM _0backward )

,

). Feature vector for audio data (X530,

) For the first result vector for the audio data through the same process (

,

).

동일한 방법을 통해 제 2 시간 구간(M530)에 대해 채팅 데이터에 대한 특징 벡터(X550,

) 및 상기 오디오 데이터에 대한 특징 벡터(X570,

)로부터 채팅 데이터에 대한 제 2 결과 벡터(

,

) 및 오디오 데이터에 대한 제 2 결과 벡터(

,

) 를 생성할 수 있다.Through the same method, the feature vector (X550, for chat data) for the second time interval (M530)

) And the feature vector for the audio data (X570,

), the second result vector for chat data (

,

) And second result vector for audio data (

,

).

제 1시간 구간에 대한 제 1 결과 벡터(

,

)와 제 2 시간 구간에 대한 제 2 결과 벡터(

,

)를 이어 붙여 다층 퍼셉트론(MLP, M550)에 입력하여 스코어 벡터(S510, s_t)을 생성할 수 있다.First result vector for the first time interval (

,

) And the second result vector for the second time interval (

,

) To input to the multi-layer perceptron (MLP, M550) to generate score vectors (S510, s _t ).

결정부(250)는 상기 예측부(240)의 스코어 벡터(S110, S210, S310, S410, S510)에서 하이라이트 예측 확률이 높은 순으로 하이라이트 구간으로 선택할 수 있다. The determining unit 250 may select the highlight sections from the score vectors S110, S210, S310, S410, and S510 of the prediction unit 240 in the order of highest highlight prediction probability.

하이라이트로 선택되는 비율은 전체 영상 대비 비율을 미리 설정하여 선택하거나 특정 점수를 초과하는 영상을 하이라이트로 선택할 수 있다. 이 때, 하이라이트로 선택되는 비율은 전체 영상을 기준으로 10% 또는 15%처럼 특정한 비율로 설정할 수 있고, 10분 또는 15분처럼 특정한 시간 구간으로 설정할 수 있으며, 확률 점수가 특정 값을 넘는 경우에는 항상 하이라이트로 선택되도록 설정할 수 있다.The ratio selected as the highlight can be selected by presetting the ratio of the entire image, or an image exceeding a specific score is selected as the highlight. At this time, the ratio selected as the highlight can be set to a specific ratio such as 10% or 15% based on the entire video, and can be set to a specific time period such as 10 minutes or 15 minutes, and when the probability score exceeds a certain value It can be set to always be selected as a highlight.

하이라이트로 선택된 경우에는 미리 정해둔 영상의 길이만큼 최종 하이라이트로 판별할 수 있다. 이 때, 미리 정해둔 영상의 길이는 하이라이트로 판명되는 구간을 기준으로 앞과 뒤에 각각 2초 또는 10초와 같이 정해진 시간을 추가할 수 있으며, 이 때 앞과 뒤에 추가되는 시간은 개별적으로 지정할 수 있다.When it is selected as the highlight, it can be determined as the final highlight by the length of a predetermined image. At this time, the predetermined length of the video can be added in a predetermined time, such as 2 seconds or 10 seconds, respectively, before and after the section that is identified as a highlight, and the time added before and after can be individually specified. have.

도 9는 다중 시간 간격에 대한 다중 데이터 하이라이트 예측 모델에서 하이라이트를 생성하는 방법을 나타내는 블록도이다.9 is a block diagram illustrating a method of generating highlights in a multiple data highlight prediction model for multiple time intervals.

본 발명에 따른 데이터 분류 규칙 추정 방법은 상술한 본 발명에 따른 하이라이트 영상 생성 장치(200)에 의해 수행될 수 있다. 이를 위해, 본 발명에 따른 하이라이트 영상 생성 장치(200)는 후술하는 데이터 분류 규칙 추정 방법을 구성하는 각 단계를 수행하기 위한 애플리케이션(소프트웨어)가 미리 설치될 수 있다. 예를 들어, 사용자의 컴퓨터에는 본 발명에 따른 데이터 분류 규칙 추정 방법에 대한 플랫폼이 소프트웨어의 형태로 미리 설치될 수 있으며, 사용자는 컴퓨터에 설치된 소프트웨어를 실행하여 본 발명에 따른 데이터 분류 규칙 추정 방법이 제공하는 다양한 서비스를 제공받을 수 있다.The method for estimating a data classification rule according to the present invention may be performed by the highlight image generation device 200 according to the present invention described above. To this end, the highlight image generation apparatus 200 according to the present invention may be pre-installed with an application (software) for performing each step constituting the data classification rule estimation method described later. For example, a platform for a method for estimating data classification rules according to the present invention may be pre-installed in the form of software on a user's computer, and a user may execute a software installed on a computer to execute the data classification rule estimation method according to the present invention Various services provided can be provided.

하이라이트 영상 생성 장치(200)는 채팅 데이터 및 오디오 데이터를 제 1 시간 구간과 제 2 시간 구간으로 각각 분리하고(1110, 1130), 각각의 시간 구간에 대해 특징 벡터를 추출할 수 있다(1210, 1230). 하이라이트 영상 생성 장치(200)는 추출된 제 1 시간 구간에 대해 순환 신경망을 이용하여 단기 결과 벡터와 장기 결과 벡터를 생성할 수 있다(1310, 1330). 하이라이트 영상 생성 장치(200)는 단기 결과 벡터와 장기 결과 벡터를 다층 퍼셉트론(MLP)에 입력하여 결과 값으로 하이라이트로 예측될 수 있는 확률에 대한 스코어 벡터를 생성할 수 있다(1510). 하이라이트 영상 생성 장치(200)는 스코어 벡터에서 확률이 높은 순서대로 하이라이트 구간으로 결정하여 미리 정해둔 길이로 최종 하이라이트로 분류할 수 있다(1710).The highlight image generating apparatus 200 may separate chat data and audio data into a first time period and a second time period (1110, 1130), and extract feature vectors for each time period (1210, 1230). ). The highlight image generating apparatus 200 may generate short-term result vectors and long-term result vectors using the cyclic neural network for the extracted first time intervals (1310, 1330). The highlight image generating apparatus 200 may input a short-term result vector and a long-term result vector into a multi-layer perceptron (MLP) to generate a score vector for probability that can be predicted as a highlight as a result value (1510). The highlight image generation apparatus 200 may determine the highlight sections in the order of the highest probability in the score vector and classify them as final highlights in a predetermined length (1710).

상술한 각 단계들에 대한 구체적인 설명은 도 4 내지 도 8을 참조하여 상술하였으므로, 반복되는 설명은 생략하기로 한다.Since the detailed description of each of the above steps has been described with reference to FIGS. 4 to 8, repeated description will be omitted.

이와 같은, 하이라이트 영상 생성 방법을 제공하는 기술은 애플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.Such a technique for providing a method for generating a highlight image may be implemented as an application or implemented in the form of program instructions that can be executed through various computer components to be recorded in a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, or the like alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded on the computer-readable recording medium are specially designed and configured for the present invention, and may be known and available to those skilled in the computer software field.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs, DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of program instructions include not only machine language codes produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the present invention, and vice versa.

이상에서는 실시 예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although described above with reference to embodiments, those skilled in the art understand that various modifications and changes can be made to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. Will be able to.

200: 채팅 데이터와 오디오 데이터를 이용한 하이라이트 영상 생성 장치
210: 분리부
220: 추출부
230: 학습부
240: 예측부
250: 결정부200: Highlight video generation device using chat data and audio data
210: separation
220: extraction unit
230: learning department
240: prediction unit
250: decision section

Claims

In the highlight video generation device using the chat data and audio data generated during the provision of the video,
A separating unit separating data generated during the provision of the image into a first time period and a second time period longer than the first time period to generate first segment data and second segment data;
An extraction unit that extracts a first feature vector from the first segment data of the separation unit and extracts a second feature vector from the second segment data of the separation unit;
A learning unit learning the first feature vector and the second feature vector of the extraction unit through a recurrent neural network (RNN), and generating a first result vector and a second result vector using the learned result;
A prediction unit generating a score vector for a probability that can be predicted as a highlight from the first result vector and the second result vector of the learning unit; And
A determination unit for generating a final highlight using the score vector generated by the prediction unit; Highlight image generating apparatus using chat data and audio data.

According to claim 1, Data generated during the provision of the image,
A highlight image generating apparatus using chat data and audio data, which may be any one of chat data and audio data, and may include both chat data and audio data.

According to claim 1, wherein the extraction unit,
Highlight image generation device using chat data and audio data, which uses natural language processing (NLP) to extract feature vectors of chat data.

According to claim 1, wherein the extraction unit,
Highlight image generation device using chat data and audio data, using Mel Frequency cepstral coefficients (MFCC) to extract feature vectors of audio data.

According to claim 1, The learning unit,
A forward learning unit generating a result vector using information of a previous time interval for the first feature vector and the second feature vector of the extraction unit; And
And a short-term backward learning unit generating a result vector by using information of a time interval after the first feature vector and the second feature vector of the extraction unit, wherein the highlight image generation apparatus includes chat data and audio data.

According to claim 1, wherein the cyclic neural network (RNN, Recurrent Neural Network),
Highlight video generation device using chat data and audio data, using Long Short Term Memory (LSTM).

The method of claim 1, wherein the prediction unit,
A highlight image generating apparatus using chat data and audio data, which uses a multi-layer perceptron (MLP) to generate a score vector.

The method of claim 1, wherein the determining unit,
Highlight images using chat data and audio data to generate highlight images by determining the highlight sections in the order of high probability from the score vector generated by the prediction unit in a predetermined order and determining the final highlights with a predetermined length. Generating device.

In the method of generating a highlight image using chat data and audio data generated during the provision of the video,
Separating the data extracted from the image into a first time period and a second time period longer than the first time period to generate first segment data and second segment data;
Extracting a first feature vector from the first segment data and extracting a second feature vector from the second segment data;
Learning the first feature vector and the second feature vector through a recurrent neural network (RNN), and generating a first result vector and a second result vector using the learned results;
Generating a score vector for probability that can be predicted as a highlight from the first result vector and the second result vector; And
Generating a final highlight using the score vector; Method for generating a highlight image using chat data and audio data.