KR102276728B1

KR102276728B1 - Multimodal content analysis system and method

Info

Publication number: KR102276728B1
Application number: KR1020190072484A
Authority: KR
Inventors: 강미나; 김만준
Original assignee: 빅펄 주식회사
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2021-07-13
Also published as: KR20200144417A

Abstract

본 발명은 멀티모달 콘텐츠 분석 시스템 및 그 방법에 관한 것으로서, 미디어 채널을 통해 업로드된 콘텐츠를 분석하는 멀티모달 콘텐츠 분석 시스템에 의해 수행되는 멀티모달 콘텐츠 분석 방법은, a) 상기 미디어 채널을 통해 콘텐츠 공유 서비스를 제공하는 적어도 하나 이상의 콘텐츠 공유 플랫폼으로부터 오디오, 비디오, 자막 및 메타데이터를 포함한 콘텐츠들을 수집하는 단계; b) 상기 수집된 콘텐츠들을 자연어 처리 기반의 전처리를 통해 각 콘텐츠에 대한 맥락을 포함하는 내용 텍스트 정보와 시청자의 게시 반응을 포함하는 사용자 반응 텍스트 정보가 결합된 영상 문서 데이터로 각각 생성하는 단계; c) 텍스트 분석에 기반한 멀티모달 분석 모델을 이용하여 상기 영상 문서 데이터의 프레임별로 특성 벡터를 추출하고, 상기 추출된 특성 벡터를 이용하여 콘텐츠 분류를 위한 분류기를 학습하는 단계; 및 d) 상기 미디어 채널에 대한 콘텐츠 탐색을 통해 새로운 콘텐츠가 발견된 경우, 상기 학습된 멀티모달 분석 모델을 이용하여 새로운 콘텐츠에 대한 분석 결과를 제공하는 단계를 포함하되, 상기 멀티모달 분석 모델은, 상기 특성 벡터간의 유사성에 기초하여 전체 영상 문서 데이터를 K개의 그룹으로 분류하는 클러스터링 알고리즘을 통해 생성된 분류기를 학습하고, 상기 학습된 분류기가 새로운 콘텐츠에서 추출된 특성 벡터를 이용하여 기 정의된 카테고리(category)에 자동으로 분류하는 것이다.The present invention relates to a multi-modal content analysis system and a method therefor, wherein the multi-modal content analysis method performed by the multi-modal content analysis system for analyzing content uploaded through a media channel includes: a) sharing content through the media channel collecting contents including audio, video, subtitles and metadata from at least one content sharing platform that provides a service; b) generating the collected contents as image document data in which content text information including context for each content and user response text information including a viewer's posting response are combined through natural language processing-based pre-processing; c) extracting a feature vector for each frame of the image document data using a multimodal analysis model based on text analysis, and learning a classifier for content classification using the extracted feature vector; and d) when new content is found through content search for the media channel, providing an analysis result for new content using the learned multi-modal analysis model, wherein the multi-modal analysis model comprises: Based on the similarity between the feature vectors, a classifier generated through a clustering algorithm that classifies the entire image document data into K groups is learned, and the learned classifier uses a feature vector extracted from new content to obtain a predefined category ( category) automatically.

Description

Multimodal content analysis system and method {MULTIMODAL CONTENT ANALYSIS SYSTEM AND METHOD}

본 발명은 멀티모달 콘텐츠 분석 시스템 및 그 방법에 관한 것이다.The present invention relates to a multimodal content analysis system and method therefor.

사용자들이 미디어를 접하는 디바이스 환경이 다양화되고 그 속에서 접할 수 있는 콘텐츠의 양이 많아지고 있다. 특히 급속도로 발전한 모바일 환경에서 사용자들은 개인화된 기기를 사용하여 콘텐츠를 소비하고 주변 사용자들과 경험을 공유한다. 콘텐츠 제공 서비스에서는 이러한 개인의 콘텐츠 소비 이력 및 SNS관계에서 발생한 데이터를 분석하여 활용함으로써 콘텐츠 소비를 활성화하고자 한다.The device environment in which users access media is diversifying, and the amount of content that users can access is increasing. In particular, in the rapidly developing mobile environment, users use personalized devices to consume content and share experiences with users around them. The content provision service intends to activate content consumption by analyzing and utilizing data generated from such individual content consumption histories and SNS relationships.

콘텐츠 제공 서비스 중에서 사용자에게 적합한 콘텐츠를 선별해주는 콘텐츠 추천 알고리즘은 모든 콘텐츠 공유 플랫폼에서 필수적인 요소가 되고 있다. 콘텐츠 추천 알고리즘은 시청자의 소비 이력이나 콘텐츠의 메타데이터 등을 콘텐츠 분석 알고리즘을 통해 사용자가 가장 필요로 할 것이라 유추되는 콘텐츠를 도출하여 제공하는 것이다. A content recommendation algorithm that selects suitable content for users from among content providing services is becoming an essential element in all content sharing platforms. The content recommendation algorithm derives and provides the content that is inferred that the user will most need through the content analysis algorithm based on the viewer's consumption history or content metadata.

콘텐츠 분석 알고리즘은 콘텐츠의 오디오, 비디오, 자막 등의 내용뿐만 아니라 조회수, 노출수, 시청 시간뿐만 아니라 댓글이나, 좋아요/싫어요, 공유 등의 시청자 반응에 대한 분석도 중요한 정보로 사용되고 있다. The content analysis algorithm is used as important information to analyze not only the content of content such as audio, video, and subtitles, but also the number of views, impressions, and viewing time, as well as comments, likes/dislikes, and shares.

도 1은 일반적인 콘텐츠 분석 알고리즘에서 사용되는 데이터 변화를 설명하는 도면이다. 1 is a diagram for explaining data changes used in a general content analysis algorithm.

도 1에 도시된 바와 같이, 소셜 네트워크 서비스(Facebook, Twitter), 콘텐츠 공유 사이트(Youtube, Flickr) 등 새로운 형태의 콘텐츠 공유 플랫폼들이 활성화되면서, 콘텐츠 공유 플랫폼을 통해 공유되고 있는 동영상 콘텐츠들이 싱글모달 데이터에서 음성, 영상, 자막, 메타데이터 등의 다변량 데이터 형태인 멀티모달 데이터로 발전하고 있다.As shown in FIG. 1 , as new types of content sharing platforms such as social network services (Facebook, Twitter) and content sharing sites (Youtube, Flickr) are activated, video content being shared through the content sharing platform is single-modal data. It is developing into multimodal data in the form of multivariate data such as audio, video, subtitles, and metadata.

콘텐츠 공유 플랫폼을 통해 새롭게 등장하고 있는 콘텐츠 및 사용자의 수가 급격히 증가하고 있기 때문에 일반적인 데이터 분석 알고리즘으로는 대량의 멀티모달 콘텐츠를 분석하기 위해 급격히 증가하는 연산량을 처리할 수 없고, 빠르게 요구되는 분석 속도를 감당할 수 없다는 문제점이 있다. 따라서, 콘텐츠 분석 알고리즘은 대량의 멀티모달 데이터를 빠르고 정확하게 분석하는 기술을 필요로 하고 있다. As the number of new content and users is rapidly increasing through content sharing platforms, general data analysis algorithms cannot handle the rapidly increasing amount of computation to analyze large amounts of multi-modal content, There is a problem that I cannot afford. Therefore, the content analysis algorithm requires a technique for quickly and accurately analyzing a large amount of multimodal data.

본 발명은 전술한 문제점을 해결하기 위하여, 본 발명의 일 실시예에 따라 멀티모달 데이터형태의 콘텐츠들을 빠르고 정확하게 분석할 수 있고, 콘텐츠들에 대한 내용과 시청자 반응을 결합한 형태로 분석 결과를 제공하는 것에 목적이 있다.In order to solve the above problems, according to an embodiment of the present invention, according to an embodiment of the present invention, it is possible to quickly and accurately analyze content in the form of multimodal data, and provides an analysis result in a form combining the content of the content and the viewer's reaction. has a purpose

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical task to be achieved by the present embodiment is not limited to the technical task as described above, and other technical tasks may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서 본 발명의 일 실시예에 따른 미디어 채널을 통해 업로드된 콘텐츠를 분석하는 멀티모달 콘텐츠 분석 시스템에 의해 수행되는 멀티모달 콘텐츠 분석 방법은, a) 상기 미디어 채널을 통해 콘텐츠 공유 서비스를 제공하는 적어도 하나 이상의 콘텐츠 공유 플랫폼으로부터 오디오, 비디오, 자막 및 메타데이터를 포함한 콘텐츠들을 수집하는 단계; b) 상기 수집된 콘텐츠들을 자연어 처리 기반의 전처리를 통해 각 콘텐츠에 대한 맥락을 포함하는 내용 텍스트 정보와 시청자의 게시 반응을 포함하는 사용자 반응 텍스트 정보가 결합된 영상 문서 데이터로 각각 생성하는 단계; c) 텍스트 분석에 기반한 멀티모달 분석 모델을 이용하여 상기 영상 문서 데이터의 프레임별로 특성 벡터를 추출하고, 상기 추출된 특성 벡터를 이용하여 콘텐츠 분류를 위한 분류기를 학습하는 단계; 및 d) 상기 미디어 채널에 대한 콘텐츠 탐색을 통해 새로운 콘텐츠가 발견된 경우, 상기 학습된 멀티모달 분석 모델을 이용하여 새로운 콘텐츠에 대한 분석 결과를 제공하는 단계를 포함하되, 상기 멀티모달 분석 모델은, 상기 특성 벡터간의 유사성에 기초하여 전체 영상 문서 데이터를 K개의 그룹으로 분류하는 클러스터링 알고리즘을 통해 생성된 분류기를 학습하고, 상기 학습된 분류기가 새로운 콘텐츠에서 추출된 특성 벡터를 이용하여 기 정의된 카테고리(category)에 자동으로 분류하는 것이다.As a technical means for achieving the above technical problem, a multi-modal content analysis method performed by a multi-modal content analysis system for analyzing content uploaded through a media channel according to an embodiment of the present invention comprises: a) the media channel Collecting content including audio, video, subtitles and metadata from at least one content sharing platform that provides a content sharing service through b) generating the collected contents as image document data in which content text information including context for each content and user response text information including a viewer's posting response are combined through natural language processing-based pre-processing; c) extracting a feature vector for each frame of the image document data using a multimodal analysis model based on text analysis, and learning a classifier for content classification using the extracted feature vector; and d) when new content is found through content search for the media channel, providing an analysis result for new content using the learned multi-modal analysis model, wherein the multi-modal analysis model comprises: Based on the similarity between the feature vectors, a classifier generated through a clustering algorithm that classifies the entire image document data into K groups is learned, and the learned classifier uses a feature vector extracted from new content into a predefined category ( category) automatically.

또한, 본 발명의 다른 일 실시예에 따른 미디어 채널을 통해 업로드된 콘텐츠를 분석하는 멀티모달 콘텐츠 분석 시스템은, 멀티모달 콘텐츠 분석 방법을 수행하기 위한 프로그램이 기록된 메모리; 및 상기 프로그램을 실행하기 위한 프로세서를 포함하며, 상기 프로세서는, 상기 프로그램의 실행에 의해, 상기 미디어 채널을 통해 콘텐츠 공유 서비스를 제공하는 적어도 하나 이상의 콘텐츠 공유 플랫폼으로부터 오디오, 비디오, 자막 및 메타데이터를 포함한 콘텐츠들을 수집하고, 상기 수집된 콘텐츠들을 자연어 처리 기반의 전처리를 통해 각 콘텐츠에 대한 맥락을 포함하는 내용 텍스트 정보와 시청자의 게시 반응을 포함하는 사용자 반응 텍스트 정보가 결합된 영상 문서 데이터로 각각 생성하고, 텍스트 분석에 기반한 멀티모달 분석 모델을 이용하여 상기 영상 문서 데이터의 프레임별로 특성 벡터를 추출하고, 상기 추출된 특성 벡터를 이용하여 콘텐츠 분류를 위한 분류기를 학습하고, 상기 미디어 채널에 대한 콘텐츠 탐색을 통해 새로운 콘텐츠가 발견된 경우, 상기 학습된 멀티모달 분석 모델을 이용하여 새로운 콘텐츠에 대한 분석 결과를 제공하되, 상기 멀티모달 분석 모델은, 상기 특성 벡터간의 유사성에 기초하여 전체 영상 문서 데이터를 K개의 그룹으로 분류하는 클러스터링 알고리즘을 통해 생성된 분류기를 학습하고, 상기 학습된 분류기가 새로운 콘텐츠에서 추출된 특성 벡터를 이용하여 기 정의된 카테고리(category)에 자동으로 분류하는 것이다.In addition, a multi-modal content analysis system for analyzing content uploaded through a media channel according to another embodiment of the present invention includes a memory in which a program for performing a multi-modal content analysis method is recorded; and a processor for executing the program, wherein the processor receives audio, video, subtitles and metadata from at least one content sharing platform that provides a content sharing service through the media channel by executing the program. Contents are collected, and the collected contents are generated as image document data in which content text information including context for each content and user response text information including viewer's posting response are combined through natural language processing-based pre-processing. and extracting a feature vector for each frame of the image document data using a multimodal analysis model based on text analysis, learning a classifier for content classification using the extracted feature vector, and searching for content for the media channel When new content is found through K, an analysis result for the new content is provided using the learned multi-modal analysis model, wherein the multi-modal analysis model calculates the entire image document data based on the similarity between the feature vectors. A classifier generated through a clustering algorithm for classifying into groups is learned, and the learned classifier is automatically classified into a predefined category using a feature vector extracted from new content.

전술한 본 발명의 과제 해결 수단에 의하면, 멀티모달 분석 모델을 이용하여 사용자들은 수많은 콘텐츠들을 일일이 살펴볼 필요없이 특정한 카테고리에 대해 레이블링함으로써 대량의 멀티모달 데이터형태의 콘텐츠들을 빠르고 정확하게 분석할 수 있고 해당 카테고리 내의 콘텐츠들에 대한 내용과 시청자 반응을 한번에 살펴볼 수 있다.According to the problem solving means of the present invention described above, by using the multi-modal analysis model, users can quickly and accurately analyze a large amount of multi-modal data-type contents by labeling a specific category without having to look at numerous contents one by one, and You can look at the contents of the contents and the reaction of viewers at once.

도 1은 일반적인 콘텐츠 분석 알고리즘에서 사용되는 데이터 변화를 설명하는 도면이다.
도 2는 본 발명의 일 실시예에 따른 멀티모달 콘텐츠 분석 시스템의 전체 구성을 설명하는 도면이다.
도 3은 본 발명의 일실시예에 따른 멀티모달 콘텐츠 분석 시스템의 세부 구성을 설명하는 도면이다.
도 4는 본 발명의 일 실시예에 따른 멀티모달 콘텐츠 분석 방법을 설명하는 순서도이다.
도 5는 콘텐츠의 화면 및 메타데이터 형태를 설명하기 위한 예시도이다.
도 6은 본 발명의 일 실시예에 따른 영상 문서 데이터를 설명하기 위한 예시도이다.
도 7은 본 발명의 일 실시예에 따른 데이터 저장모듈 의 구조를 설명하는 도면이다.
도 8은 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 학습 과정을 설명하기 위한 예시도이다.
도 9는 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 콘텐츠 분류 과정을 설명하는 도면이다.
도 10는 도 9의 분류 결과를 설명하기 위한 예시도이다.
도 11은 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 집중 알고리즘을 설명하기 위한 예시도이다.
도 12는 도 11의 집중 가중치를 이용하는 집중 알고리즘을 설명하기 위한 예시도이다.
도 13은 본 발명이 일 실시예에 따른 멀티모달 분석 모델의 분석 결과를 설명하기 위한 예시도이다.1 is a diagram for explaining data changes used in a general content analysis algorithm.
2 is a view for explaining the overall configuration of a multi-modal content analysis system according to an embodiment of the present invention.
3 is a diagram for explaining the detailed configuration of a multi-modal content analysis system according to an embodiment of the present invention.
4 is a flowchart illustrating a multimodal content analysis method according to an embodiment of the present invention.
5 is an exemplary diagram for explaining a screen and metadata form of content.
6 is an exemplary diagram for explaining image document data according to an embodiment of the present invention.
7 is a view for explaining the structure of a data storage module according to an embodiment of the present invention.
8 is an exemplary diagram for explaining a learning process of a multimodal analysis model according to an embodiment of the present invention.
9 is a view for explaining a content classification process of a multi-modal analysis model according to an embodiment of the present invention.
FIG. 10 is an exemplary diagram for explaining the classification result of FIG. 9 .
11 is an exemplary diagram for explaining a concentration algorithm of a multimodal analysis model according to an embodiment of the present invention.
12 is an exemplary diagram for explaining a concentration algorithm using the concentration weight of FIG. 11 .
13 is an exemplary diagram for explaining an analysis result of a multimodal analysis model according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . Also, when a part "includes" a component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에서 ‘단말’은 휴대성 및 이동성이 보장된 무선 통신 장치일 수 있으며, 예를 들어 스마트 폰, 태블릿 PC 또는 노트북 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치일 수 있다. 또한, ‘단말’은 네트워크를 통해 다른 단말 또는 서버 등에 접속할 수 있는 PC 등의 유선 통신 장치인 것도 가능하다. 또한, 네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. In the present specification, a 'terminal' may be a wireless communication device with guaranteed portability and mobility, for example, any type of handheld-based wireless communication device such as a smart phone, a tablet PC, or a notebook computer. In addition, the 'terminal' may be a wired communication device such as a PC that can connect to another terminal or server through a network. In addition, the network refers to a connection structure capable of exchanging information between each node, such as terminals and servers, and includes a local area network (LAN), a wide area network (WAN), and the Internet (WWW). : World Wide Web), wired and wireless data networks, telephone networks, and wired and wireless television networks.

무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다.Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasound Communication, Visible Light Communication (VLC), LiFi, etc. are included, but are not limited thereto.

이하의 실시예는 본 발명의 이해를 돕기 위한 상세한 설명이며, 본 발명의 권리 범위를 제한하는 것이 아니다. 따라서 본 발명과 동일한 기능을 수행하는 동일 범위의 발명 역시 본 발명의 권리 범위에 속할 것이다.The following examples are detailed descriptions to help the understanding of the present invention, and do not limit the scope of the present invention. Accordingly, an invention of the same scope performing the same function as the present invention will also fall within the scope of the present invention.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일 실시예에 따른 멀티모달 콘텐츠 분석 시스템의 전체 구성을 설명하는 도면이고, 도 3은 본 발명의 일실시예에 따른 멀티모달 콘텐츠 분석 시스템의 세부 구성을 설명하는 도면이다. FIG. 2 is a diagram illustrating the overall configuration of a multi-modal content analysis system according to an embodiment of the present invention, and FIG. 3 is a diagram illustrating a detailed configuration of a multi-modal content analysis system according to an embodiment of the present invention.

도 2 및 도 3을 참조하면, 멀티모달 콘텐츠 분석 시스템(100)는 통신 모듈(110), 메모리(120), 프로세서(130), 데이터 저장모듈(140) 및 표시 모듈(150)을 포함한다. 한편, 콘텐츠 공유 플랫폼(200)은 다중 채널 네트워크(Multi-Channel Networks, MCN) 서비스를 제공하는 시스템 또는 사업자가 제공하는 유튜브, 아프리카 TV, 트위치, 인스타그램 등이 될 수 있고, 사용자 단말(300)은 1인 동영상 창작자 또는 크리에이터가 소지한 단말이거나, 미디어 채널상에 업로드된 콘텐츠를 시청하는 시청자가 소지한 단말일 수 있다. 2 and 3 , the multi-modal content analysis system 100 includes a communication module 110 , a memory 120 , a processor 130 , a data storage module 140 , and a display module 150 . On the other hand, the content sharing platform 200 may be a system providing a multi-channel network (Multi-Channel Networks, MCN) service or YouTube, Afreeca TV, Twitch, Instagram, etc. provided by a service provider, and a user terminal ( 300) may be a terminal owned by a single video creator or creator, or a terminal owned by a viewer watching content uploaded on a media channel.

통신 모듈(110)은 통신망과 연동하여 멀티모달 콘텐츠 분석 시스템(100)이 사용자 단말(300), 콘텐츠 공유 플랫폼(200)과의 송수신 신호를 패킷 데이터 형태로 제공하는 데 필요한 통신 인터페이스를 제공한다. 나아가, 통신 모듈(110)은 사용자 단말(300)로부터 데이터 요청을 수신하고, 이에 대한 응답으로서 데이터를 송신하는 역할을 수행할 수 있다. 또한, 통신 모듈(110)은 콘텐츠 공유 플랫폼(200)에 데이터 요청을 송신하고, 이에 대한 응답으로서 데이터를 수신하는 역할을 수행할 수 있다. The communication module 110 provides a communication interface necessary for the multi-modal content analysis system 100 to provide a transmission/reception signal with the user terminal 300 and the content sharing platform 200 in the form of packet data by interworking with the communication network. Furthermore, the communication module 110 may perform a role of receiving a data request from the user terminal 300 and transmitting data in response thereto. Also, the communication module 110 may serve to transmit a data request to the content sharing platform 200 and receive data in response thereto.

여기서, 통신 모듈(110)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.Here, the communication module 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

메모리(120)는 멀티모달 콘텐츠 분석 방법을 수행하기 위한 프로그램이 기록된다. 또한, 메모리(120)는 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행한다. 여기서, 메모리(120)는 휘발성 저장 매체(volatile storage media) 또는 비휘발성 저장 매체(non-volatile storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In the memory 120, a program for performing a multi-modal content analysis method is recorded. In addition, the memory 120 performs a function of temporarily or permanently storing data processed by the processor 130 . Here, the memory 120 may include a volatile storage medium or a non-volatile storage medium, but the scope of the present invention is not limited thereto.

프로세서(130)는 멀티모달 콘텐츠 분석 방법을 수행하기 위한 프로그램을 실행함으로써 멀티모달 분석 모델을 이용한 콘텐츠에 대한 분석 결과를 사용자 단말(300) 또는 콘텐츠 공유 플랫폼(200)에 제공하는 전체 과정을 제어한다. 프로세서(130)가 수행하는 각각의 동작에 대해서는 추후 보다 상세히 살펴보기로 한다.The processor 130 controls the entire process of providing the analysis result of the content using the multi-modal analysis model to the user terminal 300 or the content sharing platform 200 by executing a program for performing the multi-modal content analysis method. . Each operation performed by the processor 130 will be described in more detail later.

여기서, 프로세서(130)는 프로세서(processor)와 같이 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기서, '프로세서(processor)'는, 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.Here, the processor 130 may include all kinds of devices capable of processing data, such as a processor. Here, the 'processor' may refer to a data processing device embedded in hardware, for example, having a physically structured circuit to perform a function expressed as a code or an instruction included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

관계형 데이터베이스(140)와 스토리지(145)를 포함하는 데이터 저장모듈 (140)은 멀티모달 콘텐츠 분석 방법을 수행하면서 누적되는 데이터가 저장된다. 예컨대, 관계형 데이터베이스(141)에는 텍스트 정보, 시간 정보, 수치 정보, 작은 크기의 바이너리 데이터 등의 하이레벨 특성을 갖는 데이터들이 저장될 수 있고, 스토리지(145)에는 관계형 데이터베이스(140)에 저장하기 어려운 이미지, 비디오, 오디오 등의 소스 데이터들인 로우레벨 특성을 갖는 데이터들이 저장될 수 있다. 관계형 데이터베이스(141)와 스토리지(145)의 저장 방식은 데이터 적재, 접근 및 데이터 저장의 관리 비용 등에 따른 변형 방식을 포함한다. The data storage module 140 including the relational database 140 and the storage 145 stores data accumulated while performing the multimodal content analysis method. For example, data having high-level characteristics such as text information, time information, numerical information, and small-sized binary data may be stored in the relational database 141 , and the storage 145 may be difficult to store in the relational database 140 . Data having a low-level characteristic, which are source data such as an image, video, and audio, may be stored. The storage method of the relational database 141 and the storage 145 includes a modification method according to data loading, access, and management cost of data storage.

표시 모듈(150)은 프로세서(130)의 제어에 의해 콘텐츠에 대한 분석 결과를 텍스트, 테이블 또는 그래프 형태의 보고서로 화면 출력한다. The display module 150 outputs the analysis result of the content in the form of a text, table, or graph on the screen under the control of the processor 130 .

도 4는 본 발명의 일 실시예에 따른 멀티모달 콘텐츠 분석 방법을 설명하는 순서도이고, 도 5는 콘텐츠의 화면 및 메타데이터 형태를 설명하기 위한 예시도이며, 도 6은 본 발명의 일 실시예에 따른 영상 문서 데이터를 설명하기 위한 예시도이다. 4 is a flowchart illustrating a multimodal content analysis method according to an embodiment of the present invention, FIG. 5 is an exemplary diagram for explaining a screen and metadata form of content, and FIG. 6 is an embodiment of the present invention It is an exemplary diagram for explaining the image document data according to the

복수의 사용자 단말(300)은 다양한 콘텐츠 공유 플랫폼(200)을 이용하여 미디어 채널 상에 자신이 제작한 콘텐츠를 업로드할 수 있고, 미디어 채널에 업로드된 콘텐츠를 조회 또는 시청하면서 댓글, 좋아요/싫어요 클릭 등의 게시물을 게시하면서 미디어 활동을 수행할 수 있다(S1).The plurality of user terminals 300 may upload content they have created on a media channel by using various content sharing platforms 200 , and click comments and like/dislike while viewing or viewing the content uploaded to the media channel. A media activity may be performed while posting a post such as (S1).

멀티모달 콘텐츠 분석 시스템(100)은 API(Application Programming Interface)를 이용하여 콘텐츠 공유 플랫폼(200)을 통해 채널별로 콘텐츠들을 수집한다(S2, S3). 이때, 콘텐츠는 인터넷 등의 통신망을 통해 제공되는 각종 디지털 정보를 의미하는 것으로서, 비디오, 자막, 이미지, 오디오 등의 각종 정보 내용물, 프로그램, 영화, 음악, 게임 소프트웨어 등을 의미할 수 있다. The multi-modal content analysis system 100 collects content for each channel through the content sharing platform 200 using an application programming interface (API) (S2, S3). In this case, the content refers to various digital information provided through a communication network such as the Internet, and may refer to various information contents such as video, subtitles, images, and audio, programs, movies, music, game software, and the like.

멀티모달 콘텐츠 분석 시스템(100)은 API 또는 크롤링을 통한 데이터 요청시 일시적인 장애 발생하거나, 데이터 요청응답 시간이 지연되는 경우에, 기설정된 대기 시간 이후에 다시 데이터 요청을 수행할 수 있다. 그러나, 데이터 접근 권한이 상실된 경우, 각 콘텐츠 공유 플랫폼에 개별적으로 데이터 요청을 수행한 이후에 데이터 요청에 대한 로직 실행을 중단하거나 건너뛰게 된다.The multi-modal content analysis system 100 may request data again after a preset waiting time when a temporary failure occurs or a data request response time is delayed when requesting data through API or crawling. However, if the data access right is lost, the logic execution for the data request is stopped or skipped after the data request is individually made to each content sharing platform.

콘텐츠 공유 플랫폼별로 콘텐츠 제공 방식이 상이하기 때문에, 멀티모달 콘텐츠 분석 시스템(100)은 수집된 콘텐츠에 관련된 객체들을 탐색하고, 탐색된 객체 구조에 따른 관계성을 파악한다. 즉, 멀티모달 콘텐츠 분석 시스템(100)은 콘텐츠 공유 플랫폼(200) 내부에 어떤 콘텐츠들이 구성되어 있는지 탐색하고, 콘텐츠를 구성하는 오디오, 비디오, 자막 및 메타데이터 등의 객체를 탐색하며, 메타데이터 내 시청자의 게시 반응에 연관되는 객체들을 확인할 수 있다. 예를 들어, 콘텐츠 공유 플랫폼(200)이 유튜브인 경우, 멀티모달 콘텐츠 분석 시스템(100)은 탐색 기능을 통해 유튜브에 존재하는 채널들을 파악하고, 일정 시간마다 유튜브 내에 새로운 채널이나 동영상 콘텐츠들이 업로드 되는지를 탐색 및 발굴할 수 있다.Since the content providing method is different for each content sharing platform, the multi-modal content analysis system 100 searches for objects related to the collected content and identifies a relationship according to the searched object structure. That is, the multi-modal content analysis system 100 searches for what content is configured inside the content sharing platform 200, searches for objects such as audio, video, subtitles, and metadata constituting the content, and searches for objects within the metadata. Objects related to the viewer's post response can be identified. For example, when the content sharing platform 200 is YouTube, the multi-modal content analysis system 100 identifies channels existing in YouTube through a search function, and checks whether new channels or video contents are uploaded in YouTube at regular intervals. can be explored and discovered.

멀티모달 콘텐츠 분석 시스템(100)은 수집된 영상 콘텐츠에 대한 로그정보를 데이터 저장모듈(140)에 저장하고, 새로운 콘텐츠 탐색시 데이터 저장모듈(140)의 로그 정보를 이용하여 이미 탐색한 콘텐츠를 확인할 수 있고, 새로 업로드된 콘텐츠에 대해 신속하고 효율적인 탐색을 수행할 수 있다. The multi-modal content analysis system 100 stores log information on the collected image content in the data storage module 140, and when searching for new content, it checks the content already searched by using the log information of the data storage module 140 and can perform a fast and efficient search for newly uploaded content.

도 5에 도시된 바와 같이, 멀티모달 콘텐츠 분석 시스템(100)은 유튜브 API 요청을 통해 콘텐츠 A를 수집할 수 있고, 수집된 콘텐츠 A는 타임 라인에 따라 비디오, 오디오, 자막, 메타 데이터를 포함하고 있다. 이때, 메타 데이터에는 콘텐츠 A의 제목, 설명, 태그, 카테고리, 조회수, 좋아요 수, 싫어요 수, 얼로드 시간, 썸 네일, 댓글 등을 포함한다. As shown in FIG. 5 , the multimodal content analysis system 100 may collect content A through a YouTube API request, and the collected content A includes video, audio, subtitles, and metadata along a timeline. have. At this time, the metadata includes the title, description, tag, category, number of views, number of likes, number of dislikes, payload time, thumbnail, and comments of content A.

멀티모달 콘텐츠 분석 시스템(100)은, 도 6에 도시된 바와 같이, 콘텐츠들을 자연어 처리 기반의 전처리를 통해 각 콘텐츠에 대한 맥락을 포함하는 내용 텍스트 정보와 시청자의 게시 반응을 포함하는 사용자 반응 텍스트 정보를 결합하여 영상 문서 데이터를 생성한다(S4). As shown in FIG. 6 , the multi-modal content analysis system 100 performs content text information including the context for each content through natural language processing-based pre-processing, and user response text information including the viewer's post response. are combined to generate image document data (S4).

콘텐츠에 대한 전처리 과정은 불용어 처리, 자모 분리, 어간 추출, 단어 토큰화 등을 수행하여 각 콘텐츠의 내용 텍스트 정보와 사용자 반응 텍스트 정보가 결합된 영상 문서 데이터를 생성하게 되는데, 멀티모달 분석 모델의 전처리 과정도 동일하게 이루어진다. In the pre-processing process of content, stopword processing, alphabet separation, stem extraction, word tokenization, etc. are performed to generate image document data in which content text information of each content and user response text information are combined. The process is done in the same way.

이렇게 생성된 영상 문서 데이터는 멀티모달 분석 모델의 학습을 위해 데이터 저장모듈(140)에 저장된다. 이때, 데이터 저장모듈(140)은 연관되는 데이터 집합에 따라 학습용 DB, 단어사전 DB, 분석용 DB 등으로 구분하여 사용할 수 있고, 각 DB를 통합하여 관리할 수도 있다. The image document data thus generated is stored in the data storage module 140 for learning the multi-modal analysis model. In this case, the data storage module 140 may be used by dividing the DB into a learning DB, a word dictionary DB, an analysis DB, etc. according to a related data set, and may integrate and manage each DB.

도 7은 본 발명의 일 실시예에 따른 데이터 저장모듈 의 구조를 설명하는 도면이다.7 is a view for explaining the structure of a data storage module according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 관계형 데이터베이스(RBDMS, 141)는 텍스트 정보, 시간 정보, 수치 정보, 작은 크기의 바이너리 데이터 등의 하이레벨 특성을 갖는 데이터들을 저장하고, 스토리지(145)는 관계형 데이터베이스(141)에 저장하기 어려운 이미지 소스, 비디오 소스, 오디오 소스 등의 로우레벨 특성을 갖는 데이터를 저장한다. 7, the relational database (RBDMS, 141) stores data having high-level characteristics such as text information, time information, numerical information, and small-sized binary data, and the storage 145 is a relational database ( 141) stores data having low-level characteristics, such as image sources, video sources, and audio sources, which are difficult to store.

데이터 저장모듈(140)은 콘텐츠 공유 플랫폼별로 데이터 구조와 DB 확장 관리의 필요성에 따라 여러 구조를 복합적으로 채택하여 사용할 수 있다. 예를 들어, 데이터 저장모듈(140)은 관계형 데이터베이스(141)를 우선적으로 사용하고, 비구조적인 데이터 관리를 위해 비관계형 데이터베이스(NoSQL, 143)를 사용할 수 있다. 또한, 데이터 저장모듈(140)은 DB 확장 노드와 백업을 위한 복제 노드(147)를 포함하고, 읽기/쓰기 동작은 데이터 객체별로 일관성 있는 응답을 주고받도록 구현될 수 있다. The data storage module 140 may adopt and use various structures in a complex manner according to the data structure and the need for DB extension management for each content sharing platform. For example, the data storage module 140 may preferentially use the relational database 141 and use a non-relational database (NoSQL, 143) for unstructured data management. In addition, the data storage module 140 includes a DB extension node and a replication node 147 for backup, and read/write operations may be implemented to exchange a consistent response for each data object.

비관계형 데이터베이스(143) 기반의 비구조적인 데이터를 관계형 데이터베이스(141)에 저장하는 경우, 프로세서(130)는 데이터 스키마가 존재하더라도 실제로 수집된 데이터의 필드가 있는지에 따라 관계형 데이터베이스(141)에 각 필드를 저장할지 조건분기를 처리한다. 데이터베이스의 트랙잭션을 이용하여 비구조적인 데이터를 저장하지 못할 경우에는 데이터 저장을 생략할 수 있다. 이러한 비동기 처리 논리를 수행하면, 동시 다발적인 데이터 요청 등으로 인한 데이터 수집 및 처리 등의 처리에 대한 컴퓨팅 자원을 효율적으로 사용할 수 있다. When the non-relational database 143-based unstructured data is stored in the relational database 141, the processor 130 stores each data in the relational database 141 according to whether there is actually a field of the collected data even if a data schema exists. Whether to save the field or process the conditional branch. When unstructured data cannot be stored using a database transaction, data storage can be omitted. If such asynchronous processing logic is performed, computing resources for processing such as data collection and processing due to simultaneous data requests and the like can be efficiently used.

만일, 프로세서(130)는 데이터 저장모듈(140)에 동시 다발적으로 접근하여 쓰기 요청을 하는 경우, 일시적으로 데이터 저장모듈(140)의 동시 처리 가능한 작업 한계에 도달할 수 있다. 이때, 프로세서(130)는 작업큐를 통해 수집된 데이터를 저장하고, 작업 큐에 데이터가 존재할 경우 FIFO 방식으로 데이터를 불러와 데이터 저장모듈(140)에 저장할 수 있다. If the processor 130 concurrently accesses the data storage module 140 and makes a write request, the data storage module 140 may temporarily reach the limit of simultaneous processing possible. In this case, the processor 130 may store the data collected through the work queue, and when data exists in the work queue, retrieve the data in a FIFO manner and store it in the data storage module 140 .

한편, 프로세서(130)는 처리해야 할 데이터 양에 따라 스케일 아웃/스케일 업(Scale out/Scale up) 등의 방법을 통해 데이터 저장모듈(140)에 데이터의 분산 처리를 수행할 수도 있고, 다수의 컴퓨팅 자원을 운용할 수 있다.On the other hand, the processor 130 may perform distributed processing of data in the data storage module 140 through a method such as scale out / scale up according to the amount of data to be processed, or a plurality of Computing resources can be managed.

다시 도 4를 설명하면, 멀티모달 콘텐츠 분석 시스템(100)은 텍스트 분석에 기반한 멀티모달 분석 모델을 이용하여 영상 문서 데이터의 프레임별로 특성 벡터를 추출하고, 추출된 특성 벡터를 이용하여 분류기를 학습한다(S5, S6).Referring to FIG. 4 again, the multi-modal content analysis system 100 extracts a feature vector for each frame of image document data using a multi-modal analysis model based on text analysis, and learns a classifier using the extracted feature vector. (S5, S6).

미디어 채널에 대한 콘텐츠 탐색을 통해 새로운 콘텐츠가 발견된 경우(S7), 학습된 멀티모달 분석 모델을 이용하여 새로운 콘텐츠에 대한 분석 결과를 제공한다(S8, S9). When new content is found through content search for a media channel (S7), an analysis result for the new content is provided using the learned multi-modal analysis model (S8, S9).

즉, 멀티모달 콘텐츠 분석 시스템(100)은 영상 문서 데이터의 내용 텍스트 정보를 이용하여 콘텐츠 내용에 대한 카테고리들을 분류하고, 각 카테고리별 콘텐츠들에 대한 시청자 반응 텍스트 정보를 이용하여 빈도수에 따라 상위권의 단어들을 시청자 반응으로 설정한 후, 새로운 컨텐츠에 대해 상기 카테고리와 시청자 반응이 결합된 분석 결과를 제공한다. 따라서, 사용자들은 수많은 콘텐츠들을 일일이 살펴볼 필요없이 특정한 카테고리에 대해 레이블링함으로써 해당 카테고리 내의 콘텐츠들을 한번에 살펴볼 수 있다. That is, the multi-modal content analysis system 100 classifies categories for content content by using content text information of image document data, and uses viewer response text information for content for each category. After setting these as the viewer reaction, the analysis result in which the category and the viewer reaction are combined for new content is provided. Accordingly, users can look at contents in a corresponding category at once by labeling a specific category without having to look at numerous contents one by one.

도 8은 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 학습 과정을 설명하기 위한 예시도이다. 8 is an exemplary diagram for explaining a learning process of a multimodal analysis model according to an embodiment of the present invention.

멀티모달 분석 모델은 신경망 기반의 메모리 네트워크, 순환 신경망(RNN), 팽창된 합성곱 신경망(Dilated CNN) 등을 사용하여 구현될 수 있다. 이때, 멀티모달 분석 모델은 타임 시퀀스의 길이나 시간 단위 간격의 길이에 반비례하는 총 시퀀스 데이터 길이에 따라 시퀀스 데이터를 처리하는 신경망 구조가 달라질 수 있다. The multimodal analysis model can be implemented using a neural network-based memory network, a recurrent neural network (RNN), a dilated convolutional neural network (Dilated CNN), and the like. In this case, the multimodal analysis model may have a different neural network structure for processing sequence data according to the total sequence data length that is inversely proportional to the length of the time sequence or the length of the time unit interval.

이러한 멀티모달 분석 모델은 데이터 저장모듈(140)로부터 영상 문서 데이터를 가져와 전처리 과정을 통해 학습 데이터로 변환하여 학습용 DB에 저장하고, 학습용 DB의 학습 데이터를 이용하여 분류기를 학습하게 된다.This multi-modal analysis model takes the image document data from the data storage module 140, converts it into training data through a pre-processing process, and stores it in the training DB, and learns the classifier using the training data of the training DB.

전처리 과정은 한글 데이터의 경우에 외래어, 특정 특수 문자, 이메일 등의 불용어들을 제거하거나 다른 단어 토큰으로 치환하고, 영어 데이터의 경우 소문자로 변환한다. In the case of Korean data, the preprocessing process removes foreign words, specific special characters, and stopwords such as e-mail, or replaces them with other word tokens, and converts them to lowercase letters in the case of English data.

멀티모달 분석 모델은 학습 데이터에 대해 형태소 분석, 어간 추출, 정규화 과정 등을 거쳐 추출되는 모든 단어들에 대해 단어사전 DB를 구축하고, 단어사전 DB에 일정 빈도수 이상 등장하는 토픽 단어들을 분석에 사용한다. 이때, 단어 사전 DB에 일정 빈도수 미만으로 등장하는 단어들은 고유명사일 확률이 높으므로 특수 토큰(Unknown token)으로 처리한다. 만일, 멀티모달 분석 모델이 문자 단위의 학습 데이터를 사용할 경우, 한글을 초성, 중성, 종성으로 분리하는 전처리 과정이 이루어지므로 고유 명사에 대해 특수 토큰으로 처리할 필요가 없어진다. The multi-modal analysis model builds a word dictionary DB for all words extracted through morpheme analysis, stem extraction, and normalization processes for learning data, and topic words appearing more than a certain frequency in the word dictionary DB are used for analysis. . At this time, words appearing less than a certain frequency in the word dictionary DB are highly probable to be proper nouns, so they are treated as special tokens (Unknown tokens). If the multimodal analysis model uses character-based learning data, a preprocessing process of separating Hangul into initial, middle, and final consonants is performed, eliminating the need to process proper nouns as special tokens.

멀티모달 분석 모델은 단어사전 DB를 이용하여 토픽 단어들을 토큰으로 치환한 후 각 영상 문서 데이터에 대한 학습 과정을 수행한다. 학습 과정은 doc2vec, BERT(Bidirectional Encoder Representations from Transformers) 등의 딥러닝 기반의 자연어 처리 알고리즘을 사용한다.The multi-modal analysis model uses a word dictionary DB to replace topic words with tokens and then performs a learning process for each image document data. The learning process uses deep learning-based natural language processing algorithms such as doc2vec and BERT (Bidirectional Encoder Representations from Transformers).

word2vec 알고리즘에 기반한 doc2vec 의 PV-DBOW (Distributed Bag of Words version of Paragraph Vector), PV-DB (Distributed Memory of Paragraph Vector) 알고리즘은 특정 문서에서 어떤 단어들이 나오는지를 예측하는 훈련을 통해 그 문서가 어떤 내용인지 학습하는 것이다.doc2vec's PV-DBOW (Distributed Bag of Words version of Paragraph Vector) and PV-DB (Distributed Memory of Paragraph Vector) algorithms based on the word2vec algorithm train to predict which words appear in a specific document, cognitive learning.

BERT의 훈련 과정 중 MLM(Masked Language Model)은 대량의 자연어 말뭉치를 이용한 비지도 학습 방법 중 하나이다. 사전 학습된 트랜스포머 인코더(Transformer Encoder)에 PV-DB알고리즘을 채용하여 문서(paragraph 또는 document) 벡터를 연결할 수 있다. 자연어 처리 모델을 MLM 등의 비지도 학습 방식으로 수집한 콘텐츠 데이터에 대해 모델을 미세 조정(Fine-tuning)한다. Among the training courses of BERT, MLM (Masked Language Model) is one of the unsupervised learning methods using a large amount of natural language corpus. A document (paragraph or document) vector can be connected by employing the PV-DB algorithm in the pre-trained Transformer Encoder. Fine-tuning the natural language processing model on the content data collected through unsupervised learning methods such as MLM.

도 9는 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 콘텐츠 분류 과정을 설명하는 도면이고, 도 10는 도 9의 분류 결과를 설명하기 위한 예시도이다.9 is a diagram for explaining a content classification process of a multimodal analysis model according to an embodiment of the present invention, and FIG. 10 is an exemplary diagram for explaining the classification result of FIG. 9 .

멀티모달 분석 모델은 학습 과정을 통해 각 콘텐츠에 대해 추출된 특성 벡터들을 거리 또는 확률에 기반을 두어 유사도를 정의하고, 이 유사성에 기초하여 전체 영상 문서 데이터에 대한 클러스터링을 통해 유사한 콘텐츠들끼리 분류한다(S110). The multimodal analysis model defines the similarity of feature vectors extracted for each content through the learning process based on the distance or probability, and classifies similar contents among the similar contents through clustering of the entire image document data based on the similarity. (S110).

여기서, 클러스터링(Clustering)은 데이터 마이닝(Data mining) 기법으로서 유사 콘텐츠 검색과, 검색된 콘텐츠에 연관되는 사용자 반응을 위해 해당 콘텐츠의 특징을 추출하는 것이다. Here, clustering is a data mining technique that extracts features of corresponding content for similar content search and user response related to the found content.

또한, 멀티모달 분석 모델은 클러스터링 결과를 확인하기 위해 피드백 과정을 수행하고, 피드백에 따라 클러스터에 대한 재조직화를 수행한다(S120). 즉, 초기의 잘못된 클러스터링을 회복하기 위해 재정의된 중심값 기준으로 다시 거리 기반의 클러스터 재분류를 수행한 후, 클러스터간의 경계가 변경되지 않으면 클러스터링 알고리즘을 종료한다. In addition, the multimodal analysis model performs a feedback process to confirm the clustering result, and performs reorganization of the cluster according to the feedback ( S120 ). That is, after reclassifying the cluster based on the distance based on the redefined center value to recover the initial erroneous clustering, if the boundary between the clusters is not changed, the clustering algorithm is terminated.

도 10에 도시된 바와 같이, 재조직화가 완료되면 클러스터 레이블들을 관찰하여 단계별로 레이블들을 재조직화하는 구조적 레이블링을 통해 최종적인 콘텐츠 분류가 완성되고, 이 분류 결과를 바탕으로 분류기를 생성한다(S130). 이렇게 생성된 분류기는 새로운 콘텐츠에 대해 자연어 처리 기반으로 추출된 특성 벡터를 이용하여 새로운 콘텐츠가 어떤 카테고리에 속하는지를 판별한다. 즉, 분류기는 특성 벡터를 이용하여 새로운 콘텐츠를 해당 카테고리에 할당한다. As shown in FIG. 10 , when the reorganization is completed, the final content classification is completed through structural labeling in which the labels are reorganized step by step by observing the cluster labels, and a classifier is generated based on the classification result ( S130 ). The generated classifier determines which category the new content belongs to by using the feature vector extracted based on natural language processing for the new content. That is, the classifier allocates new content to the corresponding category using the feature vector.

예를 들어, 멀티모달 분석 모델은 클러스터에 10대 먹방, 20대 먹방, 30대 먹방, 중국음식 먹방, 한국음식 먹방 등의 레이블이 존재하는 경우, 구조적 레이블링을 통해 2단계의 레이블 구조를 구성할 수 있다. 즉, 최상위 레이블로 먹방을 설정한 후 하위 레이블로 각 세부 먹방 레이블을 구성한다. For example, the multimodal analysis model can construct a two-stage label structure through structural labeling when labels such as teenage mukbang, 20s mukbang, 30s mukbang, Chinese food mukbang, and Korean food mukbang exist in the cluster. can That is, after setting the mukbang as the top label, each detailed mukbang label is configured as the sub-label.

콘텐츠 분류의 대상이 되는 영상 문서 데이터는 자연어로 쓰인 비구조화된 데이터이므로, 이를 처리하기 위해 구조적인 데이터로 표현할 필요가 있다. 따라서, 영상 문서 데이터를 단어사전 DB의 전체 단어를 대상으로 불용어 및 빈도수에 따른 중요도를 고려하여 구성된 특성 벡터로 표현하며, 특성 벡터내의 특성은 단어와 각 특성의 값으로 구성된다. 특성 값은 빈도수, 존재 유무 및 가중치이다.Since the image document data subject to content classification is unstructured data written in natural language, it is necessary to express it as structured data to process it. Therefore, the image document data is expressed as a feature vector constructed by considering the importance according to the stop word and frequency for all words in the word dictionary DB, and the feature in the feature vector is composed of the word and the value of each feature. The feature values are frequency, presence or absence, and weight.

이와 같이, 클러스터링을 통한 콘텐츠 분류는 주어진 영상 문서 데이터에 대한 사전 정보없이 의미 있는 자료구조를 찾아낼 수 있고, 짧은 계산 시간이 소요되며, 대량의 콘텐츠에 적용할 수 있다. In this way, content classification through clustering can find a meaningful data structure without prior information on given image document data, takes a short calculation time, and can be applied to a large amount of content.

도 11은 본 발명의 일 실시예에 따른 멀티모달 분석 모델의 집중 알고리즘을 설명하기 위한 예시도이고, 도 12는 도 11의 집중 가중치를 이용하는 집중 알고리즘을 설명하기 위한 예시도이다. 11 is an exemplary diagram illustrating a concentration algorithm of a multimodal analysis model according to an embodiment of the present invention, and FIG. 12 is an exemplary diagram illustrating a concentration algorithm using the concentration weight of FIG. 11 .

멀티모달 분석 모델은 콘텐츠를 타임 라인 상에서 오디오, 모션, 프레임의 각 모드 데이터별로 집중 알고리즘을 적용하여, 어떤 프레임의 데이터가 다른 프레임과 연관성이 큰지를 고려해서 특성 벡터를 추출할 수 있다. 이렇게 추출된 특성 벡터를 이용하여 클러스터링 알고리즘을 수행함으로써 콘텐츠에 대한 카테고리 분류를 수행할 수 있다. The multimodal analysis model can extract feature vectors by applying a concentration algorithm for each mode data of audio, motion, and frame on the content timeline, considering which frame of data is highly correlated with other frames. By performing a clustering algorithm using the extracted feature vector, it is possible to classify the content.

비디오의 모드 데이터인 경우, 전처리 과정을 통해 프레임 수를 1FPS 정도로 줄이고, 프레임의 화질 수를 다운샘플링한다. 이렇게 전처리 과정을 거친 비디오의 모드 데이터에 대해 자연어 처리 기반의 멀티모달 분석 모델을 이용하여 각 프레임의 특성 벡터를 추출한다.In the case of video mode data, the number of frames is reduced to about 1 FPS through a preprocessing process, and the number of frames is downsampled. For the video mode data that has been pre-processed in this way, a feature vector of each frame is extracted using a multimodal analysis model based on natural language processing.

오디오의 모드 데이터인 경우, 사전에 학습된 멀티모달 분석 모델은 각 프레임별로 특성 벡터를 추출한 후 타임 라인에 대해 압축하여 최종 특성 벡터를 추출한다.In the case of audio mode data, the pre-trained multimodal analysis model extracts a feature vector for each frame and then compresses it on the timeline to extract the final feature vector.

자막의 모드 데이터인 경우, 자연어 처리 기반의 멀티모달 분석 모델을 이용하여 각 프레임의 특성 벡터를 추출한다.In the case of subtitle mode data, a feature vector of each frame is extracted using a multimodal analysis model based on natural language processing.

집중 알고리즘은 모드 데이터별 특성 벡터들에 대해 하나의 특성 벡터와 다른 특성 벡터들 사이의 관련성을 결정짓는 파라미터를 학습시키게 된다. The focused algorithm learns a parameter that determines the relationship between one feature vector and another for feature vectors for each mode data.

도 12에 도시된 바와 같이, 집중 알고리즘을 통한 집중 가중치를 이용하여 'it'이라는 대명사를 분석할 때, it이 가리키는 'animal'이라는 단어에 집중해서 분석을 수행한다. 즉, 윗 문장과 아래 문장에서 it이 가리키는 단어가 서로 다르지만, 집중 알고리즘 중 자가 집중 알고리즘을 통해서 진한 푸른색으로 표기되어 있는 ‘animal’과 ‘street’ 에 적절히 더 가중치를 주고 있다는 것을 알 수 있다.As shown in FIG. 12 , when the pronoun 'it' is analyzed using the concentration weight through the concentration algorithm, the analysis is performed by focusing on the word 'animal' indicated by it. That is, although the word it refers to in the upper and lower sentences is different, it can be seen that among the concentration algorithms, 'animal' and 'street', which are marked in dark blue color, are appropriately given more weight through the self-concentration algorithm.

이와 같이, 멀티모달 콘텐츠 분석 모델은 집중 알고리즘을 통해 타임 스탬프 내에서 모드 데이터별 연관성이나 전체 타임 라인 내에서 다른 타임 스탬프에 있는 데이터 사이의 연관성을 고려할 수 있다. In this way, the multimodal content analysis model may consider the correlation for each mode data within the time stamp or the correlation between data at different time stamps within the entire timeline through a concentrated algorithm.

도 13은 본 발명이 일 실시예에 따른 멀티모달 분석 모델의 분석 결과를 설명하기 위한 예시도이다.13 is an exemplary diagram for explaining an analysis result of a multimodal analysis model according to an embodiment of the present invention.

도 13에 도시된 바와 같이, 멀티모달 분석 모델은 새로운 콘텐츠를 분석하여 콘텐츠 내용에 대해 시청자 반응이 결합된 정보를 분석 결과로 제공한다. 콘텐츠의 메타데이터(제목, 설명, 카테고리 등)들을 레이블 데이터로 지정함으로써 멀티모달 분석 모델은 각 콘텐츠에 대한 메타 데이터를 맞추는 방식으로 학습되고, 콘텐츠를 분석할 수 있다. As shown in FIG. 13 , the multi-modal analysis model analyzes new content and provides information combined with a viewer's reaction to the content as an analysis result. By designating the content metadata (title, description, category, etc.) as label data, the multimodal analysis model can be trained in a way that matches the metadata for each content and can analyze the content.

예를 들어, 새로운 콘텐츠(viedo 1)의 카테고리가 스토리텔링 형식의 감동적인 이야기인 경우, 해당 카테고리에 대한 시청자 반응이 '감동받는','연민하는'이 될 수 있다. For example, if the category of the new content (viedo 1) is a moving story in a storytelling format, the viewer's reaction to the corresponding category may be 'impressed' or 'compassionate'.

따라서, 사용자 또는 콘텐츠 공유 플랫폼 사업자들은 수많은 콘텐츠들을 일일이 살펴볼 필요없이 특정한 카테고리에 대해 레이블링 함으로써 유사한 콘텐츠들과 시청자 반응을 한번에 확인할 수 있다. Therefore, users or content sharing platform operators can check similar content and viewer reaction at once by labeling a specific category without having to look at numerous content one by one.

이상에서 설명한 본 발명의 실시예에 따른 멀티모달 콘텐츠 분석 방법은, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 이러한 기록 매체는 컴퓨터 판독 가능 매체를 포함하며, 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함하며, 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.The multi-modal content analysis method according to the embodiment of the present invention described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Such recording media includes computer-readable media, and computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Computer readable media also includes computer storage media, which include volatile and nonvolatile embodied in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. , both removable and non-removable media.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 멀티모달 콘텐츠 분석 시스템
110: 통신 모듈 120: 메모리
130: 프로세서 140: 데이터 저장모듈
150: 표시 모듈 100: multimodal content analysis system
110: communication module 120: memory
130: processor 140: data storage module
150: display module

Claims

A multi-modal content analysis method performed by a multi-modal content analysis system that analyzes content uploaded through a media channel, the method comprising:
a) collecting content including audio, video, subtitles and metadata from at least one content sharing platform that provides a content sharing service through the media channel;
b) Content text information including context for each content through natural language processing-based preprocessing of the collected contents and user response text information including viewer's posting response, and video document data written in natural language to each content generating each star;
c) extracting a feature vector for each frame of the image document data using a multimodal analysis model based on text analysis, and learning a classifier for content classification using the extracted feature vector; and
d) when new content is found through content search for the media channel, providing an analysis result for the new content using the learned multi-modal analysis model;
The multimodal analysis model is,
Based on the similarity between the feature vectors, a classifier generated through a clustering algorithm that classifies the entire image document data into K groups is learned, and the learned classifier uses a feature vector extracted from new content to obtain a predefined category ( automatically classified into categories),
The image document data is divided into mode data including audio, motion, and frame, and an attention algorithm is applied to each mode data to extract a feature vector considering the correlation between frames,
The concentrated algorithm is a multimodal content analysis method, in which the concentrated weight is applied to another feature vector associated with one feature vector for intensive analysis.

The method of claim 1,
Step a) stores log information about the collected content,
In step d), the multimodal content analysis method is to search for new content after excluding the content that has already been searched by using the stored log information.

The method of claim 1,
The content text information is
at least one of video data, audio data, and subtitle data of the corresponding content;
The user response text information,
At least one of the content's title, description, tags, views, impressions, clicks to impressions, positive/negative ratings, upload time, thumbnails, comments, metadata including viewer retention, watch time, or viewer demographic information. The method comprising the above, multimodal content analysis method.

The method of claim 1,
Step a) is,
Storing data having high-level characteristics including text information, time information, and numerical information among the collected contents in a relational database,
The multi-modal content analysis method of storing data having low-level characteristics including image data, video data, and audio data of the collected contents in a storage.

5. The method of claim 4,
When at least one data write request occurs in the relational database or storage at the same time, the data by the data write request is stored using the work queue, and then data is retrieved and stored from the work queue in a first-in-first-out (FIFO) method. That is, a multimodal content analysis method.

The method of claim 1,
Step c) is,
At least one word is extracted from the image document data, topic words are extracted through natural language processing analysis including stopword processing, stem extraction, and normalization for the extracted words, and then a word dictionary database is created using the topic words. To provide, a multi-modal content analysis method.

7. The method of claim 6,
The multi-modal analysis model performs learning on each image document data by replacing topic words with a predetermined frequency or more among topic words stored in the word dictionary database with tokens,
A multimodal content analysis method that treats topic words less than a preset frequency as a special token (Unknown token).

delete

The method of claim 1,
The concentration algorithm selects either a method of extracting a feature vector considering the correlation between mode data in the time stamp of the content or a method of extracting a feature vector considering the correlation between mode data of different time stamps within the entire timeline. A multimodal content analysis method to use.

The method of claim 1,
The clustering algorithm is
defining the similarity of the extracted feature vectors for each content based on a distance or a probability;
forming K clusters by classifying similar contents among similar contents through clustering of the entire image document data based on the defined similarity;
performing reorganization by reclassifying K clusters based on the distance or probability;
Completing the final content classification through structural labeling of reorganizing cluster labels step by step when the boundary between clusters is not changed through the reorganization.

The method of claim 1,
The multimodal analysis model is,
Classifying categories for content content using the content text information,
By using the viewer response text information for the contents of each category, high-ranking words are set as viewer responses according to the frequency,
A multimodal content analysis method that provides an analysis result in which the category and viewer reaction are combined for new content.

The method of claim 1,
Step c) is,
Using multiple learning algorithms based on natural language processing models including Embeddings from Language Models (ELMO), Bidirectional Encoder Representations from Transformers (BERT), Big Bird, Generative Pre-Training (GPT), and Masked Language Model (MLM). to learn the feature vector, a multimodal content analysis method.

In a multi-modal content analysis system for analyzing content uploaded through a media channel,
a memory in which a program for performing a multimodal content analysis method is recorded; and
a processor for executing the program;
The processor, by executing the program,
Collecting content including audio, video, subtitles and metadata from at least one content sharing platform that provides a content sharing service through the media channel;
Through natural language processing-based pre-processing of the collected contents, content text information including context for each content and user response text information including viewer's posting reaction are included, and video document data written in natural language is processed for each content, respectively. create,
Extracting a feature vector for each frame of the image document data using a multimodal analysis model based on text analysis, learning a classifier for content classification using the extracted feature vector,
When new content is found through content search for the media channel, an analysis result for the new content is provided using the learned multi-modal analysis model,
The multimodal analysis model is,
Based on the similarity between the feature vectors, a classifier generated through a clustering algorithm that classifies the entire image document data into K groups is learned, and the learned classifier uses a feature vector extracted from new content to obtain a predefined category ( automatically classified into categories),
The image document data is divided into mode data including audio, motion, and frame, and an attention algorithm is applied to each mode data to extract a feature vector considering the correlation between frames,
The concentrated algorithm is a multimodal content analysis method, in which the concentrated weight is applied to another feature vector associated with one feature vector for intensive analysis.