KR102009029B1

KR102009029B1 - A contents filtering system for comparative analysis of feature information

Info

Publication number: KR102009029B1
Application number: KR1020190074100A
Authority: KR
Inventors: 김찬우
Original assignee: 주식회사 코드라인
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2019-08-09

Abstract

A content filtering system for the comparative analysis of feature information according to the present invention includes: a content collection module collecting and storing content consisting of any one of a video, an image, and a sound; a feature information extraction module extracting hash value feature information of the content and storing the same as feature information; a filtering module filtering illegal content by comparing and analyzing the feature information in the content; and a result providing module transmitting the filtering result of the filtering module to a content provider. According to the present invention, an illegal copy of the content shared through an online storage service and the like can be filtered by extracting and comparing the feature information of the content, and thus the illegal copy of the content can be effectively blocked and managed.

Description

Content filtering system through comparative analysis of feature information {A CONTENTS FILTERING SYSTEM FOR COMPARATIVE ANALYSIS OF FEATURE INFORMATION}

본 발명은 특징정보 비교분석을 통한 콘텐츠 필터링 시스템에 관한 것으로서, 보다 상세하게 설명하면 웹하드 등을 통해 공유되는 콘텐츠의 불법 복제물을 콘텐츠의 특징정보를 추출 및 비교하여 필터링함으로써 콘텐츠의 불법 복제물을 효율적으로 차단 및 제휴 관리할 수 있는, 특징정보 비교분석을 통한 콘텐츠 필터링 시스템에 관한 것이다.The present invention relates to a content filtering system by comparing and analyzing feature information. More specifically, the present invention relates to an illegal copy of a content shared through a web hard, etc., by extracting and comparing feature information of the content and filtering the illegal copy of the content. The present invention relates to a content filtering system through feature information comparison and analysis that can be blocked and affiliated with each other.

먼저, 웹하드(Web hard)란 일정한 용량의 저장 공간을 확보해 문서나 콘텐츠를 저장, 열람, 편집할 수 있으며, 다수의 사람들과 콘텐츠를 공유할 수 있는 인터넷 콘텐츠 관리 서비스를 말한다. 이러한 웹하드에서는 영화, 음악, 드라마, 예능 등 다양한 콘텐츠들이 방대하게 유통되고 있으며, 유저들은 웹하드를 통해 쉽게 이러한 콘텐츠를 주고받을 수 있다. 이때, 이러한 콘텐츠의 유통 행위는 저작권자의 동의 없이 주고받는 일이 다반사이며, 일반적으로 불특정 다수가 공유받기 때문에 이러한 콘텐츠의 배포는 콘텐츠의 창작자 및 관계자에게 심대한 피해를 주는 것은 물론 저작권법을 위배하는 것으로 간주된다. 따라서 불법 복제 콘텐츠에 대한 무단 배포를 방지하기 위한 다양한 방법들이 연구되어 적용되고 있다First, Web hard refers to an Internet content management service that allows a user to store, view and edit a document or content by securing a certain amount of storage space, and to share content with a large number of people. In such a web hard, a variety of contents such as movies, music, dramas and entertainment are widely distributed, and users can easily exchange these contents through the web hard. At this time, the distribution of such content is common to give and receive without the consent of the copyright holder, and in general, since the unspecified majority is shared, the distribution of such content not only causes serious damage to the creator and the related person of the content but also violates the copyright law. do. Therefore, various methods for preventing unauthorized distribution of counterfeit contents have been researched and applied.

이에 대한 선행기술로서, 한국 등록특허 제 10-1035867호에 '워터마크를 이용한 사용자 단말기에서의 콘텐츠 필터링 방법 및 시스템'이 개시되어 있다. As a prior art, Korean Patent No. 10-1035867 discloses a method and system for filtering content in a user terminal using a watermark.

상기 선행기술은 워터마크를 이용한 사용자 단말기에서의 콘텐츠 필터링 방법으로서, 워터마크 추출 모듈을 탑재한 프로그램을 통해서 상기 콘텐츠의 워터마크를 추출하는 단계; 상기 추출된 워터마크를 워터마크 관리 서버에 송신하는 단계; 상기 워터마크 관리 서버로부터 상기 추출된 워터마크에 기초하여 워터마크 정보 및 제어 정보를 수신하는 단계; 및 상기 제어 정보에 기초하여 상기 콘텐츠를 필터링하는 단계를 포함하며, 상기 워터마크 정보는 사용자 정보, 저작권 정보, 콘텐츠의 복제 정보, 유통 정보, 불법 여부에 관한 정보 및 과금 여부에 관한 정보 중 적어도 하나 이상을 포함하고, 상기 제어 정보는 콘텐츠를 필터링할지 여부에 관한 정보를 포함하는, 콘텐츠 필터링 방법을 제시하고 있다.The prior art is a content filtering method in a user terminal using a watermark, the method comprising: extracting a watermark of the content through a program equipped with a watermark extraction module; Transmitting the extracted watermark to a watermark management server; Receiving watermark information and control information based on the extracted watermark from the watermark management server; And filtering the content based on the control information, wherein the watermark information includes at least one of user information, copyright information, copy information of the content, distribution information, information on illegality, and information on billing. Including the above, the control information includes a content filtering method including information on whether to filter the content.

상기 선행 기술은 콘텐츠의 불법 확산 문제를 해결하기 위한 것으로 사용자 단말기를 이용하는 콘텐츠 서비스 환경에서 워터마크가 삽입된 콘텐츠를 서비스하고, 사용자 단말기에서 다른 사용자 단말기로 직접 재전송을 하거나 다른 사용자 단말기로부터 다운로드하는 경우, 실시간으로 해당 콘텐츠의 워터마크를 추출하여 불법임을 판단하고 이를 필터링하는 기능을 제공하여 저작권 실시간 관리 시스템을 제공할 수 있지만, 워터마크를 사용하기 때문에 워터마크가 화면에 불필요하게 표시될 수 있거나 아니면 보다 정밀한 필터링을 수행하기에 부족한 면이 있다는 단점이 따른다.The prior art is to solve the problem of illegal spread of the content when the service to the watermark-embedded content in the content service environment using the user terminal, re-transmitted directly from the user terminal to another user terminal or downloaded from another user terminal However, the real-time copyright management system can be provided by extracting the watermark of the corresponding content in real time and providing a function of filtering and filtering the illegal content, but the watermark may be displayed on the screen unnecessarily. The disadvantage is that there is a lack of precision filtering.

따라서, 콘텐츠의 특징정보(DNA)를 추출하고 이를 통해서 흑백, 상하좌우 반전, 크기 조작 등의 불법 복제 콘텐츠도 필터링할 수 있는 신규하고 진보한 콘텐츠 필터링 시스템을 개발할 필요성이 대두되는 실정이다.Therefore, there is a need to develop a new and advanced content filtering system capable of extracting characteristic information (DNA) of content and filtering illegally copied contents such as black and white, up, down, left, right, and size manipulation.

본 발명은 상기 기술의 문제점을 극복하기 위해 안출된 것으로, 웹하드 등을 통해 공유되는 콘텐츠의 불법 복제물을 콘텐츠의 특징정보를 추출 및 비교하여 필터링함으로써 콘텐츠의 불법 복제물을 효율적으로 차단 및 제휴 관리할 수 있는 시스템을 제공하는 것을 주요 목적으로 한다. SUMMARY OF THE INVENTION The present invention has been made to overcome the problems of the above technology, and the illegal copy of the content shared through the web hard and the like to extract and compare the feature information of the content to filter and effectively manage the illegal copy of the content The main objective is to provide a system that can

본 발명의 또 다른 목적은, 해시값을 이용한 불법 콘텐츠 추출을 가능케 하는 것이다.Another object of the present invention is to enable illegal content extraction using hash values.

본 발명의 추가 목적은, 보호 대상 콘텐츠와 분석대상 콘텐츠 사이의 키워드 비교를 통해 불법 콘텐츠 추출을 가능케 하는 것이다. It is a further object of the present invention to enable illegal content extraction through keyword comparison between the protected content and the analyzed content.

본 발명의 추가 목적은, 키워드를 이용한 불법 콘텐츠 추출 시 분석대상 콘텐츠에 포함된 키워드 뿐 아니라 분석대상 콘텐츠와 관련된 텍스트에서 타겟 키워드를 추출하여 유사도 분석에 이용할 수 있도록 하는 것이다.It is a further object of the present invention to extract a target keyword from text related to the analysis target content as well as the keyword included in the analysis target content when extracting illegal content using the keyword, and to use it for similarity analysis.

상기 목적을 달성하기 위하여, 본 발명에 따른 특징정보 비교분석을 통한 콘텐츠 필터링 시스템은, 동영상, 이미지, 사운드 중 어느 하나로 이루어진 콘텐츠를 수집 및 저장하는 콘텐츠 수집모듈; 상기 콘텐츠의 해시값 특징정보를 추출하여 이를 특징정보로 저장하는 특징정보 추출모듈; 상기 콘텐츠 간에 상기 특징정보를 비교분석하여 불법 콘텐츠를 필터링하는 필터링모듈; 상기 필터링모듈의 필터링 결과를 콘텐츠 제공자에게 전송하는 결과 제공모듈;을 포함하는 것을 특징으로 한다.In order to achieve the above object, the content filtering system through the comparative analysis of the feature information according to the present invention, a content collection module for collecting and storing the content consisting of any one video, image, sound; A feature information extraction module for extracting the hash value feature information of the content and storing it as feature information; A filtering module for filtering illegal content by comparing and analyzing the feature information among the contents; And a result providing module for transmitting the filtering result of the filtering module to a content provider.

또한, 상기 콘텐츠 수집모듈은, 상기 콘텐츠 제공자로부터 보호대상 콘텐츠를 제공받아 저장하는 보호대상 수집부와, 복수의 콘텐츠 서비스 사이트에 업로드된 분석대상 콘텐츠를 수집하여 저장하는 분석대상 수집부로 이루어지고, 상기 특징정보 추출모듈은, 상기 보호대상 콘텐츠의 해시값 특징정보를 추출하여 저장하며, 상기 필터링모듈은, 상기 특징정보를 기준으로 상기 보호대상 콘텐츠와 상기 분석대상 콘텐츠를 비교 분석하는 것을 특징으로 한다.The content collection module may include a protection target collection unit configured to receive and store protected content from the content provider, and an analysis target collection unit configured to collect and store analysis target content uploaded to a plurality of content service sites. The feature information extraction module extracts and stores the hash value feature information of the protected content, and the filtering module compares and analyzes the protected content and the analyzed content based on the feature information.

나아가, 상기 보호대상 수집부는, 상기 콘텐츠 제공자로부터 상기 보호대상 콘텐츠의 특징 키워드를 제공받아 저장하는 특징 키워드 수집 파트를 구비하고, 상기 분석대상 수집부는, 상기 분석대상 콘텐츠와 연관된 연관 키워드를 수집하여 저장하는 연관 키워드 수집 파트를 구비하며, 상기 시스템은, 복수의 상기 연관 키워드에서 특징이 되는 단어를 포함한 타겟 키워드를 추출하여 저장하는 타겟 키워드 추출모듈;을 추가로 포함하고, 상기 필터링모듈은, 상기 보호대상 콘텐츠의 특징 키워드와 상기 분석대상 콘텐츠의 타겟 키워드를 비교 분석하는 키워드 비교분석부를 포함하는 것을 특징으로 한다.Furthermore, the protected object collecting unit includes a feature keyword collecting part that receives and stores a feature keyword of the protected object content from the content provider, and the analyzing object collecting unit collects and stores related keywords related to the analyzed object content. And a related keyword collection part, wherein the system further comprises: a target keyword extracting module configured to extract and store a target keyword including a characteristic word from a plurality of the related keywords, wherein the filtering module comprises: the protection module; And a keyword comparison analysis unit which compares and analyzes feature keywords of target content and target keywords of the analysis target content.

본 발명에 따른 특징정보 비교분석을 통한 콘텐츠 필터링 시스템에 따르면,According to the content filtering system through the comparative analysis of the feature information according to the present invention,

1) 콘텐츠의 특징정보를 추출 및 비교하여 필터링함으로써 콘텐츠의 불법 복제물을 효율적으로 차단 및 제휴 관리할 수 있고,1) By extracting and comparing the feature information of the content and filtering, it is possible to efficiently block and affiliate management of illegal copies of the content,

2) 해시값을 이용한 불법 콘텐츠 추출을 가능케 할 수 있으며,2) It is possible to extract illegal content using hash value.

3) 키워드 비교를 통해 불법 콘텐츠 추출을 가능케할 뿐만 아니라,3) keyword comparisons not only enable illegal content extraction,

4) 텍스트에서 타겟 키워드를 보다 정밀하고 객관적인 알고리즘에 근거하여 추출함으로써 보다 정확한 분석 툴을 제공할 수 있다는 효과를 가진다.4) It is effective to provide more accurate analysis tool by extracting target keyword from text based on more precise and objective algorithm.

도 1은 본 발명의 시스템에 대한 세부 구성을 도시한 블록도.
도 2는 본 발명의 필터링 시스템의 순서를 도시한 프로세스도.
도 3은 본 발명의 콘텐츠의 특징정보를 추출하는 방법을 나타낸 개념도.
도 4는 본 발명의 콘텐츠의 특징정보(DNA)의 일 실시예를 보여주는 개념도.1 is a block diagram showing a detailed configuration of a system of the present invention.
2 is a process diagram showing the sequence of the filtering system of the present invention.
3 is a conceptual diagram illustrating a method of extracting feature information of content of the present invention.
4 is a conceptual diagram illustrating an embodiment of the feature information (DNA) of the content of the present invention.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명하도록 한다. 첨부된 도면은 축척에 의하여 도시되지 않았으며, 각 도면의 동일한 참조 번호는 동일한 구성 요소를 지칭한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. The accompanying drawings are not drawn to scale, and like reference numerals in each of the drawings refer to like elements.

먼저, 본 발명의 특징정보 비교분석을 통한 콘텐츠 필터링 시스템(10)을 구성하는 각 구성요소에 대해 설명하고자 한다.First, each component constituting the content filtering system 10 through the comparative analysis of the feature information of the present invention will be described.

본 발명의 특징정보 비교분석을 통한 콘텐츠 필터링 시스템(10)은 기본적으로 메인서버를 포함할 수 있다.The content filtering system 10 through the comparative analysis of the feature information of the present invention may basically include a main server.

메인서버는 본 발명의 시스템(10)의 구현을 위한 서버PC의 모임이라고도 할 수 있으며, 후술하겠지만 본 시스템에는 다양한 데이터베이스 들이 포함되므로 해당 DB들의 집단 뿐 아니라 본 발명의 시스템 구현을 위한 시스템 서버 등이 본 발명의 메인서버라 할 수 있다. 이러한 메인서버는 콘텐츠 모니터링 시스템을 구현하여 서비스 사이트에 업로드된 콘텐츠 중에서 제휴 콘텐츠, 비제휴 콘텐츠를 식별해낼 뿐 아니라, 나아가 불법 콘텐츠, 즉 콘텐츠 제공자의 허가 없이 불법으로 판매되고 있는 콘텐츠에 대한 필터링을 수행한다.The main server may also be referred to as a collection of server PCs for the implementation of the system 10 of the present invention. As will be described later, the system includes various databases, so that the system server for implementing the system of the present invention may be used as well. The main server of the present invention. The main server implements a content monitoring system to identify affiliated and non-affiliated contents among the contents uploaded to the service site, and further performs filtering on illegal contents, that is, contents that are illegally sold without the permission of the content provider. do.

여기서 제휴 콘텐츠라 함은 콘텐츠에 대한 저작물 또는 유통 권리를 가지고 있는 자, 즉 콘텐츠 제공자와의 계약을 통해 정상적으로 판매되고 있는 콘텐츠로서, 다시 말해 정식 허가를 통해 공유되는 콘텐츠라고 할 수 있다.Here, the affiliated content is content that is normally sold through a contract with a content provider, that is, a person having a copyright or a distribution right for the content, that is, content that is shared through a formal permission.

비제휴 콘텐츠는 넓은 범위의 불법 콘텐츠라고 할 수 있는데, 콘텐츠 제공자와의 계약 없이 판매되고 있는 콘텐츠라고 할 수 있다. 즉 판매 및 공유에 대한 정식 허가가 없이 공유 및 판매되고 있는 콘텐츠라고 할 수 있다.Non-partnered content is a wide range of illegal content, which is content that is sold without a contract with a content provider. In other words, it can be said to be content that is shared and sold without formal permission for sale and sharing.

따라서 본 발명의 메인서버는 시스템(10)의 구현을 통해 서비스 사이트에 업로드된 콘텐츠 중 제휴 콘텐츠와 비제휴 콘텐츠를 판별하여 비제휴 콘텐츠를 추출해내는 역할을 수행하는 것도 가능하고, 나아가 비제휴 콘텐츠 중에서도 세밀한 의미의 불법 콘텐츠 추출을 수행하도록 하는 것이다. 이 때 추출 대상은 콘텐츠 제공자로부터 보호 요청이 들어온 콘텐츠, 즉 보호 대상 콘텐츠를 의미하는데, 이는 제휴 없이는 합법적인 공유를 금지한 콘텐츠로서, 비제휴 콘텐츠 중에서 해당 보호 대상 콘텐츠가 추출된다면 이는 엄연한 불법에 해당함으로 좁은 의미의 불법 콘텐츠는 비제휴 콘텐츠 중에서도 보호 대상 콘텐츠와 유사도가 높은 콘텐츠를 의미하는 것이다. 아니면, 본 발명의 시스템은 이와 같이 제휴/비제휴 콘텐츠 내지 비제휴/불법 콘텐츠의 구분 없이 크게 콘텐츠를 정상 콘텐츠와 불법 콘텐츠로 2가지로 구분하는 역할에 집중할 수 있고 이에 대한 구성 및 기능을 하기에서 구체적으로 설명할 예정이다.Therefore, the main server of the present invention can perform the role of extracting non-affiliated content by discriminating affiliated content and non-affiliated content among contents uploaded to a service site through the implementation of the system 10, and furthermore, among non-participated contents. It is to allow illegal content extraction in a detailed sense. At this time, the extraction target means the content that the protection request has been received from the content provider, that is, the content that has been protected, and this content is prohibited from being legally shared without the affiliation. As such, illegal content in a narrow sense means content that is highly similar to protected content among non-affiliated contents. Alternatively, the system of the present invention can concentrate on the role of dividing the content into two categories, normal content and illegal content without distinguishing affiliated / non-partnered content or non-partnered / illegal content. It will be explained in detail.

즉, 본 발명의 시스템에 따르면 서비스 사이트, 즉 웹하드 상에서 불법 콘텐츠의 유통 및 제휴 콘텐츠의 정상판매를 실시간으로 모니터링하여 불법 콘텐츠에 대한 차단요청 및 정상 콘텐츠(또는 보호 대상 콘텐츠) 및 불법 콘텐츠 등에 대한 유통 통계 등을 디지털화하여 제공할 수 있게 된다.That is, according to the system of the present invention, the distribution of illegal content and normal sale of affiliated content are monitored in real time on a service site, i.e., the web hard, and the request for blocking and normal content (or protected content) and illegal content, etc. Distribution statistics and the like can be provided digitally.

이와 같은 본 발명의 특징정보 비교분석을 통한 콘텐츠 필터링 시스템(10)에 대해 보다 상세히 설명하면 다음과 같다.The content filtering system 10 through the comparative analysis of the feature information of the present invention will be described in more detail as follows.

도 1은 본 발명의 시스템에 대한 세부 구성을 도시한 블록도이고, 도 2는 본 발명의 필터링 시스템의 순서를 도시한 프로세스도이다.1 is a block diagram showing a detailed configuration of the system of the present invention, Figure 2 is a process diagram showing the sequence of the filtering system of the present invention.

도 1 및 도 2을 참조하여 설명하면, 본 발명의 특징정보 비교분석을 통한 콘텐츠 필터링 시스템(10)은 기본적으로 콘텐츠 수집모듈(100), 특징정보 추출모듈(200), 필터링모듈(300), 결과 제공모듈(400)을 포함한다.Referring to Figures 1 and 2, the content filtering system 10 through the feature information comparison analysis of the present invention is basically a content collection module 100, feature information extraction module 200, filtering module 300, It includes a result providing module 400.

도 3은 본 발명의 콘텐츠의 특징정보를 추출하는 방법을 나타낸 개념도이며, 도 4는 본 발명의 콘텐츠의 특징정보(DNA)의 일 실시예를 보여주는 개념도이다.3 is a conceptual diagram illustrating a method of extracting feature information of content of the present invention, and FIG. 4 is a conceptual diagram showing an embodiment of feature information (DNA) of content of the present invention.

콘텐츠 수집모듈(100)은 동영상, 이미지, 사운드 중 어느 하나로 이루어진 콘텐츠를 수집 및 저장하는 것으로서, 본 발명의 시스템의 메인서버가 콘텐츠를 수집하여 이를 저장하는 역할을 수행할 수 있다.The content collection module 100 collects and stores content consisting of any one of a video, an image, and a sound, and may play a role of collecting and storing content by a main server of the system of the present invention.

이러한 콘텐츠 수집모듈은 보호대상 수집부(110), 분석대상 수집부(120)로 이루어질 수 있다.The content collection module may be composed of a protection target collection unit 110, the analysis target collection unit 120.

보호대상 수집부(110)는 콘텐츠 제공자로부터 보호대상 콘텐츠를 제공받아 저장하는 것으로서, 보호대상 콘텐츠란 콘텐츠 제공자로부터 보호 요청이 들어온 콘텐츠 즉, 정상적인 유통 과정을 거쳐 제공받은 콘텐츠를 의미한다.The protected object collecting unit 110 receives and stores protected content from a content provider. The protected content means a content that a protection request is received from a content provider, that is, a content provided through a normal distribution process.

분석대상 수집부(120)는 복수의 콘텐츠 서비스 사이트에 업로드된 분석대상 콘텐츠를 수집하여 저장하는 것으로서, 여기서 분석대상 콘텐츠라 함은 본 발명의 시스템에 불법 콘텐츠인지 여부를 판별 요청한 콘텐츠를 의미하는 것으로, 아직 분석이 되기 이전 단계의 보호 대상 콘텐츠 및 불법 콘텐츠를 모두 포함하는 콘텐츠를 의미한다.The analysis target collection unit 120 collects and stores the analysis target content uploaded to a plurality of content service sites, wherein the analysis target content refers to the content requested to determine whether it is illegal content in the system of the present invention. In other words, this refers to content that includes both protected and illegal content that has not yet been analyzed.

즉, 분석대상 수집부(120)는 분석대상 콘텐츠를 후술할 구성 및 기능, 즉 특정정보를 기준으로 보호대상 콘텐츠와 분석대상 콘텐츠를 비교 분석하는 일련의 기능을 통해 필터링함으로써 불법 복제 콘텐츠인지 아닌지의 여부를 판단할 수 있다. That is, the analysis target collecting unit 120 filters the analysis target content through a configuration and a function to be described later, that is, a series of functions for comparing and analyzing the protected content and the analysis target content based on specific information to determine whether or not it is illegal copy content. It can be determined.

특징정보 추출모듈(200)은 콘텐츠의 해시값 특징정보를 추출하여 이를 특징정보로 저장하는 것으로서, 상술한 콘텐츠 수집모듈에 의해 데이터베이스에 저장된 보호대상 콘텐츠의 특징정보를 추출하여 저장할 수 있다. 여기서, 해시(Hash)값 특징정보라 함은 해당 콘텐츠의 해시값을 의미하며, 해시값은 복사된 디지털 증거의 동일성을 입증하기 위해 파일 특성을 축약한 암호 같은 수치로 일반적으로 수사과정에서 '디지털 증거의 지문'으로 통하므로, 해당 콘텐츠의 특성이라고도 할 수 있다. 이때, 해시값 특징정보에 대하여 동영상의 예를 들어 설명하면, 동영상의 프레임(1시간 분량의 영상 기준으로 최대 300 내지 800개 가량의 프레임이 추출될 수 있다) 하나하나의 특징점들(DNA)들이 해시값 특징 정보가 될 수 있다.The feature information extraction module 200 extracts the hash value feature information of the content and stores it as feature information. The feature information extraction module 200 may extract and store feature information of the protected content stored in the database by the above-described content collection module. Here, the hash value characteristic information means a hash value of the corresponding content, and the hash value is a numeric value such as a password abbreviated as a file characteristic in order to prove the identity of the copied digital evidence. It can be called the characteristic of the content because it is referred to as the 'fingerprint of evidence'. In this case, the hash value feature information is described as an example of a video. Each of the feature points (DNA) of the video frame (maximum of about 300 to 800 frames can be extracted based on an image of an hour) It can be hash value feature information.

필터링모듈(300)은 콘텐츠 간에 상기 특징정보를 비교분석하여 불법 콘텐츠를 필터링하는 것으로서, 상기 특징정보 추출모듈로부터 추출한 보호 대상 콘텐츠(원본 DNA 파일)의 특징정보와 불법 콘텐츠(조작된 DNA 파일)의 특징정보를 비교하는 것이다. 여기서, 특징정보는 상술하였듯이 콘텐츠의 DNA 즉, 해시값(해시값 특징정보)일 수 있다. The filtering module 300 compares and analyzes the feature information between contents to filter illegal content, wherein the feature information of the protected content (original DNA file) extracted from the feature information extraction module and the illegal content (manipulated DNA file) Compare feature information. Here, the feature information may be a DNA of the content, that is, a hash value (hash value feature information) as described above.

콘텐츠 중 영상의 특징정보를 예를 들어 설명하자면, 불법 복제된 콘텐츠가 보호 대상 콘텐츠 영상을 흑백으로 처리한 뒤 이를 좌우 반전하여 복제를 하였다고 할 때, 추출한 영상의 특징정보를 비교함으로써 불법 복제되었음을 판단하여 필터링할 수 있다. 즉, 도 4에서 보아 알 수 있듯이, 특징정보를 추출하여 저장하게 되면 흑백 처리, 상하좌우 반전 처리 크기 조작 처리 등에도 복제 유무를 판단하여 필터링할 수 있게 된다. To describe the feature information of the image, for example, when the illegally copied content is copied in black and white after being protected in black and white, it is judged to be illegally copied by comparing the feature information of the extracted image. Can be filtered. That is, as shown in FIG. 4, when the feature information is extracted and stored, the presence or absence of duplication may be determined and filtered in black and white processing, up, down, left and right inversion processing size manipulation processing, and the like.

결과 제공모듈(400)은 필터링모듈의 필터링 결과를 콘텐츠 제공자에게 전송하는 것으로서, 상술한 필터링모듈로부터 불법 복제되었다고 판단된 콘텐츠의 필터링 결과 즉, 결과 데이터베이스를 저장하여 유저에게 제공하는 것이다. 즉, 필터링모듈에 의한 결과는 불법 복제 콘텐츠인지 정상 유통 콘텐츠인지가 될 수 있다. 이때, 도출된 필터링에 대한 결과는 유저의 요구에 따라 서버형 또는 인터넷에 접속함으로써 클라우드형의 형태로 제공할 수 있다. The result providing module 400 transmits the filtering result of the filtering module to the content provider. The result providing module 400 stores the filtering result of the content determined to be illegally copied from the filtering module, that is, the result database, and provides the result to the user. In other words, the result of the filtering module may be illegal copy content or normal distribution content. In this case, the result of the derived filtering may be provided in a cloud type by accessing a server type or the Internet according to a user's request.

이때, 상술한 콘텐츠 수집모듈(100)의 보호대상 수집부(110)는 특징 키워드 수집 파트(111)를 구비할 수 있고, 분석대상 수집부(120)는 연관 키워드 수집 파트(121)를 구비할 수 있다. 이에 따라, 시스템은 타겟 키워드 추출모듈(500)을 추가적으로 포함하고, 필터링모듈(300)은 키워드 비교분석부(310)을 포함할 수 있다.In this case, the protection target collection unit 110 of the above-described content collection module 100 may include a feature keyword collection part 111, and the analysis target collection unit 120 may include an associated keyword collection part 121. Can be. Accordingly, the system may further include a target keyword extraction module 500, and the filtering module 300 may include a keyword comparison analyzer 310.

특징 키워드 수집 파트(111)는 상기 콘텐츠 제공자로부터 상기 보호대상 콘텐츠의 특징 키워드를 제공받아 저장하는 것으로서, 보호대상 콘텐츠를 특징할 수 있는 키워드를 미리 설정하여 저장하는 것이다. 특징 키워드에 대하여 예를 들어 설명하면, 영화 "겨울왕국(Frozen, 2013)"에 대한 특징 키워드를 저장한다고 할 때, 감독의 이름인 "크리스 벅", 제작사인 "디즈니", 등장인물 이름인 "엘사","안나", "올라프" 등이 특징 키워드로서 저장될 수 있다.The feature keyword collection part 111 receives and stores a feature keyword of the protected content from the content provider, and sets and stores a keyword that can characterize the protected content in advance. For example, if the feature keyword for the movie "Frozen, 2013" is stored, the director's name "Chris Buck", the producer "Disney", the character name " Elsa "," Anna "," Olaf ", etc. may be stored as feature keywords.

이에 더하여 특징 키워드 수집 파트(111)는 검색 키워드 수집수단(112), 빈도수 판별수단(113), 특징 키워드 병행 설정수단(114)을 포함할 수 있다.In addition, the feature keyword collection part 111 may include a search keyword collection unit 112, a frequency determining unit 113, and a feature keyword parallel setting unit 114.

검색 키워드 수집수단(112)은 특징 키워드를 복수 개의 포탈사이트에서 검색 시 검색 결과로 도출된 검색 키워드를 수집하는 것으로서, 포탈사이트에 해당 콘텐츠를 검색하였을 경우 도출되는 결과 데이터 또는 연관 검색어 등을 수집하는 것이다. The search keyword collecting unit 112 collects search keywords derived as a search result when searching for a feature keyword in a plurality of portal sites. The search keyword collecting unit 112 collects result data or related search terms derived when the corresponding content is searched on the portal site. will be.

빈도수 판별수단(113)은 검색 키워드의 빈도수를 판별하는 것으로서, 상술한 검색 키워드 수집수단을 통해 도출된 검색 키워드 중 빈도수가 높은 것을 해당 콘텐츠와 관련이 높다고 판단할 수 있다.The frequency determining unit 113 determines the frequency of the search keyword, and may determine that a high frequency among the search keywords derived through the search keyword collection unit described above is related to the corresponding content.

특징 키워드 병행 설정수단(114)은 빈도수를 기준으로 상기 검색 키워드 중 적어도 어느 하나를 특징 키워드로 함께 설정하는 것이다. 즉, 상술한 빈도수 판별수단으로부터 검색 키워드 중 빈도수가 높다고 판단된 것을 해당 콘텐츠의 특징이 된다고 판단하여 이를 특징 키워드로 함께 추가하는 것이다.The feature keyword parallel setting means 114 sets at least one of the search keywords as the feature keyword based on the frequency. In other words, it is determined that the frequency of the search keywords is high from the above-described frequency determining means to be the feature of the corresponding content, and is added together as the feature keyword.

연관 키워드 수집 파트(121)는 상기 분석대상 콘텐츠와 연관된 연관 키워드를 수집하여 저장하는 것이다. 이때, 분석대상 콘텐츠의 연관 키워드를 수집하는 방법은 포털 사이트에 분석대상 콘텐츠를 검색하였을 때 제공받을 수 있는 연관 검색어 수집, 콘텐츠의 소개 또는 설명 란의 단어들을 분석, 직접 입력하여 수집하는 방법이 적용될 수 있다.The related keyword collection part 121 collects and stores related keywords related to the analysis target content. In this case, the method of collecting related keywords of the analysis target content is applied to a method of collecting related search terms that can be provided when the analysis target content is searched on the portal site, analyzing the words of the content introduction or description field, and inputting them directly. Can be.

이에 따라 본 발명의 시스템(!0)에 추가적으로 포함될 수 있는 타겟 키워드 추출모듈(500)은 복수의 상기 연관 키워드에서 특징이 되는 단어를 포함한 타겟 키워드를 추출하여 저장하는 것이다. 상술한 연관 키워드 수집 파트(121)에 따라 수집된 연관 키워드 중 분석대상 콘텐츠의 특징이 될 수 없다고 판단된 키워드를 필터링하게 되어 특징이 될 수 있는 키워드만을 추출하여 저장하는 것이다.Accordingly, the target keyword extraction module 500, which may be additionally included in the system (! 0) of the present invention, extracts and stores a target keyword including a feature word from a plurality of related keywords. Among the related keywords collected according to the related keyword collection part 121 described above, the keyword determined as not being a feature of the content to be analyzed is filtered, and only the keywords that can be featured are extracted and stored.

또한, 필터링모듈(300)의 키워드 비교분석부(310)는 상기 보호대상 콘텐츠의 특징 키워드와 상기 분석대상 콘텐츠의 타겟 키워드를 비교 분석하는 것으로서, 특징 키워드와 타겟 키워드의 일치 정도를 비교 분석함으로써 동일 콘텐츠인지 비교하는 것이다. 일반적으로, 불법 콘텐츠를 사이트에 업로드할 경우 복제 대상의 제목을 그대로 사용하지 않고 간접적으로 이를 표현하게 되는데, 이러한 키워드의 비교분석을 통해 동일한 콘텐츠인지 확인할 수 있는 것이다. 이에 대하여 예를 들어 설명하면, 영화 "겨울왕국(Frozen, 2013)"에 대한 복제 콘텐츠로서 "타임지 선정 올해 최고의 영화, 크리스 벅 감독의 디즈니 최고 흥행작"의 제목으로 업로드될 수 있다. 이때, 복제 콘텐츠 즉, 분석대상 콘텐츠의 키워드 중 "크리스 벅", "디즈니"의 타겟 키워드를 통해 "겨울왕국(Frozen, 2013)" 작품과 동일한 콘텐츠임을 확인할 수 있는 것이다.In addition, the keyword comparison analysis unit 310 of the filtering module 300 compares and analyzes the feature keyword of the protected content and the target keyword of the analysis target content, and compares and compares the degree of matching between the feature keyword and the target keyword. Is to compare the content. In general, when illegal content is uploaded to a site, the title of a copy object is not used as it is, but it is expressed indirectly. Through comparison analysis of these keywords, it is possible to confirm whether the same content is used. As an example, it could be uploaded under the title "Times Best Film of the Year, Chris Buck's Best Disney Film" as a replica of the movie "Frozen, 2013." In this case, it is possible to confirm that the content is identical to the work of "Frozen, 2013" through the target keywords of "Cris Buck" and "Disney" among the keywords of the duplicate content, that is, the analysis target content.

또한, 타겟 키워드 추출모듈(500)의 기본 기능은 복수개의 연관 키워드에 대한 키워드 분석, 나아가 키워드로부터 타겟 키워드 추출 기능이라 할 수 있다. 일반적으로 키워드 분석의 경우 AHP 방식의 계층분석이 주로 이용되어져 왔으나, 보다 바람직하게는 단순한 계층분석에서 더 나아가 잠재계층분석을 통해 보다 세분화된 분석을 수행하도록 하여 타겟 키워드를 생성해낼 수 있도록 하고, 그를 통해 타겟 키워드 특징정보를 생성하는 것이 보다 효과적일 것이다.In addition, the basic function of the target keyword extraction module 500 may be referred to as keyword analysis for a plurality of related keywords, and further, target keyword extraction from keywords. In general, in the case of keyword analysis, AHP hierarchical analysis has been mainly used, but more preferably, the target keyword can be generated by performing a more detailed analysis through potential hierarchical analysis. It will be more effective to generate the target keyword feature information.

이를 위해 상기 타겟 키워드 추출모듈(500)은 단어파악부(510), 분석 키워드 생성부(520), 층계분류부(530), 키워드 그룹 생성부(540), 타겟 키워드 지정부(550)를 포함할 수 있다. 각각의 구성에 대해 보다 세밀하게 설명하면 다음과 같다.To this end, the target keyword extraction module 500 includes a word detection unit 510, an analysis keyword generation unit 520, a stair classification unit 530, a keyword group generation unit 540, and a target keyword designation unit 550. can do. Each configuration will be described in more detail as follows.

단어파악부(510)는 연관 키워드에 포함된 단어를 파악하는 것이다. 이때, 단어의 구분 기준은 띄어쓰기를 기본으로 한다. 띄어쓰기에 따라 단어의 개수가 달라질 수 있으나, 연관 키워드가 띄어쓰기를 수행하지 않은 경우 어근 및 고유명사에 따라 단어가 나누어진다. '어메이징 스파이더맨'의 경우 단어 2개로 파악되며,'어메이징스파이더맨'이라는 관련텍스트가 입력된 경우 어근 및 고유명사에 따라 나누어 어메이징(어근) / 스파이더맨(고유명사) 이라는 2개의 단어로 파악되는 것이다. 즉 띄어쓰기를 수행하지 않은 경우 기본적으로 어근에 따라 나누되, 두 개 이상의 어근이 하나의 고유명사를 구성하는 경우 그 고유명사 자체를 하나의 단어로 판정한다.The word catching unit 510 is to identify the words included in the related keyword. In this case, the criteria for distinguishing words is based on spacing. The number of words may vary depending on the spacing, but when the associated keyword does not perform spacing, words are divided according to roots and proper nouns. In the case of 'Amazing Spider-Man', it is identified as two words, and when the relevant text of 'Amazing Spider-Man' is entered, it is divided into two words, 'Amazing' (root) and 'Spider-Man' (proper noun). will be. In other words, if no spacing is performed, it is basically divided according to roots. If two or more roots constitute one proper noun, the proper noun itself is determined as one word.

분석 키워드 생성부(520)는 복수개의 상기 단어를 필터 처리하여 분석 키워드를 생성하는 것으로서, 상기 단어파악부(510)를 통해 추출된 복수개의 상기 단어를 필터 처리하여 분석 키워드를 생성한다. 마침표나 따옴표, 물음표 등의 문장부호 및 기호는 텍스트 분석에 큰 영향을 미치지 않으므로 삭제 처리하는 것을 기본으로 한다. 또한 이때 필터 처리는 바람직하게는 추출된 단어에 대한 명사화를 기본으로 한다. 더불어 이때 인칭대명사를 제외한 대명사의 경우 본 형태를 유지하여도 이미 명사형이므로 별도의 변환이 필요하지 않는다. 더불어 이 경우 명사화되지 못하는 단어, 예를 들어 전치사 등은 명사화가 불가능하므로 삭제 처리되는 것을 기본으로 한다.The analysis keyword generation unit 520 generates an analysis keyword by filtering the plurality of words, and generates an analysis keyword by filtering the plurality of words extracted through the word detection unit 510. Punctuation marks and symbols, such as periods, quotes, and question marks, do not significantly affect text analysis. Also, the filter process is preferably based on nouns on the extracted words. In this case, pronouns other than personal pronouns are already nouns, even if they are kept in this form, and thus no additional conversion is required. In this case, words that cannot be noun, for example, prepositions, etc., cannot be noun, and thus are deleted.

층계분류부(530)는 분석 키워드생성부(520)를 통해 생성된 복수개의 상기 분석 키워드에 대한 잠재계층분석(LCA)를 수행하여 복수개의 상기 분석 키워드를 복수개의 층계(계층)로 분류하는 기능을 수행한다. 이때 잠재계층분석은 구조방정식모형을 구성하는 한 부분으로, 잠재변수들 간의 인과 및 상관관계를 나타내는 부분이다. 여기서 잠재변수라 함은 분석 키워드가 되므로, 해당 분석 키워드간의 인과 및 상관관계에 따라 키워드를 층계화(계층화)하여 분류해낼 수 있게 되는 것이다.The stair classification unit 530 classifies the plurality of analysis keywords into a plurality of hierarchies by performing latent hierarchical analysis (LCA) on the plurality of analysis keywords generated by the analysis keyword generation unit 520. Do this. At this time, latent hierarchical analysis is a part of constructing the structural equation model, and it is a part showing causality and correlation among latent variables. Since the latent variable is an analysis keyword, it is possible to classify the keyword according to the causality and correlation between the analysis keywords.

따라서 이러한 잠재계층분석을 통해 계층화된 분석 키워드들은 유사성을 바탕으로 하여 추정된 집단이므로 이와 같은 잠재계층분석방법에 따라 생성된 타겟 키워드 특징정보 역시 유형별로 정리될 수 있음은 물론이다.Therefore, since the analysis keywords stratified through the latent hierarchical analysis are estimated groups based on similarity, the target keyword characteristic information generated by the latent hierarchical analysis method can also be sorted by type.

이 때 그동안 텍스트분석에서 많이 이용되어 왔던 AHP 방식의 계층분석의 경우 군집화를 기본으로 하는데, 군집분석은 자료 값을 바탕으로 분류를 시도하는 단순한 방법이며 혼합모형처럼 특정 통계방법에서 추정되는 계수를 바탕으로 분류하는 것(예, 잠재성장모형에서 추정되는 변화율을 바탕으로 유형화)은 가능하지 않다. 잠재계층분석의 경우 집단 수를 결정하는 다양한 통계지수, 종단적 분석, 영향변수와 결과변수를 포함함으로써, 다양한 분석과 결합이 가능하여 매우 강력하고 유연하여 분류에 있어서 최고 수준의 분석방법이라고 할 수 있다.In this case, hierarchical analysis of AHP method, which has been widely used in text analysis, is based on clustering. Cluster analysis is a simple method of attempting classification based on data values and based on coefficients estimated by specific statistical methods such as mixed models. Categorizing them as (eg, categorizing them based on estimated rates of change in the potential growth model) is not possible. In the case of latent hierarchical analysis, it can be combined with various analyzes by including various statistical indices, longitudinal analysis, influence variables and outcome variables to determine the number of groups, which is very powerful and flexible. have.

상기와 같은 층계분류부는(530) 복수개의 키워드를 계층화하여 복수개의 층계로 분류하는 기능을 수행한다고 하였는데, 이를 위해 층계분류부(530)는 기본적으로 팩터추출파트(531) 및 모형적용파트(532)를 기본적으로 포함한다.The stair classification unit 530 is said to perform a function of classifying a plurality of keywords into a plurality of stairs. For this purpose, the stair classification unit 530 is basically a factor extraction part 531 and a model application part 532. ) Is included by default.

팩터추출파트(531)는 복수개의 상기 분석 키워드 중에서 카테고리팩터를 추출하는 기능을 수행하는 것으로서, 이때 카테고리팩터라 함은 해당 제공 대상 콘텐츠의 특성을 보여줄 수 있는 키워드가 카테고리팩터가 된다. 예를 들어 해당 제공 대상 콘텐츠의 장르와 관련된 키워드, 제목이나 파일명 관련 키워드 등이 카테고리팩터로 추출될 수 있다.The factor extraction part 531 performs a function of extracting a category factor from a plurality of the analysis keywords. In this case, a category factor is a keyword that can show characteristics of the corresponding content to be provided. For example, a keyword related to the genre of the content to be provided, a keyword related to a title or a file name, etc. may be extracted as a category factor.

모형적용파트(532)는 상기 팩터추출파트(531)를 통해 추출된 카테고리팩터를 잠재변수모형을 통해 상호유사성을 바탕으로 계층 분류를 수행하는 기능을 수행한다. 이때 잠재변수모형의 경우 계층수의 결정 및 모수의 결정이 필수적이라 할 수 있으므로, 계층수와 모수를 결정할 수 있는 추가적인 구성을 더 필요로 한다. 따라서 모형적용파트(532)의 원활한 적용을 위해서는 층계분류부(530)의 세부 구성을 통해 모수 및 층계수를 결정해야 할 필요성이 있는데, 이를 위해 층계분류부(530)에는 모수범위결정파트(533), 층계수판단파트(534)가 더 포함될 수 있다.The model application part 532 performs hierarchical classification based on the mutual similarity of the category factor extracted through the factor extraction part 531 through the latent variable model. In the latent variable model, the determination of the number of hierarchies and the determination of parameters are essential, and thus, an additional configuration for determining the number of hierarchies and parameters is required. Therefore, in order to smoothly apply the model application part 532, it is necessary to determine the parameters and the layer coefficient through the detailed configuration of the stair classification unit 530. For this purpose, the stair classification unit 530 has a parameter range determination part 533. ), The step determination part 534 may be further included.

모수범위결정파트(533)는 상기 분석 키워드로부터 추출된 키워드의 수에 따라 모수범위를 결정하는 기능을 수행하는 것으로서, 이러한 모수범위 중 어느 하나의 값이 실제 분석 키워드가 분류될 층계의 수가 되는데, 이때 모수범위만을 결정하고 층계수를 사전에 결정하지 않는 이유는 보다 탐색적이고 기술적인 분석 키워드의 분석을 가능케 하기 위함인데, 분석 키워드, 즉 데이터에 근거하여 해당 데이터를 귀납적으로 판단하기 위함이다. 더불어 층계수 판단 시 모형검증을 충분히 거치기 때문에 별도의 사전 계층수 설정 없이도 정확한 분석을 가능케 한다. 이때 모수범위에는 제한을 두지 않으나, 추출된 키워드의 개수가 포함되는 범위로 모수범위를 결정한다. 즉 모수범위의 크기에는 제한을 두지 않으나, 예를 들어 키워드가 100개 추출된 경우 모수범위는 100을 포함하는 범위어야 한다. 즉 1 이상 100 이하, 1 이상 1000 이하, 50 이상 150 이하 등 100이 포함되는 범위로서 모수범위를 결정해야 한다.The parameter range determination part 533 performs a function of determining a parameter range according to the number of keywords extracted from the analysis keyword. The value of any one of these parameter ranges is the number of stairs to which the actual analysis keyword is classified. At this time, the reason for determining only the parameter range and not determining the layer coefficient in advance is to enable more exploratory and technical analysis of the analysis keyword, and to inductively determine the corresponding data based on the analysis keyword, that is, the data. In addition, because the model is sufficiently verified when determining the layer coefficients, accurate analysis is possible without setting a separate prior number of layers. At this time, the parameter range is not limited, but the parameter range is determined as a range including the number of extracted keywords. That is, the size of the parameter range is not limited, but, for example, if 100 keywords are extracted, the parameter range must be a range including 100. That is, the parameter range should be determined as a range including 100, such as 1 or more and 100 or less, 1 or more and 1000 or less, 50 or more and 150 or less.

층계수판단파트(530)는 상기 모수범위에 포함된 정수 각각에 대해 아카이케정보지수, 베이지안정보지수, 수정베이지안정보지수를 산출하고 그 값을 비교 처리하여 모수범위 중 어느 하나의 값을 층계수로 지정하는 기능을 수행한다. 이는 상술한 바와 같이 계층수를 미리 산정하지 않고 범위화한 다음 실제 모형 적용을 통해 가장 적합한 값을 계층수로서 지정하는 것이다. 이때 보다 바람직하게는 상기 모수범위에 포함된 정수 각각을 예비모수로 지정하고, 상기 예비모수 각각에 대해 아카이케정보지수, 베이지안정보지수, 수정베이지안정보지수를 산출하고 산출된 값을 분석 처리하여 대입한 정수의 값 대비 최대우도추정치에 가장 가까운 값을 나타내는 정수를 계층수로 지정하게 된다.The step determination part 530 calculates the akaike information index, the Bayesian information index, and the modified Bayesian information index for each of the integers included in the parameter range, and compares the values to determine the value of any one of the parameter ranges. Perform the function specified by. As described above, the range of the number of layers is not calculated in advance, and then the most suitable value is designated as the number of layers by applying the actual model. More preferably, each of the integers included in the parameter range is designated as a preliminary parameter, and for each of the preliminary parameters, an akaike information index, a Bayesian information index, a corrected Bayesian information index are calculated, and the calculated values are analyzed and substituted. An integer representing the value closest to the maximum likelihood estimate compared to the value of one integer is designated as the number of layers.

이를 위해선 기본적으로 최대우도추정치를 계산해낼 필요성이 있는데, 최대우도추정치는 다음의 수학식 1 내지 수학식 2를 통해 산출될 수 있다.To this end, it is basically necessary to calculate the maximum likelihood estimate, which can be calculated through the following equations (1) to (2).

수학식 1,

Equation 1,

(여기서,

는 분석 키워드

에 대해 설정된 지수,

는 지수

및 예비모수

에 대한 우도함수,

는 상기 모수범위에 포함된 수 중 어느 하나로써 예비모수,

은 분석 키워드의 총 개수를 의미한다.)(here,

Analyze keywords

The exponent set for,

Is the exponent

And preliminary parameters

Likelihood function for,

Is any one of the numbers included in the parameter range as a preliminary parameter,

Means the total number of analysis keywords.)

수학식 2,

Equation 2,

(여기서,

는 최대우도추정치,

는 분석 키워드

에 대해 설정된 지수,

은 분석 키워드의 총 개수를 의미한다.)(here,

Is the maximum likelihood estimate,

Analyze keywords

The exponent set for,

Means the total number of analysis keywords.)

이 때 보다 자세히 설명하면 분석 키워드

에 대해 설정된 지수라 함은 관련텍스트에 분석 키워드

가 나타난 빈도수를 의미하며, 분석 키워드의 총 개수는 연관 키워드에 포함된 단어에서 추출된 분석 키워드의 전체 개수를 의미한다.If you explain in more detail,

The exponent set for

Denotes a frequency of appearing, and the total number of analysis keywords refers to the total number of analysis keywords extracted from words included in the related keyword.

이 때 우도함수를 구하는 것은 일반적인 계산으로는 어려울 수 있다. 따라서 우도함수의 계산 시에는 통계 프로그램을 이용하는 것이 좋은데, 이를 위해서 MPlus 등의 통계용 프로그램이 바람직하게 이용될 수 있다. 우도함수의 경우 그 최대값을 구하는 것이 일반적인 목표이다. 이때 바람직하게는 예비모수의 값이 적으면서도 우도함수의 값이 큰 것이 추출해내고자 하는 바람직한 최대값이라 할 수 있다.At this point, it is difficult to calculate the likelihood function in general calculation. Therefore, when calculating the likelihood function, it is better to use a statistical program. For this purpose, a statistical program such as MPlus may be preferably used. In the case of likelihood function, the maximum value is a general goal. In this case, the smaller the value of the preliminary parameter and the larger the value of the likelihood function may be a preferable maximum value to be extracted.

나아가 층계수판단파트(534)는 상기 모수범위에 포함된 정수 각각을 예비모수로 지정하고, 아카이케정보지수, 베이지안정보지수, 수정베이지안정보지수를 각각 산출해내고 산출된 값을 분석 처리하여 대입한 정수의 값 대비 최대우도추정치에 가장 가까운 값을 나타내는 정수를 계층수로 지정한다고 하였는데, 이때 각각의 산출공식은 다음과 같다.Further, the step determination part 534 designates each of the integers included in the parameter range as a preliminary parameter, calculates the Akaike Information Index, the Bayesian Information Index, and the Corrected Beige Information Index, and analyzes the calculated values and substitutes them. The integer that represents the value closest to the maximum likelihood estimation value is designated as the number of hierarchies, and each calculation formula is as follows.

아카이케정보지수 산출Calculation of Akaike Information Index

수학식 3,

Equation 3,

베이지안정보지수 산출Bayesian Information Index

수학식 4,

Equation 4,

수정베이지안정보지수 산출Corrected Bayesian Information Index

수학식 5,

Equation 5,

(여기서,

는 아카이케정보지수,

는 베이지안정보지수,

는 수정베이지안정보지수,

는 분석 키워드

에 대해 설정된 지수,

는 지수

및 예비모수

에 대한 우도함수,

는 상기 모수범위에 포함된 수 중 어느 하나로써 예비모수,

은 분석 키워드의 총 개수를 의미한다.)(here,

Is the Akaike Information Index,

Is the Bayesian Information Index,

Is the Corrected Bayesian Information Index

Analyze keywords

The exponent set for,

Is the exponent

And preliminary parameters

Likelihood function for,

Means the total number of analysis keywords.)

이와 같이 세가지 정보지수를 산출하고 그 값을 비교 처리하는 까닭은 어느 모형이 가장 적합한지를 판단하기 위함이 첫 번째이며, 나아가 모수의 수와 표본의 수에 따라 각각의 정보지수가 다른 페널티를 부과하고 있으므로 서로 다른 페널티를 부과하는 세 개의 정보지수를 모두 산출하고 이를 비교 처리함으로써 산출된 정보지수 중 대입한 정수의 값 대비 최대우도추정치에 가장 가까운 값을 나타내는 우도수치를 보이는 정보지수의 모형을 가장 적합한 정보지수 모형으로 판단하고, 해당 정수의 값을 계층수로 지정 처리 하는 것이다.The reason for calculating the three information indices and comparing the values is the first to determine which model is most suitable. Furthermore, each information index imposes different penalties according to the number of parameters and the number of samples. Therefore, by calculating all three information indexes that impose different penalties, and comparing them, the model of the information index that shows the likelihood value that shows the closest value to the maximum likelihood estimation value is compared with the value of the substituted integer. It is determined by the information index model, and the value of the integer is designated as the number of layers.

더불어 이때 각각의 정보지수를 구하는 것은 일반적인 계산으로는 어려울 수 있다. 따라서 각각의 정보지수 산출 시에는 통계 프로그램을 이용하는 것이 좋은데, 이를 위해서 MPlus 등의 통계용 프로그램이 바람직하게 이용될 수 있다.In addition, obtaining each information index can be difficult with general calculations. Therefore, it is good to use a statistical program when calculating each information index, for this purpose, a statistical program such as MPlus may be preferably used.

따라서 상술한 구성을 통해 층계의 수가 지정되는 경우 모형적용파트(532)를 통한 층계분류가 가능해지는데, 이 때 층계분류는 다음의 수학식 6을 통해 가능하다.Therefore, when the number of stairs is designated through the above-described configuration, the stair classification through the model application part 532 is possible. In this case, the stair classification is possible through Equation 6 below.

수학식 6,

Equation 6,

(여기서,

는 종속변수의 벡터값에 따른 잠재변수모형의 값,

는 분석 키워드

에 대한 종속변수의 벡터,

는 추출된 카테고리팩터의 계급값,

는 각각의 층계,

는 층계수판단파트에 의해 판단된 층계수를 의미한다.)(here,

Is the value of the latent variable model according to the vector value of the dependent variable,

Analyze keywords

Vector of dependent variables for,

Is the rank value of the extracted category factor,

Each staircase,

Means the floor coefficient judged by the floor determination part.)

이 때 계급값은 카테고리팩터로 지정된 분석 키워드가 해당 관련텍스트에 출현한 빈도수를 의미하며, 층계수의 경우 상술한 수학식 1 내지 5를 통해 산출되었다. 더불어 이와 같은 잠재변수모형 적용 시에도 MPlus 등의 통계용 프로그램을 이용한다면 당업자가 능히 사용할 수 있을 것이다.In this case, the rank value means a frequency in which the analysis keyword designated as the category factor appears in the relevant text, and in the case of the layer coefficient, it is calculated through Equations 1 to 5 described above. In addition, even when applying the latent variable model, if you use a statistical program such as MPlus will be available to those skilled in the art.

따라서 다양한 모형을 통한 비교 처리 및 그를 통한 층계수 결정을 통해 우수한 정확도를 나타내는 층계화가 가능한 것이며, 편향되지 않은 값을 얻을 수 있게 된다. 나아가 앞에서도 언급했듯이 이러한 잠재계층분석을 통해 계층화된 키워드들은 유사성을 바탕으로 하여 추정된 집단이므로 이와 같은 잠재계층분석방법에 따라 유형화된 분석 키워드를 얻을 수 있게 되는 것이다.Therefore, the comparison process through various models and the determination of the layer coefficients through them can be used for stratification with excellent accuracy, and an unbiased value can be obtained. Furthermore, as mentioned earlier, the stratified keywords are estimated groups based on similarity, and thus, analysis keywords typed according to the latent hierarchy analysis method can be obtained.

나아가 키워드 그룹 생성부(540)는 분류된 상기 분석 키워드를 그룹화하여 키워드그룹을 생성하는 기능을 수행한다. 이는 층계에 따라 각각 분류된 키워드를 하나의 키워드그룹으로 묶어 유형별 키워드그룹을 생성하는 기능이다.Furthermore, the keyword group generator 540 performs a function of grouping the classified analysis keywords to generate a keyword group. This is a function of generating keyword groups for each type by grouping keywords classified according to the hierarchy into one keyword group.

타겟 키워드 지정부(550)는 각각의 상기 키워드 그룹에 포함된 상기 분석 키워드 중 적어도 어느 하나를 타겟 키워드로 지정하는 것으로서, 키워드 그룹에 포함된 분석 키워드 전체가 타겟 키워드로 지정될 수도 있으며, 혹은 각각의 키워드 그룹 중 어느 하나만이 타겟 키워드로 지정될 수도 있다. 이는 모니터링 및 유사도 분석에 있어 속도에 보다 가중치를 둘 것인지, 정확도에 가중치를 둘 것인지에 따라 시스템 관리자에 의해 타겟 키워드 지정 개수가 달라질 수 있다.The target keyword designation unit 550 designates at least one of the analysis keywords included in each of the keyword groups as a target keyword, and all of the analysis keywords included in the keyword group may be designated as the target keyword, or each Only one of the keyword groups of may be designated as the target keyword. In the monitoring and similarity analysis, the number of target keyword designations may be changed by the system administrator according to whether the weight is more weighted or the accuracy is weighted.

지금까지 설명한 바와 같이, 본 발명에 따른 특징정보 비교분석을 통한 콘텐츠 필터링 시스템을 상기 설명 및 도면에 표현하였지만 이는 예를 들어 설명한 것에 불과하여 본 발명의 사상이 상기 설명 및 도면에 한정되지 않으며, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 다양한 변화 및 변경이 가능함은 물론이다.As described above, although the content filtering system through the comparative analysis of the feature information according to the present invention has been represented in the above description and the drawings, this is merely an example, and thus the spirit of the present invention is not limited to the above description and the drawings. Various changes and modifications are possible without departing from the spirit of the invention.

10 : 시스템 100 : 콘텐츠 수집모듈
110 : 보호대상 수집부 111 : 특징키워드 수집 파트
112 : 검색키워드 수집수단 113 : 빈도수 판별수단
114 : 특징키워드 병행 설정수단 120 : 분석대상 수집부
121 : 연관 키워드 수집 파트 200 : 특징정보 추출모듈
300 : 필터링모듈 310 : 키워드 비교분석부
400 : 결과 제공모듈 500 : 타켓 키워드 추출모듈
510 : 단어파악부 520 : 분석키워드 생성부
530 : 층계분류부 531 : 팩터추출파트
532 : 모형적용파트 533 : 모수범위결정파트
534 : 층계수판단파트 540 : 키워드 그룹 생성부
550: 타겟 키워드 지정부10: system 100: content collection module
110: protection target collector 111: feature keyword collection part
112: search keyword collection means 113: frequency determination means
114: feature keyword parallel setting means 120: analysis target collection unit
121: related keyword collection part 200: feature information extraction module
300: filtering module 310: keyword comparison analysis unit
400: result providing module 500: target keyword extraction module
510: word detection unit 520: analysis keyword generation unit
530: stair classification unit 531: factor extraction part
532: Model application part 533: Parameter range determination part
534: Step determination part 540: Keyword group generation unit
550: target keyword specifying unit

Claims

Content filtering system through comparative analysis of feature information,
A protected collection unit for receiving and storing protected content and feature keywords of the protected content from a content provider, and collecting and storing analysis content uploaded to a plurality of content service sites and related keywords associated with the analysis content; A content collection module having an analysis target collection unit;
A feature information extraction module for extracting hash value feature information of the content to be protected and storing it as feature information;
A target keyword extraction module for extracting and storing a target keyword including a word characterizing from a plurality of related keywords;
A filtering module which compares and analyzes the protected content and the analysis content based on the feature information, and compares and analyzes feature keywords of the protected content and target keywords of the analysis content;
And a result providing module for transmitting the filtering result of the filtering module to a content provider.
The target keyword extraction module,
A word grasping unit which grasps the words included in the related keywords, an analysis keyword generating unit which generates an analysis keyword by filtering the plurality of words, and a plurality of the analysis keywords through latent hierarchical analysis (LCA); A target classifier which designates at least one of a stair classification unit classified into a stair, a keyword group generation unit that groups the classified analysis keywords to generate a keyword group, and the analysis keywords included in each of the keyword groups as target keywords. Content filtering system, characterized in that consisting of a keyword designation unit.

The method of claim 1,
The feature keyword collection part,
Search keyword collection means for collecting a search keyword derived as a search result when the feature keyword is searched in a plurality of portal sites;
Frequency determining means for determining the frequency of the search keyword;
And a feature keyword parallel setting means for setting at least one of the search keywords as a feature keyword on the basis of the frequency.

The method of claim 1,
The stair classification unit,
A factor extraction part for extracting a category factor from a plurality of analysis keywords;
And a model application part for classifying the plurality of category factors into a plurality of layers through a latent variable model based on mutual similarity.

The method of claim 3, wherein
The stair classification unit,
A parameter range determination part for determining a parameter range based on the number of extracted analysis keywords, and
Each of the integers included in the parameter range is designated as a preliminary parameter, and for each of the preliminary parameters, an akaike information index, a Bayesian information index, and a modified Bayesian information index are calculated, and the calculated values are analyzed and compared to the values of the substituted integers. And a step coefficient determining part that designates an integer representing a value closest to the maximum likelihood estimate as a layer coefficient.

The method of claim 4, wherein
The maximum likelihood estimate is
The content filtering system, characterized in that calculated through the following equation (1) to (2).
Equation 1,

(here,

Analyze keywords

The exponent set for,

Is the exponent

And preliminary parameters

Likelihood function for,

Means the total number of analysis keywords.)
Equation 2,

(here,

Is the maximum likelihood estimate,

Analyze keywords

The exponent set for,

Means the total number of analysis keywords.)

delete