KR102456410B1

KR102456410B1 - Method and apparatus for classifying events in crowd-sourcing environment

Info

Publication number: KR102456410B1
Application number: KR1020200168857A
Authority: KR
Inventors: 윤찬현; 김지환; 이경채
Original assignee: 한국과학기술원
Priority date: 2019-12-09
Filing date: 2020-12-04
Publication date: 2022-10-20
Also published as: KR20210072714A

Abstract

본 발명의 일 실시예에 따른 이벤트 분류 방법은, 사용자로부터 이벤트 정보를 수신하는 단계, 상기 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하는 단계, 상기 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 제1 이벤트 클러스터 및 제2 이벤트 클러스터를 생성하는 단계 및 상기 제1 이벤트 클러스터 및 상기 제2 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행하는 단계를 포함할 수 있다.An event classification method according to an embodiment of the present invention includes the steps of receiving event information from a user, performing pre-processing on the event information, and extracting an event keyword, based on the event keyword and pre-stored event storage information and generating a first event cluster and a second event cluster by performing topic modeling, and performing classification on the event information through comparison of the first event cluster and the second event cluster. .

Description

METHOD AND APPARATUS FOR CLASSIFYING EVENTS IN CROWD-SOURCING ENVIRONMENT

본 발명은 크라우드-소싱 환경에서 이벤트를 분류하는 장치 및 방법에 관한 것이다. 보다 구체적으로, 본 발명은 크라우드-소싱 환경에서 수집되는 이벤트 데이터에서 추출된 텍스트 및 이미지 키워드에 기초하여 이벤트 객체 간의 중복 여부를 판단하고, 이벤트를 분류하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for classifying events in a crowd-sourced environment. More specifically, the present invention relates to an apparatus and method for determining whether overlap between event objects and classifying events based on text and image keywords extracted from event data collected in a crowd-sourcing environment.

크라우드-소싱(Crowd-sourcing)은 일반 대중(crowd)과 외부 전문가의 자원(outsourcing)을 활용한다는 의미를 갖는 합성어로서, 크라우드-소싱을 활용할 경우, 다양한 분야의 생산 및 서비스 등의 활동에 있어 대중을 참여시켜 문제 해결을 위한 시간 및 비용을 절감시키는 효과가 있다. 이때, 크라우드-소싱을 활용하기 위해서는 참여자들이 생산해내는 대량의 비정형 정보들을 효율적으로 관리하고 검증하는 것이 중요하다.Crowd-sourcing is a compound word that means to utilize the resources of the general public (crowd) and external experts (outsourcing). It has the effect of saving time and money for problem solving by participating in At this time, in order to utilize crowd-sourcing, it is important to efficiently manage and verify a large amount of unstructured information produced by participants.

기존에는 크라우드-소싱 참여자의 신뢰도로 데이터의 품질을 예측하거나, 테스트 데이터 셋을 이용하여 일련의 테스트를 수행하여 크라우드-소싱 데이터를 검증하는 방법이 수행되고 있었으나, 상기 검증을 수행하는 작업자의 작업 비용이 증가하는 등의 문제점이 존재한다.In the past, methods of estimating data quality based on the reliability of crowd-sourced participants or verifying crowd-sourced data by performing a series of tests using a test data set were performed, but the cost of the worker performing the verification There are problems such as increasing this.

한편, 크라우드-소싱 환경에서 수집된 데이터는 참여자의 수에 따라 대량의 정보를 쉽게 수집할 수 있는 장점이 있는 반면에, 참여자가 고의로 정보를 왜곡하거나, 실수로 정보를 누락하여 입력하는 경우가 많아 데이터가 믿을 만한 정보인지에 대한 검증이 필요하며, 또한 동일 대상 이벤트 객체에 대해 여러 참여자가 정보를 입력한 경우, 중복 여부에 따른 데이터를 통합하는 등의 관리가 필요하다.On the other hand, data collected in a crowd-sourcing environment has the advantage of being able to easily collect a large amount of information depending on the number of participants, but there are many cases where participants intentionally distort information or accidentally omit information. It is necessary to verify whether the data is reliable information. In addition, when multiple participants input information for the same target event object, management such as integrating data according to whether or not there is duplication is required.

따라서, 크라우드-소싱 참여자들에 의해 이벤트 데이터가 업로드 되었을 때, 해당 데이터가 거짓 정보를 포함한 이벤트 객체인지, 정확한 정보를 포함한 새로운 이벤트 객체인지 또는 정확한 정보를 포함하면서 중복되는 이벤트 객체가 있는지에 대한 분류가 필요한 실정이다.Therefore, when event data is uploaded by crowd-sourcing participants, whether the data is an event object containing false information, a new event object containing accurate information, or whether there is a duplicate event object containing accurate information is needed.

한편, LDA(Latent Dirichlet Allocation) 토픽 모델링 알고리즘은 능동 통계적인 토픽 모델링 기법 중 하나로, 텍스트 마이닝 분야에서 많이 쓰이는 기법이다. LDA 토픽 모델링은 주어진 대용량 문서에 대하여 주요 키워드 기반으로 각 문서에 어떤 주제들이 존재하는지를 분석하는 확률 모형으로, 토픽별 단어의 분포 및 문서별 토픽의 분포를 추정할 수 있어 대량의 문서를 클러스터링 하는데 사용될 수 있다. On the other hand, the LDA (Latent Dirichlet Allocation) topic modeling algorithm is one of the active statistical topic modeling techniques and is widely used in the text mining field. LDA topic modeling is a probabilistic model that analyzes which topics exist in each document based on major keywords for a given large-capacity document. can

또한, 트위터, 페이스북 등과 같은 SNS(Social Network Service) 및 뉴스 등의 비정형 텍스트 데이터로부터 핵심 토픽, 주요 이슈 또는 특정 이벤트를 추출하기 위해 많은 연구가 수행되고 있으며, 이를 이용한 트렌드 분석, 동향분석 시스템 등을 구현한 사례들이 존재한다. In addition, many studies are being conducted to extract key topics, major issues, or specific events from unstructured text data such as SNS (Social Network Service) and news such as Twitter and Facebook, and trend analysis, trend analysis system, etc. There are examples of implementation.

그러나, LDA 토픽 모델링은 확률 모형을 다루는 통계학 및 텍스트로부터 형태소를 분석해 의미적 분석을 하고, 그 의미를 컴퓨터가 해석 및 처리하는 NLP(Natural Language Processing)의 혼합 기법으로 일반 사용자가 모델링하여 사용하기 힘들다는 단점이 존재한다.However, LDA topic modeling is a mixed technique of NLP (Natural Language Processing) that analyzes morphemes from texts and statistics dealing with probability models and analyzes morphemes, and a computer interprets and processes the meaning. It is difficult for general users to model and use them. has drawbacks.

한국등록특허공보, 10-1713831호 (2017.03.02. 등록)Korean Patent Publication No. 10-1713831 (Registered on Mar. 2, 2017)

본 발명이 해결하고자 하는 과제는 크라우드-소싱 환경에서 이벤트를 분류하는 장치 및 방법을 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide an apparatus and method for classifying events in a crowd-sourced environment.

또한, 본 발명이 해결하고자 하는 과제는, LDA 토픽 모델링 분석 기법과 K-means 클러스터링 기반 근접 이벤트 객체 분류 기법이 융합된 이벤트 분류 장치 및 방법을 제공하는 것이다.In addition, the problem to be solved by the present invention is to provide an event classification apparatus and method in which an LDA topic modeling analysis technique and a K-means clustering-based proximity event object classification technique are fused.

또한, 본 발명이 해결하고자 하는 과제는, 크라우드-소싱 참여자가 수집한 이벤트 데이터에 대해 시스템 내부에서 자동으로 이미지, 텍스트 및 시공간 정보를 분석하여 동일한 이벤트 객체를 군집화 및 분류할 수 있는 이벤트 분류 장치 및 방법을 제공하는 것이다.In addition, the problem to be solved by the present invention is an event classification device capable of clustering and classifying the same event object by automatically analyzing images, text, and spatiotemporal information within the system for event data collected by crowd-sourcing participants, and to provide a way

또한, 본 발명이 해결하고자 하는 과제는, NLP 기반의 텍스트 키워드 추출 및 딥러닝 기반의 이미지 키워드 추출을 통해 LDA 토픽 모델링을 수행하여, 소량의 텍스트에 대한 토픽 모델링의 낮은 정확도를 보완할 수 있는 이벤트 분류 장치 및 방법을 제공하는 것이다.In addition, the problem to be solved by the present invention is an event that can compensate for the low accuracy of topic modeling for a small amount of text by performing LDA topic modeling through NLP-based text keyword extraction and deep learning-based image keyword extraction It is to provide a classification apparatus and method.

또한, 본 발명이 해결하고자 하는 과제는, 크라우드-소싱 참여자로부터 업로드된 이벤트 데이터가 거짓 정보를 포함하고 있는지 여부 및 새로운 이벤트 데이터인지 여부를 분류할 수 있는 이벤트 분류 장치 및 방법을 제공하는 것이다.Another object of the present invention is to provide an event classification apparatus and method capable of classifying whether event data uploaded from a crowd-sourcing participant contains false information and whether it is new event data.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. will be able

본 발명의 일 실시예에 따른 이벤트 분류 방법은, 사용자로부터 이벤트 정보를 수신하는 단계, 상기 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하는 단계, 상기 추출된 이벤트 키워드 및 기 저장된 이벤트 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하는 단계 및 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행하는 단계를 포함할 수 있다.An event classification method according to an embodiment of the present invention includes the steps of: receiving event information from a user, performing pre-processing on the event information, and extracting an event keyword; The method may include generating an entire event cluster and a proximity event cluster by performing topic modeling based on the topic, and performing classification on the event information by comparing the entire event cluster and the proximity event cluster.

상기 방법에 있어서, 상기 토픽 모델링은 LDA(Latent Dirichlet Allocation) 토픽 모델링일 수 있다.In the method, the topic modeling may be LDA (Latent Dirichlet Allocation) topic modeling.

상기 방법에 있어서, 상기 이벤트 정보는, 텍스트 정보, 이미지 정보, 이벤트 시간 정보 및 이벤트 위치 정보를 포함할 수 있다.In the method, the event information may include text information, image information, event time information, and event location information.

상기 방법에 있어서, 상기 이벤트 키워드를 추출하는 단계는, 기 학습된 신경망을 이용하여 상기 전처리가 수행된 이벤트 정보 중 상기 이미지 정보와 일치하는 상위 n개의 라벨을 이미지 키워드로 추출하는 단계 및 상기 텍스트 정보의 형태소 분석을 통해 상기 이벤트 정보의 텍스트 키워드를 추출하는 단계를 포함하고, 상기 신경망은, 참조 이미지 정보 및 참조 이미지 키워드가 입력되면, 상기 참조 이미지 정보에 대한 이미지 키워드가 출력되도록 학습될 수 있다.In the method, the step of extracting the event keyword includes: extracting, as image keywords, the top n labels matching the image information among the event information on which the preprocessing has been performed using a pre-trained neural network, and the text information and extracting a text keyword of the event information through morpheme analysis of , and the neural network may be trained to output an image keyword for the reference image information when reference image information and a reference image keyword are input.

상기 방법에 있어서, 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하는 단계는, 상기 토픽 모델링을 수행하여 상기 기 저장된 이벤트 정보로부터 도출된 이벤트 저장 키워드 중 상기 추출된 이벤트 키워드와 제1 유사도를 갖는 키워드를 클러스터링하여 상기 전체 이벤트 클러스터를 생성할 수 있다.In the method, the generating of the entire event cluster and the proximity event cluster comprises: performing the topic modeling and selecting a keyword having a first similarity with the extracted event keyword among event storage keywords derived from the pre-stored event information The entire event cluster may be generated by clustering.

상기 방법에 있어서, 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하는 단계는, 클러스터링을 수행하여 상기 기 저장된 이벤트 정보 중 상기 이벤트 시간 정보 및 상기 이벤트 위치 정보와 기 설정된 유사도를 갖는 후보 클러스터를 생성하는 단계 및 상기 토픽 모델링을 수행하여 상기 후보 클러스터로부터 도출된 이벤트 저장 키워드 중 상기 추출된 이벤트 키워드와 유사도가 제2 유사도를 갖는 키워드를 클러스터링하여 상기 근접 이벤트 클러스터를 생성하는 단계를 포함할 수 있다.In the method, the generating of the entire event cluster and the proximity event cluster includes: performing clustering to generate a candidate cluster having a preset similarity with the event time information and the event location information among the pre-stored event information and generating the proximity event cluster by clustering keywords having a second degree of similarity to the extracted event keyword among the event storage keywords derived from the candidate cluster by performing the topic modeling.

상기 방법에 있어서, 상기 클러스터링은 K-means 클러스터링일 수 있다.In the method, the clustering may be K-means clustering.

상기 방법에 있어서, 상기 이벤트 정보에 대한 분류를 수행하는 단계는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터에 중복으로 포함된 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 중복 이벤트 정보인 것으로 분류할 수 있다.In the method, in the step of classifying the event information, if there is event storage information duplicated in the entire event cluster and the adjacent event cluster, the event information may be classified as duplicate event information. have.

상기 방법에 있어서, 상기 이벤트 정보에 대한 분류를 수행하는 단계는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 이벤트 정보가 존재하지 않을 경우, 상기 이벤트 정보는 새로운 이벤트 정보인 것으로 분류할 수 있다.In the method, in the step of classifying the event information, when event information does not exist in each of the entire event cluster and the adjacent event cluster, the event information may be classified as new event information.

상기 방법에 있어서, 상기 이벤트 정보에 대한 분류를 수행하는 단계는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 중복 없이 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.In the method, in the step of classifying the event information, when event storage information exists without duplicates in each of the entire event cluster and the adjacent event cluster, the event information may be classified as false event information. .

상기 방법에 있어서, 상기 이벤트 정보에 대한 분류를 수행하는 단계는, 상기 전체 이벤트 클러스터에 이벤트 저장 정보가 존재하고, 상기 근접 이벤트 클러스터에는 이벤트 저장 정보가 존재하지 않을 경우, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.In the method, in the step of classifying the event information, when event storage information exists in the entire event cluster and event storage information does not exist in the adjacent event cluster, the event information is false event information. can be classified as being

본 발명의 일 실시예에 따른 이벤트 분류 장치는, 사용자로부터 이벤트 정보를 수신하는 통신부 및 프로세서를 포함하고, 상기 프로세서는, 상기 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하고, 상기 추출된 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하고, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행할 수 있다.An event classification apparatus according to an embodiment of the present invention includes a communication unit and a processor for receiving event information from a user, wherein the processor performs pre-processing on the event information, extracts an event keyword, and Based on the event keyword and pre-stored event storage information, topic modeling is performed to generate an entire event cluster and a proximity event cluster, and classification of the event information can be performed by comparing the entire event cluster and the proximity event cluster. have.

상기 장치에 있어서, 상기 토픽 모델링은 LDA(Latent Dirichlet Allocation) 토픽 모델링일 수 있다.In the apparatus, the topic modeling may be a latent dirichlet allocation (LDA) topic modeling.

상기 장치에 있어서, 상기 이벤트 정보는, 텍스트 정보, 이미지 정보, 이벤트 시간 정보 및 이벤트 위치 정보를 포함할 수 있다.In the device, the event information may include text information, image information, event time information, and event location information.

상기 장치에 있어서, 상기 프로세서는, 기 학습된 신경망을 이용하여 상기 전처리가 수행된 이벤트 정보 중 상기 이미지 정보와 일치하는 상위 n개의 라벨을 이미지 키워드로 추출하고, 상기 텍스트 정보의 형태소 분석을 통해 상기 이벤트 정보의 텍스트 키워드를 상기 이벤트 키워드로 추출하고, 상기 신경망은, 참조 이미지 정보 및 참조 이미지 키워드가 입력되면, 상기 참조 이미지 정보에 대한 이미지 키워드가 출력되도록 학습될 수 있다.In the device, the processor extracts, as an image keyword, the top n labels matching the image information among the event information on which the preprocessing has been performed using a pre-trained neural network, and through morphological analysis of the text information, the The text keyword of the event information may be extracted as the event keyword, and the neural network may be trained to output the image keyword for the reference image information when the reference image information and the reference image keyword are input.

상기 장치에 있어서, 상기 프로세서는, 상기 토픽 모델링을 수행하여 상기 기 저장된 이벤트 정보로부터 도출된 이벤트 저장 키워드 중 상기 추출된 이벤트 키워드와 제2 유사도를 갖는 키워드를 클러스터링하여 상기 전체 이벤트 클러스터를 생성할 수 있다.In the device, the processor may generate the entire event cluster by clustering keywords having a second similarity to the extracted event keyword among the event storage keywords derived from the pre-stored event information by performing the topic modeling. have.

상기 장치에 있어서, 상기 프로세서는, K-means 클러스터링을 수행하여 상기 이벤트 저장 정보 중 상기 기 저장된 이벤트 시간 정보 및 상기 이벤트 위치 정보와 기 설정된 유사도를 갖는 후보 클러스터를 생성하고, 상기 토픽 모델링을 수행하여 상기 후보 클러스터로부터 도출된 이벤트 저장 키워드 중 상기 추출된 이벤트 키워드와 제2 유사도를 갖는 키워드를 클러스터링하여 상기 근접 이벤트 클러스터를 생성할 수 있다.In the device, the processor performs K-means clustering to generate a candidate cluster having a preset similarity with the pre-stored event time information and the event location information among the event storage information, and performs the topic modeling, The proximity event cluster may be generated by clustering keywords having a second similarity to the extracted event keyword among the event storage keywords derived from the candidate cluster.

상기 장치에 있어서, 상기 프로세서는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터에 중복으로 포함된 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 중복 이벤트 정보인 것으로 분류할 수 있다.In the device, the processor may classify the event information as duplicate event information when duplicate event storage information exists in the entire event cluster and the adjacent event cluster.

상기 장치에 있어서, 상기 프로세서는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 이벤트 저장 정보가 존재하지 않을 경우, 상기 이벤트 정보는 새로운 이벤트 정보인 것으로 분류할 수 있다.In the device, when event storage information does not exist in each of the entire event cluster and the adjacent event cluster, the processor may classify the event information as new event information.

상기 장치에 있어서, 상기 프로세서는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 중복 없이 이벤트 저장 정보가 존재할 경우 또는 상기 전체 이벤트 클러스터에 이벤트 저장 정보가 존재하고, 상기 근접 이벤트 클러스터에는 이벤트 저장 정보가 존재하지 않을 경우, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.In the device, the processor is configured to include, when event storage information exists without duplication in each of the entire event cluster and the proximity event cluster, or event storage information exists in the entire event cluster, and the proximity event cluster includes event storage information If it does not exist, the event information may be classified as false event information.

본 발명의 일 실시예에 따른 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능한 기록 매체로서, 상기 컴퓨터 프로그램은, 사용자로부터 이벤트 정보를 수신하는 단계, 상기 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하는 단계, 상기 추출된 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하는 단계 및 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행하는 단계를 포함하는 이벤트 분류 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.As a computer-readable recording medium storing a computer program according to an embodiment of the present invention, the computer program includes the steps of receiving event information from a user, performing pre-processing on the event information, and extracting event keywords. Step, generating an entire event cluster and a proximity event cluster by performing topic modeling based on the extracted event keyword and pre-stored event storage information, and comparing the entire event cluster and the proximity event cluster to the event information It may include instructions for causing the processor to perform an event classification method including the step of performing classification.

본 발명의 일 실시예에 따른 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 사용자로부터 수신한 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하는 단계, 상기 추출된 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하는 단계 및 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행하는 단계를 포함하는 이벤트 분류 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함하는 기록매체에 저장될 수 있다.A computer program stored in a computer-readable recording medium according to an embodiment of the present invention, wherein the computer program performs pre-processing on event information received from a user, extracting an event keyword, the extracted event Generating an entire event cluster and a proximity event cluster by performing topic modeling based on a keyword and pre-stored event storage information, and performing classification on the event information through comparison of the entire event cluster and the proximity event cluster It may be stored in a recording medium including instructions for causing the processor to perform an event classification method including

본 발명의 일 실시예에 따르면, 크라우드-소싱 참여자가 수집한 이벤트 데이터에 대해 시스템 내부에서 자동으로 이미지, 텍스트 및 시공간 정보를 분석하여 동일한 이벤트 객체를 군집화 및 분류할 수 있는 이벤트 분류 장치 및 방법이 제공될 수 있다.According to an embodiment of the present invention, there is an event classification apparatus and method capable of clustering and classifying the same event object by automatically analyzing images, text, and spatiotemporal information within the system for event data collected by crowd-sourcing participants. can be provided.

또한, 본 발명의 일 실시예에 따르면, NLP 기반의 텍스트 데이터의 키워드 추출뿐만 아니라, 딥러닝 기반의 이미지 분류 기법이 적용된 LDA 토픽 모델링이 수행되므로, 소량의 텍스트에 대해서도 비교적 높은 정확도의 토픽 모델링이 가능할 수 있다.In addition, according to an embodiment of the present invention, since LDA topic modeling to which a deep learning-based image classification technique is applied is performed as well as keyword extraction of NLP-based text data, topic modeling with relatively high accuracy even for a small amount of text is performed. It may be possible.

또한, 본 발명의 일 실시예에 따르면, 이벤트 정보의 수집 및 관리 수행에 있어, 크라우드-소싱 환경이 활용될 수 있으므로, 기존의 관리자가 일일이 이벤트를 업로드 및 관리하는 것보다 공유할 수 있는 정보의 양이 증가하고, 관리자 고용 비용이 줄어들 수 있다.In addition, according to an embodiment of the present invention, in performing event information collection and management, a crowd-sourcing environment can be utilized, so that an existing administrator can share information rather than upload and manage events one by one. The amount may increase, and the cost of hiring a manager may decrease.

본 발명의 실시예에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Effects that can be obtained in the embodiments of the present invention are not limited to the above-mentioned effects, and other effects not mentioned are clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. it could be

도 1은 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 객체가 분석되는 과정을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 데이터 관리 모듈의 구성을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 객체 분류 모듈의 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 분류 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 이벤트 분류 장치의 구성을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 크라우드-소싱 이미지 데이터에서 이미지 키워드가 추출되는 과정을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 전체 이벤트 그룹 에서 LDA 토픽 모델링이 수행되는 과정을 설명하기 위한 도면이다.
도 8은 본 발명의 일 실시예에 따른 근접 이벤트 그룹에서 LDA 토픽 모델링이 수행되는 과정을 설명하기 위한 도면이다.
도 9는 본 발명의 일 실시예에 따른 이벤트 객체가 분류되는 과정을 설명하기 위한 도면이다.1 is a diagram for explaining a process of analyzing a crowd-sourced event object according to an embodiment of the present invention.
2 is a diagram for explaining the configuration of a crowd-sourced event data management module according to an embodiment of the present invention.
3 is a diagram for explaining the configuration of a crowd-sourced event object classification module according to an embodiment of the present invention.
4 is a flowchart illustrating a method for classifying a crowd-sourced event according to an embodiment of the present invention.
5 is a diagram for explaining the configuration of an event classification apparatus according to an embodiment of the present invention.
6 is a diagram for explaining a process of extracting image keywords from crowd-sourced image data according to an embodiment of the present invention.
7 is a view for explaining a process in which LDA topic modeling is performed in the entire event group according to an embodiment of the present invention.
8 is a diagram for describing a process in which LDA topic modeling is performed in a proximity event group according to an embodiment of the present invention.
9 is a diagram for explaining a process of classifying event objects according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 도면부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, if it is determined that a detailed description of a well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

이하 사용되는 '…부', '…기' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어, 또는, 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Hereinafter used '… wealth', '… The term 'group' means a unit for processing at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software.

이하, 본 명세서에서 사용되는 용어에 대해 설명하도록 한다.Hereinafter, terms used in this specification will be described.

LDA(Latent Dirichlet Allocation) 토픽 모델링 알고리즘은 주어진 대용량 문서에 대해 주요 키워드를 기반으로 각 문서에 어떤 주제들이 존재하는지를 분석하는 확률 모형을 의미한다. 즉, LDA 토픽 모델링은 뉴스 기사와 같은 대량의 텍스트를 포함한 문서에 대한 토픽을 도출하는 알고리즘을 의미한다.The LDA (Latent Dirichlet Allocation) topic modeling algorithm refers to a probabilistic model that analyzes which topics exist in each document based on key keywords for a given large-capacity document. In other words, LDA topic modeling refers to an algorithm for deriving topics for documents including a large amount of text, such as news articles.

한편, LDA 토픽 모델링은 베이지안 통계 방법론에 기초하기 때문에 사전 확률인 문서의 토픽 분포 생성에 사용되는 디리클레 분포 파라미터 α와 토픽의 단어 분포 생성에 사용되는 디리클레 분포 파라메터 β의 값이 LDA 토픽 모델링의 결과에 중요한 영향을 미친다. 사전 확률 파라미터가 클수록 토픽들의 분포가 비슷해지고, 작을수록 특정 토픽이 크게 나타나게 되는데, 보통 초기값은 디폴트(default)값이 사용되거나, 도메인 전문가가 임의로 결정하게 된다.On the other hand, since LDA topic modeling is based on Bayesian statistical methodology, the values of the Dirichlet distribution parameter α used to generate the topic distribution of the document, which are prior probabilities, and the Dirichlet distribution parameter β used to generate the word distribution of the topic depend on the results of LDA topic modeling. have a significant impact As the prior probability parameter is large, the distribution of topics becomes similar, and as the prior probability parameter is small, a specific topic appears larger. Usually, a default value is used for the initial value, or a domain expert is arbitrarily determined.

LDA 토픽 모델링은 대량의 텍스트를 포함한 문서에 대한 토픽을 도출하는 알고리즘이므로, 트위터, 페이스북 등의 게시글과 같은 소량의 텍스트를 포함한 문서에 대한 토픽 분석에 있어서, 비교적 정확도가 낮아지는 경향이 있다. 또한, 분석하고자 하는 문서의 키워드에 너무 의존적으로 토픽이 도출되는 경향이 있다. 본 발명의 일 실시예에 의할 경우, 텍스트 데이터와 함께 수집된 이미지 데이터로부터 딥러닝 기반으로 추출된 키워드를 추가적으로 이용하여, 문서에 소량의 텍스트만이 포함되어 있더라도, 비교적 정확도가 높은 토픽 분석이 가능할 수 있다.Since LDA topic modeling is an algorithm for deriving a topic for a document containing a large amount of text, the accuracy tends to be relatively low in topic analysis for a document containing a small amount of text, such as posts such as Twitter and Facebook. In addition, topics tend to be derived too dependent on the keywords of the document to be analyzed. According to an embodiment of the present invention, by additionally using keywords extracted based on deep learning from image data collected together with text data, topic analysis with relatively high accuracy even if only a small amount of text is included in the document It may be possible.

클러스터링은 특성이 비슷한 데이터끼리 묶어주는 기법으로, 비슷한 뉴스 또는 사용 패턴이 유사한 사용자를 묶어주는 것과 같은 패턴 인지, 데이터 압축 등에 널리 사용되는 학습 방법을 의미할 수 있다. 클러스터링은 라벨링(Labeling)이 되어있지 않은 데이터를 묶는 경우가 일반적이므로, 비지도 학습(Unsupervised learning) 방법이 사용될 수 있다. Clustering is a technique for grouping data with similar characteristics, and may refer to a learning method widely used for pattern recognition and data compression, such as grouping users with similar news or usage patterns. In clustering, since it is common to group unlabeled data, an unsupervised learning method may be used.

K-means 클러스터링 알고리즘은 클러스터링 방법 중 주어진 데이터를 여러 파티션으로 나누는 분할법에 속하며, n개의 중심점을 찍은 후에, 이 중심점에서 각 점 간의 거리의 합이 가장 최소화가 되는 중심점 n의 위치를 찾고, 이 중심점에서 가까운 점들을, 중심점을 기준으로 묶는 클러스터링 알고리즘을 의미할 수 있다.The K-means clustering algorithm belongs to the partitioning method that divides the given data into multiple partitions among the clustering methods. After taking n center points, the location of the center point n where the sum of the distances between the points from these center points is the smallest is found, and the center point It may mean a clustering algorithm that groups points close to , based on a central point.

한편, K-means 클러스터링 알고리즘은 데이터를 한 개 이상의 데이터 객체로 구성된 k개의 그룹으로 나누는 것으로, 이때 그룹을 나누는 과정은 거리 기반의 그룹 간 비유사도(Dissimilarity)와 같은 비용함수(Cost function)을 최소화하는 방식으로 이루어질 수 있다. 상기 과정에서, 같은 그룹 내 데이터 객체 간의 유사도는 증가하고, 다른 그룹 내 데이터 객체와의 유사도는 감소하게 된다. 상기 K-means 클러스터링 알고리즘은 구현이 쉽고, 패턴 수가 n일 때 시간 복잡도가 0(n)인 장점을 가지므로 대용량 데이터에서 널리 이용되고 있다.On the other hand, the K-means clustering algorithm divides data into k groups composed of one or more data objects. In this case, the process of dividing groups minimizes cost functions such as distance-based dissimilarity between groups. can be done in this way. In the above process, the degree of similarity between data objects in the same group increases, and the degree of similarity between data objects in another group decreases. The K-means clustering algorithm is easy to implement and has the advantage that the time complexity is 0(n) when the number of patterns is n, so it is widely used in large data.

이하, 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 객체가 분석되는 과정을 설명하기 위한 도면이다.1 is a diagram for explaining a process of analyzing a crowd-sourced event object according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예가 적용되는 크라우드-소싱 환경에서의 스마트시티 이벤트 공유 맵 시나리오가 도시되어 있다. Referring to FIG. 1 , a smart city event sharing map scenario in a crowd-sourcing environment to which an embodiment of the present invention is applied is illustrated.

공유 맵 서비스를 이용하는 크라우드-소싱 참여자는 각각 자신의 모바일 기기(예: 스마트폰) 상의 앱을 통하여 주변에서 일어나는 특정 이벤트에 대한 데이터를 크라우드-소스 데이터로써 공유 맵 서비스에 업로드할 수 있다. 한편, 본 명세서에서 크라우드-소싱 참여자는 사용자로 명명될 수 있다. 또한, 크라우드-소스 데이터는 이벤트 데이터로 명명될 수 있다.Each crowd-sourcing participant using the shared map service can upload data about a specific event occurring in the vicinity to the shared map service as crowd-source data through an app on their mobile device (eg, a smartphone). Meanwhile, in the present specification, a crowd-sourcing participant may be referred to as a user. Also, crowd-sourced data may be referred to as event data.

이벤트 데이터는 각각 텍스트 데이터, 이미지 데이터, 위치 데이터 및 시간 데이터를 포함할 수 있으며, 이벤트 데이터는 예를 들어 문화 정보 및 행사 정보일 수 있다. 사용자들이 각각 업로드한 이벤트 데이터는 본 발명의 일 실시예에 따른 이벤트 분류 장치(1000)에 의해 이벤트 데이터 간의 중복 여부 판단 및 분류가 수행될 수 있으며, 상기 수행 결과가 반영된 각종 이벤트 데이터가 지도 상에 표시되어 이벤트 분류 장치(1000)의 사용자에게 공유 맵 서비스 형태로 제공될 수 있다. 이때, 일 실시예에 따른 이벤트 분류 장치(1000)는 크라우드-소싱 이벤트 객체 분류 모듈 및 크라우드-소싱 이벤트 데이터 관리 모듈을 포함할 수 있으나, 반드시 이에 제한되는 것은 아니다.The event data may include text data, image data, location data, and time data, respectively, and the event data may be, for example, culture information and event information. The event data uploaded by each user may be determined and classified as to whether or not overlap between the event data is performed by the event classification apparatus 1000 according to an embodiment of the present invention, and various event data reflecting the performance results are displayed on the map. displayed and may be provided to the user of the event classification apparatus 1000 in the form of a shared map service. In this case, the event classification apparatus 1000 according to an embodiment may include a crowd-sourced event object classification module and a crowd-sourced event data management module, but is not limited thereto.

본 발명의 일 실시예에 따른 이벤트 분류 장치(1000)가 분석하고자 하는 대상인 이벤트 데이터로부터 이벤트 객체가 추출될 수 있다. 본 명세서에서, 이벤트 데이터는 이벤트 객체 자체를 의미할 수도 있다. An event object may be extracted from event data that is an object to be analyzed by the event classification apparatus 1000 according to an embodiment of the present invention. In this specification, event data may mean an event object itself.

한편, 이벤트 객체란, 스마트시티 생활에서 생성되는 수 백가지의 이벤트 데이터 중 개별 시민의 관심 정보가 모여 지역 관심사가 되는 관심 정보의 집합을 의미할 수 있다. On the other hand, the event object may mean a set of interest information that becomes a local interest by gathering the interest information of individual citizens among hundreds of event data generated in smart city life.

즉, 본 발명의 일 실시예에 의할 경우, 크라우드-소싱을 활용하여 다수의 사용자들이 도시 곳곳에서 일어나는 이벤트 데이터를 업로드하고, 이를 공유할 수 있는 서비스를 지원하기 위해, 사용자가 수집한 이벤트 데이터를 전문가의 판단 없이도 장치 내부에서 자동으로 분석 및 분류할 수 있다. 이하, 도 2 및 3에서 일 실시예에 따른 이벤트 분류 장치(1000)에 포함될 수 있는 크라우드-소싱 이벤트 데이터 관리 모듈(200) 및 크라우드-소싱 이벤트 객체 관리 모듈(300)에 대해 설명한다.That is, according to an embodiment of the present invention, event data collected by users to support a service in which a plurality of users upload event data occurring in various parts of the city by utilizing crowd-sourcing and share it. can be automatically analyzed and classified inside the device without expert judgment. Hereinafter, the crowd-sourced event data management module 200 and the crowd-sourced event object management module 300 that may be included in the event classification apparatus 1000 according to an embodiment will be described with reference to FIGS. 2 and 3 .

도 2는 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 데이터 관리 모듈(200)의 구성을 설명하기 위한 도면이다.2 is a diagram for explaining the configuration of the crowd-sourced event data management module 200 according to an embodiment of the present invention.

도 2에 도시된 크라우드-소싱 이벤트 데이터 관리 모듈(200)은, 사용자로부터 수신 및 수집한 크라우드-소싱 이벤트 데이터에 대해 전처리를 수행하고, 이미지 데이터 및 텍스트 데이터로부터 키워드를 추출하여 이벤트 데이터를 객체화 및 관리하는 모듈을 의미할 수 있다. The crowd-sourced event data management module 200 shown in FIG. 2 performs pre-processing on the crowd-sourced event data received and collected from users, and extracts keywords from image data and text data to objectify the event data and It can mean a managed module.

도 2를 참조하면, 일 실시예에 따른 크라우드-소싱 이벤트 데이터 관리 모듈(200)은 크라우드소싱 이벤트 데이터 수집기(210), 크라우드소싱 이벤트 데이터 전처리기(220), 이미지 키워드 추출기(230), 텍스트 키워드 추출기(240) 및 크라우드소싱 이벤트 데이터 저장소(250)를 포함할 수 있으나, 반드시 이에 제한되는 것은 아니다.Referring to FIG. 2 , the crowd-sourced event data management module 200 according to an embodiment includes a crowd-sourced event data collector 210 , a crowd-sourced event data preprocessor 220 , an image keyword extractor 230 , and a text keyword. extractor 240 and crowdsourced event data store 250 , but are not necessarily limited thereto.

크라우드소싱 이벤트 데이터 수집기(210)는, 크라우드-소싱 서비스 사용자로부터 크라우드-소싱 이벤트 데이터를 수집할 수 있다. 상기 사용자는 자신의 스마트폰 등의 모바일 단말기를 통해 이벤트 데이터를 업로드할 수 있으며, 업로드 시 이용되는 통신 방식은 근거리 통신 및 원거리 통신 중 적어도 하나를 포함할 수 있다. 또한, 크라우드소싱 이벤트 데이터 수집기(210)는 수집한 이벤트 데이터의 유형을 확인할 수 있다.The crowdsourced event data collector 210 may collect crowd-sourced event data from users of the crowd-sourced service. The user may upload event data through a mobile terminal such as his/her smartphone, and a communication method used for uploading may include at least one of short-distance communication and long-distance communication. In addition, the crowdsourcing event data collector 210 may check the type of collected event data.

크라우드소싱 이벤트 데이터 전처리기(220)는, 크라우드소싱 이벤트 데이터 수집기(210)로부터 수집된 이벤트 데이터를 정확히 분석 및 분류하기 위해 이벤트 데이터의 종류에 따라 전처리를 수행할 수 있다.The crowdsourcing event data preprocessor 220 may perform preprocessing according to the type of event data in order to accurately analyze and classify the event data collected from the crowdsourcing event data collector 210 .

구체적으로, 크라우드소싱 이벤트 데이터 전처리기(220)는 이벤트 데이터 중 이미지 데이터에 대해서는 추후 이미지 키워드 추출기(230)에서 이미지 키워드를 추출하기 위해 크기 조정(Re-sizing)을 수행할 수 있다. 예를 들어, 크라우드소싱 이벤트 데이터 전처리기(220)는 이미지 데이터에 대해 이미지 분류 CNN 모델의 입력에 적합한 사이즈로 크기 조정을 수행할 수 있다.Specifically, the crowdsourced event data preprocessor 220 may perform resizing to extract image keywords from the image keyword extractor 230 on the image data among the event data later. For example, the crowdsourcing event data preprocessor 220 may resize the image data to a size suitable for input of the image classification CNN model.

또한, 크라우드소싱 이벤트 데이터 전처리기(220)는 이벤트 데이터 중 텍스트 데이터에 대해서는 텍스트 키워드 추출을 위해 텍스트 데이터를 정제할 수 있다. 예를 들어, 크라우드소싱 이벤트 데이터 전처리기(220)는 텍스트 데이터에 대해 정규화, 토큰화 및 특수 문자 처리 등을 수행할 수 있다.In addition, the crowdsourcing event data preprocessor 220 may refine text data in order to extract text keywords for text data among event data. For example, the crowdsourcing event data preprocessor 220 may perform normalization, tokenization, and special character processing on text data.

또한, 크라우드소싱 이벤트 데이터 전처리기(220)는 이벤트 데이터 중 시간 데이터에 대해서는 타입 변환을 수행할 수 있다. 예를 들어, 크라우드소싱 이벤트 데이터 전처리기(220)는 문자 타입으로 수집되는 시간 데이터의 타입을 Time 타입으로 변환할 수 있다.Also, the crowdsourcing event data preprocessor 220 may perform type conversion on time data among event data. For example, the crowdsourcing event data preprocessor 220 may convert the type of time data collected in the character type into the Time type.

한편, 상술한 크라우드소싱 이벤트 데이터 전처리기(220)에서 수행되는 전처리는 일 예시에 불과하며, 그 밖에 일 실시예에 따른 이미지 분류 방법을 수행하기 위해 다양한 방식의 전처리가 수행될 수 있다.Meanwhile, the preprocessing performed by the above-described crowdsourcing event data preprocessor 220 is only an example, and in addition, various types of preprocessing may be performed to perform the image classification method according to an embodiment.

이미지 키워드 추출기(230)는, 딥러닝을 기반으로 전처리가 수행된 이벤트 정보 중 이미지 정보와 가장 일치하는 상위 n개의 라벨을 이미지 키워드로 추출할 수 있다. 즉, 이미지 키워드 추출기(230)는, 기 학습된 인공 신경망을 기반으로 이벤트 정보에서 이미지 키워드를 추출할 수 있는데, 상기 인공 신경망은 예를 들어 CNN(Convolutional Neural Network)일 수 있다.The image keyword extractor 230 may extract, as image keywords, the top n labels that most match image information among event information on which preprocessing is performed based on deep learning. That is, the image keyword extractor 230 may extract an image keyword from event information based on a pre-trained artificial neural network, and the artificial neural network may be, for example, a Convolutional Neural Network (CNN).

한편, CNN은 참조 이미지 및 참조 키워드가 입력되면, 상기 참조 이미지에 대한 키워드가 추출되도록 기 학습된 신경망일 수 있다. 이때, 상기 참조 이미지에 대한 키워드는 라벨로 명명될 수 있다.Meanwhile, when a reference image and a reference keyword are input, the CNN may be a pre-trained neural network to extract a keyword for the reference image. In this case, the keyword for the reference image may be called a label.

이미지 키워드 추출기(230)는 이미지를 분류해주는 CNN 딥러닝 모델을 통해 모델에 기 저장된 라벨(예: 1000개의 라벨) 중에 크라우드-소싱 이미지 정보가 해당할 확률이 높은 n개의 라벨을 추출하고, 추출한 n개의 라벨을 이미지 키워드로 정의할 수 있다. 이미지 키워드 추출기(230)가 이미지 정보로부터 이미지 키워드를 추출하는 과정에 대해서는 도 6과 관련된 설명에서 후술한다.The image keyword extractor 230 extracts n labels with a high probability of corresponding crowd-sourced image information among the labels (eg, 1000 labels) pre-stored in the model through the CNN deep learning model that classifies images, and extracted n Dog labels can be defined as image keywords. A process in which the image keyword extractor 230 extracts the image keyword from the image information will be described later in the description related to FIG. 6 .

텍스트 키워드 추출기(240)는, NLP를 기반으로 전처리가 수행된 이벤트 정보 중 텍스트 정보의 형태소 분석을 통한 주요 키워드를 텍스트 키워드로 추출할 수 있다.The text keyword extractor 240 may extract a main keyword as a text keyword through morphological analysis of text information among event information that has been pre-processed based on NLP.

텍스트 키워드 추출기(240)는 한국어 분석을 위해, 학습 데이터를 이용하지 않으면서 문장에 존재하는 단어를 찾는 방식 또는 문장을 단어열로 분해하거나 품사를 판별하는 비지도학습 기반의 한국어 단어 추출 알고리즘을 사용할 수 있다. 상기 비지도학습 기반의 한국어 단어 추출 알고리즘에 의하면, 문서에서 자주 등장하는 연속된 단어열이 단어로 정의되고, 통계 기반으로 단어가 학습됨에 따라, 시간이 지나고 유행이 변함에 따라 새로운 이벤트 유형과 이에 대응되는 단어들이 생성되어도 문장에서 단어가 효과적으로 추출될 수 있다. 즉, 텍스트 키워드 추출기(240)는 한국어 어절의 형태소 분석을 통해 텍스트 정보에서 명사를 포함하는 주요 텍스트 키워드를 추출할 수 있다. For Korean analysis, the text keyword extractor 240 uses an unsupervised learning-based Korean word extraction algorithm that finds words in a sentence without using learning data or decomposes a sentence into word sequences or determines parts of speech. can According to the unsupervised learning-based Korean word extraction algorithm, a continuous word sequence that appears frequently in a document is defined as a word, and as words are learned based on statistics, new event types and corresponding trends change over time. Even when corresponding words are generated, a word can be effectively extracted from a sentence. That is, the text keyword extractor 240 may extract main text keywords including nouns from text information through morpheme analysis of Korean words.

한편, 이미지 키워드 추출기(230) 및 텍스트 키워드 추출기(240) 각각으로부터 추출된 이미지 키워드 및 텍스트 키워드는, 크라우드-소싱 이벤트 데이터 관리 모듈(200) 내에 별도로 존재하는 결합기(미도시)에서 결합될 수 있다. 한편, 상기 결합된 이벤트 키워드 및 상기 이벤트 데이터는 후술할 크라우드소싱 이벤트 데이터 저장소(250)에 저장될 수 있다. 그리고, 상기 이벤트 키워드는 후술할 크라우드-소싱 이벤트 객체 분류 모듈(300)에서 이용될 수 있다.On the other hand, the image keyword and text keyword extracted from each of the image keyword extractor 230 and the text keyword extractor 240 may be combined in a combiner (not shown) separately present in the crowd-sourced event data management module 200 . . Meanwhile, the combined event keyword and the event data may be stored in a crowdsourcing event data storage 250 to be described later. And, the event keyword may be used in the crowd-sourced event object classification module 300, which will be described later.

크라우드소싱 이벤트 데이터 저장소(250)는, 사용자로부터 수신한 이벤트 데이터 및 기존의 다른 사용자로부터 저장된 이벤트 데이터가 저장되는 공간을 의미할 수 있다. 이때, 크라우드소싱 이벤트 데이터 저장소(250) 내에 저장된 이벤트 정보는, 이벤트 저장 정보로 불릴 수 있다.The crowdsourcing event data storage 250 may refer to a space in which event data received from a user and event data stored from other existing users are stored. In this case, event information stored in the crowdsourcing event data storage 250 may be referred to as event storage information.

도 3은 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 객체 분류 모듈(300)의 구성을 설명하기 위한 도면이다.3 is a diagram for explaining the configuration of the crowd-sourced event object classification module 300 according to an embodiment of the present invention.

도 3에 도시된 크라우드-소싱 이벤트 객체 분류 모듈(300)은, 이벤트 객체 간의 LDA 토픽 모델링 및 K-means 클러스터링 기법을 적용하여 이벤트 중복 여부 판단 및 분류를 수행하는 모듈을 의미할 수 있다.The crowd-sourced event object classification module 300 illustrated in FIG. 3 may refer to a module for determining whether events overlap and classifying by applying LDA topic modeling and K-means clustering between event objects.

도 3을 참조하면, 일 실시예에 따른 크라우드-소싱 이벤트 객체 분류 모듈(300)은 근접 이벤트 객체 군집기(310), 전체 이벤트 그룹 LDA 토픽 모델링기(320), 근접 이벤트 그룹 LDA 토픽 모델링기(330) 및 크라우드소싱 이벤트 객체 분류기(340)를 포함할 수 있으나, 반드시 이에 제한되는 것은 아니다.3, the crowd-sourced event object classification module 300 according to an embodiment includes a proximity event object clusterer 310, an entire event group LDA topic modeler 320, and a proximity event group LDA topic modeler ( 330) and the crowdsourced event object classifier 340, but is not necessarily limited thereto.

근접 이벤트 객체 군집기(310)는, 전체 이벤트 정보의 시간 정보 및 위치 정보에 대해 클러스터링을 수행하여, 사용자로부터 수신한 이벤트 정보로부터 도출된 이벤트 객체와 근접한 이벤트 객체 그룹을 도출할 수 있다. 전체 이벤트 정보는, 이벤트 저장 정보 및 사용자로부터 수신한 이벤트 정보를 포함한 이벤트 정보를 의미할 수 있다. The proximity event object grouper 310 may derive an event object group adjacent to the event object derived from the event information received from the user by performing clustering on the time information and location information of the entire event information. The entire event information may refer to event information including event storage information and event information received from a user.

예를 들어, 근접 이벤트 객체 군집기(310)는, K- means 클러스터링을 수행하여 기 저장된 이벤트 저장 정보 중 사용자로부터 수신한 이벤트 정보의 시간 정보 및 위치 정보와 기 설정된 유사도를 갖는 이벤트 정보를 클러스터링할 수 있다. 이때, 상기 클러스터링된 이벤트 정보는 후보 클러스터라고 불릴 수 있다.For example, the proximity event object clusterer 310 performs K-means clustering to cluster event information having a preset similarity with time information and location information of event information received from a user among pre-stored event storage information. can In this case, the clustered event information may be referred to as a candidate cluster.

구체적으로, 근접 이벤트 객체 군집기(310)는 전체 이벤트 정보의 공간 정보에 따른 K-means 클러스터링을 수행하여, 분석 대상 정보, 즉, 사용자로부터 수신한 이벤트 정보와 같은 그룹에 속한 이벤트 정보들을 1차 클러스터링할 수 있다. 또한, 근접 이벤트 객체 군집기(310)는 전체 이벤트 정보의 시간 정보에 따른 K-means 클러스터링을 수행하여, 분석 대상 정보와 같은 그룹에 속한 이벤트 정보들을 2차 클러스터링할 수 있다. 이후, 근접 이벤트 객체 군집기(310)는 상기 1차, 2차 클러스터링 결과에 모두 속한 이벤트 정보를 근접 이벤트 그룹에 추가할 수 있다. 이때, 상기 근접 이벤트 그룹은 후보 클러스터라고 불릴 수 있다.Specifically, the proximity event object clusterer 310 performs K-means clustering according to the spatial information of the entire event information to first collect the analysis target information, that is, event information belonging to the same group as the event information received from the user. can be clustered. Also, the proximity event object clusterer 310 may perform K-means clustering according to time information of the entire event information to secondary cluster event information belonging to the same group as the analysis target information. Thereafter, the proximity event object grouper 310 may add event information belonging to both the primary and secondary clustering results to the proximity event group. In this case, the proximity event group may be referred to as a candidate cluster.

전체 이벤트 그룹 LDA 토픽 모델링기(320)는, 후술할 근접 이벤트 그룹 LDA 토픽 모델링기(330)와 토픽 모델링의 대상이 다를 뿐, 수행하는 기능은 같을 수 있다. 전체 LDA 토픽 모델링기(320)는, 전체 이벤트 정보를 대상으로, LDA 토픽 모델링을 수행할 수 있다. 그 결과, 전체 이벤트 그룹 LDA 토픽 모델링기(320)는 전체 이벤트 정보에서 추출된 복수 개의 키워드 중 사용자로부터 수신한 이벤트 정보의 키워드 리스트와 기 설정된 유사도를 갖는 이벤트 저장 정보를 군집화할 수 있다. 이때, 상기 군집화된 이벤트 저장 정보는 전체 이벤트 클러스터로 명명될 수 있다.The entire event group LDA topic modeler 320 may perform the same function as the proximity event group LDA topic modeler 330 , which will be described later, except that the subject of topic modeling is different. The entire LDA topic modeler 320 may perform LDA topic modeling on the entire event information. As a result, the entire event group LDA topic modeler 320 may group event storage information having a preset similarity with the keyword list of event information received from the user among a plurality of keywords extracted from the entire event information. In this case, the clustered event storage information may be referred to as an entire event cluster.

근접 이벤트 그룹 LDA 토픽 모델링기(330)는, 상술한 근접 이벤트 객체 군집기(310)에서 생성한 후보 클러스터를 대상으로 하여, LDA 토픽 모델링을 수행할 수 있다. 그 결과, 근접 이벤트 그룹 LDA 토픽 모델링기(330)는 후보 클러스터에서 추출된 복수 개의 키워드 중 사용자로부터 수신한 이벤트 정보의 키워드 리스트와 기 설정된 유사도를 갖는 이벤트 저장 정보를 군집화할 수 있다. 이때, 상기 군집화된 이벤트 저장 정보는 근접 이벤트 클러스터로 명명될 수 있다. 상술한 LDA 토픽 모델링에 대해서는 도 7 및 도 8과 관련된 설명에서 상세히 설명한다.The proximity event group LDA topic modeler 330 may perform LDA topic modeling with the candidate cluster generated by the aforementioned proximity event object clusterer 310 as a target. As a result, the proximity event group LDA topic modeler 330 may group event storage information having a preset similarity with a keyword list of event information received from a user among a plurality of keywords extracted from the candidate cluster. In this case, the clustered event storage information may be referred to as a proximity event cluster. The above-described LDA topic modeling will be described in detail in the descriptions related to FIGS. 7 and 8 .

크라우드소싱 이벤트 객체 분류기(340)는, 전체 이벤트 그룹 LDA 토픽 모델링기(320) 및 근접 이벤트 그룹 LDA 토픽 모델링기(330)에서 각각 도출된 전체 이벤트 클러스터 및 근접 이벤트 클러스터에 속한 이벤트 객체에 대해 비교 분석을 수행할 수 있다. The crowdsourced event object classifier 340 compares and analyzes the entire event group and the event object belonging to the proximity event cluster derived from the entire event group LDA topic modeler 320 and the proximity event group LDA topic modeler 330, respectively. can be performed.

구체적으로, 크라우드소싱 이벤트 객체 분류기(340)는 전체 이벤트 클러스터 및 근접 이벤트 클러스터에 중복으로 포함된 이벤트 객체가 존재할 경우, 상기 사용자로부터 수신한 이벤트 정보가 기 저장된 이벤트 저장 정보 중 적어도 하나와 중복된 정보인, 중복 이벤트 정보라고 분류할 수 있다.Specifically, the crowdsourced event object classifier 340 includes, when there is an event object duplicated in the entire event cluster and the adjacent event cluster, the event information received from the user overlaps with at least one of pre-stored event storage information. It can be classified as human and overlapping event information.

또한, 크라우드소싱 이벤트 객체 분류기(340)는, 전체 이벤트 클러스터 및 근접 이벤트 클러스터 모두에 이벤트 객체가 존재하지 않을 경우, 상기 사용자로부터 수신한 이벤트 정보가 기 저장된 이벤트 저장 정보 중 어느 것과도 일치하거나 유사하지 않은 새로운 이벤트 정보인 것으로 분류할 수 있다.In addition, the crowdsourced event object classifier 340, when there is no event object in both the entire event cluster and the adjacent event cluster, the event information received from the user does not match or resemble any of the pre-stored event storage information. It can be classified as new event information that has not been updated.

또한, 크라우드소싱 이벤트 객체 분류기(340)는, 전체 이벤트 클러스터 및 근접 이벤트 클러스터 각각에 이벤트 객체가 존재하지만, 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터 사이에 중복되는 이벤트 객체는 없는 경우, 상기 사용자로부터 수신한 이벤트 정보가 거짓된 정보임을 나타내는 거짓 이벤트 정보인 것으로 분류할 수 있다.In addition, the crowdsourced event object classifier 340, if there is an event object in each of the entire event cluster and the proximity event cluster, but there is no overlapping event object between the entire event cluster and the proximity event cluster, the It can be classified as false event information indicating that the event information is false information.

또한, 크라우드소싱 이벤트 객체 분류기(340)는, 전체 이벤트 클러스터에만 이벤트 객체가 존재하고, 근접 이벤트 클러스터에는 이벤트 객체가 없는 경우에도, 상기 사용자로부터 수신한 이벤트 정보가 거짓된 정보임을 나타내는 거짓 이벤트 정보인 것으로 분류할 수 있다. In addition, the crowdsourced event object classifier 340 is false event information indicating that the event information received from the user is false information, even when there is an event object only in the entire event cluster and there is no event object in the adjacent event cluster. can be classified.

도 4는 본 발명의 일 실시예에 따른 크라우드-소싱 이벤트 분류 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method for classifying a crowd-sourced event according to an embodiment of the present invention.

도 4에 도시된 이벤트 분류 방법은 도 1에서 설명한 이벤트 분류 장치(1000)에 의해 수행될 수 있다. 따라서, 상술한 도 1 내지 도 3과 관련된 설명은 도 4와 관련된 설명에서도 원용될 수 있다.The event classification method illustrated in FIG. 4 may be performed by the event classification apparatus 1000 described with reference to FIG. 1 . Accordingly, the description related to FIGS. 1 to 3 may also be referred to in the description related to FIG. 4 .

도 4를 참조하면, 일 실시예에 따른 이벤트 분류 장치(1000)는 사용자로부터 이벤트 정보를 수신할 수 있다(S401). 상기 이벤트 정보는, 예를 들어, 문화 정보 또는 행사 정보로써, 텍스트 정보, 이미지 정보, 이벤트 시간 정보 및 이벤트 위치 정보를 포함할 수 있다.Referring to FIG. 4 , the event classification apparatus 1000 according to an embodiment may receive event information from a user ( S401 ). The event information is, for example, cultural information or event information, and may include text information, image information, event time information, and event location information.

예를 들어, 이벤트 정보가 '11월 30일에 열리는 제주도 감귤 축제'인 경우, 텍스트 정보는 제주도, 감귤, 축제 등일 수 있고, 이미지 정보는, 상기 축제의 포스터 사진일 수 있다. 또한, 이벤트 시간 정보는 11월 30일일 수 있으며, 이벤트 위치 정보는 제주도일 수 있다.For example, if the event information is 'Jeju Island Citrus Festival to be held on November 30', the text information may be Jeju Island, tangerines, a festival, etc., and the image information may be a poster photo of the festival. In addition, the event time information may be November 30, and the event location information may be Jeju Island.

또한, 일 실시예에 따른 이벤트 분류 장치(1000)는 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출할 수 있다(S402).Also, the event classification apparatus 1000 according to an embodiment may perform pre-processing on event information and extract an event keyword ( S402 ).

이벤트 분류 장치(1000)는 이벤트 정보 중 이미지 정보에 대하여는, 기 학습된 신경망을 이용하여 상기 이미지 정보에 해당할 확률이 높은 복수 개의 이미지 키워드를 추출할 수 있다. 예를 들어, 상기 신경망은, CNN(Convolutional Neural Network) 모델일 수 있다.The event classification apparatus 1000 may extract a plurality of image keywords having a high probability of corresponding to the image information by using a pre-trained neural network for image information among event information. For example, the neural network may be a Convolutional Neural Network (CNN) model.

또한, 이벤트 분류 장치(1000)는 이벤트 정보 중 텍스트 정보에 대하여는, 상기 텍스트 정보의 형태소 분석을 통해 텍스트 키워드를 추출할 수 있다. Also, the event classification apparatus 1000 may extract text keywords from text information among event information through morphological analysis of the text information.

또한, 일 실시예에 따른 이벤트 분류 장치(1000)는 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성할 수 있다(S403). 이때, 상기 토픽 모델링은 LDA(Latent Dirichlet Allocation) 토픽 모델링일 수 있다.In addition, the event classification apparatus 1000 according to an embodiment may generate an entire event cluster and a proximity event cluster by performing topic modeling based on the event keyword and pre-stored event storage information ( S403 ). In this case, the topic modeling may be LDA (Latent Dirichlet Allocation) topic modeling.

이벤트 분류 장치(1000)는, 상기 토픽 모델링을 수행하여 상기 기 저장된 이벤트 저장 정보로부터 도출된 이벤트 저장 키워드 중 상기 추출된 이벤트 키워드와 제1 유사도를 갖는 키워드의 클러스터를 생성하여 전체 이벤트 클러스터를 생성할 수 있다. The event classification apparatus 1000 performs the topic modeling to generate a cluster of keywords having a first similarity with the extracted event keyword among the event storage keywords derived from the pre-stored event storage information to generate an entire event cluster. can

이벤트 분류 장치(1000)는, K-means 클러스터링을 수행하여 상기 이벤트 저장 정보 중 상기 이벤트 시간 정보 및 상기 이벤트 위치 정보와 기 설정된 유사도를 갖는 후보 클러스터를 생성하고, 상기 토픽 모델링을 수행하여 상기 후보 클러스터로부터 도출된 이벤트 저장 키워드 중 상기 이벤트 키워드와 제2 유사도를 갖는 키워드의 클러스터를 생성하여 근접 이벤트 클러스터를 생성할 수 있다. The event classification apparatus 1000 performs K-means clustering to generate a candidate cluster having a preset similarity with the event time information and the event location information among the event storage information, and performs the topic modeling to perform the topic modeling to generate the candidate cluster. A proximity event cluster may be generated by generating a cluster of keywords having a second similarity to the event keyword among the event storage keywords derived from .

그리고, 일 실시예에 따른 이벤트 분류 장치(1000)는 전체 이벤트 클러스터 및 근접 이벤트 클러스터의 비교를 통해 이벤트 정보에 대한 분류를 수행할 수 있다(S404).In addition, the event classification apparatus 1000 according to an embodiment may classify event information by comparing the entire event cluster and the proximity event cluster ( S404 ).

구체적으로, 이벤트 분류 장치(1000)는 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터에 중복으로 포함된 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 중복 이벤트 정보인 것으로 분류할 수 있다.Specifically, when event storage information that is duplicated included in the entire event cluster and the adjacent event cluster exists, the event classification apparatus 1000 may classify the event information as duplicate event information.

또한, 이벤트 분류 장치(1000)는 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터 각각에 이벤트 저장 정보가 존재하지 않을 경우, 상기 이벤트 정보는 새로운 이벤트 정보인 것으로 분류할 수 있다.Also, when event storage information does not exist in each of the entire event cluster and the adjacent event cluster, the event classification apparatus 1000 may classify the event information as new event information.

또한, 이벤트 분류 장치(1000)는 상기 전체 이벤트 클러스터 및 근접 이벤트 클러스터 각각에 중복 없이 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.Also, when event storage information exists without overlap in each of the entire event cluster and the adjacent event cluster, the event classification apparatus 1000 may classify the event information as false event information.

또한, 이벤트 분류 장치(1000)는, 상기 전체 이벤트 클러스터에만 이벤트 저장 정보가 존재하고, 상기 근접 이벤트 클러스터에는 이벤트 저장 정보가 존재하지 않을 경우에도, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.Also, the event classification apparatus 1000 may classify the event information as false event information even when event storage information exists only in the entire event cluster and event storage information does not exist in the adjacent event cluster. .

도 5는 본 발명의 일 실시예에 따른 이벤트 분류 장치의 구성을 설명하기 위한 도면이다. 도 5에 도시된 이벤트 분류 장치(500)는, 도 1에서 설명한 이벤트 분류 장치(1000)와 동일한 장치일 수 있으므로, 도 1 내지 도 3에서 설명한 내용은 도 5와 관련된 설명에서도 원용될 수 있다.5 is a diagram for explaining the configuration of an event classification apparatus according to an embodiment of the present invention. Since the event classification apparatus 500 illustrated in FIG. 5 may be the same apparatus as the event classification apparatus 1000 illustrated in FIG. 1 , the contents described with reference to FIGS. 1 to 3 may also be referred to in the description related to FIG. 5 .

도 5를 참조하면, 일 실시예에 따른 이벤트 분류 장치(500)는, 통신부(510) 및 프로세서(520)를 포함할 수 있으나, 반드시 이에 제한되는 것은 아니다.Referring to FIG. 5 , the event classification apparatus 500 according to an embodiment may include a communication unit 510 and a processor 520 , but is not necessarily limited thereto.

통신부(510)는, 사용자로부터 이벤트 정보를 수신할 수 있다.The communication unit 510 may receive event information from a user.

프로세서(520)는, 상기 이벤트 정보에 대해 전처리를 수행하고, 이벤트 키워드를 추출하고, 상기 이벤트 키워드 및 기 저장된 이벤트 저장 정보에 기초하여 토픽 모델링을 수행하여 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 생성하고, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터의 비교를 통해 상기 이벤트 정보에 대한 분류를 수행할 수 있다. 이때, 상기 토픽 모델링은 LDA(Latent Dirichlet Allocation) 토픽 모델링일 수 있다.The processor 520 performs pre-processing on the event information, extracts an event keyword, and performs topic modeling based on the event keyword and pre-stored event storage information to generate an entire event cluster and a proximity event cluster, Classification of the event information may be performed by comparing the entire event cluster and the proximity event cluster. In this case, the topic modeling may be LDA (Latent Dirichlet Allocation) topic modeling.

그리고, 상기 이벤트 정보는 텍스트 정보, 이미지 정보, 이벤트 시간 정보 및 이벤트 위치 정보 중 적어도 하나를 포함할 수 있다.In addition, the event information may include at least one of text information, image information, event time information, and event location information.

한편, 프로세서(520)는, CNN(Convolutional Neural Network) 모델을 이용하여 상기 이미지 정보에 해당할 확률이 높은 복수 개의 이미지 키워드를 상기 이벤트 키워드로 추출하고, 상기 텍스트 정보의 형태소 분석을 통해 텍스트 키워드를 상기 이벤트 키워드로 추출할 수 있다.Meanwhile, the processor 520 extracts a plurality of image keywords having a high probability of corresponding to the image information as the event keywords using a Convolutional Neural Network (CNN) model, and selects the text keywords through morphological analysis of the text information. It can be extracted with the event keyword.

또한, 프로세서(520)는, 상기 토픽 모델링을 수행하여 상기 이벤트 저장 정보로부터 도출된 이벤트 저장 키워드 중 상기 이벤트 키워드와 제1 유사도를 갖는 키워드의 클러스터를 생성하여 상기 전체 이벤트 클러스터를 생성할 수 있다. Also, the processor 520 may perform the topic modeling to generate a cluster of keywords having a first similarity to the event keyword among the event storage keywords derived from the event storage information to generate the entire event cluster.

또한, 프로세서(520)는, K-means 클러스터링을 수행하여 상기 이벤트 저장 정보 중 상기 이벤트 시간 정보 및 상기 이벤트 위치 정보와 기 설정된 유사도를 갖는 후보 클러스터를 생성하고, 상기 토픽 모델링을 수행하여 상기 후보 클러스터로부터 도출된 이벤트 저장 키워드 중 상기 이벤트 키워드와 제2 유사도를 갖는 키워드의 클러스터를 생성하여 상기 근접 이벤트 클러스터를 생성할 수 있다.In addition, the processor 520 performs K-means clustering to generate a candidate cluster having a preset similarity with the event time information and the event location information among the event storage information, and performs the topic modeling to perform the topic modeling to generate the candidate cluster. The proximity event cluster may be generated by generating a cluster of keywords having a second similarity to the event keyword among the event storage keywords derived from .

또한, 프로세서(520)는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터에 중복으로 포함된 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 중복 이벤트 정보인 것으로 분류할 수 있다. In addition, the processor 520 may classify the event information as duplicate event information when duplicate event storage information exists in the entire event cluster and the adjacent event cluster.

또한, 프로세서(520)는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 이벤트 저장 정보가 존재하지 않을 경우, 상기 이벤트 정보는 새로운 이벤트 정보인 것으로 분류할 수 있다.Also, when event storage information does not exist in each of the entire event cluster and the adjacent event cluster, the processor 520 may classify the event information as new event information.

또한, 프로세서(520)는, 상기 전체 이벤트 클러스터 및 상기 근접 이벤트 클러스터 각각에 중복 없이 이벤트 저장 정보가 존재할 경우, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.In addition, when event storage information exists without overlap in each of the entire event cluster and the adjacent event cluster, the processor 520 may classify the event information as false event information.

또한, 프로세서(520)는, 상기 전체 이벤트 클러스터에만 이벤트 저장 정보가 존재하고, 상기 근접 이벤트 클러스터에는 이벤트 저장 정보가 존재하지 않는 경우에도, 상기 이벤트 정보는 거짓 이벤트 정보인 것으로 분류할 수 있다.Also, the processor 520 may classify the event information as false event information even when event storage information exists only in the entire event cluster and event storage information does not exist in the adjacent event cluster.

도 6은 본 발명의 일 실시예에 따른 크라우드-소싱 이미지 데이터에서 이미지 키워드가 추출되는 과정을 설명하기 위한 도면이다.6 is a diagram for explaining a process of extracting image keywords from crowd-sourced image data according to an embodiment of the present invention.

도 6을 참조하면, 일 실시예에 따른 이미지 분류(Image Classification) CNN 모델은 이벤트 데이터 중 크라우드-소스 이미지 데이터를 입력으로 하고, 상기 이미지 데이터에 해당할 확률이 높은 n개의 라벨을 추출할 수 있다.Referring to FIG. 6 , the image classification CNN model according to an embodiment may take as input crowd-source image data among event data, and extract n labels with high probability corresponding to the image data. .

이때, 상기 CNN 모델은, 참조 이미지 및 참조 키워드를 입력하면, 상기 참조 이미지에 대한 키워드가 출력되도록 기 학습된 인공신경망일 수 있다.In this case, the CNN model may be a pre-trained artificial neural network to output a keyword for the reference image when a reference image and a reference keyword are input.

구체적으로, 이미지 데이터(img₁, img₂, …, img_m)가 이미지 분류 CNN 모델로 입력될 경우, 이미지 분류 CNN 모델에서는 입력된 이미지 데이터에 대해 convolution layer, max pooling layer, fully connected layer, softmax layer를 거쳐서 각 이미지 데이터별로 n개의 라벨(label)을 추출할 수 있다. 이때, 상기 n개의 라벨은, 입력된 이미지 데이터에 해당할 확률이 가장 높은 상위 n개의 라벨일 수 있으며, 이들이 각각 이미지 키워드(imgK₁, imgK₂, …, imgK_m)가 될 수 있다.Specifically, when image data (img ₁ , img ₂ , …, img _m ) is input to the image classification CNN model, the image classification CNN model includes a convolution layer, max pooling layer, fully connected layer, and softmax for the input image data. Through the layers, n labels can be extracted for each image data. In this case, the n labels may be the top n labels with the highest probability corresponding to the input image data, and these may be image keywords (imgK ₁ , imgK ₂ , ..., imgK _m ).

한편, 도 6에 도시된 바와 같은 딥러닝 기반 이미지 분류 CNN 모델은 ImageNet 데이터 셋을 활용하여 학습한 모델로, 입력 이미지를 약 1000개의 라벨로 분류할 수 있다. 이때, 라벨 개수 n(<1000)은 텍스트 데이터에서 추출한 키워드의 개수에 따라 정해질 수 있다. ImageNet 데이터 셋을 학습한 딥러닝 모델의 분류 라벨은 대부분 일반적인 객체를 나타내고 있어서, 만약 이미지 키워드의 개수가 더 많아지면, 전체 키워드에서 분석 대상 이벤트 객체의 특징이 희미해질 수 있다. 반면, 이미지 키워드의 개수가 적을 때, 텍스트 키워드의 개수가 적다면, LDA 토픽 모델링의 분석 정확도가 낮아질 수 있으므로, 적정 라벨 개수 n이 필요할 수 있다. 본 발명의 일 실시예에 의할 경우, 텍스트 키워드의 수 : 이미지 키워드의 수 = 3 : 1의 비율로 n이 정해질 수 있다.On the other hand, the deep learning-based image classification CNN model as shown in FIG. 6 is a model learned using the ImageNet data set, and can classify the input image into about 1000 labels. In this case, the number of labels n (<1000) may be determined according to the number of keywords extracted from text data. Most of the classification labels of the deep learning model trained on the ImageNet data set represent general objects, so if the number of image keywords increases, the characteristics of the event object to be analyzed may be blurred in all keywords. On the other hand, when the number of image keywords is small, if the number of text keywords is small, the analysis accuracy of LDA topic modeling may be lowered, and thus an appropriate number of labels n may be required. According to an embodiment of the present invention, n may be determined in a ratio of the number of text keywords: the number of image keywords=3:1.

도 7은 본 발명의 일 실시예에 따른 전체 이벤트 그룹 에서 LDA 토픽 모델링이 수행되는 과정을 설명하기 위한 도면이다.7 is a view for explaining a process in which LDA topic modeling is performed in the entire event group according to an embodiment of the present invention.

도 8은 본 발명의 일 실시예에 따른 근접 이벤트 그룹에서 LDA 토픽 모델링이 수행되는 과정을 설명하기 위한 도면이다.8 is a diagram for describing a process in which LDA topic modeling is performed in a proximity event group according to an embodiment of the present invention.

도 7 및 도 8에서 설명될 LDA 토픽 모델링은, 입력 값 및 출력 값이 다를 뿐, 그 원리는 동일하므로 함께 설명한다.The LDA topic modeling to be described with reference to FIGS. 7 and 8 has only different input values and different output values, but the principles are the same, and thus will be described together.

LDA 토픽 모델링은 주어진 문서에 대해 각 문서에 어떤 주제(토픽)들이 존재하는지에 대한 확률 모형으로, 토픽별 단어의 분포, 문서별 토픽의 분포를 추정하는데 사용될 수 있다. LDA 기반 문서 생성 과정은, 크게 1) 텍스트 데이터(문서)를 분석하여 토픽 집단을 구성하고, 2) 구해진 토픽 집단을 바탕으로 각각의 텍스트 데이터들의 Topic distribution point를 계산하고, 3) 각각의 텍스트 데이터들의 Topic distribution point를 통해 문서의 전체 주제를 파악하는 것으로 진행될 수 있다.LDA topic modeling is a probabilistic model for which topics (topics) exist in each document for a given document, and can be used to estimate the distribution of words per topic and the distribution of topics per document. The LDA-based document creation process largely consists of 1) analyzing text data (documents) to construct a topic group, 2) calculating the topic distribution point of each text data based on the obtained topic group, 3) each text data It can proceed to identify the entire topic of the document through their topic distribution points.

도 7 및 도 8에 도시된 LDA 토픽 모델링기의 구성 중, D는 전체 문서 개수, K는 전체 토픽의 수, N은 d번째 문서의 단어의 수를 의미할 수 있다. 이때, 네모 칸은 해당 횟수만큼 반복됨을 나타내며, 동그라미는 변수를 나타낸다. 또한, 화살표가 시작되는 변수는 조건, 화살표가 향하는 변수는 결과를 나타낼 수 있다. 한편, α 및 β는 각각디리클레 분포 파라메터를 나타낼 수 있다. In the configuration of the LDA topic modeler shown in FIGS. 7 and 8 , D may indicate the total number of documents, K may indicate the total number of topics, and N may indicate the number of words in the d-th document. At this time, the square indicates that the number of times is repeated, and the circle indicates the variable. In addition, a variable from which an arrow starts may indicate a condition, and a variable to which an arrow points may indicate a result. Meanwhile, α and β may each represent a Dirichlet distribution parameter.

또한, n_d,k는 k번째 토픽에 할당된 d번째 문서의 단어 빈도를, v_k,wd,n는 전체 문서에서 k번째 토픽에 할당된 단어 w_d,n의 빈도를, w_d,n은 d번째 문서에 n번째로 등장한 단어를 의미할 수 있다. 또한, V는 문서에 등장하는 전체 단어의 수를, A는 d번째 문서가 k번째 토픽과 맺고 있는 연관성 정도를, B는 d번째 문서의 n번째 단어가 k번째 토픽과 맺고 있는 연관성 정도를, p는 사후확률을 나타낼 수 있다. 그밖에 LDA 토픽 모델링의 수행 과정에 관련된 내용은 공지된 기술이므로, 자세한 설명은 생략한다. Also, n _d,k is the word frequency of the dth document assigned to the kth topic, v _k,wd,n is the frequency of the word w _d,n assigned to the kth topic in the entire document, w _d,n may mean a word that appears nth in the dth document. In addition, V is the total number of words appearing in the document, A is the degree of relevance of the d-th document to the k-th topic, B is the degree of relevance of the n-th word in the d-th document to the k-th topic, p can represent the posterior probability. In addition, since information related to the process of performing LDA topic modeling is a known technique, detailed description thereof will be omitted.

도 7을 참조하면, 사용자로부터 수신한 이벤트 정보 및 기 저장된 이벤트 저장 정보가 포함된 전체 이벤트 그룹이 LDA 토픽 모델링의 입력 값으로 사용될 수 있다. E_A는 전체 이벤트 그룹 리스트를 나타내며, K_A는 전체 이벤트 그룹의 이벤트 객체 키워드 리스트를, e는 기 저장된 이벤트 저장 정보의 이벤트 객체를, imgK는 이미지에서 도출된 키워드를, textK는 텍스트에서 도출된 키워드를, e_t는 분석 대상 이벤트 객체, 즉 사용자로부터 수신한 이벤트 정보의 이벤트 객체를 의미할 수 있다.Referring to FIG. 7 , the entire event group including event information received from the user and pre-stored event storage information may be used as input values for LDA topic modeling. E _A represents the entire event group list, K _A is the event object keyword list of the entire event group, e is the event object of pre-stored event storage information, imgK is the keyword derived from the image, and textK is the keyword derived from the text. A keyword, e _t may mean an event object to be analyzed, that is, an event object of event information received from a user.

전체 이벤트 그룹의 키워드들이 LDA 토픽 모델링기에 입력될 경우, 복수 개의 클러스터가 도출될 수 있다. 이때, 결과 값인 클러스터 중, 분석 대상 이벤트 객체와 기 설정된 유사도를 갖는 키워드들의 클러스터는 이벤트 클러스터 A로 표시되어 있으며, 상기 이벤트 클러스터 A는 상술한 전체 이벤트 클러스터일 수 있다. 도 7에서는, 이벤트 클러스터 A에 속한 키워드들은 빨간색으로 표시되어 있다. When keywords of the entire event group are input to the LDA topic modeler, a plurality of clusters may be derived. At this time, among the resultant clusters, a cluster of keywords having a preset similarity to the analysis target event object is indicated as event cluster A, and the event cluster A may be the entire event cluster described above. In FIG. 7 , keywords belonging to event cluster A are marked in red.

한편, 도 8을 참조하면, 근접 이벤트 그룹이 LDA 토픽 모델링의 입력값으로 사용될 수 있다. 이때, 근접 이벤트 그룹이란, 기 저장된 이벤트 저장 정보로부터 도출된 이벤트 객체 중, 분석 대상 이벤트 객체와 시간, 위치 정보가 유사한 값이 K-means 클러스터링에 의해 분류된 객체들을 의미할 수 있다.Meanwhile, referring to FIG. 8 , a proximity event group may be used as an input value for LDA topic modeling. In this case, the proximity event group may refer to objects in which values having similar time and location information to the analysis target event object are classified by K-means clustering among event objects derived from pre-stored event storage information.

E_B는 근접 이벤트 그룹 리스트를 나타내며, K_B는 근접 이벤트 그룹의 이벤트 객체 키워드 리스트를, e는 기 저장된 이벤트 저장 정보의 이벤트 객체를, imgK는 이미지에서 도출된 키워드를, textK는 텍스트에서 도출된 키워드를, e_t는 분석 대상 이벤트 객체, 즉 사용자로부터 수신한 이벤트 정보의 이벤트 객체를 의미할 수 있다.E _B denotes a list of proximity event groups, K _B denotes a list of event object keywords of the proximity event group, e denotes an event object of pre-stored event storage information, imgK denotes a keyword derived from an image, and textK denotes a keyword derived from text. A keyword, e _t may mean an event object to be analyzed, that is, an event object of event information received from a user.

근접 이벤트 그룹의 키워드들이 LDA 토픽 모델링기에 입력될 경우, 복수 개의 클러스터가 도출될 수 있다. 이때, 결과 값인 클러스터 중, 분석 대상 이벤트 객체와 기 설정된 유사도를 갖는 키워드들의 클러스터는 이벤트 클러스터 B로 표시되어 있으며, 상기 이벤트 클러스터 B는 상술한 근접 이벤트 클러스터일 수 있다. 도 8에서는, 이벤트 클러스터 B에 속한 키워드들은 파란색으로 표시되어 있다.When keywords of the proximity event group are input to the LDA topic modeler, a plurality of clusters may be derived. In this case, among the resultant clusters, a cluster of keywords having a preset similarity to an event object to be analyzed is indicated as an event cluster B, and the event cluster B may be the aforementioned proximity event cluster. In FIG. 8 , keywords belonging to the event cluster B are indicated in blue.

예를 들어, 사용자로부터 수신된 이벤트 정보가 11월 30일에 열리는 제주도 감귤 축제'인 경우, 상술한 이벤트 클러스터 A는 '제주도 감귤 축제'와 유사도가 높은 클러스터를 의미할 수 있다. 또한, 이벤트 클러스터 B는, '11월 30일, 제주도'와 유사한 시간 정보 및 위치 정보를 갖는 클러스터 중 '제주도 감귤 축제'와 유사도가 높은 클러스터를 의미할 수 있다.For example, when the event information received from the user is 'Jeju Island Citrus Festival held on November 30', the above-described event cluster A may mean a cluster having a high similarity to 'Jeju Island Citrus Festival'. Also, event cluster B may mean a cluster having a high degree of similarity to 'Jeju Island Citrus Festival' among clusters having time information and location information similar to 'November 30, Jeju Island'.

도 9는 본 발명의 일 실시예에 따른 이벤트 객체가 분류되는 과정을 설명하기 위한 도면이다.9 is a diagram for explaining a process of classifying event objects according to an embodiment of the present invention.

도 9(a)는, 분석 대상 객체가 중복 이벤트 객체인 경우를, 도 9(b)는, 분석 대상 객체가 새로운 이벤트 객체인 경우를, 도 9(c) 및 (d)는 분석 대상 객체가 거짓 이벤트 객체인 경우를 각각 나타낼 수 있다.Fig. 9 (a) is a case in which the analysis target object is a duplicate event object, Fig. 9 (b) is a case in which the analysis target object is a new event object, Figs. 9 (c) and (d) are the analysis target object Each case of a false event object can be indicated.

도 9에 도시된 클러스터 A, B는 각각 상술한 전체 이벤트 클러스터 및 근접 이벤트 클러스터를 의미할 수 있다.Clusters A and B illustrated in FIG. 9 may mean the above-described entire event cluster and proximity event cluster, respectively.

한편, 도 9(a)와 같이, 클러스터 A 및 B 간에 중복되는 이벤트 객체가 존재할 경우, 사용자로부터 수신한 이벤트 정보는 기 저장된 이벤트 저장 정보 중 적어도 하나와 유사한 중복 이벤트 객체로 분류될 수 있다. 즉, 이벤트 정보가 교집합에 해당하는 이벤트 객체와 중복 이벤트인 것으로 분류될 수 있다.Meanwhile, as shown in FIG. 9( a ) , when overlapping event objects exist between clusters A and B, event information received from the user may be classified as a duplicate event object similar to at least one of pre-stored event storage information. That is, event information may be classified as an event object corresponding to an intersection and a duplicate event.

도 9(b)와 같이, 클러스터 A 및 B 모두에 이벤트 객체가 존재하지 않을 경우, 사용자로부터 수신한 이벤트 정보는 기 저장된 이벤트 저장 정보 어느 것과도 유사하지 않은 새로운 이벤트 객체로 분류될 수 있다.As shown in FIG. 9B , when no event object exists in both clusters A and B, event information received from the user may be classified as a new event object that is not similar to any previously stored event storage information.

또한, 도 9(c) 및 (d)와 같이, 클러스터 A 및 B 각각에 이벤트 객체가 존재하되, 중복되는 이벤트 객체는 없거나, 클러스터 A에만 이벤트 객체가 존재하고, 클러스터 B에는 이벤트 객체가 없을 경우, 사용자로부터 수신한 이벤트 정보는 거짓 이벤트 객체로 분류될 수 있다.In addition, as shown in FIGS. 9(c) and (d), when an event object exists in each of clusters A and B, but there is no overlapping event object, there is an event object only in cluster A, and there is no event object in cluster B , the event information received from the user may be classified as a false event object.

본 발명의 일 실시예에 의할 경우, 이벤트 분류 장치(1000)는, 상기 이벤트 객체 분류 결과를 스마트시티 이벤트 공유 맵 서비스에 반영할 수 있다.According to an embodiment of the present invention, the event classification apparatus 1000 may reflect the event object classification result to the smart city event sharing map service.

한편, 전술한 다양한 실시예들에 따른 이벤트 분류 방법은 이러한 방법의 각 단계를 수행하도록 프로그램된 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램의 형태로 구현 가능하고, 또한 이러한 방법의 각 단계를 수행하도록 프로그램된 컴퓨터 프로그램을 저장하는 컴퓨터 판독가능한 기록매체의 형태로 구현될 수도 있다.On the other hand, the event classification method according to the above-described various embodiments can be implemented in the form of a computer program stored in a computer-readable recording medium programmed to perform each step of the method, and also a program to perform each step of the method It may be implemented in the form of a computer-readable recording medium storing a computer program.

본 명세서에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each block in the block diagram attached to this specification and each step in the flowchart may be performed by computer program instructions. These computer program instructions may be embodied in a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment, such that the instructions executed by the processor of the computer or other programmable data processing equipment correspond to each block in the block diagram or in the flowchart. Each step creates a means for performing the described functions. These computer program instructions may also be stored in a computer-usable or computer-readable memory that may direct a computer or other programmable data processing equipment to implement a function in a particular manner, and thus the computer-usable or computer-readable memory. The instructions stored in the block diagram may produce an article of manufacture containing instruction means for performing the functions described in each block in the block diagram or in each step in the flowchart. The computer program instructions may also be mounted on a computer or other programmable data processing equipment, such that a series of operational steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to create a computer or other programmable data processing equipment. It is also possible that instructions for performing the processing equipment provide steps for performing the functions described in each block of the block diagram and each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Further, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function(s). It should also be noted that in some alternative embodiments it is also possible for the functions recited in blocks or steps to occur out of order. For example, it is possible that two blocks or steps shown one after another may in fact be performed substantially simultaneously, or that the blocks or steps may sometimes be performed in the reverse order according to the corresponding function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and various modifications and variations will be possible without departing from the essential quality of the present invention by those skilled in the art to which the present invention pertains. Accordingly, the embodiments disclosed in the present specification are for explanation rather than limiting the technical spirit of the present invention, and the scope of the technical spirit of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

Claims

receiving event information from a user;
performing pre-processing on the event information and extracting event keywords;
generating an entire event cluster and a proximity event cluster by performing topic modeling based on the event keyword and pre-stored event information; and
Including; performing classification of the event information through comparison of the entire event cluster and the proximity event cluster;
The event information is
including text information, image information, event time information, and event location information;
The step of generating the entire event cluster and the proximity event cluster comprises:
generating a candidate cluster having a preset similarity to the event time information and the event location information from among the pre-stored event information by performing clustering; and
and generating the proximity event cluster by clustering keywords having a first similarity to the extracted event keyword among the event storage keywords derived from the candidate cluster by performing the topic modeling.

According to claim 1,
The topic modeling is an event classification method that is LDA (Latent Dirichlet Allocation) topic modeling.

delete

According to claim 1,
The extracting of the event keyword may include: extracting, as an image keyword, a plurality of upper labels matching the image information among the event information on which the pre-processing has been performed using a pre-trained neural network; and
and extracting a text keyword of the event information through morphological analysis of the text information,
The neural network is
When the reference image information and the reference image keyword are input, the image keyword for the reference image information is learned to be output.
How to classify events.

According to claim 1,
The step of generating the entire event cluster and the proximity event cluster comprises:
An event classification method for generating the entire event cluster by clustering keywords having a second degree of similarity to the extracted event keyword among event storage keywords derived from the pre-stored event information by performing the topic modeling.

delete

According to claim 1,
The clustering is a K-means clustering event classification method.

According to claim 1,
The step of classifying the event information includes:
When event information included in duplicate exists in the entire event cluster and the adjacent event cluster, the event information is classified as duplicate event information.

According to claim 1,
The step of classifying the event information includes:
When event storage information does not exist in each of the entire event cluster and the adjacent event cluster, the event information is classified as new event information.

According to claim 1,
The step of classifying the event information includes:
When event information is present in each of the entire event cluster and the proximity event cluster without overlap, or event information exists in the entire event cluster and event information does not exist in the proximity event cluster, the event information is false event information. A method of classifying events as being .

In the event classification device,
a communication unit for receiving event information from a user; and
including a processor;
The processor is
Perform pre-processing on the event information, extract event keywords,
Perform topic modeling based on the event keyword and pre-stored event storage information to generate an entire event cluster and a proximity event cluster,
Classifying the event information through comparison of the entire event cluster and the proximity event cluster,
The event information includes text information, image information, event time information and event location information,
The processor is
performing K-means clustering to generate a candidate cluster having a preset similarity with the event time information and the event location information among the pre-stored event information,
An event classification apparatus for generating the proximity event cluster by clustering keywords having a first similarity to the extracted event keyword among event storage keywords derived from the candidate cluster by performing the topic modeling.

12. The method of claim 11
The topic modeling is an event classification device that is LDA (Latent Dirichlet Allocation) topic modeling.

delete

12. The method of claim 11,
The processor is
Extracting, as image keywords, a plurality of upper labels matching the image information among the event information on which the preprocessing has been performed using a pre-learned neural network,
extracting a text keyword of the event information as the event keyword through morphological analysis of the text information,
The neural network is
When the reference image information and the reference image keyword are input, the event classification apparatus learned to output the image keyword for the reference image information.

12. The method of claim 11,
The processor is
An event classification apparatus for generating the entire event cluster by clustering keywords having a second degree of similarity to the extracted event keyword among event storage keywords derived from the pre-stored event information by performing the topic modeling.

delete

12. The method of claim 11,
The processor is
An event classification apparatus for classifying the event information as duplicate event information when duplicate event information is included in the entire event cluster and the adjacent event cluster.

12. The method of claim 11,
The processor is
When event information does not exist in each of the entire event cluster and the adjacent event cluster, the event information is classified as new event information.

12. The method of claim 11,
The processor is
When event information exists without overlap in each of the entire event cluster and the proximity event cluster, or when event information exists in the entire event cluster and event information does not exist in the proximity event cluster, the event information is false event information An event classification device that classifies as being.

A computer-readable recording medium storing a computer program, comprising:
The computer program is
receiving event information from a user;
performing pre-processing on the event information and extracting event keywords;
generating an entire event cluster and a proximity event cluster by performing topic modeling based on the event keyword and pre-stored event information; and
Including; performing classification of the event information through comparison of the entire event cluster and the proximity event cluster;
The event information includes text information, image information, event time information and event location information,
The step of generating the entire event cluster and the proximity event cluster comprises:
generating a candidate cluster having a preset similarity to the event time information and the event location information from among the pre-stored event information by performing clustering; and
generating the proximity event cluster by clustering keywords having a first similarity to the extracted event keyword among the event storage keywords derived from the candidate cluster by performing the topic modeling; A computer-readable recording medium containing instructions for doing so.

As a computer program stored in a computer-readable recording medium,
The computer program is
receiving event information from a user;
performing pre-processing on the event information and extracting event keywords;
generating an entire event cluster and a proximity event cluster by performing topic modeling based on the event keyword and pre-stored event information; and
Including; performing classification of the event information through comparison of the entire event cluster and the proximity event cluster;
The event information includes text information, image information, event time information and event location information,
The step of generating the entire event cluster and the proximity event cluster comprises:
generating a candidate cluster having a preset similarity to the event time information and the event location information from among the pre-stored event information by performing clustering; and
generating the proximity event cluster by clustering keywords having a first similarity to the extracted event keyword among the event storage keywords derived from the candidate cluster by performing the topic modeling; A computer program stored in a recording medium containing instructions for