KR20120059935A

KR20120059935A - Text classification device and classification method thereof

Info

Publication number: KR20120059935A
Application number: KR1020100121446A
Authority: KR
Inventors: 박성배; 손정우
Original assignee: 경북대학교 산학협력단
Priority date: 2010-12-01
Filing date: 2010-12-01
Publication date: 2012-06-11
Also published as: KR101158750B1

Abstract

PURPOSE: A document classification apparatus and document classification method thereof are provided to calculate weighted values for documents of learning data based on information related to a category of the learning data. CONSTITUTION: Weighted values for learning data are calculated according to similarity between execution data and the learning data(S120). Document classification learning operation for the learning data is executed by reflecting the weighted values for the learning data(S130). The execution data is classified based on results of the document classification learning operation(S150). The learning data includes plural documents classified by each category. The documents include different weighted values according to the similarity.

Description

TEXT CLASSIFICATION DEVICE AND CLASSIFICATION METHOD THEREOF

본 발명은 문서분류장치 및 그것의 문서분류방법에 관한 것으로, 좀더 구체적으로 기계학습 기반의 문서분류장치 및 그것의 문서분류방법에 관한 것이다.The present invention relates to a document classification apparatus and a document classification method thereof, and more particularly, to a machine learning based document classification apparatus and a document classification method thereof.

인터넷 상에는 수만, 수억 개의 문서가 존재하며, 블로그, 미니홈페이지 등의 활성화로 인하여 문서의 양은 기하급수적으로 증가하고 있다. 이러한 문서는 많은 정보를 가지고 있으며, 문서에 포함된 정보에 접근하기 위해서 검색 시스템이나 분석 시스템 등이 다양하게 사용되고 있다.There are tens and hundreds of millions of documents on the Internet, and the volume of documents is increasing exponentially due to the activation of blogs and mini homepages. These documents have a lot of information, and various search systems and analysis systems are used to access the information contained in the documents.

문서의 정보에 접근하기 위한 검색 및 분석 시스템의 대부분은 문서를 카테고리별로 구분하여 접근성을 높인다. 예를 들어, 인터넷 뉴스를 제공하는 포털 검색 시스템의 경우에 문서들을 정치, 사회, 경제, 연예 등의 카테고리별로 분류함으로써 문서에 대한 접근성을 높이고 있다. 초기에는 이러한 문서의 분류는 사람에 의하여 직접 수행되었다. 그러나, 정보의 양이 급증함에 따라 많은 문서들을 자동으로 분류할 수 있는 문서분류장치에 대한 필요성이 증가하고 있는 실정이다. Most retrieval and analysis systems for accessing document information improve accessibility by dividing documents into categories. For example, in the case of a portal search system that provides Internet news, documents are classified into categories such as politics, society, economy, and entertainment to enhance access to documents. Initially, the classification of these documents was done directly by humans. However, as the amount of information increases, there is an increasing need for a document classification apparatus that can automatically classify many documents.

본 발명은 자동으로 문서들을 분류하면서, 동시에 문서들에 포함된 정보에 따라 문서를 각 카테고리별로 정확하게 분류하는 문서분류장치를 제공하는 데 있다. SUMMARY OF THE INVENTION The present invention provides a document classification apparatus for automatically classifying documents and simultaneously classifying documents in each category according to information included in the documents.

본 발명의 실시 예에 따른 카테고리별로 문서를 분류하는 문서분류방법은 실행데이터와 학습데이터 사이의 유사 정도에 따라, 상기 학습데이터에 대한 가중치를 계산하는 단계; 상기 계산된 학습데이터에 대한 가중치를 반영하여, 상기 학습데이터에 대한 문서분류 학습동작을 수행하는 단계; 및 상기 문서분류 학습동작의 수행 결과에 기초하여, 상기 실행데이터를 분류하는 단계를 포함한다.According to an embodiment of the present invention, a document classification method for classifying documents by category may include: calculating weights for the learning data according to a degree of similarity between execution data and learning data; Performing a document classification learning operation on the learning data by reflecting the weighted values of the learning data; And classifying the execution data based on a result of performing the document classification learning operation.

실시 예로써, 상기 학습데이터는 각 카테고리별로 분류된 복수의 문서들을 포함하며, 상기 학습데이터의 복수의 문서들은 상기 실행데이터와의 유사 정도에 따라 각각 서로 다른 가중치를 갖는다.In an embodiment, the learning data includes a plurality of documents classified for each category, and the plurality of documents of the learning data have different weights according to similarities with the execution data.

실시 예로써, 상기 학습데이터의 복수의 문서들과 상기 실행데이터 사이의 유사 정도는 상기 학습데이터의 복수의 문서들에 포함된 소정 단어들의 출현 확률과 상기 실행데이터에 포함된 소정 단어들의 출현 확률에 따라 결정된다.In an embodiment, the degree of similarity between the plurality of documents of the learning data and the execution data may be determined by the occurrence probability of predetermined words included in the plurality of documents of the learning data and the occurrence probability of predetermined words included in the execution data. Is determined accordingly.

실시 예로써, 상기 문서분류 학습동작을 수행하는 단계는 상기 계산된 학습데이터에 대한 가중치를 반영하여 상기 학습데이터의 포함된 소정의 단어들의 출현 확률(이하, 학습데이터 분포)을 계산하는 단계; 및 상기 학습데이터 분포가 상기 실행데이터의 소정 단어들의 출현 확률(이하, 실행데이터 분포)과 유사한지의 여부를 판단하는 단계를 더 포함한다.In an embodiment, the performing of the document classification learning operation may include calculating a probability of occurrence (hereinafter, learning data distribution) of predetermined words included in the learning data by reflecting a weight of the calculated learning data; And determining whether the learning data distribution is similar to a probability of occurrence of certain words of the execution data (hereinafter, referred to as execution data distribution).

실시 예로써, 상기 학습데이터 분포와 상기 실행데이터 분포 사이의 유사 정도는 상기 실행데이터 분포에 대한 가능도(likelihood)의 값이 수렴하는지의 여부에 따라 결정된다.In an embodiment, the degree of similarity between the learning data distribution and the execution data distribution is determined depending on whether the values of likelihoods for the execution data distribution converge.

실시 예로써, 상기 실행데이터 분포에 대한 가능도의 값이 수렴하지 않는 경우, 상기 실행데이터가 분류된 후에, 상기 분류된 실행데이터를 포함하는 학습데이터에 대한 가중치를 다시 결정한다.In an embodiment, when the values of the likelihood for the execution data distribution do not converge, after the execution data is classified, the weight for the learning data including the classified execution data is determined again.

본 발명의 실시 예에 따른 카테고리별로 문서를 분류하는 문서분류장치는 학습데이터의 문서들에 대한 가중치를 계산하는 가중치 계산모듈; 상기 가중치를 반영하여, 상기 학습데이터에 포함된 단어들의 출현 확률(이하, 학습데이터 분포)를 계산하는 문서분류 학습모듈; 및 상기 문서분류 학습모듈에서 계산된 상기 가중치가 반영된 상기 학습데이터 분포에 기초하여, 실행데이터를 분류하는 문서분류기를 포함한다.Document classification apparatus for classifying documents by category according to an embodiment of the present invention includes a weight calculation module for calculating a weight for the documents of the training data; A document classification learning module for calculating a probability of occurrence of words included in the learning data (hereinafter, referred to as learning data distribution) by reflecting the weights; And a document classifier for classifying execution data based on the learning data distribution reflecting the weight calculated by the document classification learning module.

실시 예로써, 상기 가중치 계산모듈은 상기 학습데이터 및 상기 실행데이터의 소정 단어들의 출현 확률의 유사 정도에 따라, 학습데이터의 문서들에 대한 가중치를 계산한다.In an embodiment, the weight calculation module calculates weights for documents of learning data according to a similarity degree of occurrence probability of predetermined words of the learning data and the execution data.

실시 예로써, 상기 학습데이터의 문서들에 대한 가중치는 각각 서로 다른 것을 특징으로 한다.In an embodiment, the weights of the documents of the learning data may be different from each other.

실시 예로써, 상기 문서분류 학습모듈은 상기 가중치가 반영된 상기 학습데이터 분포가 상기 실행데이터의 단어들의 출현 확률(이하, 실행데이터 분포)과 유사하지 않은 경우, 상기 실행데이터를 상기 학습데이터의 카테고리에 임시로 분류한 후에, 상기 학습데이터의 문서들에 대한 가중치를 다시 결정한다.According to an embodiment, the document classification learning module may include the execution data in the category of the learning data when the learning data distribution reflecting the weight is not similar to the occurrence probability of the words of the execution data (hereinafter, referred to as execution data distribution). After classifying temporarily, the weights for the documents of the learning data are again determined.

실시 예로써, 상기 학습데이터 분포와 상기 실행데이터 분포 사이의 유사 정도는 상기 실행데이터 분포에 대한 가능도의 값이 수렴하는지의 여부에 따라 결정된다. In an embodiment, the degree of similarity between the learning data distribution and the execution data distribution is determined depending on whether the values of the likelihoods for the execution data distribution converge.

실시 예로써, 상기 실행데이터 분포에 대한 가능도의 값이 수렴하지 않는 경우, 상기 문서분류 학습모듈은 상기 실행데이터를 임시로 분류하고, 상기 임시로 분류된 실행데이터를 포함하는 학습데이터의 카테고리에 관한 정보를 상기 가중치 계산모듈에 전송한다.In an embodiment, when the value of the likelihood for the distribution of execution data does not converge, the document classification learning module temporarily classifies the execution data, and assigns to the category of the learning data including the temporarily classified execution data. Information about the weight calculation module.

실시 예로써, 상기 가중치 계산모듈은 상기 임시로 분류된 실행데이터를 포함하는 상기 학습데이터의 카테고리에 관한 정보에 기초하여, 상기 학습데이터의 문서들에 대한 가중치를 다시 계산한다.In an embodiment, the weight calculation module recalculates the weights of the documents of the learning data based on the information about the category of the learning data including the temporarily classified execution data.

본 발명의 실시 예에 따르면, 각 카테고리별로 문서를 정확하게 분류할 수 있다. According to an embodiment of the present disclosure, documents may be classified according to categories.

도 1 내지 도 3은 본 발명의 실시 예에 따른 문서분류기의 동작을 설명하기 위한 도면이다.
도 4는 본 발명의 다른 실시 예에 따른 문서분류장치를 보여주는 블록도이다.
도 5는 도 4의 문서분류장치의 동작을 수행하는 예시적인 알고리즘을 보여주는 도면이다.
도 6은 도 4의 문서분류장치의 동작을 설명하기 위한 순서도이다. 1 to 3 are views for explaining the operation of the document classifier according to an embodiment of the present invention.
4 is a block diagram showing a document classification apparatus according to another embodiment of the present invention.
5 is a diagram illustrating an exemplary algorithm for performing an operation of the document classification apparatus of FIG. 4.
6 is a flowchart for explaining an operation of the document classification apparatus of FIG. 4.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 본 발명의 실시 예를 첨부된 도면을 참조하여 설명하기로 한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art may easily implement the technical idea of the present invention.

도 1 내지 도 3은 본 발명의 실시 예에 따른 문서분류기(10)의 동작을 설명하기 위한 도면이다. 1 to 3 are diagrams for explaining the operation of the document classifier 10 according to an embodiment of the present invention.

도 1을 참조하면, 문서분류기(10)는 분류할 문서를 전달받고, 전달받은 문서를 전달받은 문서에 대응하는 카테고리로 분류한다. 예를 들어, 문서분류기(10)는 '문서 x'에 기재된 내용에 기초하여, 전달받은 '문서 x'를 '카테고리C'로 분류할 수 있다. Referring to FIG. 1, the document classifier 10 receives a document to be classified, and classifies the received document into a category corresponding to the received document. For example, the document classifier 10 may classify the received document x as category C based on the content described in document x.

이 경우, 문서분류기(10)는 기계학습 방법에 기초하여 전달받은 문서를 분류할 수 있다. 여기서, 기계학습 방법은 이미 분류된 문서들의 공통 패턴을 찾는 것을 의미한다. 기계학습 방법에 기초한 문서분류방법은 이미 분류된 문서들의 공통 패턴을 분류할 문서에 적용함으로써, 전달받은 문서를 자동으로 분류할 수 있다. In this case, the document classifier 10 may classify the received document based on the machine learning method. Here, the machine learning method means finding a common pattern of already classified documents. The document classification method based on the machine learning method can automatically classify the received document by applying the common pattern of already classified documents to the document to be classified.

도 2 및 도 3을 참조하여 좀더 자세히 설명하면, 데이터가 수집되고(S11 단계), 수집된 데이터를 각 카테고리별로 분류함으로써 학습데이터가 생성된다(S12 단계). 여기서, 학습데이터는 각 카테고리별로 분류된 문서들의 집합을 의미한다. 예를 들어, 초기의 학습데이터는 사람에 의하여 각 카테고리별로 분류된 문서들의 집합일 수 있다. 이 후, 각 카테고리별로 분류된 문서들의 공통 패턴이 추출된다(S13 단계). 2 and 3, the data is collected (step S11), and the learning data is generated by classifying the collected data into categories (step S12). Here, the learning data means a set of documents classified by each category. For example, the initial learning data may be a set of documents classified by each category by a person. Thereafter, a common pattern of documents classified for each category is extracted (step S13).

실행데이터가 문서분류기(10)에 수신되면(S21 단계), 문서분류기(10)는 각 카테고리별로 분류된 문서들의 공통 패턴과 실행데이터를 비교한다(S22 단계). 여기서 실행데이터는 분류할 문서들의 집합을 의미한다. 문서분류기(10)는 추출된 공통 패턴을 실행데이터와 비교함으로써, 실행데이터를 기계적으로 분류할 수 있다(S23 단계). When the execution data is received in the document classifier 10 (step S21), the document classifier 10 compares the common pattern of the documents classified for each category with the execution data (step S22). Here, execution data means a collection of documents to be classified. The document classifier 10 may mechanically classify the execution data by comparing the extracted common pattern with the execution data (step S23).

이 경우, 예를 들어, 각 카테고리별로 추출된 공통 패턴은 특정 단어의 출현 빈도일 수 있다. 간략한 설명을 위하여, '스포츠'라는 카테고리에 '야구', '축구'와 같은 단어들이 특정 빈도로 출현한다고 가정하자. 문서분류기(10)는 '야구', '축구'와 같은 단어의 출현 빈도를 바탕으로, 이들 단어가 비슷한 빈도로 출현하는 실행데이터의 문서를 '스포츠' 카테고리로 분류할 수 있다. In this case, for example, the common pattern extracted for each category may be a frequency of occurrence of a specific word. For the sake of brevity, assume that words such as 'baseball' and 'soccer' appear in a certain category in the category 'sports'. The document classifier 10 may classify documents of execution data in which these words appear at similar frequencies based on the frequency of occurrence of words such as 'baseball' and 'soccer' into the 'sports' category.

그러나, 이 경우에 문서분류기(10)가 분류해야 할 실행데이터의 분포는 학습데이터의 분포와 일치하지 않을 수 있다. 이는 문서에 나타나는 단어의 종류가 신조어, 외래어 등으로 인하여 무한할 수 있기 때문이다. 여기서, 실행데이터의 분포는 실행데이터의 문서에서 특정 단어들이 출현하는 빈도에 대한 확률 분포를 의미하고, 학습데이터의 분포는 학습데이터에서 특정 단어들이 출현하는 빈도에 대한 확률 분포를 의미한다. However, in this case, the distribution of the execution data to be classified by the document classifier 10 may not match the distribution of the learning data. This is because the types of words appearing in the document may be infinite due to new words, foreign words, and the like. Here, the distribution of the execution data means a probability distribution for the frequency of occurrence of specific words in the document of the execution data, and the distribution of the training data means a probability distribution for the frequency of occurrence of specific words in the learning data.

이와 같은 문제를 해결하기 위하여, 본 발명의 다른 실시 예에 따른 문서분류장치는 실행데이터의 문서와 학습데이터의 문서들 사이의 유사한 정도에 따라, 학습데이터 내의 문서들에 각각 가중치를 부여할 것이다. 예를 들어, 문서분류장치는 학습데이터 내의 문서들 중 실행데이터의 문서와 유사한 문서에 높은 가중치를 부여하고, 실행데이터의 문서와 비슷하지 않은 문서에 낮은 가중치를 부여할 것이다. In order to solve such a problem, the document classification apparatus according to another embodiment of the present invention will weight each document in the learning data according to the similarity between the document of the execution data and the documents of the learning data. For example, the document classification apparatus may give a high weight to a document similar to the document of the execution data among the documents in the learning data, and give a low weight to a document that is not similar to the document of the execution data.

이 경우, 학습데이터 내의 문서들과 실행데이터의 문서 사이의 유사도는 예를 들어, 소정 단어의 출현 확률에 따라 결정될 수 있다. 가중치를 부여함으로써, 본 발명의 실시 예에 따른 문서분류장치는 실행데이터를 좀더 정확하게 각 카테고리별로 분류할 수 있다. 이는 이하에서 좀더 자세히 설명될 것이다. In this case, the similarity between the documents in the learning data and the document of the execution data may be determined according to, for example, the probability of occurrence of a predetermined word. By assigning a weight, the document classification apparatus according to the embodiment of the present invention may classify execution data more precisely into each category. This will be explained in more detail below.

도 4는 본 발명의 다른 실시 예에 따른 문서분류장치(100)를 보여주는 블록도이다. 도 4를 참조하면, 문서분류장치(100)는 문서분류기(110)와 문서분류 학습기(120)를 포함한다. 문서분류 학습기(120)는 가중치 계산모듈(121)과 문서분류 학습모듈(122)을 포함한다.4 is a block diagram showing a document classification apparatus 100 according to another embodiment of the present invention. Referring to FIG. 4, the document classification apparatus 100 includes a document classifier 110 and a document classification learner 120. The document classification learner 120 includes a weight calculation module 121 and a document classification learning module 122.

가중치 계산모듈(121)은 실행데이터와 학습데이터를 전달받는다. 가중치 계산모듈(121)은 실행데이터와 학습데이터의 유사성을 측정하고, 학습데이터 내의 문서들 중 실행데이터의 문서와 유사한 학습데이터 내의 문서에 높은 가중치를 부여한다. 이 경우, 가중치는 학습데이터 내의 문서 중 특정 단어가 학습데이터 분포에서 가지는 확률과 실행데이터 분포에서 가지는 확률이 비슷할수록 높게 책정될 수 있다. 가중치는 이하의 수학식 1 내지 5를 이용하여 계산될 수 있다.The weight calculation module 121 receives the execution data and the learning data. The weight calculation module 121 measures the similarity between the execution data and the learning data, and gives a high weight to the document in the learning data similar to the document of the execution data among the documents in the learning data. In this case, the weight may be set higher as the probability that a specific word in the document in the learning data has a similarity in the distribution of the execution data and the probability of the execution data distribution. The weight may be calculated using Equations 1 to 5 below.

수학식 1은 학습데이터의 문서'd'와 실행데이터의 문서'du'(du∈Du(cj))간의 유사도 함수 K()를 이용하여, 학습데이터의 문서'd'의 가중치를 결정하는 함수'Φ'를 보여준다. 이때, Du(cj)는 실행데이터 중 cj 카테고리에 속한 문서를 의미하고, adu는 실행 데이터의 문서'du'가 가지는 가중치로, 실행데이터의 문서'du'가 실행데이터 분포에서 가지는 중요도를 의미한다. 이때, adu는 수학식 2를 이용하여 계산될 수 있다.Equation 1 is a function for determining the weight of the document 'd' of the training data using the similarity function K () between the document 'd' of the training data and the document 'du' (du∈Du (cj)) of the execution data. Shows 'Φ'. At this time, Du (cj) means a document belonging to the category cj among the execution data, adu is a weight of the document 'du' of the execution data, it means the importance of the document 'du' of the execution data in the execution data distribution. . In this case, adu may be calculated using Equation 2.

수학식 2에서, P1은 학습데이터의 분포를 의미하고, P2는 실행데이터의 분포를 의미한다. π는 계산된 학습데이터의 문서의 가중치이다. 수학식 2에서, a'는 실행데이터의 분포와 학습데이터의 분포 사이의 KL을 최소화하도록 결정된다. KL은 Kullback-Leibler divergence를 의미한다. 수학식 2의 KL은 수학식 3과 같이 표현될 수 있다.In Equation 2, P1 denotes a distribution of learning data, and P2 denotes a distribution of execution data. π is the weight of the document of the calculated learning data. In Equation 2, a 'is determined to minimize the KL between the distribution of the execution data and the distribution of the learning data. KL stands for Kullback-Leibler divergence. KL of Equation 2 may be expressed as Equation 3.

수학식 3을 이용하여, 학습데이터의 분포와 실행데이터의 분포 사이의 KL을 최소화할 수 있다. 이는 수학식 4와 같이 표현될 수 있다.Using Equation 3, it is possible to minimize the KL between the distribution of the training data and the distribution of the execution data. This may be expressed as in Equation 4.

수학식 4에서, Ncj는 실행데이터의 문서 중, cj카테고리에 속하는 문서의 수를 의미한다. 계산의 편의상, 가중치의 총 합이 문서의 수와 같다고 가정된다. 이 경우, 수학식 4는 수학식 5와 같이 수식화될 수 있으며, 수학식 5를 이용하여 본 발명의 실시 예에 따른 최적화된 가중치가 계산될 수 있다.In Equation 4, Ncj means the number of documents belonging to the cj category among the documents of the execution data. For convenience of calculation, it is assumed that the total sum of the weights is equal to the number of documents. In this case, Equation 4 may be formulated as in Equation 5, and an optimized weight according to an embodiment of the present invention may be calculated using Equation 5.

계속해서 도 4를 참조하면, 문서분류 학습모듈(122)은 가중치 계산모듈(121)로부터 학습데이터 내의 문서들의 가중치를 전달받는다. 또한, 문서분류 학습모듈(122)은 학습데이터 및 실행데이터를 전달받는다. 문서분류 학습모듈(122)은 전달받은 학습데이터 내의 문서들의 가중치를 이용하여, 문서분류 학습동작을 수행한다. 4, the document classification learning module 122 receives the weights of the documents in the learning data from the weight calculation module 121. In addition, the document classification learning module 122 receives the learning data and the execution data. The document classification learning module 122 performs a document classification learning operation by using the weights of the documents in the received learning data.

여기서, 문서분류 학습동작은 학습데이터의 각 문서에 나타난 단어들의 출현 빈도에 대한 확률 값을 계산하는 동작을 포함한다. 문서분류 학습모듈(122)은 가중치를 학습데이터 내의 문서들에 반영하여 확률 값을 계산함으로써, 높은 가중치의 문서에 나타난 단어들을 더 중요하게 확률 값의 계산에 반영할 수 있다. 즉, 문서분류 학습모듈(122)은 가중치가 반영된 학습데이터의 분포를 구할 수 있다. 이 경우, 문서분류 학습모듈(122)은 잘 알려진 Naive Bayes 기반의 문서분류기를 이용하여, 문서분류 학습동작을 수행할 수 있다. Here, the document classification learning operation includes an operation of calculating a probability value for the frequency of occurrence of words appearing in each document of the learning data. The document classification learning module 122 may calculate the probability value by reflecting the weights in the documents in the learning data, thereby more importantly reflecting the words appearing in the high weight document in the calculation of the probability value. That is, the document classification learning module 122 may obtain a distribution of learning data in which weights are reflected. In this case, the document classification learning module 122 may perform a document classification learning operation by using a well-known Naive Bayes-based document classifier.

Naive Bayes 기반의 문서분류기를 이용하여 문서분류 학습동작을 수행하는 경우, 문서분류 학습모듈(122)은 가능도(likelihood)가 최대화되도록 문서분류 학습동작을 수행한다. 가능도가 소정의 최대값으로 수렴하는 경우, 문서분류 학습모듈(122)은 계산된 확률 값(또는 학습데이터 분포)를 문서분류기(110)에 전달할 것이다. 이 경우, 문서분류기(110)는 전달받은 확률 값(또는 학습데이터 분포)를 실행데이터(또는 실행데이터 분포)와 비교함으로써, 실행데이터 내의 문서들을 각 카테고리로 분류할 수 있다. When performing a document classification learning operation using a Naive Bayes-based document classifier, the document classification learning module 122 performs the document classification learning operation to maximize the likelihood. When the likelihood converges to a predetermined maximum value, the document classification learning module 122 may transmit the calculated probability value (or learning data distribution) to the document classifier 110. In this case, the document classifier 110 may classify the documents in the execution data into categories by comparing the received probability value (or learning data distribution) with the execution data (or execution data distribution).

한편, 가능도가 소정의 최대값으로 수렴하지 않는 경우, 문서분류 학습모듈(122)은 실행데이터를 임시로 분류하고, 임시로 분류된 실행데이터의 카테고리 정보를 가중치 계산모듈(121)에 전달한다. 가중치 계산모듈(121)은 전달받은 임시적인 실행데이터의 카테고리 정보를 학습데이터에 반영하여, 학습데이터 내의 문서들의 가중치를 다시 결정한다. 문서분류 학습모듈(122)에서 가능도는 수학식 6과 같이 표현될 수 있다.On the other hand, if the likelihood does not converge to a predetermined maximum value, the document classification learning module 122 temporarily classifies the execution data, and transmits the category information of the temporarily classified execution data to the weight calculation module 121. . The weight calculation module 121 reflects the category information of the received temporary execution data in the training data, and determines the weight of the documents in the training data again. In the document classification learning module 122, the likelihood may be expressed as in Equation 6.

수학식 6에서, Dl은 학습데이터를 의미하고, Du는 실행데이터를 의미하며, θ는 학습해야할 파라미터를 의미한다. 설명의 편의상, P2(θ)가 일정(uniform)하다고 가정된다. 이 경우, 최대값으로 수렴해야할 가능도는 수학식 7과 같이 표현될 수 있다.In Equation 6, Dl means learning data, Du means execution data, and θ means a parameter to be learned. For convenience of explanation, it is assumed that P2 (θ) is uniform. In this case, the likelihood to converge to the maximum value may be expressed as in Equation 7.

수학식 7에서, Z는 문서d가 j번째 카테고리에 속하는지 아닌지를 0 또는 1로 나타낸다. y는 문서d가 가지는 카테고리의 집합(또는 카테고리 레이블)로 학습데이터 Dl을 통하여 주어진다. P2는 실행데이터의 분포를 의미하고, P1은 학습데이터의 분포를 의미한다. 예시적으로, P1과 P2는 서로 다른 값을 갖는다고 가정된다. 이 경우, 분류기(예를 들어, Naive Bayes 기반의 문서분류기, 또는 도 4의 문서분류기)는 P2에서의 가능도의 최대화를 목표로 한다. 수학식 7에서, P1에서의 가능도를 계산하여 반영하는 것은 가중치π를 반영하여 계산한 P1에서의 가능도가 P2에서의 가능도가 되기 때문이다. 이는 가중치π가 P1과 P2를 최대한 같게 만들기 때문이다. In Equation 7, Z denotes 0 or 1 whether or not the document d belongs to the j-th category. y is a set (or category label) of the category of the document d, and is given through the training data D1. P2 means distribution of execution data, and P1 means distribution of learning data. By way of example, it is assumed that P1 and P2 have different values. In this case, the classifier (eg, Naive Bayes based document classifier, or the document classifier of FIG. 4) aims to maximize the likelihood at P2. In Equation 7, the likelihood at P1 is calculated and reflected because the likelihood at P1 calculated by reflecting the weight? Is the likelihood at P2. This is because the weight π makes P1 and P2 as equal as possible.

가능도를 최대화하기 위하여, 예시적으로 EM 알고리즘 기반의 반-검사(semi-supervised) 학습방법을 적용할 것이다. EM 알고리즘은 수학식 8 및 수학식 9와 같은 두 단계로 나누어진다.To maximize the likelihood, an EM algorithm based semi-supervised learning method will be applied. The EM algorithm is divided into two stages such as Equation 8 and Equation 9.

수학식 8에서 구해진 값들을 이용하여, 파라미터θ의 값이 수학식 9와 같이 구해질 수 있다. Using the values obtained in Equation 8, the value of the parameter θ can be obtained as in Equation 9.

수학식 9에서, n(w,d)는 단어w가 문서d에 나타난 횟수를 의미한다. D와 W는 수학식 10과 같이 계산될 수 있다. In Equation 9, n (w, d) means the number of times the word w appears in the document d. D and W may be calculated as shown in Equation 10.

상술한 바와 같이, 본 발명의 실시 예에 따른 문서분류 학습기(120)는 학습데이터 분포의 계산 시, 학습데이터 내의 문서들의 가중치를 반영한다. 학습데이터 내의 문서들의 가중치를 이용함으로써, 본 발명의 실시 예에 따른 문서분류장치(100)는 실행데이터의 분포와 학습데이터의 분포가 서로 다른 경우에도, 안정적으로 실행데이터를 분류할 수 있다. 수학식 6 내지 10에서 설명된 가중치π를 활용한 문서분류 학습동작은 도 5와 같은 알고리즘을 통하여 구현될 수 있다. As described above, the document classification learner 120 according to an embodiment of the present invention reflects the weights of the documents in the learning data when calculating the learning data distribution. By using the weights of the documents in the learning data, the document classification apparatus 100 according to an embodiment of the present invention can stably classify the execution data even when the distribution of the execution data and the distribution of the learning data are different. Document classification learning operation using the weight π described in Equations 6 to 10 may be implemented through the algorithm shown in FIG. 5.

한편, 문서분류기(110)는 실행데이터 및 학습 데이터 분포에 관한 정보를 전달받는다. 문서분류기(110)는 학습데이터 분포에 관한 정보를 이용하여, 실행데이터를 분류한다. 여기서, 문서분류기(110)는 도 1 내지 도 3에서 설명된 문서분류기(10)를 이용하여 구현될 수 있다. Meanwhile, the document classifier 110 receives information regarding execution data and learning data distribution. The document classifier 110 classifies the execution data using the information on the distribution of the learning data. Here, the document classifier 110 may be implemented using the document classifier 10 described with reference to FIGS. 1 to 3.

도 6은 도 4의 문서분류장치(100)의 동작을 설명하기 위한 순서도이다. 6 is a flowchart for describing an operation of the document classification apparatus 100 of FIG. 4.

S110 단계에서, 실행데이터가 문서분류장치(100)에 실행데이터가 수신된다. 실행데이터는 분류하고자 하는 적어도 하나 이상의 문서를 포함한다.In operation S110, execution data is received by the document classification apparatus 100. Execution data includes at least one document to be classified.

S120 단계에서, 학습데이터 내의 문서들의 가중치가 결정된다. 즉, 가중치 계산모듈(121)은 실행데이터와 학습데이터의 유사성에 기초하여, 학습데이터 내의 문서들의 가중치를 결정한다. 예를 들어, 분류하고자 하는 문서와 유사한 학습데이터의 문서에는 유사하지 않은 학습 데이터의 문서보다 높은 가중치가 부여될 것이다. In step S120, the weights of the documents in the learning data are determined. That is, the weight calculation module 121 determines weights of documents in the learning data based on the similarity between the execution data and the learning data. For example, a document of learning data similar to a document to be classified may be given a higher weight than a document of dissimilar learning data.

S130 단계에서, 문서분류 학습동작이 수행된다. 이 경우, 문서분류 학습기(122)는 가중치 계산모듈(121)에서 계산된 학습데이터의 문서들의 가중치를 반영하여, 학습데이터 분포를 계산한다. In step S130, the document classification learning operation is performed. In this case, the document classification learner 122 reflects the weights of the documents of the learning data calculated by the weight calculation module 121 and calculates the distribution of the learning data.

S140 단계에서, 가능도가 소정의 최대값으로 수렴하는지의 여부가 판단된다. 즉, 문서분류 학습기(122)는 가능도를 계산하고, 계산된 가능도의 값이 소정의 값으로 수렴하는지의 여부를 판단한다.In step S140, it is determined whether the likelihood converges to a predetermined maximum value. That is, the document classification learner 122 calculates the likelihood and determines whether the calculated likelihood value converges to a predetermined value.

가능도가 소정의 최대값으로 수렴하지 않는 경우, 문서분류 학습기(122)는 실행데이터를 임시로 분류한다(S150 단계). 임시로 분류된 실행데이터의 카테고리 정보는 가중치 계산모듈(121)에 전달되고, 가중치 계산모듈(121)은 임시적인 실행데이터의 카테고리 정보를 반영하여 가중치를 다시 결정한다(S160 단계). If the likelihood does not converge to a predetermined maximum value, the document classification learner 122 temporarily classifies the execution data (step S150). The category information of the temporarily classified execution data is transferred to the weight calculation module 121, and the weight calculation module 121 determines the weight again by reflecting the category information of the temporary execution data (S160).

가능도가 소정의 최대값으로 수렴하는 경우, 문서분류 학습기(122)는 계산된 학습데이터 분포에 관한 정보를 문서분류기(110)에 전달한다. 문서분류기(110)는 전달받은 학습데이터 분포에 관한 정보를 이용하여 실행데이터를 최종적으로 분류한다(S170 단계). When the likelihood converges to a predetermined maximum value, the document classification learner 122 transmits the information about the calculated learning data distribution to the document classifier 110. The document classifier 110 finally classifies the execution data by using the received information about the distribution of the learning data (step S170).

상술한 바와 같이, 본 발명의 실시 예에 따른 문서분류장치(100)는 학습데이터의 문서들에 각각 가중치를 부여함으로써, 학습데이터의 분포와 실행데이터의 분포가 동일하지 않은 경우에도 실행데이터를 정확하게 분류할 수 있다. As described above, the document classification apparatus 100 according to the embodiment of the present invention assigns weights to the documents of the training data, respectively, so that the execution data can be accurately displayed even when the distribution of the training data and the distribution of the execution data are not the same. Can be classified.

한편, 본 발명의 범위 또는 기술적 사상을 벗어나지 않고 본 발명의 구조가 다양하게 수정되거나 변경될 수 있음은 이 분야에 숙련된 자들에게 자명하다. 상술한 내용을 고려하여 볼 때, 만약 본 발명의 수정 및 변경이 아래의 청구항들 및 동등물의 범주 내에 속한다면, 본 발명이 이 발명의 변경 및 수정을 포함하는 것으로 여겨진다.
On the other hand, it is apparent to those skilled in the art that the structure of the present invention can be variously modified or changed without departing from the scope or technical spirit of the present invention. In view of the foregoing, it is believed that the present invention includes modifications and variations of this invention provided they come within the scope of the following claims and their equivalents.

10, 110: 문서분류기 100: 문서분류장치
120: 문서분류 학습기 121: 가중치 계산모듈
122: 문서분류 학습모듈10, 110: document sorter 100: document sorting device
120: classifier learner 121: weight calculation module
122: Document Classification Learning Module

Claims

In the document classification method to classify documents by category:
Calculating weights for the learning data according to a degree of similarity between the execution data and the learning data;
Performing a document classification learning operation on the learning data by reflecting the weight on the learning data; And
And classifying the execution data based on a result of performing the document classification learning operation.

The method of claim 1,
The learning data includes a plurality of documents classified for each category, and the plurality of documents of the learning data have different weights according to similarities with the execution data.

The method of claim 2,
The degree of similarity between the plurality of documents of the learning data and the execution data is a document determined according to the occurrence probability of predetermined words included in the plurality of documents of the learning data and the occurrence probability of predetermined words included in the execution data. Classification method.

The method of claim 1,
Performing the document classification learning operation is
Calculating a distribution of learning data by reflecting a weight of the learning data; And
And determining whether the learning data distribution is similar to the execution data distribution.

The method of claim 4, wherein
A degree of similarity between the learning data distribution and the execution data distribution is determined according to whether or not values of likelihood for the execution data distribution converge.

The method of claim 5, wherein
And if the values of the likelihoods for the distribution of execution data do not converge, after the execution data is classified, the weight for the learning data including the classified execution data is determined again.

In a document classification device that classifies documents by category:
A weight calculation module for calculating weights for documents of the learning data;
A document classification learning module for calculating a distribution of learning data by reflecting the weights; And
And a document classifier for classifying execution data based on the learning data distribution calculated by the document classification learning module.

The method of claim 7, wherein
And the weight calculation module calculates weights for documents of learning data according to a similarity degree of occurrence probability of predetermined words of the learning data and the execution data.

The method of claim 8,
Document classification apparatus, characterized in that the weights for the documents of the learning data are different from each other.

The method of claim 7, wherein
The document classification learning module temporarily classifies the execution data into a category of the learning data when the distribution of the learning data is not similar to the distribution of the execution data, and then re-determines the weights of the documents of the learning data. Sorting device.

11. The method of claim 10,
A degree of similarity between the learning data distribution and the execution data distribution is determined according to whether the values of the likelihoods for the execution data distribution converge.

The method of claim 11,
If the value of the likelihood for the distribution of execution data does not converge, the document classification learning module temporarily classifies the execution data and provides information on a category of learning data including the temporarily classified execution data. Document classification apparatus for transmitting to the weight calculation module.

The method of claim 12,
And the weight calculation module is configured to recalculate weights for documents of the learning data based on the information on the category of the learning data including the temporarily classified execution data.