KR101071495B1

KR101071495B1 - Device and method for categorizing electronic document automatically

Info

Publication number: KR101071495B1
Application number: KR1020090041272A
Authority: KR
Inventors: 조태호
Original assignee: 인하대학교 산학협력단
Priority date: 2009-05-12
Filing date: 2009-05-12
Publication date: 2011-10-10
Also published as: KR20100122298A

Abstract

전자문서 자동 분류 방법 및 장치가 개시된다. 전자문서를 자동으로 분류하는 방법은, 하나 이상의 라벨된 문서를 이용하여 카테고리 프로파일 테이블을 생성하는 단계 및 상기 카테고리 프로파일 테이블을 이용하여 미분류 문서를 분류하는 단계를 포함한다.Disclosed are a method and apparatus for automatically classifying electronic documents. A method for automatically classifying electronic documents includes generating a category profile table using one or more labeled documents and classifying an unclassified document using the category profile table.

전자 문서, 자동 분류, 카테고리 프로파일, 테이블, 인코딩 Electronic documents, automatic classification, category profiles, tables, encoding

Description

Electronic document automatic classification method and apparatus {DEVICE AND METHOD FOR CATEGORIZING ELECTRONIC DOCUMENT AUTOMATICALLY}

본 발명은 전자문서 자동 분류 방법 및 장치에 관한 것으로 특히, 문서를 테이블로 인코딩하고, 카테고리 프로파일을 이용하여 전자문서를 자동으로 분류하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for automatically classifying electronic documents, and more particularly, to a method and apparatus for encoding a document into a table and automatically classifying the electronic document using a category profile.

최근, 문서를 자동으로 분류하는 기술은 문서 관리 시스템, 텍스트 정보 시스템 등의 발전을 위해 중요한 기술로 인식되고 있다. 문서 분류는 미분류된 문서를 미리 설정된 카테고리 중 하나의 카테고리로 할당하여 수행할 수 있다.Recently, technology for automatically classifying documents has been recognized as an important technology for the development of document management systems, text information systems, and the like. The document classification may be performed by assigning an unclassified document to one of preset categories.

종래에는 문서 분류를 위해 문서 내의 텍스트를 수치벡터(numerical vector)로 표현해야 하며, 수치벡터로 표현해야 하는 벡터가 고차원화되고, 수치벡터가 대부분 0값으로 채워질 수 있었다. Conventionally, text in a document must be represented as a numerical vector for document classification, and the vector to be expressed as a numeric vector is high-dimensionalized, and the numeric vector can be filled with most of zero values.

따라서, 수치벡터를 사용하지 않고 문서 자동 분류의 효율을 높일 수 있는 문서 자동 분류 방법 및 장치가 요구된다.Therefore, there is a need for an automatic document classification method and apparatus capable of increasing the efficiency of automatic document classification without using a numerical vector.

본 발명은 문서를 텍스트로 인코딩하고 미리 분류된 카테고리 프로파일을 사용하여 문서를 자동 분류함으로써, 문서 자동 분류를 위한 연산량을 줄일 수 있는 문서 자동 분류 방법 및 장치를 제공한다.The present invention provides a document automatic classification method and apparatus that can reduce the amount of computation for automatic document classification by encoding the document into text and automatically classifying the document using a pre-categorized category profile.

본 발명은 문서를 텍스트로 인코딩하고 미리 분류된 카테고리 프로파일을 사용하여 문서를 자동 분류함으로써, 문서 자동 분류의 성능을 향상시키고, 분류의 정확성을 향상시키는 문서 자동 분류 방법 및 장치를 제공한다.The present invention provides a document automatic classification method and apparatus for improving the performance of document automatic classification and improving the accuracy of classification by encoding the document into text and automatically classifying the document using a pre-classified category profile.

본 발명의 일실시예에 따른 전자문서 자동 분류 방법은, 하나 이상의 라벨된(labeled) 문서를 이용하여 카테고리 프로파일 테이블을 생성하는 단계 및 상기 카테고리 프로파일 테이블을 이용하여 미분류 문서를 분류하는 단계를 포함한다An automatic document classification method according to an embodiment of the present invention includes generating a category profile table using at least one labeled document and classifying an unclassified document using the category profile table.

본 발명의 일측면에 따르면, 상기 카테고리 프로파일 테이블을 생성하는 단계는, 상기 하나 이상의 라벨된 문서에 포함된 텍스트를 하나의 텍스트로 통합하는 단계; 및 상기 하나의 텍스트를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 카테고리 프로파일 테이블로 인코딩하는 단계를 포함할 수 있다.According to one aspect of the invention, generating the category profile table comprises: incorporating text included in the one or more labeled documents into one text; And encoding the one text into a category profile table that includes one or more words and weights for the one or more words.

본 발명의 일측면에 따르면, 상기 분류하는 단계는, 상기 미분류 문서를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 미분류 문서 테이블로 인코딩하는 단계, 상기 미분류 문서 테이블과 하나 이상의 카테고리 프로 파일 테이블의 유사도를 각각 판단하는 단계 및 상기 각 유사도에 기초하여 상기 미분류 문서를 분류하는 단계를 포함할 수 있다.According to one aspect of the invention, the categorizing step comprises: encoding the unclassified document into an unclassified document table including one or more words and weights for the one or more words, the unclassified document table and one or more category profiles. Determining the similarity of each table and classifying the unclassified document based on the similarity.

본 발명의 일측면에 따르면, 상기 인코딩하는 단계는, 문서 내의 텍스트를 서브스트링으로 분리하여 토큰(token)을 생성하는 단계, 상기 토큰을 원형으로 변환하여 하나 이상의 원형 단어를 생성하는 단계, 상기 문서 내의 접속사, 관사, 전치사, 대명사 중 적어도 하나를 포함하는 불사용어를 제거하는 단계 및 상기 원형 단어에 대응하는 가중치를 할당하는 단계를 포함할 수 있다.According to one aspect of the invention, the step of encoding, separating the text in the document into a substring to generate a token, converting the token into a circular to generate one or more circular words, the document Removing unused words including at least one of a conjunction, an article, a preposition, and a pronoun in the text, and assigning a weight corresponding to the circular word.

본 발명의 일실시예에 따른 전자문서 자동 분류 장치는, 하나 이상의 라벨된(labeled) 문서를 이용하여 카테고리 프로파일 테이블을 생성하는 카테고리 프로파일 생성부 및 상기 카테고리 프로파일 테이블을 이용하여 미분류 문서를 분류하는 문서 분류부를 포함한다An automatic document classification apparatus according to an embodiment of the present invention, a category profile generation unit for generating a category profile table using at least one labeled document and a document for classifying unclassified documents using the category profile table We include classification department

본 발명의 일측면에 따르면, 상기 문서 분류부는, 상기 미분류 문서를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 미분류 문서 테이블로 인코딩하는 인코딩부, 상기 미분류 문서 테이블과 하나 이상의 카테고리 프로파일 테이블의 유사도를 각각 판단하는 유사도 판단부 및 상기 각 유사도에 기초하여 상기 미분류 문서를 분류하는 분류부를 포함할 수 있다.According to one aspect of the invention, the document classification unit, encoding unit for encoding the unclassified document into an unclassified document table including one or more words and weights for the one or more words, the unclassified document table and one or more category profile table A similarity determination unit for determining the similarity of each may include a classification unit for classifying the unclassified document based on each similarity.

본 발명의 일측면에 따르면, 상기 유사도 판단부는, 하나의 카테고리 프로파일 테이블을 선택하는 카테고리 프로파일 선택부, 상기 미분류 문서 테이블과 상기 선택된 카테고리 프로파일 테이블의 공통 단어를 추출하는 공통단어 추출부 및 상기 미분류 문서 테이블로부터 상기 공통 단어에 대응하는 가중치를 합산하여 유 사도 스코어를 산출하는 스코어 산출부를 포함할 수 있다.According to an aspect of the invention, the similarity determination unit, a category profile selection unit for selecting one category profile table, a common word extracting unit for extracting a common word of the unclassified document table and the selected category profile table and the unclassified document It may include a score calculation unit for calculating a similarity score by summing the weights corresponding to the common word from the table.

본 발명의 일측면에 따르면, 상기 유사도 판단부는, 하나의 카테고리 프로파일 테이블을 선택하는 카테고리 프로파일 선택부, 상기 미분류 문서 테이블과 상기 선택된 카테고리 프로파일 테이블의 공통 단어를 추출하는 공통단어 추출부 및 상기 카테고리 프로파일 테이블로부터 상기 공통 단어에 대응하는 가중치를 합산하여 유사도 스코어를 산출하는 스코어 산출부를 포함할 수 있다.According to an aspect of the present invention, the similarity determination unit, a category profile selection unit for selecting one category profile table, a common word extracting unit for extracting a common word of the unclassified document table and the selected category profile table and the category profile It may include a score calculation unit for calculating a similarity score by summing the weights corresponding to the common word from the table.

본 발명의 일측면에 따르면, 상기 유사도 판단부는, 하나의 카테고리 프로파일 테이블을 선택하는 카테고리 프로파일 선택부, 상기 미분류 문서 테이블과 상기 선택된 카테고리 프로파일 테이블의 공통 단어를 추출하는 공통단어 추출부, 상기 공통 단어 별로 상기 미분류 문서 테이블 및 상기 선택된 카테고리 프로파일 테이블의 가중치 값을 곱하여 곱셈 스코어를 생성하는 곱셈 스코어 산출부 및 상기 공통 단어 별 곱셈 스코어를 합산하여 유사도 스코어를 산출하는 유사도 스코어 산출부를 포함할 수 있다.According to an aspect of the invention, the similarity determination unit, a category profile selection unit for selecting one category profile table, a common word extracting unit for extracting a common word of the unclassified document table and the selected category profile table, the common word And a multiplication score calculator for generating a multiplication score by multiplying weight values of the unclassified document table and the selected category profile table and a similarity score calculator for calculating a similarity score by summing multiplication scores for each common word.

본 발명의 일측면에 따르면, 상기 분류부는, 상기 유사도 스코어가 가장 큰 카테고리 프로파일 테이블에 대응하는 카테고리를 상기 미분류 문서의 카테고리로 분류할 수 있다.According to an aspect of the present invention, the classification unit may classify a category corresponding to a category profile table having the largest similarity score as a category of the unclassified document.

본 발명의 일실시예에 따르면, 문서를 텍스트로 인코딩하고 미리 분류된 카테고리 프로파일을 사용하여 문서를 자동 분류함으로써, 문서 자동 분류를 위한 연산량을 줄일 수 있는 문서 자동 분류 방법 및 장치가 제공된다.According to an embodiment of the present invention, there is provided a document automatic classification method and apparatus that can reduce the amount of calculation for automatic document classification by encoding a document into text and automatically classifying the document using a pre-categorized category profile.

본 발명의 일실시예에 따르면, 문서를 텍스트로 인코딩하고 미리 분류된 카테고리 프로파일을 사용하여 문서를 자동 분류함으로써, 문서 자동 분류의 성능을 향상시키고, 분류의 정확성을 향상시키는 문서 자동 분류 방법 및 장치가 제공된다.According to an embodiment of the present invention, a method and apparatus for automatically classifying a document which improves the performance of automatic document classification and improves the accuracy of classification by encoding the document into text and automatically classifying the document using a pre-classified category profile. Is provided.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명의　실시예들을 상세하게 설명한다. 다만, 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. 전자문서 자동 분류 방법은 전자문서 자동 분류 장치를 구성하는 구성 요소에 의해 수행될 수 있다.Hereinafter, with reference to the contents described in the accompanying drawings will be described in detail the embodiments of the present invention. However, the present invention is not limited to or limited by the embodiments. Like reference numerals in the drawings denote like elements. The electronic document automatic classification method may be performed by components constituting the electronic document automatic classification apparatus.

도 1은 본 발명의 일실시예에 있어서, 전자문서 자동 분류 방법을 도시한 흐름도이다.1 is a flowchart illustrating a method for automatically classifying electronic documents according to an embodiment of the present invention.

도 1을 참고하면, 단계(110)에서는 하나 이상의 라벨된(labeled) 문서를 이용하여 카테고리 프로파일 테이블을 생성할 수 있다. 즉, 라벨된 문서를 이용하여 수치벡터를 생성하지 않고, 카테고리별 테이블을 생성할 수 있다. 여기서, 단계(110)는 도 2 및 도 3을 참고하여 이하에서 더욱 상세하게 설명한다.Referring to FIG. 1, in step 110, a category profile table may be generated using one or more labeled documents. That is, a table for each category may be generated without generating a numerical vector using a labeled document. Here, step 110 is described in more detail below with reference to FIGS. 2 and 3.

도 2는 도 1에 도시된 카테고리 프로파일 테이블을 생성하는 과정을 나타낸 동작 흐름도이다.FIG. 2 is a flowchart illustrating a process of generating a category profile table shown in FIG. 1.

도 2를 참고하면, 카테고리 프로파일을 생성하기 위해 단계(210)에서는 상기 하나 이상의 라벨된 문서에 포함된 텍스트를 하나의 텍스트로 통합할 수 있다. 예를 들어, 동일한 카테고리에 속하는 문서1, 문서2, 및 문서3이 존재하는 경우, 상기 문서1, 문서2, 및 문서3에 포함된 텍스트를 하나의 텍스트로 통합할 수 있다.Referring to FIG. 2, in operation 210, text included in the one or more labeled documents may be merged into one text to generate a category profile. For example, when the documents 1, 2, and 3 belonging to the same category exist, the texts included in the documents 1, 2, and 3 may be combined into one text.

단계(220)에서는 상기 하나의 텍스트를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 카테고리 프로파일 테이블로 인코딩할 수 있다. 즉, 상기 통합된 하나의 텍스트에서 단어들을 추출하고, 추출된 단어들에 대응하는 가중치를 할당하여 카테고리 프로파일 테이블로 인코딩할 수 있다. 여기서, 카테고리 프로파일 테이블 인코딩을 위해 텍스트를 서브스트링으로 분리하여 토큰(token)을 생성하는 토큰화(tokenization), 상기 토큰을 원형으로 변환하여 하나 이상의 원형 단어를 생성하는 스테밍(stemming), 상기 문서 내의 접속사, 관사, 전치사, 대명사 중 적어도 하나를 포함하는 불사용어를 제거하는 불사용어 제거(stop word removal), 및 상기 원형 단어에 대응하는 가중치를 할당하는 가중치 할당(term weighting) 중 적어도 하나를 수행할 수 있다. In operation 220, the text may be encoded into a category profile table including one or more words and weights for the one or more words. That is, words may be extracted from the integrated text, and weights corresponding to the extracted words may be assigned and encoded into a category profile table. Herein, tokenization for generating a token by separating text into substrings for category profile table encoding, stemming for converting the token into a prototype, and generating one or more circular words, and the document. Perform at least one of stop word removal that removes a stopword that includes at least one of an conjunction, an article, a preposition, and a pronoun in the word, and a weight weighting that assigns a weight corresponding to the circular word. can do.

카테고리 프로파일 테이블 인코딩의 일예로, 도 3에 도시된 바와 같이, 카테고리1에 속하는 하나 이상의 문서(310)에 포함된 텍스트를 통합하고, 상기 통합된 텍스트로부터 카테고리 프로파일 테이블을 생성할 수 있다. 따라서, 각 카테고리별로 카테고리에 속하는 문서의 텍스트를 통합하고, 통합된 텍스트로부터 각각의 카테고리 프로파일 테이블을 생성할 수 있다.As an example of category profile table encoding, as shown in FIG. 3, text included in one or more documents 310 belonging to category 1 may be integrated, and a category profile table may be generated from the merged text. Therefore, the text of the document belonging to the category for each category can be merged, and each category profile table can be generated from the merged text.

다시 도 1을 참고하면, 단계(120)에서는 상기 카테고리 프로파일 테이블을 이용하여 미분류 문서를 분류할 수 있다. 따라서, 수치벡터를 이용하지 않고, 테이블을 이용하여 미분류 문서를 분류할 수 있다. 여기서, 단계(120)은 도 4 및 도 5를 참고하여 이하에서 더욱 상세하게 설명한다.Referring back to FIG. 1, in step 120, an unclassified document may be classified using the category profile table. Therefore, the unclassified document can be classified using a table without using a numerical vector. Here, step 120 will be described in more detail below with reference to FIGS. 4 and 5.

도 4는 도 1에 도시된 미분류 문서를 분류하는 과정을 나타낸 동작 흐름도이다.4 is a flowchart illustrating a process of classifying an unclassified document illustrated in FIG. 1.

도 4를 참고하면, 단계(410)에서는 상기 미분류 문서를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 미분류 문서 테이블로 인코딩할 수 있다. 여기서, 인코딩하는 과정은 도 5를 참고하여 더욱 상세하게 설명한다.Referring to FIG. 4, in step 410, the unclassified document may be encoded into an unclassified document table including one or more words and weights for the one or more words. Here, the encoding process will be described in more detail with reference to FIG. 5.

도 5를 참고하면, 단계(510)에서는 문서 내의 텍스트를 서브스트링으로 분리하여 토큰을 생성하는 토큰화를 수행할 수 있다. 여기서, 상기 토큰화는 텍스트를 토큰이라는 서브스트링(substring)으로 분리(segmenting)하는 과정이며, 공백(white spaces), 구두점(punctuation marks), 줄바꿈(carridge returns) 등에 의해 토큰화될 수 있다. 이때, 특수문자 및 수치값 등은 제거될 수 있다.Referring to FIG. 5, in operation 510, tokenization may be performed to generate a token by separating text in a document into substrings. In this case, the tokenization is a process of segmenting text into substrings called tokens, and may be tokenized by white spaces, punctuation marks, carriage returns, and the like. In this case, special characters and numerical values may be removed.

단계(520)에서는 상기 토큰을 원형으로 변환하여 하나 이상의 원형 단어를 생성할 수 있다. 예를 들어, 과거, 현재, 미래 시제를 가지는 동사, 복수 형태를 가지는 명사 등이 원형으로 변환될 수 있다.In operation 520, the token may be converted into a circular form to generate one or more circular words. For example, a verb having a past tense, a present tense, a noun having a plural form, etc. may be converted into a prototype.

단계(530)에서는 상기 문서 내의 접속사, 관사, 전치사, 대명사 중 적어도 하나를 포함하는 불사용어를 제거하는 불사용어 제거가 수행될 수 있다. 여기서, 상기 불사용어 제거는 문서의 내용(content)과 관련이 없고 문법적으로만 사용되는 단어를 제거하는 과정일 수 있다. 또한, 불사용어는 접속사, 관사, 전치사, 대명사 중 적어도 하나를 포함할 수 있다. 불사용어를 제거는 카테고리 분류를 위해 필요없는 단어를 제거하여 카테고리 분류를 보다 용이하게 수행하기 위해 수행될 수 있다.In operation 530, a stopword removal operation may be performed to remove a stopword including at least one of a conjunction, an article, a preposition, and a pronoun in the document. Here, the removing of the stop word may be a process of removing words that are not related to the content of the document and are used only grammatically. In addition, a stop word may include at least one of a conjunction, an article, a preposition, and a pronoun. Elimination of stopwords can be performed to make categorization easier by eliminating words that are not needed for categorization.

단계(540)에서는 상기 원형 단어에 대응하는 가중치를 할당할 수 있다. 여기서, 가중치를 할당을 통해 각 단어가 문서의 내용과 관련하여 얼마나 중요한 단어인지를 나타낼 수 있다. 여기서, 가중치 할당은 아래 [수학식 1]에 따라 할당할 수 있다.In operation 540, a weight corresponding to the circular word may be assigned. Here, weights may be assigned to indicate how important each word is in relation to the content of the document. Here, the weight assignment may be assigned according to Equation 1 below.

여기서, Wk는 특정단어, tf(Wk)는 주어진 코퍼스(corpus)에서 Wk의 전체 빈도, D는 코퍼스의 사이즈(코퍼스의 전체 문서수), df(Wk)는 Wk를 포함하는 문서의 수를 각각 의미할 수 있다.Where Wk is a specific word, tf (Wk) is the total frequency of Wk in a given corpus, D is the size of the corpus (the total number of documents in the corpus), and df (Wk) is the number of documents containing Wk, respectively. Can mean.

또한, 가중치 할당 시, 중복되는 단어를 제거한 후, 각 단어에 가중치 할당할 수 있다. In addition, when weighting is assigned, weights may be assigned to each word after removing duplicate words.

다시 도 4를 참고하면, 단계(420)에서는 상기 미분류 문서 테이블과 하나 이상의 카테고리 프로파일 테이블의 유사도를 각각 판단할 수 있다. 즉, 상기 미분류 문서 테이블과 하나 이상의 카테고리 프로파일 테이블을 순차적으로 비교하여 유사한 정도를 수치화하여 나타낼 수 있다. 유사도 판단은 도 6 및 도 7을 참고하여 이하에서 더욱 상세하게 설명한다.Referring back to FIG. 4, in step 420, similarities between the unclassified document table and one or more category profile tables may be determined. That is, the unclassified document table and the one or more category profile tables may be sequentially compared to quantify the similarity. The similarity determination will be described in more detail below with reference to FIGS. 6 and 7.

도 6은 본 발명의 일실시예에 있어서, 미분류 문서를 분류하는 과정을 설명 하기 위한 도면이다.6 is a diagram for describing a process of classifying unclassified documents according to one embodiment of the present invention.

도 6을 참고하면, 미분류 문서(610)는 인코딩 과정을 통해 하나 이상의 단어 및 상기 하나 이상의 단어의 가중치를 포함하는 테이블(620)로 생성될 수 있다. 이후, 상기 테이블은 제1 카테고리 프로파일 테이블(630), 제2 카테고리 프로파일 테이블(640) 등과 순차적으로 유사도를 판단할 수 있다. 이때, 제1 카테고리 프로파일 테이블(630), 제2 카테고리 프로파일 테이블(640) 등과의 유사도는 유사도 스코어를 통해 수치화하여 나타낼 수 있다. 예를 들어, 제1 카테고리 프로파일 테이블(630)과 미분류 문서(610)의 유사도 스코어가 1.5이고, 제2 카테고리 프로파일 테이블(640)과 미분류 문서(610)의 유사도 스코어가 1.2인 것으로 수치화될 수 있다. 유사도 스코어를 산출하는 방법의 실시예는 도 7을 참고하여 이하에서 더욱 상세하게 설명한다.Referring to FIG. 6, the unclassified document 610 may be generated as a table 620 including one or more words and weights of the one or more words through an encoding process. Thereafter, the table may sequentially determine similarity with the first category profile table 630 and the second category profile table 640. In this case, the similarity with the first category profile table 630 and the second category profile table 640 may be numerically represented through the similarity score. For example, the similarity score between the first category profile table 630 and the unclassified document 610 is 1.5, and the similarity score between the second category profile table 640 and the unclassified document 610 is 1.2. . An embodiment of a method for calculating the similarity score will be described in more detail below with reference to FIG. 7.

도 7은 본 발명의 일실시예에 있어서, 미분류 문서를 분류하기 위해 유사도를 판단하는 과정을 도시한 도면이다.7 is a diagram illustrating a process of determining similarity in order to classify unclassified documents according to one embodiment of the present invention.

도 7을 참고하면, 유사도 판단을 위해 복수의 카테고리 프로파일 테이블 중 유사도를 판단하여 유사도 스코어를 산출하기 위한 카테고리 프로파일 테이블(720)을 선택할 수 있다. 이때, 카테고리를 분류하고자 하는 미분류 문서 테이블(710)과 카테고리 프로파일 테이블(720)은 단어와 상기 단어에 대한 가중치를 포함할 수 있다.Referring to FIG. 7, a category profile table 720 for calculating a similarity score may be selected by determining similarity among a plurality of category profile tables to determine similarity. In this case, the unclassified document table 710 and the category profile table 720 for classifying a category may include a word and a weight for the word.

이후, 미분류 문서 테이블(710)과 카테고리 프로파일 테이블(720)에 공통적으로 포함된 단어를 추출하고, 상기 공통적으로 포함된 단어의 가중치 중 카테고리 프로파일 테이블(720)에 포함된 가중치를 합산하여 유사도 스코어를 산출할 수 있다. 예를 들어, 미분류 문서 테이블(710)이 단어 및 상기 단어의 가중치로 (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6)를 가지고, 카테고리 프로파일 테이블(720)이 (computer, 0.2), (information, 0.1), (hardware, 0.3), (communication, 0.9)을 가지는 경우, 공통적으로 포함되는 단어는 'computer' 및 'information'이다. 따라서, 상기 공통적으로 포함되는 단어, 상기 단어에 대응하는 각각의 가중치를 별도의 테이블(730)로 생성할 수 있다. 따라서, 생성된 별도의 테이블(730)은 (computer, 0.4, 0.2), (infomration, 0.7, 0.1)을 포함할 수 있다. 한편, 카테고리 프로파일 테이블(720)로부터 추출된 상기 'computer' 및 'information'에 대응하는 가중치를 추출하면, 각각 '0.2' 및 '0.1'이므로 두 값을 합산한 0.3이 카테고리 프로파일 테이블(720)에 대한 유사도 스코어로 결정될 수 있다. Thereafter, the words commonly included in the unclassified document table 710 and the category profile table 720 are extracted, and the similarity score is calculated by summing the weights included in the category profile table 720 among the weights of the commonly included words. Can be calculated. For example, the unclassified document table 710 has (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6) as words and weights of the words, and the category profile table 720 In the case of having (computer, 0.2), (information, 0.1), (hardware, 0.3), and (communication, 0.9), commonly included words are 'computer' and 'information'. Accordingly, the commonly included word and each weight corresponding to the word may be generated as a separate table 730. Thus, the generated separate table 730 may include (computer, 0.4, 0.2), (infomration, 0.7, 0.1). Meanwhile, when the weights corresponding to the 'computer' and 'information' extracted from the category profile table 720 are extracted, 0.3 is added to the category profile table 720 because the two values are '0.2' and '0.1', respectively. The similarity score for the can be determined.

유사도 판단을 위한 다른 일예로, 미분류 문서 테이블(710)과 카테고리 프로파일 테이블(720)에 공통적으로 포함된 단어를 추출하고, 상기 공통적으로 포함된 단어의 가중치 중 미분류 문서 테이블(710)에 포함된 가중치를 합산하여 유사도 스코어를 산출할 수 있다. 예를 들어, 미분류 문서 테이블(710)이 단어 및 상기 단어의 가중치로 (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6)를 가지고, 카테고리 프로파일 테이블(720)이 (computer, 0.2), (information, 0.1), (hardware, 0.3), (communication, 0.9)을 가지는 경우, 공통적으로 포함되는 단어는 'computer' 및 'information'이다. 따라서, 상기와 같이, (computer, 0.4, 0.2), (infomration, 0.7, 0.1)를 포함하는 별도의 테이블(730)을 생성할 수 있다. 여기서, 상기 'computer'에 대하여 미분류 문서 테이블(710)로부터 추출된 가중치는 '0.4'이고, 'information'에 미분류 문서 테이블(710)로부터 추출된 가중치는 '0.7'이므로 두 값을 합산한 1.1이 카테고리 프로파일 테이블(720)에 대한 유사도 스코어로 결정될 수 있다.As another example for determining similarity, the words commonly included in the unclassified document table 710 and the category profile table 720 are extracted, and the weights included in the unclassified document table 710 among the weights of the commonly included words. The sum may be calculated to calculate the similarity score. For example, the unclassified document table 710 has (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6) as words and weights of the words, and the category profile table 720 In the case of having (computer, 0.2), (information, 0.1), (hardware, 0.3), and (communication, 0.9), commonly included words are 'computer' and 'information'. Therefore, as described above, a separate table 730 including (computer, 0.4, 0.2) and (infomration, 0.7, 0.1) may be generated. Here, the weight extracted from the unclassified document table 710 with respect to the 'computer' is '0.4', and the weight extracted from the unclassified document table 710 with 'information' is '0.7'. The similarity score for the category profile table 720 may be determined.

또다른 일예로, 유사도 판단을 위해 미분류 문서 테이블(710)과 카테고리 프로파일 테이블(720)에 공통적으로 포함된 단어를 추출하고, 상기 공통적으로 포함된 단어 별로 미분류 문서 테이블(710) 및 카테고리 프로파일 테이블(720)의 가중치 값을 곱하여 곱셈 스코어를 생성할 수 있다. 이후, 상기 공통적으로 포함된 단어 별 곱셈 스코어를 합산하여 유사도 스코어를 산출할 수 있다. 예를 들어, 미분류 문서 테이블(710)이 단어 및 상기 단어의 가중치로 (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6)를 가지고, 카테고리 프로파일 테이블(720)이 (computer, 0.2), (information, 0.1), (hardware, 0.3), (communication, 0.9)을 가지는 경우, 공통적으로 포함되는 단어는 'computer' 및 'information'이다. 따라서, 상기와 같이, (computer, 0.4, 0.2), (infomration, 0.7, 0.1)를 포함하는 별도의 테이블(730)을 생성할 수 있다. 이때, 상기 'computer'에 대응하는 미분류 문서 테이블(710)로부터 추출된 가중치(ex. 0.4)와 카테고리 프로파일 테이블(720)로부터 추출된 가중치(ex. 0.2)를 곱하여 단어 'computer'에 대한 곱셈 스코어 0.08을 산출할 수 있다. 또한, 'information'에 대응하는 미분류 문서 테이블(710)로부터 추출된 가중치(ex. 0.7)와 카테고리 프로 파일 테이블(720)로부터 추출된 가중치(ex. 0.1)를 곱하여 단어 'information'에 대한 곱셈 스코어 0.07을 산출할 수 있다. 이후, 각 단어별 곱셈 스코어를 합산한 0.15가 카테고리 프로파일 테이블(720)에 대한 유사도 스코어로 결정될 수 있다.As another example, a word commonly included in the unclassified document table 710 and the category profile table 720 may be extracted to determine similarity, and the unclassified document table 710 and the category profile table may be extracted for each commonly included word. A multiplication score may be generated by multiplying a weight value of 720. Thereafter, the similarity score may be calculated by summing multiplication scores of the commonly included words. For example, the unclassified document table 710 has (computer, 0.4), (information, 0.7), (data, 0.3), (internet, 0.6) as words and weights of the words, and the category profile table 720 In the case of having (computer, 0.2), (information, 0.1), (hardware, 0.3), and (communication, 0.9), commonly included words are 'computer' and 'information'. Therefore, as described above, a separate table 730 including (computer, 0.4, 0.2) and (infomration, 0.7, 0.1) may be generated. In this case, the multiplication score for the word 'computer' is multiplied by multiplying the weight (ex. 0.4) extracted from the unclassified document table 710 corresponding to the 'computer' by the weight (ex. 0.2) extracted from the category profile table 720. 0.08 can be calculated. In addition, a multiplication score for the word 'information' is obtained by multiplying the weight (ex. 0.7) extracted from the unclassified document table 710 corresponding to 'information' by the weight (ex. 0.1) extracted from the category profile table 720. 0.07 can be calculated. Thereafter, 0.15 obtained by adding multiplication scores for each word may be determined as a similarity score for the category profile table 720.

다시 도 4를 참고하면, 단계(430)에서는 상기 각 유사도에 기초하여 상기 미분류 문서를 분류할 수 있다. 즉, 각 카테고리 프로파일 테이블에 대한 유사도 스코어 중 가장 큰 유사도 스코어를 결정하고, 상기 유사도 스코어가 가장 큰 카테고리 프로파일 테이블에 대응하는 카테고리를 상기 미분류 문서의 카테고리로 분류할 수 있다. 도 6에 도시된 바와 같이, 상기 수치화된 유사도 스코어 중 최대값을 갖는 카테고리 프로파일 테이블이 미분류 문서(610)와 가장 유사한 테이블인 것으로 결정할 수 있고, 상기 예의 경우, 유사도 스코어가 1.5인 제1 카테고리 프로파일 테이블(630)이 미분류 문서(610)와 가장 유사한 테이블로 결정될 수 있다. 따라서, 미분류 문서(610)는 제1 카테고리 프로파일 테이블(630)의 카테고리와 동일한 카테고리로 분류될 수 있다.Referring back to FIG. 4, in step 430, the unclassified documents may be classified based on the similarities. That is, the largest similarity score among similarity scores for each category profile table may be determined, and a category corresponding to the category profile table having the largest similarity score may be classified as a category of the unclassified document. As shown in FIG. 6, it may be determined that the category profile table having the maximum value of the numerical similarity scores is the table most similar to the unclassified document 610, and in this example, the first category profile having a similarity score of 1.5. Table 630 may be determined to be the table most similar to unclassified document 610. Accordingly, the unclassified document 610 may be classified into the same category as the category of the first category profile table 630.

도 8은 본 발명의 일실시예에 있어서, 전자문서 자동 분류 장치를 도시한 블록도이다.8 is a block diagram showing an electronic document automatic classification apparatus according to an embodiment of the present invention.

도 8을 참고하면, 전자문서 자동 분류 장치(800)는 프로파일 생성부(810) 및 문서 분류부(820)를 포함할 수 있다.Referring to FIG. 8, the electronic document automatic classification device 800 may include a profile generator 810 and a document classifier 820.

프로파일 생성부(810)는 하나 이상의 라벨된 문서를 이용하여 카테고리 프로파일 테이블을 생성할 수 있다. 여기서, 프로파일 생성부(810)는 상기 하나 이상의 라벨된 문서에 포함된 텍스트를 하나의 텍스트로 통합하는 텍스트 통합 부(811) 및 상기 하나의 텍스트를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 카테고리 프로파일 테이블로 인코딩하는 인코딩부(812)를 포함할 수 있다.The profile generator 810 may generate a category profile table using one or more labeled documents. Here, the profile generator 810 may include a text integrator 811 for integrating text included in the one or more labeled documents into one text, and a weight for one or more words and the one or more words. An encoding unit 812 may be included to encode the category profile table.

문서 분류부(820)는 상기 카테고리 프로파일 테이블을 이용하여 미분류 문서를 분류할 수 있다. 여기서, 문서 분류부(820)는 인코딩부(821), 유사도 판단부(822), 및 분류부(823)를 포함할 수 있다.The document classifier 820 may classify an unclassified document using the category profile table. Here, the document classifier 820 may include an encoder 821, a similarity determiner 822, and a classifier 823.

인코딩부(821)는 상기 미분류 문서를 하나 이상의 단어 및 상기 하나 이상의 단어에 대한 가중치를 포함하는 미분류 문서 테이블로 인코딩할 수 있다. 여기서, 가중치는 미분류 문서에 포함된 단어가 문서의 내용과 관련하여 얼마나 중요한지를 나타내는 수치일 수 있다. 또한, 인코딩부(821)는 토큰 생성부(미도시), 원형 생성부(미도시), 불사용어 제거부(미도시), 및 가중치 할당부(미도시)를 포함할 수 있다. 토큰 생성부(미도시)는 텍스트를 서브스트링으로 분리하여 토큰을 생성하고, 원형 생성부(미도시)는 상기 토큰을 원형으로 변환하여 원형 단어를 생성할 수 있다. 불사용어 제거부(미도시)는 상기 문서 내의 접속사, 관사, 전치사, 대명사 중 적어도 하나를 포함하는 불사용어를 제거하고, 가중치 할당부(미도시)는 상기 원형 단어에 대응하는 가중치를 할당할 수 있다.The encoding unit 821 may encode the unclassified document into an unclassified document table including one or more words and weights for the one or more words. Here, the weight may be a numerical value indicating how important the words included in the unclassified document are in relation to the contents of the document. In addition, the encoding unit 821 may include a token generator (not shown), a prototype generator (not shown), a stopword removal unit (not shown), and a weight assignment unit (not shown). A token generating unit (not shown) may generate a token by separating text into substrings, and a circular generating unit (not shown) may generate a circular word by converting the token into a circular shape. A stop word removal unit (not shown) may remove a stop word including at least one of a conjunction, an article, a preposition, and a pronoun in the document, and a weight allocator (not shown) may assign a weight corresponding to the circular word. have.

유사도 판단부(822)는 상기 미분류 문서 테이블과 하나 이상의 카테고리 프로파일 테이블의 유사도를 각각 판단할 수 있다. 또한, 유사도 판단부(822)는 일예로, 카테고리 프로파일 선택부(미도시), 공통단어 추출부(미도시), 및 스코어 산출부(미도시)를 포함할 수 있다. 프로파일 선택부(미도시)는 하나의 카테고리 프 로파일 테이블을 선택하고, 공통단어 추출부(미도시)는 상기 미분류 문서 테이블과 상기 선택된 카테고리 프로파일 테이블의 공통 단어를 추출할 수 있다. 스코어 산출부(미도시)는 상기 미분류 문서 테이블로부터 상기 공통 단어에 대응하는 가중치를 합산하여 유사도 스코어를 산출할 수 있다. 또한, 스코어 산출부(미도시)는 상기 카테고리 프로파일 테이블로부터 상기 공통 단어에 대응하는 가중치를 합산하여 유사도 스코어를 산출할 수 있다.The similarity determination unit 822 may determine similarity between the unclassified document table and the one or more category profile tables. Also, the similarity determining unit 822 may include, for example, a category profile selector (not shown), a common word extractor (not shown), and a score calculator (not shown). The profile selector (not shown) may select one category profile table, and the common word extractor (not shown) may extract common words between the unclassified document table and the selected category profile table. A score calculator (not shown) may calculate a similarity score by summing weights corresponding to the common words from the unclassified document table. In addition, a score calculator (not shown) may calculate a similarity score by summing weights corresponding to the common words from the category profile table.

또한, 유사도 판단부(822)는 다른 일예로, 카테고리 프로파일 선택부(미도시), 공통단어 추출부(미도시), 곱셈 스코어 산출부(미도시), 및 유사도 스코어 산출부(미도시)를 포함할 수 있다.In another example, the similarity determination unit 822 may include a category profile selector (not shown), a common word extractor (not shown), a multiplication score calculator (not shown), and a similarity score calculator (not shown). It may include.

이때, 프로파일 선택부(미도시)는 하나의 카테고리 프로파일 테이블을 선택하고, 공통단어 추출부(미도시)는 상기 미분류 문서 테이블과 상기 선택된 카테고리 프로파일 테이블의 공통 단어를 추출할 수 있다. 곱셈 스코어 산출부(미도시)는 상기 공통 단어 별로 상기 미분류 문서 테이블 및 상기 선택된 카테고리 프로파일 테이블의 가중치 값을 곱하여 곱셈 스코어를 생성하고, 유사도 스코어 산출부(미도시)는 상기 공통 단어 별 곱셈 스코어를 합산하여 유사도 스코어를 산출할 수 있다.In this case, the profile selector (not shown) may select one category profile table, and the common word extractor (not shown) may extract common words between the unclassified document table and the selected category profile table. A multiplication score calculator (not shown) generates a multiplication score by multiplying the weighted values of the unclassified document table and the selected category profile table for each common word, and a similarity score calculator (not shown) calculates a multiplication score for each common word. The similarity scores can be calculated by adding up.

분류부(823)는 상기 각 유사도에 기초하여 상기 미분류 문서를 분류할 수 있다. 즉, 유사도 스코어가 가장 큰 카테고리 프로파일 테이블에 대응하는 카테고리를 상기 미분류 문서의 카테고리로 분류할 수 있다.The classification unit 823 may classify the unclassified document based on each similarity. That is, a category corresponding to the category profile table having the largest similarity score may be classified as a category of the unclassified document.

한편, 도 8에서 설명되지 않은 부분은 도 1 내지 도 7의 설명을 참고할 수 있다.Meanwhile, for the parts not described in FIG. 8, the description of FIGS. 1 to 7 may be referred to.

상기와 같이, 문서를 텍스트로 인코딩하고 미리 분류된 카테고리 프로파일을 사용하여 문서를 자동 분류함으로써, 문서 자동 분류를 위한 연산량을 줄이고, 분류의 정확성을 향상시킬 수 있다.As described above, by encoding the document into text and automatically classifying the document using a pre-categorized category profile, it is possible to reduce the amount of computation for automatic document classification and to improve the accuracy of the classification.

이상과 같이 본 발명의 일실시예는 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명의 일실시예는 상기 설명된 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.　 따라서, 본 발명의 일실시예는 아래에 기재된 특허청구범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.Although one embodiment of the present invention as described above has been described by a limited embodiment and drawings, one embodiment of the present invention is not limited to the above-described embodiment, which is a general knowledge in the field of the present invention Those having a variety of modifications and variations are possible from these descriptions. Accordingly, one embodiment of the invention should be understood only by the claims set forth below, all equivalent or equivalent modifications will be within the scope of the invention idea.

도 3은 본 발명의 일실시예에 있어서, 라벨된 문서를 이용하여 카테고리 프로파일 테이블을 생성하는 과정을 설명하기 위한 도면이다.3 is a diagram for describing a process of generating a category profile table using a labeled document according to one embodiment of the present invention.

도 5는 도 4에 도시된 인코딩 과정을 나타낸 동작 흐름도이다.FIG. 5 is a flowchart illustrating an encoding process illustrated in FIG. 4.

도 6은 본 발명의 일실시예에 있어서, 미분류 문서를 분류하는 과정을 설명하기 위한 도면이다.6 is a diagram for describing a process of classifying unclassified documents according to one embodiment of the present invention.

Claims

In the electronic document automatic classification method,

The electronic document automatic classification method,

Creating a category profile table using one or more labeled documents; And

Classifying unclassified documents using the category profile table

Including,

The classifying step,

Encoding the unclassified document into an unclassified document table comprising one or more words and weights for the one or more words;

Determining similarity degrees of each of the unclassified document table and one or more category profile tables; And

Classifying the unclassified document based on each similarity

Including,

Determining the similarity,

Selecting one category profile table;

Extracting common words between the unclassified document table and the selected category profile table;

Generating a multiplication score by multiplying weight values of the unclassified document table and the selected category profile table for each common word; And

Calculating a similarity score by adding the multiplication score for each common word

Including, electronic document automatic classification method.

The method of claim 1,

Generating the category profile table,

Incorporating text contained in the one or more labeled documents into one text; And

Encoding said one text into a category profile table comprising one or more words and weights for said one or more words

Including, electronic document automatic classification method.

delete

The method of claim 1,

The encoding step,

Separating the text in the document into substrings to generate a token;

Converting the token into a prototype to generate one or more circular words;

Removing unusable words that include at least one of conjunctions, articles, prepositions, and pronouns in the document; And

Assigning a weight corresponding to the circular word

Including, electronic document automatic classification method.

delete

The method of claim 1,

Classifying the unclassified document based on the similarity level,

And classifying the category corresponding to the category profile table having the largest similarity score as the category of the unclassified document.

In the automatic document sorting apparatus,

The electronic document automatic classification device,

A category profile generator for generating a category profile table using at least one labeled document; And

A document classifier for classifying unclassified documents using the category profile table.

Including,

The document classification unit,

An encoding unit for encoding the unclassified document into an unclassified document table including one or more words and weights for the one or more words;

A similarity determination unit that determines similarity between the unclassified document table and one or more category profile tables, respectively; And

A classification unit that classifies the unclassified document based on each similarity

Including,

The similarity determination unit,

After selecting one category profile table to extract common words of the unclassified document table and the selected category profile table, multiplying the weighted values of the unclassified document table and the selected category profile table for each common word to generate a multiplication score A multiplication score for each common word is added to calculate a similarity score,

The classification unit,

Classifying a category corresponding to a category profile table having the largest similarity score as a category of the unclassified document

Characterized in that, electronic document automatic sorting device.

10. The method of claim 9,

The category profile generator,

A text integrating unit for integrating text included in the one or more labeled documents into one text; And

An encoding unit encoding the one text into a category profile table including one or more words and weights for the one or more words

Including, electronic document automatic classification device.

delete

The method according to claim 9,

The encoding unit,

A token generator for generating a token by separating text into substrings;

A prototype generator for converting the token into a root to generate a circular word;

A stopword removal unit for removing stopwords including at least one of a conjunction, an article, a preposition, and a pronoun in the document; And

A weight allocator for allocating weights corresponding to the circular words

Including, electronic document automatic classification device.