KR20140049659A

KR20140049659A - An automatic device for training and classifying documents based on n-gram statistics and an automatic method for training and classifying documents based on n-gram statistics therefor

Info

Publication number: KR20140049659A
Application number: KR1020120115730A
Authority: KR
Inventors: 김판구; 최동진; 김정인; 고미아
Original assignee: 조선대학교산학협력단
Priority date: 2012-10-18
Filing date: 2012-10-18
Publication date: 2014-04-28
Also published as: KR101400548B1

Abstract

The present invention relates to an apparatus for automatically learning documents and a method for automatically learning documents using the same, and an apparatus for automatically classifying documents and a method for automatically classifying documents using the same, which are capable of automatically learning and classifying mass documents on the web through a process of automatically learning and classifying documents based on n-gram. The apparatus for automatically classifying documents according to the present invention includes: a learning document pool including a plurality of learning document groups which are classified according to categories; a preprocessing unit configured to preprocess each of the learning document groups of the learning document pool; and an n-gram data set pool configured to store a set of n-gram data of the learning document pool, which is formed by being learned through the preprocessing of the preprocessing unit. Additionally, the apparatus for automatically classifying documents includes: an automatic document learning unit configured to allow the preprocessing unit to preprocess a corresponding new document to form a bigram set, when the new document occurs, which is not identified through the learning document pool; and an automatic document classifying unit configured to compare the bigram set of the new document, formed through the preprocessing unit, with a bigram set of the n-gram data set pool and to allocate and store the bigram set of the new document to one of n-gram data sets of the n-gram data set pool. [Reference numerals] (220) Automatic document classifying unit; (230) Learned n-gram data set(bigram example); (AA) Non-identified document; (BB) Appearance of a new document; (CC) Preprocessing

Description

Technical Field [0001] The present invention relates to an automatic learning apparatus for a document, an automatic document learning method using the same, a document automatic classification apparatus, and an automatic document classification method using the automatic classification apparatus. on N-gram statistics therefor}

본 발명은 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법에 관한 것으로서, 특히 엔그램 기반의 문서 자동 학습 및 분류 과정을 통해 웹상의 대용량 문서들을 자동으로 학습 및 분류할 수 있도록 하는 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법에 관한 것이다.The present invention relates to a document automatic learning apparatus, an automatic document learning method and a document automatic classification apparatus using the same, and a document automatic classification method using the automatic document learning apparatus and method. And a method for automatically classifying a document using the automatic classifying device and a method for automatically classifying a document using the automatic classifying device.

문서 분류는 과거 도서관 관리시스템에서 시작되어 현재까지 지속적으로 연구 개발되고 있는 분야이며, 대용량의 문서들을 효율적으로 관리하고 색인하기 위해 초기의 수작업을 시작으로 현재의 컴퓨터를 이용한 통계적 및 의미적 기법으로 발달하였다.Document classification is a field that has been started to be developed by the library management system in the past and is being continuously researched and developed so as to efficiently manage and index large documents, Respectively.

특히 인터넷과 스마트폰의 급격한 발달로 인하여 웹상에 존재하는 정보는 대량으로 증가되었고, 따라서 이러한 웹상의 정보들을 그 사용자들이 효율적으로 관리 및 검색하는 것은 매우 중요한 문제가 되었다.Especially, due to the rapid development of the Internet and smartphone, the amount of information on the web has increased in a large amount, and therefore, it is very important for the users to efficiently manage and search the information on the web.

예를 들면, 최근에 웹사이트 상에서 무수한 웹문서 정보를 처리하여 사용자의 요구에 해당되는 정보만을 추출 후 이를 사용자에게 제공하는 정보 검색 시스템이 널리 이용되고 있고, 또한 검색어로써 단어 형식이 아닌 일상적인 자연어로 인터넷 검색을 할 수 있는 인터넷 정보 검색 시스템도 이용되고 있다. 그리고 이와 같은 정보 검색 시스템의 기능 향상을 위해서는 상기와 같은 문서 분류 기술의 개발이 필수적 요건이라 할 수 있다.For example, an information retrieval system that processes countless web document information on a web site in recent years, extracts only information corresponding to a user's request, and provides the extracted information to the user is widely used. In addition, An Internet information retrieval system capable of searching the Internet is also used. In order to improve the functionality of such an information retrieval system, development of the document classification technology as described above is an essential requirement.

현재 널리 사용되고 있는 문서 분류 기술에 대해 간략히 설명하면, 문서에서 출현하는 단어의 빈도를 바탕으로 이를 벡터 공간에 매핑시켜 문서 간 거리를 측정 후 분류하는 방법, 그리고 나이브 베이즈(Naive Bayes) 이론을 적용한 확률적 모델 등이 있다.A brief description of the widely used document classification techniques is as follows: a method of mapping the distance between the documents by mapping them to the vector space based on the frequency of the words appearing in the document, and then classifying the distances between the documents and applying the Naive Bayes theory And probabilistic models.

그러나 상기와 같은 종래의 문서 분류 기술은 자연어를 대상으로 할 경우, 단어 빈도의 수로 한정하는 기법이 자연어에 대해서는 그 신뢰성에 한계점을 보이게 된다. 이는 상술한 인터넷 정보 검색 시스템의 자연어를 이용한 검색 과정에서 검색 성능의 저하 요인이 되기도 한다.However, when the conventional document classification technique is applied to a natural language, the technique of limiting the number of words to the number of word frequencies has a limitation on the reliability of the natural language. This is a factor of deterioration of search performance in the search process using the natural language of the Internet information search system described above.

상기와 같은 이유로 WordNet과 같은 지식 데이터베이스를 활용한 방법이 제안되기도 하였지만, 이와 같은 지식 데이터베이스를 활용한 방법은 해당 지식 데이터베이스의 신뢰성에 따라 문서 분류의 정확성이 크게 좌우되는 문제가 있다.Although a method utilizing a knowledge database such as WordNet has been proposed for the above reasons, there is a problem that the accuracy of document classification depends on the reliability of the knowledge database.

이에 본 출원인은 엔그램(N-gram)이 문서 내 공기정보와 출현 빈도의 수를 병합한 기법으로 음성 인식과 자연어 처리 분야에서 활발히 사용되고 있는 방법으로써, 동시 출현하는 단어와의 연관성과 중심어 및 핵심어, 대표어를 선정하는데 효과적으로 활용될 수 있는 점에 착안하여 본 발명을 제안하게 되었다.The present applicant has proposed a method in which an N-gram is a method that combines the number of occurrences of air information in the document with the number of appearance frequencies, and is a method that is actively used in the field of speech recognition and natural language processing. , And the present invention can be effectively utilized for selecting representative words.

한국등록특허 제10-0842080호, “문서의 그룹별 분류방법”Korean Patent No. 10-0842080, " Document classification method by group " 한국등록특허 제10-1092352호, “문장 코퍼스에 대한 영역 자동분류 방법 및 장치”Korean Patent No. 10-1092352, " Automatic Classification Method and Apparatus for Sentence Corpus "

본 발명은 엔그램 기반의 문서 자동 학습 및 분류 과정을 통해 웹상의 대용량 문서들을 자동으로 학습 및 분류할 수 있도록 하는 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법을 제공하는데 목적이 있다.The present invention relates to a device for automatically learning a document and automatically classifying a large amount of documents on the web through an automatic document learning and classification process based on the document, an automatic document learning method using the same, The purpose of this paper is to provide a method for automatically classifying documents.

또한, 본 발명은 웹문서를 미리 정의된 분야별로 학습단계를 거친 후 식별되지 않은 신규문서 출현 시, 이를 자동 분류하는 방법을 통하여 웹문서의 의미적 분석과 함께 사용자의 질의어와 밀접한 검색결과를 제공할 수 있는 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법을 제공하는데 목적이 있다.In addition, the present invention provides a semantic analysis of a web document and a search result closely related to a user's query word through a method of automatically classifying a new document when an unidentified new document appears after learning a web document in a predefined field And a method for automatically classifying a document using the apparatus and a method for automatically classifying a document using the same.

또한, 본 발명은 동시에 출현하는 단어들의 집합인 엔그램과 이의 출현빈도수를 바탕으로 문서를 대표할 수 있는 엔그램을 구축하여 차후에 식별되지 않은 신규문서가 발견되었을 때 기 구축된 엔그램 정보를 바탕으로 신규문서를 자동으로 분류할 수 있도록 하는 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법을 제공하는데 목적이 있다.In addition, the present invention constructs an engram that can represent a document based on an enumeration which is a set of words appearing at the same time and the frequency of appearance thereof, so that when an unidentified new document is found, A document automatic learning apparatus and a document automatic classification apparatus using the automatic document learning apparatus and a document automatic classification method using the same.

상기와 같은 목적을 달성하기 위해 본 발명에 따른 문서의 자동 학습 장치는, 카테고리 별로 분류된 복수의 학습문서 그룹이 포함되는 학습문서 풀과, 상기 학습문서 풀의 각 학습문서 그룹에 대해 전처리 과정을 하는 전처리부와, 상기 전처리부의 전처리 과정을 통해 학습되어 형성된 상기 학습문서 풀의 엔그램 데이터 세트가 저장되는 엔그램 데이터 세트 풀을 포함하여 구성된다.According to another aspect of the present invention, there is provided an automatic learning apparatus for a document according to the present invention, comprising: a learning document pool including a plurality of learning document groups classified into categories; and a preprocessing process for each learning document group of the learning document pool And an engram data set pool in which an engram data set of the learning document pool formed by learning through the preprocessing process of the preprocessing unit is stored.

또한, 상기 전처리부는 상기 학습문서 그룹의 각 학습문서 별로 특수문자를 제거하는 특수문자 제거부와, 상기 학습문서 그룹의 각 학습문서 별로 불용어(stopword)를 제거하는 불용어 제거부와, 상기 학습문서 그룹의 각 학습문서 별로 형태소를 분석하여 명사 및 동사 이외의 품사들을 가려내는 품사 태거부와, 상기 품사 태거부를 거친 학습문서로부터 명사 및 동사를 추출하는 품사 추출부를 포함하는 것을 특징으로 한다.The preprocessing unit may further include a special character removal unit for removing special characters for each learning document in the learning document group, a stopword removing unit for removing stopwords for each learning document in the learning document group, And a part-of-speech extraction unit for extracting a noun and a verb from a learning document that has been rejected by the part-of-speech recognition unit.

또한, 상기 엔그램 데이터 세트는 카테고리 별로 바이그램(Bigram) 세트를 생성하여 저장하는 것을 특징으로 한다.In addition, the engram data set is generated by storing a Bigram set for each category.

또한, 상기 바이그램 세트는 어절 단위 바이그램(Bigram)의 추출 및 구축을 통해 생성되는 것을 특징으로 한다.Also, the bi-gram set is generated by extracting and constructing a bigram in units of words.

또한, 본 발명에 따른 문서 자동 학습 방법은 복수의 학습문서가 카테고리 별로 분류되어 복수의 학습문서 그룹이 형성되는 단계와, 상기 복수의 학습문서 그룹이 개별적으로 전처리되는 단계와, 상기 전처리 과정을 통해 상기 복수의 학습문서 그룹으로부터 엔그램 데이터 세트가 개별 생성되는 단계를 포함하여 구성된다.According to another aspect of the present invention, there is provided a method for automatically learning a document, the method comprising: a plurality of learning documents classified into categories to form a plurality of learning document groups; a step of preprocessing the plurality of learning document groups individually; And separately generating an engram data set from the plurality of learning document groups.

또한, 상기 복수의 학습문서 그룹이 개별적으로 전처리되는 단계는 상기 학습문서 그룹의 각 학습문서 별로 특수문자가 제거되는 단계와, 상기 학습문서 그룹의 각 학습문서 별로 불용어(stopword)가 제거되는 단계와, 상기 학습문서 그룹의 각 학습문서 별로 형태소를 분석하여 명사 및 동사 이외의 품사들을 가려내는 단계와, 상기 형태소 분석 및 일부 품사를 가려내는 과정을 거친 학습문서로부터 명사 및 동사를 추출하여 해당 명사 및 동사의 동시출현 빈도의 수를 추출하는 단계와, 상기 동시출현 빈도의 수를 기준으로 엔그램 데이터 세트를 구축하는 단계를 포함하는 것을 특징으로 한다.The step of individually preprocessing the plurality of learning document groups may include the steps of removing special characters for each learning document in the learning document group, removing a stopword for each learning document in the learning document group, Extracting nouns and verbs from the learning documents that have been subjected to the morphological analysis and the process of extracting some parts of speech, and then extracting the nouns and verbs from the learning documents, Extracting a number of simultaneous occurrence frequencies of the verbs; and constructing an engram data set based on the number of simultaneous occurrence frequencies.

또한, 상기 동시출현 빈도의 수를 기준으로 엔그램 데이터 세트를 구축하는 단계는, 상기 학습문서 그룹 별로 상기 전처리 과정을 통해 바이그램 데이터 세트가 구축되는 단계인 것을 특징으로 한다.In addition, the step of constructing an engram data set based on the number of simultaneous appearance frequencies is a step of constructing a bi-gram data set through the preprocessing step for each learning document group.

또한, 본 발명에 따른 문서의 자동 분류 장치는, 카테고리 별로 분류된 복수의 학습문서 그룹이 포함되는 학습문서 풀 상기 학습문서 풀의 각 학습문서 그룹에 대해 전처리 과정을 하는 전처리부 상기 전처리부의 전처리 과정을 통해 학습되어 형성된 상기 학습문서 풀의 엔그램 데이터 세트가 저장되는 엔그램 데이터 세트 풀을 포함하며 상기 학습문서 풀을 통해 식별되지 않는 신규문서 출현 시 상기 전처리부가 해당 신규문서를 전처리하여 바이그램 세트를 형성하는 문서 자동 학습부와, 상기 전처리부를 통해 형성되는 상기 신규문서의 바이그램 세트와 상기 엔그램 데이터 세트 풀의 바이그램 세트를 비교하여 상기 신규문서의 바이그램 세트를 상기 엔그램 데이터 세트 풀 중 어느 하나의 엔그램 데이터 세트에 할당하여 저장하는 문서 자동 분류부를 포함하여 구성된다.The automatic classifying apparatus for a document according to the present invention may further comprise: a learning document pool including a plurality of learning document groups classified into categories; a preprocessing unit for preprocessing each learning document group of the learning document pool; And an engram data set pool in which an engram data set of the learning document pool formed by learning is learned, and when a new document unidentifiable through the learning document pool is present, the preprocessor preprocesses the new document to generate a bi- gram set A bi-gram set of the new document formed through the preprocessing unit and a bi-gram set of the engram data set pool, and comparing the bi-gram set of the new document with one of the engram data set pools Automatically classify documents that are assigned to and stored in an enum data set It is configured to include a.

또한, 상기 문서 자동 분류부는 아래의 식In addition, the automatic document classifying unit may classify

『

"

Bigram Weight : 상기 신규문서의 바이그램 세트와 상기 엔그램 데이터 세트 풀의 각 엔그램 데이터 세트별 바이그램 세트 간의 유사도로써, 상기 신규문서에서 추출된 바이그램 데이터가 상기 엔그램 데이터 세트별 바이그램에 존재 시, 해당 바이그램 단어의 출현 빈도의 수를 곱한 값.Bigram Weight: The similarity between the biggram set of the new document and the bigogram set for each engram data set of the engram data set pool, and if the biggram data extracted from the new document exists in the biggram for each engram data set, Multiply the number of occurrences of the word bigram.

CW : 기 연산된 Bigram Weight에 상기 신규문서 및 엔그램 데이터 세트 각각에 동시 출현한 바이그램의 횟수를 곱한 값』CW: A value obtained by multiplying the calculated Bigram Weight by the number of times the same number of bigrams appear in each of the new document and the Engram data set.

을 통해 상기 엔그램 데이터 세트 풀의 바이그램 세트와 상기 신규문서의 바이그램 세트 간 유사도를 측정하는 것을 특징으로 한다.The similarity between the bi-gram set of the engram dataset pool and the bi-gram set of the new document is measured.

또한, 상기 CW는 상기 기 연산된 Bigram Weight에 상기 신규문서 및 엔그램 데이터 세트 각각에 동시 출현한 바이그램의 횟수를 곱한 값 중 최대값에 할당되는 것을 특징으로 한다.Also, the CW is allocated to a maximum value among values obtained by multiplying the pre-computed Bigram Weight by the number of bigrams appearing simultaneously in the new document and the set of engram data.

또한, 상기 문서 자동 분류부는 단어의 공기정보와 출현빈도의 수 및 엔그램 출현빈도의 수를 기준으로 상기 신규문서의 학습 및 해당 학습을 통해 상기 엔그램 데이터 세트 풀 중 어느 한 엔그램 데이터 세트로의 분류를 결정하는 것을 특징으로 한다.Also, the automatic document classification unit may classify the new document into a set of the engram data sets through the learning of the new document and the corresponding learning, based on the air information of the words, the number of occurrence frequencies of the words, Is determined.

또한, 본 발명에 따른 문서 자동 분류 방법은 복수의 학습문서가 카테고리 별로 분류되어 복수의 학습문서 그룹이 형성되는 단계와, 상기 학습문서 그룹 별로 특수문자와 불용어(stopword) 제거 및 형태소 분석을 통해 명사 및 동사만을 추출하는 전처리 단계와, 상기 전처리를 통해 상기 복수의 학습문서 그룹으로부터 엔그램 데이터 세트가 개별 생성되는 단계와, 상기 엔그램 데이터 세트를 통해 식별되지 않는 신규문서가 출현하는 단계와, 상기 신규문서를 상기 전처리와 동일한 과정으로 전처리하여 상기 신규문서의 해당 바이그램을 형성하는 단계와, 상기 신규문서의 바이그램과 상기 엔그램 데이터 세트의 바이그램을 비교하여 해당 신규문서를 상기 학습문서 그룹들 중 어느 하나에 할당하여 분류하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a method of automatically classifying a document, comprising: forming a plurality of learning document groups by categorizing a plurality of learning documents into a plurality of categories; extracting special characters, stopwords, A preprocessing step of extracting only a verb from the plurality of learning document groups, a preprocessing step of extracting only a verb from the plurality of learning document groups, and a preprocessing step of separately generating an engram data set from the plurality of learning document groups through the preprocessing, A step of preprocessing a new document by the same process as the pre-processing to form a corresponding biagram of the new document; comparing the biagram of the new document with the biagram of the engram data set, And assigning the same to one.

본 발명에 따르면, 엔그램 기반의 문서 자동 학습 및 분류 과정을 통해 웹상의 대용량 문서들을 자동으로 학습 및 분류할 수 있게 된다.According to the present invention, large documents on the web can be automatically learned and classified through the automatic document learning and classification process based on an engram.

또한, 웹문서를 미리 정의된 분야별로 학습단계를 거친 후 식별되지 않은 신규문서 출현시, 이를 자동 분류하는 방법을 통하여 웹문서의 의미적 분석과 함께 사용자의 질의어와 밀접한 검색결과를 제공할 수 있다.In addition, it is possible to provide semantic analysis of web documents and search results closely related to the user's query, through a method of automatically classifying new documents when an unidentified new document is found after learning a predefined area of web documents .

또한, 동시에 출현하는 단어들의 집합인 엔그램과 이의 출현빈도수를 바탕으로 문서를 대표할 수 있는 엔그램을 구축하여 차후에 식별되지 않은 신규문서가 발견되었을 때 기 구축된 엔그램 정보를 바탕으로 신규문서를 자동으로 분류할 수 있다.In addition, an engram that can represent a document based on the set of words that appear at the same time and the frequency of appearance of the document is constructed so that a new document that is not identified at a later time is found. Based on the previously constructed information, Can be automatically classified.

도 1은 본 발명의 일 실시예에 따른 문서의 자동 학습 장치를 보인 블록도
도 2는 본 발명의 일 실시예에 따른 문서의 자동 분류 장치를 개념적으로 보인 블록도
도 3은 본 발명의 일 실시예에 따른 문서 자동 학습 방법을 보인 흐름도
도 4는 본 발명의 일 실시예에 따른 문서 자동 분류 방법을 보인 흐름도1 is a block diagram illustrating an apparatus for automatic learning of a document according to an embodiment of the present invention.
2 is a block diagram conceptually showing an apparatus for automatically classifying a document according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating a method of automatic document learning according to an embodiment of the present invention.
4 is a flow chart illustrating a document automatic classification method according to an embodiment of the present invention.

이하에서는, 첨부된 도면을 참조하여 본 발명의 일 실시예에 따른 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법을 상세하게 설명한다.Hereinafter, a document automatic learning apparatus according to an embodiment of the present invention, an automatic document learning method using the automatic document learning apparatus, an automatic document classification apparatus, and a document automatic classification method using the same will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 문서의 자동 학습 장치를 보인 블록도이다.1 is a block diagram illustrating an apparatus for automatic learning of documents according to an embodiment of the present invention.

도시된 바와 같이, 본 발명의 일 실시예에 따른 문서의 자동 학습 장치(100)는 학습문서 풀(110), 전처리부(120), 엔그램 데이터 세트 풀(130)을 포함하여 구성된다.As shown in the figure, the automatic document learning apparatus 100 according to an exemplary embodiment of the present invention includes a learning document pool 110, a preprocessing unit 120, and an engram data set pool 130.

학습문서 풀(110)은 카테고리 별로 분류된 복수의 학습문서 그룹(111)을 포함하여 형성된다.The learning document pool 110 is formed by including a plurality of learning document groups 111 classified into categories.

전처리부(120)는 학습문서 풀(110)의 각 학습문서 그룹에 대해 전처리 과정을 하며, 이와 같은 전처리부(120)는 특수문자 제거부(121), 불용어(stopword) 제거부(122), 품사 태거부(123), 품사 추출부(124)를 포함하여 형성된다.The preprocessing unit 120 preprocesses each learning document group in the learning document pool 110. The preprocessing unit 120 includes a special character removal unit 121, a stopword removal unit 122, A part-of-speech rejection 123, and a part-of-speech extraction unit 124.

특수문자 제거부(121)는 학습문서 그룹(111)의 각 학습문서 별로 특수문자를 제거하며, 불용어(stopword) 제거부(122)는 학습문서 그룹(111)의 각 학습문서 별로 불용어(stopword)를 제거한다. 그리고 품사 태거부(123)는 학습문서 그룹(111)의 각 학습문서 별로 형태소를 분석하여 명사 및 동사 이외의 품사들을 가려내며, 품사 추출부(124)는 품사 태거부(123)를 거친 학습문서로부터 명사 및 동사를 추출한다.The special character removal unit 121 removes special characters for each learning document in the learning document group 111 and the stopword removal unit 122 removes a stopword for each learning document in the learning document group 111, . Then, the part-of-speech rejection unit 123 analyzes the morpheme for each learning document in the learning document group 111, and extracts parts of speech other than nouns and verbs. The part-of-speech extraction unit 124 extracts a learning document To extract nouns and verbs.

엔그램 데이터 세트 풀(130)은 전처리부(120)의 전처리 과정을 통해 학습되어 형성된 학습문서 풀(110)의 엔그램 데이터 세트(131)가 저장된다. 그리고 엔그램 데이터 세트(131)는 카테고리 별로 바이그램(Bigram) 세트를 생성하여 저장하고, 여기서 바이그램 세트는 어절 단위 바이그램(Bigram)의 추출 및 구축을 통해 생성된다.The engram data set pool 130 stores an engram data set 131 of the learning document pool 110 formed by learning through the preprocessing process of the preprocessing unit 120. Then, the engram data set 131 generates and stores a Bigram set for each category, where the bi-gram set is generated by extracting and building Bigram in units of words.

다음은 도 2를 참조하여 본 발명의 일 실시예에 따른 문서의 자동 분류 장치에 대해 설명한다.The automatic classification apparatus for documents according to an embodiment of the present invention will now be described with reference to FIG.

도시된 바와 같이, 본 발명의 일 실시예에 따른 문서의 자동 분류 장치(200)는 문서 자동 학습부(210) 및 문서 자동 분류부(220)를 포함하여 구성된다.As shown in the figure, the automatic document classification apparatus 200 according to an embodiment of the present invention includes a document automatic learning unit 210 and a document automatic classification unit 220.

문서 자동 학습부(210)는 도 1을 참조하여 상술한 문서의 자동 학습 장치(100)에 해당하는 것으로서, 즉 문서 자동 학습부(210)는 카테고리 별로 분류된 복수의 학습문서 그룹이 포함되는 학습문서 풀과, 학습문서 풀의 각 학습문서 그룹에 대해 전처리 과정을 하는 전처리부와, 상기 전처리부의 전처리 과정을 통해 학습되어 형성된 상기 학습문서 풀의 엔그램 데이터 세트가 저장되는 엔그램 데이터 세트 풀을 포함하여 구성된다. 이와 같은 문서 자동 학습부(210)는 상기 구성을 통해 학습문서 풀을 통해 식별되지 않는 신규문서 출현 시 상기 전처리부가 해당 신규문서를 전처리하여 바이그램 세트를 형성한다.The automatic document learning unit 210 corresponds to the automatic document learning apparatus 100 described above with reference to FIG. 1, that is, the automatic document learning unit 210 includes a learning unit 220 that includes a plurality of learning document groups A preprocessor for preprocessing a preprocess for each learning document group of a document pool and a learning document pool; and an engram data set pool storing an engram data set of the learning document pool formed by learning through the preprocessing process of the preprocessing unit . The automatic document learning unit 210 pre-processes the new document when a new document that is not identified through the learning document pool appears on the automatic document learning unit 210, thereby forming a bi-gram set.

그리고 문서 자동 학습부(210)의 전처리부를 포함한 구성들 및 상기 전처리부의 구성은 도 1에 따른 문서의 자동 학습 장치의 전처리부와 동일한 구성이며, 따라서 본 실시예에서 이에 대한 상세 설명 및 도시는 생략한다.The configuration including the preprocessing unit of the automatic document learning unit 210 and the configuration of the preprocessing unit are the same as those of the preprocessing unit of the automatic document learning apparatus according to FIG. 1, and therefore, do.

문서 자동 분류부(220)는 문서 자동 학습부(210)의 전처리부를 통해 형성되는 신규문서의 바이그램 세트와 엔그램 데이터 세트 풀의 바이그램 세트를 비교하여 신규문서의 바이그램 세트를 엔그램 데이터 세트 풀(230, 문서 자동 학습부(210)의 엔그램 데이터 세트 풀에 해당하며 이해를 위해 별도로 도시하였을 뿐 임) 중 어느 하나의 엔그램 데이터 세트에 할당하여 저장한다. The automatic document classification unit 220 compares the bi-gram set of the new document formed through the pre-processing unit of the automatic document learning unit 210 with the bi-gram set of the engram data set pool and stores the bi- gram set of the new document in the engram data set pool 230, only the document data set pool of the automatic document learning unit 210 is shown separately for the sake of understanding).

여기서 문서 자동 분류부(220)는 아래의 식을 통해 엔그램 데이터 세트 풀(230)의 바이그램 세트와 신규문서의 바이그램 세트 간 유사도를 측정한다.
Here, the automatic document sorting unit 220 measures the similarity between the biagram set of the engram data set pool 230 and the bi-gram set of the new document through the following equation.

[식 1][Formula 1]

Bigram Weight : 신규문서의 바이그램 세트와 엔그램 데이터 세트 풀의 각 엔그램 데이터 세트별 바이그램 세트 간의 유사도로써, 신규문서에서 추출된 바이그램 데이터가 엔그램 데이터 세트별 바이그램에 존재 시, 해당 바이그램 단어의 출현 빈도의 수를 곱한 값.Bigram Weight: The similarity between the bigram set of a new document and the bigram set for each engram data set in the engram data set pool. When the big-gram data extracted from the new document exists in the big-gram for each engram data set, the occurrence of the corresponding big-gram word Multiplied by the number of frequencies.

CW : 기 연산된 Bigram Weight에 신규문서 및 엔그램 데이터 세트 각각에 동시 출현한 바이그램의 횟수를 곱한 값.CW: The calculated Bigram Weight multiplied by the number of times the newgrams appear in each new document and engram data set.

다시 말해, 문서 자동 분류부(220)는 단어의 공기정보와 출현빈도의 수 및 엔그램 출현빈도의 수를 기준으로 신규문서의 학습 및 해당 학습을 통해 엔그램 데이터 세트 풀(230) 중 어느 한 엔그램 데이터 세트로의 분류를 결정하는 것이다. 아래의 표 1은 이러한 예를 보인 것이다.In other words, the automatic document classification unit 220 learns new documents based on the air information of the words, the number of occurrence frequencies of the words, and the number of occurrence occurrences of the engrams, and either of the engram data set pools 230 To determine classification into an engram dataset. Table 1 below shows this example.

BigramBigram DataData BigramBigram weightweight CWCW
NewNew DocumentDocument chosun university 5,
computer science 4,
master students 5,
... chosun university 5,
computer science 4,
master students 5,
...
CategoryCategory 1 One
chosun university 15,
computer science 40,
gwangju korea 37
...chosun university 15,
computer science 40,
gwangju korea 37
...
ND*C1 = 5*15+4*40
= 235
ND * C1 = 5 * 15 + 4 * 40
= 235
1+235*2 = 471
1 + 235 * 2 = 471
CategoryCategory 2 2 republic korea 64,
chosun university 5,
president korea 41,
...republic korea 64,
chosun university 5,
president korea 41,
...
ND*C2 = 5*5 = 25
ND * C2 = 5 * 5 = 25
1+25*1 = 26
1 + 25 * 1 = 26
CategoryCategory 3 3 chosun university 71,
artificial intelligence 41,
international conference 13,
...chosun university 71,
artificial intelligence 41,
international conference 13,
...
ND*C3= 5*71= 355
ND * C3 = 5 * 71 = 355
1+355*1 = 356
1 + 355 * 1 = 356

그리고 상기 CW는 기 연산된 Bigram Weight에 신규문서 및 엔그램 데이터 세트 각각에 동시 출현한 바이그램의 횟수를 곱한 값 중 최대값에 할당되는 것일 수 있다.The CW may be assigned to a maximum value of a value obtained by multiplying a pre-calculated Bigram Weight by the number of times of a bigram simultaneously appearing in each of the new document and the Engram data set.

다음은 도 3을 참조하여 본 발명의 일 실시예에 따른 문서 자동 학습 방법에 대해 설명한다. 여기서 문서의 자동 학습 장치는 도 1의 실시예를 이용한 것이며, 따라서 이하의 설명에서 문서의 자동 학습 장치 구성은 도 1의 실시예를 따르는 동시에 동일 부호를 사용한다.Next, a method of automatic document learning according to an embodiment of the present invention will be described with reference to FIG. Here, the automatic learning apparatus of the document is based on the embodiment of FIG. 1, and therefore, in the following description, the configuration of the automatic learning apparatus of the document follows the embodiment of FIG. 1 and uses the same reference numerals.

먼저, 단계(S110)에서 복수의 학습문서가 카테고리 별로 분류되어 복수의 학습문서 그룹(111)이 형성된다.First, in step S110, a plurality of learning documents are classified into categories and a plurality of learning document groups 111 are formed.

이어서, 단계(S120)에서 복수의 학습문서 그룹(111)이 개별적으로 전처리된다.Subsequently, in step S120, a plurality of learning document groups 111 are preprocessed individually.

여기서 전처리 과정은, 단계(S121)의 학습문서 그룹(111)의 각 학습문서 별로 특수문자가 제거되는 과정, 단계(S122)의 학습문서 그룹(111)의 각 학습문서 별로 불용어(stopword)가 제거되는 과정, 단계(S123)의 학습문서 그룹(111)의 각 학습문서 별로 형태소를 분석하여 명사 및 동사 이외의 품사들을 가려내는 과정, 단계(S124)의 형태소 분석 및 일부 품사를 가려내는 과정을 거친 학습문서로부터 명사 및 동사를 추출하여 해당 명사 및 동사의 동시출현 빈도의 수를 추출하는 과정, 그리고 단계(S125)의 동시출현 빈도의 수를 기준으로 엔그램 데이터 세트를 구축하는 단계를 포함한다.Here, the preprocessing process is a process in which special characters are removed for each learning document in the learning document group 111 in step S121, a stopword is removed for each learning document in the learning document group 111 in step S122, , Analyzing morphemes for each learning document in the learning document group 111 in step S123 and extracting parts of speech other than nouns and verbs, morphological analysis in step S124, Extracting nouns and verbs from the learning document, extracting the number of simultaneous appearance frequencies of the nouns and verbs, and building an engram data set based on the number of simultaneous appearance frequencies in step S125.

그리고 이러한 일련의 전처리 과정 후, 단계(S130)에서 복수의 학습문서 그룹(111)으로부터 엔그램 데이터 세트(131)가 개별 생성되는 단계가 진행된다.After the series of preprocessing steps, a step of separately generating the engram data sets 131 from the plurality of learning document groups 111 is performed in step S130.

또한, 단계(S130)은 학습문서 그룹(111) 별로 전처리 과정을 통해 바이그램 데이터 세트가 구축되는 단계에 해당된다.In addition, step S130 corresponds to the step of constructing the bi-gram data set through the preprocessing process for each learning document group 111. [

다음은 도 4를 참조하여 본 발명의 일 실시예에 따른 문서 자동 분류 방법에 대해 설명한다. 여기서 문서의 자동 분류 장치는 도 2의 실시예를 이용한 것이다.The following will describe a document automatic classification method according to an embodiment of the present invention with reference to FIG. Here, the automatic classification apparatus of the document is based on the embodiment of FIG.

먼저, 단계(S210)에서 복수의 학습문서가 카테고리 별로 분류되어 복수의 학습문서 그룹이 형성된다.First, in step S210, a plurality of learning documents are classified into categories and a plurality of learning document groups are formed.

이어서, 단계(S220)에서 학습문서 그룹 별로 특수문자와 불용어(stopword) 제거 및 형태소 분석을 통해 명사 및 동사만을 추출하는 전처리가 이루어진다.Next, in step S220, preprocessing for extracting only nouns and verbs is performed by removing special characters, stopwords, and morpheme analysis for each learning document group.

이어서, 단계(S230)에서 단계(S220)의 전처리를 통해 복수의 학습문서 그룹으로부터 엔그램 데이터 세트가 개별 생성된다.Then, in step S230, a set of the engram data is separately generated from the plurality of learning document groups through the preprocessing in step S220.

이어서, 단계(S240)에서 엔그램 데이터 세트를 통해 식별되지 않는 신규문서가 출현한다.Then, in step S240, a new document that is not identified through the set of engram data appears.

이어서, 단계(S250)에서 단계(S240)의 신규문서를 단계(S220)의 전처리와 동일한 과정으로 전처리하여 신규문서의 해당 바이그램을 형성한다.Then, in step S250, the new document of step S240 is pre-processed in the same process as the preprocessing of step S220 to form a corresponding bi-gram of the new document.

이어서, 단계(S260)에서 신규문서의 바이그램과 엔그램 데이터 세트의 바이그램을 비교하여 해당 신규문서를 학습문서 그룹들 중 어느 하나에 할당하여 분류한다. Then, in step S260, the biagram of the new document is compared with the biagram of the engram data set, and the new document is assigned to one of the learning document groups and classified.

상술한 도 1 내지 도 4의 실시예를 통하여 알 수 있는 바와 같이, 본 발명에 따른 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법은, 엔그램 기반의 문서 자동 학습 및 분류 과정을 통해 웹상의 대용량 문서들을 자동으로 학습 및 분류할 수 있게 한다.As can be seen from the embodiments of FIGS. 1 to 4, the automatic learning apparatus for a document according to the present invention, the automatic document learning method using the same, and the automatic document classification apparatus and the automatic document classification method using the same, It enables automatic learning and classification of large documents on the web through gram-based document automatic learning and classification process.

또한, 웹문서를 미리 정의된 분야별로 학습단계를 거친 후 신규문서 출현 시, 이를 자동 분류하는 방법을 통하여 웹문서의 의미적 분석과 함께 사용자의 질의어와 밀접한 검색결과를 제공할 수 있게 한다.In addition, it is possible to provide semantic analysis of web documents and search results closely related to the user 's query term through a method of automatically classifying web documents after learning stages of predefined fields.

또한, 동시에 출현하는 단어들의 집합인 엔그램과 이의 출현빈도수를 바탕으로 문서를 대표할 수 있는 엔그램을 구축하여 차후에 식별되지 않은 신규문서가 발견되었을 때 기 구축된 엔그램 정보를 바탕으로 새로운 문서를 자동으로 분류할 수 있게 한다.It is also possible to construct an engram that can represent a document based on the set of words that appear at the same time and the frequency of occurrence of the new document, so that when a new document that is not identified is found later, To be automatically classified.

또한, 기 정의된 분류항목에 걸맞은 학습문서를 대상으로 어절 단위 엔그램 통계적 분포치 집합을 구축한 후, 신규문서가 출현하였을 경우, 신규문서 내에 존재하는 엔그램 분포치와 기 학습된 엔그램 분포치와의 유사도 측정을 통하여 신규문서를 자동 분류한다. 이때 신규문서에서 생성된 엔그램 분포데이터는 기 학습 엔그램 분포치에 누적되어 점증적으로 확장되어 정보의 변화에 적응할 수 있는 데이터로 그 신뢰성을 향상시킬 수 있게 한다.In addition, if a new document appears after constructing a set of statistical distribution values for each word in a learning document corresponding to a predefined classification item, the distribution of the engrams existing in the new document and the distribution of the previously learned engrams New documents are automatically classified through similarity measure. At this time, the distribution data of the generated documents in the new document accumulates in the distribution of the learning learning engrams, and it gradually expands to improve the reliability with the data that can adapt to the change of information.

이상에서 설명한 것은 본 발명에 따른 문서의 자동 학습 장치와 이를 이용한 문서 자동 학습 방법 및 문서의 자동 분류 장치와 이를 이용한 문서 자동 분류 방법을 실시하기 위한 하나의 실시예에 불과한 것으로서, 본 발명은 상기한 실시 예에 한정되지 않고, 이하의 특허청구범위에서 청구하는 바와 같이 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 정신이 있다고 할 것이다.Although the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit and scope of the invention. I would say that there is a technical mind.

100 : 문서의 자동 학습 장치 110 : 학습문서 풀
111 : 학습문서 그룹 120 : 전처리부
121 : 특수문자 제거부 122 : 불용어 제거부
123 : 품사태거부 124 : 품사추출부
130 : 엔그램 데이터 세트 풀 131 : 엔그램 데이터 세트
200 : 문서의 자동 분류 장치 210 : 문서 자동 학습부
220 : 문서 자동 분류부100: Automatic learning device of document 110: Learning document pool
111: learning document group 120: preprocessing unit
121: Special character removal 122:
123: Rejection of bodily injury 124:
130: engram data set pool 131: engram data set
200: automatic classification apparatus of document 210: document automatic learning unit
220: Document automatic classification section

Claims

A learning document pool including a plurality of learning document groups classified by category;
A preprocessor for preprocessing each learning document group of the learning document pool;
And an engram data set pool in which an engram data set of the learning document pool formed by being learned through a preprocessing process of the preprocessor is stored.

The method of claim 1, wherein the pretreatment unit
A special character removing unit for removing a special character for each learning document of the learning document group;
A stopword removing unit for removing a stopword for each learning document in the learning document group;
Parsing the morpheme for each learning document in the learning document group and rejecting parts of speech other than nouns and verbs;
And a part-of-speech extracting unit for extracting nouns and verbs from the learning document having passed through the part-of-speech tagger.

The method of claim 1,
The engram data set generates and stores a bigram set for each category.

The method of claim 3, wherein
The bigram set is generated by extracting and constructing a word unit biggram (Bigram).

Forming a plurality of learning document groups by classifying the plurality of learning documents by category;
Separately preprocessing the plurality of groups of learning documents;
And automatically generating an engram data set from the plurality of learning document groups through the preprocessing.

6. The method of claim 5,
The preprocessing of the plurality of learning document groups individually
Removing a special character for each learning document of the learning document group;
Removing stopwords for each learning document of the learning document group;
Analyzing parts of each learning document of the learning document group to screen out parts of speech other than nouns and verbs;
Extracting nouns and verbs from the learning document that has undergone the morphological analysis and selecting parts of speech to extract the number of simultaneous occurrences of the nouns and verbs;
And building an engram data set based on the number of co-occurrence frequencies.

The method according to claim 6,
Building an engram data set based on the number of co-occurrence frequencies
Automatic learning method of a document, characterized in that the step of building a by-gram data set through the pre-processing for each learning document group.

A preprocessing unit for preprocessing each learning document group of the learning document pool, wherein the preprocessing unit includes a plurality of learning document groups classified into categories, Wherein the preprocessing unit pre-processes the new document to form a bi-gram set when a new document that is not identified through the learning document pool is present;
By comparing the by-gram set of the new document formed by the pre-processing unit and the by-gram set of the engram data set pool, allocates and stores the by-gram set of the new document to any one of the engram data set pool. An automatic classification device for documents including a document automatic classification unit.

The method of claim 8,
The automatic document classifying unit may classify
"

Bigram Weight: The similarity between the biggram set of the new document and the bigogram set for each engram data set of the engram data set pool, and if the biggram data extracted from the new document exists in the biggram for each engram data set, Multiply the number of occurrences of the word bigram.
CW: A value obtained by multiplying the calculated Bigram Weight by the number of times the same number of bigrams appear in each of the new document and the Engram data set.
And measures the similarity between the bi-gram set of the engram dataset pool and the bi-gram set of the new document.

The method of claim 9,
And the CW is assigned to a maximum value of a value obtained by multiplying the pre-calculated Bigram Weight by the number of times the new document and the Engram data set are simultaneously displayed.

9. The image processing apparatus according to claim 8, wherein the pre-
A special character removing unit for removing a special character for each learning document of the learning document group;
A stopword removing unit for removing a stopword for each learning document in the learning document group;
Parsing the morpheme for each learning document in the learning document group and rejecting parts of speech other than nouns and verbs;
And a part-of-speech extracting unit for extracting a noun and a verb from the learning document that has been rejected.

The method of claim 8,
The document automatic classification unit classifies an engram data set into one of the engram data set pools through learning and corresponding learning of the new document based on the word air information, the number of occurrence frequencies, and the number of engram occurrence frequencies. Automatic classification apparatus of the document, characterized in that for determining.

Forming a plurality of learning document groups by classifying the plurality of learning documents by category;
A preprocessing step of extracting only nouns and verbs by removing special characters and stopwords and morphological analysis for each learning document group;
Separately generating an engram data set from the plurality of learning document groups through the preprocessing;
Appearing a new document not identified through the engram data set;
Preprocessing the new document in the same process as the preprocess to form a corresponding vigram of the new document;
And comparing the bygram of the new document with the bygram of the engram data set, and assigning the new document to one of the learning document groups to classify the new document.