KR20020072140A

KR20020072140A - Automatic Text Categorization Method Based on Unsupervised Learning, Using Keywords of Each Category and Measurement of the Similarity between Sentences

Info

Publication number: KR20020072140A
Application number: KR1020010012318A
Authority: KR
Inventors: 서정연; 이근배; 고영중
Original assignee: 서정연; 이근배; 고영중
Priority date: 2001-03-09
Filing date: 2001-03-09
Publication date: 2002-09-14
Also published as: KR100420096B1

Abstract

PURPOSE: A method for automatically categorizing a document is provided to perform the categorization of the document collected from the Internet by automatically creating and learning the data with the keyword input of each category. CONSTITUTION: A preprocessing step(10) regulates the format of the collected document, divides into a sentence unit, and extracts the content words of each sentence through the linguistic analysis. A learning sentence set creating step(20) automatically creates the learning sentence set through the representative sentence extraction from the treated sentence and the similarity measurement between the sentences. A quality extraction and categorization step(30) classifies the input document by extracting and learning the quality with the use of the created learning sentence set. The preprocessing step includes a document regulating procedure(110), a sentence unit division procedure(120), a format element analysis and tagging procedure(130) and a content word extraction procedure(140).

Description

Automatic Text Categorization Method Based on Unsupervised Learning, Using Keywords of Each Category and Measurement of the Similarity between Sentences}

본 발명은 온 라인 상 문서의 자동 문서 범주화에 관한 것이며, 특히 수작업으로 수행되는 대량의 학습 문서 생성 작업 없이 적은 비용으로 문서 범주화를 수행할 수 있는 방법에 관한 것이다.FIELD OF THE INVENTION The present invention relates to automatic document categorization of documents online, and more particularly, to a method that enables document categorization at low cost without the need for manual learning of large amounts of learning documents.

최근에는 인터넷이 폭 넓게 보급되어 온 라인(on-line)상에서 얻을 수 있는 텍스트(text) 정보의 양이 급증함에 따라 텍스트 문서를 수집하는 것은 쉬워졌으나 수집된 텍스트 정보에 대한 효율적인 정보 관리가 요구되고 있다.In recent years, as the amount of text information available on-line has increased rapidly, the collection of text documents has become easier, but efficient information management of collected text information is required. have.

종래의 자동 문서 범주화 방법은 보통 수작업에 의해 범주가 할당된 대량의 학습 문서를 사용해서 학습하고 범주화 작업을 수행한다. 그러나, 학습에 사용될 대량의 양질의 학습 문서를 생성하는데는 많이 비용과 어려움이 있다. 특히, 자동 문서 범주화의 영역이 신문 기사, 전자 도서관뿐만 아니라 전자 우편, 뉴스 그룹 등 적용 영역이 넓어지고 다양해 지고 있으므로 각 영역에 따라 대량의 학습 문서를 생성한다는 것은 많은 작업 인원과 많은 시간을 필요로 하는 어려움이 있다.Conventional automatic document categorization methods typically use a large number of learning documents that are manually categorized and trained and categorized. However, there are many costs and difficulties in generating a large amount of high quality learning documents to be used for learning. In particular, since the area of automatic document categorization is getting wider and more diverse, such as e-mail and newsgroups as well as newspaper articles and electronic libraries, creating large learning documents in each area requires a lot of work and time. There is a difficulty.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은 학습 문서를 생성하기 위한 작업 없이 각 범주의 핵심어의 입력만으로 인터넷에서 수집된 문서를 사용하여 자동으로 학습 데이터를 생성하고 학습하여 문서 범주화를 수행하는 방법을 제공하는 데 있다.An object of the present invention for solving the above problems is to automatically generate and learn the training data using the documents collected from the Internet using only the input of the key words of each category without generating a learning document to perform document categorization To provide a way.

본 발명은 기본적으로 텍스트 문서를 문장 단위로 나누는 기술과 형태소 분석 및 태깅 기술을 이용하고 있으며, 입력된 핵심어로부터 분류하고자 하는 각 범주의 특징을 잘 내포하고 있는 문장을 자동으로 추출하고 순위화하는 통계적 정보 검색 기법을 사용한다. 또한, 문장간 유사도 측정기법을 이용하여 학습 문장 데이터를 자동으로 구축하기 위한 통계적 언어 분석 기법을 사용하고 있으며, 구축된 학습 문장 데이터를 사용하여 자질을 추출하고 분류하는 과정에 의해 문서 범주화를 이룩한다.The present invention basically utilizes a technique of dividing a text document into sentence units, a morphological analysis, and a tagging technique, and automatically extracts and ranks sentences that contain characteristics of each category to be classified from input keywords. Use information retrieval techniques. Also, we use statistical linguistic analysis technique to automatically construct learning sentence data using sentence similarity measurement technique, and achieve document categorization by the process of extracting and classifying features using the constructed learning sentence data. .

도 1은 본 발명에 따른 자동 문서 범주화 방법을 나타낸 전체 흐름도.1 is an overall flow diagram illustrating an automatic document categorization method in accordance with the present invention.

도 2는 도 1의 전처리 단계에서 내용어 추출 과정의 일 예를 나타낸 흐름도.2 is a flowchart illustrating an example of a content extraction process in the preprocessing step of FIG. 1.

도 3은 도 1의 학습 문장 집합 생성 단계의 일 예를 나타낸 흐름도.3 is a flowchart illustrating an example of generating a training sentence set in FIG. 1.

도 4는 도 3에서 단어-문장간 유사도 측정의 반복 계산 예시도.4 is an exemplary calculation of repetition of word-sentence similarity measurement in FIG.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명은 (i)수집된 문서를 문장 단위로 분할하고 형태소 분석하여 내용어를 추출하는 단계; (ii)입력된 핵심어를 이용하여 각 범주의 대표 문장을 추출하는 단계; (iii)상기 추출된 대표 문장이 각 범주의 특성을 잘 나타내고 있는지를 검증하여 순위화하는 단계; (iv)상기 추출된 대표 문장과 대표 문장으로 추출되지 못한 미 분류 문장과의 문장간 유사도 측정 기법을 이용하여 학습에 사용될 학습 문장 집합을 생성하는 단계; (v)상기 생성된 학습 문장 집합을 사용하여 자질을 추출하고 학습하여 문서 범주화를 수행하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the object as described above, the present invention comprises the steps of (i) dividing the collected document into sentence units and extracting the content word by morphological analysis; (ii) extracting representative sentences of each category by using the inputted keywords; (iii) verifying and ranking whether the extracted representative sentence well represents the characteristics of each category; (iv) generating a set of learning sentences to be used for learning by using a similarity measure technique between the extracted representative sentences and unclassified sentences not extracted as representative sentences; (v) extracting and learning features using the generated learning sentence set to perform document categorization.

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명하기로 한다.Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

도 1은 본 발명에 따른 학습 문서의 생성 작업없이 각 범주의 핵심어의 입력만으로 수집된 문서를 자동으로 분류해내기 위한 비지도(非指導) 학습 기반의 자동 문서 범주화 방법의 전체 흐름도이다.FIG. 1 is an overall flowchart of an unsupervised learning-based automatic document categorization method for automatically classifying a collected document only by inputting key words of each category without generating a learning document according to the present invention.

도시된 바와 같이, 본 발명의 방법은 전체로 보아 수집된 문서의 형태를 정규화하고 문장 단위로 분할하며 언어적 분석을 통해 각 문장의 내용어를 추출하는 전처리 단계(10); 상기 가공된 문장 집합에서 대표 문장 추출과 문장간 유사도 측정 과정을 거쳐 학습 문장 집합을 자동으로 생성하는 학습 문장 집합 생성 단계(20); 상기 생성된 학습 문장 집합을 사용하여 자질을 추출하고 학습하여 입력 문서를 분류하는 자질 추출 및 범주화 단계(30)로 이루어진다.As shown, the method of the present invention includes a preprocessing step (10) of normalizing the shape of the collected document as a whole, dividing it into sentence units, and extracting the content word of each sentence through linguistic analysis; A learning sentence set generation step (20) of automatically generating a learning sentence set through a process of extracting a representative sentence from the processed sentence set and measuring the similarity between sentences; A feature extraction and categorization step 30 is performed to classify input documents by extracting and learning features using the generated learning sentence set.

상기 전처리 단계(10)는 수집된 문서를 본 시스템에서 사용하기 위해서 기계적 처리가 가능하도록 변환하는 문서 정규화 과정(110)과; 문장 단위 분할 과정(120)과; 형태소 분석 및 태깅 과정(130)과; 문장의 내용이나 특징을 잘 반영하는 내용어를 추출하는 내용어 추출 과정(140)을 포함한다.The preprocessing step (10) includes a document normalization process (110) for converting the collected documents into mechanical processing for use in the system; A sentence unit division process 120; Morphological analysis and tagging process 130; A content word extraction process 140 for extracting a content word that reflects the content or characteristics of the sentence is included.

문서 정규화 과정(110)은 HTML 문서 등에서 나타나는 태그(tag)와 특수 문자를 제거하고 한자어는 해당하는 한글로 변환시키는 작업을 수행한다.The document normalization process 110 removes a tag and a special character appearing in an HTML document and converts a Chinese character to a corresponding Hangul.

문장 단위 분할 과정(120)은 한국어의 특징에 맞추어 종료형 어미(~다, ~까, ~요, ~죠 등) 다음에 마침표(.), 물음표(?), 또는 느낌표(!)가 나오는 경우를 문장의 끝으로 보고 문서의 내용을 문장 단위로 분리한다.The sentence division process 120 may include a period (.), A question mark (?), Or an exclamation point (!) After an ending ending (~, ~, ~, ~, etc.) in accordance with Korean characteristics. Look at the end of a sentence and separate the contents of the document into sentences.

형태소 분석 및 태깅 과정(130)은 문장을 언어적, 통계적 분석을 통하여 각 형태소 별로 나누어 품사를 결정한다.The morpheme analysis and tagging process 130 determines a part-of-speech by dividing a sentence by each morpheme through linguistic and statistical analysis.

내용어 추출 과정(140)은 문장의 특징을 잘 나타내는 품사인 명사와 동사를 대상으로 문장의 내용어를 추출하는데 명사나 동사 중에도 문장의 내용을 구별하는데 별다른 정보를 주지 못하는 불용어를 처리하기 위해 불용어 사전을 사용하여 불용어 사전에 등록된 단어는 내용어 추출에서 제외된다. 도 2를 참고로 내용어 추출과정에 대한 예를 후술한다.The content extraction process 140 extracts a content word of a sentence from a noun and a verb that is a part of speech that well represents a sentence, and uses nouns to process a stopword that does not give much information to distinguish the content of a sentence among nouns and verbs. Words registered in the stopword dictionary using the dictionary are excluded from the content extraction. An example of a content word extraction process will be described below with reference to FIG. 2.

상기 학습 문장 집합 생성 단계(20)는 입력된 핵심어를 이용하여 각 범주의 대표 문장을 추출하는 대표 문장 추출 과정(210)과; 추출된 대표 문장이 각 범주의 특성을 잘 나타내고 있는지를 검증하여 순위화하는 대표 문장 검증 과정(220)과; 문장간 유사도 측정 기법을 이용하여 최종적인 학습 문장 집합을 생성하는 문장간 유사도 측정 과정(230)을 포함한다. 학습 문장 집합 생성 단계의 일 예를 도 3을 참고로 후술한다.The learning sentence set generation step 20 includes a representative sentence extraction process 210 for extracting a representative sentence of each category using the input key word; A representative sentence verification process 220 for verifying and ranking whether the extracted representative sentence well represents the characteristics of each category; A process of measuring similarity between sentences 230 generating a final set of learning sentences by using a sentence-to-sentence similarity measuring technique is included. An example of the generation of the learning sentence set will be described below with reference to FIG. 3.

대표 문장 추출 과정(210)은 입력된 범주별 핵심어를 직접 문장의 핵심어로가지고 있는 문장들을 추출하여 이들을 각 범주의 특성을 가장 잘 나타내는 문장으로 간주한다.The representative sentence extraction process 210 extracts sentences having the key words of the categories directly as the key words of the sentence and regards them as sentences that best represent the characteristics of each category.

대표 문장 검증 과정(220)은 핵심어를 포함하고 있는 문장 중에 그 범주에 해당하지 않는 문장이거나 혹은 그 범주의 특성을 잘 나타내지 못하는 문장들을 제거하기 위해서 추출된 문장들을 각 범주의 특성을 잘 나타내는 순위로 순위화 하기 위해 문장 가중치를 계산하고 순위화한다. 추출된 대표 문장의 각 내용어에 가중치를 부여하기 위하여 정보 검색 분야에 널리 사용되고 있는 용어 빈도(TF: Term Frequency)와 역범주 빈도(ICF: Inverse Category Frequency)를 사용했으며 문장의 가중치는 계산된 내용어 가중치의 평균값을 사용한다.Representative sentence verification process 220 is a sentence representing the characteristics of each category to remove the sentences that contain the key words that do not belong to the category or that do not represent the characteristics of the category well Calculate and rank sentence weights for ranking. In order to assign weights to the extracted representative sentences, term term frequency (TF) and inverse category frequency (ICF), which are widely used in the field of information retrieval, were used. The average of weights is used.

문장간 유사도 측정 과정(230)에서 추출된 대표 문장 집합은 문서 범주화의 학습 데이터로 사용하기 위해서는 그 양이 아직 적기 때문에 대표 문장으로 추출되지 못한 문장들을 각 범주의 대표 문장들과의 유사도 측정을 통해 측정된 유사도가 가장 높은 범주에 할당함으로써 학습 문장 집합을 생성한다. 본 발명에서는 단어 유사도 행렬과 문장 유사도 행렬을 사용하여 반복 계산을 통해 문장간 유사도를 계산하는데 그 예는 도 4에서 도식화하였다.The representative sentence set extracted in the sentence similarity measurement process 230 has a small amount to use as the training data for document categorization. Thus, the sentences not extracted as the representative sentences are measured by measuring the similarity with the representative sentences of each category. A learning sentence set is generated by assigning to the category with the highest measured similarity. In the present invention, the similarity between sentences is calculated through iterative calculation using a word similarity matrix and a sentence similarity matrix, an example of which is illustrated in FIG. 4.

자질 추출 및 범주화 단계(30)는 생성된 학습 문장 집합을 사용하여 학습에 사용할 자질을 추출하는 자질 추출 과정(310)과; 추출된 자질을 사용하여 학습하고 입력된 문서에 범주를 할당하는 문서 범주화 과정(320)을 포함한다. 자질 추출 과정(310)에서는 카이 제곱 통계량( χ²statistics) 값을 사용하고, 문서 범주화 과정(320)에서는 문서 분류기로서 베이시안 확률 모델(Bayesian Probability Model)을 사용한다.The feature extraction and categorization step 30 may include a feature extraction process 310 for extracting a feature to be used for learning by using the generated set of learning sentences; A document categorization process 320 for learning using the extracted features and assigning a category to the input document. In the feature extraction process 310, chi ² statistical values are used, and in the document categorization process 320, a Bayesian Probability Model is used as a document classifier.

도 2는 수집된 문서 집합의 문서 정규화 과정과 문장 단위 분할 과정을 거친 후에 언어 분석과 태깅 과정을 통해 각 문장의 내용이나 특징을 잘 반영하는 내용어를 추출하는 과정을 예시한다.FIG. 2 illustrates a process of extracting a content word that well reflects the contents or characteristics of each sentence through a language analysis and tagging process after the document normalization process and sentence unit division process of the collected document set.

먼저 수집된 문서 집합은 문서 정규화 과정을 통해 한자어나 각종 태그 등을 제거하고 문장단위로 분할된다(S11).The collected document set is divided into sentence units by removing Chinese characters or various tags through a document normalization process (S11).

분할된 문장은 예시된 바와 같이 형태소 분석 및 태깅을 통해 언어적, 통계적 분석을 통해 각 형태소 별로 품사를 결정한다(S12).As illustrated, the divided sentences determine parts of speech for each morpheme through linguistic and statistical analysis through morpheme analysis and tagging (S12).

품사 중에 문장의 특징을 잘 나타내는 품사인 명사(외래어 포함)와 동사만의 내용어를 추출한다(S13). 여기서 추출된 내용어 중에는 여러 문장에서 공통적으로 많이 나타나기 때문에 문장의 내용을 구분하기 위해 별다른 정보를 주지 못하는 불용어들이 있다. S13의 예에서 '기본[명사]'이 불용어에 해당하는데 이를 제거하기 위해 미리 불용어에 대한 사전을 구축해서 사전에 등록되어 있는 단어는 제거하여 최종적으로 해당 문장의 내용어를 추출한다(S14).Among parts of speech, parts of nouns (including foreign words) and verbs, which are parts of speech that well represent the characteristics of a sentence, are extracted (S13). Among the extracted content words, there are many stopwords that do not give much information to distinguish the contents of sentences because they appear in common in many sentences. In the example of S13, 'basic [noun]' corresponds to a stopword. To remove this, a dictionary for stopwords is constructed in advance, and words registered in the dictionary are removed to finally extract the content word of the sentence (S14).

도 3은 문장 집합으로부터 각 범주별 학습 문장 집합을 자동으로 생성해내는 과정을 예시한다. 수집된 문서 집합의 문장 집합이 S21과 같고 범주별 핵심어가 S22와 같으며 '음악'과 '인터넷'이라는 두가지 범주가 있다고 가정하자. '음악' 범주의 핵심어인 '음악'을 내용어로 가지고 있는 문장 1는 '음악'범주의 대표 문장으로 추출되고(S23), '인터넷' 범주의 핵심어인 '인터넷'을 내용어로 직접 가지고 있는 문장 2은 '인터넷' 범주의 대표 문장으로 추출된다(S24). 범주별 핵심어를 직접 내용어로 가지지 못하는 문장은 미 분류 문장으로 분류된다(S25).3 illustrates a process of automatically generating a learning sentence set for each category from a sentence set. Suppose that the sentence set of the collected document set is equal to S21, the keyword by category is equal to S22, and there are two categories, 'music' and 'internet'. Sentence 1 containing the word 'music' as the key word of the category 'music' is extracted as the representative sentence of the 'music' category (S23), and sentence 2 directly containing the word 'internet' as the key word of the 'internet' category. Is extracted as a representative sentence of the 'Internet' category (S24). Sentences that do not directly have the core words for each category are classified as unclassified sentences (S25).

추출된 대표 문장들만으로 각 범주의 학습을 위한 학습 문장 집합이 되기에는 양이 부족하기 때문에, 대표 문장으로 추출되지 못한 미 분류 문장들과 각 범주의 대표 문장과의 유사도 측정을 통해 가장 유사도 값이 높게 나오는 범주로 미 분류 문장을 할당시킨다(S26). 문장 3과 문장 4는 핵심어를 가지고 있지 않기 때문에 미 분류 문장으로 분류되었으나 유사도 측정 과정(S26)을 거쳐 문장 3은 '음악' 범주에 할당되고(S27), 문장 4는 유사도 측정의 값이 어느 한계값 이상이 되지 않으므로 어느 범주에도 해당되지 않는 것으로 간주되어 계속 미 분류 문장 집합에 속하게 되고 결국 학습에 참여하지 않는다(S28).Since the extracted representative sentences are not enough to be a set of learning sentences for learning of each category, the highest similarity value is obtained by measuring the similarity between the unclassified sentences which are not extracted as the representative sentences and the representative sentences of each category. The non-classified sentence is assigned to the category that comes out (S26). Sentence 3 and sentence 4 are classified as unclassified sentences because they do not have a key word, but through similarity measurement process (S26), sentence 3 is assigned to the category of 'music' (S27), and sentence 4 has some limits on the similarity measurement. Since it is not equal to or greater than the value, it is regarded as not belonging to any category and continues to belong to the unclassified sentence set, and eventually does not participate in learning (S28).

본 발명에서는 문장간 유사도 측정 방법이 매우 중요한데 이를 위해 기존에 정보 검색에서 사용하는 단순한 방법들을 사용하지 않고 도 4와 같이 단어 유사도 행렬(S41)과 문장 유사도 행렬(S42)을 사용하여 반복 계산하고 문장간 유사도를 계산한다. 유사한 단어는 유사한 문맥에 위치하는 경향이 있으므로 이를 이용하여 문맥 정보를 반영하여 문장간 유사도를 측정한다. 본 발명에서 단어와 문장은 상호 보충적인 역할을 수행하는데, 문장은 포함하고 있는 단어들에 의해 표현되고, 단어는 그 단어를 포함하고 있는 문장들에 의해 표현된다. 즉, 문장은 유사한 단어들을 많이 포함할수록 유사한 문장이고 단어는 유사한 문장에서 많이 사용될수록 유사한 단어이다. 이를 반영하기 위해 2개의 유사도 행렬(S41),(S42)을 구성하고 반복 계산을 통해 계산된 유사도의 값이 서로에게 반영되도록 하였다.In the present invention, the method of measuring the similarity between sentences is very important. For this purpose, it is repeated using a word similarity matrix S41 and a sentence similarity matrix S42 as shown in FIG. 4 without using simple methods used in information retrieval. Calculate the similarity between the livers. Since similar words tend to be located in similar contexts, they are used to measure similarity between sentences by reflecting contextual information. In the present invention, the word and the sentence play a complementary role, wherein the sentence is represented by the containing words, the word is represented by the sentences containing the word. That is, a sentence is a similar sentence as it contains more similar words, and a word is a similar word as more words are used in a similar sentence. In order to reflect this, two similarity matrices S41 and S42 were constructed, and the similarity values calculated through the iterative calculation were reflected to each other.

단어 유사도 행렬(S41)의 행과 열은 유사도를 측정하고자 하는 범주별 대표 문장과 미 분류 문장들에 포함되어 있는 모든 내용어들로 구성되어 내용어 사이의 유사도 값을 가지며, 문장 유사도 행렬(S42)은 대표 문장과 미 분류 문장들의 유사도 값을 나타내게 된다.The rows and columns of the word similarity matrix (S41) are composed of the representative sentences for each category for which similarity is to be measured and all content words included in unclassified sentences, and have similarity values between the content words, and the sentence similarity matrix (S42). ) Represents the similarity value between the representative sentence and the unclassified sentence.

본 발명은 수작업에 의해 범주가 할당된 대량의 학습 문서 생성 작업 없이 문서 범주화를 수행하게 함으로써 적은 비용으로 문서 범주화를 수행하고자 하는 온 라인 상의 문서 범주화 응용 영역에서 유용하게 사용할 수 있는 효과가 있으며, 또한, 대량의 학습 문서 생성 작업에 본 발명에서 제안된 기법을 사용한다면 작업에 소요되는 많은 시간과 인력을 최소화하여 학습 문서를 생성할 수 있는 효과가 있다.The present invention has an effect that can be useful in online document categorization application areas that want to perform document categorization at a low cost by allowing document categorization to be performed without generating a large amount of training documents which are manually assigned categories. For example, if the technique proposed in the present invention is used to generate a large amount of learning documents, the learning document can be generated by minimizing a lot of time and manpower.

Claims

In the automatic document categorization method for documents collected from the Internet,

An automatic document categorization method comprising generating a set of learning sentences by using sentences of collected documents as basic semantic units of each category, and classifying each category by using a similarity measure between sentences.

(a) dividing the collected document into sentence units and extracting a content word by morphological analysis;

(b) extracting a representative sentence of each category using the input keywords;

(c) verifying and ranking whether the extracted representative sentence well represents the characteristics of each category;

(d) generating a set of learning sentences to be used for learning by measuring similarity between the extracted representative sentences and sentences not classified as unclassified sentences; And

(e) extracting and learning a feature to be used for learning by using the generated set of learning sentences and assigning a category to a document.

The method of claim 2, wherein the extracting the content of the collected document comprises: a document normalization step of enabling mechanical processing of the collected document; A sentence division step of dividing a sentence of a normalized document into sentence units; An automatic document categorization method comprising the step of morphological analysis and tagging of the divided sentence, and the extraction of the content word uses a stopword dictionary.

3. The automatic document categorization according to claim 2, wherein the extracting of the representative sentence comprises extracting sentences in which the key words for each category are directly included as content words and considering them as sentences that best represent the characteristics of each category. Way.

The method of claim 2, wherein the representative sentence verification and ranking step includes weighting each content word of the representative sentence extracted using the term frequency (TF) and the reverse category frequency (ICF). Automatic document categorization method.

3. The automatic document categorization method according to claim 2, wherein the sentence similarity measurement in the learning sentence set generation step is obtained through iterative calculation using a word similarity matrix and a sentence similarity matrix.

The method of claim 6, wherein the rows and columns of the word similarity matrix are composed of all the content words included in the category representative sentences and the unclassified sentences for which the similarity is to be measured, and have similarity values between the content words. The similarity matrix is an automatic document categorization method, characterized in that it has a similarity value between representative sentences and unclassified sentences.