JPH08287097A

JPH08287097A - Method and device for sorting document

Info

Publication number: JPH08287097A
Application number: JP7093985A
Authority: JP
Inventors: Seiji Washisaki; 誠司鷲▲崎▼; Masahiro Oku; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-04-19
Filing date: 1995-04-19
Publication date: 1996-11-01

Abstract

PURPOSE: To sort documents based on the contents of a sentence by considering not only the probability of a word but also the relation between adjacent words at the time of obtaining the probability of sorting documents, thereby allowing each of them to obtain the probability of sorting the pertinent document into some document group. CONSTITUTION: When data for calculating sorting probability is accumulated by sorting probability accumulation processing, a document is newly inputted to a new document input part 6. A morphemic analyzing part 7 divides the inputted document to be the units of sentences and then morphemic-analyzes it to extract words. A sorting item extracting part 8 extracts the word, adjacent two words and three words to be sorting items from the obtained words. A sorting probability extracting part 9 selects one sorting group to be the candidate of sorting from among all the sorts. Then a sorting probability calculation part 10 extracts, with respect to the extracted sorting items, the probability of making each of them in the sorting group from a probability accumulation part 14. In addition, the probability of sorting the document to be in some sort is individually calculated through the use of the sorting probability of the sorting items.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書分類方法及び装置
に係り、利用者によって入力された文書に対して予め用
意してある複数の文書の分類候補の中から最も適切な分
類を決定する文書分類装置に関する。特に、文書分類時
に予め語句が文書を分類する確率を算出しておき、それ
を用いて確率的に文書を分類する文書分類方法及び装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification method and apparatus, and determines the most appropriate classification from a plurality of document classification candidates prepared in advance for a document input by a user. Document classifier In particular, the present invention relates to a document classification method and apparatus for calculating a probability that a word classifies a document in advance at the time of classifying the document and using the probability to classify the document stochastically.

【０００２】[0002]

【従来の技術】近年、与えられた文書に対して、予め設
定してある分類候補の中から適切な分類を選択する文書
分類手法に関して研究が進んでいる。さらに、様々な文
書分類手法の中でも、確率に基づく文書分類手法と呼ば
れる手法が注目されている。確率に基づく文書分類手法
とは、予め分類先（複数の場合もあり得る）が判定済み
の文書を訓練データとして入力しておき、その中の単語
がどの程度の文書を分類するかを確率として蓄積してお
き、新たに分類すべき文書が入力されたとき、蓄積して
おいた単語の文書とに対する分類確率を用いて総合的に
その文書がどの分類にあたるかを決定する手法のことを
指す。2. Description of the Related Art In recent years, research has been conducted on a document classification method for selecting an appropriate classification from preset classification candidates for a given document. Furthermore, among various document classification methods, a method called a probability-based document classification method is drawing attention. The probability-based document classification method is to input a document whose classification destination (there may be multiple) has been determined in advance as training data, and to determine how many documents the words in it classify. When a document to be stored and newly classified is input, it refers to a method that comprehensively determines which classification the document belongs to by using the classification probability of the stored word and the document. .

【０００３】従来の技術において確率に基づく文書分類
手法は、以下の４通りに代表される。１．第１の手法： Probablistic Relevance Weighting
(PRW) この手法の例として、「Relevance weighting of searc
h terms; Journal ofthe American Society for Inform
ation Science, 27: pp. 129-146, 1976 」がある。The document classification method based on probability in the prior art is represented by the following four methods. 1. First method: Probablistic Relevance Weighting
(PRW) As an example of this method, `` Relevance weighting of searc
h terms; Journal of the American Society for Inform
ation Science, 27: pp. 129-146, 1976 ”.

【０００４】この方法は、文書ｄがある分類ｃに分類さ
れる確率をＰ（ｃ｜ｄ）で表すと、このＰ（ｃ｜ｄ）を
文書ｄが分類ｃ以外に分類される確率を利用して近似的
に表すことを基本的なアイデアとしている。ｃ＊が分類
ｃ以外に分類されることを表し、そしてＰ（ｃ＊｜ｄ）
が、文書ｄが分類ｃ以外に分類される確率を表すとする
と、確率Ｐ（ｃ｜ｄ）は以下で表されるｇ（ｃ｜ｄ）を
求めることによって算出できる。In this method, if the probability that a document d is classified into a certain classification c is represented by P (c | d), this P (c | d) is used as the probability that the document d is classified into other than the classification c. The basic idea is to express it approximately. represents that c * is classified other than classification c, and P (c * | d)
, The probability P (c | d) can be calculated by finding g (c | d) expressed below.

【０００５】[0005]

【数１】 [Equation 1]

【０００６】ベイズの定理を用いると、上記等式（１）
は以下のように変形できる。Using Bayes' theorem, the above equation (1)
Can be transformed as follows.

【０００７】[0007]

【数２】 [Equation 2]

【０００８】（２）式において、Ｐ（ｃ）は訓練データ
によって求められるもので、ランダムに選択した文書ｄ
がある分類ｃに分類される確率を表す。よって、Ｐ（ｄ
｜ｃ）を求めればこの確率ｇ（ｃ｜ｄ）は得られる。こ
のＰ（ｄ｜ｃ）は文書内の単語の出現情報により求める
ことができる。即ち、各単語が文書内に独立して現れる
と仮定すると、In the equation (2), P (c) is obtained from the training data, and the randomly selected document d
Represents the probability of being classified into a certain classification c. Therefore, P (d
This probability g (c | d) can be obtained by obtaining | c). This P (d | c) can be obtained from the appearance information of the words in the document. That is, assuming that each word appears independently in the document,

【０００９】[0009]

【数３】 (Equation 3)

【００１０】と変形できる。ここで、ｃ−ｄは先の訓練
データの中で分類先を決定しようとしている文書ｄ内に
は現れないが、分類ｃに分類される文書ｄ以外の文書の
中に存在する単語の集合である。またＴ_iは単語を表
し、Ｔ＝１，０はｔ_iが文書ｄ内に存在するかしないか
を表す。最終的には、ｇ（ｃ｜ｄ）は以下で表される。It can be modified as follows. Here, cd is a set of words existing in a document other than the document d classified into the classification c, although it does not appear in the document d whose classification is to be determined in the previous training data. is there. Further, T _i represents a word, and T = 1,0 represents whether or not t _i exists in the document d. Finally, g (c | d) is expressed as follows.

【００１１】[0011]

【数４】 [Equation 4]

【００１２】これらは、訓練データから計算することが
でき、分類確率が求められる。詳細は上記の参考文献p
p. 130 〜135 に記載されている。２．第２の手法： Component Theory (CT) この手法の例として、「Experiments with a component
theory of probablistic information retrieval base
d on single terms as document components;ACM Trans
actions on Information Systems, 8 (4): pp. 363-38
6, 1990 」がある。These can be calculated from the training data and the classification probabilities are determined. See reference p above for details.
p. 130-135. 2. Second method: Component Theory (CT) As an example of this method, "Experiments with a component"
theory of probablistic information retrieval base
d on single terms as document components; ACM Trans
actions on Information Systems, 8 (4): pp. 363-38
6, 1990 ”.

【００１３】この手法は、第１の手法の問題点を改善す
るために考案されているものである。文書は単語から成
ることを仮定としており、第１の手法では確率算出時に
一つ一つの単語が文書を分類する確率を利用したのに対
して、第２の手法は任意の一つ以上の単語が関連しあっ
て分類先を決定する確率を利用する。第１の手法の式
（１）のｇ（ｃ｜ｄ）を等比級数的に以下のように変形
する。This method is designed to improve the problems of the first method. It is assumed that a document consists of words, and the first method uses the probability that each word classifies the document when calculating the probability, whereas the second method uses one or more arbitrary words. Utilizes the probability of determining the classification destination because they are related to each other. G (c | d) in the equation (1) of the first method is geometrically transformed as follows.

【００１４】[0014]

【数５】 (Equation 5)

【００１５】ここで、Ｐ（Ｔ＝ｔ｜ｄ）、Ｐ（Ｔ＝ｔ｜
ｃ）は、それぞれ文書ｄの中に含まれる単語ｔの確率と
分類ｃに分類される文書の中の単語ｔの確率を表してい
る。これらは、訓練データから求めることができる。詳
細は、従来技術の第２の手法に関する参考文献のｐｐ．
３６８〜３７５にある。Here, P (T = t | d) and P (T = t |
c) represents the probability of the word t included in the document d and the probability of the word t in the document classified into the classification c, respectively. These can be obtained from training data. For details, refer to the pp.
368-375.

【００１６】３．第３の手法： Retrieval with Proba
blistic Indexing (RPI) この手法の例として、「Models for retrieval with pr
obablistic indexing;Information Processing & Retri
eval, 25(1); pp. 55-72, 1989 」がある。この手法
は、第２の手法に分類の確率を求める際に単語を拡張し
てベクトル化する機能を加えたものである。すなわち、
文書ｄの特徴を表す単語ベクトルＸを、Ｘ＝（Ｔ₁，Ｔ
₂，…Ｔ_N）と表す。このＴ_iは、文書ｄが単語ｔ_iを
含む時１、そうでない時０となる要素とする。このとき
のＰ（ｃ｜ｄ）は以下の等式で表すことができる。3. Third method: Retrieval with Proba
blistic Indexing (RPI) As an example of this method, `` Models for retrieval with pr
obablistic indexing; Information Processing & Retri
eval, 25 (1); pp. 55-72, 1989 ". This method is a method in which a word is expanded and vectorized when the probability of classification is obtained, in addition to the second method. That is,
A word vector X representing the feature of the document d is represented by X = (T ₁ , T
₂ , ... T _N ). This T _i is an element that becomes 1 when the document d includes the word t _i and 0 otherwise. P (c | d) at this time can be expressed by the following equation.

【００１７】[0017]

【数６】 (Equation 6)

【００１８】この式に対して、更にそれぞれの単語が独
立であると仮定してベイズの定理を適用すれば、上記の
式（９）は以下のように変形できる。If the Bayes' theorem is applied to this equation by assuming that each word is independent, the above equation (9) can be transformed as follows.

【００１９】[0019]

【数７】 (Equation 7)

【００２０】これは訓練データから算出可能である。詳
細は、従来技術の第３の手法に関する参考文献のｐｐ．
５６〜６３にある。 4. 第４の手法： Single Random variable with Multi
ple Values (SVMV) この手法の例として、「A Probablistic Model for Tex
t Categorization Based on a Single Random Variable
with Multiple Values; ANLP' 94, pp. 162-167, 199
4」がある。This can be calculated from the training data. For details, refer to pp.
56-63. 4. Fourth method: Single Random variable with Multi
ple Values (SVMV) As an example of this method, `` A Probablistic Model for Tex
t Categorization Based on a Single Random Variable
with Multiple Values; ANLP '94, pp. 162-167, 199
There is 4 ".

【００２１】この第４の手法は、上記の第３の手法を拡
張したものである。第４の手法では、第３の手法におけ
る単語のベクトル化の代りに文書中に存在する単語その
ものの頻度情報を考慮するところが新しい。この手法で
は、分類確率Ｐ（ｃ｜ｄ）は（９）の式を変形して、The fourth method is an extension of the third method described above. The fourth method is new in that instead of vectorizing the words in the third method, the frequency information of the words themselves existing in the document is considered. In this method, the classification probability P (c | d) is obtained by modifying the equation (9),

【００２２】[0022]

【数８】 (Equation 8)

【００２３】で表すことができる。与えられた各単語ｔ
_iに対して文書ｄが分類ｃに分類される確率が独立だと
すると、上式（１２）は第３の手法と同様に変形でき
る。すなわち、It can be represented by Each given word t
_{If the} probability that the document d is classified into the classification c with respect to _i is independent, the above equation (12) can be modified as in the third method. That is,

【００２４】[0024]

【数９】 [Equation 9]

【００２５】Ｐ（Ｔ＝ｔ_i｜ｃ），Ｐ（Ｔ＝ｔ_i｜
ｄ），Ｐ（Ｔ＝ｔ_i），Ｐ（ｃ）などは訓練データによ
り算出できる。詳細は、従来技術の第４の手法に関する
参考文献のｐｐ．１６４〜１６５にある。P (T = t _i | c), P (T = t _i |
d), P (T = t _i ), P (c), etc. can be calculated from the training data. For details, refer to pp. 164-165.

【００２６】[0026]

【発明が解決しようとする課題】従来の確率に基づく文
書分類手法の問題点に関して説明する。問題点は４つに
大別できる。１．第１の方法：Probablistic Relevance Weighting
（ＰＲＷ）に関する問題点として以下のようなものがあ
る。Problems of the conventional probability-based document classification method will be described. The problems can be roughly divided into four. 1. First Method: Probablistic Relevance Weighting
The problems with (PRW) are as follows.

【００２７】問題１文書中の単語の頻度情報が考慮さ
れていない。ある単語がある文書中に存在する場合に確
率を１、そうでない場合を０として考えるために、文書
分類に効果があると考えられる単語の頻度情報を含むこ
とができない。Problem 1 Word frequency information in a document is not taken into consideration. Since the probability is considered to be 1 when a word is present in a document and 0 when it is not, it is not possible to include frequency information of words considered to be effective for document classification.

【００２８】問題２分類に対する単語の重みが考慮さ
れていない。ある単語が全分類の中のどの程度の分類に
含まれているかの情報が含まれていない。そのため、単
語がある分類に含まれる際の重みが考慮されない。例え
ば単語ｗ１が分類ｃ１中に存在し、単語ｗ２が分類ｃ
１、ｃ２、ｃ３、ｃ４、ｃ５中に存在すると仮定する
と、単語ｗ１はｗ２よりも分類ｃ1 を特徴付け易い。正
確な分類のためには、この性質を考慮に入れる必要があ
る。Problem 2 Word weights for classification are not considered. There is no information about how many of the categories a word falls into. Therefore, the weight when a word is included in a certain classification is not considered. For example, the word w1 exists in the classification c1, and the word w2 exists in the classification c.
Assuming they are present in 1, c2, c3, c4, c5, the word w1 is easier to characterize the classification c1 than w2. This property must be taken into account for accurate classification.

【００２９】問題３訓練データが少ない場合の対処が
できない。訓練データが少ない場合、確率算出式の分母
が０となり確率が計算できなくなる場合が存在する。こ
の問題のよく知られた解法は、微小な数を確率式の分子
と分母に加えることである。しかし、この方法だと正確
な確率は算出できない。Problem 3 It is impossible to deal with the case where the training data is small. If the training data is small, the denominator of the probability calculation formula may be 0, and the probability may not be calculated. A well-known solution to this problem is to add a small number to the numerator and denominator of the stochastic equation. However, this method cannot calculate the exact probability.

【００３０】問題４単語の関係が考慮できない。確率
を算出する単語は、単純に文中の単語を考えているの
で、単語間の意味的な関係を考慮できない。２．第２の手法：Component Theory（ＣＴ）に関する問
題点として以下のようなものがある。Problem 4 The relationship between words cannot be considered. The word for which the probability is calculated simply considers the word in the sentence, so the semantic relationship between the words cannot be considered. 2. Second method: There are the following problems regarding Component Theory (CT).

【００３１】問題３訓練データが少ない場合の対処が
できない。訓練データが少ない場合、確率算出式の分母
が０となり確率が計算できなくなる場合が存在する。こ
の問題のよく知られた解法は、微小な数を確率式の分子
と分母に加えることである。しかし、この方法だと正確
な確率は算出できない。Problem 3 It is impossible to deal with the case where the training data is small. If the training data is small, the denominator of the probability calculation formula may be 0, and the probability may not be calculated. A well-known solution to this problem is to add a small number to the numerator and denominator of the stochastic equation. However, this method cannot calculate the exact probability.

【００３２】問題４単語の関係が考慮できない。確率
を算出する単語は、単純に文中の単語を考えているの
で、単語間の意味的な関係を考慮できない。３．第３の手法：Retrieval with Probablistic Indexi
ng（ＲＰＩ）に関する問題点として以下のようなものが
ある。Problem 4 The relationship between words cannot be considered. The word for which the probability is calculated simply considers the word in the sentence, so the semantic relationship between the words cannot be considered. 3. Third method: Retrieval with Probablistic Indexi
There are the following problems regarding ng (RPI).

【００３３】問題１文書中の単語の頻度情報が考慮さ
れていない。ある単語がある文書中に存在する場合に確
率を１、そうでない場合を０として考えるために、文書
分類に効果があると考えられる単語の頻度情報を含むこ
とができない。Problem 1 The word frequency information in the document is not taken into consideration. Since the probability is considered to be 1 when a word is present in a document and 0 when it is not, it is not possible to include frequency information of words considered to be effective for document classification.

【００３４】問題３訓練データが少ない場合の対処が
できない。訓練データが少ない場合、確率算出式の分母
が０となり確率が計算できなくなる場合が存在する。こ
の問題のよく知られた解法は、微小な数を確率式の分子
と分母に加えることである。しかし、この方法だと正確
な確率は算出できない。Problem 3 It is impossible to deal with the case where the training data is small. If the training data is small, the denominator of the probability calculation formula may be 0, and the probability may not be calculated. A well-known solution to this problem is to add a small number to the numerator and denominator of the stochastic equation. However, this method cannot calculate the exact probability.

【００３５】問題４単語の関係が考慮できない。確率
を算出する単語は、単純に文中の単語を考えているの
で、単語間の意味的な関係を考慮できない。４．第４の手法：Single Random Variable with Multip
le Values （ＳＶＭＶ）に関する問題点として以下のよ
うなものがある。Problem 4 The relationship between words cannot be considered. The word for which the probability is calculated simply considers the word in the sentence, so the semantic relationship between the words cannot be considered. 4. Fourth method: Single Random Variable with Multip
There are the following problems regarding le Values (SVMV).

【００３６】問題４単語の関係が考慮できない。確率
を算出する単語は、単純に文中の単語を考えているの
で、単語間の意味的な関係を考慮できない。このように、上記従来の技術においては、単語の頻度情
報を用いていない、訓練データが少ない場合の対処がで
きない、さらに、文書の分類時に文書中に含まれる単語
がその文書に含まれる確率の基づいてその文書がどの文
書群に分類されているかを決定しているため、文書中の
単語間の間駅が文書分類結果に反映されないため、正確
な分類ができないという問題がある。Problem 4 The relationship between words cannot be considered. The word for which the probability is calculated simply considers the word in the sentence, so the semantic relationship between the words cannot be considered. As described above, in the above-mentioned conventional technique, the frequency information of words is not used, it is impossible to cope with a small amount of training data, and the word included in the document at the time of classifying the document is not included in the probability of being included in the document. Since the document group is determined based on the document based on that, the station between words in the document is not reflected in the document classification result, which causes a problem that accurate classification cannot be performed.

【００３７】本発明は、上記の点に鑑みなされたもの
で、上記従来の問題点を解決し、新たに入力された文書
をより正確に分類することが可能な文書分類方法及び装
置を提供することを目的とする。The present invention has been made in view of the above points, and provides a document classification method and apparatus capable of solving the above-mentioned conventional problems and classifying newly input documents more accurately. The purpose is to

【００３８】[0038]

【課題を解決するための手段】本発明は、利用者により
入力された文書を、予め蓄積されている文書の分類候補
の中から尤も適切なものを選択する文書分類方法におい
て、入力された文書を予め分類された文書群に分類し、
文書の分類時に文書中に含まれる単語が文書に含まれる
文書分類確率を求める際に、該単語の確率のみならず、
隣接単語間の関係を取得して、該文書分類確率を算出す
る。SUMMARY OF THE INVENTION The present invention is a document classification method for selecting an appropriately appropriate document input by a user from among document classification candidates stored in advance. Is classified into a pre-classified document group,
When determining the document classification probability that a word included in the document is included in the document when classifying the document, not only the probability of the word,
The relationship between adjacent words is acquired and the document classification probability is calculated.

【００３９】図１は、本発明の原理を説明するための図
である。本発明は、文書を分類する際に、文書分類確率
算出用に、予め分類された文書を読み込み（ステップ
１）、読み込んだ文書内の文章を形態素解析し（ステッ
プ２）、形態素解析により分割された単語から分類確率
算出時に用いる連続するｎ（ｎは自然数）個の単語の文
書の分類項目として抽出し（ステップ３）、抽出した各
分類項目毎の各々に対して文書が全分類の中にある分類
に分類される頻度を計算し（ステップ４）、計算された
頻度から、各分類項目が文書を分類する確率を計算し
（ステップ５）、計算された確率を文書分類用に蓄積し
（ステップ６）、新たに分類すべき文書が入力された時
に（ステップ７）、入力された文書内の文章を形態素解
析し（ステップ８）、形態素解析された単語を用いて、
分類項目において連続するｎ個の単語列を抽出し（ステ
ップ９）、抽出した単語列に対して、予め蓄積しておい
た連続するｎ個の単語の中で一致するところの文書があ
る分類に属する分類確率を抽出し（ステップ１０）、抽
出した分類項目の分類確率を用いて、文書がある分類に
対して分類される確率を個別に算出し（ステップ１
１）、算出された分類確率のうち、最も確率の高い分類
確率から順に文書の分類結果として決定し（ステップ１
２）、決定された文書を分類結果として表示する（ステ
ップ１３）。FIG. 1 is a diagram for explaining the principle of the present invention. According to the present invention, when a document is classified, a document classified in advance is read (step 1), a sentence in the read document is subjected to morphological analysis (step 2) for document classification probability calculation, and the document is divided by morphological analysis. It is extracted as a classification item of a document of continuous n (n is a natural number) words used in calculating the classification probability from the extracted word (step 3), and the document is included in all classifications for each extracted classification item. The frequency of classification into a certain classification is calculated (step 4), the probability that each classification item classifies a document from the calculated frequency is calculated (step 5), and the calculated probability is stored for document classification ( In step 6), when a document to be classified is newly input (step 7), the sentence in the input document is morphologically analyzed (step 8), and the words subjected to morphological analysis are used.
A continuous n word string in the classification item is extracted (step 9), and the extracted word string is classified into a class in which there is a matching document among the n consecutive words stored in advance. The classification probabilities to which the documents belong are extracted (step 10), and the classification probabilities of the extracted classification items are used to individually calculate the probabilities that documents are classified into a certain classification (step 1).
1) Among the calculated classification probabilities, the highest classification probability is determined as the document classification result in order (step 1
2) Then, the determined document is displayed as a classification result (step 13).

【００４０】また、ステップ１２において文書の分類結
果を決定する際に、算出した分類確率のうち、予め設定
した閾値を越える文書を分類結果として決定する。本発
明は、利用者により入力された文書を、予め蓄積されて
いる文書の分類候補の中から尤も適切なものを選択する
文書分類装置において、入力された文書を予め分類され
た文書群に分類する手段と、文書の分類時に文書中に含
まれる単語が文書に含まれる文書分類確率を求める際
に、該単語の確率のみならず、隣接単語間の関係を取得
して、該文書分類確率を算出する手段を有する。Further, when determining the classification result of the document in step 12, the document exceeding the preset threshold value among the calculated classification probabilities is determined as the classification result. According to the present invention, in a document classification device that selects an appropriate document from among the document classification candidates stored in advance, a document input by a user is classified into a pre-classified document group. Means for obtaining the document classification probability that a word included in the document is included in the document when classifying the document, the relationship between adjacent words is acquired in addition to the probability of the word, and the document classification probability is calculated. It has a means for calculating.

【００４１】図２は、本発明の原理構成図である。本発
明は、入力された文書を分類する場合に、文書分類確率
算出用に予め文書をメモリに読み込む第１の文書入力手
段１０１と、第１の文書入力手段１０１から読み込んだ
文書内の文章を単語毎に分割する第１の形態素解析手段
１０２と、第１の形態素解析手段１０２により分割した
単語から分類確率算出時に用いる連続するｎ（ｎは自然
数）個の単語を文書の分類項目として抽出する第１の分
類項目抽出手段１０３と、第１の分類項目抽出手段１０
３により抽出した各分類項目毎に、各々に対して文書が
全分類の中のある分類に分類される頻度を計算する分類
頻度計算手段１０４と、分類頻度計算手段１０４により
計算した頻度から、各分類項目が文書を分類する確率を
計算する文書分類確率計算手段１０５と、文書分類確率
計算手段１０５により計算した確率を文書分類用とし
て、確率蓄積装置１０７に蓄積しておく分類確率蓄積手
段１０６とを用いる分類確率算出・蓄積手段１００と、
新たに分類すべき文書が入力された場合に、新たに分類
すべき文書を読み込む第２の文書入力手段２０１と、第
１の文書入力手段から入力された文書内の文書を単語毎
に分割する第２の形態素解析手段２０２と、第２の形態
素解析手段２０２により分割された単語を用いて連続す
るｎ個の単語列を抽出する第２の分類項目抽出手段２０
３と、分類確率蓄積手段１０７を参照して連続するｎ個
の単語の中で一致する文書がある分類に属する分類確率
を抽出する分類確率抽出手段２０４と、分類確率抽出手
段２０４により抽出した分類項目別の分類確率を用い
て、文書がある分類に対して分類される確率を個別に算
出する分類確率算出手段２０５と、分類確率算出手段２
０５により算出した分類確率のうち、最も確率の高いも
のから順に文書の分類結果として決定する分類判定手段
２０６と、分類判定手段２０６により決定した文書を分
類結果として表示する分類結果出力手段２０７とを用い
る分類確率参照・算出手段２００とを有する。FIG. 2 is a block diagram showing the principle of the present invention. According to the present invention, when classifying an input document, a first document input unit 101 that reads a document into a memory in advance for calculating a document classification probability and a sentence in the document read from the first document input unit 101 A first morpheme analyzing unit 102 that divides each word, and n consecutive words (n is a natural number) used when calculating a classification probability are extracted as document classification items from the words that are divided by the first morpheme analyzing unit 102. First classification item extraction means 103 and first classification item extraction means 10
For each classification item extracted by 3, the classification frequency calculation means 104 for calculating the frequency of classification of a document into a certain classification among all classifications, and the frequency calculated by the classification frequency calculation means 104, A document classification probability calculating means 105 for calculating a probability that the classification item classifies a document, and a classification probability accumulating means 106 for accumulating the probability calculated by the document classification probability calculating means 105 in the probability accumulating device 107 for document classification. Classification probability calculation / accumulation means 100 using
When a document to be newly classified is input, the second document input unit 201 that reads the document to be newly classified and the document in the document input from the first document input unit are divided into words. A second morpheme analysis means 202 and a second classification item extraction means 20 for extracting a continuous n word string using the words divided by the second morpheme analysis means 202.
3, a classification probability extracting unit 204 that refers to the classification probability accumulating unit 107, and extracts a classification probability belonging to a certain category of a document having a matching n consecutive words; and a category extracted by the classification probability extracting unit 204. A classification probability calculating unit 205 and a classification probability calculating unit 2 which individually calculate the probabilities that a document is classified for a certain classification by using the classification probabilities for each item.
Of the classification probabilities calculated in 05, the classification determination unit 206 that determines the document classification result in order from the highest probability, and the classification result output unit 207 that displays the documents determined by the classification determination unit 206 as the classification result. And a classification probability reference / calculation means 200 to be used.

【００４２】上記の分類判定手段２０６は、文書の分
類結果を決定する際に、算出した分類確率のうち、予め
設定した閾値を越える文書を分類結果として決定する。When determining the classification result of the document, the classification judging means 206 determines the document which exceeds the preset threshold value among the calculated classification probabilities as the classification result.

【００４３】[0043]

【作用】本発明では、入力された文書を予め分類した文
書群に分類して、前述の問題１、２に対しては、単語が
存在するか否かだけではなく、存在する場合にその頻度
を考えることで対処する。According to the present invention, the inputted documents are classified into a document group which is classified in advance, and in order to solve the above-mentioned problems 1 and 2, not only whether or not a word exists, but also the frequency of the existence of the word. To deal with.

【００４４】また、問題３に対しては、訓練データが少
なく、確率が算出できない場合には、単独の単語だけを
考えるのではなく、隣接単語の頻度情報を考慮するた
め、より正確な確率値を求めることが可能となる。さら
に、問題４に対しては、文書分類の確率を求める際に、
単語の確率だけではなく、隣接するｎ個の単語間を同時
に抽出し、それが文書をある分類に分類する確率を求め
ることになるために、それらの単語間に共起関係が存在
することを考えれば、従来の技術と比較してより、文の
内容に即した文書分類が可能となる。For Problem 3, when the training data is small and the probability cannot be calculated, not only a single word is considered but the frequency information of adjacent words is considered. It becomes possible to ask. Further, for Problem 4, when obtaining the probability of document classification,
Since not only the probability of a word but also the number of adjacent n words are extracted at the same time, and the probability of classifying a document into a certain category is obtained, the co-occurrence relation exists between these words. Considering this, it becomes possible to classify the documents according to the content of the sentence as compared with the conventional technique.

【００４５】[0045]

【実施例】以下、本発明の実施例を図面と共に説明す
る。図３は、本発明の一実施例の文書分類装置の構成を
示し、図４は、本発明の一実施例の文書分類装置の概要
動作を示すフローチャートである。Embodiments of the present invention will be described below with reference to the drawings. FIG. 3 shows the configuration of the document classification apparatus according to the embodiment of the present invention, and FIG. 4 is a flowchart showing the general operation of the document classification apparatus according to the embodiment of the present invention.

【００４６】図３において、文書分類装置は、分類確率
蓄積時においては、文書分類時に用いる確率の算出のた
めの訓練用文書を入力する文書入力部１、入力文書を品
詞単位に区分する形態素解析部２、分類確率を算出する
ために必要な項目を抽出する分類項目抽出部３、確率項
目毎に文書の分類確率を算出する項目別確率算出部４、
文書全体として文書の分類確率を算出する分類確率蓄積
部５、形態素解析に用いる日本語辞書１３、項目別に算
出した分類確率を蓄積しておく確率蓄積部１４より構成
されるシステムを用いる。また、項目別確率算出部４
は、分類項目毎に各々に対して文書が全分類の中のある
分類に分類される頻度を求める分類頻度算出部４１と、
算出された分類頻度から各分類項目が文書を分類する確
率を計算する分類頻度計算部４２を有する。In FIG. 3, the document classifying apparatus, when accumulating classification probabilities, the document input unit 1 for inputting a training document for calculating probabilities used at the time of document classification, and morphological analysis for classifying an input document in units of parts of speech. Part 2, a classification item extraction part 3 for extracting items necessary for calculating classification probability, an item-wise probability calculation part 4 for calculating a classification probability of a document for each probability item,
A system including a classification probability accumulating unit 5 for calculating the classification probability of the document as a whole, a Japanese dictionary 13 used for morphological analysis, and a probability accumulating unit 14 for accumulating the classification probability calculated for each item is used. Also, the item-based probability calculation unit 4
Is a classification frequency calculation unit 41 that obtains a frequency for which a document is classified into a certain classification among all classifications for each classification item,
It has a classification frequency calculation unit 42 that calculates the probability that each classification item classifies a document from the calculated classification frequency.

【００４７】また、分類確率算出時には、分類すべき新
規文書を入力する文書入力部６、入力文書を品詞単位に
区分する形態素解析部７、分類確率を算出するために該
当する項目を抽出する分類項目抽出部８、予め蓄積され
た分類項目別の分類確率を抽出する分類確率抽出部９、
入力された文書がどの分類に属するかの確率を算出する
分類確率算出部１０、どの分類に含まれるかを判定する
分類判定部１１、分類結果を出力する分類結果出力部１
２、形態素解析に用いる日本語辞書１３、項目別に算出
した分類確率を蓄積しておく確率蓄積装置１４より構成
されるシステムを用いる。また、本実施例では、分類結
果出力部１２は、分類結果を表示装置に出力するものと
する。When the classification probability is calculated, the document input unit 6 for inputting a new document to be classified, the morphological analysis unit 7 for dividing the input document into units of parts of speech, and the classification for extracting the corresponding items for calculating the classification probability. An item extraction unit 8, a classification probability extraction unit 9 that extracts a classification probability for each classification item accumulated in advance,
A classification probability calculation unit 10 that calculates the probability that the input document belongs to, a classification determination unit 11 that determines which classification the input document belongs to, and a classification result output unit 1 that outputs the classification result.
2. A system including a Japanese dictionary 13 used for morphological analysis and a probability storage device 14 for storing the classification probabilities calculated for each item is used. Further, in this embodiment, the classification result output unit 12 outputs the classification result to the display device.

【００４８】以下、図３の文書分類装置の構成図、及び
図４の文書分類装置の処理の流れに沿って本発明の一実
施例の動作を説明する。本実施例では、ある単語の隣接
３単語に着目して文書分類を行なう例を示す。隣接単語
の数が増えた場合でも処理の流れは同じである。The operation of the embodiment of the present invention will be described below with reference to the block diagram of the document classification device of FIG. 3 and the processing flow of the document classification device of FIG. In the present embodiment, an example will be shown in which document classification is performed by focusing on three words adjacent to a certain word. The processing flow is the same even when the number of adjacent words increases.

【００４９】最初に、図４の処理において、文書の分類
確率付与用に必要な項目別の確率を確率蓄積装置１４に
蓄積する処理を説明する。また、図５では、分類確率算
出時の処理を説明する。図６では、分類確率算出時の詳
細な処理について述べる。ステップ１０１）文書入力部１は、図７に示す訓練用
の文書を大量に入力する。図７の例では、文書の本文と
共にその文書の正しい分類先が複数入力されている。分
類の例としては、図７のような「飛行機」などの細分化
されたものから、「犯罪」などの大きな分類までがあ
る。文書の分類先候補は一般的に複数存在する。First, in the process of FIG. 4, a process of accumulating in the probability accumulating device 14 the probabilities for each item required for giving the classification probability of a document will be described. Further, in FIG. 5, a process at the time of calculating the classification probability will be described. In FIG. 6, detailed processing when calculating the classification probability will be described. Step 101) The document input unit 1 inputs a large number of training documents shown in FIG. In the example of FIG. 7, a plurality of correct classification destinations of the document are input together with the body of the document. Examples of classifications range from subdivided ones such as “airplane” as shown in FIG. 7 to large classifications such as “crime”. There are generally a plurality of document classification destination candidates.

【００５０】ステップ１０２）文書が入力されると、
形態素解析部２は、入力された文書は、日本語辞書８を
参照しながら文単位に分割し形態素解析を行なう。例え
ば図７の本文を形態素解析するならば、「南アフリカの
ケープタウン発ロンドン行きの飛行機は、離陸寸前に５
人組のハイジャックにより乗っ取られた。」、「飛行機
の中には乗員乗客合わせて３５９人が残されており、予
断を許さない緊迫した状況になっている。」などと文に
分割し、日本語辞書８を利用してそれぞれを形態素解析
する。その結果を図８に示す。図８では、文を単語単位
に分割し、それぞれに対して読みと品詞が付与されてい
る。例えば、「南アフリカ」という単語に関しては、読
みとして「ミナミアフリカ」、品詞として「固有名詞」
が付与される。Step 102) When the document is input,
The morphological analysis unit 2 divides the input document into sentence units with reference to the Japanese dictionary 8 and performs morphological analysis. For example, if we morphologically analyze the text of Fig. 7, "A plane from Cape Town in South Africa to London is about to take off.
Hijacked by a group of people. ”,“ There are 359 passengers and passengers left on the plane, which is a tense situation that cannot be foreseen. ”Divide into sentences and use the Japanese dictionary 8 for each. Perform morphological analysis. FIG. 8 shows the result. In FIG. 8, the sentence is divided into words, and a reading and a part of speech are given to each. For example, for the word "South Africa", the reading is "Minami Africa" and the part of speech is "proper noun".
Is given.

【００５１】ステップ１０３）こうして訓練用の文全
てから単語を切り出したら、分類項目抽出部３において
以下のステップで分類確率算出用の分類項目を抽出し、
さらに、項目別確率算出部４において、分類項目別の分
類確率を算出する。文書を分類するために、一つ一つの
単語ではなく隣接単語及び、隣接する３単語に着目す
る。隣接単語及び隣接する３単語の例を図９に示す。図
９の上表は、隣接単語の幾つかの例である。この図に
は、図７における単語を出現順序順に指定個数毎にまと
めてある。図９の下図も同様に隣接３単語の例を示して
いる。これらの語句を利用して、文書ｄがある分類ｃに
分類される確率を、以下のように求める。Step 103) After the words are cut out from all the training sentences, the classification item extracting unit 3 extracts the classification items for calculating the classification probability in the following steps,
Further, the item-wise probability calculation unit 4 calculates the category probability for each item. In order to classify documents, not the individual words but the adjacent words and the adjacent three words are focused. FIG. 9 shows an example of adjacent words and three adjacent words. The table in FIG. 9 shows some examples of adjacent words. In this figure, the words in FIG. 7 are grouped by the designated number in the order of appearance. Similarly, the lower part of FIG. 9 also shows an example of three adjacent words. Using these words and phrases, the probability that the document d is classified into a certain classification c is obtained as follows.

【００５２】基本的には、ある文書ｄが分類ｃに分類さ
れる確率を、文書中に存在する単語ではなく隣接３単語
に着目して算出する。今後は、ある任意の隣接する単語
列ｔ _i-2、ｔ_i-1、ｔ_iのことを、（ｔ_i，ｔ_i-1，ｔ
_i-2）と表記することにする。図９の例では、隣接する
単語列は（南アフリカ，の）であり、隣接３単語列は
（南アフリカ，の，ケープタウン）で表される。この
時、分類確率Ｐ（ｃ｜ｄ）は、Basically, a document d is classified into the classification c.
Probability is not the word existing in the document but the adjacent 3 words
Pay attention to the calculation. Now any adjacent word
Row t _i-2, T_i-1, T_i(T_i, T_i-1, T
_i-2). In the example of FIG. 9, adjacent
The word string is (of South Africa) and the adjacent 3 word strings are
(Cape Town, South Africa). this
Then, the classification probability P (c | d) is

【００５３】[0053]

【数１０】 [Equation 10]

【００５４】で表すことができる。与えられた各ｔ_iに
対して文書ｄが分類ｃに分類される確率が独立だとする
と、上式は前に述べた３の手法と同様に以下のように変
形できる。It can be represented by Assuming that the probability that the document d is classified into the classification c is independent for each given t _i , the above equation can be transformed as follows in the same manner as the method of 3 described above.

【００５５】[0055]

【数１１】 [Equation 11]

【００５６】上式におけるいくつかの値に関して説明す
る。Ｐ（Ｔ＝（ｔ_i，ｔ_i-1，ｔ_i- ₂｜ｃ））は、隣接
３単語ｔ_i、ｔ_i-1，ｔ_i-2が連続して現れた時分類ｃ
に分類される確率（Ａ）であり、Ｐ（Ｔ＝（ｔ_i，ｔ
_i-1，ｔ_i-2｜ｄ）は、文書ｄが存在し、さらにその中
に隣接３単語が出現する確率（Ｂ）を示す。例えば、図
７の文書は、「犯罪」及び「飛行機」という分類に分類
され、更にこの文書は図９の上図のような隣接３単語
（南アフリカ，の，ケープタウン）などをその分類項目
として保持している。分類項目抽出部３でこれと同じ分
類先を持つ全文書から同じ単語列を持つものを抽出し、
項目別確率算出部４の分類頻度算出部４１で頻度を算出
し、文書分類確率計算部４２でこの頻度を用いて上記の
条件付き確率を算出する。例えば、図７とは異なる文書
が存在するとして、その分類先の一つが「犯罪」である
とした時に、この文書内に上記の隣接３単語である（南
アフリカ，の，ケープタウン）がある場合に、それらを
同じ情報だと見做して頻度情報に加えて確率を求める。
また、Ｐ（ｃ）はランダムに選択した文書ｄが分類ｃに
分類される確率（Ｃ）である。Ｐ（Ｔ＝（ｔ_i｜
ｔ_i-1，ｔ_i-2）は、隣接する２単語ｔ_i-1、ｔ_i-2が
出現した際のｔ_iが現れる確率（Ｄ）である。（Ａ）か
ら（Ｃ）までの値は、隣接３単語の現れる頻度とそれら
が関連する文書の分類先をチェックすれば算出できる。
最後の値（Ｄ）は、訓練用の全文書の中にある隣接３単
語の現れる回数が少ない場合、正しく計算できない状態
になるので、以下のような工夫を行ない確率算出の頻度
を高める。ある変数ｔの文書中に現れる頻度をｆ（ｔ）
で表す。Some values in the above equation will be described. P (T = (t _i , t _i-1 , t _i- ₂ | c)) is a time classification c in which three adjacent words t _i , t _i-1 , and t _i-2 appear consecutively.
Probability (A) of being classified into P (T = (t _i , t
_i-1 , t _i-2 | d) indicates the probability (B) that the document d exists and three adjacent words appear therein. For example, the document of FIG. 7 is classified into the categories of “crime” and “airplane”, and further, this document has three adjacent words (South Africa, Cape Town) and the like as its classification items as shown in the upper diagram of FIG. keeping. The classification item extraction unit 3 extracts the documents having the same word strings from all the documents having the same classification destination,
The classification frequency calculation unit 41 of the item-based probability calculation unit 4 calculates the frequency, and the document classification probability calculation unit 42 uses the frequency to calculate the conditional probability. For example, assuming that a document different from that shown in FIG. 7 exists and one of the classification destinations is “crime”, the above-mentioned three adjacent words (South Africa, Cape Town) are present in this document. Then, they are regarded as the same information, and the probability is calculated by adding them to the frequency information.
P (c) is the probability (C) that the randomly selected document d is classified into the classification c. P (T = (t _i |
t _i-1 , t _i-2 ) is the probability (D) that t _i appears when _two adjacent words t _i-1 , t _i-2 appear. The values from (A) to (C) can be calculated by checking the frequency of appearance of three adjacent words and the classification destination of the document to which they are related.
The last value (D) cannot be calculated correctly when the number of occurrences of three adjacent words in all the training documents is small, so the following measures are taken to increase the frequency of probability calculation. The frequency of occurrence of a variable t in the document is f (t)
It is represented by.

【００５７】Ｐ（Ｔ＝（ｔ_i｜ｔ_i-1，ｔ_i-2））（18）＝ｑ₁（ｆ（Ｔ＝ｔ_i｜ｔ_i-1，ｔ_i-2）＋ｑ₂ｆ（Ｔ＝ｔ_i｜ｔ_i-1）＋ｑ₃ｆ（Ｔ＝ｔ_i）（19）この式におけるｆ（Ｔ＝ｔ_i｜ｔ_i-1）は単語ｔ_i-1が
出現した際のｔ_iが現れる頻度、ｆ（Ｔ＝（ｔ_i）はｔ
_iが現れる頻度、そしてｑ₁，ｑ₂，ｑ₃は、訓練デー
タにより決定できる１以下の正数、あるいは０を取る定
数である。よって、この値を算出するためには、ｆ（Ｔ
＝ｔ_i｜ｔ_i-1，ｔ_i-2）、ｆ（Ｔ＝（ｔ_i｜
ｔ_i-1）、ｆ（Ｔ＝（ｔ_i））を求めるために次のよう
な計算を行なえば良い。まず、単語毎の分類頻度を算出
する。そのために訓練用の文全体から抽出した単語の中
から同一表記のものを抽出する。P (T = (t _i | t _i-1 , t _i-2 )) (18) = q ₁ (f (T = t _i | t _i-1 , t _i-2 ) + q ₂ f ( _{_{T = t i | t i-}} 1) + q 3 f (T = t i) (19) f in the equation _{_{(T = t i | t i}} -1) is t _i when word t _i-1 appeared , F (T = (t _i ) is t
_The frequency at which _i appears and q ₁ , q ₂ , and q ₃ are positive numbers less than or equal to 1 that can be determined by training data, or constants that take 0. Therefore, in order to calculate this value, f (T
= T _i | t _i-1 , t _i-2 ), f (T = (t _i |
The following calculation may be performed to obtain t _i−1 ), f (T = (t _i )). First, the classification frequency for each word is calculated. Therefore, the same notation is extracted from the words extracted from the entire training sentence.

【００５８】ステップ１０４）分類項目抽出部３は、
抽出した単語に対して、その単語が含まれる文書が分類
される先をマークする。項目別確率算出部４の分類頻度
算出部４１は、マークした分類先が同じならば一つの集
合と考えその頻度を計算する。この頻度ｆを、単語ｔ₁
を用いてｆ（ｔ₁）と表す。次に、隣接単語の頻度を算
出する。単独の単語の場合と同様に、まず訓練用の文全
体から隣接した単語を抽出し、同様に分類先を元に集合
を作り頻度を算出する。この頻度ｆを、隣接単語ｔ₁，
ｔ₂を用いてｆ（ｔ₁，ｔ₂）と表す。さらに、隣接３
単語の頻度も同様に求める。この頻度を同様にｆ
（ｔ₁，ｔ₂，ｔ₃）と表す。これらの値より値（Ｄ）
の算出が可能である。例として、図１０に隣接３単語の
頻度情報収集例を考える。図１０には、例えば（南アフ
リカ，の，ケープタウン）の３単語が連続する場合に、
それらが含まれる文書の分類先には、「犯罪」、「飛行
機」、「アフリカ」、「旅行」などがあることを示して
いる。各々に対して、頻度情報が付与される。Step 104) The classification item extracting section 3
For the extracted word, mark where the document containing the word is classified. If the marked classification destinations are the same, the classification frequency calculation unit 41 of the item-by-item probability calculation unit 4 regards them as one set and calculates the frequency. This frequency f is used as the word t ₁
Is represented by f (t ₁ ). Next, the frequency of adjacent words is calculated. Similar to the case of a single word, first, adjacent words are extracted from the whole training sentence, and similarly, a set is created based on the classification destination and the frequency is calculated. This frequency f is set to the adjacent word t ₁ ,
with t ₂ represents the _{_{f (t 1, t 2)}} . Furthermore, adjacent 3
The word frequency is calculated in the same way. This frequency is also f
It is expressed as (t ₁ , t ₂ , t ₃ ). Value (D) from these values
Can be calculated. As an example, consider an example of collecting frequency information of three adjacent words in FIG. In FIG. 10, for example, if three words (South Africa, Cape Town) are consecutive,
It indicates that the documents including them are classified into "crime", "airplane", "Africa", "travel", and the like. Frequency information is given to each.

【００５９】ステップ１０５）以上のようにして、全
文書に対して頻度情報が得られたらステップ１０６に移
行し、残りの文書がある場合にはステップ１０１に移行
する。ステップ１０６）項目別確率算出部４の文書分類確率
計算部４２は、ある文書ｄが与えられた時に、その文書
中の分類項目ｔ_i、ｔ_i-1、ｔ_i-2等により、その文書
がどの分類に該当し、各分類項目が文書を分類するため
にどの程度の確率を持っているのかの訓練データを算出
する。Step 105) As described above, if the frequency information is obtained for all the documents, the process proceeds to step 106, and if there are remaining documents, the process proceeds to step 101. Step 106) When a document d is given, the document classification probability calculation unit 42 of the item-wise probability calculation unit 4 uses the classification items t _i , t _i-1 , t _i-2, etc., in the document to determine the document. The training data of which category corresponds to and what probability each category has for classifying the document is calculated.

【００６０】ステップ１０７）分類確率蓄積部５は、
これらを確率蓄積装置１１に蓄積しておく。式（１９）を用いれば式（１７）は最終的に以下の形に
変形できる。Step 107) The classification probability accumulating unit 5
These are stored in the probability storage device 11. By using the equation (19), the equation (17) can be finally transformed into the following form.

【００６１】[0061]

【数１２】 (Equation 12)

【００６２】次に、図５に従って、分類確率算出時の処
理を説明する。ステップ２０１）上記図４の分類確率蓄積処理により
分類確率算出用のデータが蓄積されたら、新規文書入力
部６に新たに文書が入力される。ステップ２０２）形態素解析部７は、入力された文書
に対して、文単位に分割した上で形態素解析を行ない単
語を抽出する。Next, the processing for calculating the classification probability will be described with reference to FIG. Step 201) When the data for classification probability calculation is accumulated by the classification probability accumulation process of FIG. 4, a new document is input to the new document input unit 6. Step 202) The morphological analysis unit 7 divides the input document into sentence units and then performs morphological analysis to extract words.

【００６３】ステップ２０３）分類項目抽出部８は、
こうして得られた単語から分類項目となる単独の単語、
隣接２単語、そして隣接３単語を抽出する。ステップ２０４）分類確率抽出部９は、全分類の中か
ら分類の候補となる分類群を一つ選択する。Step 203) The classification item extracting section 8
A single word that becomes a classification item from the words thus obtained,
Two adjacent words and three adjacent words are extracted. (Step 204) The classification probability extraction unit 9 selects one classification group as a classification candidate from all the classifications.

【００６４】ステップ２０５）分類確率算出部１０
は、抽出した分類項目に対して、各々がその分類群にな
る確率を確率蓄積部１４から抽出する。さらに、分類項
目の分類確率を用いて文書がある分類に対して分類され
る確率を個別に算出する。詳細な動作は図６に後述す
る。Step 205) Classification probability calculation unit 10
For the extracted classification items, the probability accumulating unit 14 extracts the probabilities that the respective classification items will belong to the classification group. Further, using the classification probabilities of the classification items, the probabilities of documents being classified for a certain classification are individually calculated. Detailed operation will be described later with reference to FIG.

【００６５】ステップ２０６）全ての分類群を確認し
ている場合には、ステップ２０７に移行し、確認してい
ない分類群がある場合には、ステップ２０４に移行す
る。ステップ２０７）分類確率算出部１０は、算出した確
率が高い順に分類群を順に整列させる。Step 206) If all the classification groups have been confirmed, the procedure proceeds to step 207, and if there is an unconfirmed classification group, the procedure proceeds to step 204. Step 207) The classification probability calculation unit 10 arranges the classification groups in order from the highest calculated probability.

【００６６】ステップ２０８）分類確率算出部１０
は、メモリ上で分類確率が高いものから分類判定部１１
に転送する。ステップ２０９）分類判定部１１は、抽出するための
閾値を越えるかどうかを検査する。検査の結果越える場
合には、ステップ２１０に移行し、越えない場合には、
処理を終了する。Step 208) Classification probability calculator 10
Indicates the classification determination unit 11 from the one having the highest classification probability in the memory.
Transfer to. Step 209) The classification determination unit 11 checks whether or not the threshold for extraction is exceeded. If the result of the inspection is exceeded, go to step 210, and if not,
The process ends.

【００６７】ステップ２１０）分類判定部１１で閾値
を越えて、検査が合格したものを分類結果出力部１２に
渡す。分類結果出力部１２は、検査に合格したものだけ
を文書の分類先を決定して最終的に順に出力する。ステップ２１１）このように全ての分類群に関して検
査を終了して、全ての処理を終了する。Step 210) The classification judgment section 11 passes the threshold value and passes the inspection to the classification result output section 12. The classification result output unit 12 determines only the documents that have passed the inspection, the document classification destination, and finally outputs the documents in order. Step 211) In this way, the inspection is completed for all the classification groups, and all the processes are completed.

【００６８】図６は、本発明の一実施例の分類確率算出
部の詳細な動作を説明するためのフローチャートであ
る。ステップ３０１）カウントｉに初期値１を設定する。ステップ３０２）まず単独の単語の隣接ｉ単語を抽出
する。FIG. 6 is a flow chart for explaining the detailed operation of the classification probability calculating section according to the embodiment of the present invention. Step 301) Set an initial value 1 to the count i. (Step 302) First, i words adjacent to a single word are extracted.

【００６９】ステップ３０３）隣接ｉ単語に対して分
類確率を抽出する。ステップ３０４）ｉが４以下の場合には、ステップ３
０５に移行し、ｉが４より大きければ、ステップ３０６
に移行する。ステップ３０５）次に、単語を隣接する２個の単語に
置き換え、同様に分類確率を抽出する。Step 303) Extract classification probabilities for adjacent i words. Step 304) If i is 4 or less, Step 3
If i is greater than 4 in step 05, step 306
Move to (Step 305) Next, the word is replaced with two adjacent words, and the classification probability is similarly extracted.

【００７０】ステップ３０６）このように、隣接する
３個の単語、さらにＰ（ｃ）等の必要な確率値を確率蓄
積装置１４から抽出する。ステップ３０７）分類確率算出部１０は、確率算出式
である式（２１）に上記の値を代入する。Step 306) In this way, the three adjacent words and the necessary probability values such as P (c) are extracted from the probability accumulator 14. Step 307) The classification probability calculation unit 10 substitutes the above value into the equation (21) which is the probability calculation equation.

【００７１】ステップ３０８）これにより、分類確率
算出部１０は選択した分類群に対する文書の分類確率を
算出する。例えば、入力された文書に対して各分類に分類される確
率が図１１のように付与されたとする。閾値を５．０×
１０^-8とした場合、図１１においてこの閾値を越えるも
のを確率の高いものから順に表示すると、分類番号１、
８、３、２となる。これらの分類を入力された文書の分
類先と決定する。Step 308) As a result, the classification probability calculation section 10 calculates the classification probability of the document for the selected classification group. For example, it is assumed that the input document is given a probability of being classified into each classification as shown in FIG. 5.0x threshold
In the case of 10 ⁻⁸ , if the items exceeding this threshold are displayed in order from the one with the highest probability in FIG.
It becomes 8, 3, 2. These classifications are determined as classification destinations of the input document.

【００７２】上記のように、本発明を用いれば従来の確
率に基づく文書分類手法が持つ問題点が解決でき、従来
手法よりも精度の高い文書の分類が可能となることがわ
かる。なお、上記の実施例において、文書入力部１、
６、形態素解析部２、７、分類項目抽出部３、８につい
て、図３において別個に記載し、別個の符号を付与して
いるが、同一構成のものであり、分類確率蓄積時と、分
類確率算出時の説明の便宜上分けているものである。As described above, by using the present invention, it is possible to solve the problems of the conventional probability-based document classification method, and it becomes possible to classify documents with higher accuracy than the conventional method. In the above embodiment, the document input unit 1,
6, the morpheme analysis units 2 and 7, and the classification item extraction units 3 and 8 are described separately in FIG. 3 and are given different reference numerals, but they have the same configuration, and the classification probability accumulation time and the classification probability accumulation time It is divided for convenience of explanation at the time of probability calculation.

【００７３】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内で種々、変更・応用が可
能である。The present invention is not limited to the above embodiments, but various modifications and applications are possible within the scope of the claims.

【００７４】[0074]

【発明の効果】上述のように、本発明によれば、従来技
術における確率に基づく文書分類手法が持っていた単語
の頻度を用いない問題（問題１、２）、訓練データが少
ない場合に対処ができない問題（問題３）に対しては、
文書の中に含まれる全単語を参照することにより解決で
きることがわかる。As described above, according to the present invention, the problem of not using the frequency of words that the probability-based document classification method in the prior art does not use (problems 1 and 2) and the case where the training data is small are addressed. For the problem that can not be done (Problem 3),
It can be seen that this can be solved by referring to all the words contained in the document.

【００７５】さらに、従来の技術において、確率に基づ
く文書分類手法では用いられなかった単語間の関係を、
隣接３単語が揃って表れることを考慮することで、それ
らの関係を分類確率に採り入れて文書の分類確率を算出
するために、より精度の高い文書の分類が可能となる。Furthermore, in the conventional technique, the relationship between words, which is not used in the probability-based document classification method, is
By considering that three adjacent words appear together, the relationship between them is incorporated into the classification probability to calculate the classification probability of the document, so that the document can be classified with higher accuracy.

【００７６】このように、本発明では、文書分類の確率
を求める際に、単語の確率だけではなく、隣接単語間の
関係を考慮に入れた上で、これらが該当文書をある文書
群に分類される確率を求めることになるために、文の内
容に即した文書分類が可能となる。As described above, according to the present invention, when the probability of document classification is obtained, not only the probability of words but also the relationship between adjacent words are taken into consideration, and these documents are classified into a certain document group. Since the probability of being processed is obtained, it is possible to classify the document according to the content of the sentence.

[Brief description of drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の一実施例の文書分類装置の構成図であ
る。FIG. 3 is a configuration diagram of a document classification device according to an embodiment of the present invention.

【図４】本発明の一実施例の分類確率蓄積時の処理を示
すフローチャートである。FIG. 4 is a flowchart showing a process at the time of accumulating classification probabilities according to an embodiment of the present invention.

【図５】本発明の一実施例の分類確率算出時の処理を示
すフローチャートである。FIG. 5 is a flowchart showing a process when calculating a classification probability according to an embodiment of the present invention.

【図６】本発明の一実施例の分類確率算出部の詳細な動
作を説明するためのフローチャートである。FIG. 6 is a flowchart illustrating a detailed operation of a classification probability calculation unit according to an exemplary embodiment of the present invention.

【図７】本発明の一実施例の文書入力部に入力される文
書の例を示す図である。FIG. 7 is a diagram showing an example of a document input to the document input unit according to the embodiment of the present invention.

【図８】本発明の一実施例の形態素解析結果を示す図で
ある。FIG. 8 is a diagram showing a morphological analysis result according to an embodiment of the present invention.

【図９】本発明の一実施例の隣接単語及び隣接３単語の
例を示す図である。FIG. 9 is a diagram showing an example of an adjacent word and three adjacent words according to an embodiment of the present invention.

【図１０】本発明の一実施例の隣接３単語の頻度情報収
集例を示す図である。FIG. 10 is a diagram showing an example of collecting frequency information of adjacent three words according to an embodiment of the present invention.

【図１１】本発明の一実施例の各文書に対する確率付与
例を示す図である。FIG. 11 is a diagram showing an example of assigning a probability to each document according to an embodiment of the present invention.

【符号の説明】１文書入力部２形態素解析部３分類項目抽出部４項目別確率算出部５分類確率蓄積部６文書入力部７形態素解析部８分類項目抽出部９分類確率抽出部１０分類確率算出部１１分類判定部１２分類結果出力部１３日本語辞書１４確率蓄積装置４１分類頻度算出部４２文書分類確率計算部１００分類確率算出・蓄積手段１０１第１の文書入力手段１０２第１の形態素解析手段１０３第１の分類項目抽出手段１０４分類頻度計算手段１０５文書分類確率計算手段１０６分類確率蓄積手段２００分類確率参照・算出手段２０１第２の文書入力手段２０２第２の形態素解析手段２０３第２の分類項目抽出手段２０４分類確率抽出手段２０５分類確率算出手段２０６分類判定手段２０７分類結果出力手段[Explanation of symbols] 1 document input unit 2 morphological analysis unit 3 classification item extraction unit 4 item-wise probability calculation unit 5 classification probability accumulation unit 6 document input unit 7 morphological analysis unit 8 classification item extraction unit 9 classification probability extraction unit 10 classification probability Calculation unit 11 Classification determination unit 12 Classification result output unit 13 Japanese dictionary 14 Probability accumulation device 41 Classification frequency calculation unit 42 Document classification probability calculation unit 100 Classification probability calculation / accumulation unit 101 First document input unit 102 First morpheme analysis Means 103 First classification item extraction means 104 Classification frequency calculation means 105 Document classification probability calculation means 106 Classification probability accumulation means 200 Classification probability reference / calculation means 201 Second document input means 202 Second morpheme analysis means 203 Second Classification item extraction means 204 Classification probability extraction means 205 Classification probability calculation means 206 Classification determination means 207 Classification results Output means

Claims

[Claims]

1. A document classification method for selecting an appropriate document from among document candidates stored in advance for a document input by a user, wherein the input document is classified into a pre-classified document group. When categorizing and determining the document classification probability that a word included in the document is included in the document when classifying the document, not only the probability of the word but also the relationship between adjacent words of the word are acquired, and The document classification method according to claim 1, wherein a document classification probability is calculated.

2. When classifying the documents, in order to calculate the document classification probability, a document classified in advance is read, a sentence in the read document is morphologically analyzed, and a classification probability is calculated from words divided by the morphological analysis. Extracted as a classification item of a document of consecutive n (n is a natural number) words used at the time of calculation, and for each extracted classification item, calculate the frequency with which the document is classified into classifications among all classifications Then, from the calculated frequency, the probability that each classification item classifies the document is calculated, the calculated probability is accumulated for document classification, and when the document to be classified is newly input, Morphological analysis is performed on the sentence of, and n consecutive word strings in the classification items are extracted using the morphologically analyzed words, and the consecutive n words accumulated in advance for the extracted word string Match in The classification probabilities that belong to a certain category of each document are extracted, and the probabilities that the documents are classified to a certain category are individually calculated using the classification probabilities of the extracted classification items. Among the calculated classification probabilities, The document classification method according to claim 1, wherein the document having the highest probability is determined in order as the classification result of the document, and the determined document is displayed as the classification result.

3. The document classification method according to claim 2, wherein when the classification result of the document is determined, a document out of the calculated classification probabilities that exceeds a preset threshold value is determined as the classification result.

4. A document classification device for selecting an appropriate document from among document classification candidates stored in advance for a document input by a user, and converting the input document into a pre-classified document group. A means for classifying, when obtaining a document classification probability that a word included in a document is included in the document when classifying the document, not only the probability of the word but also the relationship between adjacent words of the word are acquired. A document classification device, comprising: means for calculating the document classification probability.

5. When classifying an input document, first reading the document into a memory for calculating a document classification probability.
Document input means, first morpheme analysis means for dividing the sentence in the document read from the first document input means into words, and when the classification probability is calculated from the words divided by the first morpheme analysis means. First classification item extracting means for extracting consecutive n words (n is a natural number) to be used as classification items of the document, and for each classification item extracted by the first classification item extracting means, Classification frequency calculation means for calculating the frequency of classification of a document into a certain classification among all classifications, and document classification probability calculation for calculating the probability that each classification item classifies a document from the frequencies calculated by the classification frequency calculation means Means and a classification probability accumulating means for accumulating the probabilities calculated by the document classification probability calculating means for document classification, and inputting a document to be newly classified. A second document input means for reading the document to be newly classified, and a second morphological analysis means for dividing the document in the document input from the first document input means into words. A second classification item extracting means for extracting a continuous word string of n words using the words divided by the second morpheme analyzing means; and a continuous n word referring to the classification probability accumulating means. A classification probability extraction unit that extracts a classification probability that a matching document belongs to a certain classification and a classification probability for each classification item extracted by the classification probability extraction unit are used to classify the document with respect to a certain classification. Among the classification probabilities calculated by the classification probability calculation means and the classification probability calculation means for individually calculating the probability
A classification probability reference unit that uses a classification determination unit that determines the classification result of the document in order from the highest probability and a classification result output unit that displays the documents determined by the classification determination unit as the classification result.
The document classification device according to claim 4, further comprising calculation means.

6. The document classification apparatus according to claim 5, wherein the classification determination unit determines, as a classification result, a document that exceeds a preset threshold value among the calculated classification probabilities when determining the classification result of the document. .