JP3471253B2

JP3471253B2 - Document classification method, document classification device, and recording medium recording document classification program

Info

Publication number: JP3471253B2
Application number: JP14511599A
Authority: JP
Inventors: 隆明長谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-05-25
Filing date: 1999-05-25
Publication date: 2003-12-02
Anticipated expiration: 2019-05-25
Also published as: JP2000339310A

Description

【発明の詳細な説明】【０００１】【発明の属する技術分野】本発明は電子化文書から意味
属性を伴なう文書構造を獲得して該電子化文書を分類す
る装置に関する。【０００２】【従来の技術】近年、日本語ワードプロセッサや機械翻
訳システムなど自然言語処理技術を応用したコンピュー
タシステムが進歩し、自然言語処理技術の向上に対する
要求が高まっている。【０００３】自然言語処理技術において、文章構造を得
る従来の手法は、例えば自然言語処理（岩波講座ソフト
ウエア科学１５、長尾真他編、岩波書店発行）の１４
０ページ〜２２９ページに記載されているように構文解
析と意味解析を行うものであった。また、従来の分類手
法は、名詞の出現頻度を用いるものであった。【０００４】【発明が解決しようとする課題】上述した従来の構文解
析と意味解析を行う方法では、長い文章や複雑な構造を
持つ文章に対して、時間がかかったり、解析を誤ったり
することが多い。特に、近年増加している電子メールや
ホームページの文書のように、校正されていない文書に
対しては解析の精度が低いという問題点がある。また、
名詞の出現頻度を用いた分類方法では、文書作成者の細
かい意図を考慮した分類ができないという問題点があ
る。【０００５】本発明の目的は、上記の点に鑑み、校正さ
れていない文章や品質の低い文章に対しても、正しく意
味属性を伴う文章構造を付与し、文書作成者の意図を含
んだ分類を行うことのできる文書分類装置を提供するこ
とにある。【０００６】【０００７】【０００８】【０００９】【００１０】【課題を解決するための手段】本発明の文書分類装置
は、文書を入力する文書入力手段と、入力文書を解析し
て基本形、品詞、活用形を含む形態素情報を備えた形態
素に分割する形態素解析手段と、前記入力文書を構成す
る各形態素について、該形態素の前後に現われる形態素
の形態素情報の頻度から特徴ベクトルを生成し、形態素
情報と意味属性が付与されている形態素から構成される
文書が予め複数格納されているコーパスから意味属性が
付与されている形態素の前後に現われる形態素の形態素
情報から求められた、該意味属性が現われる特徴ベクト
ルと、生成された特徴ベクトルの内積を計算し、コーパ
スから求められた、距離が最も小さい特徴ベクトルの意
味属性を当該形態素に付与する意味付与手段と、前記入
力文書について、意味属性が付与されている形態素は意
味属性を用い、意味属性が付与されていない形態素は意
味属性の代わりに入力文書を解析して得られる基本形を
用いて形態素の並びを作成し、作成された形態素の並び
の頻度を構成要素とする特徴ベクトルを作成し、前記コ
ーパスに格納された文書について、意味属性が付与され
ている形態素は意味属性を用い、意味属性が付与されて
いない形態素はコーパスに格納されている基本形を用い
て形態素の並びを作成し、作成された形態素の並びの頻
度を構成要素とする特徴ベクトルを作成し、入力文書の
特徴ベクトルとコーパスに格納された文書の特徴ベクト
ルとの内積を用いた距離を計算する類似度計算手段と、
前記入力文書を前記距離が最も小さい、前記コーパスに
格納されている文書のカテゴリに分類する分類手段と、
分類結果を外部に出力する出力手段を有する。【００１１】【００１２】【００１３】受信した電子メールを分類する場合を例と
して本発明を説明する。入力手段により新規の受信メー
ルが入力されると、形態素解析手段によって、受信メー
ルは形態素に分割される。次に、意味付与手段により、
入力文書を構成する各形態素について該形態素の前後に
現われる形態素の形態素情報の頻度から特徴ベクトルを
生成し、形態素情報と意味属性が付与されている形態素
から構成される文書が予め複数格納されているコーパス
から意味属性が付与されている形態素の前後に現れる形
態素の形態素情報から求められた、該意味属性が現われ
る特徴ベクトルと、生成された特徴ベクトルの内積を計
算し、コーパスから求められた、距離が最も小さい特徴
ベクトルの意味属性を当該形態素に付与する。次に、類
似度計算手段により、前記入力文書について、意味属性
が付与されている形態素は意味属性を用い、意味属性が
付与されていない形態素は意味属性の代わりに入力文書
を解析して得られる基本形を用いて形態素の並びを作成
し、作成された形態素の並びの頻度を構成要素とする特
徴ベクトルを作成し、前記コーパスに格納された文書に
ついて、意味属性が付与されている形態素は意味属性を
用い、意味属性が付与されていない形態素はコーパスに
格納されている基本形を用いて形態素の並びを作成し、
作成された形態素の並びの頻度を構成要素とする特徴ベ
クトルを作成し、入力文書の特徴ベクトルとコーパスに
格納された文書の特徴ベクトルとの内積を用いた距離を
計算する。そして分類手段により、入力文書を前記距離
が最も小さいコーパス中格納文書のカテゴリに分類す
る。【００１４】【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。図１は本発明の第１の実施
の形態の文書分類装置の模式的ブロック構成図であり、
図２は本発明の第１の実施の形態の文書分類方法を示す
フローチャートである。文書分類装置１０は、文書を入
力する入力手段１１、形態素解析手段１２、コーパス学
習手段１３、意味付与手段１４、類似度計算手段１５、
分類手段１６、文書の分類結果を出力する出力手段１
７、コーパス１８および各手段の処理を制御する制御部
１９を備える。【００１５】コーパス（ｃｏｒｐｕｓ）１８は、多量に
収集された言語データ群であり（自然言語処理岩波講
座ソフトウエア科学１５、長尾真他編、岩波書店発行
２５３ページ参照）、ここでは意味属性のタグが付与
されている形態素から構成される文書が所定の形式で予
め多数格納されているものとする。【００１６】コーパス１８の文書例を図３に示す。図３
は第１の実施の形態の文書分類方法に用いられるコーパ
スのタグ付格納文書例であり、「私の携帯電話で通信で
きません」と「有効な解決策は書かれてありません」の
２つの文書を例として記載している。【００１７】形態素解析手段１２は入力手段１１で入力
した入力文書の文を基本形、品詞、活用形を含む形態素
情報を備えた形態素に分割する。コーパス学習手段１４
はコーパス１８に格納された形態素から構成され意味属
性のタグが付与されている格納文書を所定の形態でコー
パス１８から引き出して、意味属性のタグが付与されて
いる形態素の近傍に現れる形態素の形態素情報の頻度を
獲得し、獲得された頻度からその意味属性の現れる特徴
を学習する。意味付与手段１３は学習した結果により獲
得された意味属性の現れる特徴と、入力文書を構成する
形態素の近傍に位置する形態素の形態素情報の頻度から
なる特徴とを比較し、最も類似している特徴を有する意
味属性を、入力文書を構成する形態素の意味属性として
付与する。【００１８】類似度計算手段１５は形態素に意味属性が
付与された入力文書とコーパスに格納された格納文書と
を、形態素情報に意味属性を加えた形態素の並びの頻度
について比較して、形態素の並びの頻度に基づく文書の
類似度を計算する。分類手段１６は入力文書とコーパス
１８の比較対象とした格納文書との類似度が閾値を越え
た場合に、入力文書をその格納文書のカテゴリに分類す
る。【００１９】図２を参照して本発明の第１の実施の形態
の文書分類方法の動作を説明する。文書分類処理を開始
すると（Ｓ１１）、入力手段１１により文書が入力され
（Ｓ１２）、入力文書が形態素解析手段１２により基本
形、品詞、活用形を含む形態素情報を備えた形態素に分
割される（Ｓ１３）。意味付与手段１４で、意味属性を
付与する形態素の一つを選択し（Ｓ１４）、その形態素
の前後ｎ個の形態素についての形態素情報の頻度から特
徴を算出し（Ｓ１５）、コーパス学習手段１３で学習さ
れた意味属性の特徴とを比較し（Ｓ１６）、他の候補と
なる意味属性が存在するならば（Ｓ１７Ｎ）、次の候補
となる意味属性を選択し（Ｓ１８）、ステップＳ１６に
戻って処理を繰り返す。他に候補となる意味属性が存在
しないならば（Ｓ１７Ｎ）、最も類似した特徴を持つ意
味属性を付与する（Ｓ１９）。【００２０】入力文書の形態素の全体の意味属性付与が
終了していなければ（Ｓ２０Ｎ）、入力文書の次の意味
属性を付与する形態素を指定し（Ｓ２１）、ステップＳ
１４に戻って処理を繰り返す。入力文書の形態素の全体
の意味属性付与が終了していれば（Ｓ２０Ｙ）、類似度
計算手段１５により、格納文書の一つを選択し（Ｓ２
２）、入力文書の意味属性を含む形態素の並びの頻度か
らなる特徴と選択した格納文書の意味属性を含む形態素
の並びの頻度からなる特徴とを比較し（Ｓ２３）、類似
度が所定の閾値以下であれば（Ｓ２４Ｎ）、次の格納文
書を指定し（Ｓ２５）、ステップＳ２２に戻って処理を
繰り返す。類似度が所定の閾値以上であれば（Ｓ２４
Ｙ）、分類手段１７により入力文書を類似度が所定の閾
値以上である格納文書のカテゴリに分類し（Ｓ２６）、
文書分類処理を終了する（Ｓ２７）。次にステップＳ１
４からＳ２１に記載されている意味属性を学習する手順
について例をあげて詳細に説明する。まず、コーパス１
８に格納されている格納文書の文の形態素情報として
は、図３に示されるような、形態素解析によって得られ
る、単語の品詞、基本型、活用形が前提となる。これら
の情報を手がかりに特徴ベクトルを構成する。コーパス
１８に格納された文書にはそれぞれの意味属性の前後の
ｎ個づつの形態素の形態素情報（基本形、品詞、活用
形）から、いわゆるｎ−ｇｒａｍ（自然言語処理岩波
講座ソフトウエア科学１５、長尾真他編、岩波書店発
行１５ページ参照）が取得され、コーパス全体から各
共起語の形態素情報に対して所定の方法で重み付けを行
ない、すべての形態素情報の重みより各意味属性の特徴
ベクトルが構成されている。【００２１】例えば、コーパス１８に格納されている格
納文書のある文「講演会で話された内容についての質
問」については表１のような形態素情報が付与されてい
る。【００２２】【表１】ここで、文書「そちらは動作確認されたんですよね」を
入力して形態素解析すると表２のような形態素情報が得
られる。【００２３】【表２】表１において、「れた」の基本形「れる」には、「尊
敬」、「受身」、「可能」、「自発」の意味属性がある
がここでは「尊敬」となっている。表２の「れた」の基
本形「れる」の意味属性を学習するのがこのステップの
目的である。【００２４】【００２５】入力文書に対して「れる」の前後ｎ個づつ
の形態素情報の特徴ベクトルを作成する。この場合共起
する形態素情報の重みをすべて１、それ以外は０とす
る。【００２６】コーパス１８に構成されている「れる」の
取り得る意味属性の各意味属性の特徴ベクトルと入力文
書の「れる」の特徴ベクトルの内積を計算し、距離（ベ
クトルの角度）が最も小さい意味属性を獲得し、入力文
書の文のその形態素の意味属性として採用する。これに
より、類似した文脈で使われていた形態素の意味属性が
入力文書の形態素の意味属性として選択されることにな
る。【００２７】次にステップＳ２２からＳ２６に記載され
ている、コーパス１８に格納されている文書と入力文書
との間での類似度を計算する方法を一つの例によって詳
細に説明する。ここでは、文書の形態素の並びを、意味
属性が存在する形態素は意味属性を、意味属性が存在し
ない形態素は形態素の基本型を用いて形態素の並びを作
成してこの形態素の並びの頻度から特徴べクトルを作成
する。【００２８】表３の形態素情報と意味属性を有する形態
素からなる文書を例として説明する。【００２９】【表３】この文書はｎ＝３としてｎ−ｇｒａｍを作成すると、（そちら、「限定」、動作）（「限定」、動作、確認）（動作、確認、する）（確認、する、「尊敬」）（する、「尊敬」、「断定」）（「尊敬」、「断定」、「呼び掛け」）（「断定」、「呼び掛け」、「確認」）となり、コーパスに格納された格納文書についてこれら
のｎ−ｇｒａｍの頻度を用いた文書の特徴ベクトルを作
成しておく。【００３０】意味属性を獲得した入力文書についても同
様にｎ−ｇｒａｍの頻度を用いた文書の特徴ベクトルを
作成し、コーパスに格納された格納文書の特徴ベクトル
との内積を用いた距離を計算し、距離（ベクトルの角
度）が最も小さい格納文書もしくは格納文書群のカテゴ
リに入力文書を分類する。【００３１】類似度計算手段１５において対象となる形
態素の並びについて、図３に示したコーパスの文書例に
おける形態素の並びと新規に入力する文書の文の形態素
の並びの例を図４に示す。形態素の並びには、意味属性
が存在する場合には意味属性を用い、意味属性が存在し
ない場合には形態素の基本形が用いられている。【００３２】次に、本発明の第２の実施の形態の文書分
類方法と文書分類装置について図面を参照して説明す
る。図５は本発明の第２の実施の形態の文書分類装置の
模式的ブロック構成図である。【００３３】図５は、本発明の文書分類装置１００を、
装置を構成するコンピュータとして示したものであり、
コンピュータはモデム、キーボード、ポインティングデ
バイス等の入力部１１０、モデム、プリンタ、ディスプ
レイ等の出力部１２０、データ処理装置１３０、記憶部
１４０および記録媒体１５０を備える。記録媒体１５０
には各部の動作を制御できる本発明の文書分類システム
制御プログラムが記録されており、ＦＤ，ＣＤ−ＲＯ
Ｍ、半導体メモリ等が用いられる。【００３４】文書分類装置の構成や文書分類方法は第１
の実施の形態と同じなので説明を省略する。【００３５】コーパスに格納された格納文書の意味属性
のタグが付与されている形態素の近傍に現れる形態素の
形態素情報の頻度の特徴から意味属性の現れる特徴を学
習して、新規に入力された文書に対して得られる形態素
の意味属性を獲得し、さらに前記入力文書を類似する前
記コーパス中の格納文書群に分類するための制御プログ
ラムは、記録媒体１５０からデータ処理装置１３０に読
み込まれデータ処理装置１３０の動作を制御する。デー
タ処理装置１３０は制御プログラムの制御により以下の
処理を実行する。【００３６】即ち、文書を入力する処理と、入力文書を
解析して基本形、品詞、活用形を含む形態素情報を備え
た形態素に分割する処理と、前記入力文書を構成する各
形態素について、該形態素の前後に現れる形態素の形態
素情報の頻度から特徴ベクトルを生成し、形態素情報と
意味属性が付与されている形態素から構成される文書が
予め複数格納されているコーパスから意味属性が付与さ
れている形態素の前後に現れる形態素の形態素情報から
求められた、該意味属性が現われる特徴ベクトルと、生
成された特徴ベクトルの内積を計算し、前記コーパスか
ら求められた、距離が最も小さい特徴ベクトルの意味属
性を当該形態素に付与する処理と、前記入力文書につい
て、意味属性が付与されている形態素は意味属性を用
い、意味属性が付与されていない形態素は意味属性の代
わりに入力文書を解析して得られる基本形を用いて形態
素の並びを作成し、作成された形態素の並びの頻度を構
成要素とする特徴ベクトルを作成し、前記コーパスに格
納された文書について、意味属性が付与されている形態
素は意味属性を用い、意味属性が付与されていない形態
素はコーパスに格納されている基本形を用いて形態素の
並びを作成し、作成された形態素の並びの頻度を構成要
素とする特徴ベクトルを作成し、入力文書の特徴ベクト
ルとコーパスに格納された文書の特徴ベクトルとの内積
を用いた距離を計算する処理と、前記入力文書を、前記
距離が最も小さい、コーパスに格納されている文書のカ
テゴリに分類する処理とを実行する。【００３７】【発明の効果】以上説明したように本発明は、コーパス
から形態素の意味属性を補完し、意味属性を含む形態素
の並びによってコーパスに格納された格納文書との類似
度を計算することによって、意図を考慮した、より細か
な文書の分類が可能であるという効果がある。BACKGROUND OF THE INVENTION [0001] [Technical Field of the Invention The present invention relates to an apparatus for classifying electronic document won accompanied document structure meaning attributes from the electronic document. 2. Description of the Related Art In recent years, computer systems using natural language processing techniques, such as a Japanese word processor and a machine translation system, have advanced, and demands for improvements in natural language processing techniques have been increasing. [0003] In the natural language processing technology, a conventional method of obtaining a sentence structure is, for example, 14 of natural language processing (Iwanami Koza Software Science 15, edited by Makoto Nagao et al., Published by Iwanami Shoten).
The syntactic analysis and the semantic analysis were performed as described on pages 0 to 229. Further, the conventional classification method uses the frequency of appearance of nouns. [0004] In the above-mentioned conventional methods of syntactic analysis and semantic analysis, it takes time or erroneous analysis for a long sentence or a sentence having a complicated structure. There are many. In particular, there is a problem that the accuracy of analysis is low for documents that have not been proofread, such as e-mails and homepage documents that have been increasing in recent years. Also,
The classification method using the frequency of appearance of nouns has a problem in that classification cannot be performed in consideration of the detailed intention of the document creator. SUMMARY OF THE INVENTION In view of the above, it is an object of the present invention to correctly assign a sentence structure with semantic attributes to uncorrected sentences and low-quality sentences, and to classify the sentence including the intention of the document creator. The object of the present invention is to provide a document classification device capable of performing the following. A document classification device according to the present invention includes a document input unit for inputting a document, and a basic form and a part of speech by analyzing the input document. Morphological analysis means for dividing the morpheme into morphemes having morpheme information including inflected forms, and for each morpheme constituting the input document, generating a feature vector from the frequency of morpheme information of morphemes appearing before and after the morpheme, A feature in which the semantic attribute appears, obtained from morpheme information of morphemes appearing before and after the morpheme to which the semantic attribute is assigned from a corpus in which a plurality of documents composed of morphemes to which the semantic attribute is assigned are stored in advance. Calculates the inner product of the vector and the generated feature vector, and finds the semantic genus of the feature vector with the smallest distance found from the corpus. To the morpheme, and for the input document, the morpheme to which a semantic attribute is assigned uses a semantic attribute, and the morpheme without a semantic attribute is analyzed by analyzing the input document instead of the semantic attribute. A sequence of morphemes is created using the obtained basic form, a feature vector having a frequency of the created sequence of morphemes as a component is created, and for the documents stored in the corpus, morphemes to which a semantic attribute is assigned are: Using semantic attributes, morphemes to which no semantic attributes are assigned create a sequence of morphemes using the basic form stored in the corpus, and create a feature vector having the frequency of the created morpheme sequence as a component, Similarity calculating means for calculating a distance using an inner product of a feature vector of an input document and a feature vector of a document stored in a corpus;
Classification means for classifying the input document into categories of documents having the smallest distance and stored in the corpus,
An output unit that outputs the classification result to the outside is provided. The present invention will be described by taking as an example the case where received e-mails are classified. When a new received mail is input by the input means, the received mail is divided into morphemes by the morphological analysis means. Next, by means of giving meaning,
Before and after each morpheme composing the input document
The feature vector is calculated from the frequency of the morpheme information
Generated morphemes with morpheme information and semantic attributes
Corpus in which multiple documents consisting of
Appearing before and after morphemes with semantic attributes
The semantic attribute found from the morpheme information of the morpheme appears
And the inner product of the generated feature vector
Calculated from the corpus, the feature with the smallest distance
The semantic attribute of the vector is assigned to the morpheme. Next,
The similarity calculating means calculates a semantic attribute of the input document.
Morphemes are given semantic attributes, and the semantic attributes are
Unassigned morphemes are input documents instead of semantic attributes
Of morphemes using the basic form obtained by analyzing
And the frequency of the generated morphemes
Create a signature vector and add it to the document stored in the corpus.
Therefore, a morpheme to which a semantic attribute is assigned
Morphemes without semantic attributes
Create a sequence of morphemes using the stored basic forms,
The feature vector whose frequency is the frequency of the created morphemes
Create a vector and convert it to a feature vector and corpus of the input document
The distance using the inner product with the feature vector of the stored document
calculate. Then, the input means converts the input document into the distance
It is classified but to the category of the smallest corpus being stored document
You . Embodiments of the present invention will now be described with reference to the drawings. FIG. 1 is a schematic block diagram of the document classification device according to the first embodiment of the present invention.
FIG. 2 is a flowchart illustrating a document classification method according to the first embodiment of this invention. The document classification device 10 includes an input unit 11 for inputting a document, a morphological analysis unit 12, a corpus learning unit 13, a meaning providing unit 14, a similarity calculation unit 15,
Classification means 16, output means 1 for outputting the classification result of the document
7, a corpus 18 and a control unit 19 for controlling the processing of each unit. A corpus 18 is a group of linguistic data collected in large amounts (see Natural Language Processing, Iwanami Koza Software Science 15, Makoto Nagao et al., Published by Iwanami Shoten, page 253). It is assumed that a large number of documents composed of morphemes to which are added are stored in a predetermined format in advance. FIG. 3 shows an example of a document in the corpus 18. FIG.
Is an example of a corpus-tagged stored document used in the document classification method according to the first embodiment. Two documents, "I cannot communicate with my mobile phone" and "No effective solution is written" It is described as an example. The morphological analysis unit 12 divides the sentence of the input document input by the input unit 11 into morphemes having morphological information including basic forms, parts of speech, and inflected forms. Corpus learning means 14
Is a morpheme of a morpheme appearing in the vicinity of the morpheme to which the semantic attribute tag is attached, which is extracted from the corpus 18 in a predetermined form, and which is composed of the morphemes stored in the corpus 18 and to which the semantic attribute tag is attached. The frequency of the information is acquired, and the feature in which the semantic attribute appears is learned from the acquired frequency. The meaning providing means 13 compares the feature of the semantic attribute obtained as a result of learning with the feature of the frequency of the morpheme information of the morpheme located near the morpheme constituting the input document, and finds the most similar feature. Is assigned as a semantic attribute of a morpheme constituting the input document. The similarity calculating means 15 compares the input document having the morpheme with the semantic attribute and the stored document stored in the corpus with respect to the frequency of the morphemes obtained by adding the semantic attribute to the morpheme information. The similarity of the document is calculated based on the arrangement frequency. When the similarity between the input document and the storage document to be compared with the corpus 18 exceeds the threshold, the classification unit 16 classifies the input document into the category of the storage document. The operation of the document classification method according to the first embodiment of the present invention will be described with reference to FIG. When the document classification process is started (S11), a document is input by the input unit 11 (S12), and the input document is divided by the morphological analysis unit 12 into morphemes having morpheme information including basic forms, parts of speech, and inflected forms (S13). ). One of the morphemes to which the semantic attribute is assigned is selected by the meaning assigning unit 14 (S14), and the feature is calculated from the frequency of the morpheme information of the n morphemes before and after the morpheme (S15). The feature of the learned semantic attribute is compared (S16). If there is another candidate semantic attribute (S17N), the next candidate semantic attribute is selected (S18), and the process returns to step S16. Repeat the process. If no other candidate semantic attribute exists (S17N), a semantic attribute having the most similar feature is assigned (S19). If the assignment of the entire semantic attribute of the morpheme of the input document is not completed (S20N), the morpheme to which the next semantic attribute of the input document is assigned is designated (S21), and step S21 is performed.
Returning to step 14, the process is repeated. If the assignment of the entire semantic attributes of the morphemes of the input document is completed (S20Y), one of the stored documents is selected by the similarity calculating means 15 (S2).
2) comparing the feature consisting of the frequency of morpheme arrangement including the semantic attribute of the input document with the feature consisting of the frequency of morpheme arrangement including the semantic attribute of the selected stored document (S23), and determining the similarity to a predetermined threshold value; If not (S24N), the next storage document is designated (S25), and the process returns to step S22 to repeat the processing. If the similarity is equal to or more than a predetermined threshold (S24
Y) Classifying means 17 classifies the input document into stored document categories whose similarity is equal to or greater than a predetermined threshold (S26),
The document classification process ends (S27). Next, step S1
The procedure for learning the semantic attributes described in 4 to S21 will be described in detail with an example. First, Corpus 1
The morpheme information of the sentence of the stored document stored in 8 is based on the part of speech, basic type, and inflected form of a word obtained by morphological analysis as shown in FIG. A feature vector is constructed based on these pieces of information. Corpus
The document stored in No. 18 has the meaning before and after each semantic attribute.
Morphological information of n morphemes (basic form, part of speech, inflection
Form), so-called n-gram (natural language processing Iwanami)
Course Software Science 15, Makoto Nagao et al., Iwanami Shoten
Line 15), and each of the entire corpus
Weight morpheme information of co-occurring words using a predetermined method.
No, the characteristics of each semantic attribute from the weight of all morpheme information
Vector is composed. For example, a certain sentence of a document stored in the corpus 18 “ Quality of contents spoken in a lecture
As for “ question ”, morpheme information as shown in Table 1 is given. [Table 1] Here, when the document “there was a confirmation of its operation” is input and morphological analysis is performed, morphological information as shown in Table 2 is obtained. [Table 2] In Table 1, the basic form “re” of “re” has semantic attributes of “respect”, “passive”, “possible”, and “self-motivated”, but here is “respect”. The purpose of this step is to learn the semantic attributes of the basic form "re" of Table 2 "re". A feature vector of morphological information of n pieces before and after “re” is created for an input document. In this case, the weights of the co-occurring morpheme information are all 1, and the other weights are 0. The inner product of the feature vector of each semantic attribute of "re" that can be taken by the corpus 18 and the feature vector of "re" of the input document is calculated, and the distance (the angle of the vector) is the smallest. The semantic attribute is obtained and adopted as the semantic attribute of the morpheme of the sentence of the input document. As a result, the semantic attribute of the morpheme used in the similar context is selected as the semantic attribute of the morpheme of the input document. Next, the method of calculating the similarity between the document stored in the corpus 18 and the input document, which is described in steps S22 to S26, will be described in detail with reference to an example. Here, a morpheme sequence of the document is used, a morpheme having a semantic attribute is a semantic attribute, and a morpheme having no semantic attribute is created using a basic morpheme type. Create a vector. A document composed of morpheme information shown in Table 3 and a morpheme having a semantic attribute will be described as an example. [Table 3] In this document, when n-gram is created with n = 3, (there, "restricted", operation, confirmation) ("restricted, operation, confirmation") (operation, confirmation, performed) (confirmation, performed, "respect") (performed) , "Respect", "assertion") ("respect", "assertion", "call") ("assertion", "call", "confirmation"), and these n-grams for documents stored in the corpus. The feature vector of the document using the frequency of is created. Similarly, for the input document having acquired the semantic attribute, a feature vector of the document is created using the frequency of n-gram, and a distance is calculated using an inner product with the feature vector of the stored document stored in the corpus. The input document is classified into the category of the stored document or the stored document group having the smallest distance (angle of the vector). [0031] The sequence of morphemes of interest in the similarity calculation unit 15, shown in FIG. 4 is an example of a sequence of morphemes sentence of the document to be inputted to the morpheme of arrangement and new in the document example of corpus shown in FIG. The sequence of morphemes uses a semantic attribute when a semantic attribute exists, and uses a basic morpheme when no semantic attribute exists. Next, a document classification method and a document classification device according to a second embodiment of the present invention will be described with reference to the drawings. FIG. 5 is a schematic block diagram of a document classification device according to the second embodiment of this invention. FIG. 5 shows a document classifying apparatus 100 according to the present invention.
It is shown as a computer that constitutes the device,
The computer includes an input unit 110 such as a modem, a keyboard, and a pointing device, an output unit 120 such as a modem, a printer, and a display, a data processing device 130, a storage unit 140, and a recording medium 150. Recording medium 150
Records a document classification system control program of the present invention capable of controlling the operation of each unit.
M, a semiconductor memory or the like is used. The configuration of the document classifying apparatus and the document classifying method are the first.
The description is omitted because it is the same as that of the embodiment. The feature of the semantic attribute is learned from the feature of the frequency of the morpheme information of the morpheme appearing near the morpheme to which the tag of the semantic attribute of the stored document stored in the corpus is added, and a newly input document is learned. The control program for acquiring the semantic attributes of the morpheme obtained for the input document and further classifying the input document into a group of stored documents in the corpus similar to the input document is read from the recording medium 150 into the data processing device 130 and read from the data processing device 130. The operation of 130 is controlled. The data processing device 130 executes the following processing under the control of the control program. [0036] That is, each constituting a process of inputting the document, the basic form by analyzing the input document, part of speech, the process of dividing into morphemes with the morpheme information including inflected forms, the input document
For morphemes, the morpheme forms that appear before and after the morpheme
A feature vector is generated from the frequency of elementary information, and morpheme information and
Documents composed of morphemes with semantic attributes
A semantic attribute is assigned from a corpus stored in advance.
From the morpheme information of morphemes that appear before and after the morpheme
The obtained feature vector in which the semantic attribute appears and the raw
Calculate the inner product of the generated feature vectors and calculate
Genus of the feature vector with the smallest distance found
A process of applying to the morphological sexual, with the input document
Morphemes with semantic attributes use semantic attributes.
Morphemes without a semantic attribute are
Instead, form using the basic form obtained by analyzing the input document
Create a sequence of morphemes and configure the frequency of the
Create a feature vector as a component and store it in the corpus.
Form in which semantic attributes are assigned to the delivered document
The element uses semantic attributes and no semantic attributes are assigned
A morpheme is derived from a morpheme using the basic form stored in the corpus.
Create a list and configure the frequency of the morphemes created
Create a feature vector to be used as the base
Product of the file and the feature vector of the document stored in the corpus
And calculating the distance using the input document,
And processing for classifying the document into the category of the document stored in the corpus having the shortest distance . As described above, according to the present invention, the semantic attributes of morphemes are complemented from the corpus, and the similarity with the stored document stored in the corpus is calculated based on the arrangement of the morphemes including the semantic attributes. Thus, there is an effect that it is possible to classify the document more finely in consideration of the intention.

【図面の簡単な説明】【図１】本発明の第１の実施の形態の文書分類装置の模
式的ブロック構成図である。【図２】本発明の第１の実施の形態の文書分類方法を示
すフローチャートである。【図３】第１の実施の形態の文書分類方法に用いられる
コーパスのタグ付格納文書例である。【図４】類似度計算手段において対象となる形態素の並
びの例である。【図５】本発明の第２の実施の形態の文書分類装置の模
式的ブロック構成図である。【符号の説明】１０、１００文書分類装置１１、１１１入力手段１２、１１２形態素解析手段１３、１１３コーパス学習手段１４、１１４意味付与手段１５、１１５類似度計算手段１６、１１６分類手段１７、１１７出力手段１８、１１８コーパス１９、１１９制御部１１０入力部１２０出力部１３０データ処理装置１４０記憶部１５０記録媒体Ｓ１１〜Ｓ１７ステップBRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic block configuration diagram of a document classification device according to a first embodiment of the present invention. FIG. 2 is a flowchart illustrating a document classification method according to the first embodiment of this invention. FIG. 3 is an example of a corpus-tagged stored document used in the document classification method according to the first embodiment; FIG. 4 is an example of an arrangement of morphemes to be targeted by the similarity calculation means. FIG. 5 is a schematic block diagram of a document classification device according to a second embodiment of the present invention. [Description of Signs] 10, 100 Document Classification Device 11, 111 Input Unit 12, 112 Morphological Analysis Unit 13, 113 Corpus Learning Unit 14, 114 Meaning Assignment Unit 15, 115 Similarity Calculation Unit 16, 116 Classification Unit 17, 117 Output Means 18, 118 Corpus 19, 119 Control unit 110 Input unit 120 Output unit 130 Data processing device 140 Storage unit 150 Recording medium S11 to S17 Step

Claims

(57) [Claim 1] An apparatus for acquiring a sentence structure with a semantic attribute from an electronic document and classifying the electronic document, wherein a document input means for inputting the document, A morphological analysis unit that analyzes the document and divides the morpheme into morphemes having morpheme information including a basic form, a part of speech, and an inflected form; and, for each morpheme constituting the input document, A feature vector is generated, and is obtained from morpheme information of morphemes appearing before and after a morpheme having a semantic attribute from a corpus in which a plurality of documents composed of morpheme information and a morpheme having a semantic attribute are stored in advance. Further, the inner product of the feature vector in which the semantic attribute appears and the generated feature vector is calculated, and the meaning of the feature vector having the smallest distance obtained from the corpus is calculated. Means for assigning attributes to the morpheme; and for the input document, morphemes to which semantic attributes are assigned use semantic attributes, and morphemes to which no semantic attributes are assigned analyze the input document instead of semantic attributes. A sequence of morphemes is created using the basic form obtained as a result, a feature vector having a frequency of the created sequence of morphemes as a constituent element is created, and a semantic attribute is given to a document stored in the corpus. For morphemes, semantic attributes are used. For morphemes to which no semantic attributes are assigned, a morpheme sequence is created using the basic form stored in the corpus, and a feature vector having a frequency of the created morpheme sequence as a component is used. Similarity calculating means for calculating and calculating a distance using an inner product of a feature vector of the input document and a feature vector of a document stored in the corpus; A document classification device, comprising: a classification unit that classifies the input document into a category of a document stored in the corpus with the shortest distance; and an output unit that outputs a classification result to the outside.