JP2000339310A

JP2000339310A - Method and device for classifying document and recording medium with program recorded thereon

Info

Publication number: JP2000339310A
Application number: JP11145115A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-05-25
Filing date: 1999-05-25
Publication date: 2000-12-08
Anticipated expiration: 2019-05-25
Also published as: JP3471253B2

Abstract

PROBLEM TO BE SOLVED: To provide a document classifying method and a document classifying device which attach a sentence structure that correctly accompanies a semantic attribute and can perform classification including the intention of a document implementor even to a sentence that does not undergo proofreading and a sentence whose quality is low. SOLUTION: When an inputting means 11 inputs a document, a morpheme analyzing means 12 divides it into morphenes. A corpus learning means 13 learns the characteristic of a semantic attribute on the basis of the frequency of morphological information from a context in which the semantic attribute appears from a stored document which is stored in a corpus 18 and to which the morphological information and the semantic attribute are preliminarily attached. A semantics attaching means 14 attaches the semantic attribute having the characteristic on the basis of the frequency of the most analogous morphological information. A similarity calculating means 15 obtains similarity obtained by considering the semantic attribute of stored document stored in the corpus 18 and the inputted document. A classifying means 16 classifies the inputted document into categories existing in the stored document with the high similarity in the corpus 18.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は文書分類方法と文書
分類装置に関し、特に電子化文書を分類整理するソフト
ウェアに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification method and a document classification device, and more particularly to software for classifying and organizing an electronic document.

【０００２】[0002]

【従来の技術】近年、日本語ワードプロセッサや機械翻
訳システムなど自然言語処理技術を応用したコンピュー
タシステムが進歩し、自然言語処理技術の向上に対する
要求が高まっている。2. Description of the Related Art In recent years, computer systems utilizing natural language processing techniques, such as a Japanese word processor and a machine translation system, have advanced, and demands for improvements in natural language processing techniques have been increasing.

【０００３】自然言語処理技術において、文章構造を得
る従来の手法は、例えば自然言語処理（岩波講座ソフト
ウエア科学１５、長尾真他編、岩波書店発行）の１４
０ページ〜２２９ページに記載されているように構文解
析と意味解析を行うものであった。また、従来の分類手
法は、名詞の出現頻度を用いるものであった。[0003] In the natural language processing technology, a conventional method of obtaining a sentence structure is, for example, 14 of natural language processing (Iwanami Koza Software Science 15, edited by Makoto Nagao et al., Published by Iwanami Shoten).
The syntactic analysis and the semantic analysis were performed as described on pages 0 to 229. Further, the conventional classification method uses the frequency of appearance of nouns.

【０００４】[0004]

【発明が解決しようとする課題】従来の構文解析と意味
解析を行う方法では、長い文章や複雑な構造を持つ文章
に対して、時間がかかったり、解析を誤ったりすること
が多い。特に、近年増加している電子メールやホームペ
ージの文書のように、校正されていない文書に対しては
解析の精度が低いという問題点がある。また、名詞の出
現頻度を用いた分類方法では、文書作成者の細かい意図
を考慮した分類ができないという問題点がある。In the conventional methods of syntactic analysis and semantic analysis, it often takes a long time or erroneously analyzes long sentences or sentences having a complicated structure. In particular, there is a problem that the accuracy of analysis is low for documents that have not been proofread, such as e-mails and homepage documents that have been increasing in recent years. Further, the classification method using the frequency of occurrence of nouns has a problem that classification cannot be performed in consideration of the detailed intention of the document creator.

【０００５】本発明の目的は、上記の点に鑑みなされた
もので、校正されていない文章や品質の低い文章に対し
ても、正しく意味属性を伴う文章構造を付与し、文書作
成者の意図を含んだ分類を行うことのできる文書分類方
法、文書分類装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a document structure with a correct semantic attribute to uncorrected text and low-quality text, and to provide a document creator's intention. It is an object of the present invention to provide a document classifying method and a document classifying device which can perform classification including a character string.

【０００６】[0006]

【課題を解決するための手段】本発明の文書分類方法
は、電子化文書から意味属性を伴う文章構造を獲得して
文書を分類する方法であって、新規に文書を入力するス
テップと、入力文書を解析して基本形、品詞、活用形を
含む形態素情報を備えた形態素に分割するステップと、
形態素から構成され意味属性のタグが付与されている文
書が予め複数格納されているコーパス（多量に収集され
た言語データ群ｃｏｒｐｕｓ）から、意味属性のタグ
が付与されている形態素の近傍に現れる形態素の形態素
情報の頻度を獲得し、獲得された頻度からその意味属性
の現れる特徴を学習するステップと、学習した結果によ
り獲得された意味属性の現れる特徴と、入力文書を構成
する形態素の近傍に位置する形態素の形態素情報の頻度
からなる特徴とを比較し、最も類似している特徴を有す
る意味属性を、入力文書を構成する形態素の意味属性と
して付与するステップと、形態素に意味属性が付与され
た入力文書とコーパスに格納された格納文書とを、形態
素情報に意味属性を加えた形態素の並びの頻度について
比較して、形態素の並びの頻度に基づく文書の類似度を
計算するステップと、入力文書とコーパスの比較対象と
した格納文書との類似度が閾値を越えた場合に、入力文
書を比較対象とした格納文書のカテゴリに分類するステ
ップとを有する。SUMMARY OF THE INVENTION A document classification method according to the present invention is a method of acquiring a sentence structure with a semantic attribute from an electronic document and classifying the document. Analyzing the document and dividing it into morphemes with morpheme information including basic forms, parts of speech, inflected forms;
A morpheme appearing in the vicinity of a morpheme to which a semantic attribute tag is added from a corpus (a large collection of language data corpus) in which a plurality of documents composed of morphemes and to which a semantic attribute tag is added are stored in advance. Acquiring the frequency of the morpheme information of the input document, and learning the feature of the semantic attribute from the obtained frequency, the feature of the semantic attribute obtained by the learning result, and the position near the morpheme constituting the input document. Comparing the feature of the morpheme with the frequency feature of the morpheme to be performed, and assigning the semantic attribute having the most similar feature as the semantic attribute of the morpheme constituting the input document; The input document and the stored document stored in the corpus are compared with respect to the frequency of the morpheme arrangement obtained by adding the semantic attribute to the morpheme information. Calculating the similarity of the documents based on the frequency of the arrangement; and, when the similarity between the input document and the storage document to be compared with the corpus exceeds a threshold, the input document is classified into the category of the storage document as the comparison target. Classifying.

【０００７】コーパスから、意味属性のタグが付与され
ている形態素の近傍に現れる形態素の形態素情報の頻度
を獲得し、獲得された頻度からその意味属性の現れる特
徴を学習するステップが、コーパスに存在するすべての
意味属性が付与されている形態素の前後ｎ個の形態素情
報の頻度から、コーパスに存在するすべての形態素情報
について、その意味属性に対する重みを計算し、形態素
情報と計算された重みから構成されるその意味属性の前
方と後方の特徴ベクトルを求めることによって、その意
味属性の現れる特徴を学習するステップであってもよ
い。[0007] A step of acquiring, from the corpus, the frequency of morphological information of morphemes appearing in the vicinity of the morpheme to which the tag of the semantic attribute is attached, and learning the feature of the semantic attribute from the acquired frequency exists in the corpus. From the frequencies of the n morpheme information before and after the morpheme to which all the semantic attributes are assigned, the weight for the semantic attribute is calculated for all the morpheme information present in the corpus, and the weight is calculated from the morpheme information and the calculated weight. The feature vector may be a step of learning a feature in which the semantic attribute appears by obtaining a feature vector before and after the semantic attribute.

【０００８】学習した結果により獲得された意味属性の
表れる特徴と、入力文書を構成する形態素の近傍に位置
する形態素の形態素情報の頻度からなる特徴とを比較
し、最も類似している特徴を有する意味属性を、入力文
書を構成する形態素の意味属性として付与するステップ
が、コーパスから所定の方法で作成された各意味属性の
前後の特徴を表す特徴ベクトルと、入力文書の形態素の
前方と後方ｎ個づつの形態素の形態素情報から所定の方
法で作成された前後の特徴を表す特徴ベクトルとを対比
し、前方と後方についてそれぞれの特徴ベクトルの距離
を計算することにより、それぞれの距離の最も近い特徴
ベクトルの意味属性を選択するステップであってもよ
い。[0008] The feature of the semantic attribute obtained as a result of learning is compared with the feature of the frequency of the morpheme information of the morpheme located near the morpheme constituting the input document, and the feature having the most similar feature is obtained. The step of assigning the semantic attributes as the semantic attributes of the morphemes constituting the input document includes: a feature vector representing a feature before and after each semantic attribute created from the corpus by a predetermined method; By comparing the morpheme information of each morpheme with the feature vectors representing the preceding and following features created by a predetermined method, and calculating the distance between the feature vectors for the front and the back, the feature having the closest distance is obtained. The step of selecting a semantic attribute of the vector may be performed.

【０００９】形態素情報に意味属性を加えた形態素列の
類似度を計算するステップが、入力文書とコーパスに格
納されている格納文書とを比較する際に、意味属性が付
与されている形態素は意味属性を用い、意味属性が付与
されていない形態素は基本形を用いて形態素の並びを作
成し、作成された形態素の並びの頻度に基づいて各々の
文書ごとに特徴ベクトルを作成し、それぞれの特徴ベク
トルの距離を計算することによって文書を比較するステ
ップであってもよい。In the step of calculating the degree of similarity of a morpheme string obtained by adding a semantic attribute to morphological information, when comparing the input document with a stored document stored in the corpus, the morpheme to which the semantic attribute is added has a meaning. For the morphemes to which no semantic attribute is assigned using attributes, a sequence of morphemes is created using the basic form, and a feature vector is created for each document based on the frequency of the created morphemes, and each feature vector Comparing the documents by calculating the distance of the documents.

【００１０】本発明の文書分類装置は、電子化文書から
意味属性を伴う文章構造を獲得して文書を分類する装置
であって、文書を入力する文書入力手段と、入力文書を
解析して基本形、品詞、活用形を含む形態素情報を備え
た形態素に分割する形態素解析手段と、形態素から構成
され意味属性のタグが付与されている文書が予め複数格
納されているコーパスと、コーパスから、意味属性のタ
グが付与されている形態素の近傍に現れる形態素の形態
素情報の頻度を獲得し、獲得された頻度からその意味属
性の現れる特徴を学習するコーパス学習手段と、学習し
た結果により獲得された意味属性の現れる特徴と、入力
文書を構成する形態素の近傍に位置する形態素の形態素
情報の頻度からなる特徴とを比較し、最も類似している
特徴を有する意味属性を、入力文書を構成する形態素の
意味属性として付与する意味付与手段と、形態素に意味
属性が付与された入力文書とコーパスに格納された格納
文書とを、形態素情報に意味属性を加えた形態素の並び
の頻度について比較して、形態素の並びの頻度に基づく
文書の類似度を計算する類似度計算手段と、入力文書と
コーパスの比較対象の格納文書との類似度が閾値を超え
た場合に、入力文書をその格納文書のカテゴリに分類す
る分類手段と、分類結果を外部に出力する出力部と、各
手段の処理を制御する制御部とを有する。A document classification device of the present invention is a device for acquiring a sentence structure with a semantic attribute from an electronic document and classifying the document. The document classification device inputs a document, and analyzes the input document to form a basic form. Morphological analysis means for dividing into morphemes having morpheme information including word, part of speech, and inflected forms, a corpus in which a plurality of documents composed of morphemes and attached with semantic attribute tags are stored in advance, and a semantic attribute is obtained from the corpus. Corpus learning means for acquiring the frequency of morphological information of morphemes appearing in the vicinity of the morpheme to which the tag is attached, and learning the feature in which the semantic attribute appears from the acquired frequency, and the semantic attribute acquired by the learning result Is compared with a feature composed of the frequency of morpheme information of morphemes located near the morphemes constituting the input document, and the meaning having the most similar feature is compared. Means for assigning gender as a semantic attribute of a morpheme constituting the input document, and a morpheme obtained by adding a semantic attribute to morpheme information of an input document having a semantic attribute added to a morpheme and a stored document stored in a corpus. Means for calculating the similarity of a document based on the frequency of the morpheme arrangement, and comparing the similarity between the input document and the storage document to be compared with the corpus when a threshold exceeds the threshold. A classification unit for classifying an input document into a category of the stored document, an output unit for outputting a classification result to the outside, and a control unit for controlling processing of each unit.

【００１１】さらに記録媒体を備え、制御部の動作は、
記録媒体に記録された入力文書分類プログラムにより制
御できてもよい。[0011] Further, a recording medium is provided, and the operation of the control unit is as follows.
The control may be performed by an input document classification program recorded on a recording medium.

【００１２】本発明のプログラムを記録した記録媒体
は、コーパスに格納された格納文書の意味属性のタグが
付与されている形態素の近傍に現れる形態素の形態素情
報の頻度の特徴から意味属性の現れる特徴を学習して、
新規に入力された文書に対して得られる形態素の意味属
性を獲得し、さらに前記入力文書を類似する前記コーパ
ス中の格納文書群に分類するための制御プログラムを記
録する。The recording medium on which the program of the present invention is recorded is characterized in that the semantic attribute appears based on the frequency characteristic of the morpheme information of the morpheme appearing near the morpheme to which the semantic attribute tag of the stored document stored in the corpus is attached. Learning
A control program for acquiring a semantic attribute of a morpheme obtained for a newly input document and further classifying the input document into a group of documents stored in the corpus which are similar to each other is recorded.

【００１３】受信した電子メールを分類する場合を例と
して本発明の処理の特徴を述ベる。入力手段により新規
の受信メールが入力されると、形態素解析手段によっ
て、受信メールは形態素に分割される。コーパスに格納
され予め形態素情報と意味属性とが付与されている文書
から、コーパス学習手段により、各意味属性に対するす
べての形態素情報の特徴が学習される。意味付与手段に
よって、分割された形態素の形態素情報の並びに応じて
形態素に意味属性が付与される。類似度計算手段によ
り、コーパスに格納された格納文書と入力文書との類似
度が得られる。分類手段によって、入力された文書がコ
ーパス中の類似度の高い格納文書のあるカテゴリに分類
される。The features of the processing of the present invention will be described by taking as an example the case where received e-mails are classified. When a new received mail is input by the input means, the received mail is divided into morphemes by the morphological analysis means. The corpus learning means learns all the features of morphological information for each semantic attribute from documents stored in the corpus and to which morphological information and semantic attributes are added in advance. The meaning assigning means assigns a semantic attribute to the morpheme according to the morpheme information of the divided morpheme. The similarity calculating means obtains the similarity between the stored document stored in the corpus and the input document. The classification unit classifies the input document into a certain category of stored documents having a high degree of similarity in the corpus.

【００１４】[0014]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。図１は本発明の第１の実施
の形態の文書分類装置の模式的ブロック構成図であり、
図２は本発明の第１の実施の形態の文書分類方法を示す
フローチャートである。る。文書分類装置１０は、文書
を入力する入力手段１１、形態素解析手段１２、コーパ
ス学習手段１３、意味付与手段１４、類似度計算手段１
５、分類手段１６、文書の分類結果を出力する出力手段
１７、コーパス１８および各手段の処理を制御する制御
部１９を備える。Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a schematic block diagram of the document classification device according to the first embodiment of the present invention.
FIG. 2 is a flowchart illustrating a document classification method according to the first embodiment of this invention. You. The document classification device 10 includes an input unit 11 for inputting a document, a morphological analysis unit 12, a corpus learning unit 13, a meaning assignment unit 14, and a similarity calculation unit 1.
5, a classification unit 16, an output unit 17 for outputting a classification result of a document, a corpus 18, and a control unit 19 for controlling processing of each unit.

【００１５】コーパス（ｃｏｒｐｕｓ）１８は、多量に
収集された言語データ群であり（自然言語処理岩波講
座ソフトウエア科学１５、長尾真他編、岩波書店発行
２５３ページ参照）、ここでは意味属性のタグが付与
されている形態素から構成される文書が所定の形式で予
め多数格納されているものとする。A corpus 18 is a group of linguistic data collected in large amounts (see Natural Language Processing, Iwanami Koza Software Science 15, Makoto Nagao et al., Published by Iwanami Shoten, page 253). It is assumed that a large number of documents composed of morphemes to which are added are stored in a predetermined format in advance.

【００１６】コーパス１８の文書例を図３に示す。図３
は第１の実施の形態の文書分類方法に用いられるコーパ
スのタグ付格納文書例であり、「私の携帯電話で通信で
きません」と「有効な解決策は書かれてありません」の
２つの文書を例として記載している。FIG. 3 shows an example of a document in the corpus 18. FIG.
Is an example of a corpus-tagged stored document used in the document classification method according to the first embodiment. Two documents, "I cannot communicate with my mobile phone" and "No effective solution is written" It is described as an example.

【００１７】形態素解析手段１２は入力手段１１で入力
した入力文書の文を基本形、品詞、活用形を含む形態素
情報を備えた形態素に分割する。コーパス学習手段１４
はコーパス１８に格納された形態素から構成され意味属
性のタグが付与されている格納文書を所定の形態でコー
パス１８から引き出して、意味属性のタグが付与されて
いる形態素の近傍に現れる形態素の形態素情報の頻度を
獲得し、獲得された頻度からその意味属性の現れる特徴
を学習する。意味付与手段１３は学習した結果により獲
得された意味属性の現れる特徴と、入力文書を構成する
形態素の近傍に位置する形態素の形態素情報の頻度から
なる特徴とを比較し、最も類似している特徴を有する意
味属性を、入力文書を構成する形態素の意味属性として
付与する。The morphological analysis unit 12 divides the sentence of the input document input by the input unit 11 into morphemes having morphological information including basic forms, parts of speech, and inflected forms. Corpus learning means 14
Is a morpheme of a morpheme appearing in the vicinity of the morpheme to which the semantic attribute tag is attached, which is extracted from the corpus 18 in a predetermined form, and which is composed of the morphemes stored in the corpus 18 and to which the semantic attribute tag is attached. The frequency of the information is acquired, and the feature in which the semantic attribute appears is learned from the acquired frequency. The meaning providing means 13 compares the feature of the semantic attribute obtained as a result of learning with the feature of the frequency of the morpheme information of the morpheme located near the morpheme constituting the input document, and finds the most similar feature. Is assigned as a semantic attribute of a morpheme constituting the input document.

【００１８】類似度計算手段１５は形態素に意味属性が
付与された入力文書とコーパスに格納された格納文書と
を、形態素情報に意味属性を加えた形態素の並びの頻度
について比較して、形態素の並びの頻度に基づく文書の
類似度を計算する。分類手段１６は入力文書とコーパス
１８の比較対象とした格納文書との類似度が閾値を越え
た場合に、入力文書をその格納文書のカテゴリに分類す
る。The similarity calculating means 15 compares the input document having the morpheme with the semantic attribute and the stored document stored in the corpus with respect to the frequency of the morphemes obtained by adding the semantic attribute to the morpheme information. The similarity of the document is calculated based on the arrangement frequency. When the similarity between the input document and the storage document to be compared with the corpus 18 exceeds the threshold, the classification unit 16 classifies the input document into the category of the storage document.

【００１９】図２を参照して本発明の第１の実施の形態
の文書分類方法の動作を説明する。文書分類処理を開始
すると（Ｓ１１）、入力手段１１により文書が入力され
（Ｓ１２）、入力文書が形態素解析手段１２により基本
形、品詞、活用形を含む形態素情報を備えた形態素に分
割される（Ｓ１３）。意味付与手段１４で、意味属性を
付与する形態素の一つを選択し（Ｓ１４）、その形態素
の前後ｎ個の形態素についての形態素情報の頻度から特
徴を算出し（Ｓ１５）、コーパス学習手段１３で学習さ
れた意味属性の特徴とを比較し（Ｓ１６）、他の候補と
なる意味属性が存在するならば（Ｓ１７Ｎ）、次の候補
となる意味属性を選択し（Ｓ１８）、ステップＳ１６に
戻って処理を繰り返す。他に候補となる意味属性が存在
しないならば（Ｓ１７Ｎ）、最も類似した特徴を持つ意
味属性を付与する（Ｓ１９）。The operation of the document classification method according to the first embodiment of the present invention will be described with reference to FIG. When the document classification process is started (S11), a document is input by the input unit 11 (S12), and the input document is divided by the morphological analysis unit 12 into morphemes having morpheme information including basic forms, parts of speech, and inflected forms (S13). ). One of the morphemes to which the semantic attribute is assigned is selected by the meaning assigning unit 14 (S14), and the feature is calculated from the frequency of the morpheme information of the n morphemes before and after the morpheme (S15). The feature of the learned semantic attribute is compared (S16). If there is another candidate semantic attribute (S17N), the next candidate semantic attribute is selected (S18), and the process returns to step S16. Repeat the process. If no other candidate semantic attribute exists (S17N), a semantic attribute having the most similar feature is assigned (S19).

【００２０】入力文書の形態素の全体の意味属性付与が
終了していなければ（Ｓ２０Ｎ）、入力文書の次の意味
属性を付与する形態素を指定し（Ｓ２１）、ステップＳ
１４に戻って処理を繰り返す。入力文書の形態素の全体
の意味属性付与が終了していれば（Ｓ２０Ｙ）、類似度
計算手段１５により、格納文書の一つを選択し（Ｓ２
２）、入力文書の意味属性を含む形態素の並びの頻度か
らなる特徴と選択した格納文書の意味属性を含む形態素
の並びの頻度からなる特徴とを比較し（Ｓ２３）、類似
度が所定の閾値以下であれば（Ｓ２４Ｎ）、次の格納文
書を指定し（Ｓ２５）、ステップＳ２２に戻って処理を
繰り返す。類似度が所定の閾値以上であれば（Ｓ２４
Ｙ）、分類手段１７により入力文書を類似度が所定の閾
値以上である格納文書のカテゴリに分類し（Ｓ２６）、
文書分類処理を終了する（Ｓ２７）。次にステップＳ１
４からＳ２１に記載されている意味属性を学習する手順
について例をあげて詳細に説明する。まず、コーパス１
８に格納されている格納文書の文の形態素情報として
は、図３に示されるような、形態素解析によって得られ
る、単語の品詞、基本型、活用形が前提となる。これら
の情報を手がかりにべクトルを構成する。If the assignment of the entire semantic attribute of the morpheme of the input document is not completed (S20N), the morpheme to which the next semantic attribute of the input document is assigned is designated (S21), and step S21 is performed.
Returning to step 14, the process is repeated. If the assignment of the entire semantic attributes of the morphemes of the input document is completed (S20Y), one of the stored documents is selected by the similarity calculating means 15 (S2).
2) comparing the feature consisting of the frequency of morpheme arrangement including the semantic attribute of the input document with the feature consisting of the frequency of morpheme arrangement including the semantic attribute of the selected stored document (S23), and determining the similarity to a predetermined threshold value; If not (S24N), the next storage document is designated (S25), and the process returns to step S22 to repeat the processing. If the similarity is equal to or more than a predetermined threshold (S24
Y) Classifying means 17 classifies the input document into stored document categories whose similarity is equal to or greater than a predetermined threshold (S26),
The document classification process ends (S27). Next, step S1
The procedure for learning the semantic attributes described in 4 to S21 will be described in detail with an example. First, Corpus 1
The morpheme information of the sentence of the stored document stored in 8 is based on the part of speech, basic type, and inflected form of a word obtained by morphological analysis as shown in FIG. The vector is constructed based on this information.

【００２１】例えば、コーパス１８に格納されている格
納文書のある文「講演会で話された内容について」につ
いては表１のような情報が付与されている。For example, information such as Table 1 is given to a sentence “contents spoken in a lecture” of a document stored in the corpus 18.

【００２２】[0022]

【表１】ここで、文書「そちらは動作確認されたんですよね」を
入力して形態素解析すると表２のような情報が得られ
る。[Table 1] Here, when the document "You have confirmed the operation" is input and morphological analysis is performed, information as shown in Table 2 is obtained.

【００２３】[0023]

【表２】表１において、「れた」の基本形「れる」には、「尊
敬」、「受身」、「可能」、「自発」の意味属性がある
がここでは「尊敬」となっている。表２の「れた」の基
本形「れる」の意味属性を学習するのがこのステップの
目的である。[Table 2] In Table 1, the basic form “re” of “re” has semantic attributes of “respect”, “passive”, “possible”, and “self-motivated”, but here is “respect”. The purpose of this step is to learn the semantic attributes of the basic form "re" of Table 2 "re".

【００２４】コーパスに格納された文書にはそれぞれの
意味属性の前後のｎ個づつの形態素の形態素情報（基本
形、品詞、活用形）から、いわゆるｎ−ｇｒａｍ（自然
言語処理岩波講座ソフトウエア科学１５、長尾真他
編、岩波書店発行１５ページ参照）が取得され、コー
パス全体から各共起語の形態素情報に対して所定の方法
で重み付けを行い、すべての形態素情報の重みより各意
味属性の特徴ベクトルが構成されている。The documents stored in the corpus are referred to as n-grams (natural language processing, Iwanami Koza Software Science 15) from morphological information (basic form, part of speech, inflected form) of n morphemes before and after each semantic attribute. , Edited by Makoto Nagao et al., Published by Iwanami Shoten, p. 15), weighted the morpheme information of each co-occurred word from the entire corpus by a predetermined method, and determined the characteristics of each semantic attribute from the weight of all morpheme information. Vector is composed.

【００２５】入力文書に対して「れる」の前後ｎ個づつ
の形態素情報の特徴ベクトルを作成する。この場合共起
する形態素情報の重みをすべて１、それ以外は０とす
る。For an input document, a feature vector of morphological information of n pieces before and after "re" is created. In this case, the weights of the co-occurring morpheme information are all 1, and the other weights are 0.

【００２６】コーパスに構成されている「れる」の取り
得る意味属性の各意味属性の特徴ベクトルと入力文書の
「れる」の特徴ベクトルの内積を計算し、距離（ベクト
ルの角度）が最も小さい意味属性を獲得し、入力文書の
文のその形態素の意味属性として採用する。これによ
り、類似した文脈で使われていた形態素の意味属性が入
力文書の形態素の意味属性として選択されることにな
る。The inner product of the feature vector of each semantic attribute of "re" that can be taken in the corpus and the feature vector of "re" of the input document is calculated, and the meaning (the angle of the vector) having the smallest distance is calculated. The attribute is obtained and adopted as the semantic attribute of the morpheme of the sentence of the input document. As a result, the semantic attribute of the morpheme used in the similar context is selected as the semantic attribute of the morpheme of the input document.

【００２７】次にステップＳ２２からＳ２６に記載され
ている、コーパスに格納されている文書と入力文書との
間での類似度を計算する方法を一つの例によって詳細に
説明する。ここでは、文書の形態素の並びを、意味属性
が存在する形態素は意味属性を、意味属性が存在しない
形態素は形態素の基本型を用いて形態素の並びを作成し
てこの形態素の並びの頻度から特徴べクトルを作成す
る。Next, the method of calculating the similarity between the document stored in the corpus and the input document, which is described in steps S22 to S26, will be described in detail with reference to an example. Here, a morpheme sequence is created using the morpheme sequence of the document, a morpheme with a semantic attribute is a semantic attribute, and a morpheme without a semantic attribute is created using the basic morpheme type. Create a vector.

【００２８】表３の形態素情報と意味属性を有する形態
素からなる文書を例として説明する。A document composed of morpheme information shown in Table 3 and a morpheme having a semantic attribute will be described as an example.

【００２９】[0029]

【表３】この文書はｎ＝３としてｎ−ｇｒａｍを作成すると、（そちら、「限定」、動作）（「限定」、動作、確認）（動作、確認、する）（確認、する、「尊敬」）（する、「尊敬」、「断定」）（「尊敬」、「断定」、「呼び掛け」）（「断定」、「呼び掛け」、「確認」）となり、コーパスに格納された格納文書についてこれら
のｎ−ｇｒａｍの頻度を用いた文書の特徴ベクトルを作
成しておく。[Table 3] In this document, when n-gram is created with n = 3, (there, "restricted", operation, confirmation) ("restricted, operation, confirmation") (operation, confirmation, performed) (confirmation, performed, "respect") (performed) , "Respect", "assertion") ("respect", "assertion", "call") ("assertion", "call", "confirmation"), and these n-grams for the documents stored in the corpus. The feature vector of the document using the frequency of is created.

【００３０】意味属性を獲得した入力文書についても同
様にｎ−ｇｒａｍの頻度を用いた文書の特徴ベクトルを
作成し、コーパスに格納された格納文書の特徴ベクトル
との内積を用いた距離を計算し、距離（ベクトルの角
度）が最も小さい格納文書もしくは格納文書群のカテゴ
リに入力文書を分類する。Similarly, for the input document having acquired the semantic attribute, a feature vector of the document is created using the frequency of n-gram, and a distance is calculated using an inner product with the feature vector of the stored document stored in the corpus. The input document is classified into the category of the stored document or the stored document group having the smallest distance (angle of the vector).

【００３１】類似度計算手段において対象となる形態素
の並びについて、図３に示したコーパスの文書例におけ
る形態素の並びと新規に入力する文書の文の形態素の並
びの例を図４に示す。形態素の並びには、意味属性が存
在する場合には意味属性を用い、意味属性が存在しない
場合には形態素の基本形が用いられている。FIG. 4 shows an example of a sequence of morphemes in the corpus document example shown in FIG. 3 and a sequence of morphemes in a sentence of a newly input document. The sequence of morphemes uses a semantic attribute when a semantic attribute exists, and uses a basic morpheme when no semantic attribute exists.

【００３２】次に、本発明の第２の実施の形態の文書分
類方法と文書分類装置について図面を参照して説明す
る。図５は本発明の第２の実施の形態の文書分類装置の
模式的ブロック構成図である。Next, a document classification method and a document classification device according to a second embodiment of the present invention will be described with reference to the drawings. FIG. 5 is a schematic block diagram of a document classification device according to the second embodiment of this invention.

【００３３】図５は、本発明の文書分類装置１００を、
装置を構成するコンピュータとして示したものであり、
コンピュータはモデム、キーボード、ポインティングデ
バイス等の入力部１１０、モデム、プリンタ、ディスプ
レイ等の出力部１２０、データ処理装置１３０、記憶部
１４０および記録媒体１５０を備える。記録媒体１５０
には各部の動作を制御できる本発明の文書分類システム
制御プログラムが記録されており、ＦＤ，ＣＤ−ＲＯ
Ｍ、半導体メモリ等が用いられる。FIG. 5 shows a document classifying apparatus 100 according to the present invention.
It is shown as a computer that constitutes the device,
The computer includes an input unit 110 such as a modem, a keyboard, and a pointing device, an output unit 120 such as a modem, a printer, and a display, a data processing device 130, a storage unit 140, and a recording medium 150. Recording medium 150
Records a document classification system control program of the present invention capable of controlling the operation of each unit.
M, a semiconductor memory or the like is used.

【００３４】文書分類装置の構成や文書分類方法は第１
の実施の形態と同じなので説明を省略する。The configuration of the document classifying apparatus and the document classifying method are the first.
The description is omitted because it is the same as that of the embodiment.

【００３５】コーパスに格納された格納文書の意味属性
のタグが付与されている形態素の近傍に現れる形態素の
形態素情報の頻度の特徴から意味属性の現れる特徴を学
習して、新規に入力された文書に対して得られる形態素
の意味属性を獲得し、さらに前記入力文書を類似する前
記コーパス中の格納文書群に分類するための制御プログ
ラムは、記録媒体１５０からデータ処理装置１３０に読
み込まれデータ処理装置１３０の動作を制御する。デー
タ処理装置１３０は制御プログラムの制御により以下の
処理を実行する。The feature of the semantic attribute is learned from the feature of the frequency of the morpheme information of the morpheme appearing near the morpheme to which the tag of the semantic attribute of the stored document stored in the corpus is added, and a newly input document is learned. A control program for acquiring the semantic attributes of the morphemes obtained with respect to, and further classifying the input document into similar stored documents in the corpus, is read from the recording medium 150 into the data processing device 130 and read from the data processing device 130. The operation of 130 is controlled. The data processing device 130 executes the following processing under the control of the control program.

【００３６】即ち、新規に文書を入力する処理と、入力
文書を解析して基本形、品詞、活用形を含む形態素情報
を備えた形態素に分割する処理と、コーパスから、意味
属性のタグが付与されている形態素の近傍に現れる形態
素の形態素情報の頻度を獲得し、獲得された頻度から該
意味属性の現れる特徴を学習する処理と、学習した結果
により獲得された意味属性の現れる特徴と、入力文書を
構成する形態素の近傍に位置する形態素の形態素情報の
頻度からなる特徴とを比較し、最も類似している特徴を
有する意味属性を、入力文書を構成する形態素の意味属
性として付与する処理と、形態素に意味属性が付与され
た入力文書とコーパスに格納された格納文書とを、形態
素情報に意味属性を加えた形態素の並びの頻度について
比較して、形態素の並びの頻度に基づく文書の類似度を
計算する処理と、入力文書とコーパスの比較対象とした
格納文書との類似度が閾値を越えた場合に、入力文書を
比較対象の格納文書のカテゴリに分類する処理とを実行
する。That is, a process of newly inputting a document, a process of analyzing the input document and dividing it into morphemes having morpheme information including basic forms, parts of speech, and inflected forms, and adding tags of semantic attributes from the corpus. A process of acquiring the frequency of morpheme information of morphemes appearing in the vicinity of a morpheme that is present, learning the feature of the semantic attribute from the acquired frequency, the feature of the semantic attribute acquired by the learning result, and the input document A process of comparing a feature consisting of the frequency of morpheme information of morphemes located in the vicinity of morphemes constituting the morpheme and assigning a semantic attribute having the most similar feature as a semantic attribute of the morpheme constituting the input document; The input document in which the semantic attribute is added to the morpheme and the stored document stored in the corpus are compared with respect to the frequency of the arrangement of the morphemes in which the semantic attribute is added to the morpheme information. A process for calculating the similarity of documents based on the frequency of arrangement, and when the similarity between the input document and the stored document compared with the corpus exceeds a threshold, the input document is classified into the category of the stored document to be compared. And processing to be performed.

【００３７】[0037]

【発明の効果】以上説明したように本発明は、コーパス
から形態素の意味属性を補完し、意味属性を含む形態素
の並びによってコーパスに格納された格納文書との類似
度を計算することによって、意図を考慮した、より細か
な文書の分類が可能であるという効果がある。As described above, the present invention supplements the semantic attributes of morphemes from the corpus and calculates the similarity with the stored document stored in the corpus by the arrangement of morphemes including the semantic attributes. Thus, there is an effect that more detailed classification of documents can be performed in consideration of the above.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態の文書分類装置の模
式的ブロック構成図である。FIG. 1 is a schematic block configuration diagram of a document classification device according to a first embodiment of this invention.

【図２】本発明の第１の実施の形態の文書分類方法を示
すフローチャートである。FIG. 2 is a flowchart illustrating a document classification method according to the first embodiment of this invention.

【図３】第１の実施の形態の文書分類方法に用いられる
コーパスのタグ付格納文書例である。FIG. 3 is an example of a corpus-tagged stored document used in the document classification method according to the first embodiment;

【図４】類似度計算手段において対象となる形態素の並
びの例である。FIG. 4 is an example of an arrangement of morphemes to be targeted by the similarity calculation means.

【図５】本発明の第２の実施の形態の文書分類装置の模
式的ブロック構成図である。FIG. 5 is a schematic block diagram of a document classification device according to a second embodiment of the present invention.

[Explanation of symbols]

１０、１００文書分類装置１１、１１１入力手段１２、１１２形態素解析手段１３、１１３コーパス学習手段１４、１１４意味付与手段１５、１１５類似度計算手段１６、１１６分類手段１７、１１７出力手段１８、１１８コーパス１９、１１９制御部１１０入力部１２０出力部１３０データ処理装置１４０記憶部１５０記録媒体Ｓ１１〜Ｓ１７ステップ 10, 100 Document classification device 11, 111 Input unit 12, 112 Morphological analysis unit 13, 113 Corpus learning unit 14, 114 Meaning assignment unit 15, 115 Similarity calculation unit 16, 116 Classification unit 17, 117 Output unit 18, 118 Corpus 19, 119 control unit 110 input unit 120 output unit 130 data processing device 140 storage unit 150 recording medium S11 to S17 step

Claims

[Claims]

1. A method of acquiring a sentence structure with a semantic attribute from an electronic document and classifying the document, comprising the steps of: inputting a new document; analyzing the input document to obtain a basic form, a part of speech, and an inflected form Dividing the document into morphemes having morpheme information including a morpheme, and extracting a semantic from a corpus (a large collection of linguistic data group corpus) in which a plurality of documents that are composed of morphemes and to which semantic attribute tags are added are stored in advance. Acquiring the frequency of the morphological information of the morpheme appearing in the vicinity of the morpheme to which the attribute tag is attached, learning the feature in which the semantic attribute appears from the acquired frequency; Compare the feature that appears and the feature that consists of the frequency of the morpheme information of the morphemes located near the morphemes that make up the input document and have the most similar feature Adding a semantic attribute as a semantic attribute of a morpheme constituting the input document; and adding the semantic attribute to the morpheme information of the input document having the semantic attribute added to the morpheme and the stored document stored in the corpus. Calculating the similarity of the document based on the frequency of the morpheme arrangement, and the similarity between the input document and the stored document to be compared with the corpus exceeds a threshold value. Categorizing the input document into a category of the stored document as a comparison target when the input document is compared.

2. A step of acquiring, from the corpus, the frequency of morpheme information of morphemes appearing in the vicinity of a morpheme to which a tag of a semantic attribute is attached, and learning a feature of the semantic attribute from the acquired frequency. From the frequencies of the n morpheme information before and after the morpheme to which all the semantic attributes present in the corpus are assigned, for all the morpheme information present in the corpus, a weight for the semantic attribute is calculated, and the morpheme information and the calculation are calculated. 2. The document classification method according to claim 1, further comprising the step of learning a feature in which the semantic attribute appears by obtaining a feature vector before and after the semantic attribute constituted by the determined weights.

3. A feature of a semantic attribute obtained as a result of learning and a feature of the frequency of morpheme information of a morpheme located near a morpheme constituting the input document are compared, and are most similar. The step of assigning a semantic attribute having a characteristic as a semantic attribute of a morpheme constituting the input document includes: a feature vector representing a feature before and after each semantic attribute created from the corpus by a predetermined method; By comparing feature vectors representing front and rear features created by a predetermined method from morpheme information of n morphemes before and after the morpheme, and calculating the distance of each feature vector for the front and the back, 2. A method for selecting a semantic attribute of a feature vector whose distance is the closest to the above.
Document classification method described in.

4. A step of calculating the similarity of a document based on the frequency of morpheme arrangement in which morpheme information is added with a semantic attribute, comprises: comparing the input document with the stored document stored in a corpus; For morphemes with semantic attributes, semantic attributes are used. For morphemes without semantic attributes, a morpheme sequence is created using the basic form, and for each document based on the frequency of the created morpheme sequence. 2. The method according to claim 1, further comprising the step of creating feature vectors and comparing documents by calculating a distance between the feature vectors.

5. An apparatus for acquiring a sentence structure with a semantic attribute from an electronic document and classifying the document, comprising: a document input unit for inputting a document; and analyzing the input document to determine a basic form, a part of speech, and an inflected form. Morphological analysis means for dividing into morphemes having morpheme information including, a corpus (a large amount of collected language data group corpus) in which a plurality of documents composed of morphemes and tagged with semantic attributes are stored in advance. A corpus learning means for acquiring, from the corpus, the frequency of morpheme information of morphemes appearing in the vicinity of the morpheme to which the tag of the semantic attribute is attached, and learning the feature of the semantic attribute from the acquired frequency; The feature obtained by the semantic attribute obtained by the above is compared with the feature consisting of the frequency of the morpheme information of the morpheme located near the morpheme constituting the input document. Means for assigning a semantic attribute having similar characteristics as a semantic attribute of a morpheme constituting the input document; and the input document having a semantic attribute added to the morpheme and the storage stored in the corpus. A similarity calculating means for comparing a document with a morpheme information obtained by adding a semantic attribute to the morpheme information and calculating a similarity of the document based on the morpheme arrangement frequency; and comparing the input document with the corpus. A classifying unit that classifies the input document into a category of the stored document when the similarity with the target stored document exceeds a threshold value; an output unit that outputs a classification result to the outside; and controls processing of each unit. And a control unit that performs the operation.

6. The document classification device according to claim 5, further comprising a recording medium, wherein the operation of the control unit can be controlled by an input document classification program recorded on the recording medium.

7. A method for determining semantic attributes based on the frequency characteristics of morpheme information of morphemes appearing near morphemes to which semantic attribute tags of stored documents stored in a corpus (a large amount of collected language data corpus) are attached. A control program for learning the appearing features, acquiring semantic attributes of morphemes obtained for a newly input document, and further classifying the input document into similar stored documents in the corpus was recorded. A recording medium, a procedure of newly inputting a document, a procedure of analyzing the input document and dividing the document into morphemes, and a morpheme of a morpheme appearing near a morpheme to which a semantic attribute tag is added from the corpus. A procedure of acquiring the frequency of the information and learning a feature of the semantic attribute from the acquired frequency; a feature of the semantic attribute acquired by the learned result; A morpheme located near a morpheme constituting the input document is compared with a feature composed of frequency of morpheme information, and a semantic attribute having the most similar feature is assigned as a semantic attribute of the morpheme constituting the input document. And comparing the input document in which the morpheme is given a semantic attribute and the stored document stored in the corpus with respect to the frequency of the morpheme sequence obtained by adding the semantic attribute to the morpheme information. Calculating the similarity of the document based on the input document, and when the similarity between the input document and the stored document to be compared with the corpus exceeds a threshold, the category of the stored document to be compared with the input document. Steps to classify
A machine-readable recording medium on which a program for executing the program is recorded.