JP2001060199A

JP2001060199A - Device and method for classifying document and computer readable recording medium storing document classfication program

Info

Publication number: JP2001060199A
Application number: JP11234749A
Authority: JP
Inventors: Yuji Kyoya; 祐二京屋; Kunio Noguchi; 国雄野口; Chie Sekimoto; 千絵關本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-08-20
Filing date: 1999-08-20
Publication date: 2001-03-06

Abstract

PROBLEM TO BE SOLVED: To perform high-accuracy document classifying processing by classifying a document after the document is converted so as to include necessary and sufficient keywords in the document. SOLUTION: This device is provided with an input part 105 for inputting a document to be classified and attribute information attached to the document, a document analytic part 111 for extracting a keyword by analyzing the inputted document on the basis of morpheme analysis, a classification rule storage part 103 for storing a classification rule for describing the prescribed combination composed of at least one of the keyword showing the characteristic of a group and the attribute information for each group, a similarity degree calculating part 115 for calculating the degree of similarity between each of classification rules and the keyword or attribute information of the document and a document classifying part 116 for finding a group corresponding to the classification rule, with which the document is most similar, on the basis of the calculated similarity degree and making the document correspondent to the group.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語で記述さ
れる文書を所定数のグループに分類する文書分類装置、
文書分類方法および文書分類プログラムを格納したコン
ピュータ読取り可能な記録媒体に関し、特に、自然言語
で記述される文書を所定数のグループに分類する際に、
文書に付随する属性情報を用いると共にキーワードを補
完することで、精度の高い文書分類処理を可能にする技
術に係る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification apparatus for classifying documents described in a natural language into a predetermined number of groups.
Document classification method and a computer-readable recording medium storing a document classification program, particularly, when classifying documents described in natural language into a predetermined number of groups,
The present invention relates to a technology that enables highly accurate document classification processing by using attribute information attached to a document and complementing a keyword.

【０００２】[0002]

【従来の技術】図８は、自然言語で記述される文書を所
定数のグループに分類する文書分類装置の従来技術を示
す。2. Description of the Related Art FIG. 8 shows a prior art of a document classifying apparatus for classifying documents described in a natural language into a predetermined number of groups.

【０００３】従来、この自然言語で記述される文書を所
定数のグループに分類する文書分類装置においては、図
８に示すように、ユーザは文書分類処理を行なう前に、
典型文書を幾つかのグループに手動で分類し、その結果
を適当な記録媒体内に保存する。そして、実際の文書分
類処理は、分類しようとする文書（以下、対象文書と表
記）とグループ内の典型文書間の類似度を比較・計算
し、対象文書をどのグループに分類するかを自動判定す
ることにより行なわれていた（例えば、特開平１１−４
５２４７号公報参照）。Conventionally, in a document classifying apparatus for classifying documents described in a natural language into a predetermined number of groups, as shown in FIG.
The typical documents are manually classified into several groups, and the results are stored in a suitable recording medium. In the actual document classification process, the similarity between a document to be classified (hereinafter referred to as a target document) and a typical document in a group is compared and calculated, and it is automatically determined to which group the target document is classified. (See, for example, JP-A-11-4
No. 5247).

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記の
従来の文書分類技術には、以下の問題点があった。However, the above-described conventional document classification technology has the following problems.

【０００５】第一に、従来の文書分類処理では、対象文
書内に必要且つ十分な量のキーワードが含まれている場
合には精度の高い分類処理が可能であるが、例えば、会
話文や要約文等の文書カテゴリーの場合には、キーワー
ドの一部が省略されていたり、逆に、必要以上のキーワ
ードが何度も登場することが多い。この場合、分類処理
に用いられる情報が極めて限られているために、精度の
高い分類処理を実行することができない。[0005] First, in the conventional document classification processing, when a necessary and sufficient amount of keywords are included in a target document, high-precision classification processing is possible. In the case of a document category such as a sentence, a part of the keyword is omitted, or conversely, a keyword more than necessary appears frequently. In this case, since the information used for the classification process is extremely limited, a highly accurate classification process cannot be performed.

【０００６】第二に、文書分類処理では、インタビュー
やアンケート等の文書カテゴリーから得られる、発言者
に関する情報や発言場面に関する情報等、外部から与え
られる文書に付随する情報を用いて行う場合があるが、
従来の文書分類処理においては、対象文書内のキーワー
ドのみを用いて文書分類処理を行うために、外部から与
えられる文書に付随する情報を文書分類処理に反映させ
ることは実質的に不可能であり、結果として、文書分類
処理の精度の低下につながっていた。Second, the document classification processing is sometimes performed using information attached to a document provided from the outside, such as information on a speaker or information on a speech scene, obtained from a document category such as an interview or a questionnaire. But,
In the conventional document classification process, since the document classification process is performed using only the keywords in the target document, it is practically impossible to reflect information attached to the externally provided document in the document classification process. As a result, the accuracy of the document classification process is reduced.

【０００７】本発明は、上記の問題点を解決するために
なされたものである。そして、その目的とするところ
は、文書に付随する属性情報を用いると共にキーワード
を補完することで高精度の文書分類処理を可能にする文
書分類装置、文書分類方法および文書分類プログラムを
格納したコンピュータ読取り可能な記録媒体を提供する
ことにある。The present invention has been made to solve the above problems. The object of the present invention is to provide a document classification device, a document classification method, and a computer-readable program that stores a document classification program that enables high-accuracy document classification processing by using attribute information attached to a document and complementing keywords. It is to provide a possible recording medium.

【０００８】また、本発明の他の目的は、文書分類処理
を容易にすることで文書分類処理に要する労力および時
間を大幅に削減する文書分類装置、文書分類方法および
文書分類プログラムを格納したコンピュータ読取り可能
な記録媒体を提供することにある。Another object of the present invention is to provide a document classifying apparatus, a document classifying method, and a computer storing a document classifying program which greatly reduce the labor and time required for the document classifying process by facilitating the document classifying process. It is to provide a readable recording medium.

【０００９】[0009]

【課題を解決するための手段】上記技術的課題を解決す
るにあたって、発明者らは、文書内のキーワードおよび
文書に付随する属性情報のうちの１つ以上からなる所定
の組み合わせを記述した分類ルールに基づいて文書分類
処理を行ない、さらに、対象文書内で省略されているキ
ーワードを補完することにより、高精度の文書分類処理
が可能になるという考えに至った。In order to solve the above technical problem, the present inventors have proposed a classification rule which describes a predetermined combination of at least one of a keyword in a document and attribute information attached to the document. The document classification process is performed on the basis of, and the keyword that is omitted in the target document is complemented, so that a high-precision document classification process can be realized.

【００１０】この考えに基づいた本発明の第１の特徴
は、自然言語で記述される文書を、所定数のグループに
分類する文書分類装置であって、分類する文書と文書に
付随する属性情報を入力する入力部と、入力された文書
を形態素解析に基づいて解析し、キーワードを抽出する
文書解析部と、グループごとに、グループの特性を示す
キーワードおよび属性情報のうちの１つ以上からなる所
定の組み合わせを記述する分類ルールを格納する分類ル
ール記憶部と、分類ルールのそれぞれと、文書のキーワ
ードまたは属性情報との類似度を算出する類似度算出部
と、算出された類似度に基づいて、文書が最も類似する
分類ルールに対応するグループを求め、文書を該グルー
プに対応付ける文書分類部とを具備する文書分類装置で
あることにある。A first feature of the present invention based on this idea is a document classifying device for classifying documents described in a natural language into a predetermined number of groups, the classifying documents and attribute information attached to the documents. , An input unit for analyzing the input document based on the morphological analysis, and extracting a keyword, and, for each group, one or more of keyword and attribute information indicating characteristics of the group. A classification rule storage unit that stores a classification rule describing a predetermined combination, a similarity calculation unit that calculates a similarity between each of the classification rules and a keyword or attribute information of the document, and a similarity calculation unit that calculates a similarity based on the calculated similarity. And a document classification unit that obtains a group corresponding to a classification rule with which a document is most similar, and associates the document with the group.

【００１１】上記構成によれば、キーワード以外の情報
を文書分類結果に反映することができるので、文書分類
処理の精度を高めることができる。According to the above configuration, since information other than the keyword can be reflected in the document classification result, the accuracy of the document classification process can be improved.

【００１２】また、本発明の第２の特徴は、キーワード
または属性情報に基づいて、文書のキーワードが省略さ
れているか否かを判定する判定部と、キーワードが省略
されていると判定された場合に、文書に省略されたキー
ワードを復元、および／またはキーワードに類似する表
現を付加して類似度算出部に出力するキーワード補完部
を具備することにある。A second feature of the present invention is that a determination unit for determining whether or not a keyword of a document has been omitted based on the keyword or attribute information, and a determination unit that has determined that the keyword has been omitted. And a keyword complementer for restoring the omitted keyword in the document and / or adding an expression similar to the keyword and outputting the expression to the similarity calculator.

【００１３】上記構成によれば、対象文書内のキーワー
ドが省略されているような場合でも、対象文書内に必要
十分なキーワードを補充することができるので、高精度
の文書分類処理を行なうことができる。According to the above configuration, even when keywords in the target document are omitted, necessary and sufficient keywords can be replenished in the target document, so that highly accurate document classification processing can be performed. it can.

【００１４】さらに、本発明の第３の特徴は、属性情報
は、文書に対応するユーザを識別する識別情報、文書に
対応する質問情報、場面情報、文書の作成時刻、命題に
対する文書カテゴリーおよび文書と先行する発言との接
続情報のいずれか１つ以上を含むことにある。A third feature of the present invention is that the attribute information includes identification information for identifying a user corresponding to the document, question information corresponding to the document, scene information, creation time of the document, a document category for the proposition, and a document category. And one or more pieces of connection information of the preceding utterance.

【００１５】さらに、本発明の第４の特徴は、分類ルー
ルは、体言要素、用言要素および修飾詞要素のいずれか
１つ以上の所定の組み合わせを含んで記述することにあ
る。A fourth feature of the present invention resides in that the classification rules are described by including a predetermined combination of at least one of a nominal element, a verbal element, and a modifier element.

【００１６】また、本発明の第５の特徴は、グループ
は、対応する分類ルールの抽象度に従って、階層的に形
成することにある。A fifth feature of the present invention resides in that groups are formed hierarchically according to the abstraction level of the corresponding classification rule.

【００１７】さらに、本発明の第６の特徴は、各グルー
プ毎に、グループに属する文書を、グループに対応する
分類ルールまたはその類似表現とともに一覧表示する表
示部を具備することにある。Further, a sixth feature of the present invention is to provide a display unit for displaying, for each group, a list of documents belonging to the group together with a classification rule corresponding to the group or a similar expression thereof.

【００１８】さらに又、本発明の第７の特徴は、表示部
はいずれのグループにも対応付けられない文書を一覧表
示することにある。A seventh feature of the present invention resides in that the display unit displays a list of documents that are not associated with any group.

【００１９】また、本発明の第８の特徴は、表示された
各文書が対応付けられるグループを、他のグループに変
更指示する編集部を具備することにある。An eighth feature of the present invention resides in that an editing unit is provided for instructing another group to change a group associated with each displayed document.

【００２０】さらに、本発明の第９の特徴は、各グルー
プ毎に、グループに属する文書群から、帰納的に分類ル
ールを生成する分類ルール生成部を具備することにあ
る。Further, a ninth feature of the present invention resides in that each group includes a classification rule generation unit that recursively generates a classification rule from a document group belonging to the group.

【００２１】また、本発明の第１０の特徴は、類似度算
出部は、属性情報に応じて、前記キーワードの重みを可
変に設定できることにある。[0021] A tenth feature of the present invention is that the similarity calculation unit can variably set the weight of the keyword according to the attribute information.

【００２２】上記構成によれば、キーワード間に重要度
の差を設けることができるので、文書分類処理のポイン
トを絞り、分類精度を高めることができる。また、例え
ば、品詞に応じて重みを可変に設定することで、分類の
精度を向上させることができる。According to the above configuration, since a difference in importance can be provided between keywords, the points of document classification processing can be narrowed and classification accuracy can be improved. Further, for example, by setting the weight variably according to the part of speech, the accuracy of classification can be improved.

【００２３】また、本発明の第１１の特徴は、自然言語
で記述される文書を、所定数のグループに分類する文書
分類方法であって、分類する文書と文書に付随する属性
情報を入力する入力ステップと、入力された文書を形態
素解析に基づいて解析し、キーワードを抽出する文書解
析ステップと、グループの特性を示すキーワードおよび
属性情報のうちの１つ以上からなる所定の組み合わせを
グループ毎に記述した分類ルールのそれぞれと、文書の
キーワードまたは属性情報との類似度を算出する類似度
算出ステップと、算出された類似度に基づいて、文書が
最も類似する分類ルールに対応するグループを求め、文
書をグループに対応付ける文書分類ステップとを有する
文書分類方法であることにある。An eleventh feature of the present invention is a document classification method for classifying a document described in a natural language into a predetermined number of groups, and inputs a document to be classified and attribute information accompanying the document. An input step, a document analysis step of analyzing the input document based on the morphological analysis and extracting a keyword, and a predetermined combination including at least one of keyword and attribute information indicating characteristics of the group, for each group. A similarity calculation step of calculating a similarity between each of the described classification rules and the keyword or attribute information of the document; and, based on the calculated similarity, a group corresponding to the classification rule to which the document is most similar, And a document classification step of associating a document with a group.

【００２４】上記構成によれば、文書内に必要十分なキ
ーワードを含まれるように文書を変換した後に文書分類
処理を行うので、高精度の文書分類処理を行なうことが
できる。According to the above configuration, since the document classification process is performed after the document is converted so that the document includes a necessary and sufficient keyword, a highly accurate document classification process can be performed.

【００２５】また、本発明の第１２の特徴は、キーワー
ドまたは属性情報に基づいて、文書のキーワードが省略
されているか否かを判定する判定部と、キーワードが省
略されていると判定された場合に、文書に省略されたキ
ーワードを復元、および／またはキーワードに類似する
表現を付加して出力するキーワード補完ステップを有す
る文書分類方法であることにある。A twelfth feature of the present invention resides in that, based on a keyword or attribute information, a determination unit for determining whether or not a keyword of a document has been omitted, and a determination unit that determines that the keyword has been omitted. Another object of the present invention is to provide a document classification method having a keyword complementing step of restoring an omitted keyword to a document and / or adding an expression similar to the keyword and outputting the result.

【００２６】さらに、本発明の第１３の特徴は、自然言
語で記述される文書を、所定数のグループに分類する文
書分類プログラムを格納したコンピュータ読取り可能な
記録媒体であって、入力された文書を形態素解析に基づ
いて解析し、キーワードを抽出する文書解析処理と、分
類する文書と文書に付随する属性情報を入力する入力処
理と、入力された文書を形態素解析に基づいて解析し、
キーワードを抽出する文書解析処理と、グループの特性
を示す前記キーワードおよび属性情報のうちの１つ以上
からなる所定の組み合わせをグループ毎に記述した分類
ルールのそれぞれと、文書のキーワードまたは属性情報
との類似度を算出する類似度算出処理と算出された類似
度に基づいて、文書が最も類似する分類ルールに対応す
るグループを求め、文書をグループに対応付ける文書分
類処理とを含み、これらの処理をコンピュータに実行さ
せる文書分類プログラムを格納したコンピュータ読取り
可能な記録媒体であることにある。A thirteenth feature of the present invention is a computer-readable recording medium storing a document classification program for classifying a document described in a natural language into a predetermined number of groups. Is analyzed based on the morphological analysis, a document analysis process for extracting a keyword, an input process for inputting a document to be classified and attribute information attached to the document, and an input document is analyzed based on the morphological analysis,
A document analysis process for extracting a keyword, a classification rule describing a predetermined combination of at least one of the keyword and attribute information indicating the characteristics of the group for each group, and a keyword or attribute information of the document. A similarity calculation process for calculating the similarity and a group corresponding to the classification rule with which the document is most similar based on the calculated similarity, and a document classification process for associating the document with the group. A computer-readable recording medium that stores a document classification program to be executed by a computer.

【００２７】上記構成によれば、文書内に必要十分なキ
ーワードを含まれるように文書を変換した後に文書分類
処理を行うので、高精度の文書分類処理を行なうことが
できる。According to the above configuration, the document classification process is performed after the document is converted so as to include a necessary and sufficient keyword in the document, so that a highly accurate document classification process can be performed.

【００２８】また、本発明の第１４の特徴は、キーワー
ドまたは属性情報に基づいて、文書のキーワードが省略
されているか否かを判定する判定部と、キーワードが省
略されていると判定された場合に、文書に省略されたキ
ーワードを復元、および／またはキーワードに類似する
表現を付加して出力するキーワード補完処理を含み、こ
れらの処理をコンピュータに実行させる文書分類プログ
ラムを格納したコンピュータ読取り可能な記録媒体であ
ることにある。A fourteenth feature of the present invention resides in that, based on a keyword or attribute information, a determination unit that determines whether a keyword of a document has been omitted, and a case that it is determined that the keyword has been omitted. And a computer-readable record storing a document classification program for causing a computer to execute a keyword supplementation process for restoring an abbreviated keyword in a document and / or adding an expression similar to the keyword and outputting the same. Being a medium.

【００２９】[0029]

【発明の実施の形態】以下、図１乃至図７を参照し、本
発明の実施形態に係る文書分類装置、文書分類方法およ
び文書分類プログラムを格納したコンピュータ読取り可
能な記録媒体について詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A document categorizing apparatus, a document categorizing method, and a computer-readable recording medium storing a document categorizing program according to an embodiment of the present invention will be described in detail below with reference to FIGS. .

【００３０】本実施形態は、キーワード補完を行い、さ
らに、キーワード以外の属性情報を利用することにお
り、対象文書内に必要十分なキーワードが含まれるよう
に対象文書を変換することで、精度の高い文書分類処理
を可能にする機能を提供する。In this embodiment, the keyword is complemented and attribute information other than the keyword is used. By converting the target document so that the target document includes a necessary and sufficient keyword, the accuracy of the target document is improved. Provides a function that enables high document classification processing.

【００３１】始めに、本発明の実施形態に係る文書分類
装置の構成について説明する。First, the configuration of the document classification device according to the embodiment of the present invention will be described.

【００３２】図１は、本発明の実施形態に係る文書分類
装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a document classification device according to an embodiment of the present invention.

【００３３】本発明の実施形態に係る文書分類装置１０
０は、図１に示すように、「文書構造情報」を抽出する
ための構造化ルールを格納した構造化ルールデータベー
ス１０１、補完処理に用いる「キーワード情報」を格納
する補完キーワードデータベース１０２、グループの特
性を示す所定のキーワード又は属性情報（後述する付随
属性情報と同義）の内の１つ以上の組み合わせをグルー
プ毎に記述した「分類ルール」を格納する分類ルールデ
ータベース１０３、文書情報、文書関連情報および文書
分類結果等の文書情報を格納する文書情報記憶部１０
４、対象文書および対象文書に付随する「付随属性情報
（＝属性情報）」を入力する入力部１０５、グループに
属する文書をグループに対応する分類ルールまたはその
類似表現と共にグループ毎に一覧表示する表示部１０
７、表示された各文書が対応付けられたグループを変更
する編集部１０６、入力された対象文書に対して文書分
類処理を施す分類処理部１１０から構成される。The document classification device 10 according to the embodiment of the present invention
0 is a structured rule database 101 storing structured rules for extracting “document structure information”, a supplemental keyword database 102 storing “keyword information” used for complementation processing, and a group A classification rule database 103 that stores “classification rules” in which one or more combinations of predetermined keyword or attribute information (synonymous with the accompanying attribute information described later) indicating characteristics are described for each group, document information, and document-related information Document storage unit 10 for storing document information such as document classification results and the like
4. An input unit 105 for inputting a target document and "attached attribute information (= attribute information)" accompanying the target document, and displaying a list of documents belonging to the group for each group together with a classification rule corresponding to the group or a similar expression thereof. Part 10
7. An editing unit 106 for changing a group associated with each displayed document, and a classification processing unit 110 for performing a document classification process on the input target document.

【００３４】また、分類処理部１１０は、対象文書を形
態素解析に基づいて解析しキーワード情報を抽出する文
書解析部１１１、対象文書、付随属性情報およびキーワ
ード情報から文書構造情報を抽出する文書構造情報抽出
部１１２、対象文書のキーワード又は属性情報に基づい
て対象文書のキーワードが省略されているか否かを判定
する判定部１１３、キーワード情報と属性情報を用いて
対象文書に不足するキーワードを補完する補完処理を行
なうキーワード補完部１１４、分類ルールの各々とキー
ワード情報又は属性情報との類似度を算出する類似度算
出部１１５、算出された類似度に基づいて対象文書が最
も類似する分類ルールに対応するグループを決定し、対
象文書と当該グループとを対応付ける文書分類部１１
６、グループに属する文書群からグループの特徴をグル
ープ毎に（例えば、公知の回帰分析手法等を用いて）帰
納的に取りだし、分類ルールを更新する分類ルール生成
部１１７を具備する。The classification processing unit 110 analyzes a target document based on morphological analysis and extracts keyword information. The document analyzing unit 111 extracts document structure information from the target document, accompanying attribute information, and keyword information. An extraction unit 112, a determination unit 113 that determines whether or not the keyword of the target document is omitted based on the keyword or attribute information of the target document, a complement that complements a keyword lacking in the target document by using the keyword information and the attribute information A keyword complementing unit 114 that performs processing, a similarity calculating unit 115 that calculates the similarity between each of the classification rules and the keyword information or the attribute information, and a target document corresponding to the classification rule with which the target document is most similar based on the calculated similarity. A document classification unit 11 that determines a group and associates the target document with the group.
6. A classification rule generation unit 117 that recursively extracts the characteristics of the group from the document group belonging to the group for each group (for example, using a known regression analysis method or the like) and updates the classification rule.

【００３５】さらに、文書情報記憶部１０４は、図５に
示すようなフォーマットで、入力された文書と共に、発
言者、発言時刻、キーワード、補完キーワード、解析属
性および所属グループに関する情報が格納されている。Further, the document information storage unit 104 stores, in a format as shown in FIG. 5, information about the speaker, the utterance time, the keyword, the complementary keyword, the analysis attribute, and the belonging group together with the input document. .

【００３６】ここで、「文書構造情報」内には、対象文
書の、文字数、文節数、文型、文語／口語、文書のジャ
ンル(例えば、感想、要望等)、テーマ等といった文書の
構造に関する情報が含まれ、文字数、文節数は対象文書
や形態素解析結果から抽出されるものとする。また、こ
こで言う文型とは、形態素解析結果から得られる、Ｓ
（主部）＋Ｖ（述部）、Ｓ＋Ｏ（目的部）＋Ｖ等の文型
を意味する。文語／口語の区別は、口語若しくは文語で
しか用いられない言い回しを調べることにより判断し、
文語／口語の区別ができない場合は判別不能とする。口
語の文書は、文語のものと比較して、同じキーワードが
何度もでてきたり、キーワード情報が省略されている可
能性が高い。このため、例えば、文語／口語の判別結果
を分類ルールの一部として用いたり、文語／口語の判別
結果に基づいて分類ルールの係数の重み付けをする等し
て、分類ルールの判定基準を文語のそれと異なるものと
することにより、分類精度を高めることができる。ま
た、文書のジャンルは、その文書が製品への感想である
のか、要望であるのか、質問であるのか、文書の内容を
判断した結果である。例えば、語尾が「〜したい」であ
れば、この文書の内容は要望である、といった判断を行
なう。テーマは文字列の集合であり、対象文書がどのよ
うな話題に関するものであるかを示すものである。例え
ば、対象文書がパソコンに関する話題の場合は「パソコ
ン」、「サッカー」に関する話題の場合は「スポーツ,
サッカー」のようにする（サッカーの上位概念としてス
ポーツがあるので、スポーツもテーマに含める)。テー
マに含まれる各文字列は、キーワード補完処理時にキー
ワード情報に追加される。これにより、「サッカー」に
関する話題である場合、対象文書に「サッカー」という
キーワードが含まれていない場合でも、キーワード情報
には「サッカー」が含まれるようになる。また、テーマ
を文書構造情報として扱うことで、特に新聞やニュース
といった、同じ記事内では扱うテーマが一貫しているタ
イプの文書の分類の精度が高まる。Here, the "document structure information" includes information on the structure of the target document, such as the number of characters, the number of phrases, the sentence pattern, the sentence / colloquial language, the genre of the document (for example, impressions, requests, etc.), and themes. , And the number of characters and the number of phrases are extracted from the target document and the result of the morphological analysis. In addition, the sentence pattern referred to here is S obtained from a morphological analysis result.
(Main part) + V (predicate), S + O (target part) + V, etc. Judgment of colloquial / colloquial is determined by examining phrases used only in colloquial or colloquial,
If the sentence / colloquial cannot be distinguished, it cannot be determined. Spoken documents are more likely to have the same keyword repeatedly or to have omitted keyword information than textual documents. For this reason, for example, the determination criteria of the classification rule are determined by using the determination result of the language rule as a part of the classification rule, or by weighting the coefficients of the classification rule based on the determination result of the phrase language / spoken language. By making it different, classification accuracy can be improved. The genre of a document is a result of judging the content of the document as to whether the document is an impression of a product, a request, or a question. For example, if the ending is “to want”, it is determined that the content of this document is a request. The theme is a set of character strings, and indicates what topic the target document relates to. For example, if the target document is a topic related to a personal computer, "PC"
(Soccer is a high-level concept of soccer, so sports should be included in the theme.) Each character string included in the theme is added to the keyword information at the time of keyword completion processing. As a result, when the topic is related to “soccer”, even when the target document does not include the keyword “soccer”, the keyword information includes “soccer”. In addition, by treating the theme as document structure information, the accuracy of classifying documents of a type that treats the theme consistently within the same article, such as newspapers and news, is increased.

【００３７】また、「キーワード情報」とは、キーワー
ドとなる文字列の集合と文字列の品詞情報の組からなる
情報である。キーワード情報は、付随属性情報に応じ
て、キーワードの重みを可変的に設定することもでき
る。The "keyword information" is information composed of a set of a character string serving as a keyword and a part of speech information of the character string. In the keyword information, the weight of the keyword can be variably set according to the accompanying attribute information.

【００３８】さらに、「分類ルール」は、文書中の体言
要素、用言要素および修飾詞要素のいずれか１つ以上の
所定の組み合わせを含む記述となっている。分類ルール
は、図６に示すように、分類ルールと対応付けられた文
書（所属文書）と共に分類ルールデータベース１０３内
に格納されている。また、分類グループは、対応する分
類ルールの抽象度に従って、階層的に形成されるように
すると良く、これにより、分類結果の検索が容易とな
る。Further, the "classification rule" is a description including a predetermined combination of any one or more of a nominative element, a verbal element, and a modifier element in a document. The classification rules are stored in the classification rule database 103 together with the documents (affiliation documents) associated with the classification rules, as shown in FIG. Further, the classification groups may be formed hierarchically according to the abstraction level of the corresponding classification rule, so that the search of the classification result is facilitated.

【００３９】また、「付随属性情報（＝属性情報）」と
は、文書に対応するユーザを識別するための識別情報
(例えば、発言者名等)、対象文書の作成時刻、対象文書
に対応する質問情報、場面情報（対象文書が作成された
時の場面に関する情報、例えば、発言者は老人であると
か等）、命題に対する文書カテゴリー、前後に登場する
文書同士の接続情報（例えば、言葉の係り受け等）等を
意味する。Further, the "associative attribute information (= attribute information)" is identification information for identifying the user corresponding to the document.
(For example, the name of the speaker), creation time of the target document, question information corresponding to the target document, scene information (information about a scene when the target document was created, for example, the speaker is an elderly person, etc.), This means a document category for a proposition, connection information between documents appearing before and after (for example, dependency on words) and the like.

【００４０】なお、付随属性情報は対象文書に始めから
付随する情報であるのに対し、文書構造情報は対象文書
を解析することにより得られる対象文書の構造情報であ
るものとする。その意味では、キーワード情報も対象文
書を形態素解析することにより得られる情報であるので
文書構造情報の一種となるが、キーワードは他の構造情
報よりも分類ルールに組み込まれる頻度が高いためにこ
こでは文書構造情報と区別する。It is assumed that the accompanying attribute information is information that accompanies the target document from the beginning, whereas the document structure information is the structure information of the target document obtained by analyzing the target document. In that sense, the keyword information is also a type of document structure information because it is information obtained by performing morphological analysis on the target document.However, since keywords are more frequently incorporated into classification rules than other structural information, here Distinguish from document structure information.

【００４１】また、キーワード補完部が行なう補完処理
においては、関連度の高いキーワードについては、片方
のキーワードが出てくればもう一方のものが自動的に補
完されるものとし、元文書の特定の場所に補完したキー
ワードを挿入するようにすると良い。In addition, in the complementing process performed by the keyword complementing unit, for a keyword having a high degree of relevance, if one of the keywords comes out, the other is automatically complemented. It is good to insert the keyword that complements the place.

【００４２】次に、本発明の実施形態に係る文書分類処
理の処理手順について説明する。Next, a processing procedure of the document classification processing according to the embodiment of the present invention will be described.

【００４３】図２は、本発明の実施形態に係る文書分類
処理の処理手順を示すフローチャート図である。FIG. 2 is a flowchart showing a processing procedure of the document classification processing according to the embodiment of the present invention.

【００４４】（１）入力部１０５に対して対象文書を入
力する（対象文書入力ステップ、Ｓ２０１）。(1) A target document is input to the input unit 105 (target document input step, S201).

【００４５】（２）入力部１０５に対して対象文書に付
随する付随属性情報を入力する（属性情報入力ステッ
プ、Ｓ２０２）。(2) The accompanying attribute information accompanying the target document is input to the input unit 105 (attribute information input step, S202).

【００４６】（３）文書解析部１１１において、入力さ
れた対象文書を形態素解析に基づいて解析し、キーワー
ド情報を抽出する（キーワード情報抽出ステップ、Ｓ２
０３）。ここで、キーワードが活用のある品詞である場
合には、品詞を終止形の状態にすると良く、これによ
り、後述の類似度算出ステップを活用形によらずに実行
することができる。(3) The document analysis section 111 analyzes the input target document based on morphological analysis and extracts keyword information (keyword information extraction step, S2
03). Here, when the keyword is a part of speech that is used, the part of speech is preferably put into an end form, so that a similarity calculation step described later can be executed regardless of the use form.

【００４７】（４）文書構造情報抽出部１１２におい
て、対象文書、付随属性情報およびキーワード情報を用
いて文書構造情報を抽出する（文書構造情報抽出ステッ
プＳ２０４）。(4) The document structure information extraction unit 112 extracts document structure information using the target document, accompanying attribute information, and keyword information (document structure information extraction step S204).

【００４８】（５）判定部１１３が、キーワード又は属
性情報に基づいて、対象文書のキーワードが省略されて
いるか否かを判定する（判定ステップ、Ｓ２０５）。判
定の結果、省略されている場合は（キーワード補完ステ
ップ、Ｓ３００）に、省略されていない場合は（類似度
算出ステップ、Ｓ２０７）に進む。(5) The determining unit 113 determines whether the keyword of the target document is omitted based on the keyword or the attribute information (determination step, S205). As a result of the determination, the process proceeds to (keyword complementing step, S300) if omitted, and to (similarity calculating step, S207) if not omitted.

【００４９】（６）キーワード補完部１１４が、対象文
書、付随属性情報、キーワード情報および文書構造情報
を用いて補完するキーワードを抽出する（キーワード補
完ステップ、Ｓ３００）。(6) The keyword complementing unit 114 extracts a keyword to be complemented using the target document, accompanying attribute information, keyword information, and document structure information (keyword complementing step, S300).

【００５０】（７）類似度算出部１１５が、分類ルール
のそれぞれと、対象文書のキーワード又は属性情報との
類似度を算出する（類似度算出ステップ、Ｓ２０７）。(7) The similarity calculator 115 calculates the similarity between each of the classification rules and the keyword or attribute information of the target document (similarity calculation step, S207).

【００５１】（８）文書分類部１１６が、算出された類
似度に基づいて、対象文書が最も類似する分類ルールに
対応するグループを抽出し、対象文書を抽出したグルー
プに対応付ける（文書分類ステップ、Ｓ２０８）。(8) The document classification unit 116 extracts a group corresponding to the classification rule to which the target document is most similar based on the calculated similarity, and associates the group with the extracted group (document classification step, S208).

【００５２】（９）表示部１０７が、グループに属する
文書をグループに対応する分類ルールまたはその類似表
現と共にグループ毎に一覧表示する（分類結果表示ステ
ップ、Ｓ２０９）。ここで、いずれのグループにも対応
付けられない文書を一覧表示するようにしても良い。表
示部１０７をこのように構成すれば、手動で分類すべき
対象文書の数を大幅に少なくすることができ、分類処理
に要する時間を短縮することができる。(9) The display unit 107 displays a list of documents belonging to the group together with a classification rule corresponding to the group or a similar expression thereof for each group (classification result display step, S209). Here, a list of documents that are not associated with any group may be displayed. With this configuration of the display unit 107, the number of target documents to be manually classified can be significantly reduced, and the time required for the classification process can be reduced.

【００５３】（１０）編集部１０６を介して、表示され
た各文書が対応付けられたグループを変更する（分類結
果変更ステップ、Ｓ２１０）。なお、この処理ステップ
は省略しても良い。(10) The group associated with each displayed document is changed via the editing unit 106 (classification result change step, S210). Note that this processing step may be omitted.

【００５４】（１１）分類ルール生成部１１７が、グル
ープに属する文書群からグループの特徴をグループ毎
に、例えば、公知の回帰分析等の手法を用いて帰納的に
取りだし、分類ルールを更新する（分類ルール更新ステ
ップ、Ｓ２１１）。なお、この処理ステップは、分類結
果変更ステップを行なった時に実施すべきであるが、そ
れ以外の場合は必ずしも行なう必要はなく、これによ
り、文書分類処理に要する時間を短縮することができ
る。(11) The classification rule generation unit 117 recursively extracts the characteristics of the group from the document group belonging to the group for each group by using, for example, a known regression analysis or the like, and updates the classification rule ( Classification rule update step, S211). It should be noted that this processing step should be performed when the classification result change step is performed. However, in other cases, it is not necessary to perform the processing step, so that the time required for the document classification processing can be reduced.

【００５５】（１２）判別部１１６が、未分類の文書が
あるか否か判別する（未処理文書判別ステップ、Ｓ２１
２）。未処理の文書がある場合、再び（対象文書入力ス
テップ、Ｓ２０１）へ移行する。未処理の文書がない場
合には、文書分類処理を終了する。(12) The determining unit 116 determines whether there is an unclassified document (unprocessed document determining step, S21
2). If there is an unprocessed document, the process returns to (target document input step, S201). If there is no unprocessed document, the document classification process ends.

【００５６】次に、上記のキーワード補完ステップＳ３
００における処理の詳細を以下に説明する。Next, the above-described keyword complementing step S3
Details of the process at 00 are described below.

【００５７】（５−１）まず、補完キーワードデータベ
ース１０２から処理対象の補完ルールを取り出すと共
に、補完元キーワード集合Ｃｘを抽出する（補完元キー
ワード集合抽出ステップ、Ｓ３０１）。図７（ａ）、
（ｂ）に補完キーワードデータベース１０２中のキーワ
ード情報からなる補完ルールを例示する。(5-1) First, a complement rule to be processed is extracted from the complementary keyword database 102, and a complement source keyword set Cx is extracted (complement source keyword set extraction step, S301). FIG. 7 (a),
FIG. 2B illustrates a complementary rule including the keyword information in the complementary keyword database 102.

【００５８】（５−２）補完元キーワード集合Ｃｘの全
てのキーワードがキーワード情報内に含まれているか否
か判別する（判別ステップ、Ｓ３０２）。判別の結果、
含まれる場合には（補完先キーワード保存ステップ、Ｓ
３０３）へ、含まれない場合は（未処理キーワード判定
ステップ、Ｓ３０４）へ進行する。(5-2) It is determined whether or not all keywords of the complement source keyword set Cx are included in the keyword information (determination step, S302). As a result of the determination,
If it is included (complement destination keyword storage step, S
303), and if not included (unprocessed keyword determination step, S304).

【００５９】（５−３）補完先キーワードを補完ルール
に基づいて導出し（補完先キーワードの導出ルールであ
る補完ルールについては後述の具体例参照）、補完先キ
ーワード集合Ｃｙの全てのキーワードを補完キーワード
データベース１０２に追加する（補完先キーワード保存
ステップ、Ｓ３０３）。(5-3) A complement destination keyword is derived based on a complement rule (refer to a specific example described later for a complement rule which is a complement destination keyword derivation rule), and all keywords of a complement destination keyword set Cy are complemented. The keyword is added to the keyword database 102 (complement destination keyword saving step, S303).

【００６０】（５−４）未処理の補完ルールがあるか否
か判定する（未処理補完ルール判定ステップ、Ｓ３０
４）。判定の結果、未処理の補完ルールがある場合に
は、再び（補完元キーワード集合抽出ステップ、Ｓ３０
１）に移行し、ない場合は、（キーワード補完ステッ
プ、Ｓ３００）を終了し、（類似度算出ステップ、Ｓ２
０７）へ移行する。(5-4) It is determined whether there is an unprocessed complement rule (unprocessed complement rule determining step, S30)
4). As a result of the determination, if there is an unprocessed complement rule, the process returns to the step of “complement source keyword set extraction step, S30
If there is not, the (keyword complementing step, S300) ends, and the (similarity calculating step, S2)
07).

【００６１】ここで、キーワード情報だけでなく、キー
ワード情報と補完先キーワードの両方を用いて、再帰的
にキーワードの補完処理を行なっても良い。これによ
り、先に補完したキーワードを元に別の補完先キーワー
ドを抽出することも可能となる。Here, not only the keyword information but also both the keyword information and the complement destination keyword may be used to recursively perform the keyword complementing process. This makes it possible to extract another complement destination keyword based on the keyword complemented earlier.

【００６２】なお、文書構造情報内のテーマ情報は、付
随属性情報、キーワード情報および他の文書構造情報を
用いて抽出するが、基本的には、キーワード情報をもと
に補完ルールＡ（後述）と同じ形式のデータベースを用
いて抽出すると良い。例えば、「マウス、キーボード」
がキーワード情報に含まれる場合、「テーマ」を「パソ
コン」とする。ここで、付随属性情報に最初からテーマ
に関する属性情報が含まれる場合は、それを使用しても
良い。また、対象文書が上流の文書の話題を継続してい
る場合には、対象文書自体から取り出せるテーマと上流
文書のテーマの和を対象文書のテーマとする。したがっ
て、話題が継続する限り、テーマに含まれる文字列の数
は下流の文書になるほど増えることになる。逆に、話題
が継続しなかった場合にはテーマはクリアする。なお、
対象文書の上流の文書がどれであるかは付随属性情報に
より判別する。また、話題が継続しているか否かは、係
り受けの文字列を参照することで判別すると良い。例え
ば、文書がニュースである場合、「これを受けて」「こ
れに対し」等が来れば継続していると判断し、「次に」
「続いて」等が来れば継続していないと判断するように
すると良い。The theme information in the document structure information is extracted using accompanying attribute information, keyword information, and other document structure information. Basically, the supplementary rule A (described later) is based on the keyword information. It is advisable to use a database of the same format as above. For example, "mouse, keyboard"
Is included in the keyword information, the “theme” is set to “the personal computer”. Here, when the attribute information on the theme is included in the accompanying attribute information from the beginning, it may be used. When the target document continues the topic of the upstream document, the sum of the theme extracted from the target document itself and the theme of the upstream document is set as the theme of the target document. Therefore, as long as the topic continues, the number of character strings included in the theme increases toward downstream documents. Conversely, if the topic does not continue, the theme is cleared. In addition,
Which document is upstream of the target document is determined based on the accompanying attribute information. Whether the topic is continuing may be determined by referring to the character string of the dependency. For example, if the document is news, it is determined that the continuation is continued if "Receive this" or "Response" is received, and "Next"
If "continue" or the like comes, it is good to judge that it is not continuing.

【００６３】また、本発明の実施形態に係る文書分類装
置１００は、編集部１０６、表示部１０７を有し、グル
ープに属する文書をグループに対応する分類ルール又は
その類似表現と共にグループ毎に一覧表示する。さら
に、表示された各文書が対応付けられたグループを編集
部１０７を介して変更することが可能であるので、従来
までは手動で行なわれていた文書分類処理に要する労力
および時間を大幅に軽減することができる。The document classifying apparatus 100 according to the embodiment of the present invention has an editing unit 106 and a display unit 107, and displays a list of documents belonging to the group for each group together with a classification rule corresponding to the group or a similar expression thereof. I do. Furthermore, since the group associated with each displayed document can be changed via the editing unit 107, the labor and time required for document classification processing that has been performed manually in the past can be greatly reduced. can do.

【００６４】また、本発明の実施形態に係わる文書分類
装置１００は、例えば、図４に示す構成のような概観を
有する。つまり、本発明の実施形態に係わる文書分類装
置１００はコンピュータシステム内に文書分類装置の各
要素を内蔵することにより構成される。コンピュータシ
ステム４０は、フロッピーディスクドライブ４１および
光ディスクドライブ４３を備えている。そして、フロッ
ピーディスクドライブ４１に対してはフロッピーディス
ク４２、光ディスクドライブ４３に対しては光ディスク
４４を挿入し、所定の読み出し操作を行うことにより、
これらの記録媒体に格納された文書分類プログラムをコ
ンピュータシステム４０内にインストールすることがで
きる。また、所定のドライブ装置を接続することによ
り、例えば、メモリ装置の役割を担うＲＯＭ４５や、磁
気テープ装置の役割を担うカートリッジ４６を用いて、
インストールやデータの読み書きを実行することもでき
る。The document classifying apparatus 100 according to the embodiment of the present invention has, for example, an appearance as shown in FIG. That is, the document classification device 100 according to the embodiment of the present invention is configured by incorporating each element of the document classification device in the computer system. The computer system 40 includes a floppy disk drive 41 and an optical disk drive 43. Then, the floppy disk 42 is inserted into the floppy disk drive 41 and the optical disk 44 is inserted into the optical disk drive 43, and a predetermined read operation is performed.
The document classification program stored in these recording media can be installed in the computer system 40. Further, by connecting a predetermined drive device, for example, using a ROM 45 serving as a memory device or a cartridge 46 serving as a magnetic tape device,
You can also install and read and write data.

【００６５】さらに、本発明の実施形態に係わる文書分
類装置１００は、プログラム化しコンピュータ読み取り
可能な記録媒体に保存しても良い。そして、文書分類処
理を行う際は、この記録媒体をコンピュータシステムに
読み込ませ、コンピュータシステム内のメモリ等の記憶
部にプログラムを格納し、文書分類プログラムを演算装
置で実行することにより、本発明の文書分類装置および
その方法を実現することができる。ここで、記録媒体と
は、例えば、半導体メモリ、磁気ディスク、光ディス
ク、光磁気ディスク、磁気テープなどのプログラムを記
録することができるようなコンピュータ読み取り可能な
記録媒体等が含まれる。Further, the document classification device 100 according to the embodiment of the present invention may be programmed and stored in a computer-readable recording medium. Then, when performing the document classification process, the recording medium is read into a computer system, the program is stored in a storage unit such as a memory in the computer system, and the document classification program is executed by the arithmetic unit, thereby realizing the present invention. A document classification device and a method thereof can be realized. Here, the recording medium includes, for example, a computer-readable recording medium capable of recording a program such as a semiconductor memory, a magnetic disk, an optical disk, a magneto-optical disk, and a magnetic tape.

【００６６】このように、本発明はここでは記載してい
ない様々実施の形態等を包含するということは十分に理
解すべきである。したがって、本発明はこの開示から妥
当な特許請求の範囲に係わる発明特定事項によってのみ
限定されるものでなければならない。As described above, it should be sufficiently understood that the present invention includes various embodiments and the like not described herein. Therefore, the present invention must be limited only by the matters specifying the invention according to the claims that are reasonable from this disclosure.

【００６７】最後に、本発明の実施形態に係る文書分類
処理における補完処理動作の具体例を簡単に説明する。Finally, a specific example of the supplementary processing operation in the document classification processing according to the embodiment of the present invention will be briefly described.

【００６８】本発明の実施形態に係る補完処理パターン
には、大きく分けて、図７（ａ）、（ｂ）に示す２種類
の補完ルールがある。すなわち、省略されたキーワード
を復元する補完ルールＡ（図７（ａ））と、類似した表
現を付加する補完ルールＢ（図７（ｂ））である。補完
ルールＡは、例えば、「クリック」という補完元キーワ
ードに対して「マウス」を補完先キーワードとして補完
することで、たとえ「マウス」が元文書に含まれていな
い場合であっても分類ルール「マウス」のグループに元
文書を分類することが可能となる。一方、補完ルールＢ
においては、例えば、「ＰＣ」「パソコン」「パーソナ
ルコンピュータ」というキーワードを相互補完すること
で、元文書が３つのキーワードの内のいずれかの表現を
用いている場合であっても、同じように精度良く分類で
きるようになる。また、分類ルールも３つの表現の内の
いずれか１つを記述すれば良くなるので、分類ルールを
簡略化することも可能となる。The complementary processing patterns according to the embodiment of the present invention are roughly classified into two types of complementary rules shown in FIGS. 7A and 7B. That is, a supplementary rule A for restoring the omitted keyword (FIG. 7A) and a supplementary rule B for adding a similar expression (FIG. 7B). The complementing rule A is, for example, by complementing “mouse” as a complementing keyword with respect to the complementing keyword “click”, so that even if “mouse” is not included in the original document, the classification rule “ The original document can be classified into a group of "mouse". On the other hand, complementary rule B
In, for example, by complementing the keywords “PC”, “PC”, and “personal computer”, even if the original document uses any of the three keywords, the same applies. Classification can be performed with high accuracy. In addition, since the classification rule only needs to describe any one of the three expressions, the classification rule can be simplified.

【００６９】なお、補完ルールＡにおける補完先キーワ
ードの導出処理に際しては、主に、ｉｆ〜ｔｈｅｎルー
ル(if A then add B)を用いるものとする。以下、補完
ルールＡの具体例を幾つか示す。In the process of deriving the complement destination keyword in the complement rule A, the if-then rule (if A then add B) is mainly used. Hereinafter, some specific examples of the supplementary rule A will be described.

【００７０】（１）「着る」→「服」、「淹れる」→
「茶」のように、通常、特定の対象にしか使われない用
言に対して、対となる体言を補完する。(1) “wear” → “clothes”, “brew” →
Complements the paired nomenclature, such as "tea", which is usually used only for a specific object.

【００７１】（２）「円高」→「ドル」、「幼い」→
「子供」のように、組で用いられる頻度が高い単語を補
完する。(2) “Yen appreciation” → “dollar”, “young” →
Complement words that are frequently used in a set, such as “child”.

【００７２】（３）「ビール」、「ウィスキー」→「ア
ルコール」のように、具体的な種別を指定する単語に対
して、それを含むより抽象的な種別の単語を補完する。(3) A word specifying a specific type, such as “beer”, “whiskey” → “alcohol”, is complemented with a word of a more abstract type including the word.

【００７３】（４）「小淵恵三」→「首相」、「イチロ
ー」→「プロ野球選手」、「福沢諭吉」→「1万円札」
のように、一般知識、専門知識、時事ネタ、俗語を用い
て、言い換えが可能な組のいずれか１つが入っている場
合の他の全ての単語もしくは代表語を補完する。(4) "Keizo Kobuchi" → "Prime Minister", "Ichiro" → "Professional baseball player", "Yukikichi Fukuzawa" → "10,000 yen bill"
As described above, all other words or representative words when any one of the sets that can be paraphrased are included are complemented by using general knowledge, expertise, current events, and slang.

【００７４】（５）「最寄駅」→「東京駅」、「晩御
飯」→「カレー」のように、事前情報を送り手と受け手
で共有していることが分かっているために省略できる単
語を補完する。(5) It can be omitted because it is known that the advance information is shared between the sender and the recipient, such as “nearest station” → “Tokyo station”, “dinner” → “curry”. Complete a word.

【００７５】（６）「子供＋嗜好品」→「ＴＶゲー
ム」、「女子高生＋嗜好品」→「ＰＨＳ」のように、世
相を反映させてルールを補完する。(6) The rules are complemented by reflecting the world, such as “child + luxury goods” → “TV game”, “girl high school girl + luxury goods” → “PHS”.

【００７６】（７）「Ａ＋Ｂ」と「Ｃ＋Ｂ」という組合
わせの文章が良く現れる場合、ＡとＣは類似語である可
能性が高いとし、互いに補完する。(7) When a combination of "A + B" and "C + B" frequently appears, A and C are likely to be similar words and are complemented by each other.

【００７７】一方、補完ルールＢにおいては、主に補集
合ルール（if A then add ^A）を用いるものとする。例
えば、「素早い」、「迅速だ」、「速い」のように、同
義語間のいずれか１つが入っている場合の他の全ての単
語もしくは代表語を補完するようにする。On the other hand, in the complementing rule B, a complement set rule (if A then add ^ A) is mainly used. For example, all other words or representative words in the case where any one of the synonyms is included, such as "quick", "quick", and "quick", are complemented.

【００７８】なお、対象文書が会話文である場合には、
ｉｆ〜ｔｈｅｎルールでも補集合ルールでも処理できな
い時がある。すなわち、対象文書自身、もしくは対象文
書に対する先行発言の属性を持つ文書が、事前知識にあ
たる情報を有し、対象文書に含まれる「こそあど」言葉
が事前知識の文字列と置き換わるような場合である（英
語で言えば、定冠詞(the)がつく言葉等である。一部の
例外(the earth等)を除く）。そのような場合には、例
えば、「その土地には」→「東京には」、「昨日の製品
では・・」 → 「製品Ａでは」というように、事前共有
情報に含まれる他のより詳細な言葉に発言文書を置き換
えると良い。When the target document is a conversational sentence,
There are times when neither the if-then rule nor the complement rule can be processed. In other words, this is a case where the target document itself or a document having the attribute of the preceding statement with respect to the target document has information corresponding to the prior knowledge, and the word “Kadosado” included in the target document is replaced with a character string of the prior knowledge. (In English, these are words with the definite article (the). With some exceptions (the earth, etc.)). In such a case, for example, "in the land" → "in Tokyo", "for yesterday's product ..." → "for product A", other more detailed information included in the pre-shared information It is good to replace the statement document with a simple word.

【００７９】また、補完処理の結果、対象文書が複数の
グループの分類ルールに同時に所属することがある。例
えば、今、分類ルールを、例えば、「include(ＰＣ or
パソコン)」や「include(情報源 or ＣＭ or コマーシ
ャル)」のようにすると、評価関数によってはどちらに
もヒットしてしまう場合があるが、そのような場合、
（１）閾値を超えた全てのグループに分類する等して、
両方のグループに所属させる、（２）分類ルールに優先
順位を付け、同じ類似度を用いる等して、片方のグルー
プに所属させる、（３）どちらのグループにも所属させ
ずに、ユーザに確認を求める、等の方法が考えられる。
なお、（１）の方法の場合、全てのグループを確認しな
くても、１つのグループを見るだけで、似た意見を確認
できたり、他にも分類先があることが分かるようにする
ことが望ましい。As a result of the complementing process, the target document may belong to a plurality of group classification rules at the same time. For example, now the classification rule is, for example, "include (PC or
If you use "computer" or "include (information source or CM or commercial)", depending on the evaluation function, you may hit both, but in such a case,
(1) By classifying into all groups exceeding the threshold,
Belong to both groups, (2) prioritize classification rules and use the same degree of similarity, etc., to belong to one group, (3) confirm to the user without belonging to either group , Etc. are conceivable.
In the case of the method (1), similar opinions can be confirmed only by looking at one group without confirming all groups, and it can be understood that there are other classification destinations. Is desirable.

【００８０】なお、本発明の適用分野としては、例え
ば、市場調査の際のインタビューやアンケート等から得
られる大量の文書の分類処理等が考えられる。As an application field of the present invention, for example, a classification process of a large number of documents obtained from an interview, a questionnaire, or the like at the time of market research can be considered.

【００８１】[0081]

【発明の効果】以上述べてきたように、本発明の文書分
類装置、文書分類方法および文書分類プログラムを格納
したコンピュータ読取り可能な記録媒体によれば、分類
する文書内で省略されているキーワードあるいは抽出さ
れたキーワードと類似する新たなキーワードを補完する
と同時に、文書内の不必要なキーワードを削除するよう
にして、文書内に必要十分なキーワードを含まれるよう
に文書を変換した後に、変換した文書に対して文書分類
処理を行なう。さらに、キーワード以外の属性情報も文
書分類結果に反映することができるので、高精度の文書
分類処理が可能となる。また、キーワード以外の情報を
文書分類結果に反映することができるので、より精度の
高い文書分類処理を実現することが可能となる。As described above, according to the document classification device, the document classification method, and the computer-readable recording medium storing the document classification program of the present invention, the keyword or the keyword omitted in the document to be classified is provided. At the same time as supplementing new keywords similar to the extracted keywords, deleting unnecessary keywords in the document, converting the document to include the necessary and sufficient keywords in the document, and then converting the document Perform document classification processing. Further, since attribute information other than keywords can be reflected in the document classification result, highly accurate document classification processing can be performed. Further, since information other than the keyword can be reflected in the document classification result, it is possible to realize a more accurate document classification process.

[Brief description of the drawings]

【図１】本発明の実施形態に係る文書分類装置の構成を
示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a document classification device according to an embodiment of the present invention.

【図２】本発明の実施形態に係る文書分類方法を示すフ
ローチャート図である。FIG. 2 is a flowchart illustrating a document classification method according to the embodiment of the present invention.

【図３】図２のステップＳ３００における補完処理の詳
細を示すフローチャート図である。FIG. 3 is a flowchart illustrating details of a complementing process in step S300 of FIG. 2;

【図４】本発明の実施形態に係る文書分類装置の概観を
示す図である。FIG. 4 is a diagram showing an overview of a document classification device according to an embodiment of the present invention.

【図５】本発明の実施形態に係る文書情報記憶部の内部
構成を示す模式図である。FIG. 5 is a schematic diagram illustrating an internal configuration of a document information storage unit according to the embodiment of the present invention.

【図６】本発明の実施形態に係る分類ルールデータベー
スの内部構成を示す模式図である。FIG. 6 is a schematic diagram showing an internal configuration of a classification rule database according to the embodiment of the present invention.

【図７】本発明の実施形態に係る補完処理を説明するた
めの図である。FIG. 7 is a diagram illustrating a supplementary process according to the embodiment of the present invention.

【図８】従来の文書分類処理を説明するための模式図で
ある。FIG. 8 is a schematic diagram for explaining a conventional document classification process.

[Explanation of symbols]

４０コンピュータシステム４１フロッピードライブ４２フロッピーディスク４３光ディスクドライブ４４光ディスク４５ＲＯＭ４６カートリッジ１００文書分類装置１０１構造化ルールデータベース１０２補完キーワードデータベース６０、１０３分類ルールデータベース５０、１０４文書情報記憶部１０５入力部１０６編集部１０７表示部１１０分類処理部１１１文書解析部１１２文書構造情報抽出部１１３判別部１１４キーワード補完部１１５類似度計算部１１６文書分類部１１７分類ルール生成部 Reference Signs List 40 computer system 41 floppy drive 42 floppy disk 43 optical disk drive 44 optical disk 45 ROM 46 cartridge 100 document classification device 101 structured rule database 102 complementary keyword database 60, 103 classification rule database 50, 104 document information storage unit 105 input unit 106 editing unit 107 display unit 110 classification processing unit 111 document analysis unit 112 document structure information extraction unit 113 discrimination unit 114 keyword complement unit 115 similarity calculation unit 116 document classification unit 117 classification rule generation unit

フロントページの続き (72)発明者關本千絵神奈川県川崎市幸区柳町70番地株式会社東芝柳町工場内Ｆターム(参考） 5B075 ND03 NK06 NK32 NR03 NR12 PR06 QM08 UU06 Continuation of the front page (72) Chie Sekimoto 70 Yanagicho, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture F-term in Toshiba Yanagimachi Plant (reference) 5B075 ND03 NK06 NK32 NR03 NR12 PR06 QM08 UU06

Claims

[Claims]

1. A document classifying apparatus for classifying documents described in a natural language into a predetermined number of groups, comprising: an input unit for inputting a document to be classified and attribute information associated with the document; A document analysis unit that analyzes a document based on morphological analysis and extracts a keyword, and describes, for each group, a predetermined combination including at least one of the keyword indicating the characteristics of the group and the attribute information. A classification rule storage unit that stores a classification rule; a similarity calculation unit that calculates a similarity between each of the classification rules and the keyword or the attribute information of the document; based on the calculated similarity, A document classification unit for finding a group corresponding to a classification rule to which the document is most similar and for associating the document with the group. Classification device.

2. The document classification device, further comprising: a determination unit configured to determine whether the keyword of the document is omitted based on the keyword or the attribute information; and that the keyword is omitted. If it is determined,
2. The document according to claim 1, further comprising: a keyword complementer that restores an omitted keyword to the document and / or adds an expression similar to the keyword and outputs the keyword to the similarity calculator. 3. Classifier.

3. The attribute information includes identification information for identifying a user corresponding to the document, question information corresponding to the document, scene information, creation time of the document, a document category for a proposition, and a statement preceding the document. The document classification apparatus according to claim 1, further comprising one or more pieces of connection information of the document.

4. The classification rule according to claim 1, wherein the classification rule is described including a predetermined combination of at least one of a nominal element, a verbal element, and a modifier element. Document classification device.

5. The document classification apparatus according to claim 1, wherein the groups are formed hierarchically according to an abstraction level of a corresponding classification rule.

6. The document classifying apparatus further includes a display unit for displaying, for each group, a list of documents belonging to the group together with a classification rule corresponding to the group or a similar expression thereof. Claims 1, 2,
6. The document classification device according to 3, 4, or 5.

7. The document classification device according to claim 6, wherein the display unit displays a list of documents that are not associated with any group.

8. The document classification apparatus according to claim 6, wherein the document classification device further comprises an editing unit for instructing another group to change a group associated with each displayed document. Document classifier.

9. The document classification device according to claim 1, further comprising: a classification rule generation unit for inductively generating the classification rule from a document group belonging to the group for each group. The document classification device described in 2, 3, 4, 5, 6, 7, or 8.

10. The method according to claim 1, wherein the similarity calculating unit variably sets the weight of the keyword according to the attribute information. Or the document classification device according to 9.

11. A document classification method for classifying documents described in a natural language into a predetermined number of groups, comprising: an input step of inputting a document to be classified and attribute information associated with the document; A document analysis step of analyzing a document based on morphological analysis and extracting a keyword; and a classification in which a predetermined combination of at least one of the keyword and the attribute information indicating the characteristics of the group is described for each group. A similarity calculating step of calculating a similarity between each of the rules and the keyword or the attribute information of the document; and, based on the calculated similarity, a group corresponding to a classification rule to which the document is most similar. And associating the document with the group.

12. The document classification method, further comprising: determining whether the keyword of the document is omitted based on the keyword or the attribute information; and determining that the keyword is omitted. If it is determined,
12. The document classification method according to claim 11, further comprising: a keyword complementing step of restoring an omitted keyword to the document and / or adding and outputting an expression similar to the keyword.

13. A computer-readable recording medium storing a document classification program for classifying a document described in a natural language into a predetermined number of groups, and inputs a document to be classified and attribute information associated with the document. A document analysis process of analyzing the input document based on morphological analysis and extracting a keyword; and a predetermined process including at least one of the keyword and the attribute information indicating the characteristics of the group. A similarity calculation process for calculating a similarity between the classification rule describing the combination for each group and the keyword or the attribute information of the document; A document classification process for finding a group corresponding to a similar classification rule and associating the document with the group; Computer readable recording medium storing a document classification program for causing a computer to execute the.

14. A computer-readable recording medium storing the document classification program, further comprising: a determination process of determining whether the keyword of the document is omitted based on the keyword or the attribute information. , When it is determined that the keyword is omitted,
14. A keyword compensating process for restoring an omitted keyword in the document and / or adding an expression similar to the keyword and outputting the keyword, and causing a computer to execute these processes. And a computer-readable recording medium storing the document classification program.