JP2003108569A

JP2003108569A - Classification processing device, control method of classification processing device, control program, and recording medium

Info

Publication number: JP2003108569A
Application number: JP2001298558A
Authority: JP
Inventors: Takashige Tanaka; 敬重田中
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-09-27
Filing date: 2001-09-27
Publication date: 2003-04-11

Abstract

PROBLEM TO BE SOLVED: To execute optimal classification by extracting proper classification ontology from a sentence of a classification object. SOLUTION: A database part 11 prestores a word or a compound word included in a prescribed keyword original document as the classification ontology in relation to a specific field. A classification renewing processing part 12 analyzes text data corresponding to the document of the classification object, extracts the word or the compound word included in the document of the classification object as classification object words and phrases, compares the classification ontology and the classification object words and phrases, and discriminates classification to which the document of the classification object belongs.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、分類処理装置、分
類処理装置の制御方法、制御プログラムおよび記録媒体
に係り、特に対象文書の属する分類を判別して書類の分
類を行うための技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a classification processing device, a classification processing device control method, a control program, and a recording medium, and more particularly to a technique for classifying a document by determining the classification to which a target document belongs.

【０００２】[0002]

【従来の技術】近年の情報技術の発達に伴い、膨大な情
報の利用が可能になってきており、各種の文書について
もその例外ではない。しかし利用可能な情報量と利用の
容易さとは相反する関係にあり、情報量すなわち文書量
が増えれば増えるほど文書を容易に利用できなくなって
しまうという問題点が生じる。上記問題点を回避すべ
く、利用対象の文書を分類し、必要な分類に属する文書
を利用するようにすることが提案されており、文書の分
類に際しては、当該文書に含まれるキーワードを抽出
し、抽出されたキーワードに基づいて分類するのが一般
的である。2. Description of the Related Art With the development of information technology in recent years, a vast amount of information has become available, and various documents are no exception. However, there is a contradictory relationship between the amount of information that can be used and the ease of use, and as the amount of information, that is, the amount of documents increases, the problem that the documents cannot be used more easily arises. In order to avoid the above problems, it has been proposed to classify the documents to be used and use the documents that belong to the necessary classification.When classifying the documents, extract the keywords included in the documents. Generally, it is classified based on the extracted keywords.

【０００３】このような場合に、文書からキーワードを
抽出するための技術の一例として、特開平６−２８２５
７２号に記載のキーワード自動抽出装置が挙げられる。
特開平６−２８２５７２号に記載のキーワード自動抽出
装置は、文書を形態素解析して品詞情報を取り出し、名
詞句およびサ変名詞を文書から抽出する。そして抽出し
た名詞句およびサ変名詞の当該文書中の重要度を判別
し、重要度の高いキーワードを当該文書の分類に対応す
るキーワードとして自動的に抽出することとなってい
た。In such a case, as an example of a technique for extracting a keyword from a document, Japanese Patent Laid-Open No. 6-2825 is available.
The keyword automatic extraction device described in No. 72 is mentioned.
The automatic keyword extraction device described in Japanese Patent Application Laid-Open No. 6-282572 extracts a part-of-speech information by performing morphological analysis on a document, and extracts a noun phrase and a sahen noun from the document. Then, the importance of the extracted noun phrase and sahen noun in the document is determined, and the keyword of high importance is automatically extracted as the keyword corresponding to the classification of the document.

【０００４】[0004]

【発明が解決しようとする課題】ところで、上記従来の
キーワード自動抽出装置においては、抽出されるキーワ
ードとしては、例えば、複数の名詞句を組み合わせた複
合語も含まれている。この場合には、複合語および当該
複合語を構成する単語についてもキーワードとして抽出
されることとなっていた。しかしながら、文書の分類を
行う場合には、複合語と複合語を構成する単語は必ずし
も同一の分野に属するものとは限らず、複合語と複合語
を構成する単語の双方に基づいて文書の分類を行うと、
本来の分類とは異なる分類とされてしまうという不具合
があった。By the way, in the above-mentioned conventional keyword automatic extraction device, the extracted keyword includes, for example, a compound word in which a plurality of noun phrases are combined. In this case, the compound word and the words forming the compound word are also to be extracted as keywords. However, when classifying documents, the compound words and the words that make up the compound words do not necessarily belong to the same field, and the classification of documents is performed based on both the compound words and the words that make up the compound words. When you do
There was a problem that the classification would be different from the original classification.

【０００５】例えば、分類対象の文章中に複合語である
「音楽ＣＤ」が含まれる場合について考察してみる。こ
の場合、従来のキーワード抽出装置においては、「音
楽」、「ＣＤ」および「音楽ＣＤ」の３つがキーワード
として抽出されることとなる。ところで、単語「ＣＤ」
は、一般的には記録媒体である「ＣＤ−ＲＯＭ」をも意
味しており、単語「ＣＤ」の属する分類としては、「コ
ンピュータ分野」も含まれてしまうこととなる。従っ
て、単語「ＣＤ」は分類を判別するためのキーワード
（以下、分類オントロジーという）としては適当ではな
いことが分かる。そこで、本発明の目的は、分類対象の
文章から適当な分類オントロジーを抽出して、最適な分
類を行うことが可能な分類処理装置、分類処理装置の制
御方法、分類処理装置の制御プログラムおよびこの制御
プログラムを記録した記録媒体を提供することにある。For example, consider a case where a sentence to be classified contains a compound word "music CD". In this case, in the conventional keyword extraction device, three of "music", "CD" and "music CD" are extracted as keywords. By the way, the word "CD"
Generally means "CD-ROM" which is a recording medium, and the category to which the word "CD" belongs also includes "computer field". Therefore, it can be seen that the word “CD” is not suitable as a keyword for discriminating the classification (hereinafter referred to as the classification ontology). Therefore, an object of the present invention is to extract an appropriate classification ontology from a sentence to be classified, and to perform optimum classification, a classification processing device control method, a classification processing device control program, and It is to provide a recording medium having a control program recorded therein.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するた
め、分類処理装置は、所定の分類基準文書に含まれる単
語あるいは複合語を特定の分野に関連させて予め分類基
準語句として記憶する分類データベース部と、分類対象
の文書を解析し、当該分類対象の文書に含まれる単語あ
るいは複合語を分類対象語句として抽出する語句抽出部
と、前記分類基準語句および前記分類対象語句を比較
し、前記分類対象の文書が属する分類を判別する分類判
別部と、を備えたことを特徴としている。上記構成によ
れば、分類データベースは、所定の分類基準文書に含ま
れる単語あるいは複合語を特定の分野に関連させて予め
分類基準語句として記憶する。語句抽出部は、分類対象
の文書を解析し、当該分類対象の文書に含まれる単語あ
るいは複合語を分類対象語句として抽出する。分類判別
部は、分類基準語句および分類対象語句を比較し、分類
対象の文書が属する分類を判別する。In order to solve the above-mentioned problems, the classification processing device stores a word or a compound word included in a predetermined classification standard document in association with a specific field in advance as a classification standard phrase. Section, a document to be classified, and a phrase extraction unit that extracts a word or a compound word included in the document to be classified as a classification target phrase, compares the classification criterion phrase and the classification target phrase, and classifies And a classification determination unit that determines the classification to which the target document belongs. According to the above configuration, the classification database stores words or compound words included in a predetermined classification standard document in advance as classification standard phrases in association with a specific field. The word / phrase extraction unit analyzes a document to be classified and extracts a word or a compound word included in the document to be classified as a classification target phrase. The classification determination unit compares the classification reference phrase and the classification target phrase to determine the classification to which the document to be classified belongs.

【０００７】この場合において、前記分類基準文書の形
態素解析を行って前記単語あるいは前記複合語を抽出す
る形態素解析部と、抽出された前記単語あるいは前記複
合語の当該分類基準文書における重要度を計算する重要
度計算部と、抽出された前記単語あるいは前記複合語を
前記重要度および前記特定の分野に対応づけて前記分類
データベース部に前記分類基準語句として登録する基準
語句登録部と、を備えるようにしてもよい。In this case, a morphological analysis unit for performing morphological analysis of the classification standard document to extract the word or the compound word, and calculating the importance of the extracted word or compound word in the classification standard document. And a reference word registration unit that registers the extracted word or the compound word in the classification database unit as the classification reference word in association with the importance and the specific field. You may

【０００８】また、前記分類判別部は、前記分類対象語
句のうち前記分類基準語句に含まれる語句の数が多い分
類を優先的に前記分類対象の文書が属する分類として判
別するようにしてもよい。さらにまた、前記分類判別部
は、前記分類対象語句がＮ個の前記単語および前記複合
語を含む場合に、前記分類対象語句をＮ次元のベクトル
空間におけるベクトルで表し、前記分類基準語句を前記
ベクトル空間におけるベクトルで表し、両ベクトルの距
離に基づいて前記判別を行うようにしてもよい。Further, the classification discriminating section may discriminate preferentially as a classification to which the document to be classified belongs, a classification in which the number of words included in the classification reference word is large among the classification objects. . Furthermore, when the classification target phrase includes N words and the compound word, the classification determination unit represents the classification target phrase by a vector in an N-dimensional vector space, and the classification reference phrase is the vector. It may be represented by a vector in space, and the determination may be performed based on the distance between the two vectors.

【０００９】また、前記分類判別部は、前記分類対象語
句がＮ個の前記単語および前記複合語を含む場合に、前
記分類対象語句の前記重要度をＮ次元のベクトル空間に
おけるベクトルで表し、前記分類基準語句の重要度を前
記ベクトル空間におけるベクトルで表し、両ベクトルの
距離に基づいて前記判別を行うようにしてもよい。The classification discriminating unit represents the importance of the classification target phrase by a vector in an N-dimensional vector space when the classification target phrase includes N words and the compound word. The importance of the classification criterion phrase may be represented by a vector in the vector space, and the determination may be performed based on the distance between the two vectors.

【００１０】さらに前記分類判別部は、前記分類対象語
句の前記重要度に対応するベクトルＸを、Ｘ＝（Ｘ１、Ｘ２、……、ＸN）とし、前記分類基準語句の前記重要度に対応するベクト
ルＹを、Ｙ＝（Ｙ１、Ｙ２、……、ＹN）で表し、距離Ｄを、Ｄ＝Σ（Ｘｉ−Ｙｉ）＊（Ｘｉ−Ｙｉ）ｉ＝１、
２、……、Ｎとした場合に、前記距離Ｄが所定のしきい値よりも小さ
い場合に前記分類対象の文書は前記分類基準文書の属す
る分類に近い分類に属すると判別するようにしてもよ
い。Further, the classification discriminating unit sets the vector X corresponding to the importance of the classification target phrase as X = (X1, X2, ..., XN), and corresponds to the importance of the classification reference phrase. The vector Y is represented by Y = (Y1, Y2, ..., YN), and the distance D is D = Σ (Xi-Yi) * (Xi-Yi) i = 1,
2, ..., N, if the distance D is smaller than a predetermined threshold value, it is determined that the classification target document belongs to a classification close to the classification to which the classification reference document belongs. Good.

【００１１】さらにまた、前記形態素解析部は、前記複
合語が複数の単語の組み合わせ、単語および抽出された
前記複合語よりも文字数の少ない複合語の組み合わせ、
あるいは、抽出された前記複合語よりも文字数の少ない
複合語の組み合わせのいずれかである場合に、当該複合
語のみを抽出するようにしてもよい。また、前記形態素
解析部は、前記形態素解析において、抽出すべき分類対
象語句として少なくとも名詞句およびサ変名詞を含む名
詞および所定の名詞句と見なせる品詞に属する単語を抽
出するようにしてもよい。Furthermore, the morpheme analysis unit is a combination of a plurality of words of the compound word, a combination of a word and a compound word having a smaller number of characters than the extracted compound word,
Alternatively, only one of the compound words having a smaller number of characters than the extracted compound word may be extracted. Further, the morphological analysis unit may extract, in the morphological analysis, words belonging to a noun including at least a noun phrase and a sahen noun and a part of speech that can be regarded as a predetermined noun phrase as a classification target phrase to be extracted.

【００１２】さらにまた、前記所定の名詞句と見なせる
品詞として形容動詞の名詞形および一段動詞の連用形を
含むようにしてもよい。また、前記分類対象語句を登録
するための形態素解析用逆引辞書を備え、前記形態素解
析部は、前記形態素解析用逆引辞書に基づいて前記形態
素解析を行うようにしてもよい。さらに前記形態素解析
部は、前記形態素解析用逆引辞書に登録されていない単
語あるいは複合語を不定語として前記形態素解析用逆引
辞書に登録するようにしてもよい。さらにまた、前記形
態素解析部は、前記形態素解析において、抽出した単語
あるいは複合語に予め定めた記号が含まれている場合、
当該単語あるいは当該複合語から前記記号を除いた後に
前記抽出した単語あるいは複合語とするようにしてもよ
い。Furthermore, the part of speech that can be regarded as the predetermined noun phrase may include a noun form of an adjective verb and a continuous form of a one-stage verb. A morphological analysis reverse lookup dictionary for registering the classification target words may be provided, and the morphological analysis unit may perform the morphological analysis based on the morphological analysis reverse lookup dictionary. Further, the morphological analysis unit may register a word or a compound not registered in the morphological analysis reverse lookup dictionary as an indefinite term in the morphological analysis reverse lookup dictionary. Furthermore, the morphological analysis unit, in the morphological analysis, when the extracted word or compound word includes a predetermined symbol,
The extracted word or compound may be obtained after removing the symbol from the word or compound.

【００１３】また、前記重要度計算部は、抽出された前
記単語あるいは前記複合語のうち予め定めた分類対象語
句として不適当な語句を除いて前記重要度計算を行うよ
うにしてもよい。さらに前記重要度計算部は、抽出され
た前記単語あるいは前記複合語のうち予め定めた前記分
類を判別するのに不適当な語句を除いて前記重要度計算
を行うようにしてもよい。さらにまた、抽出された前記
単語あるいは前記複合語に対し、所定の標準化処理を行
う標準化部を備え、前記重要度算出部は、前記標準化処
理後の前記単語あるいは前記複合語に対し前記重要度を
算出するようにしてもよい。Further, the importance degree calculation unit may perform the importance degree calculation by excluding words that are inappropriate as a predetermined classification target phrase from the extracted words or the compound words. Further, the importance degree calculation unit may perform the importance degree calculation by excluding a word or phrase not suitable for discriminating the predetermined classification from the extracted word or the compound word. Furthermore, the extracted word or the compound word is provided with a standardization unit that performs a predetermined standardization process, and the importance calculation unit determines the importance level of the word or the compound word after the standardization process. It may be calculated.

【００１４】また、所定の分類基準文書に含まれる単語
あるいは複合語を特定の分野に関連させて予め分類基準
語句として記憶する分類データベース部を備えた分類処
理装置の制御方法は、分類対象の文書を解析し、当該分
類対象の文書に含まれる単語あるいは複合語を分類対象
語句として抽出する語句抽出過程と、前記分類基準語句
および前記分類対象語句を比較し、前記分類対象の文書
が属する分類を判別する分類判別過程と、を備えたこと
を特徴としている。上記構成によれば、語句抽出過程
は、分類対象の文書を解析し、当該分類対象の文書に含
まれる単語あるいは複合語を分類対象語句として抽出す
る。分類判別過程は、前記分類基準語句および前記分類
対象語句を比較し、前記分類対象の文書が属する分類を
判別する。Further, the control method of the classification processing apparatus provided with the classification database unit for preliminarily storing a word or a compound word included in a predetermined classification standard document in association with a specific field as a classification standard phrase is a document to be classified. And a word extraction process of extracting a word or a compound word included in the document to be classified as a classification target phrase, comparing the classification reference phrase and the classification target phrase, and determining the classification to which the classification target document belongs. It is characterized by having a classification discrimination process for discriminating. According to the above configuration, in the phrase extraction process, the document to be classified is analyzed, and the word or compound word included in the document to be classified is extracted as the classification target phrase. In the classification determination process, the classification reference phrase and the classification target phrase are compared to determine the classification to which the document to be classified belongs.

【００１５】この場合において、前記分類基準文書の形
態素解析を行って前記単語あるいは前記複合語を抽出す
る形態素解析過程と、抽出された前記単語あるいは前記
複合語の当該分類基準文書における重要度を計算する重
要度計算過程と、抽出された前記単語あるいは前記複合
語を前記重要度および前記特定の分野に対応づけて前記
分類データベース部に前記分類基準語句として登録する
基準語句登録過程と、を備えるようにしてもよい。In this case, a morphological analysis process of performing morphological analysis of the classification standard document to extract the word or the compound word, and calculating the importance of the extracted word or compound word in the classification standard document. And a reference word registration step of registering the extracted word or the compound word in the classification database unit as the classification reference word in association with the importance and the specific field. You may

【００１６】また、前記分類判別過程は、前記分類対象
語句がＮ個の前記単語および前記複合語を含む場合に、
前記分類対象語句の前記重要度をＮ次元のベクトル空間
におけるベクトルで表し、前記分類基準語句の重要度を
前記ベクトル空間におけるベクトルで表し、両ベクトル
の距離に基づいて前記判別を行うようにしてもよい。さ
らに前記分類判別過程は、前記分類対象語句の前記重要
度に対応するベクトルＸを、Ｘ＝（Ｘ１、Ｘ２、……、ＸN）とし、前記分類基準語句の前記重要度に対応するベクト
ルＹを、Ｙ＝（Ｙ１、Ｙ２、……、ＹN）で表し、距離Ｄを、Ｄ＝Σ（Ｘｉ−Ｙｉ）＊（Ｘｉ−Ｙｉ）ｉ＝１、
２、……、Ｎとした場合に、前記距離Ｄが所定のしきい値よりも小さ
い場合に前記分類対象の文書は前記分類基準文書の属す
る分類に近い分類に属すると判別するようにしてもよ
い。Further, in the classification determination process, when the classification target phrase includes N words and the compound word,
The importance of the classification target phrase is represented by a vector in an N-dimensional vector space, the importance of the classification criterion phrase is represented by a vector in the vector space, and the determination is performed based on the distance between the two vectors. Good. Further, in the classification determination process, a vector X corresponding to the importance of the classification target phrase is set to X = (X1, X2, ..., XN), and a vector Y corresponding to the importance of the classification criterion phrase is set. , Y = (Y1, Y2, ..., YN), and the distance D is D = Σ (Xi−Yi) * (Xi−Yi) i = 1,
2, ..., N, if the distance D is smaller than a predetermined threshold value, it is determined that the classification target document belongs to a classification close to the classification to which the classification reference document belongs. Good.

【００１７】さらにまた、前記形態素解析過程は、前記
複合語が複数の単語の組み合わせ、単語および抽出され
た前記複合語よりも文字数の少ない複合語の組み合わ
せ、あるいは、抽出された前記複合語よりも文字数の少
ない複合語の組み合わせのいずれかである場合に、当該
複合語のみを抽出するようにしてもよい。また、前記形
態素解析過程は、前記分類対象語句を登録するための形
態素解析用逆引辞書の登録内容に基づいて前記形態素解
析を行うようにしてもよい。さらに前記形態素解析過程
は、前記形態素解析用逆引辞書に登録されていない単語
あるいは複合語を不定語として前記形態素解析用逆引辞
書に登録するようにしてもよい。Furthermore, in the morphological analysis step, the compound word is a combination of a plurality of words, a combination of a word and a compound word having a smaller number of characters than the extracted compound word, or a combination of the extracted compound word. In the case of any combination of compound words having a small number of characters, only the compound word may be extracted. Further, in the morpheme analysis process, the morpheme analysis may be performed based on the registered content of the reverse morphological analysis dictionary for registering the classification target phrase. Further, in the morphological analysis step, a word or a compound not registered in the morphological analysis reverse lookup dictionary may be registered as an indefinite word in the morphological analysis reverse lookup dictionary.

【００１８】さらにまた、前記形態素解析過程は、前記
形態素解析において、抽出した単語あるいは複合語に予
め定めた記号が含まれている場合、当該単語あるいは当
該複合語から前記記号を除いた後に前記抽出した単語あ
るいは複合語とするようにしてもよい。また、前記重要
度計算過程は、抽出された前記単語あるいは前記複合語
のうち予め定めた分類対象語句として不適当な語句を除
いて前記重要度計算を行うようにしてもよい。さらにま
た、抽出された前記単語あるいは前記複合語に対し、所
定の標準化処理を行う標準化過程を備え、前記重要度算
出過程は、前記標準化処理後の前記単語あるいは前記複
合語に対し前記重要度を算出するようにしてもよい。Further, in the morphological analysis step, in the morphological analysis, when the extracted word or compound word includes a predetermined symbol, the extraction is performed after removing the symbol from the word or compound word. It may be a word or a compound word. Further, in the importance calculation step, the importance may be calculated by excluding words that are inappropriate as a predetermined classification target word from the extracted words or the compound words. Furthermore, the extracted word or the compound word is provided with a standardization process for performing a predetermined standardization process, and the importance degree calculation process calculates the importance level for the word or the compound word after the standardization process. It may be calculated.

【００１９】また、コンピュータを所定の分類基準文書
に含まれる単語あるいは複合語を特定の分野に関連させ
て予め分類基準語句として記憶する分類データベース部
を利用した分類処理装置として機能させる制御プログラ
ムにおいて、分類対象の文書を解析させ、当該分類対象
の文書に含まれる単語あるいは複合語を分類対象語句と
して抽出させ、前記分類基準語句および前記分類対象語
句を比較させ、前記分類対象の文書が属する分類を判別
させる、ことを特徴としている。この場合において、前
記分類基準文書の形態素解析を行わせて前記単語あるい
は前記複合語を抽出させ、抽出された前記単語あるいは
前記複合語の当該分類基準文書における重要度を計算さ
せ、抽出された前記単語あるいは前記複合語を前記重要
度および前記特定の分野に対応づけて前記分類データベ
ース部に前記分類基準語句として登録させるようにして
もよい。Further, in a control program for causing a computer to function as a classification processing device using a classification database unit that stores a word or a compound word included in a predetermined classification standard document in advance as a classification standard phrase in association with a specific field, A document to be classified is analyzed, a word or a compound word included in the document to be classified is extracted as a classification target phrase, the classification criterion phrase and the classification target phrase are compared, and the classification to which the classification target document belongs is determined. The feature is that they are distinguished. In this case, the word or the compound word is extracted by performing a morphological analysis of the classification criterion document, the importance of the extracted word or the compound word in the classification criterion document is calculated, and the extracted word is extracted. The word or the compound word may be registered in the classification database unit as the classification reference word in association with the importance and the specific field.

【００２０】また、前記分類対象語句のうち前記分類基
準語句に含まれる語句および前記分類基準語句に基づい
て前記分類を判別させるようにしてもよい。さらに前記
分類の判別において、前記分類対象語句のうち前記分類
基準語句に含まれる語句の数が多い分類を優先的に前記
分類対象の文書が属する分類として判別させるようにし
てもよい。さらにまた、前記分類対象語句がＮ個の前記
単語および前記複合語を含む場合に、前記分類対象語句
をＮ次元のベクトル空間におけるベクトルで表させ、前
記分類基準語句を前記ベクトル空間におけるベクトルで
表させ、両ベクトルの距離に基づいて前記判別を行わせ
るようにしてもよい。The classification may be discriminated based on the words included in the classification standard words and the classification standard words among the classification target words. Further, in the classification determination, a classification in which the number of phrases included in the classification reference phrase is large among the classification target phrases may be preferentially determined as a classification to which the document to be classified belongs. Furthermore, when the classification target phrase includes N words and the compound word, the classification target phrase is represented by a vector in an N-dimensional vector space, and the classification criterion phrase is represented by a vector in the vector space. Alternatively, the determination may be performed based on the distance between the two vectors.

【００２１】また、前記分類対象語句がＮ個の前記単語
および前記複合語を含む場合に、前記分類対象語句の前
記重要度をＮ次元のベクトル空間におけるベクトルで表
させ、前記分類基準語句の重要度を前記ベクトル空間に
おけるベクトルで表させ、両ベクトルの距離に基づいて
前記判別を行わせるようにしてもよい。さらに前記分類
対象語句の前記重要度に対応するベクトルＸを、Ｘ＝（Ｘ１、Ｘ２、……、ＸN）で表させ、前記分類対象語句の前記重要度に対応するベ
クトルＹを、Ｙ＝（Ｙ１、Ｙ２、……、ＹN）で表させ、距離Ｄを、Ｄ＝Σ（Ｘｉ−Ｙｉ）＊（Ｘｉ−Ｙｉ）ｉ＝１、
２、……、Ｎとした場合に、前記距離Ｄが所定のしきい値よりも小さ
い場合に前記分類対象の文書は前記分類基準文書の属す
る分類に近い分類に属すると判別させるようにしてもよ
い。When the classification target phrase includes N words and the compound word, the importance of the classification target phrase is represented by a vector in an N-dimensional vector space, and the importance of the classification criterion phrase is determined. The degree may be represented by a vector in the vector space, and the determination may be performed based on the distance between the two vectors. Further, a vector X corresponding to the importance of the classification target phrase is represented by X = (X1, X2, ..., XN), and a vector Y corresponding to the importance of the classification target phrase is expressed as Y = ( Y1, Y2, ..., YN) and the distance D is D = Σ (Xi-Yi) * (Xi-Yi) i = 1,
2, ..., N, if the distance D is smaller than a predetermined threshold value, the classification target document is determined to belong to a classification close to the classification to which the classification reference document belongs. Good.

【００２２】さらにまた、前記複合語が複数の単語の組
み合わせ、単語および抽出された前記複合語よりも文字
数の少ない複合語の組み合わせ、あるいは、抽出された
前記複合語よりも文字数の少ない複合語の組み合わせの
いずれかである場合に、当該複合語のみを抽出させるよ
うにしてもよい。また、前記形態素解析において、抽出
すべき分類対象語句として名詞および所定の名詞句と見
なせる品詞に属する単語を抽出させるようにしてもよ
い。さらに前記名詞として、名詞句およびサ変名詞を含
むことを特徴としている。さらにまた、前記所定の名詞
句と見なせる品詞として形容動詞の名詞形および一段動
詞の連用形を含ませるようにしてもよい。また、前記分
類対象語句を登録するための形態素解析用逆引辞書の登
録内容に基づいて前記形態素解析を行わせるようにして
もよい。さらに前記形態素解析用逆引辞書に登録されて
いない単語あるいは複合語を不定語として前記形態素解
析用逆引辞書に登録させるようにしてもよい。Furthermore, the compound word is a combination of a plurality of words, a combination of a word and a compound word having a smaller number of characters than the extracted compound word, or a compound word having a smaller number of characters than the extracted compound word. In the case of any of the combinations, only the compound word may be extracted. Further, in the morphological analysis, a word belonging to a noun and a part of speech that can be regarded as a predetermined noun phrase may be extracted as a classification target phrase to be extracted. Further, the noun includes a noun phrase and a sahen noun. Furthermore, the part of speech that can be regarded as the predetermined noun phrase may include the noun form of the adjective verb and the continuous form of the one-stage verb. Further, the morphological analysis may be performed based on the registered contents of the reverse lookup dictionary for morphological analysis for registering the classification target phrase. Further, a word or a compound word not registered in the morphological analysis reverse lookup dictionary may be registered as an indefinite word in the morphological analysis reverse lookup dictionary.

【００２３】さらにまた、前記形態素解析において、抽
出した単語あるいは複合語に予め定めた記号が含まれて
いる場合、当該単語あるいは当該複合語から前記記号を
除いた後に前記抽出した単語あるいは複合語とさせるよ
うにしてもよい。また、抽出された前記単語あるいは前
記複合語のうち予め定めた分類対象語句として不適当な
語句を除いて前記重要度計算を行わせるようにしてもよ
い。さらにまた、抽出された前記単語あるいは前記複合
語に対し、所定の標準化処理を行わせ、前記標準化処理
後の前記単語あるいは前記複合語に対し前記重要度を算
出させるようにしてもよい。また、上記各制御プログラ
ムを記録媒体に記録するようにしてもよい。Furthermore, in the morphological analysis, when the extracted word or compound word includes a predetermined symbol, the extracted word or compound word is removed after the symbol is removed from the word or compound word. You may allow it. Further, the importance calculation may be performed by excluding an unsuitable word as a predetermined classification target word from the extracted word or the compound word. Furthermore, a predetermined standardization process may be performed on the extracted word or the compound word, and the importance of the word or the compound word after the standardization process may be calculated. Further, each of the above control programs may be recorded in a recording medium.

【００２４】[0024]

【発明の実施の形態】次に本発明の好適な実施の形態に
ついて図面を参照して説明する。［１］分類処理システムの概要構成図１に分類処理システムの概要構成ブロック図を示す。
分類処理システム１０は、大別すると、各種データをデ
ータベースとして蓄積するデータベース部１１と、デー
タベース部１１に蓄積された各データベースに基づいて
分類処理を行うとともに、分類処理の結果に基づいてデ
ータベース部１１の各データベースを更新する分類更新
処理部１２と、各種情報を表示するディスプレイ部１３
と、各種データの入力を行う入力部１４と、を備えてい
る。ここで、分類処理システム１０は、コンピュータシ
ステムにおいて実現可能であり、分類更新処理部１２の
機能は、各部に対応するマイクロプロセッサで実行可能
なプログラムによって実現される。また、このようなプ
ログラムは、半導体メモリ、ＣＤ−ＲＯＭなどの記録媒
体から直接実行してもよい。また、外部記憶装置に予め
プログラムインストールして実行することも可能であ
る。さらにプログラムの実行に先立って実行する毎、あ
るいは、最初に一度だけ、インターネットなどのネット
ワークを介してインストールするようにしてもよい。BEST MODE FOR CARRYING OUT THE INVENTION Next, preferred embodiments of the present invention will be described with reference to the drawings. [1] Schematic Configuration of Classification Processing System FIG. 1 shows a schematic configuration block diagram of the classification processing system.
The classification processing system 10 is roughly classified into a database unit 11 that accumulates various data as a database, a classification process based on each database accumulated in the database unit 11, and a database unit 11 based on the result of the classification process. Update processing unit 12 for updating each database of the above, and display unit 13 for displaying various information
And an input unit 14 for inputting various data. Here, the classification processing system 10 can be realized by a computer system, and the function of the classification update processing unit 12 is realized by a program executable by a microprocessor corresponding to each unit. Further, such a program may be directly executed from a recording medium such as a semiconductor memory or a CD-ROM. It is also possible to install the program in an external storage device in advance and execute it. Furthermore, the program may be installed each time it is executed prior to execution or only once at the beginning via a network such as the Internet.

【００２５】データベース部１１は、大別すると、分類
データベース部１５と、形態素解析用逆引き辞書１６
と、テキストデータベース部１７と、を備えている。こ
こで、データベース部１１は、ハードディスクなどの外
部記憶装置に構築されている。分類データベース部１５
は、キーワード元文書（分類基準文書）に含まれていた
単語あるいは複合語を予め指定された特定の分類（分
野）に関連させて分類オントロジー（分類基準語句）と
して記憶している。形態素解析用逆引き辞書１６は、形
態素解析に用いる辞書データとして、テキストデータを
形態素解析することにより得られる単語あるいは複合語
（形態素解析結果）を格納している。テキストデータベ
ース部１７は、分類対象の文書に対応するテキストデー
タの形態素解析の結果（単語および複合語）を格納す
る。分類更新処理部１２は、大別すると、形態素解析部
２１と、重要度計算部２２と、標準化部２３と、分類付
加部２４と、を備えている。The database unit 11 is roughly classified into a classification database unit 15 and a morphological analysis reverse lookup dictionary 16.
And a text database unit 17. Here, the database unit 11 is built in an external storage device such as a hard disk. Classification database unit 15
Stores a word or a compound word included in a keyword source document (classification reference document) as a classification ontology (classification reference phrase) in association with a predetermined specific classification (field). The morphological analysis reverse lookup dictionary 16 stores words or compound words (morphological analysis results) obtained by performing morphological analysis of text data as dictionary data used for morphological analysis. The text database unit 17 stores the results (words and compound words) of morphological analysis of text data corresponding to documents to be classified. The classification update processing unit 12 roughly includes a morpheme analysis unit 21, an importance degree calculation unit 22, a standardization unit 23, and a classification addition unit 24.

【００２６】形態素解析部２１は、分類対象のテキスト
文書あるいはキーワードを抽出するためのキーワード元
文書の形態素解析を行い形態素解析結果を生成する。そ
して形態素解析の対象が分類対象のテキスト文書である
場合には、形態素解析結果である単語あるいは複合語を
テキストデータベース部１７に出力する。また形態素解
析の対象がキーワード元文書である場合には、分類オン
トロジーを生成させるべく、形態素解析結果である単語
あるいは複合語を分類データベース部１５に出力する。
重要度計算部２２は、キーワード元文書の形態素解析結
果である単語あるいは複合語について重要度を計算す
る。例えば、ＴＦＩＤＦ法により当該キーワード元文書
における重要度としてＴＦＩＤＦ値を計算する。そし
て、分類データベース部１５にキーワード元文書の形態
素解析結果である単語あるいは複合語と対応づけて重要
度を出力することとなる。The morpheme analysis unit 21 performs a morpheme analysis of a text document to be classified or a keyword source document for extracting a keyword to generate a morpheme analysis result. If the morphological analysis target is a text document to be classified, the word or compound word that is the morphological analysis result is output to the text database unit 17. If the target of the morphological analysis is the keyword source document, the word or the compound word that is the morphological analysis result is output to the classification database unit 15 in order to generate the classification ontology.
The importance calculator 22 calculates the importance of a word or a compound word that is a morphological analysis result of a keyword source document. For example, the TFIDF value is calculated by the TFIDF method as the degree of importance in the keyword source document. Then, the importance level is output to the classification database unit 15 in association with the word or the compound word that is the morphological analysis result of the keyword source document.

【００２７】標準化部２３は、形態素解析の結果である
単語あるいは複合語の表記の揺れを補正し、補正後の単
語あるいは複合語を形態素解析結果として形態素解析部
２１に出力させることとなる。例えば、「パソコン」、
「パーソナルコンピュータ」、「パーソナルコンピュー
ター」は、標準化部２３により「パソコン」に表記が統
一され、形態素解析結果として形態素解析部２１に出力
させることとなる。同様に「ジョージ・ワシントン」お
よび「ジョージ＝ワシントン」は、「ジョージ＝ワシン
トン」に表記が統一され、形態素解析結果として形態素
解析部２１に出力させることとなる。分類付加部２４
は、テキストデータベース部１７に格納されている分類
対象の文書に対応するテキストデータの形態素解析の結
果（単語および複合語）および分類データベース部１５
に格納されている分類オントロジーを参照して分類対象
の文書の分類を判別し、テキストデータベース部１７内
の形態素解析の結果（単語および複合語）に分類の判別
結果を付加させて格納する。The standardization unit 23 corrects the fluctuation of the notation of the word or the compound word which is the result of the morpheme analysis, and outputs the corrected word or the compound word to the morpheme analysis unit 21 as the morpheme analysis result. For example, "PC",
The notation of “personal computer” and “personal computer” is unified to “personal computer” by the standardization unit 23, and the result is output to the morpheme analysis unit 21 as a morpheme analysis result. Similarly, the notation of "George Washington" and "George-Washington" is unified to "George-Washington", and the result is output to the morphological analysis unit 21 as the morphological analysis result. Classification addition unit 24
Is a result of morphological analysis (words and compound words) of the text data corresponding to the document to be classified stored in the text database unit 17 and the classification database unit 15.
The classification of the document to be classified is discriminated by referring to the classification ontology stored in, and the classification discrimination result is added to the morphological analysis result (word and compound word) in the text database unit 17 and stored.

【００２８】［２］全体処理次に図２ないし図７を参照して実施形態の分類処理装置
の動作を説明する。図２に分類処理装置の全体処理フロ
ーチャートを示す。ユーザにディスプレイ１３の表示画
面上で分類処理に用いるべき形態素解析用辞書（逆引き
用辞書）１７および分類対象文書のテキストデータにつ
いて確認を促す（ステップＳ１）。ユーザにより分類処
理に用いるべき形態素解析用辞書（逆引き用辞書）１７
および分類対象文書のテキストデータの確認がなされる
と、形態素解析部により分類対象文書のテキストデータ
に対する形態素解析処理を行い、形態素解析結果である
抽出した単語および複合語をテキストデータベースに登
録する（ステップＳ２）。[2] Overall Processing Next, the operation of the classification processing apparatus according to the embodiment will be described with reference to FIGS. 2 to 7. FIG. 2 shows an overall processing flowchart of the classification processing device. The user is prompted to confirm on the display screen of the display 13 the morphological analysis dictionary (reverse lookup dictionary) 17 and the text data of the classification target document to be used for classification processing (step S1). Morphological analysis dictionary (reverse lookup dictionary) 17 to be used for classification processing by the user 17
When the text data of the classification target document is confirmed, the morphological analysis unit performs a morphological analysis process on the text data of the classification target document and registers the extracted words and compound words that are the morphological analysis results in the text database (step S2).

【００２９】ここで、ステップＳ２の処理における形態
素解析部の処理について説明する。図３に形態素解析部
の処理フローチャートを示す。まず、ユーザにディスプ
レイ画面上で未解析文書数の確認および形態素解析辞書
の確認を促す（ステップＳ１１）。次に形態素解析部
は、未解析文書があるか否かを判別する（ステップＳ１
２）。ステップＳ１２の判別において未解析文書がない
場合には（ステップＳ１２；Ｎｏ）、形態素解析を行う
必要がないので処理を終了する。ステップＳ１２の判別
において未解析文書がある場合には（ステップＳ１２；
Ｙｅｓ）、形態素解析処理を行う（ステップＳ１３）。
この形態素解析処理においては、名詞（名詞句、サ変名
詞）および名詞句と見なす品詞に属する単語および複合
語を抽出する品詞処理を行っている。抽出対象の単語お
よび複合語の品詞としては、まず、従来と同様に形容動
詞、サ変名詞、普通名詞、数詞、固有名詞、連体詞、慣
用句、慣用単漢字（記号を除く）、連濁（名詞連濁、連
用連濁）、不定語が挙げられる。ここで、慣用句とは、
「アーメン」、「哀悼の意」などの決まり文句をいう。
また、連濁とは二つの語句が結合して一つの新たな語句
となる際に、うしろの語の語頭の清音が濁音に変更され
るものをいう。例えば、「田舎暮らし」という語句にお
ける「暮らし（ぐらし）」部分、「意向通り」という語
句における「通り（どおり）」部分などが挙げられる。
さらに不定語とは、形態素解析用逆引き辞書に含まれて
いない単語あるいは複合語をいうHere, the processing of the morphological analysis unit in the processing of step S2 will be described. FIG. 3 shows a processing flowchart of the morphological analysis unit. First, the user is prompted to confirm the number of unanalyzed documents and the morphological analysis dictionary on the display screen (step S11). Next, the morphological analysis unit determines whether or not there is an unanalyzed document (step S1).
2). If there is no unanalyzed document in the determination of step S12 (step S12; No), there is no need to perform morphological analysis, and the process ends. If there is an unanalyzed document in the determination of step S12 (step S12;
Yes), morphological analysis processing is performed (step S13).
In this morphological analysis processing, part-of-speech processing is performed to extract nouns (noun phrases, sahen nouns) and words and compound words that belong to parts of speech regarded as noun phrases. As the part of speech of the word and compound word to be extracted, first, as in the past, the adjective verb, Sahen noun, common noun, number, proper noun, adjunct, idiomatic phrase, idiomatic kanji (excluding symbols), rendaku (noun rendaku) , Continuous rendaku), indefinite terms. Here, the phrase is
A phrase such as "amen" or "condolences".
In addition, rendaku means that when two words are combined into one new phrase, the pure sound at the beginning of the word behind is changed to dakuon. For example, the "living" part in the phrase "country living", the "street" part in the phrase "desired street", and the like can be mentioned.
Furthermore, an indefinite term means a word or compound that is not included in the reverse dictionary for morphological analysis.

【００３０】新たに追加した抽出対象の単語および複合
語の品詞としては、形容動詞の名詞形、一段動詞の連用
形が挙げられる。ここで、形容動詞の名詞形および一段
動詞の連用形について具体的に説明する。例えば、「綺
麗な花」における「綺麗な」は形容動詞の連用形であり
抽出対象とはしないが、「花が綺麗」というように「綺
麗」を形容動詞の名詞形としている場合には抽出対象と
する。また、「あおむける」は抽出対象とならないが、
「あおむけ」は一段動詞の連用形であるので、抽出対象
とされる。逆に従来では抽出対象であった品詞であり、
本実施形態では抽出対象から除いた品詞としては、サ変
名詞の終止形、連体詞が挙げられる。ここで、サ変名詞
の終止形および連体詞について具体的に説明する。例え
ば、「行動を共にする」において「行動」はサ変名詞で
あり抽出対象とされるが、「行動する」はサ変名詞の終
止形であるので、抽出対象とならない。また、「明くる
朝」における「明くる」や「悪しき習慣」における「悪
しき」は連体詞であるので抽出対象とならない。The newly added word to be extracted and the part of speech of the compound word include the noun form of the adjective verb and the continuous form of the one-stage verb. Here, the noun form of the adjective verb and the continuous form of the one-stage verb will be specifically described. For example, "Beautiful" in "Beautiful flowers" is a continuous form of adjective verbs and is not subject to extraction. However, when "Beautiful" is the noun form of adjective verbs such as "Flower is pretty", it is subject to extraction. And Also, "Aomuku" is not subject to extraction,
"Aomuke" is a combination of single-stage verbs, so it is targeted for extraction. On the contrary, it is a part of speech that was conventionally extracted,
In the present embodiment, the part-of-speech removed from the extraction target includes the final form of the Sahen noun and the adnominal. Here, the ending forms and adnominals of the Sahen nouns will be specifically described. For example, in “to behave together”, “behavior” is a sahenun and is an extraction target, but “behavior” is an end form of a sahenun and is not an extraction target. In addition, “Akirakuru” in “Akarikuru Asahi” and “Evil” in “Awkward Habit” are adjuncts and therefore are not subject to extraction.

【００３１】具体的には形態素解析辞書（逆引き辞書）
に基づいて形態素を抽出し、形態素解析辞書（逆引き辞
書）に登録されていない単語（不定語）が抽出された場
合には、当該不定語については名詞句として出力され
る。このように抽出される不定語として抽出される単語
としては製品の型番などが挙げられる。また複合語で辞
書に登録されているものについては、当該複合語を構成
する複数の単語までは分析しないようにしている。次に
形態素解析部は、名詞（名詞句、サ変名詞）および名詞
句に準ずる語句が抽出されたか否かを判別する（ステッ
プＳ１４）。ステップＳ１４の判別において、名詞句お
よび名詞句に準ずる語句が抽出されなかった場合には
（ステップＳ１４；Ｎｏ）、処理をステップＳ１２に移
行し、以下、同様に処理を行う。ステップＳ１４の判別
において、名詞句および名詞句に準ずる語句が抽出され
た場合には、形態素解析部２１は、記号処理を行う。Specifically, a morphological analysis dictionary (reverse lookup dictionary)
When a morpheme is extracted based on, and a word (indefinite word) not registered in the morpheme analysis dictionary (reverse lookup dictionary) is extracted, the indefinite word is output as a noun phrase. Examples of the word extracted as the indefinite word thus extracted include the model number of the product. In addition, regarding a compound word registered in the dictionary, a plurality of words forming the compound word are not analyzed. Next, the morphological analysis unit determines whether or not a noun (noun phrase, sahen noun) and a phrase according to the noun phrase have been extracted (step S14). In the determination of step S14, when the noun phrase and the word equivalent to the noun phrase are not extracted (step S14; No), the process proceeds to step S12, and the same process is performed thereafter. In the determination of step S14, when a noun phrase and a phrase according to the noun phrase are extracted, the morphological analysis unit 21 performs symbol processing.

【００３２】この記号処理は、中点などの単語の先頭と
して不適当な文字が含まれている場合に、当該不適当な
文字を含んだ文字列を抽出した語句として処理を行わな
いようにするためである。また、記号により違う文字列
となる場合でも、例えば、「●ＨＤＤ」と「ＨＤＤ」の
ように切出位置が異なるが実質的に同一の単語である場
合があるからである。さらに製品などの型番として不適
当な記号である空である。この単語の先頭として不適当
な文字としては、以下のようなものが挙げられる。な
お、以下の説明において＜＞内のコードは対応するシフ
トＪＩＳコードである。半角系文字：「.」、「'」、「`」、「!」、「?」、
「-」、「()、「」」全角系文字：「！<8149>」、「・<8145>」、「？<8148
>」、「○<819b>」、「●<819c>」、「＊<8196
>」、「．<8144>」、「‘<8165>」〜「』<8178>」、
「／<815e>」、「＼<815f>」、「＝<8181>」〜「≧<818
6>」、「、<8141>」〜「，<8143>」、およびシフトＪＩ
Ｓコード8179、817aに対応する二つの文字In the symbol processing, when an unsuitable character is included at the beginning of a word such as a midpoint, the processing is not performed as a word / phrase extracted from a character string including the unsuitable character. This is because. In addition, even when the character string is different depending on the symbol, the words may be substantially the same, for example, “● HDD” and “HDD” although the cutout positions are different. In addition, it is an empty symbol which is inappropriate as a model number of a product or the like. Characters that are inappropriate as the beginning of this word include the following. In the following description, the code in <> is the corresponding shift JIS code. Half-width characters: ".", "'", "` ","! ","? ",
"-", "(),""Double-bytecharacters:"!<8149>"," ・ <8145>","?<8148
> ”,“ ○ <819b> ”,“ ● <819c> ”,“ ＊ <8196
> ”,“. <8144> ”,“ '<8165> ”to“ ”<8178>”,
“/ <815e>”, “\ <815f>”, “= <8181>” to “≧ <818
6> ”,“, <8141> ”to“, <8143> ”, and shift JI
Two characters corresponding to S code 8179, 817a

【００３３】このため、形態素解析部２１は、抽出され
た語句の先頭の文字が記号であるか否かを判別する（ス
テップＳ１５）。ステップＳ１５の判別において、先頭
の文字が記号である場合には、当該先頭の記号は見出し
等を表すために使用されている可能性がある。このた
め、抽出された語句から当該先頭の記号を除くべく、抽
出した語句の先頭を当該記号を除いた次の文字に設定し
て、新たに抽出した語句とみなす（ステップＳ１６）。Therefore, the morphological analysis unit 21 determines whether or not the leading character of the extracted word or phrase is a symbol (step S15). In the determination in step S15, if the leading character is a symbol, the leading symbol may be used to represent a heading or the like. Therefore, in order to remove the leading symbol from the extracted phrase, the beginning of the extracted phrase is set to the next character excluding the symbol, and is regarded as the newly extracted phrase (step S16).

【００３４】次に形態素解析部２１は、新たに抽出した
語句について、単語の長さを判別し、単語の長さが０よ
り大であるか否か、すなわち、新たに抽出した語句を構
成する文字が存在するか否かを判別する（ステップＳ１
７）。ステップＳ１７の判別において、単語の長さが０
である場合には（ステップＳ１７；Ｎｏ）、単語が存在
しないこととなるので、処理を再びステップＳ１４に移
行する。ステップＳ１７の判別において、単語の長さが
０より大である場合には（ステップＳ１７；Ｙｅｓ）、
形態素解析部２１は再びステップＳ１５において判別を
行うこととなる。Next, the morpheme analysis unit 21 determines the word length of the newly extracted word and determines whether the word length is greater than 0, that is, the newly extracted word is constructed. It is determined whether a character exists (step S1)
7). In the determination in step S17, the word length is 0
If it is (step S17; No), it means that the word does not exist, and therefore the process proceeds to step S14 again. In the determination of step S17, when the word length is greater than 0 (step S17; Yes),
The morphological analysis unit 21 will make the determination again in step S15.

【００３５】ステップＳ１５の判別において先頭の文字
が記号ではない場合には、新たに抽出された語句は、テ
キストデータベース１６に登録すべき単語であると考え
られるので、標準化部２３により標準化処理を行う（ス
テップＳ１８）。そして分類更新処理部１２は、標準化
処理後の単語をテキストデータベース１６に登録する
（ステップＳ１９）。次に分類更新処理部１２は、テキ
ストデータベース１６に登録された単語を形態素解析用
逆引き辞書１７に登録する（ステップＳ３）。次に分類
更新処理部１２は、ユーザに対しディスプレイ１３の画
面上でキーワード取得元の文書と分類の確認を促す（ス
テップＳ４）。再び形態素解析部によりキーワード取得
元の文書に対して形態素解析処理を行う（ステップＳ
５）。重要度計算部２２において、重要度の計算を行う
（ステップＳ６）。If the first character is not a symbol in the determination of step S15, the newly extracted word is considered to be a word to be registered in the text database 16, so the standardization unit 23 performs standardization processing. (Step S18). Then, the classification update processing unit 12 registers the standardized word in the text database 16 (step S19). Next, the classification update processing unit 12 registers the words registered in the text database 16 in the morphological analysis reverse lookup dictionary 17 (step S3). Next, the classification update processing unit 12 prompts the user to confirm the document from which the keyword is acquired and the classification on the screen of the display 13 (step S4). The morphological analysis unit again performs morphological analysis processing on the document from which the keyword was acquired (step S
5). The importance calculator 22 calculates the importance (step S6).

【００３６】ここで、ステップＳ６の処理における重要
度計算部２２の処理について説明する。図４に重要度計
算部の処理フローチャートを示す。まず、重要度計算部
２２は、ステップＳ５において得られたキーワード取得
元の文書の形態素解析処理データを取得する（ステップ
Ｓ２１）。次に重要度計算部２２は、未処理文書がある
か否かを判別する（ステップＳ２２）。ステップＳ２２
の判別において未だ重要度計算が完了していない未処理
のキーワード取得元の文書がない場合には（ステップＳ
２２；Ｎｏ）、重要度計算を行う必要がないので処理を
終了する。ステップＳ２２の判別において未処理のキー
ワード取得元の文書がある場合には（ステップＳ２２；
Ｙｅｓ）、重要度計算部２２は、未処理のキーワード取
得元の文書に対応する形態素解析処理データに基づいて
未処理のキーワード取得元の文書に含まれる単語あるい
は複合語についてＴＦＩＤＦ値を算出し、一定のしきい
値以上のＴＦＩＤＦ値の単語あるいは複合語を抽出する
（ステップＳ２３）。Here, the processing of the importance calculation section 22 in the processing of step S6 will be described. FIG. 4 shows a processing flowchart of the importance calculation section. First, the importance calculation unit 22 acquires the morphological analysis processing data of the document from which the keyword is acquired, obtained in step S5 (step S21). Next, the importance calculation section 22 determines whether or not there is an unprocessed document (step S22). Step S22
If there is no unprocessed keyword acquisition source document for which importance calculation has not been completed in the determination of (step S
22; No), since there is no need to calculate the importance, the process ends. If there is an unprocessed keyword acquisition source document in the determination of step S22 (step S22;
Yes), the importance calculation unit 22 calculates a TFIDF value for a word or a compound word included in the unprocessed keyword acquisition source document based on the morphological analysis processing data corresponding to the unprocessed keyword acquisition source document, A word or compound word having a TFIDF value equal to or greater than a certain threshold is extracted (step S23).

【００３７】次に重要度計算部２２は、抽出した単語あ
るいは複合語について制限処理を行い、制限処理の対象
となる（制限処理で処理対象から除くべき）単語あるい
は複合語であるか否かを判別する（ステップＳ２４）。
具体的に、制限処理の対象となる単語あるいは複合語と
しては、以下の〜の場合が挙げられる。形態素解析が失敗したような場合に得られる、単語
あるいは複合語の先頭文字が「ァ」、「ィ」、「ゥ」、
「ェ」、「ォ」、「ッ」、「ャ」、「ュ」、「ョ」、
「ヮ」、「ン」、「ヵ」、「ヶ」などとなっている場
合。全角カタカナで２文字の場合単語あるいは複合語を構成する文字列の途中に
「%」、「&」、「;」、「:」、「+」等の半角文字を含
む場合。単語あるいは複合語を構成する文字列の途中に
「〜」、「×」、「＋」などの全角文字を含む場合。単漢字である場合。Next, the importance calculation section 22 performs a restriction process on the extracted words or compound words, and determines whether or not the extracted words or compound words are target words of the restriction process (should be excluded from the processing target in the restriction process). It is determined (step S24).
Specifically, the following cases (1) to (4) are listed as the words or compound words to be subjected to the restriction process. When the morphological analysis fails, the first character of the word or compound word is "a", "i", "u",
"E", "o", "tsu", "ya", "yu", "yo",
When "ヮ", "N", "K", "K", etc. Two-byte full-width katakana In case of including half-width characters such as "%", "&", ";", ":", "+" in the middle of the character string that constitutes a word or compound word. When a double-byte character such as "~", "x", "+" is included in the middle of the character string that constitutes a word or compound word. If it is a single kanji.

【００３８】これらの単語あるいは複合語は、明らかに
名詞句（固有名詞）あるいはサ変名詞として不適当であ
るため、制限処理において除かれることとなる。ステッ
プＳ２４の処理において、制限処理の対象となる（制限
処理で処理対象から除くべき）単語あるいは複合語であ
る場合には（ステップＳ２５；Ｙｅｓ）、重要度計算部
２２は、当該単語あるいは複合語を破棄する（ステップ
Ｓ２７）。そして、処理を再びステップＳ２２に移行し
て、以下同様の処理を繰り返すこととなる。Since these words or compound words are obviously unsuitable as noun phrases (proper nouns) or sahen nouns, they will be removed in the restriction process. In the processing of step S24, when the word or compound word is a target of the restriction processing (should be excluded from the processing target in the restriction processing) (step S25; Yes), the importance calculation section 22 determines that the word or the compound word. Is discarded (step S27). Then, the process shifts to step S22 again, and the same process is repeated thereafter.

【００３９】ステップＳ２４の処理において制限処理の
対象となる（制限処理で処理対象から除くべき）単語あ
るいは複合語ではない場合には（ステップＳ２４；Ｎ
ｏ）、重要度計算部２２は、ストップワード処理の対象
となる（ストップワード処理で処理対象から除くべき）
してはじくべき単語あるいは複合語であるか否かを判別
する（ステップＳ２５）。ここで、ストップワードと
は、複数の分野において用いられる単語あるいは複合
語、すなわち、極めて一般的な単語あるいは複合語であ
り、分類を推定するには不適当な単語あるいは複合語で
ある。例えば、「ＴＥＬ」、「ＦＡＸ」、「ＯＫ」、
「ＮＧ」などが挙げられる。In the processing of step S24, if it is not a word or a compound word that should be subjected to restriction processing (should be excluded from processing by restriction processing) (step S24; N).
o), the importance calculator 22 is a target of stopword processing (should be excluded from processing targets in stopword processing)
Then, it is determined whether or not the word is a word to be flipped or a compound word (step S25). Here, the stop word is a word or compound word used in a plurality of fields, that is, a very general word or compound word, and is an inappropriate word or compound word for estimating the classification. For example, "TEL", "FAX", "OK",
Examples include “NG”.

【００４０】ステップＳ２５の判別において、ストップ
ワード処理の対象となる（ストップワード処理で処理対
象から除くべき）単語あるいは複合語である場合には
（ステップＳ２５；Ｙｅｓ）、重要度計算部２２は、当
該単語あるいは複合語を破棄する（ステップＳ２８）。
そして処理を再びステップＳ２２に移行して、以下同様
の処理を繰り返すこととなる。ステップＳ２５の判別に
おいて、ストップワード処理の対象となる（ストップワ
ード処理で処理対象から除くべき）単語あるいは複合語
ではない場合には（ステップＳ２５；Ｎｏ）、対応する
分類の分類オントロジーとして当該単語あるいは複合語
を登録し、処理を再びステップＳ２２に移行して、以下
同様の処理を繰り返すこととなる。これらの結果、分類
付加部２４は、分類のキーワードに基づいて分類を割り
振り、る（ステップＳ７）。When it is determined in step S25 that the word or the compound is the target of the stopword processing (should be excluded from the processing target in the stopword processing) (step S25; Yes), the importance calculation section 22 The word or compound word is discarded (step S28).
Then, the process moves to step S22 again, and the same process is repeated thereafter. In the determination in step S25, if the word is not a target word of stopword processing (which should be excluded from the processing target in stopword processing) or a compound word (step S25; No), the word or the word is classified as a classification ontology of the corresponding classification. The compound word is registered, the process proceeds to step S22 again, and the same process is repeated thereafter. As a result, the classification adding unit 24 allocates the classification based on the classification keyword (step S7).

【００４１】ここで分類の具体的手法について説明す
る。図５に２次元（Ｎ＝２）のベクトル空間においてベ
クトル間の距離（類似度）を用いて分類を行う場合の概
念図を示す。分類対象の文書に対応するテキストデータ
に含まれ、形態素解析処理により抽出された単語あるい
は複合語（分類対象語句）の重要度に対応するベクトル
Ｘを、Ｘ＝（Ｘ１、Ｘ２、……、ＸN）とし、キーワード取得元の文書に含まれ、形態素解析処
理により抽出された単語あるいは複合語（分類基準語
句）である分類オントロジーの重要度に対応するベクト
ルＹを、Ｙ＝（Ｙ１、Ｙ２、……、ＹN）で表し、距離Ｄを、Ｄ＝Σ（Ｘｉ−Ｙｉ）＊（Ｘｉ−Ｙｉ）ｉ＝１、
２、……、Ｎとした場合に、距離Ｄが所定のしきい値よりも小さい場
合に分類対象の文書はキーワード取得元の文書（分類基
準文書）の属する分類に近い分類に属すると判別する。
この場合において、所定のしきい値は、様々な分類結果
に基づいて適宜定めるようにすればよい。なお、本来
は、距離Ｄは、Ｄ＝√｛Σ（Ｘｉ−Ｙｉ）＊（Ｘｉ−Ｙｉ）｝ｉ＝
１、２、……、Ｎとすべきであるが、開平計算を省くことにより計算時間
の短縮化を図っている。Here, a specific method of classification will be described. FIG. 5 shows a conceptual diagram when classification is performed using the distance (similarity) between vectors in a two-dimensional (N = 2) vector space. A vector X corresponding to the degree of importance of a word or a compound word (classification target phrase) included in the text data corresponding to the document to be classified and extracted by the morphological analysis process is expressed as follows: X = (X1, X2, ..., XN ) Is a vector Y corresponding to the importance of the classification ontology that is a word or a compound word (classification reference phrase) that is included in the document from which the keyword is acquired and is extracted by the morphological analysis process, Y = (Y1, Y2, ... , YN), and the distance D is D = Σ (Xi-Yi) * (Xi-Yi) i = 1,
2, ..., N, if the distance D is smaller than a predetermined threshold value, it is determined that the classification target document belongs to a classification close to the classification to which the keyword acquisition source document (classification reference document) belongs. .
In this case, the predetermined threshold value may be appropriately determined based on various classification results. Originally, the distance D is D = √ {Σ (Xi-Yi) * (Xi-Yi)} i =
It should be 1, 2, ..., N, but the calculation time is shortened by omitting the square root calculation.

【００４２】ところで、図５に示すように、ベクトル領
域Ａは、分類対象の文書に対応するテキストデータに含
まれ形態素解析処理により抽出された単語あるいは複合
語（分類対象語句）のうち、キーワード取得元の文書に
含まれない単語あるいは複合語に対応するベクトルが存
在する領域である。また、ベクトル領域Ｂは、分類対象
の文書に対応するテキストデータに含まれ形態素解析処
理により抽出された単語あるいは複合語（分類対象語
句）であり、かつ、キーワード取得元の文書に含まれ、
形態素解析処理により抽出された単語あるいは複合語
（分類基準語句）に対応するベクトルが存在する領域で
ある。さらにベクトル領域Ｃは、キーワード取得元の文
書に含まれ形態素解析処理により抽出された単語あるい
は複合語（分類基準語句）のうち、分類対象の文書に対
応するテキストデータには含まれない単語あるいは複合
語に対応するベクトルが存在する領域である。By the way, as shown in FIG. 5, the vector area A is the keyword acquisition of the words or compound words (classification target phrases) included in the text data corresponding to the document to be classified and extracted by the morphological analysis process. This is an area where vectors corresponding to words or compound words that are not included in the original document exist. Further, the vector area B is a word or a compound word (classification target phrase) included in the text data corresponding to the document to be classified and extracted by the morphological analysis process, and is included in the document from which the keyword is acquired,
This is an area where vectors corresponding to words or compound words (classification criteria words) extracted by the morphological analysis process exist. Further, the vector region C is a word or compound word included in the document from which the keyword is acquired and extracted by the morphological analysis process (classification reference phrase) that is not included in the text data corresponding to the document to be classified. This is an area where vectors corresponding to words exist.

【００４３】通常、上記ベクトル領域Ａ、ベクトル領域
Ｂおよびベクトル領域Ｃの全てのベクトル領域におい
て、分類対象の文書とキーワード取得元の文書との間の
距離Ｄが所定のしきい値より近ければ、キーワード取得
元の文書が属する分類に分類対象の文書が属していると
認めることができる。より具体的に図６を参照して距離
Ｄの概念について説明する。図６には、文書Ｄ１：「山、山、川、川」の４単語を含む文書文書Ｄ２：「山、川、川」の３単語を含む文書文書Ｄ３：「山、山、山」の３単語を含む文書文書Ｄ４：「川」の１単語を含む文書があるものとする。Normally, in all the vector areas A, B and C, if the distance D between the document to be classified and the document from which the keyword is acquired is smaller than a predetermined threshold value, It can be recognized that the document to be classified belongs to the classification to which the document from which the keyword is acquired belongs. The concept of the distance D will be described more specifically with reference to FIG. In FIG. 6, a document D1: a document document including four words "mountain, mountain, river, river" D2: a document document including three words "mountain, river, river" D3: "mountain, mountain, mountain" Document Document Containing Three Words Document D4: It is assumed that there is a document containing one word "kawa".

【００４４】実際の計算では、ベクトルの要素である単
語あるいは複合語に対して重み付けを行って距離Ｄを算
出する。この重み付けはＴＦＩＤＦ値を用いるのが一般
的である。この場合に、多くの文書にわたってベクトル
の要素である単語あるいは複合語が出現する場合には、
ＴＦＩＤＦ値は小さな値となり、当該単語あるいは複合
語は分類を決定するのに重きを置く必要は無いというこ
とである。これに対し、同一の文書内に同一の単語ある
いは複合語が何度も出現する場合には、ＴＦＩＤＦ値は
大きな値となり、当該単語あるいは複合語は分類を決定
するのに重要な単語であるということである。しかしな
がら、説明の簡略化のため、縦軸を「山」の単語出現数
とし、横軸を「川」の単語出現数とするベクトル空間を
考える。この場合に、図７に示すように、文書Ｄ１〜文
書Ｄ４に対応するベクトルはそれぞれベクトルＶ１〜Ｖ
４となる。従って、文書Ｄ１と文書Ｄ２との間の距離Ｄ
が最も近いと考えられ、文書Ｄ１と文書Ｄ２とが同一の
分類に属すると判断できることとなる。In the actual calculation, the distance D is calculated by weighting the words or compound words that are the elements of the vector. This weighting generally uses the TFIDF value. In this case, if a word or compound that is a vector element appears in many documents,
This means that the TFIDF value will be a small value and that the word or compound need not be weighted to determine the classification. On the other hand, when the same word or compound word appears many times in the same document, the TFIDF value becomes large and the word or compound word is an important word for determining the classification. That is. However, for simplification of explanation, consider a vector space in which the vertical axis represents the number of word occurrences of “mountain” and the horizontal axis represents the number of word occurrences of “river”. In this case, as shown in FIG. 7, the vectors corresponding to the documents D1 to D4 are vectors V1 to V, respectively.
It becomes 4. Therefore, the distance D between the document D1 and the document D2
Is considered to be the closest, and it can be determined that the documents D1 and D2 belong to the same classification.

【００４５】ところで、上記分類を判別するに際し、上
記ベクトル領域Ａおよびベクトル領域Ｂに属するベクト
ルの数が少ない場合には、ベクトル領域Ｂおよびベクト
ル領域Ｃに属するベクトルだけを用いて距離Ｄの計算
（近似計算）を行っても、ベクトル領域Ａ、ベクトル領
域Ｂおよびベクトル領域Ｃの全てのベクトル領域に属す
る全てのベクトルを用いて距離Ｄの計算を行った場合と
同様の結果を得ることが可能となる。しかしながら、近
似計算を行った場合には、ベクトル領域Ｂの次元が１次
元（Ｎ＝１）の場合でも、すなわち、ベクトル領域Ｂに
対応する単語あるいは複合語が１語であり、当該ベクト
ル領域Ｂにおける距離Ｄが０である（当該語が一致し
た）場合には、キーワード取得元の文書が属する分類に
分類対象の文書が属しているという結果が得られてしま
うこととなる。When determining the classification, if the number of vectors belonging to the vector regions A and B is small, the distance D is calculated using only the vectors belonging to the vector regions B and C ( Even if (approximate calculation) is performed, it is possible to obtain the same result as when the distance D is calculated using all the vectors belonging to all the vector regions A, B, and C. Become. However, when the approximate calculation is performed, even when the dimension of the vector area B is one-dimensional (N = 1), that is, the word or compound word corresponding to the vector area B is one word, and the vector area B When the distance D in is 0 (the words match), the result that the document to be classified belongs to the classification to which the document from which the keyword was acquired belongs is obtained.

【００４６】すなわち、分類毎にベクトル領域Ｂに属す
るベクトルおよびベクトル領域Ｃに属するベクトルの総
数（要素数）が違うため、ベクトルの総数が少なく、次
元が小さい（Ｎが小さい）分類については、キーワード
取得元の文書が属する分類に分類対象の文書が属してい
るという結果が得られてしまう確率が高くなることとな
る。そこで、これを回避するためには、ベクトル領域Ｂ
の次元が大きい（Ｎが大きい）ものを優先的に距離Ｄを
計算して分類の判別に用い、次元が同じであるならば距
離Ｄの小さいものがより近い分類であると判断すればよ
い。この結果、処理時間を短縮しつつ、より近い分類を
選択することが可能となる。That is, since the total number of vectors (the number of elements) belonging to the vector area B and the vector belonging to the vector area C is different for each classification, the keyword is used for the classification having a small total number of vectors and a small dimension (small N). The probability that the result that the document to be classified belongs to the classification to which the document of the acquisition source belongs is increased. Therefore, in order to avoid this, the vector area B
The one having a larger dimension (larger N) is preferentially used to calculate the distance D and used for the discrimination of the classification, and if the dimensions are the same, the one having the smaller distance D may be determined to be the closer classification. As a result, it is possible to select a closer classification while reducing the processing time.

【００４７】［３］実施形態の効果以上の説明のように、本実施形態によれば、形態素解析
を行う場合には、上述した品詞処理および記号処理を行
っているため、形態素解析の精度および効率を向上させ
ることができ、より正確な分類を行うことができる。ま
た分類オントロジーを生成するに際し、各単語あるいは
複合語の重要度計算を行う場合には、上述した制限処理
およびストップワード処理を行っているため、より正確
に重要度を算出することができ、登録数を削減しつつ、
より有効な分類オントロジーを生成することができる。
さらに分類対象の文書を構成するテキストデータから形
態素解析により得られた単語あるいは複合語を形態素解
析用逆引き辞書に登録することにより、繰り返して形態
素解析を行うことで学習的に正確な分類オントロジーを
抽出可能な形態素解析を実現することができる。さらに
また、データベースに登録すべき単語あるいは複合語数
を削減することができ、データベースの容量を削減する
ことができる。[3] Effects of Embodiment As described above, according to this embodiment, when performing morpheme analysis, the above-mentioned part-of-speech processing and symbol processing are performed. The efficiency can be improved and more accurate classification can be performed. Moreover, when the importance of each word or compound word is calculated when the classification ontology is generated, the above-described restriction processing and stop word processing are performed, so that the importance can be calculated more accurately. While reducing the number
A more effective classification ontology can be generated.
Furthermore, by registering a word or compound word obtained by morphological analysis from the text data that composes the document to be classified into the reverse dictionary for morphological analysis, a learning-accurate classification ontology can be obtained by repeatedly performing morphological analysis. Extractable morphological analysis can be realized. Furthermore, the number of words or compound words to be registered in the database can be reduced, and the capacity of the database can be reduced.

【００４８】［４］実施形態の変形例［４．１］第１変形例以上の説明においては、データベース部１１を分類更新
処理部１２と一体に構成していたが、両者をネットワー
クを介して分散処理システムとして構成することも可能
である。この場合において、さらにデータベース部１１
を構成する各データベース１５、１６および形態素解析
用逆引き辞書１７をネットワークを介して別のデータベ
ースサーバに格納するように構成し、複数の分類更新処
理部１２として機能するコンピュータシステムから利用
可能な構成とすることも可能である。［４．２］第２変形例以上の説明においては、標準化部２３を必須の構成とし
て説明したが、必ずしも標準化部２３を設けなくてもデ
ータベースの容量は多少増加するというデメリットはあ
るが、ほぼ同様な効果を得ることが可能である。[4] Modification of Embodiment [4.1] First Modification In the above description, the database unit 11 was configured integrally with the classification update processing unit 12, but both are connected via a network. It can also be configured as a distributed processing system. In this case, the database unit 11
Each of the databases 15 and 16 and the reverse morphological analysis reverse lookup dictionary 17 that are configured to be stored in another database server via a network, and can be used by a computer system that functions as a plurality of classification update processing units 12. It is also possible to [4.2] Second Modification In the above description, the standardization unit 23 has been described as an indispensable configuration, but there is a demerit that the capacity of the database increases a little even if the standardization unit 23 is not necessarily provided, but it is almost the same. It is possible to obtain the same effect.

【００４９】[0049]

【発明の効果】本発明によれば、語句抽出部は、分類対
象の文書を解析し、当該分類対象の文書に含まれる単語
あるいは複合語を分類対象語句として抽出し、分類判別
部は、分類データベースに記憶している分類基準語句お
よび分類対象語句を比較し、分類対象の文書が属する分
類を判別するので、正確な文書の分類を容易に行える。
また、形態素解析部は、形態素解析用逆引辞書に登録さ
れていない単語あるいは複合語を不定語として形態素解
析用逆引辞書に登録するので、繰り返して形態素解析を
行うことで学習的に正確な分類オントロジーを抽出可能
な形態素解析を実現することができる。According to the present invention, the word / phrase extraction unit analyzes a document to be classified and extracts a word or a compound word included in the document to be classified as a classification target phrase. Since the classification criterion words and the classification target words stored in the database are compared and the classification to which the classification target document belongs is determined, accurate classification of documents can be easily performed.
In addition, the morphological analysis unit registers a word or compound that is not registered in the morphological analysis reverse lookup dictionary as an indefinite term in the morphological analysis reverse lookup dictionary, so that learning is performed accurately by repeatedly performing morphological analysis. A morphological analysis capable of extracting a classification ontology can be realized.

【００５０】さらに形態素解析において、抽出した単語
あるいは複合語に予め定めた記号が含まれている場合、
当該単語あるいは当該複合語から前記記号を除いた後に
抽出した単語あるいは複合語とさせ、抽出された単語あ
るいは複合語のうち予め定めた分類対象語句として不適
当な語句を除いて重要度計算を行わせるようにし、抽出
された前記単語あるいは前記複合語のうち予め定めた前
記分類を判別するのに不適当な語句を除いて前記重要度
計算を行わせるようにし、あるいは、抽出された単語あ
るいは複合語に対し、所定の標準化処理を行わせ、標準
化処理後の単語あるいは複合語に対し重要度を算出させ
るようにすることにより、データベースの容量を削減し
つつ、処理速度を向上させ、正確な分類を行える。Further, in the morphological analysis, when the extracted word or compound word includes a predetermined symbol,
The word or compound word is extracted after removing the symbol from the word or compound word, and the importance calculation is performed by removing the inappropriate words or phrases as the predetermined classification target words or phrases from the extracted words or compound words. The extracted word or the compound word so that the importance calculation is performed by removing the words or phrases that are inappropriate for discriminating the predetermined classification from the extracted word or the compound word, or the extracted word or the compound word By performing a predetermined standardization process on words and calculating the importance of the standardized word or compound word, the processing speed is improved and accurate classification is performed while reducing the database capacity. Can be done.

[Brief description of drawings]

【図１】実施形態の分類処理システムの概要構成ブロ
ック図である。FIG. 1 is a schematic block diagram of a classification processing system according to an embodiment.

【図２】分類処理装置の全体処理フローチャートであ
る。FIG. 2 is an overall processing flowchart of a classification processing device.

【図３】形態素解析部の処理フローチャートである。FIG. 3 is a processing flowchart of a morphological analysis unit.

【図４】重要度計算部の処理フローチャートである。FIG. 4 is a processing flowchart of an importance calculation section.

【図５】２次元（Ｎ＝２）のベクトル空間においてベ
クトル間の距離（類似度）を用いて分類を行う場合の概
念図である。FIG. 5 is a conceptual diagram when classification is performed using a distance (similarity) between vectors in a two-dimensional (N = 2) vector space.

【図６】距離Ｄの概念説明図である。FIG. 6 is a conceptual explanatory diagram of a distance D.

【図７】距離Ｄの具体的説明図である。FIG. 7 is a specific explanatory diagram of a distance D.

[Explanation of symbols]

１０……分類処理システム１１……データベース部（分類データベース）１２……分類更新処理部（語句抽出部、分類判別部）１３……ディスプレイ部１４……入力部１５……分類データベース部１６……形態素解析用逆引き辞書１７……テキストデータベース部 10 ... Classification processing system 11 ... Database part (classification database) 12 ... Classification update processing unit (word extraction unit, classification determination unit) 13 ... Display section 14 ... Input section 15 ... Classification database section 16 ... Reverse dictionary for morphological analysis 17 ... Text database section

Claims

[Claims]

1. A classification database unit for preliminarily storing a word or a compound word included in a predetermined classification standard document as a classification standard phrase in association with a specific field, and a document to be classified, and a document to be classified. A word extraction unit that extracts a word or a compound word included in the item as a classification target phrase; and a classification determination unit that compares the classification criterion phrase and the classification target phrase and determines the classification to which the document to be classified belongs. A classification processing device characterized by the above.

2. The classification processing apparatus according to claim 1, wherein a morphological analysis unit that performs morphological analysis of the classification standard document to extract the word or the compound word, and a morphological analysis unit for the extracted word or the compound word. An importance calculation unit for calculating the importance in the classification criterion document, and a criterion for registering the extracted word or the compound word in the classification database unit as the classification criterion phrase in association with the importance and the specific field. A classification processing device comprising: a term registration unit;

3. The classification processing device according to claim 1, wherein in the classification determination, the classification determination unit gives priority to a classification having a large number of words / phrases included in the classification reference words / phrases among the classification target words / phrases. A classification processing device, characterized in that it is determined as a classification to which the document to be classified belongs.

4. The classification processing device according to claim 1, wherein the classification determination unit sets the classification target phrase to N when the classification target phrase includes N words and the compound word.
Represented by a vector in a dimensional vector space, the classification criterion word represented by a vector in the vector space,
A classification processing device, characterized in that the discrimination is performed based on a distance between both vectors.

5. The classification processing apparatus according to claim 2, wherein the classification determination unit sets the importance of the classification target word to N when the classification target word includes N words and the compound word. A classification processing device, wherein the classification reference phrase is represented by a vector in a vector space, the importance of the classification criterion is represented by a vector in the vector space, and the determination is performed based on a distance between the two vectors.

6. The classification processing device according to claim 5, wherein the classification determination unit sets a vector X corresponding to the importance of the classification target phrase to X = (X1, X2, ..., XN), A vector Y corresponding to the importance of the classification criterion word is represented by Y = (Y1, Y2, ..., YN), and a distance D is D = Σ (Xi-Yi) * (Xi-Yi) i = 1,
2, ..., N, it is determined that the document to be classified belongs to a classification close to the classification to which the classification reference document belongs if the distance D is smaller than a predetermined threshold value. Classification processing device.

7. The classification processing device according to claim 2, wherein the morpheme analysis unit is a combination of a plurality of words in which the compound word is a combination of a word and a compound word having a smaller number of characters than the extracted compound word, or A classification processing device, wherein when any of the combinations of compound words having a smaller number of characters than the extracted compound words is extracted, only the compound word is extracted.

8. The classification processing device according to claim 2, wherein the morpheme analysis unit considers a noun including at least a noun phrase and a sahen noun and a predetermined noun phrase as a classification target phrase to be extracted in the morphological analysis. A classification processing device characterized by extracting words belonging to.

9. The classification processing device according to claim 8, wherein the part of speech that can be regarded as the predetermined noun phrase includes a noun form of an adjective verb and a continuous form of a one-stage verb.

10. The classification processing device according to claim 8, further comprising a morphological analysis reverse lookup dictionary for registering the classification target phrase, wherein the morphological analysis unit includes the morphological analysis reverse lookup dictionary. A classification processing device, characterized in that the morphological analysis is performed based on.

11. The classification processing device according to claim 10, wherein the morpheme analysis unit registers a word or a compound word not registered in the morpheme analysis reverse lookup dictionary as an indefinite term in the morpheme analysis reverse lookup dictionary. A classification processing device characterized by:

12. The classification processing device according to claim 2, wherein the morpheme analysis unit, in the morpheme analysis, if the extracted word or compound word includes a predetermined symbol, the word or compound word. A classification processing device, wherein the extracted word or compound word is obtained after removing the symbol from the.

13. The classification processing device according to claim 2, wherein the importance degree calculation unit removes an inappropriate word or phrase from the extracted words or the compound words as a predetermined classification target word or phrase. A classification processing device characterized by performing calculation.

14. The classification processing device according to claim 2, further comprising a standardization unit that performs a predetermined standardization process on the extracted word or the compound word, and the importance degree calculation unit is configured to perform the standardization process after the standardization process. A classification processing device, wherein the importance is calculated for the word or the compound word.

15. A method of controlling a classification processing device, comprising a classification database unit for preliminarily storing a word or a compound word included in a predetermined classification standard document as a classification standard phrase in association with a specific field, the document to be classified. And a word extraction process of extracting a word or a compound word included in the document to be classified as a classification target phrase, comparing the classification criterion phrase and the classification target phrase, and determining the classification to which the classification target document belongs. A method for controlling a classification processing device, comprising: a classification judgment process for judging;

16. The method of controlling a classification processing device according to claim 15, wherein a morphological analysis process of performing morphological analysis of the classification criterion document to extract the word or the compound word, and the extracted word or the compound An importance calculation process of calculating the importance of a word in the classification standard document, and associating the extracted word or the compound word with the importance and the specific field, as the classification standard phrase in the classification database unit. A method of controlling a classification processing device, comprising: a reference word registration process for registering;

17. The control method of the classification processing device according to claim 16, wherein the classification determination step is performed when the classification target phrase includes the N number of words and the compound word, the importance of the classification target phrase. The degree is represented by a vector in an N-dimensional vector space, the importance of the classification criterion is represented by a vector in the vector space, and the determination is performed based on the distance between the two vectors. .

18. The control method of the classification processing apparatus according to claim 17, wherein the classification determination process calculates a vector X corresponding to the importance of the classification target phrase as X = (X1, X2, ..., XN). ), And a vector Y corresponding to the importance of the classification criterion word is represented by Y = (Y1, Y2, ..., YN), and a distance D is D = Σ (Xi-Yi) * (Xi-Yi) ) I = 1,
2, ..., N, it is determined that the document to be classified belongs to a classification close to the classification to which the classification reference document belongs if the distance D is smaller than a predetermined threshold value. Method of controlling classification processing device.

19. The method of controlling a classification processing device according to claim 16, wherein the morphological analysis step is performed on a combination of a plurality of words, the word, and a compound word having a smaller number of characters than the extracted compound word. A control method of a classification processing device, wherein when the combination word is either a combination or a combination of compound words having a smaller number of characters than the extracted compound word, only the compound word is extracted.

20. The method of controlling a classification processing device according to claim 16, wherein the morphological analysis step performs the morphological analysis based on registered contents of a morphological analysis reverse lookup dictionary for registering the classification target words and phrases. A method for controlling a classification processing device, comprising:

21. The method for controlling a classification processing apparatus according to claim 20, wherein in the morphological analysis step, a word or compound not registered in the morphological analysis reverse lookup dictionary is used as an indeterminate word to perform the morphological analysis reverse lookup. A method for controlling a classification processing device, characterized by registering in a dictionary.

22. The method of controlling a classification processing device according to claim 16, wherein in the morphological analysis step, when the extracted word or compound word in the morphological analysis includes a predetermined symbol, the word or A method of controlling a classification processing device, characterized in that the extracted word or compound word is obtained after removing the symbol from the compound word.

23. The method of controlling a classification processing device according to claim 16, wherein the importance calculation step excludes words that are inappropriate as a predetermined classification target word from the extracted words or the compound words. A method of controlling a classification processing device, characterized in that the importance calculation is performed.

24. The control method of a classification processing apparatus according to claim 16, wherein said importance calculation step is an unsuitable word or phrase for discriminating a predetermined classification of said extracted words or said compound words. A method of controlling a classification processing device, characterized in that the importance calculation is performed except for.

25. The control method of the classification processing apparatus according to claim 16, further comprising a standardization process for performing a predetermined standardization process on the extracted word or the compound word, and the importance calculation process includes the standardization process. A method of controlling a classification processing device, comprising: calculating the degree of importance of the processed word or the compound word.

26. A control program for causing a computer to function as a classification processing device using a classification database unit that stores a word or a compound word included in a predetermined classification standard document in association with a specific field in advance as a classification standard phrase. The document to be classified is analyzed, a word or a compound word included in the document to be classified is extracted as a classification target phrase, and the classification criterion phrase and the classification target phrase are compared,
A control program for determining a classification to which the document to be classified belongs.

27. The control program according to claim 26, wherein the morphological analysis of the classification criterion document is performed to extract the word or the compound word, and the extracted word or compound word in the classification criterion document is extracted. A control program for calculating a degree of importance, and registering the extracted word or the compound word in the classification database unit as the classification reference phrase in association with the importance and the specific field.

28. The control program according to claim 26, wherein the classification is discriminated based on a phrase included in the classification criterion phrase among the classification target phrases and the classification criterion phrase.

29. The control program according to claim 28, wherein in the classification determination, the classification target document belongs to the classification target phrase having a larger number of words included in the classification reference phrase, to which the classification target document belongs. A control program characterized by being classified as a classification.

30. The control program according to claim 26, wherein, when the classification target phrase includes N words and the compound word, the classification target phrase is represented by a vector in an N-dimensional vector space, A control program, wherein a classification criterion word is represented by a vector in the vector space, and the discrimination is performed based on a distance between the two vectors.

31. The control program according to claim 27, wherein when the classification target phrase includes N words and the compound word, the importance of the classification target phrase is a vector in an N-dimensional vector space. A control program, characterized in that the importance of the classification criterion phrase is expressed by a vector in the vector space, and the determination is performed based on the distance between the two vectors.

32. The control program according to claim 31, wherein a vector X corresponding to the degree of importance of the classification target phrase.
Is represented by X = (X1, X2, ..., XN), and the vector Y corresponding to the importance of the classification criterion is represented by Y = (Y1, Y2 ,. D, D = Σ (Xi-Yi) * (Xi-Yi) i = 1,
2, ..., N, when the distance D is smaller than a predetermined threshold value, it is determined that the classification target document belongs to a classification close to the classification to which the classification reference document belongs. Control program

33. The control program according to claim 27, wherein the compound word is a combination of a plurality of words, a combination of a compound word having a smaller number of characters than the word and the extracted compound word, or the extracted compound word. A control program for extracting only a compound word if it is one of the combinations of compound words having a smaller number of characters.

34. The control program according to claim 27, wherein in the morphological analysis, a word belonging to a noun and a part of speech regarded as a predetermined noun phrase is extracted as a classification target phrase to be extracted.

35. The control program according to claim 34, wherein the noun includes a noun phrase and a sahen noun.

36. The control program according to claim 34, wherein the part of speech that can be regarded as the predetermined noun phrase includes a noun form of an adjective verb and a continuous form of a one-stage verb.

37. The control program according to claim 34, wherein the morphological analysis is performed based on the registered content of the morphological analysis reverse lookup dictionary for registering the classification target word / phrase. A control program characterized by.

38. The control program according to claim 37, wherein a word or a compound word that is not registered in the morphological analysis reverse lookup dictionary is registered in the morphological analysis reverse lookup dictionary as an indefinite term. program.

39. The control program according to claim 37, wherein, in the morphological analysis, when the extracted word or compound word includes a predetermined symbol, after removing the symbol from the word or compound word. A control program, characterized in that the extracted word or compound word is used.

40. The control program according to claim 37, wherein the importance calculation is performed by removing words that are inappropriate as a predetermined classification target word from the extracted words or the compound words. Control program

41. The control program according to claim 37, wherein a predetermined standardization process is performed on the extracted word or the compound word, and the importance degree is applied to the word or the compound word after the standardization process. A control program characterized by:

42. A recording medium on which the control program according to any one of claims 26 to 41 is recorded.