JP2005063298A

JP2005063298A - Document processing unit and method

Info

Publication number: JP2005063298A
Application number: JP2003295182A
Authority: JP
Inventors: Yayoi Shibata; 弥生柴田; Hiroshi Umeki; 宏梅基
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-08-19
Filing date: 2003-08-19
Publication date: 2005-03-10
Anticipated expiration: 2023-08-19
Also published as: JP4552401B2

Abstract

<P>PROBLEM TO BE SOLVED: To properly extract a label and keywords of a document group. <P>SOLUTION: A phrase extracting part 1 extracts an important phrase from each sentence by morphological analysis. A phrase importance score calculating part 2 calculates a phrase importance score per each phrase. An inclusive relation analyzing part 3 creates a table indicating inclusive relations of the important phrases. A label extraction score calculating part 4 newly calculates a label extraction score from the phrase importance score of each phrase so that a label extraction score of an included phrase is higher than a label extraction score of an inclusive phrase. A keyword extraction score calculating part 5 calculates a keyword extraction score by adjusting the phrase importance score so that the included phrase is extracted as a keyword. A label selecting part 6 selects a phrase with the highest label extraction score as a label. A keyword selecting part 7 selects several phrases with high phrase importance scores from the top as keywords. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、文書集合の内容を一語で表すラベルと、文書集合の概要を表すキーワードを抽出するための文書処理に関するものである。 The present invention relates to document processing for extracting a label representing the contents of a document set in one word and a keyword representing an outline of the document set.

近年、文書の電子化が進み、大量の文書が電子化されて公開されたり、あるいは共有されたりしている。このような大量の電子文書から、ユーザが必要とする文書を探し出すことは非常に困難である。そのため、ユーザの必要とする文書を大量文書から探し出すための様々な方法が考えられている。 In recent years, digitization of documents has progressed, and a large number of documents have been digitized and released or shared. It is very difficult to find a document that the user needs from such a large amount of electronic documents. For this reason, various methods for searching for documents required by the user from a large number of documents have been considered.

その一つに、文書の内容によって文書を分類し、文書群としてグループ化することで、文書群を処理するという方法がある。文書を文書群に分類するには、人手で分類するほか、自動的に文書を分類する方法や、検索結果として文書群が得られる場合などがある。 One of them is a method of processing a document group by classifying the document according to the content of the document and grouping it as a document group. In order to classify a document into a document group, there are a method of automatically classifying a document in addition to manual classification, and a case where a document group is obtained as a search result.

このように人手や自動的に分類された文書群には、他の文書群と区別するために、文書群の重要な語句をキーワードとして抽出して出力表示したり、文書群の内容を要約して表示したり、文書群に名前やラベルをつけたりするなどが行われている。 In this way, in order to distinguish them from other document groups, important words and phrases in the document group are extracted and displayed as keywords, and the contents of the document group are summarized. Displayed, and names and labels are assigned to documents.

しかし、キーワードを表示するだけでは、文書群の内容を一言で表現するのが困難であり、またラベルだけでは、全体の内容が具体的に何であるかを掴みにくい。 However, it is difficult to express the contents of a group of documents in a single word only by displaying a keyword, and it is difficult to grasp what the entire contents are specifically by using only a label.

それを解消する手段として、文書群の名前を表すラベルと、文書群の内容を表す複数のキーワードを付与することが行われている。この場合、一般的に文書群中の文書から抽出した単語の出現頻度を算出して、キーワード候補の語句を出力し、その中で出現回数の最も多い単語をラベルとし、残りの単語の上位いくつかをキーワードとする方法が取られている。 As means for solving this problem, a label indicating the name of the document group and a plurality of keywords indicating the contents of the document group are assigned. In this case, the frequency of words extracted from the documents in the document group is generally calculated, the keyword candidate phrases are output, the word with the highest frequency of occurrence is used as a label, and the top words of the remaining words are counted. The method that uses or as a keyword is taken.

例えば、株式会社ジャストシステムのＣｏｎｃｅｐｔＢａｓｅＣｌｕｓｔｅｒｉｎｇ（商標）では、キーワードの中でトップのものをラベルとして選択している。 For example, in ConceptBase Clustering (trademark) of Just System Co., Ltd., the top one of the keywords is selected as the label.

しかし、ラベルとキーワードでは持っている役割が異なる。ラベルというのは文書群全体を表すもので、文書群に現れる重要概念に共通する概念であることが相応しく、キーワードの先頭のものがふさわしいとは限らない。一方、キーワードというのは文書群の内容をユーザにわかりやすく説明する役割があり、より具体的な語句が相応しい。 However, labels and keywords have different roles. The label represents the entire document group, and is suitable for a concept common to important concepts appearing in the document group, and the first one of the keywords is not necessarily suitable. On the other hand, the keyword has a role of explaining the contents of the document group to the user in an easy-to-understand manner, and a more specific phrase is suitable.

従って、ラベルとキーワードの抽出方法もそれぞれ変える必要がある。 Therefore, it is necessary to change the method of extracting labels and keywords.

なお、単語をクラスタリングし、各クラスタを、最も重要度の高い主キーワードとともに表示することが特許文献１に記載されているが、これは文書群のラベルを抽出するものではない。
特開２００１−３２５２７２ Although Patent Document 1 describes that words are clustered and each cluster is displayed together with the most important main keyword, this does not extract a document group label.
JP 2001-325272 A

この発明は、以上の事情を考慮してなされたものであり、
提供することを目的としている。 This invention was made in consideration of the above circumstances,
It is intended to provide.

本発明は、上述した従来技術の問題を解決するためになされたものであり、文書群に対して、文書の内容を１単語で表すラベルと、文書の内容をより詳しく説明するためのキーワードを、それぞれに適した、異なる方法を用いて抽出することを可能にした、文書処理装置を提供することを目的とするものである。 The present invention has been made in order to solve the above-described problems of the prior art, and for a document group, a label representing the contents of a document in one word and a keyword for explaining the contents of the document in more detail are provided. It is an object of the present invention to provide a document processing apparatus that enables extraction using different methods suitable for each.

この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。 According to this invention, in order to achieve the above-mentioned object, the configuration as described in the claims is adopted. Here, prior to describing the invention in detail, supplementary explanations of the claims will be given.

すなわち、この発明の一側面によれば、上述の目的を達成するために、文書集合の内容を一語で表すラベルと、前記文書集合の概要を表す１個以上のキーワードを抽出する文書処理装置に：文書集合の各文書からラベルとキーワードの候補となる語句を抽出する語句抽出手段と；前記語句の重要度を表す語句重要度スコアをそれぞれの語句ごとに算出する語句重要度スコア計算手段と；前記語句の間の形態上の包含関係を解析する包含関係解析手段と；前記語句重要度スコアを、前記包含関係解析手段によって解析された包含関係に基づいて、他の語句に包含される語句に対し加点が行われるように調整してラベル抽出スコアを算出するラベル抽出スコア計算手段と；前記語句重要度スコアを、前記包含関係解析手段によって解析された包含関係に基づいて、他の語句を包含する語句に対し減点が行われるように調整してキーワード抽出スコアを算出するキーワード抽出スコア計算手段と；前記ラベル抽出スコアに従って前記語句の中からラベルを１つ選択するラベル選択手段と；前記キーワード抽出スコアに基づいてキーワードを選択するキーワード選択手段とを設けるようにしている。 That is, according to one aspect of the present invention, in order to achieve the above-described object, a document processing apparatus that extracts a label that represents the contents of a document set with one word and one or more keywords that represent an outline of the document set. N: a phrase extracting unit that extracts a candidate for a label and a keyword from each document in the document set; a phrase importance score calculating unit that calculates a phrase importance score representing the importance of the phrase for each phrase; An inclusion relation analyzing means for analyzing a morphological inclusion relation between the phrases; and a phrase included in another phrase based on the inclusion relation analyzed by the inclusion relation analyzing means for the phrase importance score A label extraction score calculating means for calculating a label extraction score by adjusting so as to add points; an inclusion analyzed by the inclusion relation analyzing means for the word importance score; A keyword extraction score calculating means for calculating a keyword extraction score by adjusting so that a deduction is performed on a phrase including another phrase based on the relationship; one label is selected from the phrases according to the label extraction score; Label selecting means for selecting; and keyword selecting means for selecting a keyword based on the keyword extraction score are provided.

この構成においては、語句の形態上の包含関係を基準にして包含される語句がラベルに選定される尤度を高くし、また包含する語句がキーワードに選定される尤度を高くし、もって、重要概念に共通する概念である語句がラベルに選ばれやすくし、具体的な意味合いの語句がキーワードに選ばれやすくすることができる。 In this configuration, the likelihood that a word to be included is selected as a label based on the inclusion relationship on the form of the word is increased, and the likelihood that the included word is selected as a keyword is increased. Phrases that are concepts common to important concepts can be easily selected as labels, and words with specific meanings can be easily selected as keywords.

前記包含関係解析手段は、例えば、文字列の包含関係や、単語列の包含関係を解析するものである。 The inclusion relation analyzing means analyzes, for example, the inclusion relation of character strings and the inclusion relation of word strings.

また、前記キーワード選択手段は、包含される語句はキーワードとして選択しないようにしてもよい。 Further, the keyword selection means may not select included words as keywords.

また、キーワード中のラベルに相当する部分を他の部分と区別して表示するようにしてもよい。 Further, a portion corresponding to the label in the keyword may be displayed separately from other portions.

また、この文書処理装置に、文書群を分類する手段を付加するようにしてもよい。 Further, means for classifying a document group may be added to the document processing apparatus.

また、本発明の他の側面によれば、文書集合の内容を一語で表すラベルを抽出する文書処理装置に：文書集合の各文書からラベルの候補となる語句を抽出する語句抽出手段と；前記語句の重要度を表す語句重要度スコアをそれぞれの語句ごとに算出する語句重要度スコア計算手段と；前記語句の間の包含関係を解析する包含関係解析手段と；前記語句重要度スコアを、前記包含関係解析手段によって解析された包含関係に基づいて、調整してラベル抽出スコアを算出するラベル抽出スコア計算手段と；前記ラベル抽出スコアに基づいてラベルを選択するラベル選択手段とを設けるようにしている。 According to another aspect of the present invention, there is provided a document processing apparatus for extracting a label that represents the contents of a document set with one word: a phrase extracting unit that extracts a phrase that is a candidate for a label from each document in the document set; A phrase importance score calculating means for calculating a phrase importance score representing the importance of the phrase for each word; inclusion relation analyzing means for analyzing an inclusion relation between the phrases; and the phrase importance score; Label extraction score calculation means for adjusting and calculating a label extraction score based on the inclusion relation analyzed by the inclusion relation analysis means; and label selection means for selecting a label based on the label extraction score are provided. ing.

この構成においても、語句の包含関係に基づいて適切にラベルとなる語句を選定できる。 Even in this configuration, it is possible to appropriately select a word / phrase as a label based on the inclusion relation of the word / phrase.

前記語句の間の包含関係は語句の形態上の包含関係や、意味上の包含関係である。語句の形態上の包含関係は、例えば文字列の包含関係や、単語列の包含関係である。意味上の包含関係は例えば辞書を用いて解析できる。 The inclusion relationship between the words is an inclusion relationship in terms of a phrase or a semantic inclusion relationship. The inclusion relation in terms of the phrase is, for example, an inclusion relation of a character string or an inclusion relation of a word string. The semantic inclusion relationship can be analyzed using a dictionary, for example.

なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。 The present invention can be realized not only as an apparatus or a system but also as a method. Of course, a part of the invention can be configured as software. Of course, software products used to cause a computer to execute such software are also included in the technical scope of the present invention.

この発明の上述の側面および他の側面は特許請求の範囲に記載され以下実施例を用いて詳述される。 These and other aspects of the invention are set forth in the appended claims and will be described in detail below with reference to examples.

本発明によれば、１ないし複数の文書からなる文書群から、適切なラベルやキーワードを抽出することができるという効果がある。このようなラベルやキーワードを付与することによって、文書群を正しく識別することが可能となる。 According to the present invention, there is an effect that appropriate labels and keywords can be extracted from a document group including one or a plurality of documents. By assigning such labels and keywords, it is possible to correctly identify a document group.

以下、この発明の実施例について説明する。 Examples of the present invention will be described below.

図１は、本発明の実施例１の文書処理装置を全体として示すブロック図である。図１において、文書処理装置は、語句抽出部１、語句重要度スコア計算部２、包含関係解析部３、ラベル抽出スコア計算部４、キーワード抽出スコア計算部５、ラベル選択部６、キーワード選択部７、表示出力部８等を含んで構成されている。具体的には、これら各部の機能を実現するコンピュータプログラムを所定のコンピュータあるいはコンピュータ群により実行する。もちろん、これら各部の一部または全部をハードウェアにより構成してもよい。 FIG. 1 is a block diagram showing the entire document processing apparatus according to the first embodiment of the present invention. In FIG. 1, a document processing apparatus includes a phrase extraction unit 1, a phrase importance score calculation unit 2, an inclusion relation analysis unit 3, a label extraction score calculation unit 4, a keyword extraction score calculation unit 5, a label selection unit 6, and a keyword selection unit. 7 and the display output part 8 etc. are comprised. Specifically, a computer program for realizing the functions of these units is executed by a predetermined computer or a group of computers. Of course, some or all of these units may be configured by hardware.

なお、本実施例においては、あらかじめ少なくともテキストを含む文書からなる文書集合が構成されているものとする。これらの文書群は、自動的に分類された結果や、もしくは検索された結果などによって取得されたものである。 In this embodiment, it is assumed that a document set including documents including at least text is configured in advance. These document groups are obtained by automatically classified results or retrieved results.

語句抽出部１は、文書群中の各文書について、テキストの形態素解析を行い、各文から重要語句を抽出する。これらの語句は形態素解析によって抽出するのではなく、他の方法を使用して抽出しても構わない。 The phrase extraction unit 1 performs text morphological analysis on each document in the document group, and extracts an important phrase from each sentence. These words / phrases are not extracted by morphological analysis, but may be extracted using other methods.

語句重要度スコア計算部２は、語句抽出部１で得られた語句に対し、語句の重要度を示す語句重要度スコアをそれぞれの語句ごとに計算する。語句重要度スコアの計算方法としては、従来から利用されているｔｆｉｄｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ／ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）法を使うこともできるが、これに限定する必要はない。 The phrase importance score calculation unit 2 calculates a phrase importance score indicating the importance of the phrase for each phrase with respect to the phrase obtained by the phrase extraction unit 1. As a method for calculating the phrase importance score, a conventionally used tfidf (Term Frequency / Inverse Document Frequency) method can be used, but the method is not limited to this.

例えば、ある文書群に含まれる要素単語の重みを、対象文書全体に対する文書群の相互情報量を応用した以下の式で表し、語句重要度スコアとすることもできる。

For example, the weight of an element word included in a certain document group can be expressed by the following formula applying the mutual information amount of the document group with respect to the entire target document, and can also be used as a phrase importance score.

包含関係解析部３は、語句の包含関係を解析し、重要語句の包含関係を示すテーブルを作成する。 The inclusion relationship analysis unit 3 analyzes the inclusion relationship of words and creates a table indicating the inclusion relationship of important words.

ラベル抽出スコア計算部４は、包含関係解析部で３の語句の包含関係をうけ、包含関係にある語句のうち、包含される語句のラベル抽出スコアが、包含する語句のラベル抽出スコアよりも高くなるように、各語句の語句重要度スコアからラベル抽出スコアを新たに算出する。 The label extraction score calculation unit 4 receives the inclusion relationship of the three words in the inclusion relationship analysis unit, and among the words in the inclusion relationship, the label extraction score of the included word is higher than the label extraction score of the included word Thus, a label extraction score is newly calculated from the phrase importance score of each phrase.

キーワード抽出スコア計算部５は、包含関係解析部３での語句の包含関係をうけ、包含関係にある語句のうち、包含する語句をキーワードとして抽出するように、語句重要度スコアを調整し、キーワード抽出スコアを算出する。 The keyword extraction score calculation unit 5 adjusts the phrase importance score so as to receive the inclusion relation of the words in the inclusion relation analysis part 3 and extract the inclusion words as keywords from the words in the inclusion relation. An extraction score is calculated.

ラベル選択部６は、ラベル抽出スコア計算部４で算出されたラベル抽出スコアのうち、最もスコアの高いものをラベルとして選択する。 The label selection unit 6 selects the label extraction score calculated by the label extraction score calculation unit 4 with the highest score as a label.

キーワード選択部７は、語句重要度スコア計算部５で算出された語句重要度スコアのうち、スコアの高い上位いくつかをキーワードとして選択する。 The keyword selection unit 7 selects, as keywords, some of the higher-ranked scores from the phrase importance scores calculated by the phrase importance score calculation unit 5.

表示出力部８は、ラベル選択部６で選択されたラベルと、キーワード選択部７で選択されたキーワードを表示出力する。 The display output unit 8 displays and outputs the label selected by the label selection unit 6 and the keyword selected by the keyword selection unit 7.

文書群内の各文書が入力されると、まず語句抽出部１が、文書中のテキストの語句抽出を行い、重要語句を抽出する。抽出する語句は、自立語だけ、もしくは名詞だけを抽出してもよい。これらの語句は形態素解析によって抽出するなどの方法があるが、他の方法を使用しても構わない。抽出したこれらの語句を重要語句とする。 When each document in the document group is input, the phrase extraction unit 1 first extracts a phrase from the text in the document and extracts an important phrase. As a word to be extracted, only an independent word or only a noun may be extracted. There are methods such as extracting these words by morphological analysis, but other methods may be used. These extracted phrases are designated as important phrases.

次に語句重要度スコア計算部２で、抽出された各語句の語句重要度スコアが算出される。重要語句と語句重要度スコアの一例を図３に表す。図３では、「フセイン」「大量破壊兵器」「査察」「核兵器」「フセイン政権」「国連」「兵器」が語句として抽出され、各語に対し語句重要度スコアが、「０．４，０．３，０．２，０．２，０．１，０．１，０．１」というように与えられている。 Next, the phrase importance score calculation unit 2 calculates the phrase importance score of each extracted phrase. An example of an important phrase and phrase importance score is shown in FIG. In FIG. 3, “Husein”, “Weapons of Mass Destruction”, “Inspection”, “Nuclear Weapons”, “Hussein Administration”, “United Nations”, “Weapons” are extracted as words, and the word importance score is “0.4,0” for each word. .3, 0.2, 0.2, 0.1, 0.1, 0.1 ".

次に包含関係解析部３で、重要語句の包含関係を解析する。この時、包含関係にある単語関係を保持しておくテーブルを作成する。図３の例を対象に包含関係解析部で作成したテーブルを図４に示す。 Next, the inclusion relationship analysis unit 3 analyzes the inclusion relationship of the important words. At this time, a table for holding the word relationship in the inclusion relationship is created. FIG. 4 shows a table created by the inclusion relationship analysis unit for the example of FIG.

ラベル抽出スコア計算部４では、包含関係解析部３で作成したテーブルをもとに、包含関係にある語句のうち、包含される語句のラベル抽出スコアが包含する語句のラベル抽出スコアよりも高くなるように各単語の語句重要度スコアを調整して、ラベル抽出スコアを算出する。ここでは、包含する語句の語句重要度スコアと包含される語句の語句重要度スコアを加算することで、ラベル抽出スコアを算出する。 In the label extraction score calculation unit 4, the label extraction score of the included word / phrase is higher than the label extraction score of the included word / phrase among the words / phrases in the inclusion relationship based on the table created by the inclusion relationship analysis unit 3. As described above, the phrase importance score of each word is adjusted to calculate the label extraction score. Here, the label extraction score is calculated by adding the phrase importance score of the included phrase and the phrase importance score of the included phrase.

図２−Ａに、ラベル抽出スコアを算出するフロチャートを示す。 FIG. 2-A shows a flowchart for calculating the label extraction score.

ラベル抽出スコア計算部４は、包含関係解析部３で作成したテーブルをもとに、語句のラベル抽出スコアを算出する。包含関係にある語句の場合、包含される語句の語句重要度スコアと、包含する語句の語句重要度スコアを加算し、包含される語句のラベル抽出スコアとする（Ｓ１３、Ｓ１４）。図２−Ａの各ステップＳ１０〜Ｓ１６は図から明らかであるので、詳細な説明は行わない。 The label extraction score calculation unit 4 calculates the label extraction score of the phrase based on the table created by the inclusion relationship analysis unit 3. In the case of a phrase having an inclusion relationship, the phrase importance score of the included phrase and the phrase importance score of the included phrase are added to obtain a label extraction score for the included phrase (S13, S14). Each step S10 to S16 in FIG. 2A is clear from the figure and will not be described in detail.

図３の例を対象にラベル抽出スコアを算出すると、フセインとフセイン政権が包含関係にあるので、包含する語句であるフセイン政権の語句重要度スコア０．１と包含される語句のフセインの語句重要度スコア０．４に加算する。その結果、フセインのラベル抽出スコアが０．５になる。 When the label extraction score is calculated for the example of FIG. 3, Hussein and the Hussein administration are in an inclusive relationship. Therefore, the Hussein administration's phrase importance score of 0.1, which is the included phrase, and Hussein's phrase importance of the included phrase Add to degree score 0.4. As a result, the label extraction score for Hussein is 0.5.

同様に、兵器と大量破壊兵器、核兵器も包含関係にあるので、包含される語句の兵器の語句重要度スコアと、大量破壊兵器と核兵器の語句重要度スコアを加算する。その結果、兵器のラベル抽出スコアが０．６になる。 Similarly, since weapons, weapons of mass destruction, and nuclear weapons are also in an inclusive relationship, the phrase importance score of weapons of the included words and the phrase importance score of weapons of mass destruction and nuclear weapons are added. As a result, the weapon label extraction score is 0.6.

「査察」、「国連」などのようにその語が包含されない場合は、語句重要度スコアのスコアをラベル抽出スコアに与える。このようにして、ラベル抽出スコアを算出した結果が、図５である。 When the word is not included, such as “inspection” or “UN”, the phrase importance score is given to the label extraction score. The result of calculating the label extraction score in this manner is shown in FIG.

キーワード抽出スコア計算部５では、語句解析部で抽出した重要語句の、キーワード抽出スコアの算出を行う。 The keyword extraction score calculation unit 5 calculates a keyword extraction score for the important phrase extracted by the phrase analysis unit.

図２−Ｂにキーワード抽出スコアを算出するフロチャートを示す。 FIG. 2-B shows a flowchart for calculating the keyword extraction score.

キーワード抽出スコア算出手段は、包含関係解析部３で作成したテーブルをもとに、語句のキーワード抽出スコアを算出する。包含関係にある語句の場合、包含される語句と包含する語句の語句重要度スコアのうち、高い方のスコアを包含する語句のキーワード抽出スコアとする（Ｓ２３、Ｓ２４）。図２−Ｂの各ステップＳ２０〜Ｓ２６も図から明らかであるので、詳細な説明は行わない。 The keyword extraction score calculation means calculates the keyword extraction score of the phrase based on the table created by the inclusion relationship analysis unit 3. In the case of words in an inclusive relationship, a keyword extraction score for a word that includes the higher score of the included words and the word importance score of the included words is set (S23, S24). Each step S20 to S26 of FIG. 2B is also clear from the figure, and will not be described in detail.

キーワードとなる語句は、文書群の内容をできるだけ具体的に示すものが相応しいため、キーワード抽出スコア計算部５では、包含する語句と包含される語句のスコアのうち、高い方のスコアを包含する語句に付与し、キーワード抽出スコアを算出する。 Since it is appropriate that the word / phrase as a keyword indicates the contents of the document group as specifically as possible, the keyword extraction score calculation unit 5 includes the word / phrase including the higher score of the word / phrase included and the score of the word / phrase included. And a keyword extraction score is calculated.

図６は、図３の例を対象にキーワード抽出スコアを算出したものである。図６では、重要語句として、「フセイン」「大量破壊兵器」「査察」「核兵器」「フセイン政権」「国連」「兵器」があげられ、各語に対し語句重要度スコアが、「０．４，０．３，０．２，０．２，０．１，０．１、０．１」というように与えられている。 FIG. 6 shows the keyword extraction score calculated for the example of FIG. In FIG. 6, “Husein”, “Weapons of mass destruction”, “Inspection”, “Nuclear weapons”, “Husein administration”, “United Nations”, “Weapons” are listed as important words. , 0.3, 0.2, 0.2, 0.1, 0.1, 0.1 ".

ここでは、「フセイン政権」が「フセイン」を包含する関係にある。この場合、より具体的な語をキーワードとして選択するために、「フセイン政権」のキーワード抽出スコアに「フセイン」と「フセイン政権」の語句重要度スコアの高い方（０．４）を付与する。「大量破壊兵器」と「兵器」、「核兵器」と「兵器」も包含関係にある。「大量破壊兵器」は「兵器」を包含する語句であり、「大量破壊兵器」と「兵器」の語句重要度スコアの高い方（０．３）を「大量破壊兵器」のキーワード抽出スコアに与える。「核兵器」と「兵器」も同様で、「核兵器」と「兵器」の語句重要度スコアの高い方（０．２）を「核兵器」のキーワード抽出スコアに与える。 Here, the “Hussein administration” includes “Hussein”. In this case, in order to select a more specific word as a keyword, the keyword extraction score of “Hussein administration” is assigned the higher one (0.4) of the phrase importance score of “Hussein” and “Hussein administration”. “Weapons of mass destruction” and “weapons”, “nuclear weapons” and “weapons” are also in an inclusive relationship. “Mass destruction weapons” is a phrase that encompasses “weapons”, and gives the keyword extraction score of “mass destruction weapons” the highest importance score (0.3) of “mass destruction weapons” and “weapons” . The same applies to “nuclear weapons” and “weapons”, and the keyword extraction score for “nuclear weapons” is given the one with the highest phrase importance score of “nuclear weapons” and “weapons” (0.2).

ラベル選択部６では、ラベル抽出スコアが最も高い「兵器」をラベルとして選択する。 The label selection unit 6 selects “weapon” having the highest label extraction score as a label.

キーワード選択部７では、ラベルの「兵器」をキーワードとして選択しないため、「フセイン」「大量破壊兵器」「査察」「核兵器」「フセイン政権」「国連」をキーワードとして抽出する。 Since the keyword selection unit 7 does not select the label “weapon” as a keyword, “hussein”, “mass destruction weapons”, “inspection”, “nuclear weapons”, “hussein administration”, and “UN” are extracted as keywords.

キーワードの抽出では、すべてをキーワードにしてもいいし、あらかじめ定めた個数あるいはあらかじめ定めたスコア以上のものをキーワードとして抽出してもよい。また、キーワード選択部７は、包含関係にある語句がある場合、包含する語句のみを選択してもよい。この例の場合「フセイン」は選択されなくなる。 In keyword extraction, all may be keywords, or a predetermined number or a score higher than a predetermined score may be extracted as keywords. Moreover, the keyword selection part 7 may select only the phrase to include, when there exists a phrase with an inclusion relationship. In this example, “Hussein” is not selected.

語句重要度スコア、ラベル抽出スコア、キーワード抽出スコアの変化を、図７に示す。 FIG. 7 shows changes in the phrase importance score, the label extraction score, and the keyword extraction score.

このようにして抽出されたラベルおよびキーワードは、表示出力部８により表示出力される。出力方法の１例を図８−Ａ、図８−Ｂ、図８−Ｃに示す。 The labels and keywords extracted in this way are displayed and output by the display output unit 8. An example of the output method is shown in FIGS. 8-A, 8-B, and 8-C.

以上、本発明の実施例１として一つの文書群に対してラベルおよびキーワードを付与する方法を説明した。 The method for assigning labels and keywords to one document group has been described above as the first embodiment of the present invention.

次に、本発明の実施例２として、文書群をクラスタリングしラベルとキーワードを付与する場合について説明する。 Next, as a second embodiment of the present invention, a case will be described in which a group of documents is clustered and labels and keywords are assigned.

図９は実施例２の文書処理装置を全体として示すブロック図である。図９において、文書処理装置は、語句抽出部１、語句重要度スコア計算部２、包含関係解析部３、ラベル抽出スコア計算部４、キーワード抽出スコア計算部５、ラベル選択部６、キーワード選択部７、表示出力部８、ラベル・キーワード保持部９等を含んで構成されている。 FIG. 9 is a block diagram showing the entire document processing apparatus according to the second embodiment. In FIG. 9, the document processing apparatus includes a phrase extraction unit 1, a phrase importance score calculation unit 2, an inclusion relation analysis unit 3, a label extraction score calculation unit 4, a keyword extraction score calculation unit 5, a label selection unit 6, and a keyword selection unit. 7, a display output unit 8, a label / keyword holding unit 9, and the like.

本実施例では、文書群がクラスタリングされると、ラベル・キーワード保持部９によって、各クラスタに対して、ラベルとキーワード集合を保持する領域が確保される。 In this embodiment, when the document group is clustered, the label / keyword holding unit 9 secures an area for holding a label and a keyword set for each cluster.

１番目のクラスタの各文書が入力されると、まず語句抽出部１において、文書中のテキストの語句抽出が行われ、重要語句が抽出される。 When each document of the first cluster is input, the phrase extraction unit 1 first extracts the phrase of the text in the document and extracts the important phrase.

次に語句重要度スコア計算部２で、抽出された各語句の語句重要度スコアが算出される。重要語句と語句重要度スコアの一例を図１０に表す。図１０では、
「省エネルギー」「消費電力」「プリンタ」「消費」「高画質」「環境」「エネルギー」
が語句として抽出され、各語に対し語句重要度スコアが、０．４，０．４，０．３，０．２，０．１，０．１，０．１
というように与えられている。 Next, the phrase importance score calculation unit 2 calculates the phrase importance score of each extracted phrase. An example of an important phrase and phrase importance score is shown in FIG. In FIG.
"Energy saving""Powerconsumption""Printer""Consumption""High image quality""Environment""Energy"
Are extracted as words, and the word importance score is 0.4, 0.4, 0.3, 0.2, 0.1, 0.1, 0.1 for each word.
And so on.

ラベル抽出スコア計算部４では、包含関係解析部３で解析された語句の包含関係から、包含関係にある語句のうち、包含される語句のラベル抽出スコアが包含する語句のラベル抽出スコアよりも高くなるように調整する。ここでは、包含する語句の語句重要度スコアと包含される語句の重要度スコアを加算することで、包含される語句のラベル抽出スコアを算出する。図１０の例を対象に、包含関係解析部３で作成した、重要語句の包含関係を表すテーブルを図１１に表す。 In the label extraction score calculation unit 4, from the inclusion relationship of the words analyzed by the inclusion relationship analysis unit 3, among the words in the inclusion relationship, the label extraction score of the included word is higher than the label extraction score of the word included. Adjust so that Here, the label extraction score of the included phrase is calculated by adding the phrase importance score of the included phrase and the importance score of the included phrase. FIG. 11 shows a table representing the inclusion relations of important words and phrases created by the inclusion relation analysis unit 3 for the example of FIG.

図１０の例を対象にラベル抽出スコアを算出すると、まず、「省エネルギー」と「エネルギー」が包含関係にあるので、包含する語句である「省エネルギー」の語句重要度スコア０．４と包含される語句の「エネルギー」の語句重要度スコア０．１を加算する。その結果、「エネルギー」のラベル抽出スコアが０．５になる。 When the label extraction score is calculated for the example of FIG. 10, since “energy saving” and “energy” are in an inclusive relationship, the phrase importance score of “energy saving”, which is an included word, is included as 0.4. Add a phrase importance score of 0.1 for the phrase “energy”. As a result, the label extraction score for “energy” is 0.5.

同様に、「消費電力」と「消費」も包含関係にあるので、包含する語句である「消費電力」の語句重要度スコア０．４と包含される語句の「消費」の語句重要度スコア０．２を加算する。その結果、「消費」のラベル抽出スコアが０．６になる。 Similarly, since “power consumption” and “consumption” are also in an inclusive relationship, the phrase importance score 0.4 of “power consumption” that is an included phrase and the phrase importance score 0 of “consumption” of the included phrase are 0. .2 is added. As a result, the label extraction score for “consumption” is 0.6.

このようにして、ラベル抽出スコアを算出した結果が、図１２である。 The result of calculating the label extraction score in this way is shown in FIG.

ラベル選択部６は、ラベル抽出スコアの最も高いものをラベルとして選択するので、図１２ではラベル抽出スコアの最も高い「消費」をラベルとして選択する。 Since the label selection unit 6 selects the label with the highest label extraction score as the label, in FIG. 12, “consumption” with the highest label extraction score is selected as the label.

次に、語句抽出部１で抽出された重要語句から、キーワード抽出を行う。キーワードとなる語句は、文書群の内容をできるだけ具体的に示すものが相応しいため、包含関係にある語句がある場合、語句重要度スコアに関係なく包含する語句を選択する。そのため、キーワード抽出スコア計算部５では、包含する語句と包含される語句のスコアのうち、高い方のスコアを包含する語句のキーワード抽出スコアとする。 Next, keyword extraction is performed from the important phrases extracted by the phrase extraction unit 1. It is appropriate that the word / phrase to be a keyword is as specific as possible showing the contents of the document group. Therefore, if there is a word / phrase in an inclusive relationship, the word / phrase to be included is selected regardless of the word / phrase importance score. Therefore, the keyword extraction score calculation unit 5 sets a keyword extraction score for a phrase including the higher score among the included phrases and the score of the included phrases.

図１３は、図１０の例を対象にキーワード抽出スコアを計算したものである。ここでは、「省エネルギー」が、「エネルギー」を包含する語句であるので、「省エネルギー」と「エネルギー」の語句重要度スコアの高い方（０．４）を「省エネルギー」のキーワード抽出スコアに付与する。「消費電力」は「消費」を包含する語句であるので、「消費電力」と「消費」の語句重要度スコアの高い方（０．４）を「消費電力」のキーワード抽出スコアに付与する。 FIG. 13 shows the keyword extraction score calculated for the example of FIG. Here, since “energy saving” is a phrase that includes “energy”, the higher one (0.4) of the phrase importance score of “energy saving” and “energy” is assigned to the keyword extraction score of “energy saving”. . Since “power consumption” is a phrase including “consumption”, the higher word importance score (0.4) of “power consumption” and “consumption” is assigned to the keyword extraction score of “power consumption”.

キーワード選択部７は、ラベルとして選択した語句をキーワードとして選択しない。また、ここでは包含関係がある場合、包含する語句のみを選択するようにすると、
「省エネルギー」「消費電力」「プリンタ」「高画質」「環境」
をキーワードとして抽出する。 The keyword selection unit 7 does not select a word selected as a label as a keyword. Also, if there is an inclusion relationship here, if you select only the words to include,
"Energy saving""Powerconsumption""Printer""High image quality""Environment"
Is extracted as a keyword.

語句重要度スコア、ラベル抽出スコア、キーワード抽出スコアの変化を、図１４に示す。 FIG. 14 shows changes in the phrase importance score, the label extraction score, and the keyword extraction score.

抽出されたラベル及びキーワードは、ラベル・キーワード保持部９によって、保存される。 The extracted label and keyword are stored by the label / keyword holding unit 9.

次に、２番目のクラスタの各文書が入力されると、まず語句抽出部１において、文書中のテキストの語句抽出が行われ、キーワードの候補となる語句が抽出される。 Next, when each document of the second cluster is input, the phrase extracting unit 1 first extracts the phrase of the text in the document and extracts the phrase that is a keyword candidate.

次に語句重要度スコア計算部２で、抽出された各語句の語句重要度スコアが算出される。 Next, the phrase importance score calculation unit 2 calculates the phrase importance score of each extracted phrase.

重要語句と語句重要度スコアの一例を図１５に表す。図１５では、「業務プロセス」「活動」「ドキュメント情報」「変化」「テーマ」「知的活動」「視点」が語句として抽出され、各語に対し語句重要度スコアが、０．３，０．３，０．２，０．２，０．２，０．１，０．１というように与えられている。 An example of an important phrase and phrase importance score is shown in FIG. In FIG. 15, “business process”, “activity”, “document information”, “change”, “theme”, “intellectual activity”, and “viewpoint” are extracted as words, and the word importance score is 0.3,0 for each word. .3, 0.2, 0.2, 0.2, 0.1, 0.1 and so on.

図１５の例を対象に、包含関係解析部３で作成した重要語句の包含関係を表すテーブルを図１６に示す。図１５の例を対象にラベル抽出スコアを算出すると、まず「活動」と「知的活動」が包含関係にあるので、包含する語句である「知的活動」の語句重要度スコア０．１と包含される語句の「活動」の語句重要度スコア０．３を加算する。その結果、「活動」のラベル抽出スコアが０．４になる。 FIG. 16 shows a table representing the inclusion relationship of the key words created by the inclusion relationship analysis unit 3 for the example of FIG. When the label extraction score is calculated for the example of FIG. 15, since “activity” and “intellectual activity” are in an inclusive relationship, the phrase importance score 0.1 of “intellectual activity” that is an inclusive phrase is The phrase importance score 0.3 of “activity” of the included phrase is added. As a result, the label extraction score for “activity” is 0.4.

このようにして、ラベル抽出スコアを算出した結果が、図１７である。 The result of calculating the label extraction score in this way is shown in FIG.

ラベル選択部６は、ラベル抽出スコアの最も高いものをラベルとして選択するので、図１７ではラベル抽出スコアの最も高い「活動」をラベルとして選択する。 Since the label selection unit 6 selects the label with the highest label extraction score as the label, in FIG. 17, the “activity” with the highest label extraction score is selected as the label.

図１８は、図１５の例を対象にキーワード抽出スコアを計算したものである。ここでは、「知的活動」が「活動」を包含する語句である。この場合、「知的活動」と「活動」の語句重要度スコアの高い方（０．３）を「知的活動」に付与する。 FIG. 18 shows the keyword extraction score calculated for the example of FIG. Here, “intellectual activity” is a phrase that includes “activity”. In this case, the higher one (0.3) of the phrase importance score of “intellectual activity” and “activity” is assigned to “intellectual activity”.

キーワード選択部７は、ラベルとした語句をキーワードとしないため、「業務プロセス」「ドキュメント情報」「変化」「テーマ」「知的活動」「視点」をキーワードとして抽出する。 Since the keyword selection unit 7 does not use the word or phrase as the label as a keyword, it extracts “business process”, “document information”, “change”, “theme”, “intellectual activity”, and “viewpoint” as keywords.

語句重要度スコア、ラベル抽出スコア、キーワード抽出スコアの変化を、図１９に示す。 FIG. 19 shows changes in the phrase importance score, the label extraction score, and the keyword extraction score.

以下、他のクラスタについても同様の処理を行う。 Thereafter, the same processing is performed for other clusters.

このようにして抽出し、ラベル・キーワード保持部９によって保持されたラベルおよびキーワードは、表示出力部８により表示出力される。出力方法の１例を図２０−Ａ、図２０−Ｂ、図２０−Ｃに示す。 The labels and keywords extracted in this way and held by the label / keyword holding unit 9 are displayed and output by the display output unit 8. An example of the output method is shown in FIGS. 20-A, 20-B, and 20-C.

この場合、複数の文書群に同様のラベルが付与されてしまう可能性があり、文書群の内容が区別されにくくなってしまうことが考えられる。そこで、クラスタリングする場合、同じラベルが付与されないように調整する。この手法については特願２００２−０７６９１９を参照する。 In this case, there is a possibility that the same label is given to a plurality of document groups, and it is considered that the contents of the document groups are difficult to distinguish. Therefore, when clustering, adjustment is made so that the same label is not given. Refer to Japanese Patent Application No. 2002-076919 for this method.

このように、文書群の内容を表す代表的な１つの語句をラベルとして抽出し、また概要を補完する具体的な語句をキーワードとして抽出することで、文書群の内容をよりわかりやすく表示することを可能にした。 In this way, by extracting one typical word / phrase representing the contents of the document group as a label and extracting a specific word / phrase that complements the outline as a keyword, the contents of the document group can be displayed more clearly. Made possible.

なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、上述の例ではラベルとともにキーワードを抽出するようにしたが、ラベルのみを抽出するようにしてもよい。逆に、キーワードのみを抽出するようにしてもよい。また上述の例では語句の形態上の包含関係を用いてラベルやキーワードに関してスコア付けを行ったが、意味上の包含関係を辞書等を用いて解析し、これに基づいて同様のスコア付けを行うようにしてもよい。ただし、意味上の包含関係は形態上の包含関係とは逆になるのでスコアの調整を逆に行う必要がある。 The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the invention. For example, in the above example, the keywords are extracted together with the labels, but only the labels may be extracted. Conversely, only keywords may be extracted. In the above example, scoring is performed on labels and keywords using inclusion relations in terms of phrases, but semantic inclusion relations are analyzed using a dictionary or the like, and the same scoring is performed based on the analysis. You may do it. However, since the semantic inclusion relationship is opposite to the morphological inclusion relationship, it is necessary to adjust the score in reverse.

本発明の実施例１を示すブロック図である。It is a block diagram which shows Example 1 of this invention. 本発明の実施例１におけるラベル抽出スコア算出のフロチャートである。It is a flowchart of label extraction score calculation in Example 1 of this invention. 本発明の実施例１におけるキーワード抽出スコア算出のフロチャートである。It is a flowchart of the keyword extraction score calculation in Example 1 of this invention. 本発明の実施例１における重要語句と語句重要度スコアの一例である。It is an example of the important phrase and phrase importance score in Example 1 of this invention. 本発明の実施例１における重要語句の包含関係を示すテーブルの一例である。It is an example of the table which shows the inclusion relation of the important word phrase in Example 1 of this invention. 本発明の実施例１におけるラベル抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score in Example 1 of this invention. 本発明の実施例１におけるキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the keyword extraction score in Example 1 of this invention. 本発明の実施例１におけるラベル抽出スコアとキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score and keyword extraction score in Example 1 of this invention. 本発明の実施例１における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 1 of this invention. 本発明の実施例１における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 1 of this invention. 本発明の実施例１における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 1 of this invention. 本発明の実施例２を示すブロック図である。It is a block diagram which shows Example 2 of this invention. 本発明の実施例２における重要語句と語句重要度スコアの一例である。It is an example of the important phrase and phrase importance score in Example 2 of this invention. 本発明の実施例２における重要語句の包含関係を示すテーブルの一例である。It is an example of the table which shows the inclusion relation of the important word phrase in Example 2 of this invention. 本発明の実施例２におけるラベル抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score in Example 2 of this invention. 本発明の実施例２におけるキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the keyword extraction score in Example 2 of this invention. 本発明の実施例２におけるラベル抽出スコアとキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score and keyword extraction score in Example 2 of this invention. 本発明の実施例２における重要語句と語句重要度スコアの一例である。It is an example of the important phrase and phrase importance score in Example 2 of this invention. 本発明の実施例２における重要語句の包含関係を示すテーブルの一例である。It is an example of the table which shows the inclusion relationship of the important word phrase in Example 2 of this invention. 本発明の実施例２におけるラベル抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score in Example 2 of this invention. 本発明の実施例２におけるキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the keyword extraction score in Example 2 of this invention. 本発明の実施例２におけるラベル抽出スコアとキーワード抽出スコアの一例の説明図である。It is explanatory drawing of an example of the label extraction score and keyword extraction score in Example 2 of this invention. 本発明の実施例２における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 2 of this invention. 本発明の実施例２における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 2 of this invention. 本発明の実施例２における表示出力方法の一例の説明図である。It is explanatory drawing of an example of the display output method in Example 2 of this invention.

Explanation of symbols

１・・・語句抽出部
２・・・語句重要度スコア計算部
３・・・包含関係解析部
４・・・ラベル抽出スコア計算部
５・・・キーワード抽出スコア計算部
６・・・ラベル選択部
７・・・キーワード選択部
８・・・表示出力部
９・・・ラベル・キーワード保持部 DESCRIPTION OF SYMBOLS 1 ... Phrase extraction part 2 ... Phrase importance score calculation part 3 ... Inclusion relation analysis part 4 ... Label extraction score calculation part 5 ... Keyword extraction score calculation part 6 ... Label selection part 7 ... Keyword selection unit 8 ... Display output unit 9 ... Label / keyword holding unit

Claims

In a document processing apparatus for extracting a label representing the contents of a document set in one word and one or more keywords representing an outline of the document set,
Word / phrase extracting means for extracting words / phrases that are candidates for labels and keywords from each document in the document set;
A phrase importance score calculating means for calculating a phrase importance score representing the importance of the phrase for each phrase;
Inclusion relation analysis means for analyzing a morphological inclusion relation between the words;
Label extraction score calculation means for calculating a label extraction score for each word from the word importance score based on an analysis result of whether or not the word is included in another word analyzed by the inclusion relation analysis means When,
Keyword extraction score calculation means for calculating a keyword extraction score for each of the phrases from the phrase importance score, based on an analysis result of whether or not the phrase includes other words analyzed by the inclusion relation analysis means; ,
Label selection means for selecting one label from the words according to the label extraction score;
A document processing apparatus comprising: a keyword selecting unit that selects a keyword based on the keyword extraction score.

The document processing apparatus according to claim 1, wherein the inclusion relation analyzing unit analyzes the inclusion relation of character strings.

The document processing apparatus according to claim 1, wherein the inclusion relation analyzing unit analyzes the inclusion relation of the word string.

The document processing apparatus according to claim 1, wherein the keyword selection unit does not select an included phrase as a keyword.

5. The document processing according to claim 1, further comprising display means for displaying the label and the keyword, wherein the display means displays the part corresponding to the label in the keyword separately from the other parts. apparatus.

The document group classification means for classifying the document group into a plurality of document sets, and the label and the keyword are extracted for each document group classified by the document group classification means. A document classification device comprising the document processing device described above,
The word classification importance calculation means and the label extraction score calculation means calculate a score so that a score of a word selected as a label and a keyword of another document set becomes small.

In a document processing apparatus that extracts a label representing the contents of a document set in one word,
A word / phrase extracting means for extracting a word / phrase as a label candidate from each document in the document set;
A phrase importance score calculating means for calculating a phrase importance score representing the importance of the phrase for each phrase;
Inclusion relation analysis means for analyzing inclusion relations between the words;
Label extraction score calculation means for adjusting the word importance score based on the inclusion relation analyzed by the inclusion relation analysis means and calculating a label extraction score;
A document processing apparatus comprising: a label selecting unit that selects a label based on the label extraction score.

The document processing apparatus according to claim 7, wherein the inclusion relationship between the phrases is an inclusion relationship in the form of a phrase.

The document processing apparatus according to claim 7, wherein the inclusion relationship between the phrases is a semantic inclusion relationship of the phrases.

In a document processing apparatus that extracts keywords from a document set,
Word / phrase extracting means for extracting word / phrase candidates from each document in the document set;
A phrase importance score calculating means for calculating a phrase importance score representing the importance of the phrase for each phrase;
Inclusion relation analysis means for analyzing inclusion relations between the words;
A keyword extraction score calculating means for adjusting the phrase importance score based on the inclusion relation analyzed by the inclusion relation analyzing means and calculating a keyword extraction score;
A document processing apparatus comprising: a keyword selecting unit that selects a keyword based on the keyword extraction score.

In a document processing method for extracting a label representing the contents of a document set in one word and one or more keywords representing an outline of the document set,
A phrase extracting unit extracting a candidate for a label and a keyword from each document in the document set;
A phrase importance score calculation unit calculating a phrase importance score representing the importance of the phrase for each phrase;
An inclusion relationship analysis unit analyzing a morphological inclusion relationship between the words;
A label extraction score calculation unit calculates a label extraction score for each word from the word importance score based on an analysis result of whether or not the word is included in another word analyzed by the inclusion relation analysis unit. A calculating step;
The keyword extraction score calculation unit calculates a keyword extraction score for each word from the word importance score based on the analysis result of whether or not the word includes other words analyzed by the inclusion relation analysis unit And steps to
A label selecting unit selecting one label from the words according to the label extraction score;
A document selection method comprising: a keyword selection unit selecting a keyword based on the keyword extraction score.

In a document processing method for extracting a label expressing the contents of a document set in one word,
A phrase extracting unit extracting a candidate for a label from each document in the document set; and
A phrase importance score calculation unit calculating a phrase importance score representing the importance of the phrase for each phrase;
An inclusion relationship analysis unit analyzing an inclusion relationship between the words;
A label extraction score calculation unit adjusting the phrase importance score based on the inclusion relation analyzed by the inclusion relation analysis step to calculate a label extraction score;
A document selection method comprising: a label selection unit selecting a label based on the label extraction score.

In a computer program for document processing used to extract a label representing the contents of a document set in one word and one or more keywords representing an outline of the document set,
A word extraction step for extracting candidate words for labels and keywords from each document in the document set;
A word importance score calculating step for calculating a word importance score representing the importance of the word for each word;
An inclusion relationship analysis step of analyzing a morphological inclusion relationship between the words;
A label extraction score calculation step for calculating a label extraction score for each word from the word importance score based on an analysis result indicating whether the word is included in another word analyzed by the inclusion relation analysis step When,
A keyword extraction score calculating step for calculating a keyword extraction score for each of the phrases from the phrase importance score based on an analysis result of whether or not the phrase includes other words analyzed by the inclusion relation analyzing step; ,
A label selection step of selecting one label from the words according to the label extraction score;
A computer program for document processing, which is used for causing a computer to execute a keyword selection step of selecting a keyword based on the keyword extraction score.

In a computer program for document processing used to extract a label that represents the contents of a document set in one word,
A word extraction step for extracting word candidates for labels from each document in the document set;
A word importance score calculating step for calculating a word importance score representing the importance of the word for each word;
An inclusion relationship analysis step of analyzing an inclusion relationship between the phrases;
A label extraction score calculation step of adjusting the word importance score based on the inclusion relationship analyzed by the inclusion relationship analysis step to calculate a label extraction score;
A computer program for document processing, which is used for causing a computer to execute a label selection step of selecting a label based on the label extraction score.