JP5642229B2

JP5642229B2 - Importance determination system, importance determination method, and computer program

Info

Publication number: JP5642229B2
Application number: JP2013095985A
Authority: JP
Inventors: 済央野本; 茂木　一男; 一男茂木
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2013-04-30
Filing date: 2013-04-30
Publication date: 2014-12-17
Anticipated expiration: 2033-04-30
Also published as: JP2014215996A

Description

本発明は、テキストを含む文書における単語の重要性を判定する技術に関する。 The present invention relates to a technique for determining the importance of a word in a document including text.

従来から、テキスト文書における重要語を判定する方法として、ＴＦ−ＩＤＦアルゴリズムがある（非特許文献１参照）。ＴＦ−ＩＤＦアルゴリズムでは、単語ｗの出現頻度ＴＦ（Term Frequency）を、単語ｗを含む文書頻度ＤＦ（Document Frequency）で除算することによって指標ＴＦ−ＩＤＦが算出される。そして、ＴＦ−ＩＤＦの値が高いほど重要な単語であると判定される。すなわち、ＴＦ−ＩＤＦアルゴリズムでは、特定の文書内でのみ出現頻度ＴＦが高く、他の文書では出現しない単語ほど、その文書を特徴付ける重要な単語であるとして指標ＴＦ−ＩＤＦが高く算出される。 Conventionally, there is a TF-IDF algorithm as a method for determining an important word in a text document (see Non-Patent Document 1). In the TF-IDF algorithm, the index TF-IDF is calculated by dividing the appearance frequency TF (Term Frequency) of the word w by the document frequency DF (Document Frequency) including the word w. Then, the higher the value of TF-IDF, the more important the word is determined. In other words, in the TF-IDF algorithm, the index TF-IDF is calculated to be higher as a word having a higher appearance frequency TF only in a specific document and not appearing in another document as an important word characterizing the document.

具体的には、指標ＴＦ−ＩＤＦの値は、例えば以下のような式を用いて算出されることが多い。なお、以下の説明において、“ａ＿ｂ”は、文字“ａ”に対して下付文字の“ｂ”が付加されていることを示す。また、“ａ＿（ｂ，ｃ）”は、文字“ａ”に対して下付文字の“ｂ，ｃ”が付加されていることを示す。 Specifically, the value of the index TF-IDF is often calculated using, for example, the following equation. In the following description, “a_b” indicates that the subscript “b” is added to the character “a”. “A_ (b, c)” indicates that the subscript “b, c” is added to the character “a”.

ｎ＿（ｉ，ｊ）は、文書ｊにおける単語ｉの出現回数である。｜Ｄ｜は文書の総数（総文書数）を示す。Ｄ＿ｉは、単語ｉを含む文書の数を表す。ｔｆｉｄｆ＿（ｉ，ｊ）は、文書ｊにおける単語ｉの重要度を表すスコア（以下、「重要度スコア」という。）を表す。ｔｆｉｄｆ＿（ｉ，ｊ）の値が所定の閾値を超える場合、単語ｉは文書ｊにおける重要語であると判定される。 n_ (i, j) is the number of appearances of the word i in the document j. | D | indicates the total number of documents (total number of documents). D_i represents the number of documents including the word i. tfidf_ (i, j) represents a score representing the importance of the word i in the document j (hereinafter referred to as “importance score”). When the value of tfidf_ (i, j) exceeds a predetermined threshold, it is determined that the word i is an important word in the document j.

Gerard Salton, Christopher Buckley, “TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL” Information Processing & Management Vol.24, No.5, pp.513-523, 1988.Gerard Salton, Christopher Buckley, “TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL” Information Processing & Management Vol.24, No.5, pp.513-523, 1988.

しかしながら、従来の技術では、複数の文書に共通して出現する重要語については、適切に判定することができないという問題があった。
例えば、全ての文書に出現する単語については、ｉｄｆ＿ｉの値が０となってしまう。そのため、ｉｄｆ＿ｉを乗じることによって得られるｔｆｉｄｆ＿（ｉ，ｊ）の値も全て０となってしまい、重要度を適切に判定することができなかった。 However, the conventional technique has a problem that important words that appear in common in a plurality of documents cannot be determined appropriately.
For example, the value of idf_i is 0 for words that appear in all documents. Therefore, the values of tfidf_ (i, j) obtained by multiplying idf_i are all 0, and the importance cannot be determined appropriately.

なお、ｉｄｆ＿ｉの値を算出する際にｌｏｇ演算を行わないことによって、算出される値が０になってしまうことを防止することが可能である。ただし、このような処理を行うと、算出されるｔｆｉｄｆ＿（ｉ，ｊ）の値に対するｔｆ＿（ｉ，ｊ）の影響が大きくなってしまう。そのため、どの文書にも出現するような頻出単語であれば、どの文書においてもｔｆ＿（ｉ，ｊ）の値が閾値を超えてしまい、重要度を適切に判定することができなかった。 Note that it is possible to prevent the calculated value from becoming 0 by not performing the log operation when calculating the value of idf_i. However, if such processing is performed, the influence of tf_ (i, j) on the calculated value of tfidf_ (i, j) increases. Therefore, if it is a frequent word that appears in any document, the value of tf_ (i, j) exceeds the threshold value in any document, and the importance cannot be determined appropriately.

上記事情に鑑み、本発明は、複数の文書に共通して出現する単語の重要性をより精度良く判定できる技術の提供を目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique capable of determining the importance of a word appearing in common in a plurality of documents with higher accuracy.

本発明の一態様は、単語毎に、一の文書における各単語の出現頻度を表す単語出現頻度が他の文書における前記単語出現頻度と有意差を持って異なるか否か判定し、当該判定結果に基づいて前記一の文書における前記単語の重要性を判定する重要性判定部と、を備える重要性判定システムである。
本発明の一態様は、上記の重要性判定システムであって、文書毎に、前記単語出現頻度を取得する単語出現頻度取得部をさらに備える。 One aspect of the present invention, for each single word, term frequency indicating the frequency of occurrence of each word is determined whether different or not have significant difference and said word occurrence frequency in the other documents in one document, the determination An importance determination system comprising: an importance determination unit that determines the importance of the word in the one document based on a result.
One aspect of the present invention is the importance determination system described above, further including a word appearance frequency acquisition unit that acquires the word appearance frequency for each document.

本発明の一態様は、上記の重要性判定システムであって、前記重要性判定部は、ある単語について取得された各文書の前記単語出現頻度の集合において、有意差を持って異なると判定された前記単語出現頻度が高いことを示す所定の条件が満たされる場合には、前記単語を、前記単語出現頻度が取得された文書における重要語であると判定する。 One aspect of the present invention is the importance determination system described above, wherein the importance determination unit is determined to have a significant difference in the set of word appearance frequencies of each document acquired for a certain word. and when said word occurrence frequency is satisfied a predetermined condition indicating a high that is, it determines that the word is a key word in the previous SL document term frequency is acquired.

本発明の一態様は、上記の重要性判定システムであって、前記重要性判定部は、ある単語について取得された各文書の前記単語出現頻度の集合において、有意差を持って異なると判定された前記単語出現頻度が低いことを示す所定の条件が満たされる場合には、前記単語を、前記単語出現頻度が取得された文書以外の各文書における重要語であると判定する。 One aspect of the present invention is the importance determination system described above, wherein the importance determination unit is determined to have a significant difference in the set of word appearance frequencies of each document acquired for a certain word. and when said word occurrence frequency is satisfied a predetermined condition indicating that low, it is determined that the word is a key word in each document other than before SL document term frequency is acquired.

本発明の一態様は、上記の重要性判定システムであって、前記重要性判定部は、所定の品詞の単語のみを重要語であると判定する。 One aspect of the present invention is the importance determination system described above, wherein the importance determination unit determines that only words having a predetermined part of speech are important words.

本発明の一態様は、上記の重要性判定システムであって、文書毎に、当該文書に出現する各単語のうち、同一又は類似の意味を有する単語を、一つの単語に変換する単語変換部をさらに備え、前記単語出現頻度取得部は、前記単語変換部によって変換された後の各単語について単語出現頻度を取得する。 One aspect of the present invention is the importance determination system described above, wherein a word conversion unit that converts, for each document, a word having the same or similar meaning among words appearing in the document into one word. The word appearance frequency acquisition unit acquires a word appearance frequency for each word after being converted by the word conversion unit.

本発明の一態様は、単語毎に、一の文書における各単語の出現頻度を表す単語出現頻度が他の文書における前記単語出現頻度と有意差を持って異なるか否か判定し、当該判定結果に基づいて前記一の文書における前記単語の重要性を判定する重要性判定ステップと、を有する重要性判定方法である。 One aspect of the present invention, for each single word, term frequency indicating the frequency of occurrence of each word is determined whether different or not have significant difference and said word occurrence frequency in the other documents in one document, the determination An importance determination step of determining importance of the word in the one document based on a result.

本発明の一態様は、上記の重要性判定システムとしてコンピュータを機能させるためのコンピュータプログラムである。 One aspect of the present invention is a computer program for causing a computer to function as the importance determination system.

本発明により、複数の文書に共通して出現する単語の重要性をより精度良く判定することが可能となる。 According to the present invention, it is possible to determine the importance of words appearing in common in a plurality of documents with higher accuracy.

第一実施形態における重要性判定システム１０の機能構成を示す概略ブロック図である。It is a schematic block diagram which shows the function structure of the importance determination system 10 in 1st embodiment. 有意水準の概略を示す図である。It is a figure which shows the outline of a significance level. 重要性判定システム１０の処理の具体例を示すフローチャートである。5 is a flowchart illustrating a specific example of processing of the importance determination system 10. 重要性判定システム１０の効果を示すための具体例を示す図である。It is a figure which shows the specific example for showing the effect of the importance determination system. 第一実施形態の変形例としての重要性判定システム１０ａを示す概略ブロック図である。It is a schematic block diagram which shows the importance determination system 10a as a modification of 1st embodiment. 第二実施形態における重要性判定システム２０の機能構成を示す概略ブロック図である。It is a schematic block diagram which shows the function structure of the importance determination system 20 in 2nd embodiment. 重要性判定システム２０の処理の具体例を示すフローチャートである。5 is a flowchart illustrating a specific example of processing of the importance determination system 20. 第三実施形態における重要性判定システム３０の機能構成を示す概略ブロック図である。It is a schematic block diagram which shows the function structure of the importance determination system 30 in 3rd embodiment. 重要性判定システム３０の処理の具体例を示すフローチャートである。5 is a flowchart illustrating a specific example of processing of the importance determination system 30.

［第一実施形態］
図１は、第一実施形態における重要性判定システム１０の機能構成を示す概略ブロック図である。重要性判定システム１０は、１台又は複数台の情報処理装置によって構成される。例えば、重要性判定システム１０が一台の情報処理装置で構成される場合、情報処理装置は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備え、重要性判定プログラムを実行する。重要性判定プログラムの実行によって、情報処理装置は、文書情報記憶部１０１、単語抽出部１０２、単語出現頻度取得部１０３及び重要性判定部１０４を備える装置として機能する。なお、重要性判定システム１０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されても良い。また、重要性判定システム１０は、専用のハードウェアによって実現されても良い。重要性判定プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ、半導体記憶装置（例えばＳＳＤ：Solid State Drive）等の可搬媒体、コンピュータシステムに内蔵されるハードディスクや半導体記憶装置等の記憶装置である。重要性判定プログラムは、電気通信回線を介して提供されても良い。 [First embodiment]
FIG. 1 is a schematic block diagram showing a functional configuration of an importance determination system 10 in the first embodiment. The importance determination system 10 includes one or a plurality of information processing apparatuses. For example, when the importance determination system 10 is configured by a single information processing apparatus, the information processing apparatus includes a CPU (Central Processing Unit) connected via a bus, a memory, an auxiliary storage device, and the like, and an importance determination program Execute. By executing the importance determination program, the information processing apparatus functions as an apparatus including the document information storage unit 101, the word extraction unit 102, the word appearance frequency acquisition unit 103, and the importance determination unit 104. Note that all or part of the functions of the importance determination system 10 may be realized using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). good. The importance determination system 10 may be realized by dedicated hardware. The importance determination program may be recorded on a computer-readable recording medium. The computer-readable recording medium is a portable medium such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, a semiconductor storage device (for example, SSD: Solid State Drive), a hard disk built in a computer system, or a semiconductor storage. A storage device such as a device. The importance determination program may be provided via a telecommunication line.

文書情報記憶部１０１は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。文書情報記憶部１０１は、文書情報を記憶する。文書情報は、文書の識別情報毎に、その文書に含まれるテキストを表す。文書識別情報は、文書毎に予め付与された識別情報である。文書情報は、文書の著者名、題名、出版日時、公開日時などの情報を含んでも良い。 The document information storage unit 101 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The document information storage unit 101 stores document information. The document information represents text included in the document for each piece of document identification information. The document identification information is identification information given in advance for each document. The document information may include information such as the document author name, title, publication date, and publication date.

単語抽出部１０２は、文書情報記憶部１０１に記憶されている文書毎に、その文書に含まれているテキストから個々の単語を抽出する。単語抽出部１０２は、例えば形態素解析を行うことによって個々の単語を抽出する。言い換えれば、単語抽出部１０２は、例えば文書毎に形態素解析を行うことによって各文書を単語単位に分かち書きする。単語抽出部１０２は、形態素解析の結果として各単語の品詞を出力しても良い。 For each document stored in the document information storage unit 101, the word extraction unit 102 extracts individual words from the text included in the document. The word extraction unit 102 extracts individual words by performing morphological analysis, for example. In other words, the word extraction unit 102 separates each document into words by performing morphological analysis for each document, for example. The word extraction unit 102 may output the part of speech of each word as a result of morphological analysis.

単語出現頻度取得部１０３は、単語抽出部１０２による抽出結果に基づいて、文書毎に各単語が出現した頻度を表す単語出現頻度を取得する。単語出現頻度取得部１０３は、例えば文書毎に各単語が出現した回数を集計し、集計結果そのものを各単語の単語出現頻度として取得しても良い。単語出現頻度取得部１０３は、例えば文書毎に、各単語が出現した回数をその文書に出現した総単語数で除算し、演算結果を各単語の単語出現頻度として取得しても良い。後者のように算出された単語出現頻度は、文書毎に出現する総単語数が異なる場合に、重要性判定システム１０の判定結果の精度を高めることを可能とする。 The word appearance frequency acquisition unit 103 acquires a word appearance frequency indicating the frequency of appearance of each word for each document based on the extraction result by the word extraction unit 102. For example, the word appearance frequency acquisition unit 103 may count the number of times each word appears for each document, and may acquire the total result as the word appearance frequency of each word. For example, for each document, the word appearance frequency acquisition unit 103 may divide the number of times each word has appeared by the total number of words that have appeared in the document, and acquire the calculation result as the word appearance frequency of each word. The word appearance frequency calculated as the latter makes it possible to increase the accuracy of the determination result of the importance determination system 10 when the total number of words appearing for each document is different.

重要性判定部１０４は、単語出現頻度取得部１０３によって取得された各単語の単語出現頻度に基づいて、各文書における各単語の重要性を判定する。 The importance determination unit 104 determines the importance of each word in each document based on the word appearance frequency of each word acquired by the word appearance frequency acquisition unit 103.

以下、重要性判定部１０４が行う処理の具体例について説明する。重要性判定部１０４は、例えば、単語ｉについて、ある文書ｊにおける単語出現頻度が他の文書における単語出現頻度よりも有意差を持って高い値である場合に、単語ｉを文書ｊにおける重要語として判定する。単語出現頻度が他の単語出現頻度よりも有意差を持って高いか否かを判定するための具体的な例として、重要性判定部１０４は、外れ値か否かを判定するための技術を用いても良い。外れ値か否かを判定するための技術の具体例として、スミルノフ・グラブス検定や、トンプソン検定がある。以下に、スミルノフ・グラブス検定において用いられる式を例として示す。 Hereinafter, a specific example of processing performed by the importance determination unit 104 will be described. For example, when the word appearance frequency in a certain document j is significantly higher than the word appearance frequency in another document for the word i, the importance determination unit 104 determines that the word i is an important word in the document j. Judge as. As a specific example for determining whether or not the word appearance frequency is significantly higher than the other word appearance frequencies, the importance determination unit 104 uses a technique for determining whether or not it is an outlier. It may be used. Specific examples of techniques for determining whether an outlier is a Smirnov-Grubbs test or a Thompson test. Below, the formula used in the Smirnov-Grubbs test is shown as an example.

式（４）において、Ｔ＿ｘは検定統計量を示し、ｎは標本数を示し、ｔは所要の有意水準をαとした場合のｔ分布の分意点（自由度ｎ−２のｔ分布のα／ｎ×１００パーセンタイル）を示す。有意水準は、例えば１％や５％等の値であり、重要性判定システム１０の設計者や利用者によって適宜決定される。

In Expression (4), T_x represents a test statistic, n represents the number of samples, t represents a decentralized point of the t distribution when the required significance level is α (α of the t distribution with n−2 degrees of freedom). / N × 100 percentile). The significance level is, for example, a value such as 1% or 5%, and is appropriately determined by the designer or user of the importance determination system 10.

図２は、有意水準の概略を示す図である。図２において、曲線によって示されるグラフはｔ分布を示す。左右両端付近に位置し斜線で示される領域における横軸の値の範囲は、有意差をもった値の範囲を示す。例えば有意水準（α）が５％の両側検定とすると、ｔ分布の曲線と横軸とで囲まれた部分の総面積を１００として、左右それぞれ面積（図の斜線の箇所）が２．５となるように縦軸と平行な直線を引き、この直線と横軸との交点をそれぞれｘ１及びｘ２とする。図２のグラフの右側であればｘ２よりも大きい場合に、５％両側検定で平均値に対して有意差があって値が大きいとみなされる。また、図２のグラフの左側であればｘ１よりも小さい場合に、５％両側検定で平均値に対して有意差があって値が小さいとみなされる。本実施形態では、横軸に単語出現頻度をとり、縦軸に度数をとったグラフにおいて、単語出現頻度の値が高い方の有意差の範囲に含まれた値が、他の単語出現頻度よりも有意差を持って高いと判定される。逆に、単語出現頻度の値が低い方の有意差の範囲に含まれた値が、他の単語出現頻度よりも有意差を持って低いと判定される。 FIG. 2 is a diagram showing an outline of the significance level. In FIG. 2, the graph indicated by the curve shows the t distribution. The range of the value on the horizontal axis in the region located near the left and right ends and indicated by diagonal lines indicates a range of values having a significant difference. For example, if the significance level (α) is a two-sided test with 5%, the total area of the portion surrounded by the t-distribution curve and the horizontal axis is 100, and the left and right areas (shaded areas in the figure) are 2.5. A straight line parallel to the vertical axis is drawn so that the intersections of the straight line and the horizontal axis are x1 and x2, respectively. On the right side of the graph of FIG. 2, when it is larger than x2, it is regarded that the value is large because there is a significant difference with respect to the average value by the 5% two-sided test. On the left side of the graph of FIG. 2, when it is smaller than x1, it is considered that the value is small because there is a significant difference with respect to the average value by the 5% two-sided test. In the present embodiment, in the graph in which the horizontal axis indicates the word appearance frequency and the vertical axis indicates the frequency, the value included in the significant difference range with the higher word appearance frequency value is higher than the other word appearance frequencies. Is determined to be high with a significant difference. On the contrary, it is determined that the value included in the significant difference range having the lower word appearance frequency value is significantly lower than the other word appearance frequencies.

重要性判定部１０４は、単語ｉ毎に、各文書ｊにおける単語出現頻度の集合から外れ値となる単語出現頻度を判定する。そして、重要性判定部１０４は、外れ値の判定結果に基づいて重要語を判定する。重要性判定部１０４は、例えば、外れ値として判定された単語出現頻度に関する単語ｉ及び文書ｊの組み合わせ全てについて、単語ｉを文書ｊの重要語として判定する。 The importance determination unit 104 determines, for each word i, a word appearance frequency that is an outlier from the set of word appearance frequencies in each document j. Then, the importance determination unit 104 determines an important word based on the outlier determination result. For example, the importance determination unit 104 determines the word i as an important word of the document j for all combinations of the word i and the document j regarding the word appearance frequency determined as an outlier.

図３は、重要性判定システム１０の処理の具体例を示すフローチャートである。図３に示されるフローチャートは、重要性判定部１０４が外れ値に基づいて重要語を判定する構成の処理を示す。以下、図３に基づいて重要性判定システム１０の処理の具体例について説明する。 FIG. 3 is a flowchart illustrating a specific example of processing of the importance determination system 10. The flowchart shown in FIG. 3 shows a process in which the importance determination unit 104 determines an important word based on an outlier. Hereinafter, a specific example of processing of the importance determination system 10 will be described with reference to FIG.

まず、単語抽出部１０２は、文書情報記憶部１０１に記憶されている文書毎に、その文書に含まれているテキストから個々の単語を抽出する（ステップＳ１０１）。次に、単語出現頻度取得部１０３は、単語抽出部１０２による抽出結果に基づいて、文書毎に各単語の単語出現頻度を取得する（ステップＳ１０２）。次に、重要性判定部１０４は、単語ｉ毎に、各文書ｊにおける単語出現頻度の集合から外れ値となる単語出現頻度を判定する（ステップＳ１０３）。そして、重要性判定部１０４は、外れ値として判定された単語出現頻度の単語ｉ及び文書ｊに基づき、単語ｉを文書ｊの重要語であると判定する（ステップＳ１０４）。 First, for each document stored in the document information storage unit 101, the word extraction unit 102 extracts individual words from the text included in the document (step S101). Next, the word appearance frequency acquisition unit 103 acquires the word appearance frequency of each word for each document based on the extraction result by the word extraction unit 102 (step S102). Next, the importance determination unit 104 determines, for each word i, a word appearance frequency that is an outlier from the set of word appearance frequencies in each document j (step S103). Then, the importance determination unit 104 determines that the word i is an important word of the document j based on the word i and the document j having the word appearance frequency determined as an outlier (step S104).

このように構成された重要性判定システム１０では、単語毎に、各文書における出現頻度のうち他の文書における出現頻度と異なる度合いに基づいて重要語が判定される。すなわち、単語ｉについて、ある文書ｊにおける単語出現頻度が他の文書における単語出現頻度と有意差を持って高い値である場合に、単語ｉが文書ｊにおける重要語として判定される。そのため、複数の文書に共通して出現する単語についても、その重要性をより精度良く判定することが可能となる。 In the importance determination system 10 configured as described above, for each word, an important word is determined based on the degree of appearance frequency in each document that is different from the appearance frequency in other documents. That is, for the word i, when the word appearance frequency in a certain document j is a high value with a significant difference from the word appearance frequency in another document, the word i is determined as an important word in the document j. Therefore, the importance of words that appear in common in a plurality of documents can be determined more accurately.

図４は、重要性判定システム１０の効果を示すための具体例を示す図である。図４は、文書１〜５における、各単語（電話、通信、インターネット、顧客、契約、銀行）の単語出現頻度を表す。従来の技術では、単語『通信』は全ての文書１〜５に高い頻度で出現するため、精度良く重要語であるか否か判定することが困難であった。一方、重要性判定システム１０によれば、各単語について文書毎に単語出現頻度が比較され、有意差を持って高いと判定された場合には重要語と判定される。そのため、全ての文書に高い頻度で出現する単語についても、精度良く重要語か否か判定することが可能となる。例えば、単語『通信』については、文書２における重要語として判定することが可能となる。 FIG. 4 is a diagram illustrating a specific example for illustrating the effect of the importance determination system 10. FIG. 4 shows the word appearance frequency of each word (phone, communication, Internet, customer, contract, bank) in the documents 1 to 5. In the conventional technique, since the word “communication” appears frequently in all the documents 1 to 5, it is difficult to accurately determine whether or not it is an important word. On the other hand, according to the importance determination system 10, the word appearance frequency is compared for each word for each word, and when it is determined that there is a significant difference, it is determined as an important word. For this reason, it is possible to accurately determine whether a word that appears frequently in all documents is an important word. For example, the word “communication” can be determined as an important word in the document 2.

また、重要性判定システム１０によれば、全ての文書に高い頻度で出現する単語であっても、単語出現頻度に大きな差が生じていない単語については、重要語でないと判定することが可能となる。例えば、単語『通信』と同様に全体的に出現頻度が高い単語『電話』については、文書毎の単語出現頻度の差が小さいため、どの文書においても重要語でないと判定することが可能となる。 Further, according to the importance determination system 10, it is possible to determine that a word that does not have a large difference in word appearance frequency is not an important word even if the word appears frequently in all documents. Become. For example, as with the word “communication”, the word “telephone”, which has a high appearance frequency as a whole, has a small difference in word appearance frequency for each document, so it can be determined that it is not an important word in any document. .

また、重要性判定システム１０では、以下のような課題を解決することも可能となる。従来の技術では、単語出現頻度が小さい単語については、重要語であるか否か精度良く判定することが困難であった。例えば、図４における単語『銀行』は、全ての文書において単語出現頻度が小さい。そのため、たとえ文書１における単語出現頻度が他の文書における単語出現頻度と異なって高かったとしても、単語『銀行』を文書１の重要語として判定することは従来は困難であった。これに対し、重要性判定システム１０では、このような問題を解決し、単語出現頻度が低い単語についても精度良く重要語か否か判定することが可能となる。例えば、重要性判定システム１０によれば、単語『銀行』について文書毎に単語出現頻度が比較され、有意差を持って高いと判定された場合には重要語と判定される。そのため、単語出現頻度が低い単語『銀行』についても、文書１の重要語として判定することが可能となる。 The importance determination system 10 can also solve the following problems. In the conventional technique, it is difficult to accurately determine whether a word with a low word appearance frequency is an important word. For example, the word “bank” in FIG. 4 has a low word appearance frequency in all documents. Therefore, even if the word appearance frequency in the document 1 is high unlike the word appearance frequency in other documents, it is conventionally difficult to determine the word “bank” as an important word of the document 1. On the other hand, the importance determination system 10 can solve such a problem and determine whether or not a word having a low word appearance frequency is an important word with high accuracy. For example, according to the importance determination system 10, the word appearance frequencies of the word “bank” are compared for each document, and when it is determined that there is a significant difference, the word is determined to be an important word. Therefore, the word “bank” having a low word appearance frequency can be determined as the important word of the document 1.

＜変形例＞
重要性判定部１０４は、必ずしも全ての外れ値に基づいて重要語を判定しなくとも良い。例えば、重要性判定部１０４は、予め定められた回数以内の再起処理によって外れ値として判定された単語ｉ及び文書ｊの組み合わせについて、単語ｉを文書ｊの重要語として判定しても良い。重要性判定部１０４は、予め定められた個数の外れ値が得られるまで再起処理を実行し、外れ値として判定された単語ｉ及び文書ｊの組み合わせについて、単語ｉを文書ｊの重要語として判定しても良い。 <Modification>
The importance determination unit 104 does not necessarily determine an important word based on all outliers. For example, the importance determination unit 104 may determine the word i as an important word of the document j for a combination of the word i and the document j determined as an outlier by a reoccurrence process within a predetermined number of times. The importance determination unit 104 executes the restart process until a predetermined number of outliers are obtained, and determines the word i as an important word of the document j for the combination of the word i and the document j determined as the outliers. You may do it.

また、重要性判定部１０４は、式（４）に基づいて算出された検定統計量のうち大きい値から所定数を外れ値として判定しても良い。そして、重要性判定部１０４は、このように行われた判定結果に基づいて重要語を判定しても良い。 Further, the importance determination unit 104 may determine a predetermined number as an outlier from a large value among the test statistics calculated based on the equation (4). And the importance determination part 104 may determine an important word based on the determination result performed in this way.

重要性判定部１０４は、単語ｉについて、ある文書ｊにおける単語出現頻度が他の文書における単語出現頻度よりも有意差を持って低い値である場合に、単語ｉを文書ｊ以外の各文書における重要語として判定しても良い。このように構成されることにより、特定の文書のみに出現しにくい単語を、他の文書における重要語として判定することが可能となる。このように判定された重要語を用いることによって、例えばこの重要語に関連するトピックについての文書であるか否かを明確に分類することが可能となる。 The importance determination unit 104 determines that the word i is a word i in each document other than the document j when the word appearance frequency in a certain document j is significantly lower than the word appearance frequency in another document. It may be determined as an important word. With this configuration, it is possible to determine a word that does not easily appear only in a specific document as an important word in another document. By using the important word determined in this way, for example, it is possible to clearly classify whether the document is about a topic related to the important word.

重要性判定部１０４は、上述した処理における条件を満たした単語のうち、所定の品詞（例えば名詞）の単語のみを重要語として判定しても良い。このように、重要語として判定される単語の品詞を限定することにより、その後の処理に適した単語のみを重要語として出力することが可能となる。 The importance determination unit 104 may determine only words having a predetermined part of speech (for example, nouns) as important words among the words that satisfy the conditions in the above-described processing. In this way, by limiting the part of speech of a word determined as an important word, it is possible to output only words suitable for subsequent processing as important words.

重要性判定部１０４は、各文書の重要語を判定するのではなく、各文書における各単語の重要性を表す値を判定しても良い。例えば、重要性判定部１０４は、単語ｉ毎に、ある文書ｊにおける単語出現頻度が他の文書における単語出現頻度と異なる程度を示す統計値を算出し、算出された値を文書ｊにおける単語ｉの重要性を表す値として判定しても良い。この場合、重要性判定部１０４は、判定結果として、文書毎に各単語の重要性を表す値を出力する。このように構成されることにより、重要語であるか否かの２値を出力値とするのではなく、各文書における各単語の重要性の度合いを多段階の値として表現することが可能となる。 The importance determination unit 104 may determine a value representing the importance of each word in each document instead of determining the important word of each document. For example, the importance determination unit 104 calculates, for each word i, a statistical value indicating the degree to which the word appearance frequency in a certain document j is different from the word appearance frequency in another document, and uses the calculated value as the word i in the document j. It may be determined as a value representing the importance of. In this case, the importance determination unit 104 outputs a value representing the importance of each word for each document as a determination result. By being configured in this way, it is possible to express the degree of importance of each word in each document as a multi-stage value, instead of using the binary value of whether or not it is an important word as an output value. Become.

図１に示される例では、重要性判定システム１０は文書情報記憶部１０１を備える装置として実装されているが、文書情報記憶部１０１は重要性判定システム１０の外部に設けられても良い。図５は、第一実施形態の変形例としての重要性判定システム１０ａを示す概略ブロック図である。この場合、重要性判定システム１０ａと文書情報記憶部１０１とは、ネットワークを介して通信可能に接続されている。重要性判定システム１０ａが備える単語抽出部１０２ａ、単語出現頻度取得部１０３ａ及び重要性判定部１０４ａは、それぞれ重要性判定システム１０における同名の機能部と同様に機能する。単語抽出部１０２ａは、ネットワークを介して文書情報記憶部１０１から文書情報を受信する。このように構成されることにより、重要性判定システム１０ａは、任意の文書情報記憶部１０１に蓄積された文書情報について重要語の判定や重要性の度合いの判定が可能となる。 In the example illustrated in FIG. 1, the importance determination system 10 is implemented as a device including the document information storage unit 101, but the document information storage unit 101 may be provided outside the importance determination system 10. FIG. 5 is a schematic block diagram showing an importance determination system 10a as a modified example of the first embodiment. In this case, the importance determination system 10a and the document information storage unit 101 are communicably connected via a network. The word extraction unit 102a, the word appearance frequency acquisition unit 103a, and the importance determination unit 104a included in the importance determination system 10a function in the same manner as the functional units of the same name in the importance determination system 10. The word extraction unit 102a receives document information from the document information storage unit 101 via the network. With this configuration, the importance determination system 10a can determine an important word and a degree of importance for document information stored in an arbitrary document information storage unit 101.

［第二実施形態］
図６は、第二実施形態における重要性判定システム２０の機能構成を示す概略ブロック図である。重要性判定システム２０は、１台又は複数台の情報処理装置によって構成される。例えば、重要性判定システム２０が一台の情報処理装置で構成される場合、情報処理装置は、バスで接続されたＣＰＵやメモリや補助記憶装置などを備え、重要性判定プログラムを実行する。重要性判定プログラムの実行によって、情報処理装置は、文書情報記憶部２０１、単語抽出部２０２、単語出現頻度取得部２０３、重要性判定部２０４、変換辞書記憶部２１１及び単語変換部２１２を備える装置として機能する。なお、重要性判定システム２０の各機能の全て又は一部は、ＡＳＩＣやＰＬＤやＦＰＧＡ等のハードウェアを用いて実現されても良い。また、重要性判定システム２０は、専用のハードウェアによって実現されても良い。重要性判定プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。重要性判定プログラムは、電気通信回線を介して提供されても良い。 [Second Embodiment]
FIG. 6 is a schematic block diagram showing a functional configuration of the importance determination system 20 in the second embodiment. The importance determination system 20 includes one or a plurality of information processing apparatuses. For example, when the importance determination system 20 is configured by a single information processing apparatus, the information processing apparatus includes a CPU, a memory, an auxiliary storage device, and the like connected by a bus and executes an importance determination program. By executing the importance determination program, the information processing apparatus includes a document information storage unit 201, a word extraction unit 202, a word appearance frequency acquisition unit 203, an importance determination unit 204, a conversion dictionary storage unit 211, and a word conversion unit 212. Function as. All or some of the functions of the importance determination system 20 may be realized using hardware such as an ASIC, PLD, or FPGA. The importance determination system 20 may be realized by dedicated hardware. The importance determination program may be recorded on a computer-readable recording medium. The importance determination program may be provided via a telecommunication line.

文書情報記憶部２０１、単語抽出部２０２、単語出現頻度取得部２０３及び重要性判定部２０４は、第一実施形態における同名の各機能部と同様に機能する。
変換辞書記憶部２１１は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。変換辞書記憶部２１１は、代表語毎に、代表語と同様の意味を有する１又は複数の単語（以下、「類義語」という。）を対応付けて記憶する。例えば、代表語『経済』に対し、『エコノミー』、『けいざい』、『けーざい』などの単語が類義語として対応付けて記憶される。 The document information storage unit 201, the word extraction unit 202, the word appearance frequency acquisition unit 203, and the importance determination unit 204 function in the same manner as each functional unit with the same name in the first embodiment.
The conversion dictionary storage unit 211 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The conversion dictionary storage unit 211 stores one or more words (hereinafter referred to as “synonyms”) having the same meaning as the representative word in association with each representative word. For example, for the representative word “economy”, words such as “economy”, “keizai”, and “keizai” are stored in association with each other as synonyms.

単語変換部２１２は、変換辞書記憶部２１１に記憶されている代表語及び類義語に基づいて、単語抽出部２０２によって抽出された単語のうち、類義語を代表語に変換する。例えば、単語抽出部２０２によって抽出された単語に『エコノミー』、『けいざい』、『けーざい』のいずれか１つ又は複数が含まれている場合、これらを全て代表語『経済』に変換する。 The word conversion unit 212 converts synonyms into representative words among the words extracted by the word extraction unit 202 based on the representative words and synonyms stored in the conversion dictionary storage unit 211. For example, if the word extracted by the word extraction unit 202 includes one or more of “economy”, “keizai”, and “keizai”, these are all converted into the representative word “economy”. .

図７は、重要性判定システム２０の処理の具体例を示すフローチャートである。以下、図７に基づいて重要性判定システム２０の処理の具体例について説明する。
まず、単語抽出部２０２は、文書情報記憶部２０１に記憶されている文書毎に、その文書に含まれているテキストから個々の単語を抽出する（ステップＳ２０１）。次に、単語変換部２１２は、単語抽出部２０２によって抽出された単語のうち、類義語を代表語に変換する（ステップＳ２１１）。次に、単語出現頻度取得部２０３は、単語変換部２１２による変換結果に基づいて、文書毎に各単語の単語出現頻度を取得する（ステップＳ２０２）。次に、重要性判定部２０４は、単語ｉ毎に、各文書ｊにおける単語出現頻度の集合から外れ値となる単語出現頻度を判定する（ステップＳ２０３）。そして、重要性判定部２０４は、外れ値として判定された単語出現頻度の単語ｉ及び文書ｊに基づき、単語ｉを文書ｊの重要語であると判定する（ステップＳ２０４）。 FIG. 7 is a flowchart illustrating a specific example of processing of the importance determination system 20. Hereinafter, a specific example of the processing of the importance determination system 20 will be described with reference to FIG.
First, for each document stored in the document information storage unit 201, the word extraction unit 202 extracts individual words from the text included in the document (step S201). Next, the word conversion unit 212 converts synonyms into representative words among the words extracted by the word extraction unit 202 (step S211). Next, the word appearance frequency acquisition unit 203 acquires the word appearance frequency of each word for each document based on the conversion result by the word conversion unit 212 (step S202). Next, the importance determination unit 204 determines, for each word i, a word appearance frequency that is an outlier from the set of word appearance frequencies in each document j (step S203). Then, the importance determination unit 204 determines that the word i is an important word of the document j based on the word i and the document j having the word appearance frequency determined as an outlier (step S204).

このように構成された重要性判定システム２０は、第一実施形態における重要性判定システム１０と同様の効果を奏することが可能である。
また、重要性判定システム２０では、単語抽出部２０２によって抽出された単語のうち、類義語に相当する単語は、その類義語に対応付けて変換辞書記憶部２１１に記憶されている代表語に変換される。そのため、表記揺れが生じている文書についても、精度良く重要語を判定することが可能となる。
第二実施形態における重要性判定システム２０は、第一実施形態における重要性判定システム１０と同様に変形して構成されても良い。 The importance determination system 20 configured as described above can achieve the same effects as the importance determination system 10 in the first embodiment.
Further, in the importance determination system 20, among words extracted by the word extraction unit 202, words corresponding to synonyms are converted into representative words stored in the conversion dictionary storage unit 211 in association with the synonyms. . For this reason, it is possible to determine an important word with high accuracy even for a document in which notation fluctuation occurs.
The importance determination system 20 in the second embodiment may be modified and configured similarly to the importance determination system 10 in the first embodiment.

［第三実施形態］
図８は、第三実施形態における重要性判定システム３０の機能構成を示す概略ブロック図である。重要性判定システム３０は、ネットワークを介してＷｅｂサーバ４０と通信することが可能である。Ｗｅｂサーバ４０は、テキストを閲覧可能に提供するサーバである。Ｗｅｂサーバ４０は、例えばブログに登録されている文書を提供するサーバであっても良いし、ニュース記事を閲覧可能に提供するサーバであっても良いし、辞書を提供するサーバであっても良いし、検索履歴を提供するサーバであっても良い。 [Third embodiment]
FIG. 8 is a schematic block diagram showing a functional configuration of the importance determination system 30 in the third embodiment. The importance determination system 30 can communicate with the Web server 40 via a network. The Web server 40 is a server that provides text so that it can be viewed. The Web server 40 may be, for example, a server that provides a document registered in a blog, a server that provides a news article so that it can be browsed, or a server that provides a dictionary. Alternatively, the server may provide a search history.

重要性判定システム３０は、１台又は複数台の情報処理装置によって構成される。例えば、重要性判定システム３０が一台の情報処理装置で構成される場合、情報処理装置は、バスで接続されたＣＰＵやメモリや補助記憶装置などを備え、重要性判定プログラムを実行する。重要性判定プログラムの実行によって、情報処理装置は、文書情報記憶部３０１、単語抽出部３０２、単語出現頻度取得部３０３、重要性判定部３０４、クラスタリング部３２１及び単語変換部３１２を備える装置として機能する。なお、重要性判定システム３０の各機能の全て又は一部は、ＡＳＩＣやＰＬＤやＦＰＧＡ等のハードウェアを用いて実現されても良い。また、重要性判定システム３０は、専用のハードウェアによって実現されても良い。重要性判定プログラムは、コンピュータ読み取り可能な記録媒体に記録されても良い。重要性判定プログラムは、電気通信回線を介して提供されても良い。 The importance determination system 30 includes one or a plurality of information processing apparatuses. For example, when the importance determination system 30 is configured by a single information processing apparatus, the information processing apparatus includes a CPU, a memory, an auxiliary storage device, and the like connected by a bus and executes an importance determination program. By executing the importance determination program, the information processing apparatus functions as an apparatus including a document information storage unit 301, a word extraction unit 302, a word appearance frequency acquisition unit 303, an importance determination unit 304, a clustering unit 321, and a word conversion unit 312. To do. All or some of the functions of the importance determination system 30 may be realized using hardware such as an ASIC, PLD, or FPGA. The importance determination system 30 may be realized by dedicated hardware. The importance determination program may be recorded on a computer-readable recording medium. The importance determination program may be provided via a telecommunication line.

文書情報記憶部３０１、単語抽出部３０２、単語出現頻度取得部３０３及び重要性判定部３０４は、第一実施形態における同名の各機能部と同様に機能する。
クラスタリング部３２１は、サーバ４０から得られる各単語の使用例に基づいて、単語抽出部３０２によって抽出された単語について、同様の意味を有する単語毎に一つのクラスタに分類されるようにクラスタリングを行う。クラスタリングの手法としては、例えばK-means法が適用されても良い。クラスタリング部３２１によるクラスタリングの結果、例えば『経済』、『エコノミー』、『けいざい』、『けーざい』などの単語が一つのクラスタに分類される。クラスタリング部３２１は、各クラスタの単語のうち、出現頻度が最も高い単語を代表語として定義し、代表語以外の単語を類義語として定義する。
単語変換部３１２は、クラスタリング部３２１によって定義された代表語及び類義語に基づいて、単語抽出部３０２によって抽出された単語のうち、類義語を代表語に変換する。 The document information storage unit 301, the word extraction unit 302, the word appearance frequency acquisition unit 303, and the importance determination unit 304 function in the same manner as the functional units having the same names in the first embodiment.
Based on the usage example of each word obtained from the server 40, the clustering unit 321 performs clustering so that the words extracted by the word extracting unit 302 are classified into one cluster for each word having the same meaning. . As a clustering method, for example, the K-means method may be applied. As a result of clustering by the clustering unit 321, for example, words such as “economy”, “economy”, “keizai”, and “keizai” are classified into one cluster. The clustering unit 321 defines a word having the highest appearance frequency among the words of each cluster as a representative word, and defines words other than the representative word as synonyms.
Based on the representative words and synonyms defined by the clustering unit 321, the word conversion unit 312 converts synonyms into representative words among the words extracted by the word extraction unit 302.

図９は、重要性判定システム３０の処理の具体例を示すフローチャートである。以下、図９に基づいて重要性判定システム３０の処理の具体例について説明する。
まず、単語抽出部３０２は、文書情報記憶部３０１に記憶されている文書毎に、その文書に含まれているテキストから個々の単語を抽出する（ステップＳ３０１）。次に、クラスタリング部３２１は、単語抽出部３０２によって抽出された単語について、同様の意味を有する単語毎に一つのクラスタに分類されるようにクラスタリングを行う（ステップＳ３２１）。 FIG. 9 is a flowchart illustrating a specific example of processing of the importance determination system 30. Hereinafter, a specific example of the processing of the importance determination system 30 will be described with reference to FIG.
First, for each document stored in the document information storage unit 301, the word extraction unit 302 extracts individual words from the text included in the document (step S301). Next, the clustering unit 321 performs clustering so that the words extracted by the word extraction unit 302 are classified into one cluster for each word having the same meaning (step S321).

次に、単語変換部３１２は、単語抽出部３０２によって抽出された単語のうち、類義語を代表語に変換する（ステップＳ３１１）。次に、単語出現頻度取得部３０３は、単語変換部３１２による変換結果に基づいて、文書毎に各単語の単語出現頻度を取得する（ステップＳ３０２）。次に、重要性判定部３０４は、単語ｉ毎に、各文書ｊにおける単語出現頻度の集合から外れ値となる単語出現頻度を判定する（ステップＳ３０３）。そして、重要性判定部３０４は、外れ値として判定された単語出現頻度の単語ｉ及び文書ｊに基づき、単語ｉを文書ｊの重要語であると判定する（ステップＳ３０４）。 Next, the word conversion unit 312 converts synonyms into representative words among the words extracted by the word extraction unit 302 (step S311). Next, the word appearance frequency acquisition unit 303 acquires the word appearance frequency of each word for each document based on the conversion result by the word conversion unit 312 (step S302). Next, the importance determination unit 304 determines, for each word i, a word appearance frequency that is an outlier from the set of word appearance frequencies in each document j (step S303). Then, the importance determination unit 304 determines that the word i is an important word of the document j based on the word i and the document j having the word appearance frequency determined as an outlier (step S304).

このように構成された重要性判定システム３０は、第一実施形態における重要性判定システム１０及び第二実施形態における重要性判定システム２０と同様の効果を奏することが可能である。 The importance determination system 30 configured as described above can achieve the same effects as the importance determination system 10 in the first embodiment and the importance determination system 20 in the second embodiment.

また、重要性判定システム３０では、クラスタリング部３２１によって代表語及び類義語が定義されるため、第二実施形態と異なり変換辞書記憶部２１１を予め用意する必要が無い。そのため、変換辞書記憶部２１１を用意するために要するコストや時間を省く事が可能となる。 Further, in the importance determination system 30, since the representative words and synonyms are defined by the clustering unit 321, unlike the second embodiment, it is not necessary to prepare the conversion dictionary storage unit 211 in advance. Therefore, the cost and time required for preparing the conversion dictionary storage unit 211 can be saved.

第三実施形態における重要性判定システム３０は、第一実施形態における重要性判定システム１０と同様に変形して構成されても良い。
クラスタリング部３２１は、各クラスタの単語のうち中心に位置する単語を代表語として定義し、代表語以外の単語を類義語として定義しても良い。 The importance determination system 30 in the third embodiment may be modified and configured similarly to the importance determination system 10 in the first embodiment.
The clustering unit 321 may define a word located in the center among the words of each cluster as a representative word, and define a word other than the representative word as a synonym.

［適用例］
以上のように構成された第一実施形態〜第三実施形態によって判定された重要語は、以下のように利用されても良い。 [Application example]
The important words determined by the first to third embodiments configured as described above may be used as follows.

例えば、テキストマイニングツールで単語のランキングを出力する際に、各単語の重要度を考慮することなく集計対象とすると、ランキング上位は話題に関係無くありふれた単語によって占められてしまう。このような問題に対し、第一実施形態〜第三実施形態によって判定された重要語のみに基づいてランキングを生成することによって、より正確に話題を分析することが可能となる。 For example, when a word ranking is output by a text mining tool, if ranking is performed without considering the importance of each word, the top ranking is occupied by common words regardless of the topic. With respect to such a problem, it is possible to analyze the topic more accurately by generating a ranking based only on the important words determined by the first embodiment to the third embodiment.

例えば、文書の要約作成や文書クラスタリングを行う場合に、文書に含まれる全ての単語を処理の対象とするのではなく、第一実施形態〜第三実施形態によって判定された重要語のみを用いることによって、特徴をより的確に表した要約を作成することや、文書クラスタリングを行うことが可能となる。 For example, when document summarization or document clustering is performed, not all words included in a document are processed, but only important words determined by the first to third embodiments are used. This makes it possible to create a summary that more accurately represents the features and to perform document clustering.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明は、文書に含まれる単語の重要性に基づいて処理を行う技術に適用可能である。 The present invention is applicable to a technique that performs processing based on the importance of words included in a document.

１０，１０ａ，２０，３０…重要性判定システム，１０１…文書情報記憶部，１０２，１０２ａ，２０２，３０２…単語抽出部，１０３，１０３ａ，２０３，３０３…単語出現頻度取得部，１０４，１０４ａ，２０４，３０４…重要性判定部，２１１…変換辞書記憶部，２１２，３１２…単語変換部，３２１…クラスタリング部，４０…Ｗｅｂサーバ 10, 10a, 20, 30 ... importance determination system, 101 ... document information storage unit, 102, 102a, 202, 302 ... word extraction unit, 103, 103a, 203, 303 ... word appearance frequency acquisition unit, 104, 104a, 204, 304 ... Importance determination unit, 211 ... Conversion dictionary storage unit, 212, 312 ... Word conversion unit, 321 ... Clustering unit, 40 ... Web server

Claims

Every single word, a word frequency is determined whether different or not have significant difference and said word occurrence frequency in other documents, the one based on the determination result indicating the occurrence frequency of each word in one document An importance determination system comprising: an importance determination unit that determines the importance of the word in a document.

The importance determination system according to claim 1, further comprising a word appearance frequency acquisition unit that acquires the word appearance frequency for each document.

When the importance determination unit satisfies a predetermined condition indicating that the word appearance frequency determined to have a significant difference in the set of word appearance frequencies of each document acquired for a word is high. the said words, determines that the pre-Symbol term frequency is important word in the acquired document, the importance determination system according to claim 1 or 2.

When the importance determination unit satisfies a predetermined condition indicating that the word appearance frequency determined to be different with a significant difference in the set of word appearance frequencies of each document acquired for a word is low the importance determination system according to the word is determined to be important word in each document other than before SL document term frequency is obtained, in any one of claims 1-3.

The importance determination system according to claim 3 or 4 , wherein the importance determination unit determines that only words having a predetermined part of speech are important words.

For each document, a word conversion unit that converts a word having the same or similar meaning among the words appearing in the document into a single word,
The term frequency acquisition unit acquires the term frequency of each word converted by the word conversion section, importance determination system according to any one of claims 1-5.

Every single word, a word frequency is determined whether different or not have significant difference and said word occurrence frequency in other documents, the one based on the determination result indicating the occurrence frequency of each word in one document the importance determination method having a significance determination steps, to determine the significance of the word in the document.

Computer program for causing a computer to function as a key determination system according to any one of claims 1-6.