JP6194760B2

JP6194760B2 - Keyword generation method, program, and information processing apparatus

Info

Publication number: JP6194760B2
Application number: JP2013230614A
Authority: JP
Inventors: 阿部　修也; 修也阿部
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-11-06
Filing date: 2013-11-06
Publication date: 2017-09-13
Anticipated expiration: 2033-11-06
Also published as: JP2015090618A

Description

本発明は、テキスト検索に係るキーワードを生成する技術に関する。 The present invention relates to a technique for generating a keyword related to text search.

例えば、ユーザが指定したキーワードに従って、テキストデータベースにおける文書を検索しても、ユーザの意図に合うものが漏れなく抽出されるとは限らない。 For example, even if a document in a text database is searched in accordance with a keyword specified by the user, what matches the user's intention is not always extracted.

逃してしまった目的の文書を改めて抽出しようとする場合に、別のキーワードを考え出すことは、ユーザにとって面倒である。例えば、先に抽出された文書に含まれる単語から次のキーワードを選び出すようにしてもよいが、その候補は膨大となる。 It is troublesome for the user to come up with another keyword when trying to extract a target document that has been missed. For example, the next keyword may be selected from the words included in the previously extracted document, but the candidates are enormous.

キーワードの候補となる単語が多いと、処理時間は長くなる。一方、キーワードの候補となる単語を限定すると、有効なキーワードが生成されず、再び目的の文書を逃すことになるかもしれない。 If there are many keyword candidates, the processing time becomes longer. On the other hand, if the words that are keyword candidates are limited, an effective keyword may not be generated, and the target document may be missed again.

特開平１０−２４０７４３号公報Japanese Patent Laid-Open No. 10-240743

本発明の目的は、一側面では、文書の集合からキーワードを効率よく抜き出すことである。 An object of the present invention is, in one aspect, to efficiently extract keywords from a set of documents.

一態様に係るキーワード生成方法は、文書の集合において出現する頻度が基準値以上である部分文字列を一又は複数検出する検出処理と、検出された部分文字列の各々について上記集合との関連の強さを示すスコアを算出し、当該スコアに基づき部分文字列の中からキーワードを選択する選択処理と、新たに選択されたキーワードの数を算出し、当該キーワードの数に基づいて、検出処理及び選択処理を終了するか否かを判定する判定処理と、検出処理及び選択処理を終了しないと判定した場合に、上記基準値を更新して、検出処理と選択処理と判定処理とを反復する反復処理とを含む。 A keyword generation method according to one aspect includes a detection process for detecting one or a plurality of partial character strings whose appearance frequency is equal to or higher than a reference value in a set of documents, and a relationship between the detected partial character strings and the set. A score indicating strength is calculated, a selection process for selecting a keyword from the partial character string based on the score, a number of newly selected keywords is calculated, and based on the number of the keywords, a detection process and A determination process for determining whether to end the selection process, and an iteration for repeating the detection process, the selection process, and the determination process by updating the reference value when it is determined not to end the detection process and the selection process Processing.

一側面としては、文書の集合からキーワードを効率よく抜き出すことができる。 As one aspect, keywords can be efficiently extracted from a set of documents.

図１は、ネットワークの概要を示す図である。FIG. 1 is a diagram showing an outline of a network. 図２は、テキストデータの例を示す図である。FIG. 2 is a diagram illustrating an example of text data. 図３は、抽出装置のモジュール構成例を示す図である。FIG. 3 is a diagram illustrating a module configuration example of the extraction device. 図４は、第１抽出結果の例を示す図である。FIG. 4 is a diagram illustrating an example of the first extraction result. 図５は、メイン処理フローを示す図である。FIG. 5 is a diagram showing a main processing flow. 図６は、第１抽出部の内部モジュール構成例を示す図である。FIG. 6 is a diagram illustrating an internal module configuration example of the first extraction unit. 図７は、第１抽出処理フローを示す図である。FIG. 7 is a diagram showing a first extraction process flow. 図８は、第２抽出部の内部モジュール構成例を示す図である。FIG. 8 is a diagram illustrating an internal module configuration example of the second extraction unit. 図９は、第２抽出処理フローを示す図である。FIG. 9 is a diagram showing a second extraction process flow. 図１０は、頻度に対する部分文字列数の特性を示す図である。FIG. 10 is a diagram illustrating the characteristics of the number of partial character strings with respect to the frequency. 図１１は、頻度に対するキーワード数の特性を示す図である。FIG. 11 is a diagram illustrating characteristics of the number of keywords with respect to frequency. 図１２は、キーワード生成部の内部モジュール構成例を示す図である。FIG. 12 is a diagram illustrating an internal module configuration example of the keyword generation unit. 図１３は、キーワード生成処理フローを示す図である。FIG. 13 is a diagram showing a keyword generation processing flow. 図１４は、部分文字列検出処理フローを示す図である。FIG. 14 is a diagram showing a partial character string detection processing flow. 図１５は、検出結果の例を示す図である。FIG. 15 is a diagram illustrating an example of a detection result. 図１６は、キーワード選択処理フローを示す図である。FIG. 16 is a diagram showing a keyword selection processing flow. 図１７は、スコア算出処理フローを示す図である。FIG. 17 is a diagram illustrating a score calculation processing flow. 図１８は、第１出現確率算出処理フローを示す図である。FIG. 18 is a diagram illustrating a first appearance probability calculation process flow. 図１９は、第２出現確率算出処理フローを示す図である。FIG. 19 is a diagram illustrating a second appearance probability calculation process flow. 図２０は、スコアの例を示す図である。FIG. 20 is a diagram illustrating an example of a score. 図２１は、終了判定処理フローを示す図である。FIG. 21 is a diagram showing the end determination processing flow. 図２２は、キーワードの度数分布表の例を示す図である。FIG. 22 is a diagram illustrating an example of a keyword frequency distribution table. 図２３は、キーワードの度数分布図の例である。FIG. 23 is an example of a keyword frequency distribution diagram. 図２４は、キーワードの度数分布図の例を示す図である。FIG. 24 is a diagram illustrating an example of a keyword frequency distribution diagram. 図２５は、最低頻度更新処理（Ａ）フローを示す図である。FIG. 25 is a diagram showing a flow of the lowest frequency update process (A). 図２６は、検出結果の例を示す図である。FIG. 26 is a diagram illustrating an example of a detection result. 図２７は、スコアの例を示す図である。FIG. 27 is a diagram illustrating an example of a score. 図２８は、キーワードテーブルの例を示す図である。FIG. 28 is a diagram illustrating an example of a keyword table. 図２９は、検出結果の例を示す図である。FIG. 29 is a diagram illustrating an example of a detection result. 図３０は、スコアの例を示す図である。FIG. 30 is a diagram illustrating an example of a score. 図３１は、キーワードテーブルの例を示す図である。FIG. 31 is a diagram illustrating an example of a keyword table. 図３２は、最低頻度更新処理（Ｂ）フローを示す図である。FIG. 32 is a diagram showing a flow of the lowest frequency update process (B). 図３３は、最低頻度更新処理（Ｃ）フローを示す図である。FIG. 33 is a diagram illustrating a minimum frequency update process (C) flow. 図３４は、コンピュータの機能ブロック図である。FIG. 34 is a functional block diagram of a computer.

［実施の形態１］
図１に、ネットワークの概要を示す。抽出装置１０１と、データベース管理システム１０３とは、ネットワークを介して接続している。データベース管理システム１０３は、テキストデータベース１０５を有している。テキストデータベース１０５は、テキストデータを格納している。テキストデータは、複数のテキスト単位を含んでいる。テキスト単位は、テキストデータベース１０５が管理するデータの単位であって、文字列データを含んでいる。テキスト単位は、画像データ、写真データあるいは音声データなどのテキスト以外の付加データを含んでいてもよい。この例におけるテキスト単位は、簡易ブログの記事である。テキスト単位は、ブログの記事以外の文書であってもよい。 [Embodiment 1]
FIG. 1 shows an outline of the network. The extraction apparatus 101 and the database management system 103 are connected via a network. The database management system 103 has a text database 105. The text database 105 stores text data. The text data includes a plurality of text units. The text unit is a unit of data managed by the text database 105 and includes character string data. The text unit may include additional data other than text, such as image data, photo data, or audio data. The text unit in this example is a simple blog article. The text unit may be a document other than a blog article.

テキストデータベース１０５に格納されているテキストデータの例について説明する。図２に、テキストデータの例を示す。この例におけるテキストデータは、テーブルの形式である。テキストデータは、記事毎にレコードを有している。 An example of text data stored in the text database 105 will be described. FIG. 2 shows an example of text data. The text data in this example is in the form of a table. The text data has a record for each article.

この例は、「村田佳菜子」という名前の選手が出場するスポーツ大会の中継番組が放送されていた時間帯における簡易ブログの記事を示している。第１レコード（省略しているレコードを除く。以下同様）の記事は、「村田佳菜子ちゃん、良かった！ #xyztv」の文字列データを含んでいる。この記事は、当該中継番組の視聴者が当該スポーツ大会の感想を投稿したものである。この記事に含まれる「#xyztv」は、ハッシュタグの例である。投稿者は、投稿する記事が特定のテーマに関することを明示するために記事内に、当該テーマを識別するハッシュタグを記述する。この例における「#xyztv」は、ＸＹＺ放送局の番組に関する記事であることを示している。 This example shows an article of a simple blog in a time zone in which a broadcast program of a sports tournament where a player named “Murata Kanako” participates was broadcast. The article of the first record (excluding omitted records, the same applies below) includes the character string data of “Kanako Murata, good! #Xyztv”. In this article, viewers of the relay program posted their impressions of the sports tournament. “#Xyztv” included in this article is an example of a hashtag. The contributor writes a hash tag for identifying the theme in the article to clearly indicate that the article to be posted relates to a specific theme. “#Xyztv” in this example indicates that the article is about an XYZ broadcast station program.

ハッシュタグは、閲覧時に検索のキーとして用いられる。例えば、閲覧者がハッシュタグを指定して記事を検索することによって、当該ハッシュタグを含む記事が一覧画面に表示される。つまり、ハッシュタグを用いることによって、例えば特定のイベントや事項などに関連する記事が収集される。ハッシュタグは、特定のテーマに関する調査や分析にも用いられる。尚、ハッシュタグは「#」記号と半角英数字とで構成されている。 The hash tag is used as a search key when browsing. For example, when a viewer searches for an article by specifying a hash tag, an article including the hash tag is displayed on the list screen. That is, by using hash tags, for example, articles related to specific events and matters are collected. Hashtags are also used for research and analysis on specific themes. The hash tag is composed of a “#” symbol and alphanumeric characters.

第２レコードの記事は、「佳菜子ちゃん素敵！」の文字列データを含んでいる。この記事も、当該中継番組の視聴者が当該スポーツ大会の感想を投稿したものである。但し、この記事にはハッシュタグが含まれていない。 The article of the second record includes character string data of “Kanako-chan is wonderful!”. In this article, viewers of the relay program posted their impressions of the sporting event. However, this article does not include hashtags.

第３レコードの記事は、「むらかなかわいい #xyztv」の文字列データを含んでいる。この記事も、当該中継番組の視聴者が当該スポーツ大会の感想を投稿したものである。この記事には、第１レコードと同様にハッシュタグ「#xyztv」が含まれている。尚、記事中の「むらかな」は、「村田佳菜子」の氏名を略した同選手の愛称である。 The article of the third record includes character string data of “Smoothly cute #xyztv”. In this article, viewers of the relay program posted their impressions of the sporting event. This article includes the hash tag “#xyztv” as in the first record. “Murakana” in the article is a nickname for the player who abbreviated the name “Kanako Murata”.

第４レコードの記事は、「むらかな決めろ！」の文字列データを含んでいる。この記事も、当該中継番組の視聴者が当該スポーツ大会の感想を投稿したものである。但し、この記事にはハッシュタグが含まれていない。この記事においても、愛称「むらかな」が用いられている。 The article of the fourth record includes character string data of “Make a decision smoothly”! In this article, viewers of the relay program posted their impressions of the sporting event. However, this article does not include hashtags. Also in this article, the nickname “Muraka” is used.

第５レコードの記事は、「村田、まずまず良かったかな #xyztv」の文字列データを含んでいる。この記事も、当該中継番組の視聴者が当該スポーツ大会の感想を投稿したものである。この記事には、第１レコード及び第３レコードと同様にハッシュタグ「#xyztv」が含まれている。 The article of the fifth record includes the character string data of “Murata, is it really good #xyztv”. In this article, viewers of the relay program posted their impressions of the sporting event. This article includes the hash tag “#xyztv” as in the first record and the third record.

テキストデータには、他にも当該中継番組の視聴者が当該スポーツ大会に関して投稿した記事が含まれている。また、テキストデータには、当該スポーツ大会に関わらない記事も含まれている。 In addition, the text data includes an article posted by the viewer of the relay program regarding the sporting event. The text data also includes articles not related to the sporting event.

図１の説明に戻る。図１に示したネットワークは、例えばインターネットあるいはＬＡＮ（Local Area Network）である。抽出装置１０１は、ネットワークを介してテキストデータベース１０５から特定のテーマに関するテキスト単位（この例では、簡易ブログの記事）を抽出する。この例で、抽出装置１０１は、図２に示したテキストデータから当該中継番組の視聴者が投稿した当該スポーツ大会に関する記事を抽出する。図１では、抽出装置１０１とデータベース管理システム１０３とがネットワークを介して接続されている例を示しているが、データベース管理システム１０３と抽出装置１０１とが一体の装置であってもよい。 Returning to the description of FIG. The network shown in FIG. 1 is, for example, the Internet or a LAN (Local Area Network). The extraction apparatus 101 extracts a text unit (in this example, a simple blog article) related to a specific theme from the text database 105 via the network. In this example, the extraction apparatus 101 extracts articles about the sports tournament posted by the viewer of the relay program from the text data shown in FIG. Although FIG. 1 illustrates an example in which the extraction apparatus 101 and the database management system 103 are connected via a network, the database management system 103 and the extraction apparatus 101 may be an integrated apparatus.

また、ユーザ端末１０７が抽出装置１０１に接続されている。ユーザ端末１０７は、抽出装置１０１によるサービスを利用するユーザが使用する端末である。図１では、前述と同様のネットワークを介して、ユーザ端末１０７と抽出装置１０１とが接続されているが、ユーザ端末１０７と抽出装置１０１とは、前述のネットワークと異なるネットワークを介して接続されるようにしてもよい。 A user terminal 107 is connected to the extraction apparatus 101. The user terminal 107 is a terminal used by a user who uses the service provided by the extraction apparatus 101. In FIG. 1, the user terminal 107 and the extraction device 101 are connected via the same network as described above, but the user terminal 107 and the extraction device 101 are connected via a network different from the above-described network. You may do it.

続いて、抽出装置１０１のモジュール構成について説明する。図３に、抽出装置１０１のモジュール構成例を示す。抽出装置１０１は、パラメータ記憶部３０１、受付部３０３、第１抽出部３０５、第１抽出結果記憶部３０７、サンプリング部３０９、サンプリング結果記憶部３１１、第２抽出部３１３、第２抽出結果記憶部３１５及び出力部３１７を有している。 Next, the module configuration of the extraction apparatus 101 will be described. FIG. 3 shows a module configuration example of the extraction apparatus 101. The extraction apparatus 101 includes a parameter storage unit 301, a reception unit 303, a first extraction unit 305, a first extraction result storage unit 307, a sampling unit 309, a sampling result storage unit 311, a second extraction unit 313, and a second extraction result storage unit. 315 and an output unit 317.

パラメータ記憶部３０１は、抽出装置１０１の処理において用いられるパラメータを記憶する。受付部３０３は、ユーザ端末１０７からのデータの入力を受け付ける。受付部３０３は、例えば上記パラメータ及び後述する第１抽出条件を受け付ける。上記パラメータには、例えば後述する最低頻度の初期値も含まれる。 The parameter storage unit 301 stores parameters used in the processing of the extraction device 101. The accepting unit 303 accepts data input from the user terminal 107. The accepting unit 303 accepts, for example, the above parameters and a first extraction condition described later. The parameters include, for example, an initial value of the lowest frequency described later.

抽出装置１０１は、例えば２回の抽出を行う。第１抽出部３０５は、そのうち先の抽出を行う。具体的には、第１抽出部３０５は、テキストデータベース１０５から第１抽出条件に適合するテキスト単位（この例では、簡易ブログの記事）を抽出する。また、第１抽出部３０５は、ハッシュタグを含む第１クエリをデータベース管理システム１０３へ送信する。ハッシュタグについての部分一致が、第１抽出条件に相当する。但し、第１抽出条件はハッシュタグ以外のテキストデータに関する条件であってもよい。 The extraction apparatus 101 performs extraction twice, for example. The first extraction unit 305 performs earlier extraction. Specifically, the first extraction unit 305 extracts a text unit (in this example, a simple blog article) that meets the first extraction condition from the text database 105. In addition, the first extraction unit 305 transmits a first query including a hash tag to the database management system 103. The partial match for the hash tag corresponds to the first extraction condition. However, the first extraction condition may be a condition relating to text data other than the hash tag.

第１抽出部３０５は、抽出された結果（第１抽出結果という。）をデータベース管理システム１０３から取得する。この例では、第１抽出結果は、第１抽出条件に適合するテキスト単位（この例では、簡易ブログの記事）の集合である。第１抽出結果記憶部３０７は、第１抽出結果を記憶する。 The first extraction unit 305 acquires the extracted result (referred to as a first extraction result) from the database management system 103. In this example, the first extraction result is a set of text units (in this example, simple blog articles) that match the first extraction condition. The first extraction result storage unit 307 stores the first extraction result.

図４に、第１抽出結果の例を示す。この第１抽出結果は、ハッシュタグ「#xyztv」によって検索された記事の集合である。図４に示した第１レコードの記事「村田佳菜子ちゃん、良かった！ #xyztv」は、図２に示した第１レコードから得られたものである。同様に、図４に示した第２レコードの記事「むらかなかわいい #xyztv」は、図２に示した第３レコードから得られたものである。同様に、図４に示した第３レコードの記事「村田、まずまず良かったかな #xyztv」は、図２に示した第５レコードから得られたものである。 FIG. 4 shows an example of the first extraction result. The first extraction result is a set of articles searched by the hash tag “#xyztv”. The first record article “Kanako Murata, good! #Xyztv” shown in FIG. 4 is obtained from the first record shown in FIG. Similarly, the article “Murky Cute #xyztv” in the second record shown in FIG. 4 is obtained from the third record shown in FIG. 2. Similarly, the third record article “Murata, Kana #xyztv”, shown in FIG. 4, was obtained from the fifth record shown in FIG.

図２に示した第２レコード及び第４レコードの記事は、ハッシュタグ「#xyztv」を含まないので抽出されない。 The articles of the second record and the fourth record shown in FIG. 2 are not extracted because they do not include the hash tag “#xyztv”.

図３の説明に戻って、サンプリング部３０９は、サンプリング処理を実行する。具体的には、サンプリング部３０９は、テキストデータベース１０５から無作為にテキスト単位（この例では、簡易ブログの記事）を抽出するようにデータベース管理システム１０３に要求し、抽出された結果（サンプリング結果という。）をデータベース管理システム１０３から取得する。サンプリング結果は、無作為に抽出されたテキスト単位の集合である。サンプリング結果記憶部３１１は、サンプリング結果を記憶する。 Returning to the description of FIG. 3, the sampling unit 309 executes a sampling process. Specifically, the sampling unit 309 requests the database management system 103 to randomly extract a text unit (in this example, a simple blog article) from the text database 105, and the extracted result (referred to as a sampling result). .) Is acquired from the database management system 103. The sampling result is a set of randomly selected text units. The sampling result storage unit 311 stores the sampling result.

第２抽出部３１３は、テキストデータベース１０５から第１抽出結果に関連するテキスト単位（この例では、簡易ブログの記事）を抽出する。具体的には、第２抽出部３１３は、第１抽出結果に基づいてキーワードを生成し、当該キーワードを含む第２クエリをデータベース管理システム１０３に送信する。 The second extraction unit 313 extracts a text unit (in this example, a simple blog article) related to the first extraction result from the text database 105. Specifically, the second extraction unit 313 generates a keyword based on the first extraction result, and transmits a second query including the keyword to the database management system 103.

第２抽出部３１３は、抽出された結果（第２抽出結果という。）をデータベース管理システム１０３から取得する。この例では、第２抽出結果は、上記キーワードを含むテキスト単位（この例では、簡易ブログの記事）の集合である。第２抽出結果記憶部３１５は、抽出された結果（第２抽出結果という。）を記憶する。 The second extraction unit 313 acquires the extracted result (referred to as “second extraction result”) from the database management system 103. In this example, the second extraction result is a set of text units (in this example, simple blog articles) including the keyword. The second extraction result storage unit 315 stores the extracted result (referred to as a second extraction result).

出力部３１７は、第２抽出結果を出力する。この例では、出力部３１７は、第２抽出結果をユーザ端末１０７へ送信する。但し、出力部３１７による出力の形態は送信に限らない。出力部３１７は、第２抽出結果を表示するようにしてもよい。出力部３１７は、第２抽出結果を記憶媒体に書き込むようにしてもよい。あるいは、出力部３１７は、第２抽出結果を印刷するようにしてもよい。 The output unit 317 outputs the second extraction result. In this example, the output unit 317 transmits the second extraction result to the user terminal 107. However, the form of output by the output unit 317 is not limited to transmission. The output unit 317 may display the second extraction result. The output unit 317 may write the second extraction result in the storage medium. Alternatively, the output unit 317 may print the second extraction result.

パラメータ記憶部３０１、受付部３０３、第１抽出部３０５、第１抽出結果記憶部３０７、サンプリング部３０９、サンプリング結果記憶部３１１、第２抽出部３１３、第２抽出結果記憶部３１５及び出力部３１７は、例えば図３４に示すハードウエア資源によって実現される。また、受付部３０３、第１抽出部３０５、サンプリング部３０９、第２抽出部３１３及び出力部３１７は、当該モジュールの処理の一部又は全部を、メモリ２５０１（図３４）にロードされたプログラムをＣＰＵ２５０３（図３４）で順次実行することにより実現するようにしてもよい。以上で、抽出装置１０１のモジュール構成についての説明を終える。 Parameter storage unit 301, reception unit 303, first extraction unit 305, first extraction result storage unit 307, sampling unit 309, sampling result storage unit 311, second extraction unit 313, second extraction result storage unit 315, and output unit 317 Is realized by, for example, hardware resources shown in FIG. In addition, the accepting unit 303, the first extracting unit 305, the sampling unit 309, the second extracting unit 313, and the output unit 317 can execute a program loaded in the memory 2501 (FIG. 34), with some or all of the processing of the module. It may be realized by executing sequentially by the CPU 2503 (FIG. 34). This is the end of the description of the module configuration of the extraction apparatus 101.

続いて、抽出装置１０１の処理について説明する。図５に、メイン処理フローを示す。受付部３０３は、ユーザ端末１０７からパラメータを受け付ける（Ｓ５０１）。受付部３０３は、受け付けたパラメータをパラメータ記憶部３０１に記憶させる。受け付けられるパラメータは、例えば後述する各種の基準値（最低頻度を含む。）、初期値及び付加値等である。既に設定されているパラメータを用いる場合には、Ｓ５０１の処理を省くようにしてもよい。 Subsequently, processing of the extraction apparatus 101 will be described. FIG. 5 shows a main processing flow. The accepting unit 303 accepts parameters from the user terminal 107 (S501). The accepting unit 303 stores the accepted parameters in the parameter storage unit 301. The accepted parameters are, for example, various reference values (including a minimum frequency) described later, an initial value, an additional value, and the like. When using parameters that have already been set, the processing in S501 may be omitted.

受付部３０３は、ユーザ端末１０７から第１抽出条件を受け付ける（Ｓ５０３）。第１抽出条件は、例えば特定のテーマに係るテキスト単位（この例では、簡易ブログの記事）を抽出するための条件である。例えば、第１抽出条件は、上述の通りハッシュタグによって指定される。第１抽出条件は、他の検索条件であってもよい。第１抽出条件は、記事中を検索するための文字列であってもよい。 The accepting unit 303 accepts the first extraction condition from the user terminal 107 (S503). The first extraction condition is a condition for extracting a text unit (in this example, a simple blog article) related to a specific theme, for example. For example, the first extraction condition is specified by a hash tag as described above. The first extraction condition may be another search condition. The first extraction condition may be a character string for searching the article.

第１抽出部３０５は、第１抽出処理を実行する（Ｓ５０５）。第１抽出処理を行う第１抽出部３０５の内部モジュール構成について説明する。図６に、第１抽出部３０５の内部モジュール構成例を示す。第１抽出部３０５は、第１クエリ生成部６０１、第１要求部６０３及び第１取得部６０５を有している。第１クエリ生成部６０１は、第１抽出条件に基づいて第１クエリを生成する。第１クエリは、第１抽出条件を満たすテキスト単位（この例では、簡易ブログの記事）をデータベース管理システム１０３に抽出させるための処理要求である。第１要求部６０３は、第１クエリをデータベース管理システム１０３に送信する。第１取得部６０５は、データベース管理システム１０３から第１抽出結果を取得する。 The first extraction unit 305 executes a first extraction process (S505). The internal module configuration of the first extraction unit 305 that performs the first extraction process will be described. FIG. 6 shows an internal module configuration example of the first extraction unit 305. The first extraction unit 305 includes a first query generation unit 601, a first request unit 603, and a first acquisition unit 605. The first query generation unit 601 generates a first query based on the first extraction condition. The first query is a processing request for causing the database management system 103 to extract a text unit that satisfies the first extraction condition (in this example, a simple blog article). The first request unit 603 transmits the first query to the database management system 103. The first acquisition unit 605 acquires the first extraction result from the database management system 103.

第１クエリ生成部６０１、第１要求部６０３及び第１取得部６０５は、例えば図３４に示すハードウエア資源によって実現される。第１クエリ生成部６０１、第１要求部６０３及び第１取得部６０５は、当該モジュールの処理の一部又は全部を、メモリ２５０１（図３４）にロードされたプログラムをＣＰＵ２５０３（図３４）で順次実行することにより実現するようにしてもよい。 The first query generation unit 601, the first request unit 603, and the first acquisition unit 605 are realized by, for example, hardware resources illustrated in FIG. The first query generation unit 601, the first request unit 603, and the first acquisition unit 605 sequentially execute a program loaded in the memory 2501 (FIG. 34) on the CPU 2503 (FIG. 34) in part or all of the processing of the module. It may be realized by executing.

図７に、第１抽出処理フローを示す。第１抽出部３０５の第１クエリ生成部６０１は、第１抽出条件に基づく第１クエリを生成する（Ｓ７０１）。 FIG. 7 shows a first extraction process flow. The first query generation unit 601 of the first extraction unit 305 generates a first query based on the first extraction condition (S701).

第１抽出部３０５の第１要求部６０３は、第１クエリをデータベース管理システム１０３へ送信する（Ｓ７０３）。第１抽出部３０５の第１取得部６０５は、データベース管理システム１０３から第１抽出結果を取得し、取得した第１抽出結果を第１抽出結果記憶部３０７に書く（Ｓ７０５）。以上で、第１抽出処理についての説明を終える。 The first request unit 603 of the first extraction unit 305 transmits the first query to the database management system 103 (S703). The first acquisition unit 605 of the first extraction unit 305 acquires the first extraction result from the database management system 103, and writes the acquired first extraction result in the first extraction result storage unit 307 (S705). This is the end of the description of the first extraction process.

このようにして、例えば図４に示した記事が抽出される。この第１抽出結果に含まれる記事は、ユーザが意図している特定のテーマに関連していると想定される。しかし、この段階で当該テーマに関連する記事が、残らず収集されているとは限らない。例えば、図２の第２レコードの記事や第４レコードの記事も、第１レコードの記事、第３レコードの記事あるいは第５レコードの記事と同様の感想を投稿したものであるが、抽出されていない。 In this way, for example, the article shown in FIG. 4 is extracted. The articles included in the first extraction result are assumed to be related to a specific theme intended by the user. However, not all articles related to the subject are collected at this stage. For example, the article of the second record and the article of the fourth record in FIG. 2 are posted with the same impression as the article of the first record, the article of the third record, or the article of the fifth record, but are extracted. Absent.

図５の説明に戻って、サンプリング部３０９は、サンプリング処理を実行する（Ｓ５０７）。具体的には、サンプリング部３０９は、テキストデータベース１０５から無作為にテキスト単位（この例では、簡易ブログの記事）を抽出するようにデータベース管理システム１０３に要求し、抽出された結果（サンプリング結果）をデータベース管理システム１０３から取得する。サンプリング結果記憶部３１１は、サンプリング部３０９によるサンプリング結果を記憶する。サンプリング結果には、複数のテキスト単位が含まれている。 Returning to the description of FIG. 5, the sampling unit 309 performs a sampling process (S507). Specifically, the sampling unit 309 requests the database management system 103 to randomly extract a text unit (in this example, a simple blog article) from the text database 105, and the extracted result (sampling result) Is acquired from the database management system 103. The sampling result storage unit 311 stores the sampling result obtained by the sampling unit 309. The sampling result includes a plurality of text units.

第２抽出部３１３は、第２抽出処理を実行する（Ｓ５０９）。第２抽出処理を行う第２抽出部３１３の内部モジュール構成について説明する。図８に、第２抽出部３１３の内部モジュール構成例を示す。第２抽出部３１３は、キーワード生成部８０１、キーワード記憶部８０３、第２クエリ生成部８０５、第２要求部８０７及び第２取得部８０９を有している。キーワード生成部８０１は、第１抽出結果に基づいてキーワードを生成する。キーワードは、第２抽出処理においてテキスト単位（この例では、簡易ブログの記事）を抽出するための条件になる。キーワードは、第１抽出結果に含まれるテキスト単位に含まれる部分文字列の中から選択される。例えば、図２に示した第２レコードの記事に含まれる「むらかな」の部分文字列が、キーワードになる。 The second extraction unit 313 executes a second extraction process (S509). The internal module configuration of the second extraction unit 313 that performs the second extraction process will be described. FIG. 8 shows an internal module configuration example of the second extraction unit 313. The second extraction unit 313 includes a keyword generation unit 801, a keyword storage unit 803, a second query generation unit 805, a second request unit 807, and a second acquisition unit 809. The keyword generation unit 801 generates a keyword based on the first extraction result. The keyword is a condition for extracting a text unit (in this example, a simple blog article) in the second extraction process. The keyword is selected from partial character strings included in text units included in the first extraction result. For example, a partial character string “Muraka” included in the article of the second record shown in FIG. 2 becomes a keyword.

キーワード記憶部８０３は、生成されたキーワードを記憶する。第２クエリ生成部８０５は、キーワードに基づいて第２クエリを生成する。第２要求部８０７は、第２クエリをデータベース管理システム１０３へ送信する。第２取得部８０９は、データベース管理システム１０３から第２抽出結果を取得する。 The keyword storage unit 803 stores the generated keyword. The second query generation unit 805 generates a second query based on the keyword. The second request unit 807 transmits the second query to the database management system 103. The second acquisition unit 809 acquires the second extraction result from the database management system 103.

キーワード生成部８０１、キーワード記憶部８０３、第２クエリ生成部８０５、第２要求部８０７及び第２取得部８０９は、例えば図３４に示すハードウエア資源によって実現される。第２抽出部３１３は、キーワード生成部８０１、第２クエリ生成部８０５、第２要求部８０７及び第２取得部８０９は、当該モジュールの処理の一部又は全部を、メモリ２５０１（図３４）にロードされたプログラムをＣＰＵ２５０３（図３４）で順次実行することにより実現するようにしてもよい。 The keyword generation unit 801, the keyword storage unit 803, the second query generation unit 805, the second request unit 807, and the second acquisition unit 809 are realized by, for example, hardware resources illustrated in FIG. The second extraction unit 313 includes a keyword generation unit 801, a second query generation unit 805, a second request unit 807, and a second acquisition unit 809 that store part or all of the processing of the module in the memory 2501 (FIG. 34). You may make it implement | achieve by executing the loaded program by CPU2503 (FIG. 34) sequentially.

図９に、第２抽出処理フローを示す。第２抽出部３１３のキーワード生成部８０１は、キーワード生成処理を実行する（Ｓ９０１）。キーワード生成部８０１は、第１抽出結果に含まれるテキスト単位（この例では、簡易ブログの記事）に含まれる部分文字列を検出し、それらの部分文字列の中から第１抽出結果と関連の強い部分文字列を選択する。そして、選択された部分文字列がキーワードになる。 FIG. 9 shows a second extraction process flow. The keyword generation unit 801 of the second extraction unit 313 executes keyword generation processing (S901). The keyword generation unit 801 detects partial character strings included in a text unit (in this example, a simple blog article) included in the first extraction result, and relates to the first extraction result from the partial character strings. Select a strong substring. The selected partial character string becomes a keyword.

例えば、形態素解析によってテキスト単位（この例では、簡易ブログの記事）から単語を検出する方法がある。形態素解析によって検出された単語の中からキーワードを選択するようにしてもよい。しかし、形態素解析によって検出される単語は、予め候補として登録されているものに限られる。従って、未だ候補として登録されていない単語は、形態素解析によって検出されない。例えば、テキスト単位に含まれている新語、略語、専門用語あるいは方言の単語などが検出されないことがある。 For example, there is a method of detecting a word from a text unit (in this example, a simple blog article) by morphological analysis. A keyword may be selected from words detected by morphological analysis. However, words detected by morphological analysis are limited to those registered in advance as candidates. Therefore, words that are not yet registered as candidates are not detected by morphological analysis. For example, new words, abbreviations, technical terms or dialect words included in text units may not be detected.

本実施の形態では、テキスト単位に含まれている新語、略語、専門用語あるいは方言の単語などもキーワードに採用されるようにするために、頻出パターンマイニングによって部分文字列を検出する。 In the present embodiment, a partial character string is detected by frequent pattern mining so that new words, abbreviations, technical terms, dialect words, etc. included in the text unit are also adopted as keywords.

部分文字列は、連続する２文字以上の連なりである。例えば、「むらかなかわいい」の文における部分文字列は、「むら」、「むらか」、「むらかな」、（中略）、「むらかなかわいい」、「らか」、「らかな」、「らかなか」、（中略）、「らかなかわいい」、（中略）、「わい」、「わいい」、「いい」である。一方、「むらかなかわいい」の文に対する形態素解析では、例えば単語「かわいい」が検出される。このように、理論上の部分文字列の数は、形態素解析で検出される単語の数に比べてとても多い。 The partial character string is a series of two or more consecutive characters. For example, the substrings in the sentence “Mura Kawaii” are “Mura”, “Muraka”, “Muraka”, (omitted), “Mura Kawaii”, “Raka”, “Raka”, “ "Kanaka", (Omitted), "Easy Cute", (Omitted), "Wai", "Wai", "Good". On the other hand, in the morphological analysis for the sentence “Smoothly cute”, for example, the word “cute” is detected. Thus, the theoretical number of partial character strings is much larger than the number of words detected by morphological analysis.

しかし、理論上の部分文字列を残らずキーワードの候補とするとキーワードを選択する処理が煩雑になる。本実施の形態では、所定頻度以上の部分文字列に限ってキーワードの候補とする。このようにすれば、キーワードの候補の数が制限されるので、キーワードを選択する処理の負担が軽くなる。尚、ここでいう頻度は、第１抽出結果に含まれるテキスト単位（この例では、簡易ブログの記事）のうち、当該部分文字列を含むテキスト単位の数である。別の言い方をすると、頻度は、第１抽出結果に含まれるテキスト単位に当該部分文字列が出現する回数である。但し、テキスト単位の複数箇所に部分文字列が含まれる場合には、１回とカウントする。また、上述の所定頻度は、基準値の一種であって、以下では最低頻度という。 However, if there are no theoretical partial character strings and keywords are candidates, the process of selecting the keywords becomes complicated. In the present embodiment, keyword candidates are limited to partial character strings having a predetermined frequency or higher. In this way, since the number of keyword candidates is limited, the burden of processing for selecting a keyword is reduced. In addition, the frequency here is the number of the text unit containing the said partial character string among the text units (this example article of a simple blog) contained in a 1st extraction result. In other words, the frequency is the number of times the partial character string appears in the text unit included in the first extraction result. However, when a partial character string is included in a plurality of locations in a text unit, it is counted once. The above-mentioned predetermined frequency is a kind of reference value, and is hereinafter referred to as a minimum frequency.

一般的に、頻出パターンマイニングでは、要素列の集合から頻出するパターンを検出する。本実施の形態では、テキスト単位が要素列に相当し、文字が要素に相当し、部分文字列がパターンに相当する。つまり、頻出パターンマイニングによって、テキスト単位の集合から頻出する部分文字列を検出する。頻出パターンマイニングによって頻出する部分文字列を検出するようにすれば、理論上の部分文字列の各々について頻度を算出する方法に比べて、処理負担が小さくなる。 Generally, in frequent pattern mining, a pattern that appears frequently from a set of element sequences is detected. In the present embodiment, a text unit corresponds to an element string, a character corresponds to an element, and a partial character string corresponds to a pattern. That is, partial character strings that frequently appear from a set of text units are detected by frequent pattern mining. If a partial character string that appears frequently by frequent pattern mining is detected, the processing load is reduced compared to a method of calculating the frequency for each theoretical partial character string.

本実施の形態では、頻出パターンマイニングの一種であるModified Prefixspanを用いる。Modified Prefixspanは、要素間の距離を条件として要素列を検出する。本実施の形態では、文字間の距離を０とする条件によって文字列を検出する。 In the present embodiment, Modified Prefixspan which is a kind of frequent pattern mining is used. Modified Prefixspan detects an element sequence on the condition of the distance between elements. In the present embodiment, a character string is detected under the condition that the distance between characters is zero.

Modified Prefixspanについては、H. Kitakami, T. Kanbara, Y. Mori, S. Kuroki, Y. Yamazaki: "Modified PrefixSpan Method for Motif Discovery in Sequence Databases". PRICAI 2002: pp.482-491, 2002に詳述されている。 Modified Prefixspan is detailed in H. Kitakami, T. Kanbara, Y. Mori, S. Kuroki, Y. Yamazaki: "Modified PrefixSpan Method for Motif Discovery in Sequence Databases". PRICAI 2002: pp.482-491, 2002 Has been.

また、Modified Prefixspanは、最低頻度以上の頻度に係る要素列を検出する。但し、最低頻度が高すぎると、キーワードとすべき部分文字列を検出しなくなる。その場合には、抽出すべきテキスト単位（この例では、簡易ブログの記事）を抽出し損なう可能性がある。 Also, Modified Prefixspan detects an element sequence related to a frequency that is equal to or higher than the lowest frequency. However, if the minimum frequency is too high, a partial character string to be a keyword is not detected. In that case, there is a possibility of failing to extract a text unit to be extracted (in this example, a simple blog article).

他方、最低頻度が低すぎると、余分な部分文字列を検出するようになる。その結果、処理負担が大きくなり、処理時間が長くなる。 On the other hand, if the lowest frequency is too low, an extra partial character string is detected. As a result, the processing load increases and the processing time becomes longer.

本実施の形態では、キーワードになる部分文字列の頻度に着目して、最低頻度を定めるようにする。 In the present embodiment, the minimum frequency is determined by paying attention to the frequency of partial character strings that become keywords.

図１０に、頻度に対する部分文字列数の特性を示す。横軸は、部分文字列の頻度を示している。縦軸は、部分文字列の数を示している。頻度に対する部分文字列数の特性は、英語における単語の出現頻度についての経験則（ジップの法則）に近似すると想定される。そのように想定すると、図示するように頻度が小さい部分文字列は、その数が極めて多くなる。他方、頻度が大きい部分文字列は、その数が極めて少なくなる。この図における部分文字列は、理論上の部分文字列であって、キーワードになる部分文字列とキーワードにならない部分文字列との両方を含んでいる。 FIG. 10 shows the characteristics of the number of partial character strings with respect to the frequency. The horizontal axis indicates the frequency of the partial character string. The vertical axis indicates the number of partial character strings. The characteristic of the number of partial character strings with respect to the frequency is assumed to approximate an empirical rule for the appearance frequency of words in English (Zip's law). Assuming that, the number of partial character strings with low frequency as shown in the figure is extremely large. On the other hand, the number of partial character strings having a high frequency is extremely small. The partial character string in this figure is a theoretical partial character string, and includes both a partial character string that becomes a keyword and a partial character string that does not become a keyword.

図１１に、頻度に対するキーワード数の特性を示す。横軸は、キーワードの頻度を示している。縦軸は、キーワード数を示している。この図は、図１０における部分文字列のうちキーワードになるものについての分布を示している。本実施の形態では、頻度に対するキーワードの分布は、山形を示すものと想定する。図１１の例は、正規分布に近似している。 FIG. 11 shows the characteristics of the number of keywords with respect to the frequency. The horizontal axis indicates the keyword frequency. The vertical axis indicates the number of keywords. This figure shows the distribution of the partial character strings in FIG. 10 that become keywords. In the present embodiment, it is assumed that the keyword distribution with respect to the frequency indicates a mountain shape. The example of FIG. 11 approximates a normal distribution.

この図で、例えば最もキーワード数が多い頻度（頻度の平均に相当する）よりも小さい値を、最低頻度に設定してModified Prefixspanを行えば半数以上のキーワードが検出される。本実施の形態では、頻度に対するキーワード数に基づいて最低頻度を設定する。 In this figure, for example, if Modified Prefixspan is performed by setting a value smaller than the frequency with the largest number of keywords (corresponding to the average frequency) to the lowest frequency, more than half of the keywords are detected. In the present embodiment, the minimum frequency is set based on the number of keywords with respect to the frequency.

キーワード生成部８０１は、大きい方から小さい方への最低頻度を移行させながら、適当な際低頻度を探索するように処理する。具体的には、キーワード生成部８０１は、最低頻度を減じながらModified Prefixspanを行い、それに伴うキーワード数の変化によって、適当な最低頻度を判定するようにする。そして、適当な最低頻度であると判定された時点で検出されているキーワードが、キーワード生成部８０１による生成結果となる。 The keyword generation unit 801 performs processing so as to search for a low frequency when appropriate while shifting the minimum frequency from the larger one to the smaller one. Specifically, the keyword generation unit 801 performs Modified Prefixspan while reducing the minimum frequency, and determines an appropriate minimum frequency based on a change in the number of keywords associated therewith. Then, the keyword detected when it is determined that the frequency is an appropriate minimum frequency becomes a generation result by the keyword generation unit 801.

尚、キーワード生成部８０１は、第１抽出結果との関連の強さを示すスコアによって、部分文字列がキーワードに適するか否かを判定する。 Note that the keyword generation unit 801 determines whether or not the partial character string is suitable for the keyword based on a score indicating the strength of association with the first extraction result.

キーワード生成処理を行うキーワード生成部８０１の内部モジュール構成について説明する。図１２に、キーワード生成部８０１の内部モジュール構成例を示す。キーワード生成部８０１は、部分文字列検出部１２０１、検出結果記憶部１２０３、キーワード選択部１２０５、選択結果記憶部１２０７、終了判定部１２０９、度数分布記憶部１２１１、反復部１２１３及びキーワード特定部１２１５を有している。 An internal module configuration of the keyword generation unit 801 that performs keyword generation processing will be described. FIG. 12 shows an internal module configuration example of the keyword generation unit 801. The keyword generation unit 801 includes a partial character string detection unit 1201, a detection result storage unit 1203, a keyword selection unit 1205, a selection result storage unit 1207, an end determination unit 1209, a frequency distribution storage unit 1211, a repetition unit 1213, and a keyword identification unit 1215. Have.

部分文字列検出部１２０１は、上述したModified Prefixspanによって、第１抽出結果記憶部３０７に含まれるテキスト単位（この例では、簡易ブログの記事）から部分文字列を検出する。検出結果記憶部１２０３は、部分文字列検出部１２０１による検出結果を記憶する。検出結果は、検出された一又は複数の部分文字列を含んでいる。キーワード選択部１２０５は、検出結果に含まれる部分文字列の中からキーワードを選択する。選択結果記憶部１２０７は、キーワード選択部１２０５による選択結果を記憶する。選択結果は、各回におけるキーワードテーブルを含む。キーワードテーブルには、選択されたキーワードが登録される。終了判定部１２０９は、キーワードの度数分布表を生成し、第２抽出処理の終了を判定する。度数分布記憶部１２１１は、キーワードの度数分布表を記憶する。反復部１２１３は、Modified Prefixspanに用いる最低頻度を更新し、部分文字列検出部１２０１による処理と、キーワード選択部１２０５による処理と、終了判定部１２０９による処理とを反復する。キーワード特定部１２１５は、第２抽出処理の結果であるキーワードを特定する。 The partial character string detection unit 1201 detects a partial character string from the text unit (in this example, a simple blog article) included in the first extraction result storage unit 307 by using the above-described Modified Prefixspan. The detection result storage unit 1203 stores the detection result by the partial character string detection unit 1201. The detection result includes one or more detected partial character strings. The keyword selection unit 1205 selects a keyword from the partial character string included in the detection result. The selection result storage unit 1207 stores the selection result by the keyword selection unit 1205. The selection result includes a keyword table at each time. The selected keyword is registered in the keyword table. The end determination unit 1209 generates a keyword frequency distribution table and determines the end of the second extraction process. The frequency distribution storage unit 1211 stores a frequency distribution table of keywords. The repetition unit 1213 updates the minimum frequency used for Modified Prefixspan, and repeats the processing by the partial character string detection unit 1201, the processing by the keyword selection unit 1205, and the processing by the end determination unit 1209. The keyword specifying unit 1215 specifies a keyword that is a result of the second extraction process.

部分文字列検出部１２０１、検出結果記憶部１２０３、キーワード選択部１２０５、選択結果記憶部１２０７、終了判定部１２０９、度数分布記憶部１２１１、反復部１２１３及びキーワード特定部１２１５は、例えば図３４に示すハードウエア資源によって実現される。部分文字列検出部１２０１、キーワード選択部１２０５、終了判定部１２０９、反復部１２１３及びキーワード特定部１２１５は、当該モジュールの処理の一部又は全部を、メモリ２５０１（図３４）にロードされたプログラムをＣＰＵ２５０３（図３４）で順次実行することにより実現するようにしてもよい。 The partial character string detection unit 1201, the detection result storage unit 1203, the keyword selection unit 1205, the selection result storage unit 1207, the end determination unit 1209, the frequency distribution storage unit 1211, the repetition unit 1213, and the keyword specification unit 1215 are illustrated in FIG. 34, for example. Realized by hardware resources. The partial character string detection unit 1201, the keyword selection unit 1205, the end determination unit 1209, the repetition unit 1213, and the keyword identification unit 1215 execute a program loaded in the memory 2501 (FIG. 34), with some or all of the processing of the module. It may be realized by executing sequentially by the CPU 2503 (FIG. 34).

図１３に、キーワード生成処理フローを示す。キーワード生成部８０１の部分文字列検出部１２０１は、部分文字列検出処理を実行する（Ｓ１３０１）。 FIG. 13 shows a keyword generation processing flow. The partial character string detection unit 1201 of the keyword generation unit 801 performs a partial character string detection process (S1301).

図１４に、部分文字列検出処理フローを示す。部分文字列検出部１２０１は、頻出パターンマイニング処理を実行する（Ｓ１４０１）。 FIG. 14 shows a partial character string detection processing flow. The partial character string detection unit 1201 executes frequent pattern mining processing (S1401).

本実施の形態に係る頻出パターンマイニング処理について説明する。上述のように文字を要素とみなし、第１抽出結果に含まれるテキスト単位（この例では、簡易ブログの記事）を要素列とみなし、部分文字列をパターンとみなして頻出パターンマイニングを行う。本実施の形態では、頻出パターンマイニングの一種であるModified Prefixspanを実行する。このとき、部分文字列検出部１２０１は、パラメータ記憶部３０１に記憶されている最低頻度をModified Prefixspanに適用する。また、部分文字列検出部１２０１は、Modified Prefixspanに適用する最低要素数を２とする。更に、部分文字列検出部１２０１は、Modified Prefixspanに適用する要素間距離を０とする。 A frequent pattern mining process according to the present embodiment will be described. As described above, characters are regarded as elements, text units (in this example, simple blog articles) included in the first extraction result are regarded as element strings, and partial character strings are regarded as patterns to perform frequent pattern mining. In the present embodiment, Modified Prefixspan, which is a type of frequent pattern mining, is executed. At this time, the partial character string detection unit 1201 applies the lowest frequency stored in the parameter storage unit 301 to Modified Prefixspan. Also, the partial character string detection unit 1201 sets the minimum number of elements applied to Modified Prefixspan to 2. Furthermore, the partial character string detection unit 1201 sets the inter-element distance applied to Modified Prefixspan to 0.

キーワード生成部８０１の部分文字列検出部１２０１は、検出された結果（検出結果）を検出結果記憶部１２０３に書く（Ｓ１４０３）。検出結果は、Modified Prefixspanによって検出された部分文字列の集合である。 The partial character string detection unit 1201 of the keyword generation unit 801 writes the detected result (detection result) in the detection result storage unit 1203 (S1403). The detection result is a set of partial character strings detected by Modified Prefixspan.

図１５に、検出結果の例を示す。検出結果は、検出された部分文字列毎にレコードを有している。レコードは、部分文字列を格納するためのフィールドと、頻度を格納するためのフィールドとを有している。 FIG. 15 shows an example of the detection result. The detection result has a record for each detected partial character string. The record has a field for storing the partial character string and a field for storing the frequency.

この例は、最初に行った頻出パターンマイニングによる検出結果の例を示している。このときの最低頻度は６０に設定されている。従って、頻度が６０以上である部分文字列が検出される。 This example shows an example of a detection result by frequent pattern mining performed first. The minimum frequency at this time is set to 60. Therefore, a partial character string having a frequency of 60 or more is detected.

第１レコードは、部分文字列「良かった」を検出し、当該部分文字列の頻度が「７０」であることを示している。第２レコードは、部分文字列「かな」を検出し、当該部分文字列の頻度が「９０」であることを示している。部分文字列「良かった」と部分文字列「かな」とは、第１抽出結果に含まれる記事に頻繁に出現することを意味する。 The first record detects the partial character string “good” and indicates that the frequency of the partial character string is “70”. The second record detects the partial character string “Kana” and indicates that the frequency of the partial character string is “90”. The partial character string “Good” and the partial character string “Kana” mean that they frequently appear in articles included in the first extraction result.

図１３の説明に戻って、キーワード生成部８０１のキーワード選択部１２０５は、キーワード選択処理を実行する（Ｓ１３０３）。図１６に、キーワード選択処理フローを示す。キーワード選択部１２０５は、検出結果記憶部１２０３に記憶されている検出結果に含まれる部分文字列のうち、スコア算出処理の対象となっていない未処理の部分文字列を１つ特定する（Ｓ１６０１）。キーワード選択部１２０５は、特定した部分文字列について、スコア算出処理を実行する（Ｓ１６０３）。スコア算出処理では、特定された部分文字列についてのスコアを算出する。 Returning to the description of FIG. 13, the keyword selection unit 1205 of the keyword generation unit 801 executes keyword selection processing (S1303). FIG. 16 shows a keyword selection processing flow. The keyword selection unit 1205 identifies one unprocessed partial character string that is not the target of the score calculation process among the partial character strings included in the detection result stored in the detection result storage unit 1203 (S1601). . The keyword selection unit 1205 executes score calculation processing for the identified partial character string (S1603). In the score calculation process, a score for the specified partial character string is calculated.

この例におけるスコアは、第１出現確率を第２出現確率で除したリフト値である。第１出現確率は、第１抽出結果において当該部分文字列が出現する確率である。第２出現確率は、サンプリング結果において当該部分文字列が出現する確率である。スコアが大きい場合には、当該部分文字列は第１抽出結果との関連が強いことを意味する。スコアが小さい場合には、当該部分文字列は第１抽出結果との関連が弱いことを意味する。 The score in this example is a lift value obtained by dividing the first appearance probability by the second appearance probability. The first appearance probability is a probability that the partial character string appears in the first extraction result. The second appearance probability is a probability that the partial character string appears in the sampling result. When the score is large, it means that the partial character string is strongly related to the first extraction result. When the score is small, it means that the partial character string is weakly related to the first extraction result.

図１７に、スコア算出処理フローを示す。キーワード選択部１２０５は、第１出現確率算出処理を実行する（Ｓ１７０１）。第１出現確率算出処理では、上述した第１出現確率を算出する。 FIG. 17 shows a score calculation processing flow. The keyword selection unit 1205 executes a first appearance probability calculation process (S1701). In the first appearance probability calculation process, the first appearance probability described above is calculated.

図１８に、第１出現確率算出処理フローを示す。キーワード選択部１２０５は、第１抽出結果に含まれるテキスト単位（この例では、簡易ブログの記事）の数を特定する（Ｓ１８０１）。具体的には、キーワード選択部１２０５は、第１抽出結果記憶部３０７に記憶されている第１抽出結果に含まれるテキスト単位をカウントする。以下、このテキスト単位の数を第１テキスト単位数という。 FIG. 18 shows a first appearance probability calculation process flow. The keyword selection unit 1205 identifies the number of text units (in this example, simple blog articles) included in the first extraction result (S1801). Specifically, the keyword selection unit 1205 counts text units included in the first extraction result stored in the first extraction result storage unit 307. Hereinafter, this number of text units is referred to as a first text unit number.

次に、キーワード選択部１２０５は、第１抽出結果において、当該部分文字列を含むテキスト単位（この例では、簡易ブログの記事）の数をカウントする（Ｓ１８０３）。以下、このテキスト単位の数を第１出現頻度という。 Next, the keyword selection unit 1205 counts the number of text units (in this example, simple blog articles) including the partial character string in the first extraction result (S1803). Hereinafter, this number of text units is referred to as a first appearance frequency.

そして、キーワード選択部１２０５は、第１抽出結果において当該部分文字列が出現する確率（第１出現確率）を算出する（Ｓ１８０５）。具体的には、キーワード選択部１２０５は、第１出現頻度を第１テキスト単位数で除して得られた商が、第１出現確率である。 Then, the keyword selection unit 1205 calculates the probability (first appearance probability) that the partial character string appears in the first extraction result (S1805). Specifically, in the keyword selection unit 1205, the quotient obtained by dividing the first appearance frequency by the first text unit number is the first appearance probability.

図１７の説明に戻って、キーワード選択部１２０５は、第２出現確率算出処理を実行する（Ｓ１７０３）。第２出現確率算出処理では、上述した第２出現確率を算出する。 Returning to the description of FIG. 17, the keyword selection unit 1205 executes a second appearance probability calculation process (S <b> 1703). In the second appearance probability calculation process, the above-described second appearance probability is calculated.

図１９に、第２出現確率算出処理フローを示す。キーワード選択部１２０５は、サンプリング結果に含まれるテキスト単位（この例では、簡易ブログの記事）の数を特定する（Ｓ１９０１）。具体的には、キーワード選択部１２０５は、サンプリング結果記憶部３１１に記憶されているサンプリング結果に含まれるテキスト単位をカウントする。以下、このテキスト単位の数を第２テキスト単位数という。 FIG. 19 shows a second appearance probability calculation process flow. The keyword selection unit 1205 identifies the number of text units (in this example, simple blog articles) included in the sampling result (S1901). Specifically, the keyword selection unit 1205 counts text units included in the sampling results stored in the sampling result storage unit 311. Hereinafter, this number of text units is referred to as a second text unit number.

次に、キーワード選択部１２０５は、サンプリング結果において、当該部分文字列を含むテキスト単位（この例では、簡易ブログの記事）の数をカウントする（Ｓ１９０３）。以下、このテキスト単位の数を第２出現頻度という。 Next, the keyword selection unit 1205 counts the number of text units (in this example, simple blog articles) including the partial character string in the sampling result (S1903). Hereinafter, this number of text units is referred to as a second appearance frequency.

そして、キーワード選択部１２０５は、サンプリング結果において当該部分文字列が出現する確率（第２出現確率）を算出する（Ｓ１９０５）。具体的には、キーワード選択部１２０５は、第２出現頻度を第２テキスト単位数で除して得られた商が、第２出現確率である。 Then, the keyword selection unit 1205 calculates the probability (second appearance probability) that the partial character string appears in the sampling result (S1905). Specifically, in the keyword selection unit 1205, the quotient obtained by dividing the second appearance frequency by the second text unit number is the second appearance probability.

図１７の説明に戻って、キーワード選択部１２０５は、出現確率の比率（リフト値）を算出する（Ｓ１７０５）。具体的には、キーワード選択部１２０５は、第１出現確率を第２出現確率で除して得られた比率が、出現確率の比率（リフト値）である。 Returning to the description of FIG. 17, the keyword selection unit 1205 calculates a ratio (lift value) of appearance probabilities (S1705). Specifically, the ratio obtained by dividing the first appearance probability by the second appearance probability in the keyword selection unit 1205 is the appearance probability ratio (lift value).

図１５に示した検出結果に含まれる部分文字列に対するスコアの例を、図２０に示す。第１レコードは、部分文字列「良かった」のスコアが「４」であることを示している。第２レコードは、部分文字列「かな」のスコアが「３」であることを示している。部分文字列「良かった」と部分文字列「かな」とは、共にスコアが小さい。つまりこれらの部分文字列は、第１抽出結果との関連が弱いことを意味する。これらの文字列は、一般的な文において頻繁に出現する文字列であるので、第１抽出結果と関連が特に強いとは言えない。 An example of the score for the partial character string included in the detection result shown in FIG. 15 is shown in FIG. The first record indicates that the score of the partial character string “good” is “4”. The second record indicates that the score of the partial character string “Kana” is “3”. Both the partial character string “good” and the partial character string “kana” have small scores. That is, these partial character strings are weakly related to the first extraction result. Since these character strings are character strings that frequently appear in a general sentence, it cannot be said that the relation with the first extraction result is particularly strong.

図１６の説明に戻って、キーワード選択部１２０５は、算出したスコアが基準値を越えているか否かを判定する（Ｓ１６０５）。当該基準値は、第１抽出結果との関連が強いか否かを判定するためのパラメータである。当該基準値は、パラメータ記憶部３０１に記憶されている。当該基準値は、パラメータ記憶部３０１によって受け付けられるパラメータの一つである。当該基準値は、最低スコアと呼ばれることもある。 Returning to the description of FIG. 16, the keyword selection unit 1205 determines whether or not the calculated score exceeds the reference value (S1605). The reference value is a parameter for determining whether or not the relationship with the first extraction result is strong. The reference value is stored in the parameter storage unit 301. The reference value is one of the parameters accepted by the parameter storage unit 301. The reference value is sometimes referred to as a minimum score.

算出したスコアが基準値を越えていると判定した場合には、キーワード選択部１２０５は、当該部分文字列を、今回のキーワードテーブルに登録する（Ｓ１６０７）。各回のキーワードテーブルは、選択結果記憶部１２０７に生成される。具体的には、キーワード選択部１２０５は、選択結果記憶部１２０７に記憶される今回のキーワードテーブルに当該部分文字列を設定する。このとき、キーワード選択部１２０５は、今回のキーワードテーブルにおいて当該部分文字列に対応するスコアを付加するようにしてもよい。 If it is determined that the calculated score exceeds the reference value, the keyword selection unit 1205 registers the partial character string in the current keyword table (S1607). The keyword table for each time is generated in the selection result storage unit 1207. Specifically, the keyword selection unit 1205 sets the partial character string in the current keyword table stored in the selection result storage unit 1207. At this time, the keyword selection unit 1205 may add a score corresponding to the partial character string in the current keyword table.

算出したスコアが基準値を越えていないと判定した場合には、キーワード選択部１２０５は、当該部分文字列を今回のキーワードテーブルに登録しない。例えば、図１５及び図２０に示した検出結果に含まれる部分文字列については、いずれもそのスコアが基準値を越えないので、今回のキーワードテーブルに登録されない。 If it is determined that the calculated score does not exceed the reference value, the keyword selection unit 1205 does not register the partial character string in the current keyword table. For example, the partial character strings included in the detection results shown in FIGS. 15 and 20 are not registered in the current keyword table because their scores do not exceed the reference value.

キーワード選択部１２０５は、検出結果記憶部１２０３に記憶されている検出結果に含まれる部分文字列のうち、スコア算出処理の対象となっていない未処理の部分文字列があるか否かを判定する（Ｓ１６０９）。未処理の部分文字列があると判定した場合には、Ｓ１６０１の処理に戻って、キーワード選択部１２０５は上述した処理を繰り返す。未処理の部分文字列がないと判定した場合には、キーワード選択部１２０５は、キーワード選択処理を終了する。そして、図１３のＳ１３０５に移る。 The keyword selection unit 1205 determines whether there is an unprocessed partial character string that is not a target of the score calculation process among the partial character strings included in the detection result stored in the detection result storage unit 1203. (S1609). If it is determined that there is an unprocessed partial character string, the process returns to S1601, and the keyword selection unit 1205 repeats the above-described process. If it is determined that there is no unprocessed partial character string, the keyword selection unit 1205 ends the keyword selection process. Then, the process proceeds to S1305 in FIG.

図１３の説明に戻って、キーワード生成部８０１の終了判定部１２０９は、終了判定処理を実行する（Ｓ１３０５）。 Returning to the description of FIG. 13, the end determination unit 1209 of the keyword generation unit 801 executes an end determination process (S1305).

図２１に、終了判定処理フローを示す。終了判定部１２０９は、現在の最低頻度が基準値以下か否かを判定する（Ｓ２１０１）。例えば、図１０に示したように頻度が２あるいは３になると、部分文字列の数が極端に多く、上述した第２出現確率算出処理などの処理負担が膨大になる。従って、本実施の形態では、最低頻度に下限を設ける。上記基準値は、最低頻度の下限を意味する。 FIG. 21 shows an end determination process flow. The end determination unit 1209 determines whether or not the current minimum frequency is equal to or less than a reference value (S2101). For example, when the frequency is 2 or 3 as shown in FIG. 10, the number of partial character strings is extremely large, and the processing load such as the second appearance probability calculation process described above becomes enormous. Therefore, in the present embodiment, a lower limit is set for the lowest frequency. The reference value means the lower limit of the lowest frequency.

現在の最低頻度が基準値以下であると判定した場合には、終了判定部１２０９は、判定結果を「終了する」と設定する（Ｓ２１０３）。このようにして、現在の最低頻度が基準値以下であると判定した場合には、以降の探索を行わないようにする。 If it is determined that the current minimum frequency is equal to or less than the reference value, the end determination unit 1209 sets the determination result to “end” (S2103). In this way, when it is determined that the current lowest frequency is below the reference value, the subsequent search is not performed.

現在の最低頻度が基準値以下ではないと判定した場合には、終了判定部１２０９は、新たな範囲を特定する（Ｓ２１０５）。終了判定処理は、上述したように頻度に対するキーワードの分布に基づいて行われる。そのために、終了判定部１２０９は、キーワードの度数分布表を生成する。新たな範囲は、頻度についての範囲であって、度数分布表における新たな階級に相当する。 If it is determined that the current minimum frequency is not less than or equal to the reference value, the end determination unit 1209 identifies a new range (S2105). The end determination process is performed based on the keyword distribution with respect to the frequency as described above. Therefore, the end determination unit 1209 generates a keyword frequency distribution table. The new range is a frequency range and corresponds to a new class in the frequency distribution table.

図２２に、キーワードの度数分布表の例を示す。度数分布表は、階級毎にレコードを有している。階級は、頻度の範囲によって特定される。レコードは、頻度の範囲を格納するためのフィールドと、キーワード数を格納するためのフィールドとを有している。反復によって終了判定処理を行う度に、新たなレコードが追加される。 FIG. 22 shows an example of a keyword frequency distribution table. The frequency distribution table has a record for each class. The class is specified by the frequency range. The record has a field for storing a frequency range and a field for storing the number of keywords. A new record is added each time the end determination process is performed by repetition.

第１レコードは、頻度が６０以上の範囲におけるキーワード数が０であることを示している。第１レコードにおけるキーワード数は、１回目に生成されたキーワード数を示している。 The 1st record has shown that the number of keywords in the range whose frequency is 60 or more is 0. The number of keywords in the first record indicates the number of keywords generated for the first time.

第２レコードは、頻度が５７以上且つ６０未満の範囲におけるキーワード数が３であることを示している。第２レコードにおけるキーワード数は、２回目に生成されたキーワード数と１回目に生成されたキーワード数の差分、つまり２回目に新たに生成されたキーワードの数を示している。 The second record indicates that the number of keywords is 3 in a frequency range of 57 or more and less than 60. The number of keywords in the second record indicates the difference between the number of keywords generated for the second time and the number of keywords generated for the first time, that is, the number of keywords newly generated for the second time.

第３レコードは、頻度が５４以上且つ５７未満の範囲におけるキーワード数が６であることを示している。第３レコードにおけるキーワード数は、３回目に生成されたキーワード数と２回目に生成されたキーワード数の差分、つまり３回目に新たに生成されたキーワードの数を示している。 The third record indicates that the number of keywords in the range where the frequency is 54 or more and less than 57 is 6. The number of keywords in the third record indicates the difference between the number of keywords generated for the third time and the number of keywords generated for the second time, that is, the number of keywords newly generated for the third time.

第４レコードは、頻度が５１以上且つ５４未満の範囲におけるキーワード数が１２であることを示している。第４レコードにおけるキーワード数は、４回目に生成されたキーワード数と３回目に生成されたキーワード数の差分、つまり４回目に新たに生成されたキーワードの数を示している。 The fourth record indicates that the number of keywords in the range where the frequency is 51 or more and less than 54 is 12. The number of keywords in the fourth record indicates the difference between the number of keywords generated for the fourth time and the number of keywords generated for the third time, that is, the number of keywords newly generated for the fourth time.

第５レコードは、頻度が４８以上且つ５１未満の範囲におけるキーワード数が１８であることを示している。第５レコードにおけるキーワード数は、５回目に生成されたキーワード数と４回目に生成されたキーワード数の差分、つまり５回目に新たに生成されたキーワードの数を示している。 The fifth record indicates that the number of keywords is 18 in a frequency range of 48 or more and less than 51. The number of keywords in the fifth record indicates the difference between the number of keywords generated at the fifth time and the number of keywords generated at the fourth time, that is, the number of keywords newly generated at the fifth time.

この例では、範囲の大きさを均等にしているが、範囲の大きさを回毎に異なるようにしてもよい。 In this example, the size of the range is made uniform, but the size of the range may be different every time.

図２３に、キーワードの度数分布図の例を示す。この度数分布図は、図２２に示した度数分布表をグラフ化したもの（ヒストグラム）である。 FIG. 23 shows an example of a keyword frequency distribution diagram. This frequency distribution diagram is a graph (histogram) of the frequency distribution table shown in FIG.

図２１の説明に戻って、終了判定部１２０９は、新たな範囲におけるキーワード数を算出する（Ｓ２１０７）。具体的には、終了判定部１２０９は、今回のキーワードテーブルに含まれるキーワードをカウントし、今回のキーワード数から前回のキーワード数を引くことによって、新たな範囲におけるキーワード数を求める。 Returning to the description of FIG. 21, the end determination unit 1209 calculates the number of keywords in the new range (S2107). Specifically, the end determination unit 1209 counts the keywords included in the current keyword table and subtracts the previous number of keywords from the current number of keywords to obtain the number of keywords in the new range.

終了判定部１２０９は、度数分布表に新たなレコードを追加する（Ｓ２１０９）。新たなレコードには、新たな範囲と、キーワード数とが設定される。 The end determination unit 1209 adds a new record to the frequency distribution table (S2109). A new range and the number of keywords are set in the new record.

終了判定部１２０９は、キーワード数の変化傾向を判定する（Ｓ２１１１）。変化傾向は、例えば増加傾向あるいは減少傾向のいずれかである。例えば、終了判定部１２０９は、前回の範囲におけるキーワード数よりも今回の範囲におけるキーワード数が増えている場合に、増加傾向と判定し、前回の範囲におけるキーワード数よりも今回の範囲におけるキーワード数が減っている場合に、減少傾向と判定する。あるいは、終了判定部１２０９は、今回の範囲におけるキーワード数を、前回より前の回の範囲におけるキーワード数と比較するようにしてもよい。終了判定部１２０９は、今回の範囲におけるキーワード数を、複数回の範囲におけるキーワード数と夫々比較して、各回の比較結果に基づいて変化傾向を判定するようにしてもよい。 The end determination unit 1209 determines a change tendency of the number of keywords (S2111). The change tendency is, for example, either an increasing tendency or a decreasing tendency. For example, when the number of keywords in the current range is greater than the number of keywords in the previous range, the end determination unit 1209 determines that the trend is increasing, and the number of keywords in the current range is greater than the number of keywords in the previous range. If it is decreasing, it is judged as a decreasing trend. Alternatively, the end determination unit 1209 may compare the number of keywords in the current range with the number of keywords in the previous time range. The end determination unit 1209 may compare the number of keywords in the current range with the number of keywords in a plurality of ranges, respectively, and determine a change tendency based on the comparison result of each time.

終了判定部１２０９は、キーワード数の変化傾向が増加傾向であるか否かを判定する（Ｓ２１１３）。キーワード数の変化傾向が増加傾向であると判定した場合には、終了判定部１２０９は、判定結果を「終了しない」と設定する（Ｓ２１１９）。例えば、図２３に示すように、４回目の範囲におけるキーワード数が増加傾向にある場合には、キーワードとすべき部分文字列の多くが未だ検出されていないと推測されるので、キーワード生成処理を続行させる。 The end determination unit 1209 determines whether or not the change tendency of the number of keywords is an increasing tendency (S2113). When it is determined that the change tendency of the number of keywords is an increasing tendency, the end determination unit 1209 sets “not end” as the determination result (S2119). For example, as shown in FIG. 23, when the number of keywords in the fourth range tends to increase, it is presumed that most of the partial character strings to be used as keywords have not yet been detected. Let it continue.

キーワード数の変化傾向が増加傾向ではないと判定した場合には、終了判定部１２０９は、今回の範囲におけるキーワード数が第１基準値以下であるか否かを判定する（Ｓ２１１５）。第１基準値は、頻度の低下に伴いキーワード数が収束しつつあると判定するための基準である。第１基準値は、パラメータ記憶部３０１に記憶されている。第１基準値は、受付部３０３によって受け付けられるようにしてもよい。第１基準値は、第１テキスト単位数（この例では、第１抽出結果に含まれる記事の数）に応じ設定されるようにしてもよい。 When it is determined that the change tendency of the number of keywords is not an increasing tendency, the end determination unit 1209 determines whether or not the number of keywords in the current range is equal to or less than the first reference value (S2115). The first reference value is a reference for determining that the number of keywords is converging with a decrease in frequency. The first reference value is stored in the parameter storage unit 301. The first reference value may be received by the receiving unit 303. The first reference value may be set according to the number of first text units (in this example, the number of articles included in the first extraction result).

図２４に、キーワードの度数分布図の例を示す。図中の破線は、第１基準値を示している。最低頻度を１５に設定して部分文字列を検出した回の範囲におけるキーワード数は、第１基準値を上回っているので、キーワード数はまだ収束する段階に至っていないと想定される。 FIG. 24 shows an example of a keyword frequency distribution diagram. The broken line in the figure indicates the first reference value. Since the number of keywords in the range where the minimum frequency is set to 15 and the partial character string is detected exceeds the first reference value, it is assumed that the number of keywords has not yet reached the stage of convergence.

次に最低頻度を１２に設定して部分文字列を検出した回の範囲におけるキーワード数は、第１基準値を下回っているので、キーワード数が収束しつつあると想定される。 Next, since the number of keywords in the range where the minimum frequency is set to 12 and the partial character string is detected is below the first reference value, it is assumed that the number of keywords is converging.

更に最低頻度を９に設定して部分文字列を検出した回の範囲におけるキーワード数は、再び第１基準値を下回っている。この時点において２回連続で第１基準値を下回ったことになる。本実施の形態は、このように２回連続して各回の範囲におけるキーワード数が第１基準値を下回った場合に、キーワード生成を終了させるようにする。連続数は、３以上であってもかまわない。また、１回でも第１基準値を下回った場合には、キーワード生成を終了させるようにしてもよい。 Furthermore, the number of keywords in the range in which the partial character string is detected with the minimum frequency set to 9 is again below the first reference value. At this time, it is below the first reference value twice in succession. In this embodiment, the keyword generation is terminated when the number of keywords in each range is continuously lower than the first reference value. The continuous number may be 3 or more. In addition, the keyword generation may be terminated when the value falls below the first reference value even once.

図２１の説明に戻って、Ｓ２１１５で、今回の範囲におけるキーワード数が第１基準値以下ではないと判定した場合には、終了判定部１２０９は、判定結果を「終了しない」と設定する（Ｓ２１１９）。 Returning to the description of FIG. 21, if it is determined in S2115 that the number of keywords in the current range is not less than or equal to the first reference value, the end determination unit 1209 sets the determination result to “not end” (S2119). ).

キーワード数が第１基準以下であると判定した場合には、終了判定部１２０９は、当該判定結果が連続した回数が所定数（この例では、２）に達したか否かを判定する。（Ｓ２１１７）。当該判定結果が連続した回数が所定数に達したと判定した場合には、終了判定部１２０９は、判定結果を「終了する」と設定する（Ｓ２１０３）。 When it is determined that the number of keywords is equal to or less than the first reference, the end determination unit 1209 determines whether or not the number of consecutive determination results has reached a predetermined number (2 in this example). (S2117). When it is determined that the number of consecutive determination results has reached a predetermined number, the end determination unit 1209 sets the determination result to “end” (S2103).

当該判定結果が連続した回数が所定数に達していないと判定した場合には、終了判定部１２０９は、判定結果を「終了しない」と設定する（Ｓ２１１９）。終了判定処理を終えると、図１３のＳ１３０７に示した処理に戻る。 If it is determined that the number of consecutive determination results has not reached the predetermined number, the end determination unit 1209 sets the determination result to “not end” (S2119). When the end determination process ends, the process returns to S1307 in FIG.

図１３の説明に戻って、反復部１２１３は、判定結果が「終了する」と設定されたか否かを判定する（Ｓ１３０７）。判定結果が「終了する」と設定されていないと判定した場合、つまり判定結果が「終了しない」と設定された場合には、キーワード生成部８０１の反復部１２１３は、最低頻度更新処理を実行する（Ｓ１３０９）。最低頻度更新処理によって、次の回における最低頻度が設定される。 Returning to the description of FIG. 13, the repetitive unit 1213 determines whether or not the determination result is set to “end” (S1307). When it is determined that the determination result is not set to “end”, that is, when the determination result is set to “do not end”, the iterative unit 1213 of the keyword generation unit 801 executes the minimum frequency update process. (S1309). The lowest frequency in the next round is set by the lowest frequency update process.

本実施の形態では、最低頻度更新処理（Ａ）を行う。図２５に、最低頻度更新処理（Ａ）フローを示す。反復部１２１３は、最低頻度から所定数（この例では、３）を減ずる（Ｓ２５０１）。当該所定数は、度数分布の階級を特定する範囲の大きさに相当する。従って、本実施の形態では、度数分布の階級を特定する範囲の大きさが均等になる。 In this embodiment, the minimum frequency update process (A) is performed. FIG. 25 shows the flow of the lowest frequency update process (A). The iterative unit 1213 subtracts a predetermined number (3 in this example) from the lowest frequency (S2501). The predetermined number corresponds to the size of a range that specifies the class of the frequency distribution. Therefore, in this embodiment, the size of the range for specifying the class of the frequency distribution is uniform.

最低頻度更新処理を終えると、反復部１２１３は、Ｓ１３０１に処理を戻し、Ｓ１３０１の部分文字列検出処理と、Ｓ１３０３のキーワード選択処理と、Ｓ１３０５の終了判定処理とを反復する。 When the minimum frequency update process is finished, the repetitive unit 1213 returns the process to S1301, and repeats the partial character string detection process of S1301, the keyword selection process of S1303, and the end determination process of S1305.

このようにして、最低頻度を減じながら、キーワード生成を終了させると判定されるまで、部分文字列検出処理（Ｓ１３０１）とキーワード選択処理（Ｓ１３０３）と終了判定処理（Ｓ１３０５）とが繰り返される。 In this way, the partial character string detection process (S1301), the keyword selection process (S1303), and the end determination process (S1305) are repeated until it is determined that the keyword generation is to be ended while decreasing the minimum frequency.

最低頻度が１２に設定された回における検出結果の例を図２６に示す。この回では、頻度が１２以上である部分文字列が検出される。 An example of the detection result at the time when the minimum frequency is set to 12 is shown in FIG. At this time, a partial character string having a frequency of 12 or more is detected.

第１レコードは、部分文字列「村田」を検出し、当該部分文字列の頻度が「３２」であることを示している。第２レコードは、部分文字列「佳菜子」を検出し、当該部分文字列の頻度が「２９」であることを示している。第３レコードは、部分文字列「佳菜子ちゃん」を検出し、当該部分文字列の頻度が「１３」であることを示している。第４レコードは、部分文字列「良かった」を検出し、当該部分文字列の頻度が「７０」であることを示している。第５レコードは、部分文字列「かな」を検出し、当該部分文字列の頻度が「９０」であることを示している。第６レコードは、部分文字列「かわいい」を検出し、当該部分文字列の頻度が「３８」であることを示している。今回初めて検出された部分文字列「佳菜子ちゃん」の頻度は、これらの部分文字列の中では最も小さい。つまり、部分文字列「佳菜子ちゃん」が第１抽出結果に含まれる記事中に出現する確率は、比較的低い。 The first record detects the partial character string “Murata” and indicates that the frequency of the partial character string is “32”. The second record detects the partial character string “Kanako” and indicates that the frequency of the partial character string is “29”. The third record detects the partial character string “Kanako-chan” and indicates that the frequency of the partial character string is “13”. The fourth record detects the partial character string “good” and indicates that the frequency of the partial character string is “70”. The fifth record detects the partial character string “Kana” and indicates that the frequency of the partial character string is “90”. The sixth record detects the partial character string “cute” and indicates that the frequency of the partial character string is “38”. The frequency of the substring “Kanako-chan” detected for the first time this time is the lowest among these substrings. That is, the probability that the partial character string “Kanako-chan” appears in the article included in the first extraction result is relatively low.

図２６に示した検出結果に含まれる部分文字列に対するスコアの例を、図２７に示す。第１レコードは、部分文字列「村田」のスコアが「１０」であることを示している。第２レコードは、部分文字列「佳菜子」のスコアが「１００」であることを示している。第３レコードは、部分文字列「佳菜子ちゃん」のスコアが「１５０」であることを示している。第４レコードは、部分文字列「良かった」のスコアが「４」であることを示している。第５レコードは、部分文字列「かな」のスコアが「３」であることを示している。第６レコードは、部分文字列「かわいい」のスコアが「５」であることを示している。今回初めて検出された部分文字列「佳菜子ちゃん」のスコアは、これらの部分文字列の中では最も大きい。つまり、部分文字列「佳菜子ちゃん」は、第１抽出結果に含まれる記事との関連が最も強い。 An example of the score for the partial character string included in the detection result shown in FIG. 26 is shown in FIG. The first record indicates that the score of the partial character string “Murata” is “10”. The second record indicates that the score of the partial character string “Kanako” is “100”. The third record indicates that the score of the partial character string “Kanako-chan” is “150”. The fourth record indicates that the score of the partial character string “good” is “4”. The fifth record indicates that the score of the partial character string “Kana” is “3”. The sixth record indicates that the score of the partial character string “cute” is “5”. The score of the substring “Kanako-chan” detected for the first time this time is the highest among these substrings. That is, the partial character string “Kanako-chan” has the strongest relationship with the article included in the first extraction result.

図２８は、図２７に示したスコアに基づくキーワードテーブルの例を示している。この例で、スコアの基準値は、５０である。従って、スコアが５０を越えた部分文字列「佳菜子」と部分文字列「佳菜子ちゃん」とが選択され、今回のキーワードテーブルに設定されている。スコアが５０を越えていない部分文字列「村田」と部分文字列「良かった」と部分文字列「かな」と部分文字列「かわいい」とは、選択されていない。 FIG. 28 shows an example of a keyword table based on the score shown in FIG. In this example, the reference value of the score is 50. Therefore, the partial character string “Kanako” and the partial character string “Kanako-chan” with a score exceeding 50 are selected and set in the keyword table of this time. The partial character string “Murata”, the partial character string “good”, the partial character string “Kana”, and the partial character string “cute” whose scores do not exceed 50 are not selected.

図２９は、その次の回における検出結果の例を示している。この回では、最低頻度が１２に設定されている。従って、頻度が１２以上である部分文字列が検出される。第１レコード乃至第５レコードは、図２６の場合と同様である。第６レコードは、部分文字列「むらかな」を検出し、当該部分文字列の頻度が「１０」であることを示している。第７レコードは、図２６の第６レコードと同様である。今回初めて検出された部分文字列「むらかな」の頻度は、前回初めて検出された部分文字列「佳菜子ちゃん」の頻度に比べて、更に低い。部分文字列「むらかな」が第１抽出結果に含まれる記事中に出現する確率は、部分文字列「佳菜子ちゃん」が第１抽出結果に含まれる記事中に出現する確率よりも、更に低い。 FIG. 29 shows an example of the detection result in the next round. At this time, the minimum frequency is set to 12. Therefore, a partial character string having a frequency of 12 or more is detected. The first to fifth records are the same as in the case of FIG. The sixth record detects the partial character string “Muraka” and indicates that the frequency of the partial character string is “10”. The seventh record is the same as the sixth record in FIG. The frequency of the partial character string “Muraka” detected for the first time this time is even lower than the frequency of the partial character string “Kanako-chan” detected for the first time last time. The probability that the partial character string “Muraka” appears in the article included in the first extraction result is even lower than the probability that the partial character string “Kanako-chan” appears in the article included in the first extraction result.

図２９に示した検出結果に含まれる部分文字列に対するスコアの例を、図３０に示す。第１レコード乃至第５レコードは、図２７の場合と同様である。第６レコードは、部分文字列「むらかな」のスコアが「６０」であることを示している。第７レコードは、図２７の第６レコードと同様である。今回初めて検出された部分文字列「むらかな」のスコアは、比較的大きい。つまり、部分文字列「むらかな」は、第１抽出結果に含まれる記事との関連が比較的強い。 An example of the score for the partial character string included in the detection result shown in FIG. 29 is shown in FIG. The first to fifth records are the same as in the case of FIG. The sixth record indicates that the score of the partial character string “Muraka” is “60”. The seventh record is the same as the sixth record in FIG. The score of the partial character string “Muraka” detected for the first time this time is relatively large. That is, the partial character string “Muraka” is relatively strongly related to the article included in the first extraction result.

図３１は、図３０に示したスコアに基づくキーワードテーブルの例を示している。この例で、スコアの基準値は、前述した通り５０である。従って、スコアが５０を越えた部分文字列「佳菜子」と部分文字列「佳菜子ちゃん」とに加えて、部分文字列「むらかな」も選択され、今回のキーワードテーブルに設定されている。スコアが５０を越えていない部分文字列「村田」と部分文字列「良かった」と部分文字列「かな」と部分文字列「かわいい」とは、前回と同様に選択されていない。 FIG. 31 shows an example of a keyword table based on the score shown in FIG. In this example, the reference value of the score is 50 as described above. Therefore, in addition to the partial character string “Kanako” and the partial character string “Kanako-chan” whose score exceeds 50, the partial character string “Muraka” is also selected and set in the keyword table of this time. The partial character string “Murata”, the partial character string “good”, the partial character string “Kana”, and the partial character string “cute” whose scores do not exceed 50 are not selected as in the previous case.

図１３の説明に戻って、Ｓ１３０７で、判定結果が「終了する」と設定されたと判定した場合には、キーワード生成部８０１のキーワード特定部１２１５は、キーワードを特定する（Ｓ１３１１）。具体的には、キーワード特定部１２１５は、最終回におけるキーワードテーブルに設定されているキーワードを特定する。キーワード生成処理を終えると、図９に示したＳ９０３の処理に戻る。 Returning to the description of FIG. 13, if it is determined in S1307 that the determination result is set to “end”, the keyword specifying unit 1215 of the keyword generating unit 801 specifies the keyword (S1311). Specifically, the keyword specifying unit 1215 specifies a keyword set in the keyword table in the last round. When the keyword generation process is completed, the process returns to S903 shown in FIG.

図９の説明に戻って、第２抽出部３１３の第２クエリ生成部８０５は、キーワード生成処理で生成されたキーワードを含む第２クエリを生成する（Ｓ９０３）。第２クエリ生成部８０５は、例えば、キーワード生成処理で生成されたキーワードを残らずＯＲ条件で検索するためのクエリを生成する。あるいは、第２クエリ生成部８０５は、キーワード生成処理で生成されたキーワードを残らずＡＮＤ条件で検索するためのクエリを生成するようにしてもよい。あるいは、第２クエリ生成部８０５は、キーワード生成処理で生成されたキーワードの一部をＯＲ条件で検索するためのクエリを生成するようにしてもよい。あるいは、第２クエリ生成部８０５は、キーワード生成処理で生成されたキーワードの一部をＡＮＤ条件で検索するためのクエリを生成するようにしてもよい。 Returning to the description of FIG. 9, the second query generation unit 805 of the second extraction unit 313 generates a second query including the keyword generated by the keyword generation processing (S903). For example, the second query generation unit 805 generates a query for searching with the OR condition without leaving the keywords generated in the keyword generation process. Or you may make it the 2nd query production | generation part 805 produce | generate the query for searching by AND conditions not leaving the keyword produced | generated by the keyword production | generation process. Or the 2nd query production | generation part 805 may produce | generate the query for searching a part of keyword produced | generated by the keyword production | generation process by OR condition. Alternatively, the second query generation unit 805 may generate a query for searching a part of the keywords generated by the keyword generation process using an AND condition.

第２抽出部３１３の第２要求部８０７は、第２クエリをデータベース管理システム１０３へ送信する（Ｓ９０５）。第２抽出部３１３の第２取得部８０９は、データベース管理システム１０３から第２抽出結果を取得し、取得した第２抽出結果を第２抽出結果記憶部３１５に書く（Ｓ９０７）。以上で、第２抽出処理についての説明を終える。第２抽出処理を終えると、図５のＳ５１１に示した処理に戻る。 The second request unit 807 of the second extraction unit 313 transmits the second query to the database management system 103 (S905). The second acquisition unit 809 of the second extraction unit 313 acquires the second extraction result from the database management system 103, and writes the acquired second extraction result in the second extraction result storage unit 315 (S907). This is the end of the description of the second extraction process. When the second extraction process is completed, the process returns to S511 in FIG.

図５の説明に戻って、出力部３１７は、第２抽出結果を出力する（Ｓ５１１）。例えば、出力部３１７は、ユーザ端末１０７へ第２抽出結果を送信する。 Returning to the description of FIG. 5, the output unit 317 outputs the second extraction result (S511). For example, the output unit 317 transmits the second extraction result to the user terminal 107.

本実施の形態によれば、テキスト単位（例えば、簡易ブログの記事のような文書）の集合に潜在しているキーワードが、ある程度抜き出されたことを推測して、処理を終わらせることができる。従って、無駄な処理を省き、更にキーワードの有効性を担保できる。尚、処理終了の時点で適当な最低頻度が特定されている。 According to the present embodiment, it is possible to estimate that a keyword latent in a set of text units (for example, a document such as a simple blog article) has been extracted to some extent, and to end the processing. . Therefore, useless processing can be omitted and the effectiveness of the keyword can be secured. An appropriate minimum frequency is specified at the end of processing.

また、一般的に部分文字列の数が多くなる範囲（頻度が低い範囲）における一連の処理を省くので、処理負担が軽減される。特に、スコア算出に係る処理負担が軽減される。 In addition, since a series of processing is generally omitted in a range where the number of partial character strings is large (range where the frequency is low), the processing load is reduced. In particular, the processing burden related to score calculation is reduced.

更に、想定されるキーワード数の変化傾向に従って、潜在しているキーワードのうち多くが抜き出されたことを推定できる。 Furthermore, it can be estimated that many of the latent keywords are extracted according to the assumed change tendency of the number of keywords.

［実施の形態２］
上述の実施の形態では、度数分布の階級を特定する頻度の範囲の大きさを均等とする例を示したが、本実施の形態では、頻度が小さくなるにつれて、度数分布の階級を特定する頻度の範囲を狭める例について説明する。 [Embodiment 2]
In the above-described embodiment, an example in which the size of the frequency range specifying the frequency distribution class is made equal is shown. However, in this embodiment, the frequency specifying the frequency distribution class as the frequency decreases. An example of narrowing the range will be described.

本実施の形態では、最低頻度を求めるための除数のパラメータを設ける。当該除数のパラメータの初期値は、パラメータ記憶部３０１に記憶されている。当該除数のパラメータの初期値は、受付部３０３によって受け付けられるようにしてもよい。 In the present embodiment, a divisor parameter for obtaining the minimum frequency is provided. The initial value of the divisor parameter is stored in the parameter storage unit 301. The initial value of the divisor parameter may be received by the receiving unit 303.

また、最低頻度更新処理において除数のパラメータに加算する所定の付加値を設ける。所定の付加値は、パラメータ記憶部３０１に記憶されている。所定の付加値は、受付部３０３によって受け付けられるようにしてもよい。 In addition, a predetermined additional value to be added to the divisor parameter in the lowest frequency update process is provided. The predetermined additional value is stored in the parameter storage unit 301. The predetermined additional value may be received by the receiving unit 303.

本実施の形態では、図１３に示したＳ１３０９において、キーワード生成部８０１の反復部１２１３は、最低頻度更新処理（Ｂ）を実行する。図３２に、最低頻度更新処理（Ｂ）フローを示す。反復部１２１３は、除数のパラメータに所定の付加値を加算する（Ｓ３２０１）。 In the present embodiment, in S1309 shown in FIG. 13, the repetitive unit 1213 of the keyword generating unit 801 executes the minimum frequency update process (B). FIG. 32 shows the flow of the lowest frequency update process (B). The iterative unit 1213 adds a predetermined additional value to the divisor parameter (S3201).

反復部１２１３は、第１テキスト単位数を上記の除数で割り（Ｓ３２０３）、商を最低頻度に設定する（Ｓ３２０５）。 The repeater 1213 divides the first text unit number by the divisor (S3203) and sets the quotient to the lowest frequency (S3205).

このようにすれば、頻度が大きい段階では、最低頻度の変化量が大きくなる。従って、反復の回数を少なくすることができる。 In this way, the change amount of the lowest frequency becomes large at the stage where the frequency is high. Therefore, the number of iterations can be reduced.

［実施の形態３］
実施の形態２では、所定の付加値を一定とする例について説明したが、本実施の形態では、キーワードの数が収束し始めるまでの第１の付加値と、キーワードの数が収束し始めてからの第２の付加値とを切り替える例について説明する。 [Embodiment 3]
In the second embodiment, the example in which the predetermined additional value is constant has been described. However, in the present embodiment, the first additional value until the number of keywords starts to converge and the number of keywords starts to converge. An example of switching the second additional value will be described.

本実施の形態では、最低頻度更新処理において除数のパラメータに加算する付加値を２種類設ける。第１の付加値は、第２の付加値よりも大きい。第１の付加値と第２の付加値とは、パラメータ記憶部３０１に記憶されている。第１の付加値と第２の付加値とは、受付部３０３によって受け付けられるようにしてもよい。 In the present embodiment, two types of additional values to be added to the divisor parameter in the minimum frequency update process are provided. The first additional value is larger than the second additional value. The first additional value and the second additional value are stored in the parameter storage unit 301. The first additional value and the second additional value may be received by the receiving unit 303.

また、キーワード数が収束する段階に近づいているか否かを判定するための第２基準値を設ける。第２基準値は、Ｓ２１１５で用いる第１基準値よりも大きい値である。第２基準値は、パラメータ記憶部３０１に記憶されている。第２基準値は、受付部３０３によって受け付けられるようにしてもよい。第２基準値は、第１テキスト単位数（この例では、第１抽出結果に含まれる記事の数）に応じ設定されるようにしてもよい。 In addition, a second reference value for determining whether or not the number of keywords is approaching a stage is provided. The second reference value is a value larger than the first reference value used in S2115. The second reference value is stored in the parameter storage unit 301. The second reference value may be received by the receiving unit 303. The second reference value may be set according to the first text unit number (in this example, the number of articles included in the first extraction result).

本実施の形態では、図１３に示したＳ１３０９において、キーワード生成部８０１の反復部１２１３は、最低頻度更新処理（Ｃ）を実行する。図３３に、最低頻度更新処理（Ｃ）フローを示す。反復部１２１３は、Ｓ２１１１で判定した変化傾向が増加傾向であるか否かを判定する（Ｓ３３０１）。Ｓ２１１１で判定した変化傾向が増加傾向であると判定した場合には、反復部１２１３は、上述した除数のパラメータに第１の付加値を加算する（Ｓ３３０３）。 In the present embodiment, in S1309 shown in FIG. 13, the repetitive unit 1213 of the keyword generating unit 801 executes the minimum frequency update process (C). FIG. 33 shows a minimum frequency update process (C) flow. The iterative unit 1213 determines whether or not the change tendency determined in S2111 is an increasing tendency (S3301). If it is determined that the change tendency determined in S2111 is an increasing tendency, the iterating unit 1213 adds the first additional value to the divisor parameter described above (S3303).

例えば図１１に示した分布において、頻度が３０乃至６０の領域では、変化傾向が増加傾向であるため、除数のパラメータに第１の付加値が加算される。 For example, in the distribution shown in FIG. 11, in the region where the frequency is 30 to 60, the change tendency tends to increase, so the first additional value is added to the divisor parameter.

一方、Ｓ２１１１で判定した変化傾向が増加傾向ではないと判定した場合には、反復部１２１３は、キーワード数が第２基準値以上であるか否かを判定する。（Ｓ３３０５）。キーワード数が第２基準値以上であると判定した場合には、反復部１２１３は、除数のパラメータに第１の付加値を加算する（Ｓ３３０３）。 On the other hand, if it is determined that the change trend determined in S2111 is not an increasing trend, the repetitive unit 1213 determines whether the number of keywords is equal to or greater than the second reference value. (S3305). If it is determined that the number of keywords is greater than or equal to the second reference value, the iterative unit 1213 adds the first additional value to the divisor parameter (S3303).

例えば図１１に示した分布において、第２基準値がキーワード数１０程度に相当すると想定すると、頻度が２０乃至３０の領域では、キーワード数が第２基準値以上であるため、除数のパラメータに第１の付加値が加算される。 For example, in the distribution shown in FIG. 11, if it is assumed that the second reference value corresponds to about 10 keywords, the number of keywords is greater than or equal to the second reference value in the frequency range of 20 to 30. An additional value of 1 is added.

他方、キーワード数が第２基準値未満であると判定した場合には、反復部１２１３は、除数のパラメータに第２の付加値を加算する（Ｓ３３０７）。 On the other hand, when it is determined that the number of keywords is less than the second reference value, the repetitive unit 1213 adds the second additional value to the divisor parameter (S3307).

例えば図１１に示した分布において、頻度が２０を下回る左側の領域では、キーワード数が第２基準値未満であるため、除数のパラメータに第２の付加値が加算される。 For example, in the distribution shown in FIG. 11, in the left region where the frequency is less than 20, the number of keywords is less than the second reference value, so the second additional value is added to the divisor parameter.

反復部１２１３は、第１テキスト単位数を除数で割り（Ｓ３３０９）、反復部１２１３は、商を最低頻度に設定する（Ｓ３３１１）。 The repeater 1213 divides the first text unit number by the divisor (S3309), and the repeater 1213 sets the quotient to the lowest frequency (S3311).

このようにすれば、キーワード数が収束する段階に至るまで、最低頻度の変化量が大きい。従って、反復の回数を少なくすることができる。 In this way, the amount of change with the lowest frequency is large until the number of keywords converges. Therefore, the number of iterations can be reduced.

以上本発明の実施の形態を説明したが、本発明はこれに限定されるものではない。例えば、上述の機能ブロック構成は実際のプログラムモジュール構成に一致しない場合もある。 Although the embodiment of the present invention has been described above, the present invention is not limited to this. For example, the functional block configuration described above may not match the actual program module configuration.

また、上で説明した各記憶領域の構成は一例であって、上記のような構成でなければならないわけではない。さらに、処理フローにおいても、処理結果が変わらなければ処理の順番を入れ替えることも可能である。さらに、並列に実行させるようにしても良い。 Further, the configuration of each storage area described above is an example, and the above configuration is not necessarily required. Further, in the processing flow, the processing order can be changed if the processing result does not change. Further, it may be executed in parallel.

なお、上で述べた抽出装置１０１は、コンピュータ装置であって、図３４に示すように、メモリ２５０１とＣＰＵ（Central Processing Unit）２５０３とハードディスク・ドライブ（ＨＤＤ：Hard Disk Drive）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とがバス２５１９で接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。ＣＰＵ２５０３は、アプリケーション・プログラムの処理内容に応じて表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリ２５０１に格納されるが、ＨＤＤ２５０５に格納されるようにしてもよい。本発明の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The extraction device 101 described above is a computer device, and as shown in FIG. 34, a memory 2501, a CPU (Central Processing Unit) 2503, a hard disk drive (HDD: Hard Disk Drive) 2505, and a display device 2509. A display control unit 2507 connected to the computer, a drive device 2513 for a removable disk 2511, an input device 2515, and a communication control unit 2517 for connecting to a network are connected by a bus 2519. An operating system (OS) and an application program for executing the processing in this embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. The CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 according to the processing content of the application program, and performs a predetermined operation. Further, data in the middle of processing is mainly stored in the memory 2501, but may be stored in the HDD 2505. In the embodiment of the present invention, an application program for performing the above-described processing is stored in a computer-readable removable disk 2511 and distributed, and installed in the HDD 2505 from the drive device 2513. In some cases, the HDD 2505 may be installed via a network such as the Internet and the communication control unit 2517. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU 2503 and the memory 2501 described above and programs such as the OS and application programs. .

以上述べた本発明の実施の形態をまとめると、以下のようになる。 The embodiment of the present invention described above is summarized as follows.

本実施の形態に係るキーワード生成方法は、文書の集合において出現する頻度が基準値以上である部分文字列を一又は複数検出する検出処理と、検出された部分文字列の各々について上記集合との関連の強さを示すスコアを算出し、当該スコアに基づき部分文字列の中からキーワードを選択する選択処理と、新たに選択されたキーワードの数を算出し、当該キーワードの数に基づいて、検出処理及び選択処理を終了するか否かを判定する判定処理と、検出処理及び選択処理を終了しないと判定した場合に、上記基準値を更新して、検出処理と選択処理と判定処理とを反復する反復処理とを含む。 The keyword generation method according to the present embodiment includes a detection process for detecting one or a plurality of partial character strings whose frequency of occurrence in a document set is equal to or higher than a reference value, and the above set for each of the detected partial character strings. Calculates a score indicating the strength of the relationship, selects a keyword from the partial character string based on the score, calculates the number of newly selected keywords, and detects based on the number of keywords The determination process for determining whether to end the process and the selection process, and when it is determined not to end the detection process and the selection process, the reference value is updated, and the detection process, the selection process, and the determination process are repeated. Repetitive processing.

このようにすれば、文書の集合に潜在しているキーワードが、ある程度抜き出されたことを推測して、処理を終わらせることができる。従って、無駄な処理を省き、更にキーワードの有効性を担保できる。 In this way, it is possible to finish the processing by guessing that the keywords that are latent in the document set have been extracted to some extent. Therefore, useless processing can be omitted and the effectiveness of the keyword can be secured.

また、上記反復処理において、反復の度に順次より小さい値へ上記基準値を更新するようにしてもよい。 Further, in the above iterative process, the reference value may be updated to a smaller value sequentially at each iteration.

このようにすれば、一般的に部分文字列の数が多くなる範囲（頻度が低い範囲）における一連の処理を省くようになるので、処理負担が軽減される。特に、スコア算出に係る処理負担が軽減される。 In this way, a series of processing in a range where the number of partial character strings is generally increased (range where the frequency is low) is omitted, so that the processing load is reduced. In particular, the processing burden related to score calculation is reduced.

また、上記判定処理において、キーワードの数が減少する傾向を示し、且つキーワードの数が閾値を下回ったと判定した場合に、検出処理及び選択処理を終了すると判定するようにしてもよい。 In the determination process, when the number of keywords shows a tendency to decrease and it is determined that the number of keywords has fallen below a threshold value, the detection process and the selection process may be determined to end.

このようにすれば、想定されるキーワード数の変化傾向に従って、潜在しているキーワードのうち多くが抜き出されたことを推定できるようになる。 In this way, it is possible to estimate that many of the latent keywords have been extracted according to the assumed change tendency of the number of keywords.

また、上記反復処理において、反復の度に上記基準値の更新量をより小さくするようにしてもよい。 Further, in the above iterative process, the update amount of the reference value may be made smaller for each iteration.

このようにすれば、キーワードの有効性を担保しつつ、反復の回数を減らすことができる。 In this way, the number of iterations can be reduced while ensuring the effectiveness of the keyword.

また、上記反復処理において、キーワードの数が収束する段階に近づいているか否かを判定し、当該段階に近づいていると判定した場合に、上記基準値の更新量を小さくするようにしてもよい。 Further, in the above iterative process, it is determined whether the number of keywords is approaching the convergence level, and when it is determined that the keyword is approaching, the reference value update amount may be reduced. .

なお、上記方法による処理をコンピュータに行わせるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブルディスク、ＣＤ−ＲＯＭ、光磁気ディスク、半導体メモリ、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納されるようにしてもよい。尚、中間的な処理結果は、一般的にメインメモリ等の記憶装置に一時保管される。 A program for causing a computer to perform the processing according to the above method can be created. The program can be a computer-readable storage medium such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, a hard disk, or the like. It may be stored in a storage device. Note that intermediate processing results are generally temporarily stored in a storage device such as a main memory.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）
文書の集合において出現する頻度が基準値以上である部分文字列を一又は複数検出する検出処理と、
検出された前記部分文字列の各々について前記集合との関連の強さを示すスコアを算出し、当該スコアに基づき前記部分文字列の中からキーワードを選択する選択処理と、
新たに選択された前記キーワードの数を算出し、当該キーワードの数に基づいて、前記検出処理及び前記選択処理を終了するか否かを判定する判定処理と、
前記検出処理及び前記選択処理を終了しないと判定した場合に、前記基準値を更新して、前記検出処理と前記選択処理と前記判定処理とを反復する反復処理と
を含み、コンピュータにより実行されるキーワード生成方法。 (Appendix 1)
A detection process for detecting one or a plurality of partial character strings whose frequency of occurrence in a set of documents is equal to or higher than a reference value;
Calculating a score indicating the strength of association with the set for each of the detected partial character strings, and a selection process for selecting a keyword from the partial character strings based on the score;
A determination process for calculating the number of newly selected keywords and determining whether to end the detection process and the selection process based on the number of keywords;
When it is determined not to end the detection process and the selection process, the reference value is updated, and the detection process, the selection process, and an iterative process that repeats the determination process are performed, and are executed by a computer Keyword generation method.

（付記２）
前記反復処理において、反復の度に順次より小さい値へ前記基準値を更新する
付記１記載のキーワード生成方法。 (Appendix 2)
The keyword generation method according to claim 1, wherein, in the iterative process, the reference value is updated to a smaller value sequentially at each iteration.

（付記３）
前記判定処理において、前記キーワードの数が減少する傾向を示し、且つ前記キーワードの数が閾値を下回ったと判定した場合に、前記検出処理及び前記選択処理を終了すると判定する
付記２記載のキーワード生成方法。 (Appendix 3)
The keyword generation method according to claim 2, wherein, in the determination process, the detection process and the selection process are determined to end when it is determined that the number of keywords is decreasing and the number of keywords is less than a threshold value. .

（付記４）
前記反復処理において、反復の度に前記基準値の更新量をより小さくする
付記２又は３記載のキーワード生成方法。 (Appendix 4)
The keyword generation method according to appendix 2 or 3, wherein, in the iterative process, the update amount of the reference value is made smaller at each iteration.

（付記５）
前記反復処理において、前記キーワードの数が収束する段階に近づいているか否かを判定し、当該段階に近づいていると判定した場合に、前記基準値の更新量を小さくする
付記２乃至４のいずれか１つ記載のキーワード生成方法。 (Appendix 5)
In the iterative process, it is determined whether or not the number of keywords is approaching a stage, and when it is determined that the keyword is approaching, the update amount of the reference value is decreased. A keyword generation method according to claim 1.

（付記６）
文書の集合において出現する頻度が基準値以上である部分文字列を一又は複数検出する検出処理と、
検出された前記部分文字列の各々について前記集合との関連の強さを示すスコアを算出し、当該スコアに基づき前記部分文字列の中からキーワードを選択する選択処理と、
新たに選択された前記キーワードの数を算出し、当該キーワードの数に基づいて、前記検出処理及び前記選択処理を終了するか否かを判定する判定処理と、
前記検出処理及び前記選択処理を終了しないと判定した場合に、前記基準値を更新して、前記検出処理と前記選択処理と前記判定処理とを反復する反復処理と
をコンピュータに実行させるためのプログラム。 (Appendix 6)
A detection process for detecting one or a plurality of partial character strings whose frequency of occurrence in a set of documents is equal to or higher than a reference value;
Calculating a score indicating the strength of association with the set for each of the detected partial character strings, and a selection process for selecting a keyword from the partial character strings based on the score;
A determination process for calculating the number of newly selected keywords and determining whether to end the detection process and the selection process based on the number of keywords;
A program for causing a computer to execute the iterative process of updating the reference value and repeating the detection process, the selection process, and the determination process when it is determined that the detection process and the selection process are not terminated .

（付記７）
文書の集合において出現する頻度が基準値以上である部分文字列を一又は複数検出する検出部と、
検出された前記部分文字列の各々について前記集合との関連の強さを示すスコアを算出し、当該スコアに基づき前記部分文字列の中からキーワードを選択する選択部と、
新たに選択された前記キーワードの数を算出し、当該キーワードの数に基づいて、前記検出処理及び前記選択処理を終了するか否かを判定する判定部と、
前記検出処理及び前記選択処理を終了しないと判定した場合に、前記基準値を更新して、前記検出部による処理と前記選択部による処理と前記判定部による処理とを反復させる反復部と
を有する情報処理装置。 (Appendix 7)
A detection unit that detects one or a plurality of partial character strings whose frequency of occurrence in a set of documents is equal to or higher than a reference value;
For each of the detected partial character strings, a score indicating the strength of association with the set is calculated, and a selection unit that selects a keyword from the partial character strings based on the score;
A determination unit that calculates the number of newly selected keywords and determines whether to end the detection process and the selection process based on the number of keywords;
When it is determined that the detection process and the selection process are not terminated, the reference value is updated, and a repetition unit that repeats the process by the detection unit, the process by the selection unit, and the process by the determination unit Information processing device.

１０１抽出装置１０３データベース管理システム
１０５テキストデータベース１０７ユーザ端末
３０１パラメータ記憶部３０３受付部
３０５第１抽出部３０７第１抽出結果記憶部
３０９サンプリング部３１１サンプリング結果記憶部
３１３第２抽出部３１５第２抽出結果記憶部
３１７出力部６０１第１クエリ生成部
６０３第１要求部６０５第１取得部
８０１キーワード生成部８０３キーワード記憶部
８０５第２クエリ生成部８０７第２要求部
８０９第２取得部１２０１部分文字列検出部
１２０３検出結果記憶部１２０５キーワード選択部
１２０７選択結果記憶部１２０９終了判定部
１２１１度数分布記憶部１２１３反復部
１２１５キーワード特定部 DESCRIPTION OF SYMBOLS 101 Extraction apparatus 103 Database management system 105 Text database 107 User terminal 301 Parameter storage part 303 Reception part 305 1st extraction part 307 1st extraction result storage part 309 Sampling part 311 Sampling result storage part 313 2nd extraction part 315 2nd extraction result Storage unit 317 output unit 601 first query generation unit 603 first request unit 605 first acquisition unit 801 keyword generation unit 803 keyword storage unit 805 second query generation unit 807 second request unit 809 second acquisition unit 1201 partial character string detection Unit 1203 detection result storage unit 1205 keyword selection unit 1207 selection result storage unit 1209 end determination unit 1211 frequency distribution storage unit 1213 iteration unit 1215 keyword identification unit

Claims

A detection process for detecting one or a plurality of partial character strings whose frequency of occurrence in a set of documents is equal to or higher than a first reference value;
For each of the detected one or more partial character strings, a score indicating the strength of association with the set is calculated, and among the one or more partial character strings, the partial character whose score exceeds a second reference value A selection process to select columns ,
The number of the newly selected partial character strings is calculated, and when the condition that the number shows a tendency to decrease and the number falls below the third reference value is satisfied , the detection process and the selection process are performed. determined to end, if the condition is not satisfied, a determination process not to end the detection process and the selection process,
When it is determined that the detection process and the selection process are not terminated, the first reference value is updated to a smaller value, and an iterative process for repeating the detection process, the selection process, and the determination process A keyword generation method including and executed by a computer.

In the repetitive processing, keyword generation method according to claim 1, wherein the smaller the update quantity of the first reference value at every iteration.

In the iterative process, before numeration is determined whether the approaching step of converging, if it is determined to be approaching to the stage, according to claim 1, wherein to reduce the amount of update of the first reference value Keyword generation method.

A detection process for detecting one or a plurality of partial character strings whose frequency of occurrence in a set of documents is equal to or higher than a first reference value;
For each of the detected one or more partial character strings, a score indicating the strength of association with the set is calculated, and among the one or more partial character strings, the partial character whose score exceeds a second reference value A selection process to select columns ,
The number of the newly selected partial character strings is calculated, and when the condition that the number shows a tendency to decrease and the number falls below the third reference value is satisfied , the detection process and the selection process are performed. determined to end, if the condition is not satisfied, a determination process not to end the detection process and the selection process,
When it is determined that the detection process and the selection process are not terminated, the first reference value is updated to a smaller value, and an iterative process for repeating the detection process, the selection process, and the determination process A program that causes a computer to execute.

A detection unit for detecting one or a plurality of partial character strings whose frequency of appearance in a set of documents is equal to or higher than a first reference value;
For each of the detected one or more partial character strings, a score indicating the strength of association with the set is calculated, and among the one or more partial character strings, the partial character whose score exceeds a second reference value A selection section for selecting a column ;
The number of the newly selected partial character strings is calculated, and when the condition that the number shows a tendency to decrease and the number falls below a third reference value is satisfied , the processing by the detection unit and the selection A determination unit that determines that the process by the unit is to be terminated, and determines that the process by the detection unit and the process by the selection unit are not terminated when the condition is not satisfied ,
If it is determined not to end the processing by the processing and the selection portion by the detecting unit, and updates the first reference value to a smaller value, the determination processing by the detecting unit and the processing by the selection unit An information processing apparatus comprising: a repeating unit that repeats processing by the unit.