JP5426710B2

JP5426710B2 - Search support device, search support method and program

Info

Publication number: JP5426710B2
Application number: JP2012062595A
Authority: JP
Inventors: 博新名; 雅一服部
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2014-02-26
Anticipated expiration: 2032-03-19
Also published as: CN103324646B; JP2013196358A; CN103324646A

Description

本発明の実施形態は、検索支援装置、検索支援方法およびプログラムに関する。 Embodiments described herein relate generally to a search support apparatus , a search support method, and a program .

文書検索は、検索対象となる文書集合から、ユーザが指定した検索キーワードを含む文書を検索する技術である。ここで文書とは、電子化された文書そのものだけでなく、テキストデータを持つ各種のコンテンツを含む。文書検索におけるユーザの操作負担を軽減するため、従来、様々な検索支援方法が提案されている。 Document search is a technique for searching a document including a search keyword designated by a user from a set of documents to be searched. Here, the document includes not only the digitized document itself but also various contents having text data. Conventionally, various search support methods have been proposed in order to reduce a user's operation burden in document search.

例えば、過去の検索式の履歴に基づいて、検索キーワードの候補をユーザに提示する方法が知られている。この方法では、ユーザの入力操作によって例えば「拡散強調像」という検索キーワードが入力されると、それに続く検索キーワードの候補として、「遅延相」、「脂肪」、「高信号」、「軸位」など、過去の検索式の履歴の中で頻繁に共起する単語を提案する。この方法によれば、複数の検索キーワードを含む検索式の生成を容易にし、ユーザの操作負担を軽減することができる。しかし、この方法では、適切な検索キーワードの候補を提案するためには多くの履歴が必要であり、履歴が偏っていたり、不足している場合には提案の質が低下し、目的の文書を検索することができない懸念がある。 For example, a method of presenting search keyword candidates to a user based on a history of past search expressions is known. In this method, when a search keyword “diffusion weighted image”, for example, is input by the user's input operation, “delayed phase”, “fat”, “high signal”, “axial position” are selected as subsequent search keyword candidates. Suggest words that frequently occur in the history of past search formulas. According to this method, it is possible to easily generate a search expression including a plurality of search keywords and reduce the operation burden on the user. However, this method requires a lot of history in order to propose appropriate search keyword candidates, and if the history is biased or insufficient, the quality of the proposal decreases, and the target document There is a concern that it cannot be searched.

また、共起関係にある２つの単語の組み合わせを定義した共起辞書を用いて、検索キーワードの候補をユーザに提案する方法が知られている。この方法では、ユーザの入力操作によってある単語が検索キーワードとして入力されると、それに続く検索キーワードの候補として、入力された検索キーワードに対する共起確率が高い単語として共起辞書に登録されている単語を提案する。この方法によれば、複数の検索キーワードを含む検索式の生成を容易にし、ユーザの操作負担を軽減することができる。しかし、この方法では、事前に単語同士の共起関係を定義した共起辞書を準備する必要があり、また、事前に準備した共起辞書が検索対象となる文書集合に適合しない場合、提案の質が低下し、目的の文書を検索することができない懸念がある。 There is also known a method of proposing search keyword candidates to a user using a co-occurrence dictionary that defines a combination of two words having a co-occurrence relationship. In this method, when a word is input as a search keyword by the user's input operation, a word registered in the co-occurrence dictionary as a word having a high co-occurrence probability for the input search keyword as a subsequent search keyword candidate Propose. According to this method, it is possible to easily generate a search expression including a plurality of search keywords and reduce the operation burden on the user. However, with this method, it is necessary to prepare a co-occurrence dictionary in which co-occurrence relationships between words are defined in advance, and if the prepared co-occurrence dictionary does not match the document set to be searched, There is a concern that the quality will deteriorate and the target document cannot be retrieved.

以上のように、複数の検索キーワードを含む検索式の生成を容易にする従来技術では、提案の質が低下して目的の文書を検索できない場合があり、改良が求められる。 As described above, in the conventional technology that facilitates the generation of a search expression including a plurality of search keywords, the quality of the proposal may be deteriorated and the target document may not be searched, and improvement is required.

特許第２８５０９５２号公報Japanese Patent No. 2850952 特開２００６−４８２８６号公報JP 2006-48286 A

本発明が解決しようとする課題は、ユーザに事前の煩雑な操作を要求することなく、検索対象の文書集合に適合した検索キーワードの候補を提案し、文書の検索を適切に支援することができる検索支援装置、検索支援方法およびプログラムを提供することである。 The problem to be solved by the present invention is to propose search keyword candidates suitable for a set of documents to be searched without requiring the user to perform complicated operations in advance, and appropriately support document search. A search support apparatus , a search support method, and a program are provided.

実施形態の検索支援装置は、抽出部と、算出部と、第１検出部と、第１生成部と、第２生成部と、第１認識部と、共起伝播部と、第２検出部と、第３生成部と、第２認識部と、提示部と、検索部と、を備える。抽出部は、検索対象の文書集合からキーワード候補を抽出する。算出部は、抽出された２つのキーワード候補の組み合わせについて、一のキーワード候補が他のキーワード候補とともに前記文書集合内の同一文書に出現する確率である共起確率を算出する。第１検出部は、前記共起確率が第１条件を満たす２つのキーワード候補の組み合わせである共起キーワード組を検出する。第１生成部は、前記共起キーワード組の一方のキーワード候補を見出し語、他方のキーワード候補を共起語とする辞書要素の集合である共起辞書を生成する。第２生成部は、入力文字列を補完して前記共起キーワード組に含まれるキーワード候補を得るための規則である文字列補完規則を生成する。第１認識部は、前記文字列補完規則に従って入力文字列を補完することで得られるキーワード候補を入力キーワードと認識する。共起伝播部は、前記共起辞書を参照し、前記入力キーワードを見出し語とする辞書要素の共起語を取得し、取得した共起語を見出し語とする辞書要素の共起語を取得する処理を繰り返す。第２検出部は、前記共起確率がゼロとなる２つのキーワード候補の組み合わせであるゼロ共起キーワード組を検出する。第３生成部は、前記ゼロ共起キーワード組の一方のキーワード候補を見出し語、他方のキーワード候補をゼロ共起語とする辞書要素の集合であるゼロ共起辞書を生成する。第２認識部は、前記ゼロ共起辞書を参照し、前記入力キーワードと前記共起伝播部による処理により取得される共起語とを繋げたワード列の中に、前記ゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在する場合は、該ワード列を除外し、残りのワード列のうち、第２条件を満たすワード列を提案ワード列と認識する。提示部は、前記提案ワード列を提示する。検索部は、提示した前記提案ワード列が選択された場合に、該提案ワード列に基づいて検索式を生成して前記文書集合に対する検索を行う。 The search support device of the embodiment includes an extraction unit, a calculation unit, a first detection unit, a first generation unit, a second generation unit, a first recognition unit, a co-occurrence propagation unit, and a second detection unit. And a third generation unit, a second recognition unit, a presentation unit, and a search unit. The extraction unit extracts keyword candidates from the document set to be searched. The calculation unit calculates a co-occurrence probability, which is a probability that one keyword candidate appears in the same document in the document set together with another keyword candidate, for the extracted combination of two keyword candidates. The first detection unit detects a co-occurrence keyword set that is a combination of two keyword candidates in which the co-occurrence probability satisfies the first condition. The first generation unit generates a co-occurrence dictionary that is a set of dictionary elements having one keyword candidate of the co-occurrence keyword set as a headword and the other keyword candidate as a co-occurrence word. The second generation unit generates a character string complement rule that is a rule for complementing the input character string to obtain keyword candidates included in the co-occurrence keyword set. A 1st recognition part recognizes the keyword candidate obtained by complementing an input character string according to the said character string complementation rule as an input keyword. The co-occurrence propagation unit refers to the co-occurrence dictionary, acquires a co-occurrence word of a dictionary element having the input keyword as an entry word, and acquires a co-occurrence word of a dictionary element having the acquired co-occurrence word as an entry word Repeat the process. The second detection unit detects a zero co-occurrence keyword set that is a combination of two keyword candidates having the co-occurrence probability of zero. The third generation unit generates a zero co-occurrence dictionary which is a set of dictionary elements in which one keyword candidate of the zero co-occurrence keyword set is a headword and the other keyword candidate is a zero co-occurrence word. The second recognizing unit refers to the zero co-occurrence dictionary, and includes the zero co-occurrence keyword set in a word string connecting the input keyword and the co-occurrence word acquired by the processing by the co-occurrence propagation unit. If there is a word string that simultaneously includes two keyword candidates that constitute a word string, the word string is excluded, and among the remaining word strings, a word string that satisfies the second condition is recognized as a proposed word string. The presenting unit presents the proposed word string. When the presented proposed word string is selected, the search unit generates a search formula based on the proposed word string and searches the document set.

図１は、第１実施形態の検索支援装置の機能的な構成を示すブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the search support apparatus according to the first embodiment. 図２は、検索対象の文書集合の一例を示す図である。FIG. 2 is a diagram illustrating an example of a document set to be searched. 図３は、検索対象の文書集合から抽出されたキーワード候補の一例を示す図である。FIG. 3 is a diagram illustrating an example of keyword candidates extracted from a search target document set. 図４は、抽出されたキーワード候補と出現頻度との関係の一例を示す図である。FIG. 4 is a diagram illustrating an example of the relationship between the extracted keyword candidates and the appearance frequency. 図５は、２つのキーワード候補の組み合わせの出現頻度の一例を示す図である。FIG. 5 is a diagram illustrating an example of the appearance frequency of a combination of two keyword candidates. 図６は、２つのキーワード候補間の共起確率の一例を示す図である。FIG. 6 is a diagram illustrating an example of the co-occurrence probability between two keyword candidates. 図７は、共起ネットワークの一例を示す図である。FIG. 7 is a diagram illustrating an example of a co-occurrence network. 図８は、共起キーワード組検出部による処理の一例を示すフローチャートである。FIG. 8 is a flowchart illustrating an example of processing by the co-occurrence keyword set detection unit. 図９は、共起辞書を構成する辞書要素のデータ構造の一例を示す図である。FIG. 9 is a diagram illustrating an example of the data structure of dictionary elements that constitute the co-occurrence dictionary. 図１０は、共起辞書の一例を示す図である。FIG. 10 is a diagram illustrating an example of a co-occurrence dictionary. 図１１は、パトリシア木の一例を示す図である。FIG. 11 is a diagram illustrating an example of a Patricia tree. 図１２は、ワード列のデータ構造の一例を示す図である。FIG. 12 is a diagram illustrating an example of a data structure of a word string. 図１３は、入力キーワード認識部、共起伝播部および提案ワード列認識部による処理の一例を示すフローチャートである。FIG. 13 is a flowchart illustrating an example of processing by the input keyword recognition unit, the co-occurrence propagation unit, and the suggested word string recognition unit. 図１４は、第２実施形態の検索支援装置の機能的な構成を示すブロック図である。FIG. 14 is a block diagram illustrating a functional configuration of the search support apparatus according to the second embodiment. 図１５は、共起確率が０である２つのキーワード候補の組み合わせを示す図である。FIG. 15 is a diagram illustrating a combination of two keyword candidates having a co-occurrence probability of zero. 図１６は、０共起辞書を構成する辞書要素のデータ構造の一例を示す図である。FIG. 16 is a diagram illustrating an example of a data structure of dictionary elements constituting the 0 co-occurrence dictionary. 図１７は、０共起辞書の一例を示す図である。FIG. 17 is a diagram illustrating an example of a 0 co-occurrence dictionary. 図１８は、入力キーワード認識部、共起伝播部および提案ワード列認識部による処理の一例を示すフローチャートである。FIG. 18 is a flowchart illustrating an example of processing by the input keyword recognition unit, the co-occurrence propagation unit, and the suggested word string recognition unit.

以下、実施形態の検索支援装置および検索支援方法について、図面を参照しながら詳細に説明する。 Hereinafter, a search support device and a search support method of an embodiment will be described in detail with reference to the drawings.

（第１実施形態）
図１は、第１実施形態の検索支援装置１００の機能的な構成を示すブロック図である。本実施形態の検索支援装置１００は、図１に示すように、キーワード候補抽出部（抽出部）１０１と、インデックス生成部１０２と、共起確率算出部（算出部）１０３と、共起キーワード組検出部（第１検出部）１０４と、共起辞書生成部（第１生成部）１０５と、パトリシア木生成部（第２生成部）１０６と、入力受付部１０７と、入力キーワード認識部（第１認識部）１０８と、共起伝播部１０９と、提案ワード列認識部（第２認識部）１１０と、表示部（提示部）１１１と、検索部１１２と、を備える。 (First embodiment)
FIG. 1 is a block diagram illustrating a functional configuration of the search support apparatus 100 according to the first embodiment. As shown in FIG. 1, the search support apparatus 100 according to the present embodiment includes a keyword candidate extraction unit (extraction unit) 101, an index generation unit 102, a co-occurrence probability calculation unit (calculation unit) 103, and a co-occurrence keyword set. A detection unit (first detection unit) 104, a co-occurrence dictionary generation unit (first generation unit) 105, a Patricia tree generation unit (second generation unit) 106, an input reception unit 107, and an input keyword recognition unit (first 1 recognition unit) 108, co-occurrence propagation unit 109, proposed word string recognition unit (second recognition unit) 110, display unit (presentation unit) 111, and search unit 112.

キーワード候補抽出部１０１は、検索対象の文書集合３００から、検索キーワードの候補となるキーワード候補を抽出する。検索対象の文書集合３００からキーワード候補を抽出する方法は特に限定されるものではなく、様々な方法を用いることができる。例えば、検索キーワードとして用いることができる多数の単語が登録された語彙辞書を参照し、検索対象の文書集合３００に含まれる各文書から、語彙辞書に登録された単語をキーワード候補として抽出することができる。 The keyword candidate extraction unit 101 extracts keyword candidates that are search keyword candidates from the search target document set 300. The method for extracting keyword candidates from the search target document set 300 is not particularly limited, and various methods can be used. For example, referring to a vocabulary dictionary in which many words that can be used as search keywords are registered, the words registered in the vocabulary dictionary are extracted as keyword candidates from each document included in the search target document set 300. it can.

ここで、検索対象の文書集合３００は、本実施形態の検索支援装置１００を使用するユーザが任意に指定することができる。つまり、本実施形形態の検索支援装置１００では、ユーザが検索対象の文書集合３００を指定すると、キーワード候補抽出部１０１が、ユーザにより指定された文書集合３００からキーワード候補を抽出する処理を実行する。以下では、検索対象の文書集合３００の具体例として、図２に示すような書籍情報の集合を例示する。図２の例では、５つの書籍情報がそれぞれ文書１〜文書５として示されており、文書１〜文書５は、それぞれ「書籍名」、「著者名」、「訳者名」、「出版社」、「ＩＳＢＮ」、および「概要」の項目ごとに、対応する書籍の情報が例えばテキストデータで記述されている。キーワード候補抽出部１０１は、これら各文書の項目ごとに記述されたテキストデータからキーワード候補を抽出する。なお、検索対象の文書集合３００は一般的に多数の文書の集合となるが、ここでは説明を簡単にするために５つの文書のみを例示している。 Here, the search target document set 300 can be arbitrarily designated by the user using the search support apparatus 100 of the present embodiment. That is, in the search support apparatus 100 according to the present embodiment, when the user specifies the document set 300 to be searched, the keyword candidate extraction unit 101 executes a process of extracting keyword candidates from the document set 300 specified by the user. . Hereinafter, as a specific example of the document set 300 to be searched, a book information set as illustrated in FIG. 2 is illustrated. In the example of FIG. 2, five pieces of book information are shown as documents 1 to 5, and documents 1 to 5 are “book name”, “author name”, “translator name”, and “publisher”, respectively. , “ISBN”, and “summary”, corresponding book information is described in, for example, text data. The keyword candidate extraction unit 101 extracts keyword candidates from text data described for each item of each document. Note that the search target document set 300 is generally a set of many documents, but only five documents are illustrated here for the sake of simplicity.

図３は、図２に例示した文書集合３００からキーワード候補抽出部１０１によって抽出されたキーワード候補の一例を示している。また、図４は、抽出されたキーワード候補と出現頻度との関係の一例を示している。キーワード候補の出現頻度は、文書集合３００内の文書のうち、同一のキーワード候補が出現する文書数を表す。例えば、図２に例示した文書集合３００において、キーワード候補の１つである「花子」は、文書１〜文書５からそれぞれ抽出されるため、キーワード候補「花子」の出現頻度は５である。 FIG. 3 shows an example of keyword candidates extracted by the keyword candidate extraction unit 101 from the document set 300 illustrated in FIG. FIG. 4 shows an example of the relationship between the extracted keyword candidates and the appearance frequency. The appearance frequency of the keyword candidates represents the number of documents in which the same keyword candidate appears among the documents in the document set 300. For example, in the document set 300 illustrated in FIG. 2, “Hanako”, which is one of the keyword candidates, is extracted from each of the documents 1 to 5, so the frequency of appearance of the keyword candidate “Hanako” is 5.

キーワード候補抽出部１０１は、文書集合３００からキーワード候補を抽出すると、抽出したキーワード候補を該キーワード候補が出現する文書を指し示す文書ポインタ情報とともにインデックス生成部１０２に対して出力する。また、キーワード候補抽出部１０１は、抽出した２つのキーワード候補の組み合わせを、共起確率算出部１０３に対して出力する。また、キーワード候補抽出部１０１は、抽出したキーワード候補と出現頻度との関係を示す図４のような情報を、図示しない記憶手段を用いて保持しておくことが望ましい。 When the keyword candidate extraction unit 101 extracts keyword candidates from the document set 300, the keyword candidate extraction unit 101 outputs the extracted keyword candidates to the index generation unit 102 together with document pointer information indicating a document in which the keyword candidates appear. Further, the keyword candidate extraction unit 101 outputs a combination of the extracted two keyword candidates to the co-occurrence probability calculation unit 103. Further, it is desirable that the keyword candidate extraction unit 101 retains information as shown in FIG. 4 indicating the relationship between the extracted keyword candidates and the appearance frequency using a storage unit (not shown).

インデックス生成部１０２は、キーワード候補抽出部１０１から入力したキーワード候補および文書ポインタ情報に基づいて、検索対象の文書集合３００に対するインデックス３０１を生成する。インデックス３０１は、ある検索キーワードが入力された場合に、その検索キーワードが出現する文書を指し示す文書ポインタ情報を保持する。 The index generation unit 102 generates an index 301 for the search target document set 300 based on the keyword candidates and the document pointer information input from the keyword candidate extraction unit 101. The index 301 holds document pointer information indicating a document in which a search keyword appears when a certain search keyword is input.

共起確率算出部１０３は、キーワード候補抽出部１０１から入力した２つのキーワード候補の組み合わせのそれぞれについて、一のキーワード候補が他のキーワード候補とともに文書集合３００内の同一文書に出現する確率である共起確率を算出する。 The co-occurrence probability calculation unit 103 is a probability that one keyword candidate appears in the same document in the document set 300 together with another keyword candidate for each combination of two keyword candidates input from the keyword candidate extraction unit 101. Calculate the probability of occurrence.

図５は、２つのキーワード候補の組み合わせの出現頻度の一例を示す図である。また、図６は、図５の出現頻度と文書集合３００の文書総数とから算出される２つのキーワード候補間の共起確率の一例を示す図である。また、図７は、図６の共起確率に基づいて生成される共起ネットワークの一例を示す図である。２つのキーワード候補の組み合わせの出現頻度は、文書集合３００内の文書のうち、２つのキーワード候補が同時に出現する文書数を表す。例えば、図２に例示した文書集合３００において、キーワード候補である「花子」と「府中」は、文書１、文書２、および文書５において同時に出現するため、「花子」と「府中」の組み合わせの出現頻度は３である。 FIG. 5 is a diagram illustrating an example of the appearance frequency of a combination of two keyword candidates. FIG. 6 is a diagram illustrating an example of the co-occurrence probability between two keyword candidates calculated from the appearance frequency of FIG. 5 and the total number of documents in the document set 300. FIG. 7 is a diagram illustrating an example of a co-occurrence network generated based on the co-occurrence probability of FIG. The appearance frequency of the combination of two keyword candidates represents the number of documents in which two keyword candidates appear simultaneously in the documents in the document set 300. For example, in the document set 300 illustrated in FIG. 2, the keyword candidates “Hanako” and “Fuchu” appear simultaneously in Document 1, Document 2, and Document 5, and therefore a combination of “Hanako” and “Fuchu”. The appearance frequency is 3.

共起確率算出部１０３は、キーワード候補抽出部１０１から入力した２つのキーワード候補の組み合わせのそれぞれについて、まず、各組み合わせの出現頻度を求める。そして、共起確率算出部１０３は、求めた組み合わせの出現頻度を文書集合３００に含まれる文書総数で除算することで、２つのキーワード候補間の共起確率を求める。そして、共起確率算出部１０３は、求めた共起確率を保持する図７に示すような共起ネットワークを生成し、生成した共起ネットワークを共起キーワード組検出部１０４に対して出力する。共起ネットワークは、図７に示すように、キーワード候補をノードとし、共起関係にあるノード同士をリンクで結び、リンクに共起確率を対応付けたものである。 The co-occurrence probability calculation unit 103 first obtains the appearance frequency of each combination for each combination of two keyword candidates input from the keyword candidate extraction unit 101. Then, the co-occurrence probability calculation unit 103 calculates the co-occurrence probability between the two keyword candidates by dividing the appearance frequency of the obtained combination by the total number of documents included in the document set 300. Then, the co-occurrence probability calculation unit 103 generates a co-occurrence network as shown in FIG. 7 that holds the calculated co-occurrence probability, and outputs the generated co-occurrence network to the co-occurrence keyword set detection unit 104. As shown in FIG. 7, the co-occurrence network is such that keyword candidates are nodes, nodes in a co-occurrence relationship are linked by links, and the co-occurrence probabilities are associated with the links.

共起キーワード組検出部１０４は、共起確率算出部１０３から入力した共起ネットワークに基づいて、共起確率が第１条件を満たす２つのキーワード候補の組み合わせである共起キーワード組を検出し、共起辞書生成部１０５およびパトリシア木生成部１０６に対して出力する。ここで、第１条件としては、例えば、最終的に得られる共起キーワード組の個数が閾値β以上になるように設定される閾値αに対して、共起確率が閾値αよりも大きいという条件を用いることができる。なお、第１条件はこの例に限らず、予め定めた様々な条件を用いることができる。 The co-occurrence keyword set detection unit 104 detects a co-occurrence keyword set that is a combination of two keyword candidates whose co-occurrence probability satisfies the first condition based on the co-occurrence network input from the co-occurrence probability calculation unit 103, Output to co-occurrence dictionary generation unit 105 and Patricia tree generation unit 106. Here, as the first condition, for example, a condition that the co-occurrence probability is larger than the threshold value α with respect to the threshold value α set so that the number of co-occurrence keyword sets finally obtained is equal to or larger than the threshold value β. Can be used. The first condition is not limited to this example, and various predetermined conditions can be used.

図８は、共起確率が閾値αよりも大きい共起キーワード組を検出する場合の共起キーワード組検出部１０４による処理の一例を示すフローチャートである。 FIG. 8 is a flowchart illustrating an example of processing performed by the co-occurrence keyword set detection unit 104 when a co-occurrence keyword set having a co-occurrence probability larger than the threshold value α is detected.

共起キーワード組検出部１０４は、まず、閾値αを予め定めた初期値に設定し、共起確率算出部１０３で算出された共起確率が閾値αよりも大きいキーワード候補の組み合わせを抽出する（ステップＳ１０１）。具体的には、共起キーワード組検出部１０４は、例えば、共起確率算出部１０３から入力した共起ネットワークの１つのノードを取り出して、そのノードに対して閾値αよりも大きい共起確率に対応するリンクで結ばれたノードを特定し、これら２つのノードが示す２つのキーワード候補の組み合わせを抽出する。そして、共起キーワード組検出部１０４は、共起ネットワークのすべてのノードに対して上記の処理を繰り返し、共起確率が閾値αよりも大きいキーワード候補の組み合わせをすべて抽出する。 The co-occurrence keyword set detection unit 104 first sets a threshold value α to a predetermined initial value, and extracts a combination of keyword candidates whose co-occurrence probability calculated by the co-occurrence probability calculation unit 103 is larger than the threshold value α ( Step S101). Specifically, the co-occurrence keyword set detection unit 104 takes, for example, one node of the co-occurrence network input from the co-occurrence probability calculation unit 103, and sets the co-occurrence probability larger than the threshold α for the node. Nodes connected by corresponding links are specified, and a combination of two keyword candidates indicated by these two nodes is extracted. Then, the co-occurrence keyword set detection unit 104 repeats the above processing for all nodes of the co-occurrence network, and extracts all keyword candidate combinations having a co-occurrence probability larger than the threshold value α.

次に、共起キーワード組検出部１０４は、ステップＳ１０１で抽出したキーワード候補の組み合わせの個数をカウントし、抽出したキーワード候補の組み合わせの個数が閾値β以上か否かを判定する（ステップＳ１０２）。この判定の結果、抽出したキーワード候補の組み合わせの個数が閾値β未満であれば（ステップＳ１０２：Ｎｏ）、共起キーワード組検出部１０４は、閾値αを所定量減少させ（ステップＳ１０３）、ステップＳ１０１に戻って上記の処理を繰り返す。一方、抽出したキーワード候補の組み合わせの個数が閾値β以上であれば（ステップＳ１０２：Ｙｅｓ）、抽出したキーワード候補の組み合わせを共起キーワード組として、共起辞書生成部１０５およびパトリシア木生成部１０６に対して出力する（ステップＳ１０４）。 Next, the co-occurrence keyword set detection unit 104 counts the number of keyword candidate combinations extracted in step S101, and determines whether or not the number of extracted keyword candidate combinations is equal to or greater than a threshold value β (step S102). If the number of combinations of extracted keyword candidates is less than the threshold β as a result of this determination (step S102: No), the co-occurrence keyword set detection unit 104 decreases the threshold α by a predetermined amount (step S103), and step S101. Return to and repeat the above process. On the other hand, if the number of extracted keyword candidate combinations is greater than or equal to the threshold β (step S102: Yes), the extracted keyword candidate combinations are set as co-occurrence keyword sets to the co-occurrence dictionary generation unit 105 and the Patricia tree generation unit 106. In response, the data is output (step S104).

共起辞書生成部１０５は、共起キーワード組検出部１０４から入力した共起キーワード組に基づいて、共起辞書３０２を生成する。共起辞書３０２は、図９に示すように、（見出し語、共起語、共起確率）の３つの要素を１組としたデータ構造を持つ辞書要素の集合であり、見出し語が指定されたときに、その見出し語を持つ辞書要素を検索して共起語および共起確率を取得することができる。共起辞書生成部１０５は、共起キーワード組検出部１０４から入力したすべての共起キーワード組について、各共起キーワード組の一方のキーワード候補を見出し語、他方のキーワード候補を共起語とし、これらキーワード候補間の共起確率を記述した辞書要素を生成し、これら辞書要素の集合である共起辞書３０２を生成する。共起辞書生成部１０５により生成された共起辞書３０２の一例を図１０に示す。なお、共起辞書３０２の全体は大きくなるため、図１０では、便宜上一部のみを示している。 The co-occurrence dictionary generation unit 105 generates the co-occurrence dictionary 302 based on the co-occurrence keyword set input from the co-occurrence keyword set detection unit 104. As shown in FIG. 9, the co-occurrence dictionary 302 is a set of dictionary elements having a data structure in which three elements of (headword, co-occurrence word, co-occurrence probability) are set, and a headword is designated. When searching, a dictionary element having the headword can be searched to obtain a co-occurrence word and a co-occurrence probability. The co-occurrence dictionary generation unit 105 sets one keyword candidate of each co-occurrence keyword set as a headword and the other keyword candidate as a co-occurrence word for all the co-occurrence keyword sets input from the co-occurrence keyword set detection unit 104. A dictionary element describing the co-occurrence probability between these keyword candidates is generated, and a co-occurrence dictionary 302 which is a set of these dictionary elements is generated. An example of the co-occurrence dictionary 302 generated by the co-occurrence dictionary generation unit 105 is shown in FIG. Since the entire co-occurrence dictionary 302 becomes large, only a part is shown in FIG. 10 for convenience.

パトリシア木生成部１０６は、共起キーワード組検出部１０４から入力した共起キーワード組に基づいて、入力文字列を補完して共起キーワード組に含まれるキーワード候補を得るためのパトリシア木３０３（文字列補完規則）を生成する。パトリシア木生成部１０６により生成されたパトリシア木３０３の一例を図１１に示す。なお、パトリシア木３０３の全体は大きくなるため、図１１では、便宜上一部のみを示している。 Based on the co-occurrence keyword set input from the co-occurrence keyword set detection unit 104, the Patricia tree generation unit 106 complements the input character string to obtain a keyword candidate included in the co-occurrence keyword set (characters Column completion rule). An example of the Patricia tree 303 generated by the Patricia tree generation unit 106 is shown in FIG. In addition, since the whole Patricia tree 303 becomes large, in FIG. 11, only a part is shown for convenience.

パトリシア木３０３は、図１１に示すように、各ノードが部分文字列を持つ木構造として表される。ルートから末端側へと順次ノードを辿り、各ノードが持つ各部分文字列を順次結合し、末端ノードであるリーフに到達した時点で得られた文字列が出力文字列として求められる。パトリシア木生成部１０６は、共起キーワード組検出部１０４から入力した共起キーワード組に含まれるキーワード候補をそれぞれ部分文字列に分割し、共通する部分文字列をマージすることによって、共起キーワード組に含まれるキーワード候補が出力文字列となるパトリシア木３０３を生成する。図１１に示すパトリシア木３０３の範囲では、「おりおん」、「おりんぴっく」、「おりんぽす」、「とうよう」、「とうきょう」、「はるこ」、「はなこ」が、それぞれ出力文字列となる。 As shown in FIG. 11, the Patricia tree 303 is represented as a tree structure in which each node has a partial character string. The nodes are sequentially traced from the root to the terminal side, and the partial character strings of the nodes are sequentially combined, and the character string obtained when the leaf that is the terminal node is reached is obtained as the output character string. The Patricia tree generation unit 106 divides the keyword candidates included in the co-occurrence keyword set input from the co-occurrence keyword set detection unit 104 into partial character strings, and merges the common partial character strings to thereby obtain the co-occurrence keyword set. A Patricia tree 303 is generated in which the keyword candidates included in the are output character strings. In the range of the Patricia tree 303 shown in FIG. 11, “orion”, “orinpic”, “orinpos”, “toyo”, “tokyo”, “haruko”, and “hanako” are output characters, respectively. It becomes a column.

入力受付部１０７は、ユーザからの操作入力を受け付ける。例えば、ユーザがキーボードなどの入力デバイスを用いて文字列を入力する操作を行った場合、入力受付部１０７はユーザによる文字列の入力を受け付けて、入力された文字列（入力文字列）を入力キーワード認識部１０８に対して出力する。ここで、入力文字列は１つの単語を構成しない部分文字列を含む。また、ここでは１文字のみの入力であっても入力文字列として扱う。入力受付部１０７は、入力文字列が変化するたびに、変化語の新たな入力文字列を入力キーワード認識部１０８に対して出力する。また、ユーザが例えば改行キーの押下など、入力文字列を確定する所定の操作を行った場合には、入力受付部１０７は、確定された入力文字列を検索キーワードとして検索部１１２に対して出力する。さらに、入力受付部１０７は、ユーザによって後述する表示部１１１に表示された提案ワード列のうちのいずれかを選択する操作が行われた場合には、その操作を受け付けて、選択された提案キーワード列に含まれる各キーワード候補を検索キーワードとして検索部１１２に対して出力する。 The input receiving unit 107 receives an operation input from the user. For example, when the user performs an operation of inputting a character string using an input device such as a keyboard, the input receiving unit 107 receives the input of the character string by the user and inputs the input character string (input character string). Output to the keyword recognition unit 108. Here, the input character string includes a partial character string that does not constitute one word. Here, even an input of only one character is treated as an input character string. The input receiving unit 107 outputs a new input character string of a change word to the input keyword recognizing unit 108 every time the input character string changes. In addition, when the user performs a predetermined operation for confirming the input character string, such as pressing a line feed key, the input receiving unit 107 outputs the confirmed input character string as a search keyword to the search unit 112. To do. Further, when an operation for selecting any one of the suggested word strings displayed on the display unit 111 described later is performed by the user, the input receiving unit 107 receives the operation and selects the proposed keyword Each keyword candidate included in the column is output to the search unit 112 as a search keyword.

入力キーワード認識部１０８は、パトリシア木生成部１０６により生成されたパトリシア木３０３に従って、入力受付部１０７が受け付けた入力文字列を補完することで得られるキーワード候補を入力キーワードと認識し、認識した入力キーワードを共起伝播部１０９に対して出力する。すなわち、入力キーワード認識部１０８は、ユーザがこれまでに入力した入力文字列を入力受付部１０７から受け取り、パトリシア木３０３を検索することによって、後続で入力される可能性のある文字列を推定して、入力文字列を先頭とし、共起キーワード組に含まれるキーワード候補と一致する文字列を求める。このとき、得られる文字列は、共起キーワード組に含まれるキーワード候補のうち、入力文字列と前方一致するものすべてとなる。ここで得られるすべての文字列を入力キーワードとすると後段の処理負荷が著しく増加することが懸念されるので、入力キーワード認識部１０８は、得られた文字列のうち、入力キーワードと認識する文字列を絞り込むことが望ましい。絞り込みの方法としては、例えば、上述したように、キーワード候補抽出部１０１により抽出されたキーワード候補と出現頻度との関係を示す図４のような情報が保持されている場合、この情報を参照し、出現頻度が大きい順に上位Ｍ０件（Ｍ０は予め定めた自然数）を入力キーワードと認識するといった方法を用いることができる。 The input keyword recognition unit 108 recognizes a keyword candidate obtained by complementing the input character string received by the input reception unit 107 as an input keyword according to the Patricia tree 303 generated by the Patricia tree generation unit 106, and recognizes the input. The keyword is output to the co-occurrence propagation unit 109. That is, the input keyword recognition unit 108 receives the input character string that the user has input so far from the input reception unit 107 and searches the Patricia tree 303 to estimate a character string that may be input subsequently. Thus, a character string matching the keyword candidate included in the co-occurrence keyword set with the input character string as the head is obtained. At this time, the obtained character strings are all of the keyword candidates included in the co-occurrence keyword set that match the input character string in the forward direction. If all the character strings obtained here are input keywords, there is a concern that the processing load on the subsequent stage will increase significantly. Therefore, the input keyword recognition unit 108 will recognize a character string recognized as an input keyword among the obtained character strings. It is desirable to narrow down. As a narrowing-down method, for example, as described above, when information as shown in FIG. 4 showing the relationship between the keyword candidate extracted by the keyword candidate extraction unit 101 and the appearance frequency is held, this information is referred to. A method of recognizing the upper M0 items (M0 is a predetermined natural number) as the input keyword in descending order of appearance frequency can be used.

また、入力キーワード認識部１０８は、ユーザから新たに入力された文字に対して、これまでの入力で得られたパトリシア木３０３のノードを基点として、新たに入力された文字に対応する次のノードを探索して、新たに入力された文字を加えた入力文字列と前方一致するキーワード候補を取得する。そして、そのうちの上位Ｍ０件を入力キーワードと認識し、共起伝播部１０９に対して出力する。入力キーワード認識部１０８は、ユーザによって新たな文字の入力が行われるたびに上記処理を繰り返す。 In addition, the input keyword recognition unit 108 performs the next node corresponding to the newly input character with respect to the newly input character from the user, with the node of the Patricia tree 303 obtained by the previous input as a base point. To obtain a keyword candidate that matches forward with the input character string obtained by adding the newly input character. Then, the top M0 items are recognized as input keywords and output to the co-occurrence propagation unit 109. The input keyword recognition unit 108 repeats the above process every time a new character is input by the user.

共起伝播部１０９は、共起辞書生成部１０５により生成された共起辞書３０２を参照し、入力キーワード認識部１０８から入力した入力キーワードを見出し語とする辞書要素の共起語を取得し、さらに、取得した共起語を見出し語とする辞書要素の共起語を取得する処理を繰り返す。具体的には、共起伝播部１０９は、入力キーワード認識部１０８から入力したＭ０件の入力キーワードそれぞれについて、共起辞書３０２を検索して入力キーワードに対する共起語を求め、複数求められた共起語のうち、入力キーワードとの間の共起確率が大きい順に上位Ｍ１件（Ｍ１は予め定めた自然数）の共起語を１段共起語として一時的に記憶する。そして、共起伝播部１０９は、Ｍ１件の１段共起語のそれぞれについて、共起辞書３０２を検索して１段共起語に対する共起語を求め、複数求められた共起語のうち、１段共起語との間の共起確率が大きい順に上位Ｍ２件（Ｍ２は予め定めた自然数）の共起語を２段共起語として一時的に記憶する。このとき、共起伝播部１０９は、入力キーワードと１段共起語との間の共起確率に対して、１段共起語と２段共起語との間の共起確率を積算した値である積算共起確率を求め、一時的に記憶する。さらに、共起伝播部１０９は、求めた積算共起確率が閾値γよりも大きければ、２段共起語から３段共起語、３段共起語から４段共起語といったように、入力キーワードから共起関係が伝播するキーワード候補を順次求める処理を繰り返し、共起関係にあるキーワード候補間の共起確率を順次積算した値である積算共起確率を求めていく。そして、共起伝播部１０９は、求めた積算共起確率が閾値γ以下になると処理を停止する。 The co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 generated by the co-occurrence dictionary generation unit 105, acquires a co-occurrence word of a dictionary element having the input keyword input from the input keyword recognition unit 108 as a headword, Furthermore, the process of acquiring a co-occurrence word of a dictionary element having the acquired co-occurrence word as a headword is repeated. Specifically, the co-occurrence propagation unit 109 searches the co-occurrence dictionary 302 for each of the M0 input keywords input from the input keyword recognition unit 108 to obtain a co-occurrence word for the input keyword. Among the words, the top M1 co-occurrence words (M1 is a predetermined natural number) in the descending order of the co-occurrence probability with the input keyword are temporarily stored as one-stage co-occurrence words. The co-occurrence propagation unit 109 searches the co-occurrence dictionary 302 for each of the M1 first-stage co-occurrence words to obtain a co-occurrence word for the first-stage co-occurrence word. In the descending order of the co-occurrence probability between the first-stage co-occurrence words, the top M2 co-occurrence words (M2 is a predetermined natural number) are temporarily stored as the second-stage co-occurrence words. At this time, the co-occurrence propagation unit 109 adds the co-occurrence probabilities between the first-stage co-occurrence word and the second-stage co-occurrence word to the co-occurrence probabilities between the input keyword and the first-stage co-occurrence word. The integrated co-occurrence probability that is a value is obtained and temporarily stored. Further, if the calculated co-occurrence probability is greater than the threshold value γ, the co-occurrence propagation unit 109 may change from the second-stage co-occurrence word to the third-stage co-occurrence word, the third-stage co-occurrence word, and the like. The process of sequentially obtaining the keyword candidates that the co-occurrence relationship propagates from the input keyword is repeated, and the integrated co-occurrence probability that is a value obtained by sequentially integrating the co-occurrence probabilities between the keyword candidates having the co-occurrence relationship is obtained. Then, the co-occurrence propagation unit 109 stops the process when the obtained integrated co-occurrence probability is equal to or less than the threshold value γ.

すなわち、共起伝播部１０９は、入力キーワードを見出し語とする辞書要素の共起語を１段共起語とし、Ｌ−１段共起語（Ｌは２以上の自然数）を見出し語とする辞書要素の共起語をＬ段共起語としたときに、Ｌを１ずつ増加させながらＬ段共起語を取得する処理を繰り返し、入力キーワードと１段共起語との間の共起確率に対して、Ｌ−１段共起語とＬ段共起語との間の共起確率を順次積算した値である積算共起確率が閾値γ以下になると、処理を停止する。そして、共起伝播部１０９は、処理を停止した段階で得られた入力キーワードからＬ段共起語までを繋げたワード列、あるいは、入力キーワードからＬ−１段共起語までを繋げたワード列を、各ワード列の積算共起確率とともに、提案ワード列認識部１１０に対して出力する。なお、共起確率の高いキーワード同士で伝播が繰り返すことを避けるために、同じキーワード候補が重複するワード列は処理の対象から除外する。共起伝播部１０９は、入力キーワード認識部１０８から新たな入力キーワードを入力するたびに、上記の処理を実施する。 That is, the co-occurrence propagation unit 109 sets a dictionary element co-occurrence word having an input keyword as an entry word as a 1-stage co-occurrence word and L-1 stage co-occurrence word (L is a natural number of 2 or more) as an entry word. When the dictionary element co-occurrence word is an L-stage co-occurrence word, the process of obtaining the L-stage co-occurrence word is repeated while increasing L by 1, and the co-occurrence between the input keyword and the 1-stage co-occurrence word When the cumulative co-occurrence probability, which is a value obtained by sequentially integrating the co-occurrence probabilities between the L-1 stage co-occurrence word and the L-stage co-occurrence word with respect to the probability, becomes equal to or less than the threshold value γ, the processing is stopped. Then, the co-occurrence propagation unit 109 connects the input keyword to the L-stage co-occurrence word obtained at the stage where the processing is stopped, or the word string from the input keyword to the L-1 stage co-occurrence word. The sequence is output to the proposed word sequence recognition unit 110 together with the cumulative co-occurrence probability of each word sequence. In order to avoid repeated propagation between keywords with high co-occurrence probabilities, word strings with the same keyword candidates overlapping are excluded from processing targets. The co-occurrence propagation unit 109 performs the above process every time a new input keyword is input from the input keyword recognition unit 108.

図１２は、共起伝播部１０９から提案ワード列認識部１１０に対して出力されるワード列のデータ構造の一例を示す図である。共起伝播部１０９から提案ワード列認識部１１０に対しては、例えば図１２に示すように、入力キーワード、１段共起語、・・・、Ｌ段共起語を繋げたワード列と、そのワード列の積算共起確率とを組み合わせたデータが出力される。また、入力キーワードからＬ−１段共起語までを繋げてワード列とする場合は、入力キーワード，・・・，Ｌ−１段共起語を繋げたワード列と、そのワード列の積算共起確率とを組み合わせたデータが、共起伝播部１０９から提案ワード列認識部１１０に対して出力される。 FIG. 12 is a diagram illustrating an example of a data structure of a word string output from the co-occurrence propagation unit 109 to the proposed word string recognition unit 110. From the co-occurrence propagation unit 109 to the proposed word string recognition unit 110, for example, as shown in FIG. 12, a word string connecting input keywords, one-stage co-occurrence words,. Data combining the cumulative co-occurrence probabilities of the word string is output. When a word string is formed by connecting the input keyword to the L-1 stage co-occurrence word, a word string connecting the input keyword,..., L-1 stage co-occurrence word and the integration of the word string are shared. Data combining the occurrence probabilities is output from the co-occurrence propagation unit 109 to the proposed word string recognition unit 110.

なお、上記の例では、積算共起確率が閾値γ以下になると共起伝播部１０９がＬ段共起語を取得する処理を停止するものとして説明したが、共起伝播部１０９が処理を停止する条件はこの例に限定されるものではなく、例えば、得られたワード列の数が上限に達すると共起伝播部１０９の処理を停止する、あるいは、１段共起語に繋がる共起語の数が上限に達すると共起伝播部１０９の処理を停止するといった条件を用いるようにしてもよい。また、上記の例では、１段共起語からＬ段共起語を取得する処理を行う際に、得られる共起語の共起確率が大きい順に上位Ｍ１件、上位ＭＬ件を絞り込むようにしているが、共起語を取得する条件となる共起確率を予め定めておき、この予め定めた共起確率以上のものを１段共起語からＬ段共起語として順次取得するようにしてもよい。 In the above example, it has been described that the co-occurrence propagation unit 109 stops the process of acquiring the L-stage co-occurrence word when the cumulative co-occurrence probability is equal to or less than the threshold γ, but the co-occurrence propagation unit 109 stops the process. The condition to be performed is not limited to this example. For example, when the number of obtained word strings reaches the upper limit, the process of the co-occurrence propagation unit 109 is stopped or the co-occurrence word connected to the one-stage co-occurrence word A condition may be used in which the processing of the co-occurrence propagation unit 109 is stopped when the number reaches the upper limit. Further, in the above example, when the L-stage co-occurrence word is acquired from the first-stage co-occurrence word, the upper M1 cases and the upper ML cases are narrowed down in descending order of the co-occurrence probability of the obtained co-occurrence words. However, the co-occurrence probability that is a condition for acquiring the co-occurrence word is determined in advance, and the ones having the predetermined co-occurrence probability or higher are sequentially acquired from the first-stage co-occurrence word to the L-stage co-occurrence word. May be.

提案ワード列認識部１１０は、共起伝播部１０９から入力したワード列のうち、第２条件を満たすワード列を提案ワード列と認識し、認識した提案ワード列を表示部１１１に対して出力する。ここで、第２条件としては、例えば、積算共起確率が大きい順に上位Ｎ件（Ｎは予め定めた自然数）という条件を用いることができる。なお、第２条件はこの例に限らず、予め定めた様々な条件を用いることができる。 The proposed word string recognition unit 110 recognizes a word string satisfying the second condition among the word strings input from the co-occurrence propagation unit 109 as a proposed word string, and outputs the recognized proposed word string to the display unit 111. . Here, as the second condition, for example, a condition of top N cases (N is a predetermined natural number) in descending order of the cumulative co-occurrence probability can be used. The second condition is not limited to this example, and various predetermined conditions can be used.

図１３は、入力キーワード認識部１０８、共起伝播部１０９および提案ワード列認識部１１０による処理の一例を示すフローチャートである。 FIG. 13 is a flowchart illustrating an example of processing by the input keyword recognition unit 108, the co-occurrence propagation unit 109, and the suggested word string recognition unit 110.

入力キーワード認識部１０８は、入力受付部１０７から入力文字列を入力すると、パトリシア木３０３に基づいて、入力文字列を補完して得られるキーワード候補を取得し、取得したキーワード候補のうち、出現頻度が大きい順に上位Ｍ０件のキーワード候補を入力キーワードと認識する（ステップＳ２０１）。そして、入力キーワード認識部１０８は、認識した入力キーワードを、共起伝播部１０９に対して出力する。 When the input keyword recognition unit 108 inputs the input character string from the input reception unit 107, the input keyword recognition unit 108 acquires keyword candidates obtained by complementing the input character string based on the Patricia tree 303, and among the acquired keyword candidates, the appearance frequency The top M0 keyword candidates are recognized as input keywords in descending order (step S201). Then, the input keyword recognition unit 108 outputs the recognized input keyword to the co-occurrence propagation unit 109.

次に、共起伝播部１０９は、入力キーワード認識部１０８からＭ０件の入力キーワードを入力すると、共起辞書３０２を参照して、Ｍ０件の入力キーワードの共起語をそれぞれ求め、これら共起語のうち、入力キーワードとの間の共起確率が大きい順に上位Ｍ１件のキーワード候補を１段共起語として取得する（ステップＳ２０２）。また、共起伝播部１０９は、共起辞書３０２を参照して、Ｌ−１段共起語（Ｌは２以上の自然数）の共起語をそれぞれ求め、これら共起語のうち、Ｌ−１段共起語との間の共起確率が大きい順に上位ＭＬ件のキーワード候補をＬ段共起語として取得する（ステップＳ２０３）。そして、共起伝播部１０９は、入力キーワードと１段共起語との間の共起確率に対して、Ｌ−１段共起語とＬ段共起語との間の共起確率を順次積算した値である積算共起確率が閾値γ以下であるか否かを判定する（ステップＳ２０４）。この判定の結果、積算共起確率が閾値γよりも大きければ（ステップＳ２０４：Ｎｏ）、共起伝播部１０９は、Ｌの値をインクリメント（＋１）して（ステップＳ２０５）ステップＳ２０３に戻り、Ｌ段共起語を取得する処理を繰り返す。一方、積算共起確率が閾値γ以下であれば（ステップＳ２０４：Ｙｅｓ）、共起伝播部１０９は、入力キーワードと取得した共起語とを繋げたワード列を、各ワード列の積算共起確率とともに、提案ワード列認識部１１０に対して出力する。 Next, when the co-occurrence propagation unit 109 inputs M0 input keywords from the input keyword recognition unit 108, the co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 and obtains co-occurrence words of the M0 input keywords, respectively. Among the words, the top M1 keyword candidates are acquired as one-stage co-occurrence words in descending order of the co-occurrence probability with the input keyword (step S202). Further, the co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 to obtain co-occurrence words of L-1 level co-occurrence words (L is a natural number of 2 or more), and among these co-occurrence words, L- The top ML keyword candidates are acquired as the L-stage co-occurrence word in descending order of the co-occurrence probability with the first-stage co-occurrence word (step S203). The co-occurrence propagation unit 109 sequentially calculates the co-occurrence probabilities between the L-1 stage co-occurrence word and the L-stage co-occurrence word with respect to the co-occurrence probability between the input keyword and the 1-stage co-occurrence word. It is determined whether or not the integrated co-occurrence probability that is the integrated value is equal to or less than the threshold value γ (step S204). If the cumulative co-occurrence probability is greater than the threshold γ as a result of this determination (step S204: No), the co-occurrence propagation unit 109 increments (+1) the value of L (step S205) and returns to step S203. Repeat the process of acquiring the co-occurrence words. On the other hand, if the cumulative co-occurrence probability is less than or equal to the threshold γ (step S204: Yes), the co-occurrence propagation unit 109 adds the word string connecting the input keyword and the acquired co-occurrence word to the cumulative co-occurrence of each word string. It outputs to the proposal word sequence recognition part 110 with a probability.

次に、提案ワード列認識部１１０は、共起伝播部１０９からワード列を入力すると、入力したワード列のうち、積算共起確率が大きい順に上位Ｎ件のワード列を提案ワード列として認識する（ステップＳ２０６）。そして、提案ワード列認識部１１０は、認識した提案ワード列を表示部１１１に対して出力する（ステップＳ２０７）。 Next, when a word string is input from the co-occurrence propagation unit 109, the proposed word string recognition unit 110 recognizes the top N word strings as the suggested word string in descending order of the cumulative co-occurrence probability. (Step S206). Then, the suggested word string recognition unit 110 outputs the recognized suggested word string to the display unit 111 (step S207).

表示部１１１は、提案ワード列認識部１１０から入力した提案ワード列を表示して、ユーザに提示する。表示部１１１が表示する提案ワード列は、ユーザがマウスやキーボードなどの入力デバイスを用いて、そのいずれかを選択することができる。表示部１１１が表示する提案ワード列のいずれかをユーザが選択した場合、その操作が入力受付部１０７により受け付けられ、選択された提案ワード列に含まれる各キーワード候補がそれぞれ検索キーワードとして検索部１１２に出力される。また、表示部１１１は、後述の検索部１１２から検索結果を入力した場合は、検索結果を表示する。 The display unit 111 displays the suggested word string input from the suggested word string recognition unit 110 and presents it to the user. The suggested word string displayed by the display unit 111 can be selected by the user using an input device such as a mouse or a keyboard. When the user selects one of the suggested word strings displayed on the display unit 111, the operation is accepted by the input accepting unit 107, and each keyword candidate included in the selected proposed word string is used as a search keyword. Is output. The display unit 111 displays the search result when the search result is input from the search unit 112 described later.

検索部１１２は、表示部１１１が表示する提案ワード列のうちのいずれかがユーザにより選択された場合に、選択された提案ワード列に含まれる各キーワード候補を受け取り、これらキーワード候補のすべてを検索キーワードとして含む検索式を生成して、文書集合３００に対する検索を行う。具体的には、検索部１１２は、インデックス生成部１０２が生成したインデックス３０１を参照して、検索式に含まれる検索キーワードを含む文書ポインタ情報を順次検索し、検索式に含まれるすべての検索キーワードを含む文書への文書ポインタ情報を求める。次に、検索部１１２は、検索対象の文書集合３００から、得られた文書ポインタ情報が指し示す文書を求め、必要な情報を検索結果として表示部１１１に対して出力する。 The search unit 112 receives each keyword candidate included in the selected suggested word string when any of the suggested word strings displayed on the display unit 111 is selected by the user, and searches all of these keyword candidates. A search expression including keywords is generated, and the document set 300 is searched. Specifically, the search unit 112 refers to the index 301 generated by the index generation unit 102 and sequentially searches for document pointer information including the search keyword included in the search expression, and all the search keywords included in the search expression. Document pointer information for a document containing Next, the search unit 112 obtains a document indicated by the obtained document pointer information from the document set 300 to be searched, and outputs necessary information to the display unit 111 as a search result.

本実施形態の検索支援装置１００は、ハードウェア構成として、例えば、ＣＰＵなどの制御装置、ＲＯＭやＲＡＭなどの内部記憶装置、キーボードやマウスなどの入力デバイス、液晶パネルなどの表示装置、ハードディスク、ＣＤ、ＤＶＤ、フラッシュメモリなどの外部記憶装置を備えた、通常のコンピュータを利用したハードウェア構成を採用することができる。この場合、コンピュータがプログラムを実行することによって、本実施形態の検索支援装置１００における上記の機能的な構成を実現することができる。また、インデックス３０１、共起辞書３０２、およびパトリシア木３０３は、例えば、ハードディスクやフラッシュメモリなどの外部記憶装置に格納しておくことができる。 The search support apparatus 100 according to the present embodiment includes, for example, a control device such as a CPU, an internal storage device such as a ROM and a RAM, an input device such as a keyboard and a mouse, a display device such as a liquid crystal panel, a hard disk, and a CD. A hardware configuration using an ordinary computer having an external storage device such as a DVD or a flash memory can be employed. In this case, the functional configuration of the search support apparatus 100 according to the present embodiment can be realized by the computer executing the program. The index 301, the co-occurrence dictionary 302, and the Patricia tree 303 can be stored in an external storage device such as a hard disk or a flash memory, for example.

次に、図２に例示した文書集合３００を検索対象として検索を行う場合を例に挙げて、本実施形態の検索支援装置１００による処理の具体例について説明する。以下では、図２に示した文書集合３００から図１０に例示した共起辞書３０２および図１１に例示したパトリシア木３０３が生成されているものとする。 Next, a specific example of processing performed by the search support apparatus 100 according to this embodiment will be described by taking as an example a case where a search is performed using the document set 300 illustrated in FIG. 2 as a search target. In the following, it is assumed that the co-occurrence dictionary 302 illustrated in FIG. 10 and the Patricia tree 303 illustrated in FIG. 11 are generated from the document set 300 illustrated in FIG.

まず、ユーザがキーボードなどの入力デバイスを用いて文字列「お」を入力すると、入力受付部１０７がこの文字列の入力を受け付けて、入力文字列「お」を入力キーワード認識部１０８に対して出力する。入力キーワード認識部１０８は、入力受付部１０７から入力文字列「お」を受け取り、図１１に例示したパトリシア木３０３を検索して、入力キーワードを認識する処理を行う。入力キーワード認識部１０８は、まず、入力文字列「お」に該当するノードを求めるために、図１１に例示したパトリシア木３０３をルートから順に探索し、「お」に該当するノードとして「おり」のノードを取得する。次に、入力キーワード認識部１０８は、その「おり」のノードからたどれるリーフ群を特定し、そのリーフが示す文字列を「おり」に後続する文字列として求め、入力文字列「お」を補完したキーワード候補を求める。この場合は「オリンピック」、「オリオン」および「オリンポス」が得られる。入力キーワード認識部１０８は、得られたキーワード候補のうち、出現頻度が大きい順に上位Ｍ０件を入力キーワードと認識する。ここでは、Ｍ０＝３と定められているものとする。この場合、入力キーワード認識部１０８は、「オリンピック」、「オリオン」および「オリンポス」を入力キーワードと認識し、これらの入力キーワードを共起伝播部１０９に対して出力する。 First, when a user inputs a character string “O” using an input device such as a keyboard, the input receiving unit 107 receives the input of the character string, and inputs the input character string “O” to the input keyword recognizing unit 108. Output. The input keyword recognizing unit 108 receives the input character string “o” from the input receiving unit 107, searches the Patricia tree 303 illustrated in FIG. 11, and performs processing for recognizing the input keyword. First, in order to obtain a node corresponding to the input character string “o”, the input keyword recognition unit 108 sequentially searches the Patricia tree 303 illustrated in FIG. 11 from the root, and sets “ori” as a node corresponding to “o”. Get the node. Next, the input keyword recognition unit 108 identifies a leaf group traced from the “ori” node, obtains a character string indicated by the leaf as a character string subsequent to “ori”, and complements the input character string “o”. Search for keyword candidates. In this case, “Olympic”, “Orion” and “Olympos” are obtained. The input keyword recognition unit 108 recognizes the top M0 items as the input keyword in descending order of appearance frequency among the obtained keyword candidates. Here, it is assumed that M0 = 3. In this case, the input keyword recognition unit 108 recognizes “Olympic”, “Orion”, and “Olympos” as input keywords, and outputs these input keywords to the co-occurrence propagation unit 109.

共起伝播部１０９は、入力キーワード「オリンピック」、「オリオン」および「オリンポス」を入力キーワード認識部１０８から受け取り、図１０に例示した共起辞書３０２を参照して、これら入力キーワードを見出し語とする辞書要素を求める。ここでは、以下に示すような辞書要素が求められる。
（オリオン、花子、０．７５）
（オリオン、小向、０．５）
（オリオン、磯子、０．５）
（オリオン、次郎、０．５）
（オリオン、ワールドカップ、０．５）
（オリオン、出版、０．４）
（オリオン、鎌倉、０．３３）
（オリオン、杉田、０．３３）
（オリオン、東洋、０．２５）
（オリオン、府中、０．２５）
（オリンピック、東洋、０．６７）
（オリンピック、堀川、０．５）
（オリンピック、芝浦、０．５）
（オリンピック、川崎、０．５）
（オリンピック、春子、０．５）
（オリンピック、次郎、０．５）
（オリンピック、世界、０．５）
（オリンピック、陸上、０．５）
（オリンピック、太郎、０．５）
（オリンピック、出版、０．４）
（オリンピック、花子、０．４）
（オリンピック、府中、０．２５）
（オリンポス、鎌倉、０．５）
（オリンポス、東京、０．５）
（オリンポス、府中、０．３３）
（オリンポス、出版、０．２）
（オリンポス、花子、０．２） The co-occurrence propagation unit 109 receives the input keywords “Olympic”, “Orion”, and “Olympos” from the input keyword recognition unit 108, and refers to the co-occurrence dictionary 302 illustrated in FIG. Find the dictionary element to do. Here, the following dictionary elements are obtained.
(Orion, Hanako, 0.75)
(Orion, Komukai, 0.5)
(Orion, Isogo, 0.5)
(Orion, Jiro, 0.5)
(Orion, World Cup, 0.5)
(Orion, Publishing, 0.4)
(Orion, Kamakura, 0.33)
(Orion, Sugita, 0.33)
(Orion, Toyo, 0.25)
(Orion, Fuchu, 0.25)
(Olympic, Oriental, 0.67)
(Olympic, Horikawa, 0.5)
(Olympic Games, Shibaura, 0.5)
(Olympic, Kawasaki, 0.5)
(Olympic, Haruko, 0.5)
(Olympic, Jiro, 0.5)
(Olympic, World, 0.5)
(Olympic, land, 0.5)
(Olympic, Taro, 0.5)
(Olympic, publication, 0.4)
(Olympic, Hanako, 0.4)
(Olympic, Fuchu, 0.25)
(Olympos, Kamakura, 0.5)
(Olympos, Tokyo, 0.5)
(Olympos, Fuchu, 0.33)
(Olympus, Publishing, 0.2)
(Olympus, Hanako, 0.2)

次に、共起伝播部１０９は、得られた辞書要素の共起語の中から、見出し語（入力キーワード）との間の共起確率が大きい順に上位Ｍ１件の共起語を１段共起語として取得する。ここでは、Ｍ１＝３に設定されているものとする。なお、見出し語との間の共起確率が同じ共起語については、例えば、見出し語の出現頻度が大きい順、共起語の出現頻度が大きい順などを基準として、取得する共起語を決定することができる。この場合、入力キーワード「オリオン」との間の共起確率が０．７５の「花子」、入力キーワード「オリンピック」との間の共起確率が０．６７の「東洋」、入力キーワード「オリオン」との間の共起確率が０．５の「小向」がそれぞれ１段共起語として取得される。 Next, the co-occurrence propagation unit 109 sets the M1 co-occurrence words in the descending order of the co-occurrence probability with the headword (input keyword) from the co-occurrence words of the obtained dictionary elements. Get as a utterance. Here, it is assumed that M1 = 3. For co-occurrence words with the same co-occurrence probability with headwords, for example, the co-occurrence words to be acquired are based on the order of appearance frequency of headwords, the order of appearance frequency of co-occurrence words, etc. Can be determined. In this case, “Hanako” having a co-occurrence probability of 0.75 with the input keyword “Orion”, “Toyo” having a co-occurrence probability of 0.67 with the input keyword “Olympic”, and the input keyword “Orion”. “Komukai” having a co-occurrence probability of 0.5 is acquired as a one-stage co-occurrence word.

さらに、共起伝播部１０９は、１段共起語と共起関係にある２段共起語を求めるため、再度共起辞書３０２を参照して、１段共起語を見出し語とする辞書要素を求める。ここでは、以下に示すような辞書要素が求められる。
（花子、オリオン、０．７５）
（花子、府中、０．６）
（花子、鎌倉、０．４）
（花子、杉田、０．４）
（花子、オリンピック、０．４）
（東洋、オリンピック、０．６７）
（東洋、出版、０．６）
（東洋、小向、０．３３）
（東洋、太郎、０．３３）
（東洋、堀川、０．３３）
（東洋、芝浦、０．３３）
（小向、杉田、０．５）
（小向、東京、０．５）
（小向、オリオン、０．５） Further, the co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 again in order to obtain a two-stage co-occurrence word having a co-occurrence relationship with the first-stage co-occurrence word. Find the element. Here, the following dictionary elements are obtained.
(Hanako, Orion, 0.75)
(Hanako, Fuchu, 0.6)
(Hanako, Kamakura, 0.4)
(Hanako, Sugita, 0.4)
(Hanako, Olympics, 0.4)
(Toyo, Olympics, 0.67)
(Toyo, Publishing, 0.6)
(Toyo, Komukai, 0.33)
(Toyo, Taro, 0.33)
(Toyo, Horikawa, 0.33)
(Toyo, Shibaura, 0.33)
(Komukai, Sugita, 0.5)
(Komukai, Tokyo, 0.5)
(Komukai, Orion, 0.5)

次に、共起伝播部１０９は、得られた辞書要素の共起語の中から、積算共起確率が閾値γよりも大きくなる共起語を、１段共起語に繋がる２段共起語として取得する。ここでは、閾値γ＝０．３７に設定されているものとする。この場合、１段共起語「花子」に対しては、入力キーワードである「オリオン」との間の共起確率が０．７５であるので、積算共起確率が０．３７よりも大きくなるには共起確率が０．４９以上の２段共起語を取得する必要があり、この条件を満たす「オリオン」および「府中」が、それぞれ２段共起語として取得される。ただし、「オリオン」は入力キーワードと重複するため、処理の対象から除外される。また、１段共起語「東洋」に対しては、入力キーワードである「オリンピック」との間の共起確率が０．６７であるので、積算共起確率が０．３７よりも大きくなるには共起確率が０．５５以上の２段共起語を取得する必要があり、この条件を満たす「オリンピック」および「出版」が、それぞれ２段共起語として取得される。ただし、「オリンピック」は入力キーワードと重複するため、処理の対象から除外される。また、１段共起語「小向」に対しては、入力キーワードである「オリオン」との間の共起確率が０．５であるので、積算共起確率が０．３７よりも大きくなるには共起確率が０．８以上の２段共起語を取得する必要があるが、この条件を満たす共起語は存在しないため、２段共起語は得られない。 Next, the co-occurrence propagation unit 109 selects a co-occurrence word having a cumulative co-occurrence probability larger than the threshold γ from the obtained dictionary element co-occurrence words, and connects it to the one-stage co-occurrence word. Get as a word. Here, it is assumed that the threshold γ = 0.37. In this case, for the first-stage co-occurrence word “Hanako”, the co-occurrence probability with the input keyword “Orion” is 0.75, so that the cumulative co-occurrence probability is greater than 0.37. It is necessary to acquire a two-stage co-occurrence word having a co-occurrence probability of 0.49 or more, and “Orion” and “Fuchu” satisfying this condition are acquired as two-stage co-occurrence words. However, since “Orion” overlaps with the input keyword, it is excluded from processing. Also, for the first-stage co-occurrence word “Toyo”, the co-occurrence probability with the input keyword “Olympic” is 0.67, so that the cumulative co-occurrence probability is larger than 0.37. It is necessary to acquire a two-stage co-occurrence word having an occurrence probability of 0.55 or more, and “Olympic” and “Publishing” that satisfy this condition are acquired as two-stage co-occurrence words. However, since “Olympic” overlaps with the input keyword, it is excluded from processing. In addition, for the one-stage co-occurrence word “Komukai”, the co-occurrence probability with the input keyword “Orion” is 0.5, so that the cumulative co-occurrence probability is greater than 0.37. It is necessary to acquire a two-stage co-occurrence word having a co-occurrence probability of 0.8 or more. However, since there is no co-occurrence word satisfying this condition, a two-stage co-occurrence word cannot be obtained.

以上の処理の結果、入力キーワードと１段共起語と２段共起語とを繋げたワード列として、（オリオン、花子、府中、０．４５）および（オリンピック、東洋、出版、０．４）が得られる。 As a result of the above processing, (Orion, Hanako, Fuchu, 0.45) and (Olympic, Toyo, Publishing, 0.4) are used as word strings connecting the input keyword, the first-stage co-occurrence word, and the second-stage co-occurrence word. ) Is obtained.

その後、共起伝播部１０９は、２段共起語と共起関係にある３段共起語を求めるため、再度共起辞書３０２を参照して、２段共起語を見出し語とする辞書要素を求める。ここでは、以下に示すような辞書要素が求められる。
（府中、鎌倉、０．６７）
（府中、出版、０．６）
（府中、花子、０．６）
（出版、東洋、０．６）
（出版、府中、０．６）
（出版、オリオン、０．４） Thereafter, the co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 again to obtain a three-stage co-occurrence word having a co-occurrence relationship with the second-stage co-occurrence word, and a dictionary having the second-stage co-occurrence word as an entry word. Find the element. Here, the following dictionary elements are obtained.
(Fuchu, Kamakura, 0.67)
(Fuchu, Publishing, 0.6)
(Fuchu, Hanako, 0.6)
(Publishing, Toyo, 0.6)
(Publishing, Fuchu, 0.6)
(Publishing, Orion, 0.4)

次に、共起伝播部１０９は、得られた辞書要素の共起語の中から、積算共起確率が閾値γ＝０．３７よりも大きくなる共起語を、２段共起語に繋がる３段共起語として取得する処理を行う。この場合、２段共起語「府中」に対しては、それまでの積算共起確率が０．４５であるので、積算共起確率が０．３７よりも大きくなるには共起確率が０．８２以上の３段共起語を取得する必要があるが、この条件を満たす共起語は存在しないため、３段共起語は得られない。また、２段共起語「出版」に対しては、それまでの積算共起確率が０．４であるので、積算共起確率が０．３７よりも大きくなるには共起確率が０．９９以上の３段共起語を取得する必要があるが、この条件を満たす共起語は存在しないため、３段共起語は得られない。このため、共起伝播部１０９は共起語を伝播させる処理を停止して、それまでの処理で得られたワード列である（オリオン、花子、府中、０．４５）および（オリンピック、東洋、出版、０．４）を、提案ワード列認識部１１０に対して出力する。 Next, the co-occurrence propagation unit 109 connects the co-occurrence words whose cumulative co-occurrence probability is greater than the threshold γ = 0.37 to the two-stage co-occurrence words from the obtained dictionary element co-occurrence words. The process of acquiring as a three-stage co-occurrence word is performed. In this case, for the two-stage co-occurrence word “Fuchu”, since the cumulative co-occurrence probability so far is 0.45, the co-occurrence probability is 0.82 for the cumulative co-occurrence probability to be greater than 0.37. Although it is necessary to acquire the above three-stage co-occurrence words, there is no co-occurrence word that satisfies this condition, and therefore, a three-stage co-occurrence word cannot be obtained. Also, for the two-stage co-occurrence word “publishing”, since the cumulative co-occurrence probability so far is 0.4, the co-occurrence probability is 0.99 or more for the cumulative co-occurrence probability to be larger than 0.37. However, since there is no co-occurrence word that satisfies this condition, a three-stage co-occurrence word cannot be obtained. For this reason, the co-occurrence propagation unit 109 stops the process of propagating the co-occurrence words, and is a word string (Orion, Hanako, Fuchu, 0.45) and (Olympics, Toyo, Publication, 0.4) is output to the proposed word string recognition unit 110.

提案ワード列認識部１１０は、共起伝播部１０９からワード列である（オリオン、花子、府中、０．４５）および（オリンピック、東洋、出版、０．４）を受け取り、これらワード列のうち、積算共起確率が大きい順に上位Ｎ件を提案ワード列と認識する。ここでは、Ｎ＝２と定められているものとする。なお、積算共起確率が同じワード列については、例えば、入力キーワードの出現頻度が大きい順、１段共起語の出現頻度が大きい順、２段共起語の出現頻度が大きい順などを基準として、Ｎ件の提案ワード列を決定することができる。この場合、提案ワード列認識部１１０は、（オリオン、花子、府中）および（オリンピック、東洋、出版）をそれぞれ提案ワード列と認識し、これらの提案ワード列を表示部１１１に対して出力する。 The proposed word string recognition unit 110 receives the word strings (Orion, Hanako, Fuchu, 0.45) and (Olympic, Toyo, Publishing, 0.4) from the co-occurrence propagation unit 109, and among these word strings, The top N items are recognized as a proposed word string in descending order of the cumulative co-occurrence probability. Here, it is assumed that N = 2. For word strings having the same cumulative co-occurrence probability, for example, the order of appearance frequency of input keywords, order of appearance frequency of first-stage co-occurrence words, order of appearance frequency of second-stage co-occurrence words, etc. As a result, N proposal word strings can be determined. In this case, the proposed word string recognition unit 110 recognizes (Orion, Hanako, Fuchu) and (Olympic, Toyo, Publishing) as a proposed word string, and outputs these suggested word strings to the display unit 111.

表示部１１１は、提案ワード列認識部１１０から提案ワード列（オリオン、花子、府中）および（オリンピック、東洋、出版）を受け取ると、これら提案ワード列を表示してユーザに提示する。そして、ユーザがマウスなどの入力デバイスを用いて、表示部１１１に表示された提案ワード列のうち、例えば（オリオン、花子、府中）を選択する操作を行うと、この選択操作が入力受付部１０７によって受け付けられ、提案ワード列に含まれる「オリオン」、「花子」、および「府中」の各キーワード候補がそれぞれ検索部１１２に出力される。なお、提案ワード列が表示部１１１に表示された状態で、ユーザがいずれかの提案ワード列を選択せずに文字列の入力を続け、例えば「おりん」という文字列を入力した場合は、入力キーワード認識部１０８が「オリンピック」および「オリンポス」を入力キーワードとして認識し、以降、同様の処理が行われる。 Upon receiving the suggested word strings (Orion, Hanako, Fuchu) and (Olympic, Toyo, Publishing) from the suggested word string recognition unit 110, the display unit 111 displays these suggested word strings and presents them to the user. When the user performs an operation of selecting, for example, (Orion, Hanako, Fuchu) from the suggested word string displayed on the display unit 111 using an input device such as a mouse, the selection operation is performed by the input receiving unit 107. The keyword candidates “Orion”, “Hanako”, and “Fuchu” included in the suggested word string are output to the search unit 112, respectively. In the state where the suggested word string is displayed on the display unit 111, the user continues to input the character string without selecting any of the suggested word strings. For example, when the user inputs the character string “Orin”, The input keyword recognition unit 108 recognizes “Olympic” and “Olympos” as input keywords, and the same processing is performed thereafter.

検索部１１２は、入力受付部１０７から「オリオン」、「花子」、および「府中」を受け取ると、これらのキーワード候補のすべてを検索キーワードとして含む検索式を生成して、文書集合３００に対する検索を行う。その結果、図２に例示した文書集合３００の中から［文書２］が得られ、［文書２］についての情報が検索結果として表示部１１１に表示される。 Upon receiving “Orion”, “Hanako”, and “Fuchu” from the input receiving unit 107, the search unit 112 generates a search expression including all of these keyword candidates as search keywords, and performs a search on the document set 300. Do. As a result, [Document 2] is obtained from the document set 300 illustrated in FIG. 2, and information on [Document 2] is displayed on the display unit 111 as a search result.

以上、具体的な例を挙げながら詳細に説明したように、本実施形態の検索支援装置１００によれば、ユーザは検索対象の文書集合３００を指定するだけで、その文書集合３００に含まれるキーワード候補のうち、共起確率が大きいキーワード候補の組み合わせである共起キーワード組が検出され、共起キーワード組に基づいた共起辞書３０２が生成されるとともに、入力文字列を補完して共起キーワード組を構成するキーワード候補を入力キーワードとして得るための文字列補完規則であるパトリシア木３０３が生成される。そして、ユーザが文字列を入力すると、その入力文字列に対応した入力キーワードと、入力キーワードから共起関係が伝播する共起語とを繋げた提案ワード列がユーザに提示され、提示された提案ワード列をユーザが選択すると、検索対象の文書集合３００から提案ワード列に含まれるキーワード候補をすべて含んだ文書の検索が行われる。したがって、本実施形態の検索支援装置１００によれば、ユーザに事前の煩雑な操作を要求することなく、検索対象の文書集合３００に適合した検索キーワードの候補を提案し、文書の検索を適切に支援することができる。 As described above in detail with specific examples, according to the search support apparatus 100 of the present embodiment, the user simply specifies the document set 300 to be searched, and the keywords included in the document set 300 are included. Among the candidates, a co-occurrence keyword set that is a combination of keyword candidates having a large co-occurrence probability is detected, a co-occurrence dictionary 302 is generated based on the co-occurrence keyword set, and the input character string is complemented to generate the co-occurrence keyword. A Patricia tree 303, which is a character string complement rule for obtaining keyword candidates constituting a set as input keywords, is generated. Then, when the user inputs a character string, a suggested word string connecting the input keyword corresponding to the input character string and the co-occurrence word in which the co-occurrence relationship propagates from the input keyword is presented to the user, and the proposed proposal When the user selects a word string, a document including all keyword candidates included in the proposed word string is searched from the search target document set 300. Therefore, according to the search support apparatus 100 of the present embodiment, the search keyword candidate suitable for the document set 300 to be searched is proposed and the document search is appropriately performed without requiring the user to perform complicated operations in advance. Can help.

また、本実施形態の検索支援装置１００によれば、共起伝播部１０９が、共起辞書３０２を参照し、入力キーワードを見出し語とする辞書要素の共起語を１段共起語とし、Ｌ−１段共起語を見出し語とする辞書要素の共起語をＬ段共起語としたときに、Ｌを１ずつ増加させながらＬ段共起語を取得する処理を繰り返し、積算共起確率が閾値γ以下になると処理を停止するようにしているので、ユーザに提示する提案ワード列の積算共起確率を閾値γに近い値に揃えることができる。このことは、ユーザに対して無駄な提案をせずに、文書集合３００に含まれる複数のキーワード候補を検索キーワードとして効果的に提案することに繋がる。つまり、積算共起確率が過大となるワード列は、一般的に関係性の高い単語の羅列である場合が多いが、本実施形態の検索支援装置１００では、積算共起確率が閾値γに近い値の提案ワード列をユーザに提示するため、一般的に関係性の高い単語の羅列を提案することなく、ユーザにとって好ましいと考えられる提案を行うことができる。 Further, according to the search support apparatus 100 of the present embodiment, the co-occurrence propagation unit 109 refers to the co-occurrence dictionary 302 and sets a co-occurrence word of a dictionary element having an input keyword as a headword as a one-stage co-occurrence word, When a co-occurrence word of a dictionary element having an L-1 stage co-occurrence word as an entry word is an L-stage co-occurrence word, the process of acquiring the L-stage co-occurrence word while repeating L is incremented by 1 Since the process is stopped when the occurrence probability is equal to or less than the threshold value γ, the integrated co-occurrence probabilities of the suggested word string presented to the user can be made to be close to the threshold value γ. This leads to effectively proposing a plurality of keyword candidates included in the document set 300 as search keywords without making unnecessary proposals to the user. That is, in many cases, the word sequence in which the cumulative co-occurrence probability is excessive is generally an enumeration of highly related words. However, in the search support apparatus 100 of this embodiment, the cumulative co-occurrence probability is close to the threshold value γ. Since the suggested word string of values is presented to the user, it is possible to make a proposal that is considered preferable for the user without proposing a word string generally related to the word.

（第２実施形態）
次に、第２実施形態について説明する。第２実施形態は、共起確率がゼロとなる２つのキーワード候補の組み合わせであるゼロ共起キーワード組を予め検出して、ゼロ共起辞書を生成しておき、入力キーワードと共起ごとを繋げたワード列の中にゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在する場合は、該ワード列を提案ワード列の対象から除外する。すなわち、入力キーワードから共起関係を伝播させていくことで入力キーワードに後続するキーワード候補を順次取得する場合、入力キーワードと１段共起語、１段共起語と２段共起語との間といったように、連続するキーワード候補間では共起関係が保証されるが、例えば、入力キーワードと２段共起語との間のように、離れたキーワード候補間では必ずしも共起関係があるとは限らない。このため、例えば、入力キーワードと２段共起語との間に共起関係がない（つまり、入力キーワードと２段共起語とが同時に出現する文書が検索対象の文書集合内に存在しない）場合、このようなワード列を提案ワード列としてユーザに提示し、ユーザがこの提案ワード列を選択して文書集合に対する検索を行うと、検索結果が得られない（つまり、ヒット件数がゼロになる）といった問題が生じる。そこで、本実施形態では、共起確率がゼロとなる２つのキーワード候補を同時に含むワード列を提案ワード列の対象から除外することで、提案ワード列に基づいて文書集合を検索すれば必ず検索結果が得られるようにしている。なお、以下では、本実施形態の特徴部分についてのみ説明し、第１実施形態と共通する構成要素については同一の符号を付して重複した説明を省略する。 (Second Embodiment)
Next, a second embodiment will be described. In the second embodiment, a zero co-occurrence keyword set, which is a combination of two keyword candidates whose co-occurrence probability is zero, is detected in advance, and a zero co-occurrence dictionary is generated to connect the input keyword and each co-occurrence. If there is a word string that includes two keyword candidates that simultaneously constitute a zero co-occurrence keyword set in the word string, the word string is excluded from the target of the proposed word string. That is, when sequentially acquiring keyword candidates that follow the input keyword by propagating the co-occurrence relationship from the input keyword, the input keyword, the first-stage co-occurrence word, the first-stage co-occurrence word, and the second-stage co-occurrence word A co-occurrence relationship is guaranteed between consecutive keyword candidates, such as between, but for example, there is always a co-occurrence relationship between distant keyword candidates, such as between an input keyword and a two-stage co-occurrence word. Is not limited. Therefore, for example, there is no co-occurrence relationship between the input keyword and the two-stage co-occurrence word (that is, a document in which the input keyword and the two-stage co-occurrence word appear simultaneously does not exist in the search target document set). In such a case, when such a word string is presented to the user as a suggested word string, and the user selects the proposed word string and performs a search on the document set, no search result is obtained (that is, the number of hits becomes zero). ) Problems arise. Therefore, in the present embodiment, a search result is always obtained by searching a document set based on a proposed word string by excluding a word string that simultaneously includes two keyword candidates whose co-occurrence probability is zero from the target of the proposed word string. Is to be obtained. Hereinafter, only the characteristic part of the present embodiment will be described, and the same components as those in the first embodiment will be denoted by the same reference numerals, and redundant description will be omitted.

図１４は、第２実施形態の検索支援装置２００の機能的な構成を示すブロック図である。本実施形態の検索支援装置２００は、図１４に示すように、第１実施形態の検索支援装置１００の構成に加え、ゼロ共起キーワード組検出部（第２検出部）２０１、ゼロ共起辞書生成部（第３生成部）２０２を備える。また、本実施形態の検索支援装置２００は、第１実施形態の検索支援装置１００が備える共起伝播部１０９および提案ワード列認識部１１０に代えて、共起伝播部２０９および提案ワード列認識部２１０を備える。 FIG. 14 is a block diagram illustrating a functional configuration of the search support apparatus 200 according to the second embodiment. As shown in FIG. 14, the search support apparatus 200 of the present embodiment includes a zero co-occurrence keyword set detection unit (second detection unit) 201, a zero co-occurrence dictionary, in addition to the configuration of the search support apparatus 100 of the first embodiment. A generation unit (third generation unit) 202 is provided. Further, the search support apparatus 200 of the present embodiment replaces the co-occurrence propagation unit 109 and the proposed word string recognition unit 110 included in the search support apparatus 100 of the first embodiment with a co-occurrence propagation unit 209 and a proposed word string recognition unit. 210.

ゼロ共起キーワード組検出部２０１は、共起確率算出部１０３から図７に例示したような共起ネットワークを入力し、この共起ネットワークに基づいて、共起確率がゼロとなる２つのキーワード候補の組み合わせであるゼロ共起キーワード組を検出し、ゼロ共起辞書生成部２０２に対して出力する。具体的には、ゼロ共起キーワード組検出部２０１は、例えば、キーワード候補抽出部１０１が抽出したキーワード候補のうち、出現頻度が所定値を超えるキーワード候補を選出し、選出したこれらキーワード候補のそれぞれについて、共起確率がゼロとなるキーワード候補との組み合わせを求めて、ゼロ共起キーワード組とする。ここで、出現頻度が所定値を超えるキーワード候補を対象とするのは、出現頻度が著しく小さいキーワード候補はそもそも共起キーワード組を構成するキーワード候補となっていないと想定されるためである。 The zero co-occurrence keyword set detection unit 201 inputs a co-occurrence network as illustrated in FIG. 7 from the co-occurrence probability calculation unit 103, and based on this co-occurrence network, two keyword candidates whose co-occurrence probability is zero Are detected and output to the zero-co-occurrence dictionary generation unit 202. Specifically, the zero co-occurrence keyword set detection unit 201 selects, for example, keyword candidates whose appearance frequency exceeds a predetermined value from the keyword candidates extracted by the keyword candidate extraction unit 101, and selects each of these selected keyword candidates. For a keyword candidate with a co-occurrence probability of zero, a zero co-occurrence keyword set is obtained. Here, the keyword candidates whose appearance frequency exceeds a predetermined value are targeted because it is assumed that keyword candidates whose appearance frequency is extremely low are not keyword candidates that constitute a co-occurrence keyword set in the first place.

図１５は、共起確率が０である２つのキーワード候補の組み合わせを示す図であり、図６に例示した２つのキーワード候補間の共起確率のうち、共起確率がゼロとなるものを抽出したものである。図１５の例では、出現頻度が２以上のキーワード候補のそれぞれについて、共起確率がゼロとなる他のキーワード候補との組み合わせを表している。 FIG. 15 is a diagram illustrating a combination of two keyword candidates having a co-occurrence probability of 0. Of the co-occurrence probabilities between the two keyword candidates illustrated in FIG. 6, those having a co-occurrence probability of zero are extracted. It is a thing. In the example of FIG. 15, each keyword candidate having an appearance frequency of 2 or more represents a combination with another keyword candidate having a co-occurrence probability of zero.

ゼロ共起辞書生成部２０２は、ゼロ共起キーワード組検出部２０１から入力したゼロ共起キーワード組に基づいて、ゼロ共起辞書３０４を生成する。ゼロ共起辞書３０４は、図１６に示すように、（見出し語、ゼロ共起語）の２つの要素を１組としたデータ構造を持つ辞書要素の集合であり、見出し語が指定されたときに、その見出し語を持つ辞書要素を検索してゼロ共起語を取得することができる。ゼロ共起辞書生成部２０２は、ゼロ共起キーワード組検出部２０１から入力したすべてのゼロ共起キーワード組について、出現頻度で絞り込んだ一方のキーワード候補を見出し語、他方のキーワード候補をゼロ共起語として辞書要素を生成し、これら辞書要素の集合であるゼロ共起辞書３０４を生成する。ゼロ共起辞書生成部２０２により生成されたゼロ共起辞書３０４の一例を図１７に示す。なお、ゼロ共起辞書３０４の全体は大きくなるため、図１７では、便宜上一部のみを示している。 The zero co-occurrence dictionary generation unit 202 generates a zero co-occurrence dictionary 304 based on the zero co-occurrence keyword set input from the zero co-occurrence keyword set detection unit 201. As shown in FIG. 16, the zero co-occurrence dictionary 304 is a set of dictionary elements having a data structure in which two elements of (headword, zero co-occurrence word) are set, and when a headword is designated. In addition, a zero co-occurrence word can be obtained by searching a dictionary element having the entry word. The zero co-occurrence dictionary generation unit 202, for all zero co-occurrence keyword sets input from the zero co-occurrence keyword set detection unit 201, finds one keyword candidate narrowed down by appearance frequency as a headword and the other keyword candidate as zero co-occurrence. A dictionary element is generated as a word, and a zero co-occurrence dictionary 304 which is a set of these dictionary elements is generated. An example of the zero co-occurrence dictionary 304 generated by the zero co-occurrence dictionary generation unit 202 is shown in FIG. Since the entire zero co-occurrence dictionary 304 is large, only a part is shown in FIG. 17 for convenience.

共起伝播部２０９は、入力キーワード認識部１０８から入力したＭ０件の入力キーワードそれぞれについて、共起辞書３０２を検索して入力キーワードに対する共起語を求め、複数求められた共起語のうち、入力キーワードとの間の共起確率が大きい順に上位Ｍ１件（Ｍ１は予め定めた自然数）の共起語を１段共起語として一時的に記憶する。そして、共起伝播部１０９は、Ｍ１件の１段共起語のそれぞれについて、共起辞書３０２を検索して１段共起語に対する共起語を求め、複数求められた共起語のうち、１段共起語との間の共起確率が閾値δ以上となる共起語を２段共起語として一時的に記憶する。さらに、共起伝播部２０９は、１段共起語に繋がる共起語の個数が予め定めた所定値Ａに達するまで、２段共起語から３段共起語、３段共起語から４段共起語といったように、１段共起語に繋がる共起語を順次求める処理を繰り返し、１段共起語に繋がる共起語の個数が所定値Ａに達すると処理を停止する。 The co-occurrence propagation unit 209 searches the co-occurrence dictionary 302 for each of the M0 input keywords input from the input keyword recognition unit 108 to obtain a co-occurrence word for the input keyword. The top M1 co-occurrence words (M1 is a predetermined natural number) in the descending order of the co-occurrence probability with the input keyword are temporarily stored as one-stage co-occurrence words. The co-occurrence propagation unit 109 searches the co-occurrence dictionary 302 for each of the M1 first-stage co-occurrence words to obtain a co-occurrence word for the first-stage co-occurrence word. A co-occurrence word having a co-occurrence probability between the first-stage co-occurrence word and the threshold value δ or more is temporarily stored as a second-stage co-occurrence word. Further, the co-occurrence propagation unit 209 starts from the second-stage co-occurrence word to the third-stage co-occurrence word until the number of co-occurrence words connected to the first-stage co-occurrence word reaches a predetermined value A. The process of sequentially obtaining co-occurrence words connected to the first-stage co-occurrence word, such as four-stage co-occurrence words, is repeated, and when the number of co-occurrence words connected to the first-stage co-occurrence word reaches a predetermined value A, the process is stopped.

すなわち、共起伝播部２０９は、入力キーワードを見出し語とする辞書要素の共起語を１段共起語とし、Ｌ−１段共起語（Ｌは２以上の自然数）を見出し語とする辞書要素の共起語をＬ段共起語としたときに、Ｌを１ずつ増加させながらＬ段共起語を取得する処理を繰り返し、Ｌが所定値Ａになると、入力キーワードからＬ段共起語までの積算共起確率を求めて、処理を停止する。そして、共起伝播部２０９は、処理を停止した段階で得られた入力キーワードからＬ段共起語までを繋げたワード列を、各ワード列の積算共起確率とともに、提案ワード列認識部２１０に対して出力する。なお、第１実施形態と同様に、共起確率の高いキーワード同士で伝播が繰り返されることを避けるために、同じキーワード候補が重複するワード列は処理の対象から除外する。共起伝播部２０９は、入力キーワード認識部１０８から新たな入力キーワードを入力するたびに、上記の処理を実施する。 That is, the co-occurrence propagation unit 209 sets the co-occurrence word of the dictionary element having the input keyword as an entry word as a 1-stage co-occurrence word, and uses the L-1 stage co-occurrence word (L is a natural number of 2 or more) as an entry word. When the dictionary element co-occurrence word is an L-stage co-occurrence word, the process of acquiring the L-stage co-occurrence word is repeated while increasing L by 1. When L reaches a predetermined value A, the L-stage co-occurrence word is obtained from the input keyword. The integrated co-occurrence probability up to the word is obtained and the process is stopped. Then, the co-occurrence propagation unit 209 converts the word string obtained by connecting the input keyword and the L-stage co-occurrence word obtained at the stage where the processing is stopped together with the proposed word string recognition unit 210 together with the cumulative co-occurrence probability of each word string. Output for. As in the first embodiment, in order to avoid repeated propagation between keywords having a high co-occurrence probability, word strings with the same keyword candidate overlapping are excluded from processing targets. The co-occurrence propagation unit 209 performs the above process every time a new input keyword is input from the input keyword recognition unit 108.

なお、上記の例では、１段共起語に繋がる共起語の数が所定値Ａに達すると共起伝播部２０９の処理を停止するものとして説明したが、共起伝播部２０９が処理を停止する条件はこの例に限定されるものではなく、例えば、第１実施形態と同様に、積算共起確率が閾値γ以下になると共起伝播部２０９が処理を停止する、あるいは、得られたワード列の数が上限に達すると共起伝播部２０９の処理を停止するといった条件を用いるようにしてもよい。 In the above example, it has been described that the process of the co-occurrence propagation unit 209 is stopped when the number of co-occurrence words connected to the first-stage co-occurrence word reaches a predetermined value A, but the co-occurrence propagation unit 209 performs the process. The condition for stopping is not limited to this example. For example, as in the first embodiment, when the integrated co-occurrence probability is equal to or less than the threshold γ, the co-occurrence propagation unit 209 stops or is obtained. A condition that the processing of the co-occurrence propagation unit 209 is stopped when the number of word strings reaches the upper limit may be used.

提案ワード列認識部２１０は、第１実施形態の提案ワード列認識部１１０と同様に、共起伝播部２０９から入力したワード列のうち、第２条件を満たすワード列を提案ワード列と認識し、認識した提案ワード列を表示部１１１に対して出力する。ただし、提案ワード列認識部２１０は、まずゼロ共起辞書３０４を参照し、共起伝播部２１０から入力したワード列の中にゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在するか否かを判定し、そのようなワード列が存在する場合には、該ワード列を除外して、残りのワード列のうち、第２条件を満たすワード列を提案ワード列として表示部１１１に対して出力する。なお、第２条件としては、第１実施形態と同様に、例えば、積算共起確率が大きい順に上位Ｎ件（Ｎは予め定めた自然数）という条件を用いることができる。 Similar to the proposed word string recognition unit 110 of the first embodiment, the proposed word string recognition unit 210 recognizes a word string satisfying the second condition among the word strings input from the co-occurrence propagation unit 209 as a proposed word string. The recognized suggested word string is output to the display unit 111. However, the proposed word string recognition unit 210 first refers to the zero co-occurrence dictionary 304, and includes a word string that simultaneously includes two keyword candidates constituting a zero co-occurrence keyword set in the word string input from the co-occurrence propagation unit 210. If such a word string exists, the word string is excluded and the word string satisfying the second condition is displayed as the suggested word string among the remaining word strings. Output to the unit 111. As the second condition, as in the first embodiment, for example, the condition of the top N cases (N is a predetermined natural number) in descending order of the cumulative co-occurrence probability can be used.

図１８は、本実施形態の検索支援装置２００の入力キーワード認識部１０８、共起伝播部２０９および提案ワード列認識部２１０による処理の一例を示すフローチャートである。 FIG. 18 is a flowchart illustrating an example of processing performed by the input keyword recognition unit 108, the co-occurrence propagation unit 209, and the suggested word string recognition unit 210 of the search support apparatus 200 according to the present embodiment.

入力キーワード認識部１０８は、入力受付部１０７から入力文字列を入力すると、パトリシア木３０３に基づいて、入力文字列を補完して得られるキーワード候補を取得し、取得したキーワード候補のうち、出現頻度が大きい順に上位Ｍ０件のキーワード候補を入力キーワードと認識する（ステップＳ３０１）。そして、入力キーワード認識部１０８は、認識した入力キーワードを、共起伝播部２０９に対して出力する。 When the input keyword recognition unit 108 inputs the input character string from the input reception unit 107, the input keyword recognition unit 108 acquires keyword candidates obtained by complementing the input character string based on the Patricia tree 303, and among the acquired keyword candidates, the appearance frequency The top M0 keyword candidates are recognized as input keywords in descending order (step S301). Then, the input keyword recognition unit 108 outputs the recognized input keyword to the co-occurrence propagation unit 209.

次に、共起伝播部２０９は、入力キーワード認識部１０８からＭ０件の入力キーワードを入力すると、共起辞書３０２を参照して、Ｍ０件の入力キーワードの共起語をそれぞれ求め、これら共起語のうち、共起確率が大きい順に上位Ｍ１件のキーワード候補を１段共起語として取得する（ステップＳ３０２）。また、共起伝播部２０９は、共起辞書３０２を参照して、Ｌ−１段共起語（Ｌは２以上の自然数）の共起語をそれぞれ求め、これら共起語のうち、Ｌ−１段共起語との間の共起確率が閾値δ以上となるキーワード候補をＬ段共起語として取得する（ステップＳ３０３）。そして、共起伝播部２０９は、Ｌの値が予め定めた所定値Ａに達したか否かを判定する（ステップＳ３０４）。この判定の結果、Ｌの値が所定値Ａ未満であれば（ステップＳ３０４：Ｎｏ）、共起伝播部２０９は、Ｌの値をインクリメント（＋１）して（ステップＳ３０５）ステップＳ３０３に戻り、Ｌ段共起語を取得する処理を繰り返す。一方、Ｌの値が所定値Ａに達していれば（ステップＳ３０４：Ｙｅｓ）、共起伝播部２０９は、入力キーワードと取得した共起語とを繋げたワード列を、各ワード列の積算共起確率とともに、提案ワード列認識部２１０に対して出力する。 Next, when the co-occurrence propagation unit 209 inputs M0 input keywords from the input keyword recognition unit 108, the co-occurrence propagation unit 209 refers to the co-occurrence dictionary 302 and obtains co-occurrence words of the M0 input keywords, respectively. Among the words, the top M1 keyword candidates are acquired as one-stage co-occurrence words in descending order of co-occurrence probability (step S302). Further, the co-occurrence propagation unit 209 refers to the co-occurrence dictionary 302 to obtain co-occurrence words of L-1 level co-occurrence words (L is a natural number of 2 or more), and among these co-occurrence words, L- Keyword candidates whose co-occurrence probability with the first-stage co-occurrence word is equal to or greater than the threshold δ are acquired as L-stage co-occurrence words (step S303). Then, the co-occurrence propagation unit 209 determines whether or not the value of L has reached a predetermined value A (step S304). As a result of this determination, if the value of L is less than the predetermined value A (step S304: No), the co-occurrence propagation unit 209 increments (+1) the value of L (step S305), and returns to step S303. Repeat the process of acquiring the co-occurrence words. On the other hand, if the value of L has reached the predetermined value A (step S304: Yes), the co-occurrence propagation unit 209 uses the word string connecting the input keyword and the acquired co-occurrence word as the integrated value of each word string. Along with the occurrence probability, it is output to the proposed word string recognition unit 210.

次に、提案ワード列認識部２１０は、共起伝播部２０９からワード列を入力すると、ゼロ共起辞書３０４を参照して、入力したワード列の中にゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在するか否かを判定する（ステップＳ３０６）。そして、提案ワード列認識部２１０は、ゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在する場合は（ステップＳ３０６：Ｙｅｓ）、当該ワード列を削除する（ステップＳ３０７）。一方、入力したワード列の中にゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在しない場合は（ステップＳ３０６：Ｎｏ）、ステップＳ３０７の処理をスキップする。次に、提案ワード列認識部２１０は、ステップＳ３０７で削除したワード列を除く残りのワード列のうち、積算共起確率が大きい順に上位Ｎ件のワード列を提案ワード列として認識する（ステップＳ３０８）。そして、提案ワード列認識部２１０は、認識した提案ワード列を表示部１１１に対して出力する（ステップＳ３０９）。 Next, when a word string is input from the co-occurrence propagation unit 209, the proposed word string recognition unit 210 refers to the zero co-occurrence dictionary 304 and forms two co-occurrence keyword sets in the input word string. It is determined whether there is a word string that includes keyword candidates at the same time (step S306). Then, when there is a word string that simultaneously includes two keyword candidates that constitute the zero co-occurrence keyword set (step S306: Yes), the proposed word string recognition unit 210 deletes the word string (step S307). On the other hand, if there is no word string that simultaneously includes two keyword candidates that constitute the zero co-occurrence keyword set in the input word string (step S306: No), the process of step S307 is skipped. Next, the proposed word string recognition unit 210 recognizes the top N word strings as the proposed word string in descending order of the cumulative co-occurrence probability among the remaining word strings excluding the word string deleted in step S307 (step S308). ). Then, the suggested word string recognition unit 210 outputs the recognized suggested word string to the display unit 111 (step S309).

本実施形態の検索支援装置２００は、第１実施形態の検索支援装置１００と同様に、ハードウェア構成として、例えば、ＣＰＵなどの制御装置、ＲＯＭやＲＡＭなどの内部記憶装置、キーボードやマウスなどの入力デバイス、液晶パネルなどの表示装置、ハードディスク、ＣＤ、ＤＶＤ、フラッシュメモリなどの外部記憶装置を備えた、通常のコンピュータを利用したハードウェア構成を採用することができる。この場合、コンピュータがプログラムを実行することによって、本実施形態の検索支援装置２００における上記の機能的な構成を実現することができる。また、インデックス３０１、共起辞書３０２、およびパトリシア木３０３、ゼロ共起辞書３０４は、例えば、ハードディスクやフラッシュメモリなどの外部記憶装置に格納しておくことができる。 Similar to the search support apparatus 100 of the first embodiment, the search support apparatus 200 of the present embodiment has a hardware configuration such as a control device such as a CPU, an internal storage device such as a ROM or a RAM, a keyboard or a mouse, and the like. A hardware configuration using a normal computer that includes an input device, a display device such as a liquid crystal panel, and an external storage device such as a hard disk, CD, DVD, or flash memory can be employed. In this case, the functional configuration of the search support apparatus 200 according to the present embodiment can be realized by the computer executing the program. The index 301, the co-occurrence dictionary 302, the Patricia tree 303, and the zero co-occurrence dictionary 304 can be stored in an external storage device such as a hard disk or a flash memory, for example.

次に、図２に例示した文書集合３００を検索対象として検索を行う場合を例に挙げて、本実施形態の検索支援装置２００による処理の具体例について説明する。以下では、図２に示した文書集合３００から図１０に例示した共起辞書３０２、図１１に例示したパトリシア木３０３、および図１７に例示したゼロ共起辞書３０４が生成されているものとする。 Next, a specific example of processing performed by the search support apparatus 200 according to the present embodiment will be described with reference to an example in which a search is performed using the document set 300 illustrated in FIG. 2 as a search target. In the following, it is assumed that the co-occurrence dictionary 302 illustrated in FIG. 10, the Patricia tree 303 illustrated in FIG. 11, and the zero co-occurrence dictionary 304 illustrated in FIG. 17 are generated from the document set 300 illustrated in FIG. .

まず、ユーザがキーボードなどの入力デバイスを用いて文字列「お」を入力すると、入力受付部１０７がこの文字列の入力を受け付けて、入力文字列「お」を入力キーワード認識部１０８に対して出力する。入力キーワード認識部１０８は、入力受付部１０７から入力文字列「お」を受け取り、図１１に例示したパトリシア木３０３を検索して、入力キーワードを認識する処理を行う。入力キーワード認識部１０８は、まず、入力文字列「お」に該当するノードを求めるために、図１１に例示したパトリシア木３０３をルートから順に探索し、「お」に該当するノードとして「おり」のノードを取得する。次に、入力キーワード認識部１０８は、その「おり」のノードからたどれるリーフ群を特定し、そのリーフが示す文字列を「おり」に後続する文字列として求め、入力文字列「お」を補完したキーワード候補を求める。この場合は「オリンピック」、「オリオン」および「オリンポス」が得られる。入力キーワード認識部１０８は、得られたキーワード候補のうち、出現頻度が大きい順に上位Ｍ０件を入力キーワードと認識する。ここでは、Ｍ０＝３と定められているものとする。この場合、入力キーワード認識部１０８は、「オリンピック」、「オリオン」および「オリンポス」を入力キーワードと認識し、これらの入力キーワードを共起伝播部２０９に対して出力する。 First, when a user inputs a character string “O” using an input device such as a keyboard, the input receiving unit 107 receives the input of the character string, and inputs the input character string “O” to the input keyword recognizing unit 108. Output. The input keyword recognizing unit 108 receives the input character string “o” from the input receiving unit 107, searches the Patricia tree 303 illustrated in FIG. 11, and performs processing for recognizing the input keyword. First, in order to obtain a node corresponding to the input character string “o”, the input keyword recognition unit 108 sequentially searches the Patricia tree 303 illustrated in FIG. 11 from the root, and sets “ori” as a node corresponding to “o”. Get the node. Next, the input keyword recognition unit 108 identifies a leaf group traced from the “ori” node, obtains a character string indicated by the leaf as a character string subsequent to “ori”, and complements the input character string “o”. Search for keyword candidates. In this case, “Olympic”, “Orion” and “Olympos” are obtained. The input keyword recognition unit 108 recognizes the top M0 items as the input keyword in descending order of appearance frequency among the obtained keyword candidates. Here, it is assumed that M0 = 3. In this case, the input keyword recognition unit 108 recognizes “Olympic”, “Orion”, and “Olympos” as input keywords, and outputs these input keywords to the co-occurrence propagation unit 209.

共起伝播部２０９は、入力キーワード「オリンピック」、「オリオン」および「オリンポス」を入力キーワード認識部１０８から受け取り、図１０に例示した共起辞書３０２を参照して、これら入力キーワードを見出し語とする辞書要素を求める。ここでは、以下に示すような辞書要素が求められる。
（オリオン、花子、０．７５）
（オリオン、小向、０．５）
（オリオン、磯子、０．５）
（オリオン、次郎、０．５）
（オリオン、ワールドカップ、０．５）
（オリオン、出版、０．４）
（オリオン、鎌倉、０．３３）
（オリオン、杉田、０．３３）
（オリオン、東洋、０．２５）
（オリオン、府中、０．２５）
（オリンピック、東洋、０．６７）
（オリンピック、堀川、０．５）
（オリンピック、芝浦、０．５）
（オリンピック、川崎、０．５）
（オリンピック、春子、０．５）
（オリンピック、次郎、０．５）
（オリンピック、世界、０．５）
（オリンピック、陸上、０．５）
（オリンピック、太郎、０．５）
（オリンピック、出版、０．４）
（オリンピック、花子、０．４）
（オリンピック、府中、０．２５）
（オリンポス、鎌倉、０．５）
（オリンポス、東京、０．５）
（オリンポス、府中、０．３３）
（オリンポス、出版、０．２）
（オリンポス、花子、０．２） The co-occurrence propagation unit 209 receives the input keywords “Olympic”, “Orion”, and “Olympos” from the input keyword recognition unit 108, and refers to the co-occurrence dictionary 302 illustrated in FIG. Find the dictionary element to do. Here, the following dictionary elements are obtained.
(Orion, Hanako, 0.75)
(Orion, Komukai, 0.5)
(Orion, Isogo, 0.5)
(Orion, Jiro, 0.5)
(Orion, World Cup, 0.5)
(Orion, Publishing, 0.4)
(Orion, Kamakura, 0.33)
(Orion, Sugita, 0.33)
(Orion, Toyo, 0.25)
(Orion, Fuchu, 0.25)
(Olympic, Oriental, 0.67)
(Olympic, Horikawa, 0.5)
(Olympic Games, Shibaura, 0.5)
(Olympic, Kawasaki, 0.5)
(Olympic, Haruko, 0.5)
(Olympic, Jiro, 0.5)
(Olympic, World, 0.5)
(Olympic, land, 0.5)
(Olympic, Taro, 0.5)
(Olympic, publication, 0.4)
(Olympic, Hanako, 0.4)
(Olympic, Fuchu, 0.25)
(Olympos, Kamakura, 0.5)
(Olympos, Tokyo, 0.5)
(Olympos, Fuchu, 0.33)
(Olympus, Publishing, 0.2)
(Olympus, Hanako, 0.2)

次に、共起伝播部２０９は、得られた辞書要素の共起語の中から、見出し語（入力キーワード）との間の共起確率が大きい順に上位Ｍ１件の共起語を１段共起語として取得する。ここでは、Ｍ１＝３に設定されているものとする。なお、見出し語との間の共起確率が同じ共起語については、例えば、見出し語の出現頻度が大きい順、共起語の出現頻度が大きい順などを基準として、取得する共起語を決定することができる。この場合、入力キーワード「オリオン」との間の共起確率が０．７５の「花子」、入力キーワード「オリンピック」との間の共起確率が０．６７の「東洋」、入力キーワード「オリオン」との間の共起確率が０．５の「小向」がそれぞれ１段共起語として取得される。 Next, the co-occurrence propagating unit 209 sets the upper M1 co-occurrence words from the co-occurrence words of the obtained dictionary elements in the descending order of the co-occurrence probability with the headword (input keyword). Get as a utterance. Here, it is assumed that M1 = 3. For co-occurrence words with the same co-occurrence probability with headwords, for example, the co-occurrence words to be acquired are based on the order of appearance frequency of headwords, the order of appearance frequency of co-occurrence words, etc. Can be determined. In this case, “Hanako” having a co-occurrence probability of 0.75 with the input keyword “Orion”, “Toyo” having a co-occurrence probability of 0.67 with the input keyword “Olympic”, and the input keyword “Orion”. “Komukai” having a co-occurrence probability of 0.5 is acquired as a one-stage co-occurrence word.

さらに、共起伝播部２０９は、１段共起語と共起関係にある２段共起語を求めるため、再度共起辞書３０２を参照して、１段共起語を見出し語とする辞書要素を求める。ここでは、以下に示すような辞書要素が求められる。
（花子、オリオン、０．７５）
（花子、府中、０．６）
（花子、鎌倉、０．４）
（花子、杉田、０．４）
（花子、オリンピック、０．４）
（東洋、オリンピック、０．６７）
（東洋、出版、０．６）
（東洋、太郎、０．３３）
（東洋、堀川、０．３３）
（東洋、芝浦、０．３３）
（小向、杉田、０．５）
（小向、東京、０．５）
（小向、オリオン、０．５） Further, the co-occurrence propagation unit 209 refers to the co-occurrence dictionary 302 again in order to obtain a two-stage co-occurrence word having a co-occurrence relationship with the first-stage co-occurrence word. Find the element. Here, the following dictionary elements are obtained.
(Hanako, Orion, 0.75)
(Hanako, Fuchu, 0.6)
(Hanako, Kamakura, 0.4)
(Hanako, Sugita, 0.4)
(Hanako, Olympics, 0.4)
(Toyo, Olympics, 0.67)
(Toyo, Publishing, 0.6)
(Toyo, Taro, 0.33)
(Toyo, Horikawa, 0.33)
(Toyo, Shibaura, 0.33)
(Komukai, Sugita, 0.5)
(Komukai, Tokyo, 0.5)
(Komukai, Orion, 0.5)

次に、共起伝播部２０９は、得られた辞書要素の共起語の中から、見出し語（１段共起語）との間の共起確率が閾値δ以上となるキーワード候補を２段共起語として取得する。ここでは、閾値δ＝０．４に設定されているものとする。この場合、１段共起語「花子」に対しては、共起確率が０．４以上である「オリオン」、「府中」、「鎌倉」、「杉田」が、それぞれ２段共起語として取得される。ただし、「オリオン」は入力キーワードと重複するため、処理の対象から除外される。また、１段共起語「東洋」に対しては、共起確率が０．４以上である「オリンピック」および「出版」が、それぞれ２段共起語として取得される。ただし、「オリンピック」は入力キーワードと重複するため、処理の対象から除外される。また、１段共起語「小向」に対しては、共起確率が０．４以上である「杉田」、「東京」、「オリオン」が、それぞれ２段共起語として取得される。ただし、「オリオン」は入力キーワードと重複するため、処理の対象から除外される。 Next, the co-occurrence propagation unit 209 selects two keyword candidates whose co-occurrence probability with the headword (one-stage co-occurrence word) is equal to or greater than the threshold δ from the co-occurrence words of the obtained dictionary elements. Acquired as a co-occurrence word. Here, it is assumed that the threshold δ = 0.4. In this case, “Orion”, “Fuchu”, “Kamakura”, and “Sugita” with a co-occurrence probability of 0.4 or more are used as the second-stage co-occurrence words “Hanako”. To be acquired. However, since “Orion” overlaps with the input keyword, it is excluded from processing. For the first-stage co-occurrence word “Toyo”, “Olympic” and “Publishing” having a co-occurrence probability of 0.4 or more are acquired as two-stage co-occurrence words, respectively. However, since “Olympic” overlaps with the input keyword, it is excluded from processing. For the first-stage co-occurrence word “Komukai”, “Sugita”, “Tokyo”, and “Orion” having co-occurrence probabilities of 0.4 or more are acquired as second-stage co-occurrence words. However, since “Orion” overlaps with the input keyword, it is excluded from processing.

以上の処理の結果、入力キーワードと１段共起語と２段共起語とを繋げたワード列として、（オリオン、花子、府中）、（オリオン、花子、鎌倉）、（オリオン、花子、杉田）、（オリンピック、東洋、出版）、（オリオン、小向、杉田）、（オリオン、小向、東京）が得られる。 As a result of the above processing, as a word string connecting the input keyword, the first-stage co-occurrence word, and the second-stage co-occurrence word, (Orion, Hanako, Fuchu), (Orion, Hanako, Kamakura), (Orion, Hanako, Sugita) ), (Olympic, Toyo, Publishing), (Orion, Komukai, Sugita), (Orion, Komukai, Tokyo).

次に、共起伝播部２０９は、１段共起語に繋がる共起語の個数が予め定めた所定値Ａに達したか否かを確認する。ここでは、簡単のために所定値Ａ＝１に設定されているものとする。この場合、共起伝播部２０９は、２段共起語が取得された段階で処理を停止し、それまでの処理で得られたワード列についてそれぞれ積算共起確率を求めて、（オリオン、花子、府中、０．４５）、（オリオン、花子、鎌倉、０．４５）、（オリオン、花子、杉田、０．３）、（オリンピック、東洋、出版、０．４）、（オリオン、小向、杉田、０．２５）、（オリオン、小向、東京、０．２５）を提案ワード列認識部２１０に対して出力する。 Next, the co-occurrence propagation unit 209 checks whether or not the number of co-occurrence words connected to the first-stage co-occurrence word has reached a predetermined value A. Here, it is assumed that the predetermined value A = 1 is set for simplicity. In this case, the co-occurrence propagation unit 209 stops the process when the two-stage co-occurrence word is acquired, obtains an integrated co-occurrence probability for each of the word strings obtained in the process so far, and (Orion, Hanako) , Fuchu, 0.45), (Orion, Hanako, Kamakura, 0.45), (Orion, Hanako, Sugita, 0.3), (Olympic, Toyo, Publishing, 0.4), (Orion, Komukai, Sugita, 0.25), (Orion, Komukai, Tokyo, 0.25) is output to the proposed word string recognition unit 210.

提案ワード列認識部２１０は、共起伝播部２０９からワード列である（オリオン、花子、府中、０．４５）、（オリオン、花子、鎌倉、０．４５）、（オリオン、花子、杉田、０．３）、（オリンピック、東洋、出版、０．４）、（オリオン、小向、杉田、０．２５）、（オリオン、小向、東京、０．２５）を受け取ると、まず、ゼロ共起辞書３０４を参照し、これらのワード列の中に、ゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列が存在する場合は、当該ワード列を削除する。図１７に例示したゼロ共起辞書３０４には、「オリオン」を見出し語とし、「東京」をゼロ共起語とする辞書要素が存在し、上記ワード列のうち（オリオン、小向、東京、０．２５）は「オリオン」と「東京」とを同時に含むため、（オリオン、小向、東京、０．２５）が削除される。 The proposed word string recognition unit 210 is a word string from the co-occurrence propagation unit 209 (Orion, Hanako, Fuchu, 0.45), (Orion, Hanako, Kamakura, 0.45), (Orion, Hanako, Sugita, 0 .3), (Olympic, Toyo, Publishing, 0.4), (Orion, Komukai, Sugita, 0.25), (Orion, Komukai, Tokyo, 0.25) Referring to the dictionary 304, if a word string that includes two keyword candidates that simultaneously constitute a zero co-occurrence keyword set exists in these word strings, the word string is deleted. The zero co-occurrence dictionary 304 illustrated in FIG. 17 includes dictionary elements having “Orion” as a headword and “Tokyo” as a zero co-occurrence word. Among the above word strings (Orion, Komukai, Tokyo, (0.25) includes “Orion” and “Tokyo” at the same time, so (Orion, Komukai, Tokyo, 0.25) is deleted.

次に、提案ワード列認識部２１０は、削除されずに残ったワード列（オリオン、花子、府中、０．４５）、（オリオン、花子、鎌倉、０．４５）、（オリオン、花子、杉田、０．３）、（オリンピック、東洋、出版、０．４）、（オリオン、小向、杉田、０．２５）のうち、積算共起確率が大きい順に上位Ｎ件を提案ワード列と認識する。ここでは、Ｎ＝４と定められているものとする。なお、積算共起確率が同じワード列については、例えば、入力キーワードの出現頻度が大きい順、１段共起語の出現頻度が大きい順、２段共起語の出現頻度が大きい順などを基準として、Ｎ件の提案ワード列を決定することができる。この場合、提案ワード列認識部２１０は、（オリオン、花子、府中）、（オリオン、花子、鎌倉）、（オリオン、花子、杉田）、（オリンピック、東洋、出版）をそれぞれ提案ワード列と認識し、これらの提案ワード列を表示部１１１に対して出力する。 Next, the proposed word string recognition unit 210 performs the remaining word strings (Orion, Hanako, Fuchu, 0.45), (Orion, Hanako, Kamakura, 0.45), (Orion, Hanako, Sugita, 0.3), (Olympic, Toyo, Publishing, 0.4), (Orion, Komukai, Sugita, 0.25), the top N items are recognized as the suggested word sequence in descending order of the cumulative co-occurrence probability. Here, it is assumed that N = 4. For word strings having the same cumulative co-occurrence probability, for example, the order of appearance frequency of input keywords, order of appearance frequency of first-stage co-occurrence words, order of appearance frequency of second-stage co-occurrence words, etc. As a result, N proposal word strings can be determined. In this case, the proposed word string recognition unit 210 recognizes (Orion, Hanako, Fuchu), (Orion, Hanako, Kamakura), (Orion, Hanako, Sugita), and (Olympic, Toyo, Publishing) as the proposed word strings. These suggested word strings are output to the display unit 111.

提案ワード列認識部２１０が表示部１１１に出力する提案ワード列は、いずれもユーザが選択したときに確実に検索結果が得られる（つまり、ヒット件数がゼロにならない）ワード列である。例えば（オリオン、花子、府中）が選択されれば、図２に例示した文書集合３００の中で［文書２］が得られ、（オリオン、花子、鎌倉）が選択されれば、図２に例示した文書集合３００の中で［文書２］が得られ、（オリオン、花子、杉田）が選択されれば、図２に例示した文書集合３００の中で［文書３］が得られ、（オリンピック、東洋、出版）が選択されれば、図２に例示した文書集合３００の中で［文書１］が得られる。 Each of the proposed word strings output to the display unit 111 by the proposed word string recognition unit 210 is a word string from which a search result can be reliably obtained (that is, the number of hits does not become zero). For example, if (Orion, Hanako, Fuchu) is selected, [Document 2] is obtained in the document set 300 illustrated in FIG. 2, and if (Orion, Hanako, Kamakura) is selected, it is illustrated in FIG. If [Document 2] is obtained in the document set 300 and (Orion, Hanako, Sugita) is selected, [Document 3] is obtained in the document set 300 illustrated in FIG. If “Toyo Publishing” is selected, [Document 1] is obtained in the document set 300 illustrated in FIG.

表示部１１１は、提案ワード列認識部２１０から提案ワード列（オリオン、花子、府中）、（オリオン、花子、鎌倉）、（オリオン、花子、杉田）、（オリンピック、東洋、出版）を受け取ると、これら提案ワード列を表示してユーザに提示する。そして、ユーザがマウスなどの入力デバイスを用いて、表示部１１１に表示された提案ワード列のうち、例えば（オリンピック、東洋、出版）を選択する操作を行うと、この選択操作が入力受付部１０７によって受け付けられ、提案ワード列に含まれる「オリンピック」、「東洋」、および「出版」の各キーワード候補がそれぞれ検索部１１２に出力される。なお、提案ワード列が表示部１１１に表示された状態で、ユーザがいずれかの提案ワード列を選択せずに文字列の入力を続け、例えば「おりお」という文字列を入力した場合は、入力キーワード認識部１０８が「オリオン」を入力キーワードとして認識し、以降、同様の処理が行われる。 When the display unit 111 receives the proposed word sequence (Orion, Hanako, Fuchu), (Orion, Hanako, Kamakura), (Orion, Hanako, Sugita), (Olympic, Toyo, Publishing) from the proposed word sequence recognition unit 210, These suggested word strings are displayed and presented to the user. When the user performs an operation of selecting, for example, (Olympic, Toyo, Publishing) from the suggested word string displayed on the display unit 111 using an input device such as a mouse, the selection operation is performed by the input receiving unit 107. The keyword candidates “Olympic”, “Toyo”, and “Publishing” included in the suggested word string are output to the search unit 112, respectively. In the state where the suggested word string is displayed on the display unit 111, the user continues to input the character string without selecting any of the suggested word strings. For example, when the character string “Orio” is input, The input keyword recognizing unit 108 recognizes “Orion” as an input keyword, and the same processing is performed thereafter.

検索部１１２は、入力受付部１０７から「オリンピック」、「東洋」、および「府中」を受け取ると、これらのキーワード候補のすべてを検索キーワードとして含む検索式を生成して、文書集合３００に対する検索を行う。その結果、図２に例示した文書集合３００の中から［文書１］が得られ、［文書１］についての情報が検索結果として表示部１１１に表示される。 When the search unit 112 receives “Olympic”, “Toyo”, and “Fuchu” from the input receiving unit 107, the search unit 112 generates a search expression including all of these keyword candidates as search keywords, and searches the document set 300. Do. As a result, [Document 1] is obtained from the document set 300 illustrated in FIG. 2, and information on [Document 1] is displayed on the display unit 111 as a search result.

以上、具体的な例を挙げながら詳細に説明したように、本実施形態の検索支援装置２００によれば、第１実施形態の検索支援装置１００と同様に、ユーザは検索対象の文書集合３００を指定するだけで、その文書集合３００に含まれるキーワード候補のうち、共起確率が大きいキーワード候補の組み合わせである共起キーワード組が検出され、共起キーワード組に基づいた共起辞書３０２が生成されるとともに、入力文字列を補完して共起キーワード組を構成するキーワード候補を入力キーワードとして得るための文字列補完規則であるパトリシア木３０３が生成される。そして、ユーザが文字列を入力すると、その入力文字列に対応した入力キーワードと、入力キーワードから共起関係が伝播する共起語とを繋げた提案ワード列がユーザに提示され、提示された提案ワード列をユーザが選択すると、検索対象の文書集合３００から提案ワード列に含まれるキーワード候補をすべて含んだ文書の検索が行われる。したがって、本実施形態の検索支援装置２００によれば、ユーザに事前の煩雑な操作を要求することなく、検索対象の文書集合３００に適合した検索キーワードの候補を提案し、文書の検索を適切に支援することができる。 As described above in detail with specific examples, according to the search support apparatus 200 of the present embodiment, the user selects the document set 300 to be searched in the same manner as the search support apparatus 100 of the first embodiment. By simply specifying, a co-occurrence keyword set that is a combination of keyword candidates having a high co-occurrence probability is detected from the keyword candidates included in the document set 300, and a co-occurrence dictionary 302 based on the co-occurrence keyword set is generated. At the same time, a Patricia tree 303, which is a character string complement rule for complementing the input character string and obtaining keyword candidates constituting the co-occurrence keyword set as the input keyword, is generated. Then, when the user inputs a character string, a suggested word string connecting the input keyword corresponding to the input character string and the co-occurrence word in which the co-occurrence relationship propagates from the input keyword is presented to the user, and the proposed proposal When the user selects a word string, a document including all keyword candidates included in the proposed word string is searched from the search target document set 300. Therefore, according to the search support apparatus 200 of the present embodiment, the search keyword candidates suitable for the search target document set 300 are proposed and the document search is appropriately performed without requiring the user to perform complicated operations in advance. Can help.

また、本実施形態の検索支援装置２００によれば、共起関係がゼロとなる２つのキーワード候補の組み合わせであるゼロ共起キーワード組が検出され、ゼロ共起辞書３０４が生成される。そして、提案ワード列認識部２１０は、ゼロ共起辞書３０４を参照し、ゼロ共起キーワード組を構成する２つのキーワード候補を同時に含むワード列を除いたワード列を、提案ワード列として表示部１１１に出力する。したがって、本実施形態の検索支援装置２００によれば、確実に検索結果が得られる提案ワード列のみをユーザに提示して、ユーザが無駄な検索を行うことがないように、適切に検索を支援することができる。 Further, according to the search support apparatus 200 of the present embodiment, a zero co-occurrence keyword set that is a combination of two keyword candidates whose co-occurrence relationship is zero is detected, and a zero co-occurrence dictionary 304 is generated. Then, the proposed word string recognition unit 210 refers to the zero co-occurrence dictionary 304, and displays a word string excluding a word string that simultaneously includes two keyword candidates constituting the zero co-occurrence keyword set as a suggested word string. Output to. Therefore, according to the search support apparatus 200 of the present embodiment, the search is appropriately supported so that only the suggested word string that can reliably obtain the search result is presented to the user and the user does not perform a useless search. can do.

以上説明した第１および第２実施形態の検索装置１００，２００における各機能構成は、例えば、検索支援装置１００，２００のハードウェア構成としてコンピュータを用いる場合、このコンピュータで所定のプログラムを実行することにより実現できる。検索支援装置１００，２００として用いるコンピュータで実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ（Compact Disk Read Only Memory）、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ（Compact Disk Recordable）、ＤＶＤ（Digital Versatile Disc）等のコンピュータで読み取り可能な記録媒体に記録されてコンピュータプログラムプロダクトとして提供される。 Each functional configuration in the search devices 100 and 200 according to the first and second embodiments described above, for example, when a computer is used as the hardware configuration of the search support devices 100 and 200, executes a predetermined program on this computer. Can be realized. A program executed by a computer used as the search support apparatus 100 or 200 is a file in an installable format or an executable format, and is a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), a CD-R (Compact It is recorded on a computer-readable recording medium such as a disk recordable (DVD) or a DVD (Digital Versatile Disc) and provided as a computer program product.

また、検索支援装置１００，２００として用いるコンピュータで実行されるプログラムを、インターネット等のネットワークに接続された他のコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、検索支援装置１００，２００として用いるコンピュータで実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。また、検索支援装置１００，２００として用いるコンピュータで実行されるプログラムを、コンピュータ内部のＲＯＭ等に予め組み込んで提供するように構成してもよい。 Further, a program executed by a computer used as the search support apparatus 100 or 200 may be stored on another computer connected to a network such as the Internet and provided by being downloaded via the network. . Further, a program executed by a computer used as the search support device 100 or 200 may be provided or distributed via a network such as the Internet. In addition, a program executed by a computer used as the search support device 100 or 200 may be provided by being incorporated in advance in a ROM or the like inside the computer.

検索支援装置１００，２００として用いるコンピュータで実行されるプログラムは、検索支援装置１００，２００の主要な構成要素（キーワード候補抽出部１０１、共起確率算出部１０３、共起キーワード組検出部１０４、ゼロ共起キーワード組検出部２０１、共起辞書生成部１０５、パトリシア木生成部１０６、ゼロ共起辞書生成部２０２、入力キーワード認識部１０８、共起伝播部１０９，２０９、提案ワード列認識部１１０，２１０、表示部１１１、検索部１１２）を含むモジュール構成となっており、実際のハードウェアとしては、例えば、ＣＰＵ（プロセッサ）が記憶媒体からプログラムを読み出して実行することにより、上記の各構成要素が主記憶装置上にロードされ、上記の各構成要素が主記憶装置上に生成されるようになっている。なお、実施形態の検索支援装置１００，２００の主要な構成要素は、その一部または全部を、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field-Programmable Gate Array）などの専用のハードウェアを用いて実現することも可能である。 The programs executed on the computers used as the search support devices 100 and 200 are the main components of the search support devices 100 and 200 (keyword candidate extraction unit 101, co-occurrence probability calculation unit 103, co-occurrence keyword set detection unit 104, zero Co-occurrence keyword set detection unit 201, co-occurrence dictionary generation unit 105, Patricia tree generation unit 106, zero co-occurrence dictionary generation unit 202, input keyword recognition unit 108, co-occurrence propagation units 109 and 209, suggested word string recognition unit 110, 210, a display unit 111, and a search unit 112). As actual hardware, for example, a CPU (processor) reads out and executes a program from a storage medium, whereby each of the above components Are loaded on the main storage, and the above components are generated on the main storage. There. The main components of the search support apparatuses 100 and 200 according to the embodiment are partially or wholly used by using dedicated hardware such as ASIC (Application Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array). It can also be realized.

以上述べた少なくとも一つの実施形態の検索支援装置によれば、上記の主要な構成要素を備えることにより、ユーザに事前の煩雑な操作を要求することなく、検索対象の文書集合に適合した検索キーワードの候補を提案し、文書の検索を適切に支援することができる。 According to the search support apparatus of at least one embodiment described above, the search keyword suitable for the set of documents to be searched without requiring the user to perform complicated operations in advance by including the main components described above. Candidates can be proposed and document search can be supported appropriately.

以上、本発明のいくつかの実施形態を説明したが、これら実施形態は例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 As mentioned above, although some embodiment of this invention was described, these embodiment is shown as an example and is not intending limiting the range of invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１００検索支援装置
１０１キーワード候補抽出部
１０３共起確率算出部
１０４共起キーワード組検出部
１０５共起辞書生成部
１０６パトリシア木生成部
１０８入力キーワード認識部
１０９共起伝播部
１１０提案ワード列認識部
１１１表示部
１１２検索部
２００検索支援装置
２０１ゼロ共起キーワード組検出部
２０２ゼロ共起辞書生成部
２０９共起伝播部
２１０提案ワード列認識部 DESCRIPTION OF SYMBOLS 100 Search support apparatus 101 Keyword candidate extraction part 103 Co-occurrence probability calculation part 104 Co-occurrence keyword set detection part 105 Co-occurrence dictionary generation part 106 Patricia tree generation part 108 Input keyword recognition part 109 Co-occurrence propagation part 110 Proposed word sequence recognition part 111 Display unit 112 Search unit 200 Search support device 201 Zero co-occurrence keyword set detection unit 202 Zero co-occurrence dictionary generation unit 209 Co-occurrence propagation unit 210 Proposed word string recognition unit

Claims

An extraction unit for extracting keyword candidates from a set of documents to be searched;
For a combination of two extracted keyword candidates, a calculation unit that calculates a co-occurrence probability that is a probability that one keyword candidate appears in the same document in the document set together with another keyword candidate;
A first detection unit that detects a co-occurrence keyword set in which the co-occurrence probability satisfies a first condition and is a combination of two keyword candidates;
A first generation unit that generates a co-occurrence dictionary that is a set of dictionary elements having one keyword candidate of the co-occurrence keyword set as a headword and the other keyword candidate as a co-occurrence word;
A second generation unit that generates a character string completion rule that is a rule for complementing an input character string and obtaining keyword candidates included in the co-occurrence keyword set;
A first recognition unit that recognizes a keyword candidate obtained by complementing an input character string according to the character string completion rule as an input keyword;
Co-occurrence is repeated by referring to the co-occurrence dictionary, acquiring a co-occurrence word of a dictionary element having the input keyword as an entry word, and acquiring a co-occurrence word of the dictionary element having the acquired co-occurrence word as an entry word A propagation section;
A second detection unit that detects a zero-co-occurrence keyword set that is a combination of two keyword candidates for which the co-occurrence probability is zero;
A third generation unit for generating a zero co-occurrence dictionary that is a set of dictionary elements in which one keyword candidate of the zero co-occurrence keyword set is a headword and the other keyword candidate is a zero co-occurrence word;
Two keyword candidates that constitute the zero co-occurrence keyword set in a word string that refers to the zero co-occurrence dictionary and connects the input keyword and the co-occurrence word acquired by the processing by the co-occurrence propagation unit A second recognition unit that recognizes a word string that satisfies the second condition among the remaining word strings as a proposed word string,
A presentation unit for presenting the suggested word string;
A search support apparatus, comprising: a search unit that generates a search formula based on the proposed word string and searches the document set when the proposed word string is selected.

The co-occurrence propagation unit is a dictionary in which a co-occurrence word of a dictionary element having the input keyword as an entry word is a first-stage co-occurrence word and an L-1 stage co-occurrence word (L is a natural number of 2 or more) is an entry word. When the co-occurrence word of an element is an L-stage co-occurrence word, the process of acquiring the L-stage co-occurrence word is repeated while increasing L by 1, and between the input keyword and the first-stage co-occurrence word When the cumulative co-occurrence probability, which is a value obtained by sequentially integrating the co-occurrence probabilities between the L-1 stage co-occurrence word and the L-stage co-occurrence word with respect to the co-occurrence probability, becomes equal to or less than a first threshold value, The search support apparatus according to claim 1, wherein the search support apparatus is stopped.

The second recognizing unit is higher in the descending order of the cumulative co-occurrence probability in a word string connecting the input keyword and the co-occurrence word acquired while the co-occurrence propagation unit continues the process. 3. The search support device according to claim 2, wherein N (N is a predetermined natural number) word string is recognized as the proposed word string.

The co-occurrence propagation unit is a dictionary in which a co-occurrence word of a dictionary element having the input keyword as an entry word is a first-stage co-occurrence word and an L-1 stage co-occurrence word (L is a natural number of 2 or more) is an entry word. When the co-occurrence word of an element is an L-stage co-occurrence word, the process of acquiring the L-stage co-occurrence word is repeated while increasing L by 1, and the process is stopped when L reaches a predetermined value. The search support apparatus according to claim 1, wherein:

The first detection unit obtains the number of combinations of two keyword candidates whose co-occurrence probability is greater than a second threshold, and if the obtained number is less than a third threshold, the second threshold is reduced by a predetermined amount. , Repeating the process of obtaining the number of combinations of two keyword candidates having a greater co-occurrence probability than the second threshold value reduced by a predetermined amount, and the two keyword candidates when the obtained number is equal to or greater than the third threshold value The search support apparatus according to claim 1, wherein the combination is detected as the co-occurrence keyword set.

A search support method executed in a search support device,
An extraction unit of the search support device extracting keyword candidates from a set of documents to be searched;
The calculation unit of the search support device calculates a co-occurrence probability that is a probability that one keyword candidate appears in the same document in the document set together with another keyword candidate for the combination of the extracted two keyword candidates. When,
A step of detecting a co-occurrence keyword set in which the first detection unit of the search support device is a combination of two keyword candidates in which the co-occurrence probability satisfies a first condition;
A step of generating a co-occurrence dictionary which is a set of dictionary elements having one keyword candidate of the co-occurrence keyword set as a headword and the other keyword candidate as a co-occurrence word;
A step of generating a character string completion rule that is a rule for obtaining a keyword candidate included in the co-occurrence keyword set by complementing an input character string by the second generation unit of the search support device;
A step of recognizing a keyword candidate obtained by complementing an input character string according to the character string complementing rule as an input keyword by the first recognition unit of the search support device;
The co-occurrence propagation unit of the search support device refers to the co-occurrence dictionary, acquires a co-occurrence word of a dictionary element having the input keyword as an entry word, and obtains a dictionary element having the acquired co-occurrence word as an entry word. Repeating the process of obtaining co-occurrence words;
A second detection unit of the search support device detecting a zero co-occurrence keyword set that is a combination of two keyword candidates having a co-occurrence probability of zero;
The third generation unit of the search support device generates a zero co-occurrence dictionary that is a set of dictionary elements in which one keyword candidate of the zero co-occurrence keyword set is a headword and the other keyword candidate is a zero co-occurrence word. Steps,
The second recognition unit of the search support device, in the reference to the zero cooccurrence dictionary, word sequence obtained by connecting the occurrence word acquired by treatment with the input keyword and the co-occurrence propagating unit, the zero When there is a word string that includes two keyword candidates that simultaneously constitute a co-occurrence keyword set, the word string is excluded, and among the remaining word strings, a word string that satisfies the second condition is recognized as a proposed word string. Steps,
The presenting unit of the search support device presenting the suggested word string;
The search unit of the search support device includes a step of generating a search formula based on the proposed word string and searching the document set when the presented suggested word string is selected. Search support method.

  On the computer,
  A function of an extraction unit that extracts keyword candidates from a set of documents to be searched;
  A function of a calculation unit that calculates a co-occurrence probability that is a probability that one keyword candidate appears in the same document in the document set together with another keyword candidate for a combination of two extracted keyword candidates;
  A function of a first detection unit that detects a co-occurrence keyword set in which the co-occurrence probability satisfies a first condition and is a combination of two keyword candidates;
  A function of a first generation unit that generates a co-occurrence dictionary that is a set of dictionary elements in which one keyword candidate of the co-occurrence keyword set is a headword and the other keyword candidate is a co-occurrence word;
  A function of a second generation unit that generates a character string completion rule that is a rule for complementing an input character string and obtaining keyword candidates included in the co-occurrence keyword set;
  A function of a first recognition unit that recognizes a keyword candidate obtained by complementing an input character string according to the character string completion rule as an input keyword;
  Co-occurrence is repeated by referring to the co-occurrence dictionary, acquiring a co-occurrence word of a dictionary element having the input keyword as an entry word, and acquiring a co-occurrence word of the dictionary element having the acquired co-occurrence word as an entry word The function of the propagation part;
  A function of a second detection unit for detecting a zero co-occurrence keyword set that is a combination of two keyword candidates with the co-occurrence probability being zero;
  A function of a third generation unit that generates a zero co-occurrence dictionary that is a set of dictionary elements in which one keyword candidate of the zero co-occurrence keyword set is a headword and the other keyword candidate is a zero co-occurrence word;
  Two keyword candidates that constitute the zero co-occurrence keyword set in a word string that refers to the zero co-occurrence dictionary and connects the input keyword and the co-occurrence word acquired by the processing by the co-occurrence propagation unit If there is a word string that simultaneously contains the word string, the function of the second recognition unit that excludes the word string and recognizes a word string satisfying the second condition as a proposed word string among the remaining word strings;
  A function of a presentation unit for presenting the suggested word string;
  A program for realizing a function of a search unit that generates a search formula based on the proposed word string and performs a search on the document set when the proposed word string is presented.