JP3385913B2

JP3385913B2 - Related word presentation device and medium recording related word presentation program

Info

Publication number: JP3385913B2
Application number: JP13730097A
Authority: JP
Inventors: 博増市
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1997-05-27
Filing date: 1997-05-27
Publication date: 2003-03-10
Anticipated expiration: 2017-05-27
Also published as: JPH10334105A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は関連語提示装置及び
関連語提示用プログラムを記録した媒体に関し、特に検
索条件に関連する単語を提示する関連語提示装置及び検
索条件に関連する単語の提示をコンピュータに行わせる
ための関連語提示用プログラムを記録した媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a related word presenting apparatus and a medium for recording a related word presenting program, and more particularly to a related word presenting apparatus for presenting words related to a search condition and a presentation of words related to the search condition. The present invention relates to a medium in which a related word presentation program to be executed by a computer is recorded.

【０００２】[0002]

【従来の技術】検索システムでは、一般にキーワードに
よる検索方法が用いられている。検索条件として任意の
キーワード（検索語）を検索システムに入力すると、文
書内容に検索語を含む全ての文書が検索結果として得ら
れる。ところが、このような検索システムで多量の文献
情報を検索対象とする場合、必然的に検索結果も多量に
なり、その適合性を判断するために多大の労力を費やす
ことになる。この労力には、個人差はあるせよ、無益点
と呼ばれる物理的／心理的限界がある。検索結果がその
限界を超えてしまう場合は、適合性の判断に要する労力
を軽減しようとして、検索結果を絞り込む行動をとりが
ちである。これは、適合する検索結果を得ることより
も、検索結果の量が無益点を下回ることの方を重視しが
ちであることを示している。そのため、検索対象を絞り
込むための論理積演算子を過度に使用することになり、
結果として、本来なら検索できた有用な文献の検索漏れ
を生じさせることになる。大規模全文データベースの検
索では、この傾向は特に顕著である。2. Description of the Related Art In a search system, a search method using a keyword is generally used. When an arbitrary keyword (search word) is input to the search system as a search condition, all documents that include the search word in the document content are obtained as search results. However, when a large amount of document information is searched for in such a search system, the search results are inevitably large, and a great deal of labor is spent to judge the suitability. Although there are individual differences in this effort, there is a physical / psychological limit called a disadvantage. When the search result exceeds the limit, it tends to take action to narrow down the search result in order to reduce the effort required to judge the suitability. This indicates that the amount of search results tends to fall below the disadvantage point rather than obtain matching search results. Therefore, you will be excessively using the logical product operator to narrow down the search target,
As a result, a search for useful documents that could otherwise be searched will be omitted. This tendency is especially noticeable in large-scale full-text database searches.

【０００３】そこで、「特開平２−２９７２９０号公
報」において提案されているように、関連語辞書を用い
ることにより検索語の関連語をユーザに提示する方法が
用いられている。ユーザは、検索式に含まれている検索
語の関連語の中から適切な語を選択して論理積演算子で
結合し、検索結果を絞り込むことによって、前述のよう
な盲目的な絞り込みを避けることができる。Therefore, as proposed in Japanese Patent Laid-Open No. 2-297290, a method of presenting a related word of a search word to a user by using a related word dictionary is used. The user selects the appropriate words from the related terms of the search terms included in the search expression, combines them with the logical product operator, and narrows down the search results to avoid the blind narrowing down described above. be able to.

【０００４】例えば、ユーザが入力した検索語が「軸
受」の場合、関連語辞書から「軸受」の関連語として
「玉軸受」「マイクロボール軸受」「液体軸受」「気体
潤滑軸受」「磁気軸受」などを取得し、ユーザに提示す
る。ここで、ユーザが「磁気軸受」について知りたい場
合、「磁気軸受」を新たな検索語とする（あるいは「軸
受」に対して「磁気」を論理積演算子で結合する）こと
によって、検索結果の絞り込みを適切に行うことができ
る。For example, when the search word input by the user is "bearing", "ball bearing", "micro ball bearing", "liquid bearing", "gas lubricated bearing", and "magnetic bearing" are related terms of "bearing" from the related term dictionary. Is acquired and presented to the user. Here, if the user wants to know about "magnetic bearing", "magnetic bearing" is used as a new search term (or "magnetic" is combined with "AND" by the logical product operator) to obtain the search result. Can be appropriately narrowed down.

【０００５】すなわち、検索語の関連語として提示され
た語の中から検索意図に沿った語を選択し、論理積演算
子で結合することによって、的確な検索結果の絞り込み
を行うことが可能となる。That is, it is possible to accurately narrow down the search result by selecting a word according to the search intention from the words presented as the related words of the search word and combining them with the logical product operator. Become.

【０００６】[0006]

【発明が解決しようとする課題】しかし、通常のキーワ
ード検索では検索語と完全一致する文書のみを検索結果
とするため、検索結果が必要以上に絞り込まれてしまう
場合が多いにもかかわらず、検索結果が必要以上に絞り
込まれていないかどうかを、ユーザが簡単に知ることが
できないという問題点があった。However, in the ordinary keyword search, only the documents that exactly match the search word are used as the search results, so the search results are often narrowed down more than necessary There is a problem that the user cannot easily know whether the results are narrowed down more than necessary.

【０００７】絞り込みが過度なものでない（重要な検索
結果が漏れしまっていない）ことをユーザが確認するた
めには、絞り込み前の検索結果と絞り込み後の検索結果
との両者の内容を比較対比しなければならない。このよ
うな確認作業は、検索結果が大量であることから考えて
実際には不可能である。そのため、上記の従来技術で
は、絞り込みによってどのような検索結果が得られたの
か（絞り込みによって検索結果がどのように変化したの
か）については何も知ることができない。したがって、
絞り込みによって得られた検索結果が検索意図に沿わな
い（検索漏れが極めて多い）場合でも、その検索結果を
最終結果として採用してしまうことになる。In order for the user to confirm that the narrowing down is not excessive (important search results are not leaked), the contents of both the search result before narrowing down and the search result after narrowing down are compared and compared. There must be. Such confirmation work is actually impossible considering the large number of search results. Therefore, in the above-described related art, it is impossible to know what kind of search result is obtained by the narrowing down (how the search result is changed by the narrowing down). Therefore,
Even if the search result obtained by the narrowing down does not meet the search intention (there is a large number of omissions in the search), the search result is adopted as the final result.

【０００８】本発明はこのような点に鑑みてなされたも
のであり、検索条件を変更したことによって検索結果が
受ける影響を容易に確認できる関連語提示装置を提供す
ることを目的とする。The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a related word presentation device capable of easily confirming the influence of a search result by changing a search condition.

【０００９】また、本発明の他の目的は、検索条件を変
更したことによって検索結果が受ける影響を容易に確認
できるようにコンピュータを機能させるための関連語提
示用プログラムを記録した媒体を提供することである。Another object of the present invention is to provide a medium in which a related word presenting program for operating a computer is recorded so that the influence of a search result by changing a search condition can be easily confirmed. That is.

【００１０】[0010]

【課題を解決するための手段】本発明では上記課題を解
決するために、検索条件に関連する単語を提示する関連
語提示装置において、複数の文書を格納する文書格納手
段と、入力された複数の検索条件を受け取る検索条件受
取手段と、前記検索条件受取手段が受け取った各検索条
件に適合する文書集合を前記文書格納手段から取得する
文書検索手段と、前記文書検索手段が取得した文書集合
中に存在する各単語を関連語候補とし、前記文書検索手
段が取得した文書の数である第１の値と、前記文書検索
手段が取得した文書集合の中で各関連語候補を含んでい
る文書の数である関連語候補ごとの第２の値と、前記文
書格納手段に格納されている文書の中で各関連語候補を
含んでいる文書の数である関連語候補ごとの第３の値と
を取得し、第１の値と第３の値との積あるいは和である
第４の値を関連語候補ごとに計算し、第２の値と第４の
値との比率に基づいて、前記検索条件受取手段が受け取
った検索条件と各関連語候補との間の関連度を、検索条
件ごとに計算する関連度計算手段と、前記検索条件受取
手段が受け取った各検索条件に応じて前記関連度計算手
段により算出された各関連語候補の関連度を比較し、各
関連語候補の関連度の値の変化に基づいて関連語を決定
する関連語計算手段と、前記関連語計算手段が決定した
関連語を表示装置に表示する関連語表示手段と、を有す
ることを特徴とする関連語提示装置が提供される。According to the present invention, in order to solve the above-mentioned problems, in a related word presentation device for presenting words related to a search condition, a document storage means for storing a plurality of documents and a plurality of input plural words. Search condition receiving means for receiving the search condition, document search means for acquiring a document set matching the respective search conditions received by the search condition receiving means from the document storage means, and a document set acquired by the document search means. Each word existing in the is a related word candidate, a first value which is the number of documents acquired by the document searching unit, and a document including each related word candidate in the document set acquired by the document searching unit. Second value for each related word candidate and the third value for each related word candidate that is the number of documents containing each related word candidate among the documents stored in the document storage means. And get and first A fourth value, which is the product or sum of the third value and the third value, is calculated for each related word candidate, and based on the ratio between the second value and the fourth value, the search received by the search condition receiving means. Relevance calculating means for calculating the degree of relevance between the condition and each related word candidate for each search condition, and each calculated by the relevance calculating means according to each search condition received by the search condition receiving means. Related word calculation means for comparing related degrees of related word candidates and determining related words based on a change in the value of the related degree of each related word candidate, and related words determined by the related word calculation means are displayed on a display device. The related word display device is provided.

【００１１】この関連語提示装置によれば、入力された
複数の検索条件が検索条件受取手段によって受け取ら
れ、文書検索手段により、各検索条件に適合する文書集
合が文書格納手段から取得される。次に、関連度計算手
段によって、検索条件受取手段が受け取った検索条件と
各関連語候補との間の関連度が、検索条件ごとに計算さ
れる。すると、関連語計算手段により、検索条件受取手
段が受け取った各検索条件に応じて関連度計算手段によ
り算出された各関連語候補の関連度が比較され、各単語
の関連度の値の変化に基づいて関連語が決定される。決
定された関連語は、関連語表示手段によって表示装置に
表示される。これにより、複数の検索条件のそれぞれか
ら得られる検索結果の相違点を、その検索結果から抽出
された単語の関連度の変化によって容易に確認すること
ができる。According to this related word presentation device, the plurality of input search conditions are received by the search condition receiving means, and the document search means obtains the document set conforming to each search condition from the document storage means. Next, the degree-of-association calculating unit calculates the degree of association between the search condition received by the search-condition receiving unit and each related-word candidate for each search condition. Then, the related word calculation means compares the related degrees of the related word candidates calculated by the related degree calculation means according to the respective search conditions received by the search condition receiving means, and changes in the value of the related degree of each word. The related word is determined based on this. The determined related word is displayed on the display device by the related word display means. Thus, the difference in the search results obtained from each of the plurality of search conditions can be easily confirmed by the change in the degree of association of the words extracted from the search results.

【００１２】また、検索条件に関連する単語の提示をコ
ンピュータに行わせるための関連語提示用プログラムを
記録した媒体において、複数の文書を格納する文書格納
手段、入力された複数の検索条件を受け取る検索条件受
取手段、前記検索条件受取手段が受け取った各検索条件
に適合する文書集合を前記文書格納手段から取得する文
書検索手段、前記文書検索手段が取得した文書集合中に
存在する各単語を関連語候補とし、前記文書検索手段が
取得した文書の数である第１の値と、前記文書検索手段
が取得した文書集合の中で各関連語候補を含んでいる文
書の数である関連語候補ごとの第２の値と、前記文書格
納手段に格納されている文書の中で各関連語候補を含ん
でいる文書の数である関連語候補ごとの第３の値とを取
得し、第１の値と第３の値との積あるいは和である第４
の値を関連語候補ごとに計算し、第２の値と第４の値と
の比率に基づいて、前記検索条件受取手段が受け取った
検索条件と各関連語候補との間の関連度を、検索条件ご
とに計算する関連度計算手段、前記検索条件受取手段が
受け取った各検索条件に応じて前記関連度計算手段によ
り算出された各関連語候補の関連度を比較し、各関連語
候補の関連度の値の変化に基づいて関連語を決定する関
連語計算手段、前記関連語計算手段が決定した関連語を
表示装置に表示する関連語表示手段、としてコンピュー
タを機能させるための関連語提示用プログラムを記録し
た媒体が提供される。Further, in a medium in which a related word presentation program for causing a computer to present a word related to a search condition is recorded, a document storage means for storing a plurality of documents, and a plurality of input search conditions are received. Relating the search condition receiving means, the document searching means for acquiring from the document storing means a document set matching each search condition received by the search condition receiving means, and the respective words existing in the document set acquired by the document searching means As a word candidate, a first value that is the number of documents acquired by the document search unit, and a related word candidate that is the number of documents that include each related word candidate in the document set acquired by the document search unit. And a third value for each related word candidate, which is the number of documents containing each related word candidate among the documents stored in the document storage unit, And the value of Is the product or the sum of the third value 4
Is calculated for each related word candidate, and the degree of association between the search condition received by the search condition receiving means and each related word candidate is calculated based on the ratio between the second value and the fourth value. Relevance calculating means for each search condition, and comparing the relevance of each related word candidate calculated by the relevance calculating means according to each search condition received by the search condition receiving means, Related word presenting means for causing a computer to function as related word calculating means for determining a related word based on a change in the value of the degree of association and related word displaying means for displaying the related word determined by the related word calculating means on a display device. There is provided a medium having a program for recording recorded therein.

【００１３】この媒体に記録された関連語提示用プログ
ラムをコンピュータで実行させれば、複数の文書を格納
する文書格納手段と、入力された複数の検索条件を受け
取る検索条件受取手段と、前記検索条件受取手段が受け
取った各検索条件に適合する文書集合を前記文書格納手
段から取得する文書検索手段と、前記文書検索手段が取
得した文書の数である第１の値と、前記文書検索手段が
取得した文書集合中に存在する単語のそれぞれを含んで
いる文書の数である単語ごとの第２の値と、前記文書格
納手段に格納されている文書の中で、各単語を含んでい
る文書の数である単語ごとの第３の値とを取得し、第１
の値と第３の値との積あるいは和である第４の値を単語
ごとに計算し、第２の値と第４の値との比率に基づい
て、前記検索条件受取手段が受け取った検索条件と各単
語との間の関連度を、検索条件ごとに計算する関連度計
算手段と、前記検索条件受取手段が受け取った各検索条
件に応じて前記関連度計算手段により算出された各単語
の関連度を比較し、各単語の関連度の値の変化に基づい
て関連語を決定する関連語計算手段と、前記関連語計算
手段が決定した関連語を表示装置に表示する関連語表示
手段と、がコンピュータによって実現される。When the computer executes the related word presentation program recorded on this medium, the document storage means for storing a plurality of documents, the search condition receiving means for receiving a plurality of input search conditions, and the search A document search unit that acquires a document set that matches each search condition received by the condition reception unit from the document storage unit, a first value that is the number of documents acquired by the document search unit, and the document search unit A second value for each word, which is the number of documents including each of the words existing in the acquired document set, and a document including each word among the documents stored in the document storage unit. And a third value for each word that is the number of
The fourth value, which is the product or sum of the value of and the third value, is calculated for each word, and the search received by the search condition receiving means is based on the ratio of the second value and the fourth value. The degree of relevance between the condition and each word, the degree-of-relevance calculation means for calculating each retrieval condition, and the degree of relevance of each word calculated by the degree-of-association calculation means according to each retrieval condition received by the retrieval condition receiving means. A related word calculating means for comparing the related degrees and determining a related word based on a change in the value of the related degree of each word; and a related word displaying means for displaying the related word determined by the related word calculating means on a display device. , Are realized by a computer.

【００１４】[0014]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。図１は、本発明の原理構成図であ
る。文書格納手段１は、複数の文書を格納している。検
索条件受取手段２は、ユーザがキーボード等の入力装置
を用いて入力した複数の検索条件「Ｓｄ」「Ｓｌ」を受
け取る。文書検索手段３は、検索条件受取手段２が受け
取った各検索条件に適合する文書集合「Ｘｄ」「Ｘｌ」
を、文書格納手段１から取得する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the principle of the present invention. The document storage means 1 stores a plurality of documents. The search condition receiving means 2 receives a plurality of search conditions “Sd” and “Sl” input by the user using an input device such as a keyboard. The document search means 3 uses the document sets “Xd” and “Xl” that match the search conditions received by the search condition receiving means 2.
From the document storage means 1.

【００１５】関連度計算手段４は、まず、文書検索手段
３が取得した文書集合中に存在する各単語を関連語候補
とする。そして、文書検索手段３が取得した文書の数で
ある第１の値と、文書検索手段３が取得した文書集合中
に存在する各関連語候補を含んでいる文書の数である関
連語候補ごとの第２の値と、文書格納手段に格納されて
いる文書の中で、各関連語候補を含んでいる文書の数で
ある関連語候補ごとの第３の値とを取得し、第１の値と
第３の値との積あるいは和である第４の値を関連語候補
ごとに計算し、第２の値と第４の値との比率に基づい
て、検索条件受取手段２が受け取った検索条件と各関連
語候補との間の関連度を、検索条件ごとに計算する。例
えば、後述する拡張相互情報量「ＭＩ₀（Ｓｄ，Ｗｄ
ｎ）」「ＭＩ ₀（Ｓｌ，Ｗｌｎ）」（「Ｗｄｎ」は「Ｘ
ｄ」に含まれる関連語候補の識別子であり、「Ｗｌｎ」
は「Ｘｌ」に含まれる関連語候補の識別子である）を計
算し、その値を関連度とする。拡張相互情報量は、値が
大きいほど、検索条件と関連語候補との間の関連度が高
いことを示す。The degree-of-association calculation means 4 is first a document search means.
Related word candidates for each word existing in the document set acquired by 3
And The number of documents acquired by the document search means 3
A certain first value and the document set acquired by the document search means 3
Is the number of documents containing each related word candidate existing in
The second value for each collocation candidate and stored in the document storage means
Number of documents that include each related word candidate
Obtain a third value for each related word candidate, and obtain the first value as
The fourth value which is the product or sum of the third value and the related value candidate
Calculated for each and based on the ratio of the second and fourth values
The search condition received by the search condition receiving means 2 and each relation
The degree of association with word candidates is calculated for each search condition. An example
For example, the extended mutual information “MI₀(Sd, Wd
n) ”“ MI ₀(Sl, Wln) "(" Wdn "is" X
"Wln", which is an identifier of a related word candidate included in "d".
Is an identifier of a related word candidate included in “X1”).
And calculate the value as the degree of association. The value of the extended mutual information is
The larger the value, the higher the degree of association between the search condition and the related word candidate.
Indicate that

【００１６】関連語計算手段５は、検索条件受取手段２
が受け取った各検索条件「Ｓｄ」「Ｓｌ」に応じて関連
度計算手段４により算出された各関連語候補の関連度を
比較し、各関連語候補の関連度の値の変化に基づいて関
連語を決定する。例えば、関連度増加語の閾値「Ｔｕ」
と、関連度減少語の閾値「Ｔｉ」を予め設定しておく。
そして、「Ｗｄｎ」と「Ｗｌｎ」とに含まれる全ての関
連語候補に関し、一方の検索条件「Ｓｌ」との間の関連
度から、他方の検索条件「Ｓｄ」との間の関連度を減算
する。減算の結果が閾値「Ｔｕ」より大きければ関連度
増加語とし、閾値「Ｔｉ」より小さければ関連度減少語
とする。これらの、関連度増加語と関連度減少語とが関
連語となる。The related word calculation means 5 is the retrieval condition receiving means 2
Compares the relevance of each related word candidate calculated by the relevance calculating means 4 in accordance with each of the search conditions “Sd” and “Sl” received by, and relates based on the change in the value of the relevance of each related word candidate. Determine the word. For example, the threshold value “Tu” of the word with increased relevance
And the threshold value “Ti” of the degree-of-association word is preset.
Then, for all related word candidates included in “Wdn” and “Wln”, the degree of association with one search condition “Sl” is subtracted from the degree of association with the other search condition “Sd”. To do. If the result of the subtraction is larger than the threshold value "Tu", it is regarded as a word with increased relevance, and if it is smaller than the threshold value "Ti", it is regarded as a word with a low relevance. These related degree increasing words and related degree decreasing words are related terms.

【００１７】関連語表示手段６は、関連語計算手段５が
決定した関連語を表示装置に表示する。このような関連
語提示装置によれば、ユーザが、検索過程における絞り
込み前と絞り込み後のそれぞれの検索条件を入力すれ
ば、それらの検索条件に対する各語の関連度が計算さ
れ、関連度の変化に応じて適当な関連語（例えば、関連
度の変化の大きい単語）がユーザに提示される。これに
より、検索結果として得られる文書集合を精読すること
なしに、検索式を変更したことによって検索結果集合の
特徴がどのように変化したかを知ることが可能となる。
具体的には、検索意図に合致する語の関連度が大きく減
少した場合（あるいは、検索意図に沿わない語の関連度
が大きく増加した場合）に、検索式の変更が適切でなか
った（よって、適切な検索結果が得られるよう検索式を
再修正すべきである）ことを知ることができ、ユーザの
検索意図に沿わない誤った絞り込みを避けることが可能
となる。さらに、表示された関連語を参照することによ
り、絞り込みを行うための検索式の再修正を効率的に行
うことができる。The related word display means 6 displays the related word determined by the related word calculation means 5 on the display device. According to such a related word presentation device, if the user inputs the search conditions before and after narrowing down in the search process, the degree of relevance of each word with respect to those search conditions is calculated, and the change in the degree of relevance is calculated. A suitable related word (for example, a word having a large change in the degree of association) is presented to the user in accordance with the above. As a result, it is possible to know how the characteristics of the search result set have changed by changing the search expression, without having to thoroughly read the document set obtained as the search result.
Specifically, if the degree of relevance of words that match the search intent significantly decreases (or if the degree of relevance of words that do not meet the search intent significantly increases), changing the search formula was not appropriate ( , It is necessary to re-correct the search formula so as to obtain an appropriate search result), and it is possible to avoid erroneous narrowing down that does not meet the user's search intention. Furthermore, by referring to the displayed related words, it is possible to efficiently re-correct the search expression for narrowing down.

【００１８】なお、上記の各原理構成の構成要素の機能
は、各処理機能の命令が記述されたプログラムをコンピ
ュータで実行することにより実現できる。その場合、プ
ログラムは、コンピュータで読み取り可能な記録媒体に
格納しておく。記録媒体としては、半導体記憶装置や、
磁気記録装置、あるいは光ディスク等を用いることがで
きる。The functions of the constituent elements of the above-described principle configurations can be realized by executing a program in which a command of each processing function is described by a computer. In that case, the program is stored in a computer-readable recording medium. As the recording medium, a semiconductor memory device,
A magnetic recording device, an optical disk, or the like can be used.

【００１９】ところで、本発明の関連語計算手段では、
本来単語間の類似度として用いる統計量である相互情報
量、Ｄｉｃｅ−ｃｏｅｆｆｉｃｉｅｎｔおよびｔ−ｓｃ
ｏｒｅを拡張することによって、検索式と単語の間の類
似度を計算し、その類似度を関連度とすることができ
る。相互情報量、Ｄｉｃｅ−ｃｏｅｆｆｉｃｉｅｎｔお
よびｔ−ｓｃｏｒｅを単語間の類似度計算に用いた例と
して、「春野，山崎：辞書と統計を用いた対訳アライメ
ント，情報処理学会自然言語処理研究会研究報告，９６
−ＮＬ−１１２，ｐｐ．２３−３０（１９９６）」、
「大森，堤，中西：統計情報を用いた対訳単語辞書の作
成，言語処理学会第２回年次大会発表論文集，ｐｐ．４
９−５２（１９９６）」等を挙げることができる。By the way, in the related word calculating means of the present invention,
Mutual information, Dice-coefficient, and t-sc, which are statistic amounts originally used as similarity between words.
By expanding ore, the similarity between the search expression and the word can be calculated and the similarity can be used as the relevance. As an example of using mutual information, Dice-coefficient and t-score for similarity calculation between words, "Haruno, Yamazaki: Bilingual alignment using dictionaries and statistics, Research Report of IPSJ Natural Language Processing Research Group, 96
-NL-112, pp. 23-30 (1996) ",
"Omori, Tsutsumi, Nakanishi: Preparation of bilingual word dictionary using statistical information, Proceedings of the Second Annual Conference of the Linguistic Processing Society, pp. 4
9-52 (1996) "and the like.

【００２０】以下に、相互情報量などを本願発明に適用
するための拡張方法について説明する。単語ｗｏｒｄ１
とｗｏｒｄ２の間の相互情報量（ＭＩ）は、An extension method for applying mutual information and the like to the present invention will be described below. Word word1
And the mutual information (MI) between word2 is

【００２１】[0021]

【数１】 [Equation 1]

【００２２】と定義される。ただし、全検索対象文書数
をＭ、ｗｏｒｄ１とｗｏｒｄ２を共に含む文書数をａ、
ｗｏｒｄ１のみを含む文書数をｂ、ｗｏｒｄ２のみを含
む文書数をｃとした場合、Is defined as However, the total number of search target documents is M, the number of documents including both word1 and word2 is a,
When the number of documents including only word1 is b and the number of documents including only word2 is c,

【００２３】[0023]

【数２】 [Equation 2]

【００２４】[0024]

【数３】 [Equation 3]

【００２５】[0025]

【数４】 [Equation 4]

【００２６】である。これに対して本発明では、検索式
Ｓと単語ｗｏｒｄの間の相互情報量（ＭＩ₀）を、[0026] On the other hand, in the present invention, the mutual information amount (MI ₀ ) between the search expression S and the word word is

【００２７】[0027]

【数５】 [Equation 5]

【００２８】と定義する。ただし、全検索対象文書数を
Ｍ、ｗｏｒｄを含みかつ検索式Ｓから得られる文書の数
をａ₀、検索式Ｓから得られる文書のうちｗｏｒｄを含
まない文書の数をｂ₀、ｗｏｒｄを含む文書のうち検索
式Ｓから得られる文書を除いた文書の数をｃ₀とした場
合、It is defined as However, the total number of search target documents includes M and words, and the number of documents obtained from the search formula S includes a ₀ , and the number of documents that do not include word among the documents obtained from the search formula S includes b ₀ and word. If the number of documents excluding the documents obtained from the search expression S among the documents is c ₀ ,

【００２９】[0029]

【数６】 [Equation 6]

【００３０】[0030]

【数７】 [Equation 7]

【００３１】[0031]

【数８】 [Equation 8]

【００３２】である。ここで、「ａ₀＋ｂ₀」が図１の
説明における「第１の値」に相当し、「ａ₀」が「第２
の値」に相当し、「ａ₀＋ｃ₀」が「第３の値」に相当
する。したがって、式（５）は、It is Here, “a ₀ + b ₀ ” corresponds to the “first value” in the description of FIG. 1, and “a ₀ ” corresponds to the “second value”.
"Value of" and "a ₀ + c ₀ " corresponds to "third value". Therefore, equation (5) becomes

【００３３】[0033]

【数９】 [Equation 9]

【００３４】とすることにより、全検索対象文書数Ｍ、
「第１の値」、「第２の値」及び「第３の値」を変数と
する計算式となる。相互情報量と同様に単語間の類似度
を求める統計量として、Ｄｉｃｅ−ｃｏｅｆｆｉｃｉｅ
ｎｔおよびｔ−ｓｃｏｒｅを挙げることができる。Ｄｉ
ｃｅ−ｃｏｅｆｆｉｃｉｅｎｔ（ＤＣ）およびｔ−ｓｃ
ｏｒｅ（ＴＳ）は、As a result, the total number of search target documents M,
The calculation formula has “first value”, “second value”, and “third value” as variables. Dice-coefficie is used as a statistic for obtaining the similarity between words as well as the mutual information.
Mention may be made of nt and t-score. Di
ce-coefficient (DC) and t-sc
ore (TS) is

【００３５】[0035]

【数１０】 [Equation 10]

【００３６】[0036]

【数１１】 [Equation 11]

【００３７】と定義される。これらについても、相互情
報量と同様に、検索式と単語の間の類似度計算するため
に以下のような拡張を施すことが可能である。Is defined as Similar to the mutual information, these can be expanded as follows to calculate the similarity between the search formula and the word.

【００３８】[0038]

【数１２】 [Equation 12]

【００３９】[0039]

【数１３】 [Equation 13]

【００４０】ＭＩ₀（Ｓ，word），ＤＣ₀（Ｓ，wor
d），ＴＳ₀（Ｓ，word）のいずれも、その値が大きい
ほど検索式Ｓと単語ｗｏｒｄの間に高い類似性があるこ
とを意味する。以後、ＭＩ₀（Ｓ，word）を「拡張相互
情報量」、ＤＣ₀（Ｓ，word）を「拡張ＤＣ」、ＴＳ₀
（Ｓ，word）を「拡張ＴＳ」と呼ぶこととする。なお、
相互情報量と同様に、拡張ＤＣと拡張ＴＳとをそれぞ
れ、MI ₀ (S, word), DC ₀ (S, wor
The larger the value of both d) and TS ₀ (S, word), the higher the similarity between the search expression S and the word word. Thereafter, MI ₀ (S, word) is “extended mutual information”, DC ₀ (S, word) is “extended DC”, and TS ₀
(S, word) will be referred to as "extended TS". In addition,
Similar to the mutual information, the extended DC and the extended TS are

【００４１】[0041]

【数１４】 [Equation 14]

【００４２】[0042]

【数１５】 [Equation 15]

【００４３】と表すことができる。式（１４）から分か
るように、拡張ＤＣを求める際には、全検索対象文書数
Ｍは不要である。次に、本発明の関連文書検索装置の実
施の形態を具体的に説明する。It can be expressed as As can be seen from the equation (14), the total number of search target documents M is not necessary when obtaining the extended DC. Next, an embodiment of the related document search device of the present invention will be specifically described.

【００４４】図２は、本発明の実施の形態の構成を示す
ブロック図である。文書格納手段１１は、電子化された
検索対象文書の内容を形態素解析手段１２によって付加
される文書識別子と対にして格納する記憶装置である。FIG. 2 is a block diagram showing the configuration of the embodiment of the present invention. The document storage unit 11 is a storage device that stores the content of the digitized search target document in a pair with the document identifier added by the morpheme analysis unit 12.

【００４５】形態素解析手段１２は、文書格納手段１１
に格納されている各文書に文書識別子を付加した上で、
各文書に形態素解析処理を施して自立語（キーワードと
なるべき語）を抽出し、対応する文書識別子と対にして
格納する。The morphological analysis means 12 is the document storage means 11
After adding the document identifier to each document stored in
A morphological analysis process is performed on each document to extract an independent word (a word to be a keyword), which is stored as a pair with a corresponding document identifier.

【００４６】索引構造生成手段１３は、形態素解析手段
１２での形態素解析処理結果を基に、索引構造として、
単語−単語識別子リスト１４ａ、単語識別子−文書識別
子リスト１４ｂ、文書識別子−単語識別子リスト１４ｃ
を作成する。The index structure generating means 13 determines an index structure based on the morphological analysis processing result of the morphological analyzing means 12.
Word-word identifier list 14a, word identifier-document identifier list 14b, document identifier-word identifier list 14c
To create.

【００４７】索引構造格納手段１４は、索引構造生成手
段１３によって作成された単語−単語識別子リスト１４
ａ、単語識別子−文書識別子リスト１４ｂ、文書識別子
−単語識別子リスト１４ｃを格納する記憶装置である。The index structure storage means 14 is a word-word identifier list 14 created by the index structure generation means 13.
a, a word identifier-document identifier list 14b, and a document identifier-word identifier list 14c.

【００４８】単語−単語識別子リスト１４ａは、単語文
字列とその単語を特定するための単語識別子の対応関係
を記述したリストである。単語識別子−文書識別子リス
ト１４ｂは、各単語識別子について、その単語識別子で
示される単語文字列を含む文書の文書識別子の集合を記
述したリストである。The word-word identifier list 14a is a list describing the correspondence between word character strings and word identifiers for specifying the word. The word identifier-document identifier list 14b is a list that describes, for each word identifier, a set of document identifiers of documents including the word character string indicated by the word identifier.

【００４９】文書識別子−単語識別子リスト１４ｃは、
各文書識別子について、その文書識別子で示される文書
に含まれる単語の単語識別子の集合を記述したリストで
ある。The document identifier-word identifier list 14c is
For each document identifier, it is a list describing a set of word identifiers of words included in the document indicated by the document identifier.

【００５０】検索条件受取手段２１は、単語を論理和演
算子あるいは論理積演算子で接続することによって構成
される検索条件（検索式）の入力を、キーボードなどの
入力装置から複数回受け付けるユーザインタフェースで
ある。The search condition receiving means 21 is a user interface that receives a search condition (search formula) formed by connecting words by a logical sum operator or a logical product operator a plurality of times from an input device such as a keyboard. Is.

【００５１】文書検索手段２２は、検索条件受取手段２
１に入力された検索条件に適合する全ての文書の文書識
別子を、単語−単語識別子リスト１４ａ及び単語識別子
−文書識別子リスト１４ｂを参照して取得し、取得した
文書識別子集合を保存する。また、関連度計算手段２５
に対しては、保存した文書識別子集合中の識別子数を渡
すと共に、関連度計算手段２５から与えられる単語識別
子に対応する単語を含む文書の総数を渡す。The document search means 22 is the search condition receiving means 2
The document identifiers of all the documents that match the search condition input in 1 are acquired by referring to the word-word identifier list 14a and the word identifier-document identifier list 14b, and the acquired document identifier set is stored. Also, the degree-of-association calculation means 25
With respect to, the number of identifiers in the stored document identifier set is passed, and the total number of documents including the word corresponding to the word identifier given from the degree-of-association calculating unit 25 is passed.

【００５２】文書内単語検索手段２３は、文書検索手段
２２から得られる検索条件に適合する文書集合の各文書
に含まれる単語の識別子集合を、文書識別子−単語識別
子リスト１４ｃを参照して取得し、それらを連接して１
つの単語識別子集合とする。The in-document word searching means 23 obtains the identifier set of words included in each document of the document set satisfying the search condition obtained from the document searching means 22 by referring to the document identifier-word identifier list 14c. , Connect them 1
One word identifier set.

【００５３】単語出現数計算手段２４は、文書内単語検
索手段２３から得られた単語識別子集合中での各単語識
別子の出現数を計算し、単語識別子と出現数との対をリ
ストとして作成する。The word appearance number calculation means 24 calculates the number of appearances of each word identifier in the word identifier set obtained from the in-document word search means 23, and creates a pair of the word identifier and the appearance number as a list. .

【００５４】関連度計算手段２５は、単語出現数計算手
段２４によって計算された各単語識別子の出現頻度と、
文書検索手段２２から得られた検索条件に適合する文書
識別子の総数と、文書検索手段２２から得られる単語識
別子に対応する単語を含む文書数の３つの値を基に、検
索条件受取手段２１に入力された検索条件と各単語識別
子に対応する単語との間の拡張相互情報量を計算する。The degree-of-association calculating means 25 calculates the appearance frequency of each word identifier calculated by the word appearance number calculating means 24,
Based on the three values of the total number of document identifiers that match the search condition obtained from the document search unit 22 and the number of documents that include the word corresponding to the word identifier obtained from the document search unit 22, The expanded mutual information between the input search condition and the word corresponding to each word identifier is calculated.

【００５５】関連度記憶手段２６は、関連度計算手段２
５によって計算された各単語とその拡張相互情報量と
を、対応する検索条件と共に格納する。検索条件指定手
段２７は、関連度記憶手段２６に記憶されている検索条
件（検索条件受取手段２１に入力された検索条件）を表
示し、その中から、検索条件受取手段２１に最後に入力
された検索条件との比較対象とすべき検索条件を指定す
ることが可能なユーザインターフェースである。The degree-of-association storage means 26 is used as the degree-of-association calculation means 2.
Each word and its extended mutual information calculated by 5 are stored together with the corresponding search condition. The search condition designation means 27 displays the search conditions (search conditions input to the search condition receiving means 21) stored in the degree-of-association storage means 26, from which the search condition receiving means 21 is finally input. It is a user interface capable of designating a search condition to be compared with the search condition.

【００５６】関連語計算手段２８は、検索条件受取手段
２１に最後に入力された検索条件に対応する各語の拡張
相互情報量と、検索条件受取手段２１によって指定され
た検索条件に対応する各語の拡張相互情報量とを比較
し、両者の値の差の絶対値を求める。そして、両者の相
互情報量の絶対値の差が、予め設定された閾値を超える
語を関連語として取得する。The related word calculation means 28 corresponds to the extended mutual information amount of each word corresponding to the search condition last input to the search condition receiving means 21 and each search condition specified by the search condition receiving means 21. By comparing with the extended mutual information of the word, the absolute value of the difference between the two values is obtained. Then, a word in which the difference between the absolute values of the mutual information exceeds a preset threshold is acquired as a related word.

【００５７】関連語表示手段２９は、関連語計算手段２
８で計算された各関連語を出力するユーザインタフェー
スである。検索結果表示手段３０は、文書検索手段２２
から得られる、検索条件受取手段２１に入力された検索
条件に適合する文書集合を、文書格納手段１１を参照し
て出力するユーザインタフェースである。The related word display means 29 is related word calculation means 2
9 is a user interface for outputting each related word calculated in 8. The search result display means 30 is a document search means 22.
This is a user interface for outputting a document set matching the search condition input to the search condition receiving means 21 obtained from the above with reference to the document storage means 11.

【００５８】なお、上記の各構成要素の有している機能
は、コンピュータが所定のプログラムモジュールを実行
することによって実現される機能である。そして、これ
らを実現するためのコンピュータプログラムは、半導体
メモリや磁気記録媒体などの記録媒体に記録されてい
る。ただし、文書格納手段１１と索引構造格納手段１４
とは、実際のＨＤＤ（ハードディスク装置）などの記憶
装置を、所定のプログラムで制御することにより実現さ
れる。The function of each of the above components is a function realized by the computer executing a predetermined program module. A computer program for realizing these is recorded on a recording medium such as a semiconductor memory or a magnetic recording medium. However, the document storage means 11 and the index structure storage means 14
Is realized by controlling a storage device such as an actual HDD (hard disk device) with a predetermined program.

【００５９】また、図２の関連語提示装置の各構成要素
は、図１の構成要素に対して次のような関係にある。文
書格納手段１１及び索引構造格納手段１４が文書格納手
段１に対応する。検索条件受取手段２１が検索条件受取
手段２に対応する。文書検索手段２２が文書検索手段３
に対応する。文書内単語検索手段２３、単語出現数計算
手段２４及び関連度計算手段２５が関連度計算手段４に
対応する。関連度記憶手段２６及び関連語計算手段２８
が関連語計算手段５に対応する。関連語表示手段２９が
関連語表示手段６に対応する。Each component of the related word presentation device of FIG. 2 has the following relationship with the component of FIG. The document storage means 11 and the index structure storage means 14 correspond to the document storage means 1. The search condition receiving means 21 corresponds to the search condition receiving means 2. The document search means 22 is the document search means 3
Corresponding to. The in-document word searching means 23, the word appearance number calculating means 24, and the relevance calculating means 25 correspond to the relevance calculating means 4. Relevance storage 26 and related word calculator 28
Corresponds to the related word calculation means 5. The related word display means 29 corresponds to the related word display means 6.

【００６０】ここで、本実施の形態では、関連文書検索
を行う前に予め索引構造の生成処理を実行しておく必要
がある。そこで、まず索引構造の生成処理について説明
する。In this embodiment, it is necessary to execute the index structure generation process in advance before performing the related document search. Therefore, the index structure generation process will be described first.

【００６１】索引構造の生成処理の前提として、形態素
解析結果リストが生成されていなければならない。図３
は、形態素解析手段１２に格納される形態素解析結果リ
スト１２ａの例を示す図である。形態素解析手段１２
は、文書格納手段１１に格納されている各検索対象文書
に識別子を割当てた上で、それぞれの文書に形態素解析
処理を施して自立語を抽出し、対応する文書識別子と対
にして格納する。ただし、同一文書中から同一の自立語
が複数回抽出された場合は、２回目以降の抽出結果を無
視し、一つの文書識別子に対応する自立語が重複するこ
とはないものとする。As a premise of the index structure generation process, the morphological analysis result list must be generated. Figure 3
FIG. 6 is a diagram showing an example of a morphological analysis result list 12a stored in the morphological analysis means 12. Morphological analysis means 12
Assigns an identifier to each search target document stored in the document storage unit 11, performs a morphological analysis process on each document to extract an independent word, and stores it in a pair with a corresponding document identifier. However, when the same independent word is extracted a plurality of times from the same document, the second and subsequent extraction results are ignored, and independent words corresponding to one document identifier do not overlap.

【００６２】この形態素解析結果リスト１２ａを基に、
索引構造生成手段１３が各種索引構造を生成する。図４
〜図６に索引構造生成手段１３により作成され、索引構
造格納手段１４に格納される索引構造の例を示す。なお
図４〜図６中のデータは、図３のデータに基づいて作成
された例となっている。Based on this morphological analysis result list 12a,
The index structure generation means 13 generates various index structures. Figure 4
6 shows an example of the index structure created by the index structure generating means 13 and stored in the index structure storing means 14. The data in FIGS. 4 to 6 are examples created based on the data in FIG.

【００６３】図４は、単語−単語識別子リストの例を示
す図である。単語−単語識別子リスト１４ａには、抽出
された単語と、その単語に割り当てられた識別子とが組
となって格納されている。FIG. 4 is a diagram showing an example of the word-word identifier list. The word-word identifier list 14a stores the extracted words and the identifiers assigned to the words as a set.

【００６４】図５は、単語識別子−文書識別子リストの
例を示す図である。単語識別子−文書識別子リスト１４
ｂには、単語識別子と、その単語識別子が割り当てられ
ている単語を含む文書の識別子（文書識別子）が組とな
って格納されている。FIG. 5 is a diagram showing an example of a word identifier-document identifier list. Word identifier-document identifier list 14
In b, a word identifier and a document identifier (document identifier) including a word to which the word identifier is assigned are stored as a set.

【００６５】図６は、文書識別子−単語識別子リストの
例を示す図である。文書識別子−単語識別子リスト１４
ｃには、文書識別子と、その文書識別子が割り当てられ
ている文書に含まれる単語の単語識別子とが組となって
格納されている。FIG. 6 is a diagram showing an example of the document identifier-word identifier list. Document identifier-word identifier list 14
In c, a document identifier and a word identifier of a word included in the document to which the document identifier is assigned are stored as a set.

【００６６】索引構造生成手段１３による索引構造の生
成アルゴリズムは以下の通りである。図７は、索引構造
の生成アルゴリズムを示すフローチャートである。［Ｓ１］単語−単語識別子リスト１４ａの生成処理形態素解析手段１２に格納されている形態素解析結果リ
スト中の全単語を、重複なく、かつ、単語文字列の持つ
値の順にソートしたリストを作成する。各単語に対し
て、リストの先頭から順に１で始まる自然数を単語識別
子として割当てる。［Ｓ２］文書識別子−単語識別子リスト１４ｃの生成処
理形態素解析手段１２に格納されている形態素解析結果リ
スト中の各単語をステップＳ１で割当てた単語識別子で
置き換え、各文書識別子ごとに対応する単語識別子を小
さい値順にソートする。［Ｓ３］単語識別子−文書識別子リスト１４ｂの生成処
理単語識別子を１から順に並べ、各単語識別子に対応する
単語が含まれる文書の文書識別子を、ステップＳ２で作
成した文書識別子−単語識別子リスト１４ｃを参照して
抽出し、単語識別子と対にして格納する。The index structure generation algorithm by the index structure generation means 13 is as follows. FIG. 7 is a flowchart showing an index structure generation algorithm. [S1] Generation processing of word-word identifier list 14a All words in the morpheme analysis result list stored in the morpheme analysis means 12 are sorted without duplication and in the order of the value of the word character string to create a list. . For each word, a natural number starting with 1 from the beginning of the list is assigned as a word identifier. [S2] Generation process of document identifier-word identifier list 14c Each word in the morpheme analysis result list stored in the morpheme analysis means 12 is replaced with the word identifier assigned in step S1, and the word identifier corresponding to each document identifier is replaced. Are sorted in ascending order. [S3] Generation processing of word identifier-document identifier list 14b The word identifiers are arranged in order from 1, and the document identifiers of the documents including the words corresponding to the respective word identifiers are set to the document identifier-word identifier list 14c created in step S2. It is referred to and extracted, and is stored as a pair with a word identifier.

【００６７】以上のアルゴリズムにより、索引構造が生
成される。索引構造の生成処理が行われた後、検索条件
受取手段２１対する検索式の入力が可能となる。そし
て、利用者がキーボードなどの入力装置を用いて所望の
検索式を入力し、検索開始の指令を行うと、関連語の提
示処理が開始される。The index structure is generated by the above algorithm. After the index structure generation process is performed, a search expression can be input to the search condition receiving unit 21. Then, when the user inputs a desired search expression using an input device such as a keyboard and issues a search start command, the related word presentation process is started.

【００６８】図８は、検索条件受取手段２１に入力され
た検索式から関連度を求めるためのアルゴリズムを示す
フローチャートである。以下、図８の各ステップについ
て説明する。なお、以下の説明において、単語−単語識
別子リスト１４ａをＬ１、単語識別子−文書識別子リス
ト１４ｂをＬ２、文書識別子−単語識別子リスト１４ｃ
をＬ３と記述する。［Ｓ１１］検索条件受取手段２１が、単語を論理積演算
子あるいは論理和演算子で結合した検索式を受け取る。
この検索式をＳと呼ぶことにする。［Ｓ１２］文書検索手段２２が、Ｓに適合する文書の文
書識別子を、Ｌ１およびＬ２を参照して取得する。得ら
れた文書識別子集合をＸと呼び、集合Ｘの要素数をＮと
する。［Ｓ１３］ステップＳ１２においてＮ＝０であればステ
ップＳ１４へ進み、そうでなければステップＳ１５へ進
む。［Ｓ１４］文書検索手段２２は、Ｓの関連文書がないも
のとして、処理を終了する。［Ｓ１５］文書内単語検索手段２３が、Ｘに属する各文
書識別子に対応する全ての単語識別子を、Ｌ３を参照し
て取得する。取得した単語識別子の集合をＹとする。［Ｓ１６］単語出現数計算手段２４が、Ｙに属する単語
識別子の重複を取り除き、各単語識別子の重複回数を記
録する。重複の取り除かれた単語識別子集合を新たにＹ
とし、Ｙの要素Ｗｎ（ｎ＝１，２，・・・，Ｐ）の重複
回数をＲ（Ｗｎ）とする。ただし、ＰはＹの要素数とす
る。［Ｓ１７］文書検索手段２２が、Ｙに属する全単語識別
子Ｗｎ（ｎ＝１，２，・・・，Ｐ）に関して、Ｗｎに対
応する文書識別子の総数をＬ２から取得する。Ｙの要素
Ｗｎに対応する文書識別子数をＦ（Ｗｎ）とする。［Ｓ１８］関連度計算手段２５が、Ｙに属する全単語識
別子Ｗｎ（ｎ＝１，２，・・・，Ｐ）について、全検索
対象文書数をＭとして、FIG. 8 is a flow chart showing an algorithm for obtaining the degree of association from the search expression input to the search condition receiving means 21. Hereinafter, each step of FIG. 8 will be described. In the following description, the word-word identifier list 14a is L1, the word identifier-document identifier list 14b is L2, and the document identifier-word identifier list 14c.
Is described as L3. [S11] The search condition receiving means 21 receives a search expression in which words are combined by a logical product operator or a logical sum operator.
This search formula will be called S. [S12] The document search unit 22 acquires the document identifier of the document that matches S by referring to L1 and L2. The obtained document identifier set is called X, and the number of elements of the set X is N. [S13] If N = 0 in step S12, the process proceeds to step S14, and if not, the process proceeds to step S15. [S14] The document search unit 22 determines that there is no related document of S and ends the process. [S15] The in-document word searching means 23 acquires all word identifiers corresponding to each document identifier belonging to X by referring to L3. Let Y be the set of acquired word identifiers. [S16] The word appearance number calculation means 24 removes the duplication of the word identifiers belonging to Y and records the number of duplications of each word identifier. Newly remove the duplicated word identifier set Y
And the number of overlaps of the element Wn (n = 1, 2, ..., P) of Y is R (Wn). However, P is the number of elements of Y. [S17] With respect to all word identifiers Wn (n = 1, 2, ..., P) belonging to Y, the document search means 22 acquires the total number of document identifiers corresponding to Wn from L2. Let F (Wn) be the number of document identifiers corresponding to the element Wn of Y. [S18] The degree-of-association calculation means 25 sets the total number of search target documents to M for all word identifiers Wn (n = 1, 2, ..., P) belonging to Y,

【００６９】[0069]

【数１６】 [Equation 16]

【００７０】[0070]

【数１７】ｐｒｏｂ（Ｗｎ）＝Ｆ（Ｗｎ）／Ｍ・・・・（１７）を計算し、これらの値をＷｎと組にしてリストとする。
また、## EQU16 ## Prob (Wn) = F (Wn) / M ... (17) is calculated, and these values are paired with Wn to form a list.
Also,

【００７１】[0071]

【数１８】ｐｒｏｂ（Ｓ）＝Ｎ／Ｍ・・・・（１８）を計算する。［Ｓ１９］関連度計算手段２５が、Ｙに属する各単語識
別子Ｗｎ（ｎ＝１，２，・・・，Ｐ）について、式
（５）に従って、拡張相互情報量ＭＩ₀（Ｓ，Ｗｎ）を
計算し、得られた値を関連度として関連度記憶手段２６
に格納して終了する。ただし、ＭＩ₀（Ｓ，Ｗｎ）が負
の値になる語は、関連度記憶手段２６には格納しないも
のとする。## EQU18 ## Prob (S) = N / M ... (18) is calculated. [S19] The degree-of-association calculation means 25 calculates the extended mutual information MI ₀ (S, Wn) for each word identifier Wn (n = 1, 2, ..., P) belonging to Y according to the equation (5). The calculated value is used as the degree of association and the degree-of-association storage unit 26
Store in and exit. However, a word having a negative value for MI ₀ (S, Wn) is not stored in the degree-of-association storage unit 26.

【００７２】以上のステップにより、入力された検索式
に対応して、その検索式によって得られる文書に含まれ
る各語の関連度を得ることが可能となる。つまり、関連
度記憶手段２６には、各検索式ごとに「検索式」「検索
式によって得られる文書集合中に出現する語のリスト」
「各語に対応する関連度（拡張相互情報量）のリスト」
の３つの組が格納されることになる。By the steps described above, it becomes possible to obtain the degree of association of each word included in the document obtained by the search expression, corresponding to the input search expression. That is, in the degree-of-association storage unit 26, "search formula" for each search formula "list of words appearing in document set obtained by search formula"
"List of relevance (extended mutual information) corresponding to each word"
Will be stored.

【００７３】関連度記憶手段２６に複数の検索式に対応
する情報が格納された状態で、ユーザは、検索条件指定
手段２７を用いて、検索結果の比較対照とすべき検索式
を指定する。そして、関連語計算手段２８によって関連
語計算を行う。関連語計算手段２８によって実行される
関連語計算アルゴリズムは、以下の通りである。With the information corresponding to a plurality of search expressions stored in the degree-of-association storage means 26, the user uses the search condition specifying means 27 to specify the search expressions to be compared and compared with the search results. Then, the related word calculation means 28 calculates a related word. The related word calculation algorithm executed by the related word calculation means 28 is as follows.

【００７４】図９は、関連語計算アルゴリズムを示すフ
ローチャートである。このフローチャートに示す処理
は、全て関連語計算手段２８によって実行される。な
お、ここでの関連語とは、検索条件指定手段２７によっ
て指定された検索式と検索条件受取手段２１に最後に入
力された検索式の両検索式から得られる検索結果の差異
を特徴的に示す語のことである。［Ｓ２１］検索条件指定手段２７によって指定された検
索式（Ｓｄ）に対応する各語（Ｗｄ１, Ｗｄ２, …，Ｗ
ｄｎ）とその関連度（ＭＩ₀（Ｓｄ，Ｗｄ１）,ＭＩ₀
（Ｓｄ，Ｗｄ２）, …，ＭＩ₀（Ｓｄ，Ｗｄｎ））およ
び、検索条件受取手段２１に最後に入力された検索式
（Ｓｌ）に対応する各語（Ｗｌ１, Ｗｌ２,…，Ｗｌ
ｍ）とその関連度（ＭＩ₀（Ｓｌ，Ｗｌ１）, ＭＩ
₀（Ｓｌ，Ｗｌ２）, …，ＭＩ₀（Ｓｌ，Ｗｌｍ））を
関連度記憶手段２６から取得する。［Ｓ２２］検索式Ｓｄに対応する各語（Ｗｄｉ（１≦ｉ
≦ｎ））について関連度差FIG. 9 is a flowchart showing the related word calculation algorithm. All the processing shown in this flowchart is executed by the related word calculation means 28. Note that the related term here is characteristically the difference between the search results obtained from both the search expression specified by the search condition specifying means 27 and the search expression last input to the search condition receiving means 21. It is a word to show. [S21] Each word (Wd1, Wd2, ..., W) corresponding to the search expression (Sd) specified by the search condition specifying means 27.
dn) and its degree of association (MI ₀ (Sd, Wd1), MI ₀
(Sd, Wd2), ..., MI ₀ (Sd, Wdn)) and each word (W1, W12, ..., Wl) corresponding to the search expression (Sl) last input to the search condition receiving means 21.
m) and its degree of relevance (MI ₀ (Sl, Wl1), MI
₀ (S1, W12), ..., MI ₀ (S1, Wlm)) is acquired from the degree-of-association storage unit 26. [S22] Each word (Wdi (1 ≦ i
≦ n))

【００７５】[0075]

【数１９】を求め、得られた値が予め設定した閾値Ｔｕよりも大き
い場合Ｗｄｉを関連度差と共に関連度増加語リストに加
え、得られた値が閾値Ｔｌ（＜Ｔｕ）よりも小さい場合
Ｗｄｉを関連度差と共に関連度減少語リストに加える。
ここで、Ｗｄｉが（Ｗｌ１，Ｗｌ２，・・・，Ｗｌ
ｍ）に存在しない場合は、ＭＩ₀（Ｓｌ，Ｗｄｉ）＝０
とする。［Ｓ２３］検索式Ｓｌに対応する各語（Ｗｌｊ（１≦ｊ
≦ｍ））について関連度差[Formula 19] If the obtained value is larger than a preset threshold Tu, Wdi is added to the relevance increasing word list together with the relevance difference, and if the obtained value is smaller than the threshold Tl (<Tu), Wdi is relevance. Add to the reduced relevance word list along with the difference.
Where Wdi is (Wl1, Wl2, ..., Wl
m 0) does not exist, MI ₀ (Sl, Wdi) = 0
And [S23] Each word (Wlj (1≤j
≦ m)) Relevance difference

【００７６】[0076]

【数２０】を求め、得られた値が予め設定した閾値Ｔｕよりも大き
い場合Ｗｌｊを関連度差と共に関連度増加語リストに加
え、得られた値が閾値Ｔｌよりも小さい場合Ｗｌｊを関
連度差と共に関連度減少語リストに加える。ここで、Ｗ
ｌｊが（Ｗｄ１，Ｗｄ２，・・・，Ｗｄｎ）に存在しな
い場合は、ＭＩ₀（Ｓｄ，Ｗｌｊ）＝０とする。ただ
し、Ｗｌｊが既にリスト中に存在する場合は再度リスト
に加えることはしない。［Ｓ２４］関連度増加語リスト中の各語を関連度差の大
きい順にソートし、関連度減少語リスト中の各語を関連
度差の小さい順にソートする。[Equation 20] If the obtained value is larger than a preset threshold value Tu, Wlj is added to the word list with increased relevance and the obtained value is smaller than the threshold value Tl. Add to reduced word list. Where W
When lj does not exist in (Wd1, Wd2, ..., Wdn), MI ₀ (Sd, Wlj) = 0 is set. However, if Wlj already exists in the list, it is not added to the list again. [S24] Each word in the word list with increased relevance is sorted in descending order of difference in relevance, and each word in word list with reduced relevance is sorted in ascending order of difference in relevance.

【００７７】以上のようにして、関連語計算が行われ、
関連度差の大きい順にソートされた関連度増加語リスト
と、関連度差の小さい順にソートされた関連度減少語リ
ストとが生成される。そして、生成された各リストが、
関連語表示手段２９によって表示装置の画面に表示され
る。The related word calculation is performed as described above,
A relevance-increasing word list sorted in descending order of relevance difference and a relevance-decreasing word list sorted in descending order of relevance difference are generated. And each generated list is
It is displayed on the screen of the display device by the related word display means 29.

【００７８】図１０、図１１、図１２および図１３に本
実施例のユーザインタフェースを示す。図１０は、関連
語検索画面を示す図である。この関連語検索画面４０
は、４つのサブウィンドウ４１〜４４に別れている。FIG. 10, FIG. 11, FIG. 12 and FIG. 13 show the user interface of this embodiment. FIG. 10 is a diagram showing a related word search screen. This related term search screen 40
Is divided into four sub-windows 41 to 44.

【００７９】サブウィンドウ４１は、検索式入力用のウ
ィンドウであり、テキスト入力フィールド４１ａと、検
索ボタン４１ｂとが設けられている。ユーザは、キーボ
ードなどの入力装置を用いて、テキスト入力フィールド
４１ａに検索式を入力し、検索ボタン４１ｂを押下する
ことにより、検索指令を入力できる。The sub window 41 is a window for inputting a search expression, and has a text input field 41a and a search button 41b. The user can input a search command by inputting a search expression in the text input field 41a using an input device such as a keyboard and pressing the search button 41b.

【００８０】サブウィンドウ４２は、検索式履歴表示用
のウィンドウである。このサブウィンドウ４２には、テ
キスト入力フィールド４１ａに過去に入力された検索式
が、入力された順番に表示されている。ユーザが、この
サブウィンドウ４２に表示されている検索式の中の１つ
を指定することにより、最後に入力された検索式と比較
すべき検索式が選択される。The sub window 42 is a window for displaying a retrieval formula history. In this sub-window 42, the search formulas input in the past in the text input field 41a are displayed in the input order. When the user specifies one of the search expressions displayed in the sub window 42, the search expression to be compared with the last input search expression is selected.

【００８１】サブウィンドウ４３は、関連語表示用のウ
ィンドウである。このサブウィンドウ４３には、関連度
減少語表示フィールド４３ａと、関連度増加語表示フィ
ールド４３ｂとがある。関連度減少語表示フィールド４
３ａには、ステップＳ２４によって得られた関連度減少
語リストが表示される。関連度増加語表示フィールド４
３ｂには、ステップＳ２４によって得られた関連度増加
語リストが表示される。The sub-window 43 is a window for displaying related words. The sub-window 43 has a degree-of-association word display field 43a and a degree-of-association word display field 43b. Relevance reduction word display field 4
On 3a, the relevance-reduced word list obtained in step S24 is displayed. Relevance increasing word display field 4
In 3b, the word list with increased relevance obtained in step S24 is displayed.

【００８２】サブウィンドウ４４は、検索結果表示用の
ウィンドウである。このサブウィンドウ４４には、テキ
スト入力フィールド４１ａに入力された検索式に適合す
る文書情報が表示される。The sub window 44 is a window for displaying a search result. In this sub-window 44, the document information matching the search formula input in the text input field 41a is displayed.

【００８３】なお、図１０中のテキスト入力フィールド
４１ａは検索条件受取手段２１により提供されるユーザ
インタフェースであり、サブウィンドウ４２は検索条件
指定手段２７により提供されるユーザインタフェースで
あり、サブウィンドウ４３は関連語表示手段２９により
提供されるユーザインタフェースであり、サブウィンド
ウ４４は検索結果表示手段３０により提供されるユーザ
インタフェースである。The text input field 41a in FIG. 10 is a user interface provided by the search condition receiving means 21, the subwindow 42 is a user interface provided by the search condition specifying means 27, and the subwindow 43 is a related word. The sub-window 44 is a user interface provided by the display unit 29, and the sub-window 44 is a user interface provided by the search result display unit 30.

【００８４】ここで、例えば、ユーザの検索意図が「地
震時に有効な耐震建造物について知りたい。」である場
合を考える。この場合、ユーザは、サブウィンドウ４２
の中から検索式「地震ｏｒ震災ｏｒ震動」を指定する。
そして、この検索式に対して、「耐震」を論理積演算子
で結合して、新たな検索式とする。Here, for example, consider a case where the user's search intention is “I want to know about an earthquake-resistant building effective during an earthquake”. In this case, the user may
Specify the search formula "earthquake or earthquake or earthquake" from among.
Then, "seismic resistance" is combined with this search formula by a logical product operator to form a new search formula.

【００８５】図１１は、「（地震ｏｒ震災ｏｒ震動）ａ
ｎｄ耐震」を検索式として入力した場合の表示画面を示
す図である。ここでは、サブウィンドウ４２中の選択さ
れている検索式「地震ｏｒ震災ｏｒ震動」は、強調表示
されている。ここで、検索ボタン４１ｂを押下すること
により、ステップＳ１１〜Ｓ１９（図８に示す）および
ステップＳ２１〜Ｓ２４（図９に示す）の処理が実行さ
れ、得られた関連語がサブウィンドウ４３に表示され
る。FIG. 11 shows "(earthquake or earthquake or vibration) a.
It is a figure which shows a display screen at the time of inputting "nd seismic resistance" as a search type. Here, the selected search formula “earthquake or earthquake or vibration” in the sub-window 42 is highlighted. Here, by pressing the search button 41b, the processes of steps S11 to S19 (shown in FIG. 8) and steps S21 to S24 (shown in FIG. 9) are executed, and the obtained related words are displayed in the sub window 43. It

【００８６】図１２は、「（地震ｏｒ震災ｏｒ震動）ａ
ｎｄ耐震」を検索式として入力した場合の関連語の表示
結果を示す図である。図のように、サブウィンドウ４３
の関連度減少語表示フィールド４３ａと関連度増加語表
示フィールド４３ｂとのそれぞれに、関連度減少語リス
トの内容と、関連度増加語リストの内容とが表示されて
いる。FIG. 12 shows "(earthquake or earthquake or vibration) a.
It is a figure which shows the display result of the related term when "nd seismic resistance" is input as a search formula. Sub-window 43 as shown
The content of the related degree decreasing word list and the content of the related degree increasing word list are displayed in each of the related degree decreasing word display field 43a and the related degree increasing word display field 43b.

【００８７】このようにして表示された関連語を参照す
ることにより、ユーザは、検索式「地震ｏｒ震災ｏｒ震
動」を「（地震ｏｒ震災ｏｒ震動）ａｎｄ耐震」に変更
した結果生じる以下のような２つの問題点を発見するこ
とができる。（１）「免震」「制震」といった「耐震」と類似する語
についての情報が漏れてしまっている。（２）「家具」「本棚」「箪笥」等の建造物でないもの
についての耐震方法についての検索結果が多く含まれて
しまっている。By referring to the related words displayed in this manner, the user changes the search expression "earthquake or earthquake or vibration" to "(earthquake or earthquake or vibration) and seismic resistance" There are two problems that can be discovered. (1) Information about words similar to “seismic resistance” such as “seismic isolation” and “seismic control” is leaked. (2) A large number of search results are included for earthquake-proof methods for non-buildings such as "furniture,""bookcases," and "chaise chests."

【００８８】ユーザは、上記の問題点を改善するため
に、例えば「（地震ｏｒ震災ｏｒ震動）ａｎｄ（耐震ｏ
ｒ免震ｏｒ制震）ａｎｄ（ｎｏｔ（家具ｏｒ本棚ｏｒ箪
笥））」といった新たな検索式に変更する。In order to remedy the above-mentioned problems, the user may say, for example, "(earthquake or earthquake or vibration) and (earthquake resistance o
Change to a new search formula such as r seismic isolation or seismic control) and (not (furniture or bookshelf or chest of drawers)).

【００８９】図１３は、「（地震ｏｒ震災ｏｒ震動）ａ
ｎｄ（耐震ｏｒ免震ｏｒ制震）ａｎｄ（ｎｏｔ（家具ｏ
ｒ本棚ｏｒ箪笥））」を検索式として入力した場合の関
連語の表示結果を示す図である。なお、サブウィンドウ
４２中では、検索式「地震ｏｒ震災ｏｒ震動」が選択さ
れたままである。このような検索式を入力することによ
り、「（地震ｏｒ震災ｏｒ震動）ａｎｄ耐震」を検索式
として入力した場合と比較すると、関連度減少語から
「免震」「制震」がなくなり、関連度増加語リストから
「家具」「本棚」「箪笥」がなくなっている。FIG. 13 shows "(earthquake or earthquake or vibration) a.
nd (seismic resistance or seismic isolation or damping) and (not (furniture o
It is a figure which shows the display result of the related term when "r bookshelf or chest") "is input as a search formula. In the sub-window 42, the search formula "earthquake or earthquake or quake" remains selected. By entering such a search formula, compared to the case where "(earthquake or earthquake or earthquake) and seismic resistance" is entered as the search formula, "seismic isolation" and "seismic control" disappear from the degree of relevance reduction, “Furniture”, “bookshelf”, and “chaise chest” are missing from the list of words with increased relevance.

【００９０】このようにして、関連度の減少語と増加語
の内容を知ることにより、検索意図に沿った検索式を迅
速に作成することが可能となる。なお、上記の実施の形
態は、２つの検索条件それぞれに対して各語の関連度を
求めた上で、関連度差を基に検索結果集合の変化の特徴
を示す関連語を決定するものであるが、２つの検索条件
から得られる文書集合の差を求め、一方の文書集合にの
み含まれる単語の関連度を求めることによって、関連語
を決定することもできる。文書集合の差を求めた後に関
連度を計算しても、上記の実施の形態と同様の効果が得
られる。As described above, by knowing the contents of the word with the reduced degree of association and the word with the increased degree of relevance, it is possible to quickly create a search expression in accordance with the search intention. In the above embodiment, the degree of relevance of each word is obtained for each of the two search conditions, and then the related term indicating the feature of the change in the search result set is determined based on the difference in the degree of relevance. However, it is also possible to determine the related word by obtaining the difference between the document sets obtained from the two search conditions and obtaining the degree of association of the words included in only one document set. Even if the degree of association is calculated after obtaining the difference between the document sets, the same effect as in the above-described embodiment can be obtained.

【００９１】このような実施の形態を以下に説明する。
なお、この実施の形態の図２に示した実施の形態と同様
の構成のシステムで実現できるため、図２の各構成要素
の符号を用いて説明する。ただし、検索条件指定手段２
７により指定された検索条件は、関連度記憶手段２６で
はなく、文書検索手段２２に渡される。Such an embodiment will be described below.
Since it can be realized by a system having a configuration similar to that of the embodiment shown in FIG. 2 of this embodiment, description will be given using the reference numerals of the respective constituent elements of FIG. However, the search condition specifying means 2
The search condition designated by 7 is passed to the document search means 22 instead of the relevance degree storage means 26.

【００９２】図１４は、文書集合の差に基づいて関連語
を決定するための手順を示すフローチャートである。［Ｓ３１］検索条件受取手段２１が、単語を論理積演算
子あるいは論理和演算子で結合した検索式を受け取る。［Ｓ３２］文書検索手段２２が、検索条件指定手段２７
で指定されている検索条件に適合する文書の文書識別子
を、Ｌ１（単語−単語識別子リスト）およびＬ２（単語
識別子−文書識別子リスト）を参照して取得する。得ら
れた文書識別子集合をＡとする。文書検索手段２２は、
さらに検索条件受取手段２１が受け取った検索条件に適
合する文書の文書識別子を、Ｌ１およびＬ２を参照して
取得する。得られた文書識別子集合をＢとする。［Ｓ３３］文書集合「Ａａｎｄ（ｎｏｔＢ）」に対応す
る文書識別子集合を図８のステップＳ１２におけるＸと
して、文書検索手段２２、文書内単語検索手段２３、単
語出現数計算手段２４及び関連度計算手段２５が、ステ
ップＳ１２〜ステップＳ１９（図８に示す）と同じ処理
を実行する。これにより、文書集合「Ａａｎｄ（ｎｏｔ
Ｂ）」に含まれる各語の関連度が計算され、関連度記憶
手段２６に格納される。［Ｓ３４］関連語計算手段２８が、ステップＳ３３によ
って得られた関連度が予め設定された閾値Ｔよりも大き
い語を関連度減少語リストに加え、関連度の大きい順に
ソートする。［Ｓ３５］文書集合「Ｂａｎｄ（ｎｏｔＡ）」に対応す
る文書識別子集合を図８のステップＳ１２におけるＸと
して、文書検索手段２２、文書内単語検索手段２３、単
語出現数計算手段２４及び関連度計算手段２５が、ステ
ップＳ１２〜ステップＳ１９（図８に示す）と同じ処理
を実行する。これにより、文書集合「Ｂａｎｄ（ｎｏｔ
Ａ）」に含まれる各語の関連度が計算され、関連度記憶
手段２６に格納される。［Ｓ３６］関連語計算手段２８が、ステップＳ３５によ
って得られた関連度が予め設定された閾値Ｔよりも大き
い語を関連度増加語リストに加え、関連度の大きい順に
ソートする。［Ｓ３７］関連語表示手段２９が、関連語計算手段２８
によってソートされた関連度減少語リストと関連度増加
語リストとの内容を、表示装置の画面に表示する。FIG. 14 is a flowchart showing a procedure for determining a related word based on the difference between document sets. [S31] The search condition receiving means 21 receives a search expression in which words are combined by a logical product operator or a logical sum operator. [S32] The document search means 22 is replaced by the search condition designation means 27.
The document identifiers of the documents that match the search condition specified by are acquired by referring to L1 (word-word identifier list) and L2 (word identifier-document identifier list). Let A be the obtained document identifier set. The document search means 22
Further, the search condition receiving means 21 acquires the document identifier of the document that matches the search condition received by referring to L1 and L2. Let B be the obtained document identifier set. [S33] The document identifier set corresponding to the document set “Aand (notB)” is set as X in step S12 of FIG. 8, and the document searching unit 22, the in-document word searching unit 23, the word appearance number calculating unit 24, and the degree of association calculating unit 25 performs the same processing as steps S12 to S19 (shown in FIG. 8). As a result, the document set “Aand (not
The degree of association of each word included in “B)” is calculated and stored in the degree-of-association storage unit 26. [S34] The related word calculation unit 28 adds words having a degree of relevance obtained in step S33 that is greater than a preset threshold value T to the reduced degree-of-relevance word list and sorts them in descending order of degree of relevance. [S35] The document identifier set corresponding to the document set "Band (notA)" is set as X in step S12 of FIG. 25 performs the same processing as steps S12 to S19 (shown in FIG. 8). As a result, the document set “Band (not
The degree of association of each word included in “A)” is calculated and stored in the degree-of-association storage unit 26. [S36] The related word calculation unit 28 adds words having a degree of relevance obtained in step S35 that is greater than a preset threshold value T to the word list with increased degree of relevance and sorts them in descending order of relevance. [S37] The related word display means 29 causes the related word calculation means 28.
The contents of the related degree decreasing word list and the related degree increasing word list sorted by are displayed on the screen of the display device.

【００９３】このようにして、関連度増加語と関連度減
少語とをユーザに提示することができる。この実施の形
態では、検索式を変更したことにより検索結果から漏れ
てしまった単語や、新たに登場した単語の中から関連語
が提示される。In this way, the word with increased relevance and the word with decreased relevance can be presented to the user. In this embodiment, the related words are presented from the words that have been omitted from the search results due to the change of the search expression and the newly appeared words.

【００９４】[0094]

【発明の効果】以上説明したように本発明の関連語提示
装置では、複数の検索条件に対する各関連語候補の関連
度を計算し、関連度の変化に基づいて関連語を決定し
て、その関連語をユーザに提示するようにしたため、検
索結果として得られる文書集合を精読することなしに、
検索条件を変更したことによって検索結果集合の特徴が
どのように変化したかを知ることが可能となる。As described above, in the related word presentation device of the present invention, the degree of relevance of each related word candidate with respect to a plurality of search conditions is calculated, the related word is determined based on the change in the degree of relevance, and the Since the related words are presented to the user, the document set obtained as the search result is not carefully read,
By changing the search condition, it becomes possible to know how the characteristics of the search result set have changed.

【００９５】また、本発明の関連語提示用プログラムを
記録した媒体によれば、記録された関連語提示用プログ
ラムをコンピュータで実行することにより、そのコンピ
ュータは、複数の検索条件に対する各関連語候補の関連
度を計算し、関連度の変化に基づいて関連語を決定し
て、その関連語をユーザに提示できるようになる。した
がって、この媒体に記録された関連語提示用プログラム
を用いれば、検索条件を変更したことによって検索結果
集合の特徴がどのように変化したかを容易に知ることが
できるような関連語の提示を、コンピュータに行わせる
ことが可能となる。Further, according to the medium in which the related word presentation program of the present invention is recorded, the computer executes the recorded related word presentation program so that the computer can obtain each related word candidate for a plurality of search conditions. It is possible to calculate the degree of relevance of, determine the related word based on the change in the degree of relevance, and present the related word to the user. Therefore, by using the related word presentation program recorded on this medium, it is possible to present related words so that it is possible to easily know how the characteristics of the search result set are changed by changing the search condition. , It is possible to make a computer do it.

[Brief description of drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の実施の形態の構成を示すブロック図で
ある。FIG. 2 is a block diagram showing a configuration of an embodiment of the present invention.

【図３】形態素解析手段に格納される形態素解析結果リ
ストの例を示す図である。FIG. 3 is a diagram showing an example of a morpheme analysis result list stored in a morpheme analysis unit.

【図４】単語−単語識別子リストの例を示す図である。FIG. 4 is a diagram showing an example of a word-word identifier list.

【図５】単語識別子−文書識別子リストの例を示す図で
ある。FIG. 5 is a diagram showing an example of a word identifier-document identifier list.

【図６】文書識別子−単語識別子リストの例を示す図で
ある。FIG. 6 is a diagram showing an example of a document identifier-word identifier list.

【図７】索引構造の生成アルゴリズムを示すフローチャ
ートである。FIG. 7 is a flowchart showing an index structure generation algorithm.

【図８】検索式受取手段に入力された検索式から関連度
を求めるためのアルゴリズムを示すフローチャートであ
る。FIG. 8 is a flowchart showing an algorithm for obtaining a degree of association from a search expression input to a search expression receiving means.

【図９】関連語計算アルゴリズムを示すフローチャート
である。FIG. 9 is a flowchart showing a related word calculation algorithm.

【図１０】関連語検索画面を示す図である。FIG. 10 is a diagram showing a related word search screen.

【図１１】「（地震ｏｒ震災ｏｒ震動）ａｎｄ耐震」を
検索式として入力した場合の表示画面を示す図である。FIG. 11 is a diagram showing a display screen when “(earthquake or earthquake or vibration) and seismic resistance” is entered as a search formula.

【図１２】「（地震ｏｒ震災ｏｒ震動）ａｎｄ耐震」を
検索式として入力した場合の関連語の表示結果を示す図
である。FIG. 12 is a diagram showing a display result of related words when “(earthquake or earthquake or earthquake) and seismic resistance” is entered as a search expression.

【図１３】「（地震ｏｒ震災ｏｒ震動）ａｎｄ（耐震ｏ
ｒ免震ｏｒ制震）ａｎｄ（ｎｏｔ（家具ｏｒ本棚ｏｒ箪
笥））」を検索式として入力した場合の関連語の表示結
果を示す図である。[Fig. 13] "(Earthquake or earthquake or earthquake) and (Earthquake resistance o
It is a figure which shows the display result of the related term when "r seismic isolation or seismic control) and (not (furniture or bookshelf or chest of drawers)" is input as a search expression.

【図１４】文書集合の差に基づいて関連語を決定するた
めの手順を示すフローチャートである。FIG. 14 is a flowchart showing a procedure for determining a related word based on a difference between document sets.

[Explanation of symbols]

１文書格納手段２検索条件受取手段３文書検索手段４関連度計算手段５関連語計算手段６関連語表示手段 1 Document storage means 2 Search condition receiving means 3 Document search means 4 Relevance calculation means 5 Related word calculation means 6 Related word display means

フロントページの続き (56)参考文献特開平５−81327（ＪＰ，Ａ) 春野雅彦、他１名，辞書と統計を用いた対訳アライメント，情報処理学会自然言語処理研究会研究報告，1996年３月 14日，96−ＮＬ−112，ｐ．23−30 北村美穂子、他１名，対訳コーパスを利用した対訳表現の自動抽出，情報処理学会論文誌，1997年４月15日，第38 巻，第４号，ｐ．727−735 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 Continuation of the front page (56) References JP-A-5-81327 (JP, A) Haruhiko Masahiko, 1 other person, bilingual alignment using dictionaries and statistics, IPSJ Natural Language Processing Research Group research report, 1996 3 March 14, 96-NL-112, p. 23-30 Mihoko Kitamura, et al., Automatic extraction of bilingual expressions using bilingual corpus, Journal of Information Processing Society of Japan, April 15, 1997, Vol. 38, No. 4, p. 727-735 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/30

Claims

(57) [Claims]

1. A related word presentation device for presenting words related to a search condition, a document storage unit for storing a plurality of documents, a search condition receiving unit for receiving a plurality of input search conditions, and a search condition receiving unit. A document search unit that acquires a document set that matches each search condition received by the unit from the document storage unit, and each word existing in the document set acquired by the document search unit as a related word candidate, and the document search unit A first value, which is the number of documents acquired by, and a second value for each related word candidate, which is the number of documents that include each related word candidate in the document set acquired by the document search unit, A third value for each related word candidate, which is the number of documents including each related word candidate among the documents stored in the document storage means, is acquired, and the first value and the third value are acquired. Relates a fourth value that is the product or sum of It is calculated for each candidate, and the degree of association between the search condition received by the search condition receiving means and each related word candidate is calculated for each search condition based on the ratio between the second value and the fourth value. The degree of relevance of each related word candidate is compared with the degree of relevance of each related word candidate calculated by the degree of relevance calculation means according to each search condition received by the search condition receiving means. A related word presenting device, comprising: a related word calculating means for determining a related word based on the change of the related word; and a related word displaying means for displaying the related word decided by the related word calculating means on a display device.

2. The degree-of-association calculating means sets the number of all documents stored in the document storage means as M, the first value as α, the second value for each related word candidate as β, and the related value. When the third value for each word candidate is γ, the value of the extended mutual information obtained by the following calculation formula, extended mutual information = log ₂ {(Mβ) / (αγ)} The related word presentation device according to claim 1, wherein the related condition presenting apparatus sets the degree of relevance between the search condition received by the receiving means and each related word candidate.

3. The degree-of-association calculating means sets the number of all documents stored in the document storage means to M, the first value is α, the second value for each related word candidate is β, and the related value is related. When the third value for each word candidate is γ, the search condition receiving means obtains the value of the extended TS obtained by the extended TS (t-score) = M {(Mβ-αγ) / (αγ)}. The related word presenting apparatus according to claim 1, wherein the related word presenting apparatus sets the degree of relevance between the received search condition and each related word candidate.

4. The degree-of-association calculation means sets the first value to α,
The second value for each related word candidate is β, and the third value for each related word candidate
When the value of γ is γ, the value of the extended DC obtained by the following formula, extended DC (Dice-coefficent) = 2β / (α + γ), is used as the search condition and each related word received by the search condition receiving means. The related word presenting apparatus according to claim 1, wherein the degree of association with a candidate is used.

5. The related word calculating means, when determining a related word based on two search conditions before and after narrowing down a document search, rather than the degree of relevance between the search condition before narrowing down and A word with a higher degree of relevance to the search condition after narrowing down is defined as a word with an increased degree of relevance, and is more related to the search condition after narrowing down than the degree of relevance to the search condition before narrowing down. 2. The related word presenting apparatus according to claim 1, wherein the related word candidate having a smaller degree is the related degree decreasing word, and the related degree increasing word and the related degree decreasing word are the related words.

6. A related word presentation device for presenting words related to a search condition, a document storage means for storing a plurality of documents, and a search condition reception for receiving a pair of search conditions before and after narrowing down a document search. Means, a document search means for acquiring from the document storage means a document set that satisfies one of the search conditions received by the search condition receiving means and does not satisfy the other search condition, and the document search means acquires the document set. Each word existing in the document set described above is set as a related word candidate, and a first value that is the number of documents acquired by the document searching unit and each related word existing in the document set acquired by the document searching unit. A second value for each related word candidate that is the number of documents including the candidate, and a related word that is the number of documents that include each related word candidate among the documents stored in the document storage unit. For each candidate The third value is acquired, and the fourth value, which is the product or sum of the first value and the third value, is calculated for each related word candidate, and the ratio between the second value and the fourth value is calculated. Based on the relevance calculation means for calculating the relevance between the pair of search conditions received by the search condition receiving means and each related word candidate, and the relevance obtained from the relevance calculation means is constant. Related word presenting means comprising: a related word calculating means having a related word candidate having a value or more as a related word; and a related word displaying means for displaying the related word determined by the related word calculating means on a display device. apparatus.

7. The related word calculation means, wherein the degree of relevance obtained based on a document set that satisfies the search condition before narrowing down and does not satisfy the search condition after narrowing down is a predetermined value or more. Relevance is defined as a related word candidate in which the degree of relevance is reduced and the degree of relevance obtained based on a document set that does not satisfy the search condition before narrowing down and satisfies the search condition after narrowing down is a predetermined value or more. The related word presentation device according to claim 6, wherein the related word is an increased word, and the related degree increasing word and the related degree decreasing word are the related words.

8. The degree-of-association calculating means sets the number of all documents stored in the document storage means to M, the first value is α, the second value for each related word candidate is β, and the related value is related. When the third value for each word candidate is γ, the value of the extended mutual information obtained by the following calculation formula, extended mutual information = log ₂ {(Mβ) / (αγ)} 7. The related word presentation device according to claim 6, wherein the degree of relevance between the search condition received by the receiving means and each related word candidate is used.

9. The degree-of-association calculation means sets the number of all documents stored in the document storage means as M, a first value as α, a second value for each related word candidate as β, and a related value. When the third value for each word candidate is γ, the search condition receiving means obtains the value of the extended TS obtained by the extended TS (t-score) = M {(Mβ-αγ) / (αγ)}. 7. The related word presenting apparatus according to claim 6, wherein the degree of relevance between the received search condition and each related word candidate is set.

10. The relevance calculating means calculates as follows when the first value is α, the second value for each related word candidate is β, and the third value for each related word candidate is γ. The value of the expanded DC obtained by the formula, expanded DC (Dice-coefficent) = 2β / (α + γ), is used as the degree of association between the search condition received by the search condition receiving means and each related word candidate. The related word presentation device according to claim 6.

11. A medium for storing a related word presentation program for causing a computer to present a word related to a search condition, a document storage means for storing a plurality of documents, and a plurality of input search conditions. Retrieval condition receiving means, document retrieval means for obtaining from the document storage means a document set matching each retrieval condition received by the retrieval condition receiving means, and relating each word existing in the document set obtained by the document searching means As a word candidate, a first value that is the number of documents acquired by the document search unit, and a related word candidate that is the number of documents that include each related word candidate in the document set acquired by the document search unit. And a third value for each related word candidate, which is the number of documents containing each related word candidate among the documents stored in the document storage unit, And the value of A fourth value, which is the product or sum of 3 and the value of 3, is calculated for each related word candidate, and based on the ratio between the second value and the fourth value, the search condition received by the search condition receiving means Relevance calculating means for calculating the degree of association with each related word candidate for each search condition, and each related word candidate calculated by the related degree calculating means according to each search condition received by the search condition receiving means Related word calculating means for comparing related degrees of the related word candidates and determining related words based on a change in the value of the related degree of each related word candidate, and related word display for displaying the related words determined by the related word calculating means on a display device. A medium in which a related word presentation program for causing a computer to function as means is recorded.

12. A medium storing a related word presenting program for causing a computer to present a word related to a search condition, a document storage means for storing a plurality of documents, before and after narrowing down a document search. Search condition receiving means for receiving a pair of search conditions, and a document search for acquiring from the document storing means a document set satisfying one of the search conditions received by the search condition receiving means and not satisfying the other search condition. A first value, which is the number of documents acquired by the document search unit, and a set of documents acquired by the document search unit, wherein each word existing in the document set acquired by the document search unit is a related word candidate. Second value for each related word candidate, which is the number of documents containing each related word candidate existing in the document, and each related word in the documents stored in the document storage means. And a third value of each related term candidate is the number of documents that contain the complement acquired first value and the third
A fourth value, which is the product or sum of the search conditions, is calculated for each related word candidate, and the pair of search conditions received by the search condition receiving means is calculated based on the ratio between the second value and the fourth value. And a related degree calculating unit that calculates a degree of association between each related word candidate, and a related word calculating unit that uses a related word candidate in which the degree of relevance obtained from the related degree calculating unit is a certain value or more, A related word display program for causing a computer to function as a related word display means for displaying the related word determined by the related word calculation means on a display device.