JPH10171821A

JPH10171821A - Method for presenting retrieval word candidate and device therefor

Info

Publication number: JPH10171821A
Application number: JP8327275A
Authority: JP
Inventors: Takashi Inoue; 孝史井上; Kazuo Tanaka; 一男田中; Masakatsu Ookubo; 雅且大久保; Masayuki Sugizaki; 正之杉崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-12-06
Filing date: 1996-12-06
Publication date: 1998-06-26

Abstract

PROBLEM TO BE SOLVED: To provide a retrieval word candidate presenting method and device for executing efficient retrieval and obtaining a further proper text. SOLUTION: Words are extracted from a text designated by a user to be related with a retrieval request from among texts included in the previous retrieved result, and the group of words which are left as the result of narrowing-down such as the specification of a part of speed is presented to a user, and a retrieval expression is re-constituted only of words designated as those which are judged to be appropriate by the user from among them. Therefore, the candidates of the words being the elements of a new retrieval expression are presented to the user from the text designated by the user, and the retrieval expression is generated from the words designated by the user from among them. Thus, a further proper text can be obtained by this time retrieval.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テキストデータベ
ースに対する検索、すなわちテキスト検索における検索
語候補提示方法およびその装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for searching a text database, that is, a method and apparatus for presenting search word candidates in a text search.

【０００２】[0002]

【従来の技術】テキスト検索とは、文書の集合をデータ
ベースに登録しておき、ユーザが与えた検索式に関連す
る文書をそのデータベースから取り出す技術である。検
索式は「通信」のような単語だけでなく、「通信ＡＮＤ
計算機」のように「通信」と「計算機」の両方の単語に
関連するという検索式や、「通信ＯＲ計算機」のように
いずれかの単語に関連するという検索式も受諾されるこ
とが多い。2. Description of the Related Art Text search is a technique in which a set of documents is registered in a database, and documents related to a search formula given by a user are extracted from the database. The search expression is not limited to words like "communication", but also "communication AND
A search expression relating to both words of "communication" and "computer" such as "computer" and a retrieval expression relating to any word such as "communication OR computer" are often accepted.

【０００３】また、図１に示す如くユーザがある検索式
を与えて検索し（ステップ(以下Ｓ)１１）、その検索結
果を表示し（Ｓ１２）、これに基づいてユーザが検索式
を変更し（Ｓ１３）、再度検索する（Ｓ１４）というこ
とがよく行なわれる。例えば、最初「通信」という語を
検索式として検索した時に、検索結果が希望するよりも
狭い範囲からの数少ない文書の集合であった場合には、
「通信０Ｒ計算機」などと検索式を変更して検索条件を
広げる。Also, as shown in FIG. 1, a user gives a search formula to perform a search (step (hereinafter, S) 11), displays the search result (S12), and based on this, the user changes the search formula. (S13) and re-search (S14) are often performed. For example, if you first search for the word "communication" as a search expression and the search result is a collection of a few documents from a narrower range than you want,
Change the search formula to "communication 0R calculator" or the like to broaden search conditions.

【０００４】テキスト検索において、ある程度自動的に
検索式を変更して再検索するための手法として、適合フ
ィードバックという技術が提案されている。詳細な方法
については、たとえば“Information Retrieval”（Wil
lam B.Frakes and Ricardo Baeza-Yates著 Prentice Ha
llの出版）という本の第１１章に記述されているが、こ
こでは簡単に適合フィードバックについて説明する。[0004] In text search, a technique called relevance feedback has been proposed as a technique for automatically changing the search formula to some extent and performing a search again. See “Information Retrieval” (Wil
Prentice Ha by lam B. Frakes and Ricardo Baeza-Yates
ll publication), described in Chapter 11 of this book, but here is a brief description of relevance feedback.

【０００５】まず、一般的な適合フィードバックを含む
検索の流れを図２に示す。ユーザから指定された検索式
に基づき検索を行い（Ｓ２１）、その検索結果を表示す
る（Ｓ２２）。検索結果として得られたテキストのう
ち、適切であるとユーザが指定したテキストから、新た
な検索式を自動的に生成する（Ｓ２３）。この新たな検
索式に基づいて再び検索を行う（Ｓ２４）。この様にし
て、必要な情報が得られるまで、指定されたテキストか
らの検索式の自動生成と、再検索を繰り返す。[0005] First, the flow of a search including general relevance feedback is shown in FIG. A search is performed based on a search formula specified by the user (S21), and the search result is displayed (S22). A new search formula is automatically generated from the text specified by the user as appropriate among the texts obtained as search results (S23). The search is performed again based on the new search formula (S24). In this way, until the necessary information is obtained, the automatic generation of the search formula from the specified text and the re-search are repeated.

【０００６】次に、適合フィードバックにおける一般的
な検索式生成処理の流れを図３に示す。第１に、指定さ
れたテキストに対して、単語分割と各単語の品詞同定等
を行う形態素解析が遂行される（Ｓ３１）。例えば指定
されたテキストの一部が、「…サービスＣは競合各社に
対抗するための新サービスで…」であった場合には、こ
の部分に対する形態素解析の結果は図４に示すようにな
る。例えば、「サービスＣ」と「は」はそれぞれ１単語
として分割され、その品詞はそれぞれ名詞と助詞であ
る。第２に、指定されたテキストから有効単語の抽出が
行われる（Ｓ３２）。よく用いられる方法は、指定され
たテキストに含まれる単語から、あらかじめ定められた
品詞だけを抽出する方法であり、この方法では名詞だけ
を抽出することが多い。例えば、あるテキストの一部
が、前記に示すものである場合には、この部分からは
「サービスＣ」「競合」「各社」「サービス」が選ばれ
る。Next, FIG. 3 shows a flow of a general search expression generation process in matching feedback. First, a morphological analysis is performed on the designated text to perform word division and part-of-speech identification of each word (S31). For example, if a part of the specified text is "... service C is a new service to compete with competitors ...", the result of the morphological analysis for this part is as shown in FIG. For example, "service C" and "ha" are each divided into one word, and their parts of speech are nouns and particles, respectively. Second, valid words are extracted from the designated text (S32). A frequently used method is to extract only a predetermined part of speech from words included in a specified text. In this method, only a noun is often extracted. For example, when a part of a certain text is as described above, "service C", "competition", "each company", and "service" are selected from this part.

【０００７】ここまでで抽出された単語に対して、その
出現頻度に基づいて点数を付け、高い点数を持つ単語だ
けに絞り込むこともよく行なわれる。点数を付けるには
種々の方法が考えられるが、たとえば次のような方法が
ある。すなわち、指定されたテキスト中に現れる語のう
ち、検索対象のテキストデータベース全体の中での出現
頻度が少ない語が、指定したテキストの特徴をより強く
示していると考えて、より高い点数を与える。これを実
現する簡単な計算式は、単語に与える点数＝ log（１÷テキストデータペース中
のテキストのうち、その単語を含むものの数）となる。この点数がしきい値を超えるものだけを抽出す
る。なお、出現頻度はテキスト登録時にあらかじめ求め
ておく。[0007] The words extracted so far are often scored based on the frequency of appearance, and narrowed down to only words with high scores. Various methods are conceivable for assigning points. For example, the following methods are available. That is, words that appear less frequently in the entire text database to be searched among words appearing in the specified text are considered to indicate the characteristics of the specified text more strongly, and are given higher scores. . A simple formula for achieving this is: score given to a word = log (1 ÷ the number of texts in the text data pace that include the word). Only those whose score exceeds the threshold are extracted. The appearance frequency is obtained in advance at the time of text registration.

【０００８】第３に、検索式の構築を行う（Ｓ３３）。
単純な方法は、抽出された有効単語をＯＲで結ぶこと、
すなわち有効単語のいずれかを含んでいればよいという
検索式を構築することである。先に挙げたテキスト例か
らは以下のような検索式が構築される。Third, a search formula is constructed (S33).
A simple method is to OR the extracted valid words with OR ,
That is, it is to construct a search formula that only needs to include any of the valid words. From the text examples given above, the following search formula is constructed.

【０００９】“…ＯＲサービスＣＯＲ競合ＯＲ各社ＯＲ
サービスＯＲ…” この様に、指定されたテキストから関連のある用語を抽
出して、検索式に反映することにより、前回の検索では
得ることのできなかった広い範囲の関連するテキストを
次の検索では得ることが可能となる。[…] OR Service C OR Competitive OR OR
Service OR … ”In this way, by extracting relevant terms from the specified text and reflecting them in the search formula, a wide range of related texts that could not be obtained in the previous search is searched for next. Now you can get it.

【００１０】また、再検索で得られたテキストに対し
て、より適切な順にスコア付けする場合もあるがここで
は割愛する。In some cases, the text obtained by the re-search is scored in a more appropriate order, but is omitted here.

【００１１】[0011]

【発明が解決しようとする課題】前述した従来の検索技
術では、前回の検索で得られた結果を参考に検索条件を
変更することが可能だが、どのように変更すればよいか
に関する指針がなく、試行錯誤が行われるため効率が悪
い。一方、適合フィードバック技術では、品詞の特定な
どの絞り込みは行われるものの、指定されたテキストか
ら抽出した単語から機械的に検索式を生成していたため
に、不要な単語まで生成された検索式に含まれてしま
い、再計算の際に関連のないテキストまで取得するとい
う問題点があった。In the conventional search technique described above, it is possible to change the search condition with reference to the result obtained in the previous search, but there is no guideline on how to change the search condition. However, efficiency is low because trial and error is performed. On the other hand, in the adaptive feedback technology, although narrowing down, such as identification of part of speech, is performed, since the search expression is generated mechanically from words extracted from the specified text, unnecessary words are included in the generated search expression. There was a problem that even when recalculating, even unrelated text was obtained.

【００１２】本発明の目的は、効率のよい検索を実行で
き、より適切なテキストを得ることが可能な検索語候補
提示方法およびその装置を提供することにある。An object of the present invention is to provide a method and apparatus for presenting a search term candidate which can execute an efficient search and obtain a more appropriate text.

【００１３】[0013]

【課題を解決するための手段】本発明は、前記目的を達
成するため、前回の検索結果に含まれるテキストの中
で、検索要求に関連があるとユーザが指定したテキスト
から単語を抽出し、品詞の特定等の絞り込みを行った結
果残った単語の集合をユーザに提示し、その中からユー
ザが適当と判断するものを指定した単語のみから検索式
を再構成する。また、提示する際には、指定されたテキ
スト中のより特徴的な語から順に提示する。According to the present invention, in order to achieve the above object, words are extracted from a text specified by a user as being relevant to a search request among texts included in a previous search result, A set of words remaining as a result of narrowing down the part of speech and the like is presented to the user, and a search formula is reconstructed from only words that specify those that the user deems appropriate from among them. In addition, when presenting, the words are presented in order from the more characteristic words in the designated text.

【００１４】本発明によれば、ユーザが指定したテキス
トから、新たな検索式の要素となる単語の候補をユーザ
に提示し、その中からユーザが指定した単語から検索式
を生成することによって、今回の検索でより適切なテキ
ストを得ることを可能とする。According to the present invention, word candidates to be elements of a new search formula are presented to the user from the text specified by the user, and a search formula is generated from the word specified by the user from among them. It is possible to obtain more appropriate text in this search.

【００１５】[0015]

【発明の実施の形態】以下に、本発明の一実施の形態に
ついて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below.

【００１６】図５は本発明に係る検索語候補提示装置を
示すブロック図であり、図６は同じく検索語候補提示処
理のフローチャートを示すものである。検索語候補提示
装置は、制御部５１と、表示部５２と、ユーザ指定部５
３と、有効単語抽出部５４と、点数計算部５５と、絞り
込み部５６、検索式生成部５７と、検索部５８とからな
る。以下に、検索語候補提示処理の流れを説明する。FIG. 5 is a block diagram showing a search word candidate presenting apparatus according to the present invention, and FIG. 6 is a flowchart showing a search word candidate presenting process. The search word candidate presentation device includes a control unit 51, a display unit 52, and a user designation unit 5.
3, a valid word extraction unit 54, a score calculation unit 55, a refinement unit 56, a search formula generation unit 57, and a search unit 58. Hereinafter, the flow of the search word candidate presentation process will be described.

【００１７】第１に検索結果の表示を表示部５２を用い
て行う（Ｓ６１）。ユーザは、ユーザ指定部５３によ
り、検索語候補提示のために、表示された検索結果のテ
キストの中から、検索要求と関連するテキストを指定す
る（Ｓ６２）。First, search results are displayed on the display unit 52 (S61). The user specifies the text related to the search request from the displayed text of the search result by the user specifying unit 53 to present the search word candidate (S62).

【００１８】第２に、指定されたテキストから有効単語
抽出部５４によって有効単語を抽出する（Ｓ６３）。有
効単語の抽出には従来の技術の欄で説明したように、品
詞の特定（普通は名詞のみ）と、単語の頻度に基づく点
数付けが用いられる。有効単語の点数は、点数計算部５
５により計算され（Ｓ６４）、これを用いた絞り込みを
絞り込み部５６で行う（Ｓ６５）。ここまでは従来技術
と同様である。Second, valid words are extracted from the designated text by the valid word extracting unit 54 (S63). As described in the section of the related art, the extraction of the effective word uses the identification of the part of speech (usually only a noun) and the scoring based on the frequency of the word. The score of the effective word is calculated by the score calculation unit 5
5 (S64), and the narrowing-down unit 56 performs narrowing using this (S65). Up to this point, it is the same as the prior art.

【００１９】第３に、ここまでの方法で抽出された単語
の集合を検索語の候補として、点数の高いものから順に
ユーザに表示部５２にて提示する（Ｓ６６）。ユーザ
は、提示された単語の中から適当な単語を複数指定する
（Ｓ６７）。ここが、従来技術と異なる点である。Third, a set of words extracted by the above-described methods is presented to the user on the display unit 52 in the order of the highest score as search word candidates (S66). The user specifies a plurality of appropriate words from the presented words (S67). This is a point different from the prior art.

【００２０】第４に、ユーザから指定された単語を用い
て検索式生成部５７で検索式を生成し（Ｓ６９）、検索
部５８で検索を行う（Ｓ６９）。Fourth, a search formula is generated by the search formula generation unit 57 using the word specified by the user (S69), and the search is performed by the search unit 58 (S69).

【００２１】例を用いて本発明の処理の流れをさらに具
体的に説明する。表示された検索結果に対して、検索要
求に関連するテキストとして、ユーザが、「ＣＦなどの
宣伝という面でもサービスＣを積極的にアピールしてお
り、Ｃ社の力の入れ具合が伺える。実際、これらのサー
ビスの契約数は非常に増加しており、ドル箱となってい
る。」を指定したとする。このテキストから、有効単語
の抽出によって取り出される名詞は、ＣＦ、宣伝、面、
サービスＣ、Ａ社、力、入れ、具合、サービス、契約、
ドル箱である。さらに前述の従来技術の説明で示したも
のと同様の頻度統計による点数付けを行い、閾値以下の
ものを除去することによって絞り込みを行う。これによ
って最終的に抽出される追加単語の候補を点数順に並べ
たものが、ドル箱、ＣＦ、サービスＣ、Ａ社、競合、シ
リーズ、具合になったとする。これをユーザに提示し、
その中からユーザが指定したものをＯＲでつないで検索
式を生成する。例えば、今ユーザが指定した単語が、Ａ社、サービスＣ、シリーズであったとすると、生成される検索式はＡ社ＯＲサービスＣＯＲシリーズとなり、これを新しい検索条件として再検索を行う。The processing flow of the present invention will be described more specifically with reference to examples. In response to the displayed search result, as a text associated with the search request, the user has actively promoted the service C in terms of promotion of CF and the like, indicating the strength of Company C's efforts. , The number of contracts for these services has increased significantly, making them dollar boxes. " Nouns extracted from this text by extracting valid words are CF, advertisement, face,
Service C, company A, power, investment, condition, service, contract,
It is a dollar box. Further, a score is assigned based on the same frequency statistic as that described in the description of the related art described above, and narrowing-down is performed by removing those below the threshold value. It is assumed here that the candidates of the additional words finally extracted are arranged in the order of score, such as dollar box, CF, service C, company A, competition, series, and so on. This is presented to the user,
A search expression is generated by connecting the items designated by the user from among them by OR . For example, if the word specified by the user is company A, service C, series, the generated search formula is company A OR service COR series, and the search is performed again using this as a new search condition.

【００２２】[0022]

【発明の効果】以上説明したように本発明によれば、今
回の再検索ではより適切なテキストを得ることが可能と
なる。それにより、ユーザは本当に必要な情報を、より
短時間に、より容易に取得することが可能となる。As described above, according to the present invention, it is possible to obtain a more appropriate text in this re-search. As a result, the user can more easily obtain the necessary information in a shorter time.

[Brief description of the drawings]

【図１】一般的な「検索〜検索式変更〜再検索」の流れ
を示すフローチャートFIG. 1 is a flowchart showing a general flow of “search-change search formula-re-search”;

【図２】一般的な適合フィードバックの流れを示すフロ
ーチャートFIG. 2 is a flowchart showing a flow of a general adaptation feedback;

【図３】適合フィードバックにおける一般的な検索式生
成処理を示すフローチャートFIG. 3 is a flowchart showing a general search expression generation process in matching feedback;

【図４】形態素解析の例を示す図FIG. 4 shows an example of morphological analysis.

【図５】本発明の一実施の形態を示す検索語候補提示装
置のブロック図FIG. 5 is a block diagram of a search word candidate presenting apparatus according to an embodiment of the present invention.

【図６】本発明の一実施の形態の検索語候補提示処理を
示すフローチャートFIG. 6 is a flowchart showing search word candidate presentation processing according to one embodiment of the present invention;

───────────────────────────────────────────────────── フロントページの続き (72)発明者杉崎正之東京都新宿区西新宿３丁目19番２号日本電信電話株式会社内 ──────────────────────────────────────────────────の Continued on the front page (72) Inventor Masayuki Sugizaki 3-19-2 Nishishinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Corporation

Claims

[Claims]

1. A method of presenting a search word candidate in a text search, wherein a search result and a search word candidate are displayed, and a user designates a text in the displayed search result and a search word in the search word candidate. , Extract valid search terms from the specified text, calculate the number of search term candidates, narrow down the search term candidates based on the score, generate a search formula from the specified search terms, and generate the generated search formula A method for presenting a candidate for a search word, characterized by performing a search using the search term.

2. An apparatus for presenting a search word candidate in a text search, comprising: a display unit for displaying a search result and a search word candidate; and a user displaying text in the displayed search result and a search word in the search word candidate. , A valid word extraction unit that extracts valid search words from the specified text, a score calculation unit that calculates the number of search word candidates, and a narrowing unit that narrows down search word candidates based on the points A search term candidate presentation device, comprising: a search term generation unit that generates a search term from a specified search term; and a search unit that performs a search using the generated search term.