JP3438947B2

JP3438947B2 - Information retrieval device

Info

Publication number: JP3438947B2
Application number: JP12210194A
Authority: JP
Inventors: 朋之宮下; 克信柴田
Original assignee: NS Solutions Corp
Current assignee: NS Solutions Corp
Priority date: 1994-06-03
Filing date: 1994-06-03
Publication date: 2003-08-18
Anticipated expiration: 2018-08-18
Also published as: JPH07334515A

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は、文献データベースなど
における情報の検索装置に関し、特に、全文検索や全物
件検索が可能な情報検索装置に関する。【０００２】【従来の技術】文献などの情報を物件として多数保持す
るデータベースからユーザが必要とする物件を検索する
場合、各物件に予めキーワードを付与しておいて検索キ
ーと一致するキーワードを検索することにより所望の物
件を探し出すキーワード検索方法と、全情報の全文検索
を行なって検索キーを含むものを探し出す直接検索方法
とがある。キーワード検索方法には、データベースに
格納される各物件に予めキーワードを付与するのでその
作業にかなりの人手を要し、任意にキーワードを付与
した場合にはキーワードの個数が膨大となるのでシソー
ラスによる管理が必要となり、また、的確なキーワー
ドの付与が難しく、所望の物件に到達できないことがあ
る、という問題点がある。また、直接検索方法には、
検索対象となる物件数が多い場合や各物件のデータ量が
大きい場合に現実的な時間の範囲内で検索を完了させる
ことができず、また、いわゆる曖昧検索を一般的には
行なうことができないという問題点がある。【０００３】本出願人は、上述したような従来の情報検
索方法の諸問題点を解決するため、特開平４−３２６１
６４号公報において、キーワードの付与を必要とせず、
高速での検索が行なえるデータベース検索システムを提
案した。このデータベース検索システムは、検索対象の
各物件ごとに自己相関情報を予め求めておき、検索時に
は、検索キーの自己相関情報を求めて検索キーの自己相
関情報と物件の自己相関情報との合致度を物件ごとに求
め、この合致度の高い物件から出力しようとするもので
ある。自己相関情報としては、例えば、固定サイズの二
値行列が使用される。この検索システムは、各物件の
データ量によらずに固定サイズの自己相関情報に基づい
て検索がなされるので、検索時間の大幅な短縮が図れ、
検索キーと少し異なる表記の文字列を含む物件に対し
てはかなり大きな合致度が得られるので、曖昧検索を行
なうことが可能となり、自己相関情報の算出を自動的
に行なうことができるので、キーワード検索方法に比
べ、データベース構築時の作業量を大幅に減少させるこ
とができ、さらに、検索キーをそのまま含む物件に対
しては最大の合致度が与えられるので、そのような物件
を見逃すことがない、などの利点を有する。【０００４】ここでこの特開平４−３２６１６４号公報
に記載されたデータベース検索システムの具体例を説明
する。ここでは、検索対象の物件が英文テキストであっ
て、テキストの各文字がＡＳＣＩＩ（アスキー）コード
で表現されているものとする。ＡＳＣＩＩコードは通常
８ビットであるが、英文の通常文字を使用している限り
最上位ビットは使用されないので、下位側の７ビットの
みを考慮する。自己相関情報としては、各文字が７ビッ
トのコードによって０から１２７までのいずれかの整数
で表わされているので、１２８×１２８の二値行列を使
用する。この行列の各要素は、"０"に初期化されている
ものとする。【０００５】まず、英文テキストから、この英文テキス
トの各文字についてその文字を先頭とし所定の文字数か
らなる連字を抽出する。英文テキストが「This_is_a_pe
n.」（ここで"_"はスペースを表わす。）であり、所定
の文字数が３文字であれば、"Thi","his","is_","s_
a","_a_","a_p","_pe","pen","en.","n."の連字が抽出
される。続いて、抽出された連字においてその連字の１
文字目、２文字目、３文字目の文字コードがそれぞれｃ
₁,ｃ₂,ｃ₃であったとすれば、自己相関情報を表わす二
値行列の要素(ｃ₁,ｃ₂)と(ｃ₁,ｃ₃)の値を"１"にセット
する。この操作を抽出された全ての連字について実行す
ることにより、対象としている英文テキストの自己相関
情報が得られたことになる。【０００６】一方、検索時には、まず、上述と同様の手
順によって検索キーの自己相関情報を求める。そして、
物件ごとに、その物件の自己相関情報の二値行列と検索
キーから求めた二値行列とを比較し、検索キーからの二
値行列で"１"となっている各要素が物件から求めた二値
行列において"１"になっているかどうかを調べる。検索
キーに対応する二値行列で"１"となっている行列の要素
のうち物件に対応する二値行列で"１"となっているもの
割合を合致度とする。そして、この合致度が高い方から
順に物件を出力する。【０００７】この具体例では、連字に基づいて自己相関
情報が算出されている。大まかにいえば、検索キーに含
まれる連字のうちどれだけのものが物件に含まれている
かに応じて、合致度が算出される。そして、連字の一致
を逐語的に調べるのではなく、自己相関情報に変換した
上で合致度を算出することによって、極めて高速での検
索が可能となっている。【０００８】【発明が解決しようとする課題】特開平４−３２６１６
４号に開示されたデータベース検索システムによって、
上述したように、キーワードを使用せずに高速で文献検
索を行なうことが可能となった。しかし、文献データベ
ースに格納される文書数や情報量は急増の一途をたどっ
ており、より高速であって確実な文献検索の実現が求め
られている。【０００９】本発明の目的は、検索者の意図するものを
より高速で検索することが可能な情報検索装置を提供す
ることにある。【００１０】【００１１】【課題を解決するための手段】本発明の情報検索装置
は、入力される検索文字列に基づき、検索対象文書の集
合の中から所望の文書の検索を行なう情報検索装置であ
って、前記検索文字列を入力する入力手段と、所定の文
字長である連字を前記検索文字列から抽出して検索用連
字群を構成する連字抽出手段と、前記検索対象文書ごと
の重要度を格納する重要度格納手段と、前記検索用連字
群に含まれる連字ごとに当該連字を構成する文字種の組
み合わせ、およびその順序に応じて当該連字に対する重
要度を決定し、前記各検索対象文書について、前記検索
用連字群に含まれる連字ごとに当該連字が当該検索対象
文書に含まれるかを調べ、当該連字が含まれている場合
には当該連字に対応する重要度を前記重要度格納手段に
おける当該検索対象文書の重要度に加算する検索手段
と、前記重要度格納手段を参照し、前記各検索対象文書
ごとの重要度に応じて検索結果の出力を行なう出力手段
とを有する。【００１２】【作用】文献データベースの検索を行なおうとする場
合、文書自体の局所的な構造に注目して検索を行なうの
が一般的である。注目している場所に書かれている内容
とその例えば５ページ先に書かれている内容との相関に
よって検索を行ないたいなどということは、まずありえ
ない。局所的な構造に注目した場合、数語ないし数十語
の長さの検索キーをそのままで使用するよりも、「従来
の技術」でも述べたように、検索キーを連字に分解しこ
の連字に基づいて検索を行なった方が効率的である。【００１３】ところで、検索者が実際に検索を行なおう
としている局面を考えると、この検索者は何らかの意味
を托して検索キーを選んでいるばずである。また、検索
対象の文書を考えた場合、この文書に含まれる単語が全
て同等の重みをもつのではなく、その文書の識別に役立
つ特徴的な単語とそうでない単語とが混在している。従
来の検索方法では、特に、連字による場合、検索結果に
対する各連字の寄与分は同等であって、検索者の托した
意味や文書の特徴的な内容を反映しておらず、このた
め、全く意図しない文書をヒットしたりすることが多か
った。【００１４】本発明では、検索キー（検索文字列）を連
字に分解した上で、その連字が検索のために特徴的なも
のなのかそうでないのかを判断し、より特徴的な連字が
検索結果により大きく寄与するようにしているので、検
索者の意図に沿って確実に検索を行なうことが可能とな
る。具体的には、各連字についてその連字の検索結果に
寄与する割合を重要度として定め、検索対象文書中にそ
の連字が含まれる場合には、当該連字の重要度をその検
索対象文書の重要度（評価値）に加算するようにすれば
よい。連字に対する重要度の定め方としては、例えば、
検索に使用される連字ごとに、検索対象の全文書を通
じてのその連字の出現頻度を求め、出現頻度の高い連字
ほど重要度を小さくする、連字ごとに連字を構成する
文字種を求め、その文字種の組み合わせによってその連
字の重要度を定める、などの方法があり、これらおよ
びの方法を併用するようにしてもよい。【００１５】重要度算出における上記の方法は、簡単
に言えば、各検索対象文書に共通に現れるものほど、所
望する文書を特定する度合いは低いということに基づい
ている。例えば、各種の活用語尾や、英語の文章におけ
る"the"や助動詞、日本文における助詞や助動詞が、共
通に出現しやすいものに該当する。一方、上記の方法
は、例えば日本語の文章中では漢字やひらがな、数字等
が混在して使用されるが、漢字どうしの結合は熟語とし
て特徴的な意味を有することが多いということに基づい
ている。日本語以外の言語であっても、例えば中国語に
おいては、助詞になりうる漢字と助詞にはならない漢字
の区別があってこの区別に応じて文字種を定めることが
可能である。また、英語などにおいても、大文字と小文
字や、数字、ギリシャ文字、ハイフンなどの記号である
かに応じて文字種を定めることができる。英語の場合、
過去型語尾の"ed"や副詞語尾の"ly"などを別の文字種と
して扱うような処理も可能である。【００１６】本発明において、連字ごとの重要度の具体
的な算出方法は、例えば、検索対象となる文書の言語や
分野（例えば、技術文献であるか、新聞記事であるか、
文学作品であるかなど）、検索者のくせ（どのような検
索キーをよく使うかなど）、検索の目的（あいまい検索
を行なうかどうか）などに応じ、さらには検索文字列そ
のものに応じて、変化させることが可能であり、適応的
に変化させることもできる。重要度の算出方法を必要に
応じて変化させることにより、さらに的確な検索を行な
うことが可能となる。【００１７】【実施例】次に本発明の実施例について、図面を参照し
て説明する。図１は本発明の一実施例の情報検索装置の
構成を示すブロック図であり、図２はこの情報検索装置
を使用し本発明の方法によって情報の検索を行なう場合
の処理手順を示すフローチャートである。【００１８】この情報検索装置１１は、データベース格
納部１０に検索対象文書として蓄積されている文献情報
の検索を実行するためのものであり、検索者が指定した
検索文字列を入力する検索文字列入力部１２と、所定の
文字長である連字を入力された検索文字列から抽出する
連字抽出部１３と、検索文字列から抽出された各連字に
対して当該連字の重要度を算出するとともにこれらの連
字に基づいてデータベース格納部１０中の各検索対象文
書を実際に検索する検索エンジン部１４と、検索文字列
から抽出された各連字に対して重要度を算出する際に使
用されるパラメータを格納するパラメータ格納部１５
と、検索対象文書ごとの重要度を格納する重要度格納部
１６と、重要度格納部１６を参照し各検索対象文書ごと
の重要度に応じて正規化を行ない検索結果の出力を行な
う出力部１７とによって構成されている。なお、検索文
字列から抽出された連字の集合を検索用連字群という。【００１９】検索エンジン部１４は、検索用連字群の中
の各連字についてその連字に対する重要度を算出する
が、本実施例では、連字の出現頻度から定まる第１種の
重要度と、連字を構成する文字種から定まる第２種の重
要度の２通りの重要度をそれぞれの連字について求めて
いる。第１種および第２種の重要度は、それぞれ、０ま
たは正の実数であって、大きな値をとるものほど検索結
果に大きく寄与するように設定されている。以下、これ
らの重要度について説明する。【００２０】第１種の重要度は、検索エンジン部１４に
よって、データベース格納部１０に格納された全ての検
索対象文書を通してのその連字の出現頻度を算出し、出
現頻度が小さいほど大きな値となり、出現頻度が大きい
ほど小さな値となるように、定められる。ここでこの出
現頻度は、その連字を含む文書の数を文書の総数で除し
たものであって、０から１までの実数で表わされる。図
２は、ｘ軸に出現頻度を、ｙ軸に第１種の重要度の値を
とったグラフであって、出現頻度と第１種の重要度との
関係を表わす関数の一例を示している。このグラフから
も明らかなように、第１種の重要度は、出現頻度に対し
て単調減少となる関数で表わされる。なお、全ての検索
対象文書に出現する連字は、検索に用いるものとしては
無意味であるから、このような連字に対しては第１種の
重要度が０になるようにすることが望ましい。【００２１】第２種の重要度は、連字の文字長が２文字
の場合であれば、連字を構成する１文字目と２文字目の
文字種の組み合わせに応じて、検索エンジン部１４によ
って決定される。具体的には、パラメータ格納部１５内
に、文字種の組み合わせに応じた第２種の重要度の値を
表わす計算用テーブル２１を設けておき、この計算用テ
ーブル２１を参照することによって、各連字ごとに求め
られる。検索対象文字が日本語の文書である場合の計算
用テーブル２１の内容の一例が図３に示されている。図
示された例では、文字種としてひらがな、カタカナ、漢
字、英数字、記号に分類し、連字が漢字のみで構成され
る場合に第２種の重要度が最も大きな値となるようにな
っている。なお、漢字とひらがなの組み合わせから分か
るように、１番目の文字の文字種と２番目の文字の文字
種とを入れ替えた場合に、同じ第２種の重要度の値にな
るとは限らない。１文字目が漢字で２文字目がひらがな
の場合は、熟語（名詞）の最後の文字＋助詞の組み合わ
せである場合が圧倒的であり、文字種の組み合わせの順
が逆になっている場合に比べ、より特徴的でないと考え
られるからである。連字の文字長が３文字以上である場
合の扱いも、基本的にはここで述べたものと同様であ
る。【００２２】第１種および第２種の重要度を求めた上
で、検索エンジン部１４は、その実際の検索処理を実行
するようになっている。検索のアルゴリズムとしては、
例えば、上述した特開平４−３２６１６４号公報に記載
されたものがある。検索処理は、各検索対象文書ごと
に、検索用連字群の各連字についてその連字が当該検索
対象文書に含まれているかどうかを判定し、その連字が
含まれている場合にはその連字に対応する第１種および
第２種の重要度を重要度格納部１６内の当該検索対象文
書の重要度に加算することによって、行なわれる。実際
には、検索用連字群の全ての連字についての第１種およ
び第２種の重要度を算出してから一括して検索処理を実
行してもよいし、検索用連字群から１個の連字を取り出
し、その連字について第１種および第２種の重要度を求
め、その上でその連字が各検索対象文書に含まれるいる
かどうかを調べることを各連字について繰り返して実行
するようにしてもよい。【００２３】重要度格納部１６の構成例が図４に示され
ている。重要度格納部１６は、データベース格納部１０
内の各検索対象文書ごとにふられた文書番号とその文書
番号に対応する検索対象文書の重要度とからなる表とし
て構成されている。重要度の初期値は０である。検索対
象文書ごとの重要度は、上述の「従来の技術」における
合致度に対応する。【００２４】次に、本発明の方法に基づきこの情報検索
装置１０を用いて行なう情報検索の手順について、図５
を使用して説明する。【００２５】まず、検索文字列入力部１２を介して検索
文字列を入力し（ステップ１０１）、連字抽出部１３に
よってこの検索文字列を所定の文字長の連字に分解する
（ステップ１０２）。連字の長さが２文字、検索文字列
が例えば「大阪に出張する。」であれば、「大阪」、「阪
に」、「に出」、「出張」、「張す」、「する」、「る。」の各連字が抽
出され、検索用連字群を構成する。【００２６】次に、検索用連字群の各連字について、以
下の処理を行なう。すなわち、未処理の連字が残ってい
るかを判定し（ステップ１０３）、残っている場合に
は、未処理の連字について、検索エンジン部１４によっ
て、全ての検索対象文書を通じたその連字の出現頻度を
算出し（ステップ１０４）、算出された出現頻度に応じ
てその連字に対する第１の重要度を算出し（ステップ１
０５）、その連字を構成する文字種の組み合わせに応じ
てその連字に対する第２種の重要度を決定する（ステッ
プ１０５）。次に、データベース格納部１０内の検索対
象文書であってその連字について未検索の文書があるか
どうかを調べ（ステップ１０７）、未検索の文書がある
場合には未検索の文書のうちから１つの文書を選択し、
その文書中にその連字が含まれるかどうかを判定する
（ステップ１０８）。含まれていない場合にはそのまま
ステップ１０７に戻り、含まれている場合には、重要度
格納部１６におけるその文書の重要度に、その連字に対
する第１種および第２種の重要度を加算し（ステップ１
０９）、ステップ１０７に戻る。ステップ１０７で未検
索の文書が残っていないと判定されたら、検索用連字群
に含まれる次の連字による検索のために、ステップ１０
３に戻る。【００２７】上述した「大阪に出張する。」という検索
文字列から２文字ずつの連字を抽出した場合、連字「す
る」や「る。」は各文書に共通して出現しやすいので、第
１種の重要度は小さくなる。また、図３に示すような計
算用テーブルを使用している場合には、連字「阪に」に対
する第２種の重要度は小さくなる。一方、連字「大阪」
は、漢字２文字からなるので第２種の重要度は大きくな
る。また、各検索対象文書を通じての「大阪」の出現頻度
が小さく、「出張」の出現頻度がある程度大きいものとす
れば、第１種の重要度は「大阪」の方が「出張」よりも大き
くなる。結局、全体的に見れば、「大阪」の寄与度合が大
きい検索がなされることになる。【００２８】ステップ１０３で未処理の連字が残ってい
ないと判定された場合は、検索用連字群の全ての連字に
基づく検索処理が終了した場合であるから、出力部１７
に制御を移し、重要度格納部１６に格納された各文書ご
との重要度の値を正規化する（ステップ１１０）。ここ
で正規化とは、最大の重要度が１になるように、各重要
度に同一の係数を乗算する処理のことである。検索文字
列（検索キー）が長いほど検索用連字群の連字の数が多
くなり、そのため各文書ごとの生の重要度が大きくなり
がちであるが、このように正規化を行なうことにより、
このような検索文字列の相違により重要度の値のばらつ
きを補正することが可能となる。そして、正規化された
重要度を大きい方から順に並べ（ステップ１１１）、文
書ごとの重要度をリストとして出力することによって検
索結果の出力を行ない（ステップ１１２）、処理を終了
する。【００２９】【発明の効果】以上説明したように本発明は、検索文字
列から連字を抽出した上で各連字に対する重要度を求
め、検索結果に対する各連字の寄与割合をこの重要度に
応じて変化させることにより、検索において特徴的な連
字がより検索結果に反映することになって、検索者の意
図に沿って確実に検索を行なうことが可能となるという
効果がある。連字に対する重要度の定め方として、検
索に使用される連字ごとに、検索対象の全文書を通じて
のその連字の出現頻度を求め、出現頻度の高い連字ほど
重要度を小さくする、連字ごとに連字を構成する文字
種を求め、その文字種の組み合わせによってその連字の
重要度を定める、などの方法を採用することにより、よ
り確実な検索を行なうことが可能となる。BACKGROUND OF THE INVENTION [0001] Field of the Invention The present invention relates to a biopsy SakuSo location information in such document database, in particular, relates to an information search SakuSo location capable full-text search or full Property Search . 2. Description of the Related Art When searching for a property required by a user from a database holding a large number of information such as documents as properties, a keyword is assigned to each property in advance, and a keyword matching a search key is searched. Then, there are a keyword search method for searching for a desired property, and a direct search method for searching for an item including a search key by performing a full-text search of all information. In the keyword search method, a keyword is preliminarily assigned to each property stored in the database, so considerable work is required for the work. If a keyword is arbitrarily assigned, the number of keywords becomes enormous, so management using a thesaurus In addition, there is a problem that it is difficult to provide a proper keyword, and a desired property may not be reached. Also, direct search methods include:
When the number of properties to be searched is large or the data volume of each property is large, the search cannot be completed within a realistic time range, and so-called ambiguous search cannot be generally performed. There is a problem. [0003] The present applicant has disclosed Japanese Patent Application Laid-Open No. Hei 4-32661 in order to solve the problems of the conventional information retrieval method as described above.
No. 64 does not require the addition of a keyword,
We proposed a database search system that can perform high-speed search. In this database search system, the autocorrelation information is obtained in advance for each property to be searched, and at the time of search, the autocorrelation information of the search key is obtained and the degree of matching between the autocorrelation information of the search key and the autocorrelation information of the property is obtained. Is obtained for each property, and an attempt is made to output the property with the highest degree of matching. As the autocorrelation information, for example, a fixed-size binary matrix is used. This search system performs searches based on fixed-size autocorrelation information regardless of the data volume of each property, so that search time can be significantly reduced,
For properties that contain a character string that is slightly different from the search key, a fairly high degree of matching can be obtained, so fuzzy search can be performed and autocorrelation information can be calculated automatically. Compared with the search method, the amount of work required when constructing the database can be greatly reduced, and the property that includes the search key as it is is given the maximum matching degree, so that such a property will not be missed , Etc. Here, a specific example of the database search system described in Japanese Patent Application Laid-Open No. 4-326164 will be described. Here, it is assumed that the search target property is an English text, and each character of the text is represented by an ASCII (ASCII) code. The ASCII code is usually 8 bits, but only the lower 7 bits are considered since the most significant bit is not used as long as normal characters in English are used. As the autocorrelation information, since each character is represented by a 7-bit code by any integer from 0 to 127, a 128 × 128 binary matrix is used. Each element of this matrix is assumed to be initialized to “0”. [0005] First, for each character of the English text, a continuous character consisting of a predetermined number of characters starting from that character is extracted from the English text. If the English text is "This_is_a_pe
n. "(where" _ "represents a space), and if the predetermined number of characters is three," Thi "," his "," is_ "," s_
The consecutive characters of "a", "_ a _", "a_p", "_ pe", "pen", "en.", "n." are extracted.
The character code of the 2nd, 3rd and 3rd characters is c
_{If it is 1} , c ₂ , c ₃ , the values of the elements (c ₁ , c ₂ ) and (c ₁ , c ₃ ) of the binary matrix representing the autocorrelation information are set to “1”. By performing this operation for all extracted consecutive characters, the autocorrelation information of the target English text has been obtained. On the other hand, at the time of retrieval, first, the autocorrelation information of the retrieval key is obtained by the same procedure as described above. And
For each property, the binary matrix of the autocorrelation information of the property is compared with the binary matrix obtained from the search key, and each element which is "1" in the binary matrix from the search key is obtained from the property. Check whether it is "1" in the binary matrix. The ratio of the element of “1” in the binary matrix corresponding to the property among the elements of the matrix “1” in the binary matrix corresponding to the search key is defined as the matching degree. Then, properties are output in order from the one with the highest matching degree. In this specific example, autocorrelation information is calculated based on consecutive characters. Broadly speaking, the degree of matching is calculated according to how many consecutive characters included in the search key are included in the property. Then, rather than checking the coincidence of consecutive characters, the degree of matching is calculated after conversion to autocorrelation information, thereby enabling extremely high-speed retrieval. [0008] Japanese Patent Application Laid-Open No. 4-32616
By the database search system disclosed in No. 4,
As described above, a high-speed document search can be performed without using a keyword. However, the number of documents and the amount of information stored in the literature database are steadily increasing, and there is a demand for faster and more reliable literature search. An object of the present invention is to provide a information search SakuSo location that can be searched faster the intended searcher. An information search apparatus according to the present invention searches for a desired document from a set of search target documents based on an input search character string. An input unit for inputting the search character string; a continuous character extraction unit for extracting a continuous character having a predetermined character length from the search character string to form a search continuous character group; Importance storing means for storing the importance of each character, a combination of character types constituting the consecutive characters for each of the consecutive characters included in the retrieval consecutive character group , and determining the importance for the consecutive characters in accordance with the order. Then, for each of the search target documents, it is determined whether or not the consecutive characters are included in the search target document for each of the consecutive characters included in the search consecutive character group. The importance corresponding to the character A search means for adding to the importance of the document to be searched in the unit, and output means said reference the importance storage unit, to output the search results in accordance with the importance of each of said search target document. When performing a search of a document database, it is common to focus on the local structure of the document itself. It is highly unlikely that a search is to be performed based on the correlation between the content written in the place of interest and, for example, the content written five pages ahead. When focusing on the local structure, rather than using a search key with a length of several to several tens of words as it is, as described in `` Conventional Technology, '' the search key is decomposed into It is more efficient to search based on characters. By the way, considering a situation where a searcher is actually trying to perform a search, the searcher must select a search key with some meaning. In addition, when considering a document to be searched, not all the words included in the document have the same weight, but characteristic words useful for identifying the document are mixed with words that are not. In the conventional search method, especially in the case of continuous characters, the contribution of each continuous character to the search result is equal, and does not reflect the meaning entrusted by the searcher or the characteristic content of the document. , And often hit unintended documents. In the present invention, a search key (search character string) is decomposed into continuous characters, and it is determined whether or not the consecutive characters are characteristic for a search. Is made to greatly contribute to the search result, so that the search can be surely performed according to the intention of the searcher. Specifically, the percentage of each consecutive character that contributes to the search result of the consecutive characters is determined as importance, and if the consecutive characters are included in the search target document, the importance of the consecutive characters is determined by the search target. What is necessary is just to add to the importance (evaluation value) of a document. Examples of how to determine the importance for continuous characters are, for example,
For each consecutive character used in the search, calculate the frequency of occurrence of the consecutive characters in all the documents to be searched.The higher the frequency of the consecutive characters, the lower the importance. For example, there is a method of determining the importance of the consecutive characters according to the combination of the character types, and these methods may be used together. The above-described method of calculating the importance level is based on the fact that the more commonly appearing documents in each search target, the lower the degree of specifying a desired document. For example, various inflectional endings, "the" and auxiliary verbs in English sentences, and particle and auxiliary verbs in Japanese sentences correspond to common common words. On the other hand, the above method is based on the fact that kanji, hiragana, numbers, etc. are used in a mixture in Japanese sentences, for example, but the combination of kanji often has a characteristic meaning as a idiom. I have. Even in languages other than Japanese, for example, in Chinese, there is a distinction between kanji that can be a particle and kanji that cannot be a particle, and it is possible to determine the character type according to this distinction. Also, in English and the like, the character type can be determined according to whether it is a symbol such as uppercase and lowercase letters, numbers, Greek letters, hyphens, and the like. For English,
It is also possible to process such that the past ending "ed" and the adverb ending "ly" are treated as different character types. In the present invention, a specific method of calculating the importance of each consecutive character is, for example, the language or field of the document to be searched (for example, whether it is a technical document, a newspaper article,
Literary work), the habit of the searcher (such as what search keys are frequently used), the purpose of the search (whether or not to perform a fuzzy search), and also the search string itself. It can be varied and can be adaptively varied. By changing the method of calculating the importance as needed, it is possible to perform a more accurate search. Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an information retrieval apparatus according to an embodiment of the present invention. FIG. 2 is a flowchart showing a processing procedure when information retrieval is performed by the method of the present invention using this information retrieval apparatus. is there. The information retrieval apparatus 11 is for performing a search for document information stored in the database storage unit 10 as documents to be searched, and is a search character string for inputting a search character string designated by a searcher. An input unit 12, a continuous character extracting unit 13 for extracting continuous characters having a predetermined character length from the input search character string, and determining the importance of the continuous character for each continuous character extracted from the search character string. The search engine unit 14 for calculating and actually searching each search target document in the database storage unit 10 based on these consecutive characters, and calculating the importance for each consecutive character extracted from the search character string. Storage unit 15 for storing parameters used for
An importance storage unit 16 for storing the importance of each document to be searched, and an output unit for referring to the importance storage 16 and performing normalization according to the importance of each document to be searched and outputting a search result 17. A set of consecutive characters extracted from the retrieval character string is referred to as a retrieval consecutive character group. The search engine unit 14 calculates the importance of each of the consecutive characters in the retrieval consecutive character group. In this embodiment, the first type of importance determined from the appearance frequency of the consecutive characters. And the second type of importance determined from the character type of the consecutive characters is obtained for each consecutive character. The first type and the second type importance are each 0 or a positive real number, and are set so that the larger the value, the larger the contribution to the search result. Hereinafter, these degrees of importance will be described. The first type of importance is calculated by the search engine unit 14 by calculating the frequency of appearance of the consecutive characters through all the search target documents stored in the database storage unit 10, and the smaller the frequency is, the larger the value becomes. , The higher the frequency of appearance, the smaller the value. Here, the appearance frequency is obtained by dividing the number of documents including the consecutive characters by the total number of documents, and is represented by a real number from 0 to 1. FIG. 2 is a graph in which the frequency of appearance is plotted on the x-axis and the value of the first type importance is plotted on the y-axis, and shows an example of a function representing the relationship between the frequency of appearance and the first type importance. I have. As is clear from this graph, the first type of importance is represented by a function that decreases monotonically with the frequency of appearance. It should be noted that the consecutive characters appearing in all the search target documents are meaningless for use in the search. Therefore, the first type importance is set to 0 for such consecutive characters. desirable. If the character length of the consecutive characters is two, the second type of importance is determined by the search engine unit 14 in accordance with the combination of the first and second characters constituting the consecutive characters. It is determined. More specifically, a calculation table 21 representing a second type of importance value corresponding to the combination of character types is provided in the parameter storage unit 15, and by referring to this calculation table 21, Required for each character. FIG. 3 shows an example of the contents of the calculation table 21 when the search target character is a Japanese document. In the illustrated example, the character type is classified into hiragana, katakana, kanji, alphanumeric characters, and symbols, and when consecutive characters are composed only of kanji, the second type has the highest importance. . As can be seen from the combination of the kanji and the hiragana, when the character type of the first character and the character type of the second character are switched, the same second type importance value is not always obtained. When the first character is a kanji and the second character is a hiragana, the combination of the last character of the idiom (noun) + the particle is overwhelming, and the order of the combination of character types is reversed. Because it is considered to be less characteristic. The case where the character length of consecutive characters is three or more characters is basically the same as that described here. After obtaining the first and second types of importance, the search engine unit 14 executes the actual search processing. As a search algorithm,
For example, there is one described in the above-mentioned JP-A-4-326164. The search process determines, for each document to be searched, whether each of the consecutive characters in the group of consecutive characters for search is included in the document to be searched, and if the consecutive characters are included, This is performed by adding the first type and second type importance corresponding to the consecutive characters to the importance of the search target document in the importance storage unit 16. In practice, the search processing may be executed collectively after calculating the first and second types of importance for all the continuous characters in the search continuous character group, For each consecutive character, it takes out one consecutive character, finds the first and second importance for the consecutive characters, and then checks whether the consecutive characters are included in each search target document. May be executed. FIG. 4 shows an example of the configuration of the importance storage section 16. The importance storage unit 16 stores the database storage unit 10
The table is configured as a table including the document numbers assigned to the respective search target documents and the importance of the search target documents corresponding to the document numbers. The initial value of importance is 0. The degree of importance for each search target document corresponds to the degree of matching in the above-mentioned “conventional technology”. Next, an information retrieval procedure performed by using the information retrieval apparatus 10 based on the method of the present invention will be described with reference to FIG.
This will be described using. First, a search character string is input via the search character string input section 12 (step 101), and the continuous character extraction section 13 decomposes the search character string into continuous characters having a predetermined character length (step 102). . If the length of the consecutive characters is 2 characters and the search character string is, for example, "Business trip to Osaka.", "Osaka", "Osaka ni", "Nide", "Business trip", "Chang", "Yes" And "ru." Are extracted to form a search consecutive character group. Next, the following processing is performed for each consecutive character in the retrieval consecutive character group. That is, it is determined whether unprocessed continuous characters remain (step 103). If unprocessed continuous characters remain, the unprocessed continuous characters are searched for by the search engine unit 14 through all search target documents. An appearance frequency is calculated (step 104), and a first importance for the consecutive characters is calculated according to the calculated appearance frequency (step 1).
05), the second type importance for the consecutive characters is determined according to the combination of character types constituting the consecutive characters (step 105). Next, it is checked whether or not there is a document that is a search target document in the database storage unit 10 and that has not been searched for the consecutive characters (step 107). Select one document,
It is determined whether the consecutive characters are included in the document (step 108). If it is not included, the process directly returns to step 107. If it is included, the first and second types of importance for the consecutive characters are added to the importance of the document in the importance storage unit 16. (Step 1
09), returning to step 107. If it is determined in step 107 that there are no unsearched documents, the process proceeds to step 10 to search for the next consecutive character included in the retrieval consecutive character group.
Return to 3. When two consecutive characters are extracted from the above-described search character string "Travel to Osaka.", The consecutive characters "do" and "ru." Tend to appear commonly in each document. The first type becomes less important. In addition, when the calculation table as shown in FIG. 3 is used, the second type importance for the consecutive characters "Saka ni" becomes small. On the other hand, the consecutive characters "Osaka"
Is composed of two kanji characters, so the importance of the second type increases. Also, assuming that the frequency of occurrence of “Osaka” through each search target document is low and the frequency of occurrence of “business trip” is high to some extent, the importance of the first type is greater for “Osaka” than for “business trip”. Become. In the end, a search that has a large contribution degree of “Osaka” will be performed as a whole. If it is determined in step 103 that there are no unprocessed continuous characters, it means that the search process based on all continuous characters in the search continuous character group has been completed.
And normalizes the importance value of each document stored in the importance storage unit 16 (step 110). Here, the normalization is a process of multiplying each importance by the same coefficient so that the maximum importance becomes 1. The longer the search character string (search key), the larger the number of digraphs in the search digraphs, which tends to increase the raw importance of each document. ,
Such a difference in the search character string makes it possible to correct the variation in the value of the importance. Then, the normalized importance is arranged in descending order (step 111), and the search result is output by outputting the importance of each document as a list (step 112), and the process is terminated. As described above, according to the present invention, the degree of importance of each consecutive character is determined after extracting consecutive characters from the search character string, and the contribution ratio of each consecutive character to the search result is determined by this importance. , The characteristic continuous characters in the search are more reflected in the search result, and the search can be performed reliably according to the searcher's intention. As a method of determining the importance of the consecutive characters, for each consecutive character used in the search, the frequency of occurrence of the consecutive characters in all documents to be searched is determined, and the higher the appearance frequency, the lower the importance is. A more reliable search can be performed by adopting a method such as finding a character type constituting a consecutive character for each character and determining the importance of the consecutive character by a combination of the character types.

【図面の簡単な説明】【図１】本発明の一実施例の情報検索装置の構成を示す
ブロック図である。【図２】出現頻度と第１種の重要度の関係を表わす関数
の一例を示すグラフである。【図３】文字種と第２種の重要度との対応を表わす計算
用テーブルの構成の一例を示す図である。【図４】重要度格納部の構成を示す図である。【図５】図１の情報検索装置を使用し本発明の方法によ
って情報の検索を行なう場合の処理手順を示すフローチ
ャートである。【符号の説明】１０データベース格納部１１情報検索装置１２検索文字列入力部１３連字抽出部１４検索エンジン部１５パラメータ格納部１６重要度格納部１７出力部２１計算用テーブル１０１〜１１２ステップBRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a configuration of an information search device according to one embodiment of the present invention. FIG. 2 is a graph showing an example of a function representing a relationship between an appearance frequency and a first type importance; FIG. 3 is a diagram illustrating an example of a configuration of a calculation table indicating a correspondence between a character type and a second type of importance; FIG. 4 is a diagram showing a configuration of an importance storage unit. FIG. 5 is a flowchart showing a processing procedure when information is searched by the method of the present invention using the information search apparatus of FIG. 1; [Description of Signs] 10 Database storage unit 11 Information search device 12 Search character string input unit 13 Continuous character extraction unit 14 Search engine unit 15 Parameter storage unit 16 Importance storage unit 17 Output unit 21 Calculation tables 101 to 112 Step

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06F 17/30

Claims

(57) [Claim 1] An information search device for searching for a desired document from a set of search target documents based on an input search character string, wherein the search character string is Input means for inputting, continuous character extracting means for extracting a continuous character having a predetermined character length from the search character string to form a continuous character group for search, and importance for storing the importance of each search target document A storage unit, for each of the consecutive characters included in the retrieval consecutive character group, a combination of character types constituting the consecutive characters , and determining the importance for the consecutive characters in accordance with the order thereof ; for each of the search target documents, For each consecutive character included in the search consecutive character group, it is checked whether the consecutive character is included in the search target document, and if the consecutive character is included, the importance corresponding to the consecutive character is set to the importance. Of the document to be searched in the degree storage means. An information search device comprising: a search unit for calculating a search result; and an output unit for outputting a search result according to the importance of each document to be searched by referring to the importance storage unit.