JP3913495B2

JP3913495B2 - Route search method, route search device, program and recording medium

Info

Publication number: JP3913495B2
Application number: JP2001150934A
Authority: JP
Inventors: 正之杉崎; 信行大森; 大二郎森; 博人稲垣
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-05-21
Filing date: 2001-05-21
Publication date: 2007-05-09
Anticipated expiration: 2021-05-21
Also published as: JP2002342381A

Description

【０００１】
【発明の属する技術分野】
本発明は、他の文書を参照するハイパーリンクの情報を有する大量の文書データ集合における最短経路を探索する最短経路探索方法、最短経路探索装置、記録媒体及びプログラムに関する。
【０００２】
【従来の技術】
近年、インターネットなどのコンピュータネットワークを通じて、大量の電子化された文書をやり取りしたり、不特定他数を対象にした情報発信ができるようになっている。そのため、そのような文書情報を対象に、個人が必要とする情報を検索できるようなサービスが不特定多数を対象にネットワーク上で実現されている。
【０００３】
コンピュータネットワーク上で表現された文書では、その特徴を生かした表現が利用されている。そのなかでも、ＷＷＷ（World Wide Web）と呼ばれる文書は、図２に示すように、なんらかの情報を書き記すだけではなく、他のコンピュータ上に存在する他の人が書いた文書を参照するための「ハイパーリンク」の機能がある。これは、他の文書を信頼して自分の記している情報を補完したり、同じ内容の文書を指すときなどに利用される。特に、お互いの文書を参照する場合には、それらの文書あるいは文書群の内容に非常に関連がある。
【０００４】
また、図３にリンク機能を持たせた可視情報を示す。従来、利用者は、目的の文書にたどり着くために、リンクに付与された可視情報を頼りに、経験と勘のみでリンクを選び、目的とする文書にたどり着いていた。図３では、文書Ａにおける「プロの財テクノウハウ」や「実践財テク大学」などの可視情報を頼りに、リンクを選んでいた。
【０００５】
【発明が解決しようとする課題】
従来は、経験と勘のみでリンクを選んでいた。利用者はなるべくリンクをたどる回数が少ない状態で目的文書に到達したい。リンクをたどる経路をパスと呼び、たどる回数が最小なパスを最短パスとよぶ。しかし、従来は、経験と勘のみでリンクを選んでいたので、結果として、リンクをたどる回数が多くなる確率が高く、利用者の利便性が低下していた。
【０００６】
図４にハイパーテキストの空間例を示す。例えば、図４で文書Ａから文書Ｌにたどりつくのに、Ａ，Ｂ，Ｅ，Ｄ，Ｇ，Ｋ，Ｌと６回のリンクをたどることもあった。
図４に示すように、探索空間が有限であれば（ここでいう有限とは、利用者が許容し得る時間内にパスを見つけることを意味する）、探索空間を網羅するように探索し、最短パスを見つけることができる。しかし、ＷＷＷのような探索空間は有限ではなく、最短パスを求めることはできないでいた。
本発明はこのような事情に鑑みてなされたものであり、文書間に張られたハイパーリンクをたどってある文書から他の有る文書を探索する際に、最短経路で探索することができる最短経路探索方法、最短経路探索装置、記録媒体及びプログラムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
上記目的を達成するために、本発明は、カテゴリに属する文書を記憶する文書カテゴリ記憶手段と、パス検索手段と、結果出力手段とを有し、カテゴリに属する文書間に張られたハイパーリンクをたどって起点となる文書から終点となる文書にたどり着くパスを求める経路探索装置の経路探索方法であって、前記パス検索手段が行う、ハイパーリンクに重みを付与し、ハイパーリンクの重みの値を用いて、重みつき有向グラフを経路探索し、前記起点となる文書から前記終点となる文書にいたるパスを求めるパス検索過程と、前記結果出力手段が行う、前記パス検索過程で得たパスを出力する結果出力過程と、からなり、前記ハイパーリンクの重みは、リンク元文書のカテゴリの特徴ベクトルとリンク先文書のカテゴリの特徴ベクトルとによる類似度である関連度と、リンク先文書のカテゴリの特徴ベクトルとリンク先文書の特徴ベクトルとによる類似度である適合度と、の乗算によるものであることを特徴とする経路探索方法である。
【０００９】
また本発明は、上述の経路探索方法において、前記リンク先文書が複数のカテゴリに属しているとき、前記経路探索において用いる前記ハイパーリンクの重みの算出において、リンク先文書が属するカテゴリ毎に算出されるそれぞれの重みのうち最大の重みの値を用いることを特徴とする。
【００１０】
また本発明は、カテゴリに属する文書間に張られたハイパーリンクをたどって起点となる文書から終点となる文書にたどり着くパスを求める経路探索装置であって、前記カテゴリに属する文書を記憶する文書カテゴリ記憶手段と、前記ハイパーリンクで張られたリンク元文書のカテゴリの特徴ベクトルとリンク先文書のカテゴリの特徴ベクトルとによる類似度である関連度と、リンク先文書のカテゴリの特徴ベクトルとリンク先文書の特徴ベクトルとによる類似度である適合度と、の乗算により、前記ハイパーリンクの重みを算出する重み算出手段と、前記ハイパーリンクの重みの値を用いて、重みつき有向グラフを経路探索し、前記起点となる文書から前記終点となる文書にいたるパスを求めるパス検索手段と、前記パス検索過程で得たパスを出力する結果出力手段と、を備えることを特徴とする経路探索装置である。
【００１２】
また本発明は、上述の経路探索装置において、前記パス検索手段は、前記リンク先文書が複数のカテゴリに属しているとき、前記経路探索において用いる前記ハイパーリンクの重みの算出において、リンク先文書が属するカテゴリ毎に算出されるそれぞれの重みのうち最大の重みの値を用いることを特徴とする。
【００１３】
また本発明は、上述の経路探索方法をコンピュータに実行させるプログラムである。
【００１４】
また本発明は、上述のプログラムを記録した、コンピュータが読取可能な記録媒体である。
【００１５】
本発明は、部分探索空間での探索を繰り返し、部分探索空間で文書間の重みが最も高いパスをつなげることによって、全体のパスとする手法を用いる。最短パスの保証はないが最短パスにより近いパスを見つけることとなり、利用者の利便性が向上する。
具体的には、文書を分類するためのディレクトリを作成し、経路を求めたい２つの文書を入力すると、ディレクトリを活用して文書間の重みを算出し、重みに基づいて、パスを探索し、最終的なパスを結果として出力する。
【００１６】
パス探索をさらに詳しく説明すると、探索の起点文書を設定し、この起点文書からのリンク先文書を抽出し、起点文書とリンク先文書の同者に関してこれら文書が属するカテゴリ（ディレクトリの１つの分野）を決め、起点文書とリンク先文書間の重みを求め、重みに基づいて最適なパスを決める。
【００１７】
ディレクトリ入力部は、ディレクトリを作成する。文書は、少なくともひとつのカテゴリ（ディレクトリの１つの分野）に属し、重みの算出に活用される。
検索条件入力部では、経路を求めたい２つの文書などの検索条件を入力する。
パス検索部では、入力された２つの文書のパスを求め、
結果出力部では、パス検索部で得られた結果を表示する。
また、パス検素部の起点設定部では、探索の起点文書を設定し、
リンク先文書抽出部では、起点文書からのリンク先文書を抽出する。
カテゴリ決定部では、起点文書とリンク先文書の属するカテゴリを決め、
重み抽出部では、起点文書とリンク先文書の重みを求める。
パス決定記録部では、重みから最適なパスを決定し、これを保存し、
探索終了判定部では、探索の継続を判断する。
【００１８】
【発明の実施の形態】
以下、本発明の実施の形態を、図面を参照して詳細に説明する。なお、本発明の実施の形態を説明する全図において、同一要素には同一符号を付け、重複する説明を省略する。
図１に本発明の実施の形態に係る最短経路探索装置の構成を示す。同図において、最短経路探索装置は、はディレクトリ入力部１０１と、検索条件入力部１０２と、パス検索部１０３と、結果出力部１０４とを有している。
また、パス検索部１０３は、起点設定部２０１と、リンク先文書抽出部２０２と、カテゴリ決定部２０３と、重み算出部２０４と、パス決定記録部２０５と、探索終了判定部２０６とを有している。
【００１９】
また、図５、図６は、本発明の実施の形態に係る最短経路探索装置の処理フローであり、番号は図１のブロック図と対応し、同一の番号のものが、その処理を行う手段とステップとなっている。
ディレクトリ入力部１０１では、別途用意しておいたディレクトリの構成情報、および、自動分類するためのカテゴリに割り当てられるサンプルとなる文書を入力する。例えば、図７のようなディレクトリを作成する。
検索条件入力部１０２では、パスを求めたい２つの文書を指示する。例えば、文書Ａ，文書Ｂなどと入力する。
【００２０】
パス検索部１０３では、検索条件入力部１０２で入力された文書Ａ、文書Ｂ間におけるなるべく短いパスを、前記ディレクトリのカテゴリ間の関連度と、文書とカテゴリ間の適合度を用いて各ハイパーリンクの重みを計算し、その重みを用いて探索する。任意の文書Ａから任意の文書Ｂへの最短のパスを求める際に、ハイパーリンクの重みの値を用いて、重みつき有効グラフから最短経路を探索するＡ*アルゴリズム（「人工知能の基礎知識」太原著，近代科学社 1988 を参照）などを適応する。これにより、横型探索や縦型探索といった最短経路の探索手法（「人工知能の基礎知識」太原著，近代科学社 1988 を参照）より効率的にパスを捜し出すことが可能となる。
【００２１】
結果出力部１０４では、パス探索部１０３で求めた最短のパスの情報を出力、表示する。
以上で、全体の流れを説明した。これからさらに、パス検索部１０３によるパス検素について詳細を説明するが、その前に、パス検索は、文書間の重みに基づいて行うので、重みの算出方法について説明する。重みを求めるために必要となる文書の特徴ベクトル、カテゴリの特徴ベクトル、文書がカテゴリに適している度合いを示す適合度、カテゴリ間の関係の強さを示す関連度についても説明する。
【００２２】
文書は予め用意したディレクトリに自動分類される。文書の自動分類に関しては、特願平１０−２８１６２１（“情報自動分類方法および装置と情報自動分類プログラムを記録した記録媒体”杉崎ほか）の「従来の技術」として解説している手法などが使える。簡単に解説すると、まず文書内に存在する単語とその出現頻度情報から特徴ベクトルを作成する。文書ｉ内に存在する単語ｋの値ww_ikは、ww_ik＝文書ｉでの単語ｋの出現回数log（全文書数／単語ｋが出現する文書数）
とし、文書ｉの特徴ベクトルvec_iは、
vec_i ＝（ww_i1，…，ww_ik，…，ww_in）（１）
とする。
【００２３】
ただし、ｎは全文書内に出現する全単語数である（特徴ベクトルの作成に関しては、“Automatic Text Processing” Gerard Salton， ADDISON−WESLEY pub． 1989を参照）。
同様に、サンプル文書内に存在する単語とその出現頻度情報からカテゴリｘの特徴ベクトルcatev_xを作成する。カテゴリの特徴ベクトルにおける各要素の値は、割り当てられているサンプル文書の特徴ベクトルの対応する要素の平均値とする。
【００２４】
文書の特徴ベクトルとカテゴリの特徴ベクトルとの成す角を利用して三角関数cosθを計算し、その値から文書の各カテゴリに対する適合度（どれくらい「そのカテゴリに属する」という判断が適切か）を定義できる。文書ｉとカテゴリｘとの適合度をrel_ixと表現することにする。三角関数cosθを用いる場合は、rel_ixの取りうる値は０から１の間となり、１が最も適合度が高いということになる。
また、各カテゴリ間でカテゴリの関連度を定義する。カテゴリｘとカテゴリｙの関連度をCaterel_xyとする。自動で計算する場合は、カテゴリの特徴ベクトルの成す角を利用して三角関数cosθをカテゴリ間の関連度とする（特願平１０−２８１６２１の「カテゴリ間ヘの距離を導入する」を参照）。
【００２５】
カテゴリ間の関連度と、文書の各カテゴリの適合度を用いて、ある文書（iとする）に別の文書（jとする）について張られているハイパーリンクに対して重みを定義する。カテゴリｘに割り当てられた文書ｉからカテゴリｙに割り当てられた文書ｊに張られたハイパーリンクの重みw_ijを
w_ij＝Caterel_xy・rel_jy （２）
とする。
ハイパーリンクの重みの例について図８を参照して説明する。
【００２６】
文書Ａがカテゴリ「Ｔ社」に属し、適合度＝０．７とし、文書Ｄがカテゴリ「純文学」に属し適合度＝０．９とする。「Ｔ社」と「純文学」の関連度＝０．１１とすると、文書Ａから文書Ｄへのリンクの重みは０．０９９（０．１１＊０．９）となる。
文書が属するカテゴリは１つとは限らない。複数存在する場合がある。こういった場合は、最大の重みを採用する。図９は文書間における経路が複数である場合におけるハイパーリンクの重みの例を示している。同図において、文書Ｄがカテゴリ「純文学」と「パーツ」に属している。「純文学」経由の重みは０．０９９（０．１１＊０．９）であり、「パーツ」経由の重みは０．１２（０．３＊０．４）であるので、重みとして値が大きい０．１２を採用する。
【００２７】
次に、パス検索部１０３におけるパス検索について説明する。全体としては、部分探索空間においてハイパーリンクの重みから最適なパスを定め、部分探索空間での探索を繰り返し、全体のパスを決めるという手法である。
図１０で文書Ａから文書Ｌへのパスを求める場合について説明する。図１０における数値は重みを示し、大きい値ほど、望ましいパスであることを示す。
パス検索部１０３では、まず、起点設定部２０１において、探索の起点を設定する。１回目は、探索条件として入力された文書であり、２回目以降は、前回の部分探索の解である。例えば、図１０では文書Ａとなる。
【００２８】
次いで、リンク先文書抽出部２０２では、起点文書からハイパーリンクが張られている文書を抽出する。図１０では文書Ｂと文書Ｃが抽出される。カテゴリ決定部２０３では、起点文書とリンク先文書に関し、自動分類技術を用いてカテゴリを決め、文書とカテゴリ間の適合度を求める。
この例では、文書Ａ、文書Ｂ、文書Ｃの属するカテゴリを決め、適合度を求める。
さらに、重み算出部２０４では、カテゴリ決定部２０３で得た結果と、カテゴリ間の関連度からハイパーリンクの重みを式（２）により計算する。図１０には、計算された結果としての重みが記載されている。
【００２９】
パス決定記録部２０５では、計算された重みを比較し、最適なパスを決め、最適なパスを記録する。図１０に示す例では、文書Ａから文書Ｂへのハイパーリンクの重みは０．４であり、文書Ａから文書Ｃへのハイパーリンクの重みは０．５であるので、文書Ａから文書Ｃへのハイパーリンクを選択し、それを記録する。
さらに、探索終了判定部２０６では、選択した文書が目的の文書かどうかをチェックし、目的の文書であれは処理を終了し、そうでなければステップ２０１に戻り、再度探索を行う。
【００３０】
図１０に示す例では、文書Ｌに到達していないので、再度探索を行う。
２回目以降の探索では、前回の探索結果を反映して処理を進める。図１０に示す例では、１回目の探索結果の文書Ｃを起点として、２回目の探索を行う。文書Ｅと文書Ｆが比較され、文書Ｆが選ばれる。以上の処理を繰り返し、文書Ｌとなるまで行う。
また、２回目以降の、ハイパーリンクの重みの選択は、既に選択してきたパスの重みをすべて掛け合わせて累積して重みの大きいハイパーリンクを探索する。
こうして、図１０の例では、Ａ，Ｃ，Ｆ，Ｉ，Ｌの経路が解として求められる。これら探索の過程を図１１にまとめる。網掛け行が部分探索での最適解を示す。
【００３１】
なお、関連度を、どの段階で算出するかに関しては、計算時間と記憶メモリ量とのトレードオフから決めるものである。部分探索毎に毎回計算する方法（計算時間がかかる）もあれば、ディレクトリ入力段階で計算し記録しておいてこの記録された値を利用するやり方（記憶メモリ量が多い）もある。
また、関連度は個別に人為的に値を指定してもよい。この場合は、ディレクトリ入力過程で行う。部分探索の手法に関しては、従来技術であるので、細かな説明はこれ以上行わないが、簡単に補足すると、リンク先が複数段の部分探索空間で探索を行ってもよい。
【００３２】
また、部分探索で解を複数個保持しながら探索を進めてもよい。また、探索終了条件として、処理時間や探索回数を採用してもよい。これら、他の探索のやり方は、システム固定で処理する方法もあれば、利用者が、探索条件入力過程で指定する方法もある。
【００３３】
また、図１に示した最短経路探索装置の機能を実現するための、図５、図６に示す最短経路探索プログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによりパス探索を行ってもよい。
【００３４】
すなわち、文書間に張られたハイパーリンクをたどってある文書から他のある文書まで最短のパスを経由するようにパスを探索するための最短経路探索プログラムであって、前記最短経路探索プログラムは、文書を分類するためのディレクトリ入力ステップと、経路を求めたい２つの文書を少なくとも含む索条件の指示を行う検索条件入力ステップと、前記入力された２つの文書間の経路を求めるパス検索ステップと、前記パス検索過程で得た結果を出力する結果出力ステップとからなる最短経路探索プログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録された最短経路探索プログラムをコンピュータシステムに読み込ませ、実行することによりパス探索を行ってもよい。
【００３５】
また、上記記録媒体に記録された最短経路探索プログラムにおいて、前記パス検索ステップは、探索の起点文書を設定する起点設定ステップと、起点文書に対するリンク先文書を抽出するリンク先文書抽出ステップと、前記起点文書とリンク先文書の両者に関し、これらの文書が属するカテゴリをそれぞれ決定するカテゴリ決定ステップと、文書間の重みを、すべてのリンク先文書について計算する重み算出ステップと、前記算出された重みを比較し、起点文書からリンク先文書への最適なパスを決定し、該パスを記録するパス決定記録ステップと、探索をさらに続けるか否かを判定する探索終了判定ステップとからなることを特徴とする。
【００３６】
なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可般媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。
【００３７】
さらに、「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの(伝送媒体ないしは伝送波)、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。
【００３８】
【発明の効果】
以上説明したように、本発明では、文書間の重みに着目してパスを求めるので、的はずれな（選択回数が極端に多くなるような）パスが提示される確率は低く、的確なパス（最短パスに近いパス）を提示する確率が高くなる。
また、部分空間に区切って探索を行うので、処理時間も利用者が許容できる範囲内に抑えることが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る最短経路探索装置の構成を示すブロック図。
【図２】ハイパーリンクの機能を示す説明図。
【図３】ハイパーリンクの機能を実現する可視情報を示す説明図。
【図４】ハイパーテキストの探索空間を示す説明図。
【図５】図１に示す最短経路探索装置の処理内容を示すフローチャート。
【図６】図５におけるパス検索処理の詳細を示すフローチャート。
【図７】ディレクトリの一例を示す説明図。
【図８】ハイパーリンクの重みの例を示す説明図。
【図９】探索経路が複数ある場合のハイパーリンクの重みの例を示す説明図。
【図１０】ハイパーテキストの探索空間の一例をハイパーリンクの重みと共に示す説明図。
【図１１】ハイパーテキストのパス検索時における部分探索の過程を示す説明図。
【符号の説明】
１０１ディレクトリ入力部
１０２探索条件入力部
１０３パス検索部
１０４結果出力部
２０１起点設定部
２０２リンク先文書抽出部
２０３カテゴリ決定部
２０４重み算出部
２０５パス決定記録部
２０６探索終了判定部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a shortest path search method, a shortest path search apparatus, a recording medium, and a program for searching for a shortest path in a large amount of document data sets having hyperlink information referring to other documents.
[0002]
[Prior art]
In recent years, it has become possible to exchange a large amount of electronic documents and to send information to an unspecified number of others through a computer network such as the Internet. Therefore, a service that can search for information required by an individual for such document information has been realized on a network for an unspecified number of people.
[0003]
Documents expressed on computer networks use expressions that take advantage of their characteristics. Among them, a document called WWW (World Wide Web), as shown in FIG. 2, not only writes some information but also refers to a document written by another person existing on another computer. There is a function of "Hyperlink". This is used when other documents are trusted to supplement the information written by the user, or when referring to documents having the same contents. In particular, when referring to each other's documents, the contents of those documents or document groups are very relevant.
[0004]
FIG. 3 shows visible information having a link function. Conventionally, in order to arrive at a target document, a user selects a link based only on experience and intuition, relying on the visible information given to the link, and arrives at the target document. In FIG. 3, the link was selected based on the visible information such as “professional financial technology know-how” and “practical financial technology university” in document A.
[0005]
[Problems to be solved by the invention]
In the past, links were selected based only on experience and intuition. Users want to reach the target document with as few links as possible. A path that follows a link is called a path, and a path that has the minimum number of times is called a shortest path. However, conventionally, a link is selected only by experience and intuition, and as a result, there is a high probability that the number of times of following the link is high, and the convenience for the user is reduced.
[0006]
FIG. 4 shows an example of hypertext space. For example, in order to reach the document L from the document A in FIG. 4, the links A, B, E, D, G, K, and L may be followed six times.
As shown in FIG. 4, if the search space is finite (here, finite means finding a path within a time acceptable by the user), the search is performed so as to cover the search space, The shortest path can be found. However, the search space such as WWW is not finite, and the shortest path cannot be obtained.
The present invention has been made in view of such circumstances, and when searching for another document from a document that follows a hyperlink between documents, the shortest route that can be searched by the shortest route. An object is to provide a search method, a shortest path search device, a recording medium, and a program.
[0007]
[Means for Solving the Problems]
In order to achieve the above object, the present invention comprises a document category storage means for storing documents belonging to a category, a path search means, and a result output means, and provides a hyperlink between documents belonging to a category. A route search method of a route search device for obtaining a path from a document as a starting point to a document as an end point, wherein the path search unit assigns a weight to a hyperlink and uses a value of the weight of the hyperlink A path search process for searching the weighted directed graph to obtain a path from the document as the starting point to the document as the end point, and a result of outputting the path obtained in the path searching process performed by the result output unit and the output process, consists of the weight of the hyperlink, in the feature vector of the feature vector and the link destination document of the link source document category category And relevance is the similarity, a route search method characterized by the matching degree according to the feature vector of the feature vector and the destination document destination document category is similarity is due to the multiplication of.
[0009]
In the route search method described above, when the link destination document belongs to a plurality of categories, the hyperlink weight used in the route search is calculated for each category to which the link destination document belongs. Among the weights, the maximum weight value is used.
[0010]
The present invention is also a route search apparatus for obtaining a path from a source document to a destination document by following hyperlinks between documents belonging to a category, wherein the document category stores documents belonging to the category A storage means , a relevance degree that is a similarity between a feature vector of a category of a link source document and a feature vector of a category of a link destination document that are linked by the hyperlink, a feature vector of a category of the link destination document, and a link destination document The weight calculation means for calculating the weight of the hyperlink by multiplication with the fitness that is the similarity based on the feature vector of the feature vector, and the route search of the weighted directed graph using the value of the weight of the hyperlink, Path search means for obtaining a path from a starting document to the end document, and a path obtained in the path searching process. And result output means for outputting a route search device, characterized in that it comprises a.
[0012]
In the route search device according to the present invention, the path search unit may calculate the weight of the hyperlink used in the route search when the link destination document belongs to a plurality of categories. The maximum weight value among the weights calculated for each category is used.
[0013]
Further, the present invention is a program for causing a computer to execute the route search method described above.
[0014]
The present invention is also a computer-readable recording medium that records the above-described program.
[0015]
The present invention uses a method of making the entire path by repeating the search in the partial search space and connecting the paths having the highest weight between documents in the partial search space. Although there is no guarantee of the shortest path, a path closer to the shortest path is found, which improves the convenience for the user.
Specifically, when a directory for classifying documents is created and two documents for which a route is desired are input, a weight between documents is calculated using the directory, a path is searched based on the weight, Output the final path as a result.
[0016]
The path search will be described in more detail. The origin document of the search is set, the link destination document is extracted from the origin document, and the category (one field of the directory) to which these documents belong with respect to the origin document and the link destination document. The weight between the origin document and the linked document is obtained, and the optimum path is determined based on the weight.
[0017]
The directory input unit creates a directory. A document belongs to at least one category (one field of a directory) and is used for calculating weights.
The search condition input unit inputs search conditions such as two documents for which a route is desired.
The path search unit obtains the paths of the two input documents,
The result output unit displays the result obtained by the path search unit.
In addition, the origin setting part of the path verification part sets the search origin document,
The link destination document extraction unit extracts a link destination document from the origin document.
The category determination unit determines the category to which the origin document and linked document belong,
The weight extraction unit obtains the weights of the origin document and the link destination document.
In the path determination recording unit, the optimum path is determined from the weights, stored,
The search end determination unit determines whether or not to continue the search.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that, in all drawings illustrating the embodiment of the present invention, the same elements are denoted by the same reference numerals, and redundant description is omitted.
FIG. 1 shows a configuration of a shortest path search device according to an embodiment of the present invention. In the figure, the shortest path search apparatus has a directory input unit 101, a search condition input unit 102, a path search unit 103, and a result output unit 104.
The path search unit 103 includes a starting point setting unit 201, a link destination document extraction unit 202, a category determination unit 203, a weight calculation unit 204, a path determination recording unit 205, and a search end determination unit 206. ing.
[0019]
5 and 6 are processing flows of the shortest path search apparatus according to the embodiment of the present invention. The numbers correspond to those in the block diagram of FIG. 1, and the same number means that performs the processing. And steps.
The directory input unit 101 inputs directory configuration information prepared separately and a sample document assigned to a category for automatic classification. For example, a directory as shown in FIG. 7 is created.
The search condition input unit 102 designates two documents whose paths are to be obtained. For example, documents A and B are entered.
[0020]
In the path search unit 103, the shortest path between the documents A and B input by the search condition input unit 102 is used for each hyperlink by using the degree of association between the categories of the directory and the degree of matching between the documents and categories. The weight is calculated, and the search is performed using the weight. When finding the shortest path from an arbitrary document A to an arbitrary document B, an A * algorithm that searches for the shortest path from the weighted effective graph using the hyperlink weight value ("Basic Knowledge of Artificial Intelligence") Adapted by Taihara, modern science company 1988). This makes it possible to search for a path more efficiently than the shortest path search method such as horizontal search or vertical search (see “Basic Knowledge of Artificial Intelligence” written by Taihara, Modern Science Co., 1988).
[0021]
The result output unit 104 outputs and displays information on the shortest path obtained by the path search unit 103.
The overall flow has been described above. The details of the path verification by the path search unit 103 will be further described. Before that, since the path search is performed based on the weight between documents, a method for calculating the weight will be described. The document feature vector, the category feature vector, the suitability indicating the degree of suitability of the document for the category, and the relevance indicating the strength of the relationship between the categories are also described.
[0022]
Documents are automatically classified into a directory prepared in advance. For automatic document classification, the method described in “Prior Art” of Japanese Patent Application No. 10-281621 (“Automatic Information Classification Method and Device and Recording Medium Recording Automatic Information Classification Program” Sugisaki et al.) Can be used. . Briefly, a feature vector is first created from words existing in a document and their appearance frequency information. The value ww _ik of the word k existing in the document i is ww _ik = number of occurrences of the word k in the document i log (total number of documents / number of documents in which the word k appears)
And the feature vector vec _i of document _i is
vec _i = (ww _i1 , ..., ww _ik , ..., ww _in ) (1)
And
[0023]
Where n is the total number of words that appear in all documents (for the creation of feature vectors, see “Automatic Text Processing” Gerard Salton, ADDISON-WESLEY pub. 1989).
Similarly, a feature vector “catev _x” of category “ _x” is created from words existing in the sample document and their appearance frequency information. The value of each element in the category feature vector is the average value of the corresponding elements of the feature vector of the assigned sample document.
[0024]
The trigonometric function cosθ is calculated using the angle between the feature vector of the document and the feature vector of the category, and the degree of fitness for each category of the document is defined from the calculated value (how much judgment of “belonging to that category” is appropriate) it can. The degree of matching between the document i and the category x is expressed as rel _ix . When the trigonometric function cos θ is used, the possible value of rel _ix is between 0 and 1, with 1 being the highest fitness.
Also, the category relevance level is defined between the categories. Assume that the degree of association between category x and category y is Caterel _xy . In the case of automatic calculation, the trigonometric function cos θ is used as the relevance between categories using the angle formed by the feature vector of the category (see “Introducing the distance between categories” in Japanese Patent Application No. 10-281621). .
[0025]
A weight is defined for a hyperlink attached to another document (j) from one document (i) using the relevance between categories and the fitness of each category of the document. The hyperlink weight w _ij is _extended from the document i assigned to the category x to the document j assigned to the category y.
w _ij = Caterel _xy・ rel _jy (2)
And
An example of the hyperlink weight will be described with reference to FIG.
[0026]
Document A belongs to category “Company T”, fitness level = 0.7, and document D belongs to category “pure literature”, fitness level = 0.9. If the degree of association between “Company T” and “pure literature” = 0.11, the weight of the link from document A to document D is 0.099 (0.11 * 0.9).
The category to which a document belongs is not necessarily one. There may be more than one. In such cases, the maximum weight is adopted. FIG. 9 shows an example of hyperlink weights when there are a plurality of paths between documents. In the figure, the document D belongs to the categories “pure literature” and “parts”. Since the weight via “pure literature” is 0.099 (0.11 * 0.9) and the weight via “parts” is 0.12 (0.3 * 0.4), the weight is large. 0.12 is adopted.
[0027]
Next, path search in the path search unit 103 will be described. As a whole, the optimum path is determined from the weight of the hyperlink in the partial search space, the search in the partial search space is repeated, and the entire path is determined.
A case where a path from the document A to the document L is obtained will be described with reference to FIG. The numerical values in FIG. 10 indicate weights, and larger values indicate more desirable paths.
In the path search unit 103, first, the starting point setting unit 201 sets the starting point of the search. The first time is a document input as a search condition, and the second and subsequent times are solutions of the previous partial search. For example, in FIG.
[0028]
Next, the link destination document extraction unit 202 extracts a document with a hyperlink from the origin document. In FIG. 10, document B and document C are extracted. The category determination unit 203 determines a category for the origin document and the link destination document using an automatic classification technique, and obtains a degree of conformity between the document and the category.
In this example, the category to which the document A, the document B, and the document C belong is determined, and the fitness is obtained.
Furthermore, the weight calculation unit 204 calculates the weight of the hyperlink from the result obtained by the category determination unit 203 and the degree of association between the categories, using Expression (2). FIG. 10 shows the calculated weights.
[0029]
The path determination recording unit 205 compares the calculated weights, determines an optimal path, and records the optimal path. In the example shown in FIG. 10, the weight of the hyperlink from document A to document B is 0.4, and the weight of the hyperlink from document A to document C is 0.5. Select the hyperlink and record it.
Further, the search end determination unit 206 checks whether or not the selected document is the target document. If the target document is the target document, the process ends. If not, the process returns to step 201 to search again.
[0030]
In the example shown in FIG. 10, since the document L has not been reached, the search is performed again.
In the second and subsequent searches, the process proceeds by reflecting the previous search result. In the example shown in FIG. 10, the second search is performed starting from the document C as the first search result. Document E and document F are compared, and document F is selected. The above processing is repeated until the document L is obtained.
In the second and subsequent times, the hyperlink weight is selected by multiplying all the already selected path weights and accumulating to search for a hyperlink having a larger weight.
Thus, in the example of FIG. 10, the routes A, C, F, I, and L are obtained as solutions. The search process is summarized in FIG. Shaded rows indicate the optimal solution for partial search.
[0031]
Note that at which stage the degree of association is calculated is determined by a trade-off between the calculation time and the storage memory amount. There is a method of calculating each time for each partial search (which takes a long time), and a method of calculating and recording at the directory input stage and using the recorded value (a large amount of storage memory).
In addition, the degree of association may be manually specified individually. In this case, it is performed in the directory input process. Since the partial search method is a conventional technique, detailed description will not be made any more. However, for a simple supplement, the search may be performed in a partial search space having a plurality of link destinations.
[0032]
Further, the search may be advanced while holding a plurality of solutions by partial search. Further, the processing time or the number of searches may be employed as the search end condition. These other search methods include a method in which the system is fixed, and a method in which the user designates in the search condition input process.
[0033]
In addition, the shortest path search program shown in FIGS. 5 and 6 for realizing the function of the shortest path search apparatus shown in FIG. 1 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is recorded. The path search may be performed by causing the computer system to read and execute.
[0034]
That is, a shortest path search program for searching a path so as to pass through a shortest path from a document following a hyperlink between documents to another document, A directory input step for classifying documents, a search condition input step for specifying a search condition including at least two documents for which a route is to be obtained, a path search step for obtaining a route between the two input documents, A shortest path search program comprising a result output step for outputting a result obtained in the path search process is recorded on a computer-readable recording medium, and the shortest path search program recorded on the recording medium is read by a computer system. The path search may be performed by executing.
[0035]
In the shortest path search program recorded on the recording medium, the path search step includes a start point setting step for setting a search start document, a link destination document extraction step for extracting a link destination document for the start document, For both the origin document and the linked document, a category determining step for determining the category to which these documents belong, a weight calculating step for calculating the weight between the documents for all the linked documents, and the calculated weight Comparing and determining an optimum path from the origin document to the linked document, and recording the path determination recording step, and a search end determination step for determining whether or not to continue the search To do.
[0036]
Here, the “computer system” includes an OS and hardware such as peripheral devices.
The “computer-readable recording medium” refers to a storage device such as a flexible disk, a magneto-optical disk, a general medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system.
[0037]
Furthermore, the “computer-readable recording medium” means that a program is dynamically held for a short time, like a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. It is also assumed that what holds a program for a certain period of time, such as a volatile memory inside a computer system that becomes a server or client in that case (transmission medium or transmission wave).
Further, the program may be for realizing a part of the above-described functions, and further, a program that can realize the above-described functions in combination with a program already recorded in a computer system, a so-called difference file ( Difference program).
[0038]
【The invention's effect】
As described above, in the present invention, since the path is obtained by paying attention to the weight between documents, there is a low probability that an unsuitable path (in which the number of selections becomes extremely large) is presented, and an accurate path ( The probability of presenting a path close to the shortest path is high.
In addition, since the search is performed while being divided into partial spaces, the processing time can be suppressed within a range that can be allowed by the user.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a shortest path search device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a hyperlink function.
FIG. 3 is an explanatory diagram showing visible information for realizing a hyperlink function.
FIG. 4 is an explanatory diagram showing a search space for hypertext.
FIG. 5 is a flowchart showing the processing contents of the shortest path search device shown in FIG. 1;
6 is a flowchart showing details of a path search process in FIG. 5;
FIG. 7 is an explanatory diagram showing an example of a directory.
FIG. 8 is an explanatory diagram showing an example of hyperlink weights.
FIG. 9 is an explanatory diagram illustrating an example of the weight of a hyperlink when there are a plurality of search routes.
FIG. 10 is an explanatory diagram showing an example of a hypertext search space together with hyperlink weights.
FIG. 11 is an explanatory diagram showing a partial search process during hypertext path search.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 Directory input part 102 Search condition input part 103 Path search part 104 Result output part 201 Origin point setting part 202 Link destination document extraction part 203 Category determination part 204 Weight calculation part 205 Path determination recording part 206 Search end determination part

Claims

A document category storage means for storing documents belonging to a category, a path search means, and a result output means, and a document that is a starting point is traced to a document that is a starting point by following a hyperlink between documents belonging to the category. A route search method of a route search device for obtaining a path to reach,
A path obtained by assigning a weight to a hyperlink, searching for a weighted directed graph using the hyperlink weight value, and obtaining a path from the starting document to the end document by the path search unit Search process,
A result output process performed by the result output means for outputting a path obtained in the path search process, and
The weight of the hyperlink is
The relevance that is the similarity between the feature vector of the category of the link source document and the feature vector of the category of the link destination document, and the fitness that is the similarity between the feature vector of the category of the link destination document and the feature vector of the link destination document And a route search method characterized by being multiplied by the above .

When the linked document belongs to a plurality of categories, the maximum weight value of the weights calculated for each category to which the linked document belongs is calculated in the calculation of the weight of the hyperlink used in the route search . The route search method according to claim 1 , wherein the route search method is used.

A route search device that obtains a path from a source document to a destination document by following a hyperlink between documents belonging to a category,
Document category storage means for storing documents belonging to the category;
A relevance that is a similarity between the feature vector of the category of the link source document and the feature vector of the category of the link destination document that are linked by the hyperlink, the feature vector of the category of the link destination document, and the feature vector of the link destination document Weight calculation means for calculating the weight of the hyperlink by multiplying the fitness that is the similarity by
Path search means for searching for a weighted directed graph using the value of the weight of the hyperlink and obtaining a path from the starting document to the ending document;
A result output means for outputting a path obtained in the path search process;
A route search apparatus comprising:

The path search means, when the link destination document belongs to a plurality of categories, in calculating the weight of the hyperlink used in the route search , out of the respective weights calculated for each category to which the link destination document belongs The route search device according to claim 3 , wherein a maximum weight value is used.

The program which makes a computer perform the route search method of any one of Claim 1 or Claim 2 .

A computer-readable recording medium on which the program according to claim 5 is recorded.