JP2005018530A

JP2005018530A - Information processor, information processing program, and information processing method

Info

Publication number: JP2005018530A
Application number: JP2003183975A
Authority: JP
Inventors: Kazuaki Kidokoro; 和明城所
Original assignee: Toshiba Corp; Toshiba TEC Corp
Current assignee: Toshiba Corp; Toshiba TEC Corp
Priority date: 2003-06-27
Filing date: 2003-06-27
Publication date: 2005-01-20

Abstract

<P>PROBLEM TO BE SOLVED: To facilitate document search by effective keyword information, by integrating document search through the use of information on an operation history. <P>SOLUTION: A history collection means 2 of a client terminal 1 collects history data indicating the operation history, and transmits the data to a history server 4. A history acquisition part 5 of the history server 4 stores the received history data in a history DB 10, and a keyword extraction part 7 extracts a keyword included in the history data, weights the keyword and registers the weighted keyword with an index DB 12. In addition, if it has been set to analyze the history data, a history analysis part 6 analyzes the history data. Then the keyword extraction part 7 extracts a keyword based on the analysis result, weights the keyword, and registers the weighted keyword with the index DB 12. Thereby, the client terminal 1 retrieves a desired document on the basis of the keyword information registered with the index DB 12, by using an index search part 3. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、オフィス文書やＷｅｂ文書などの電子ドキュメントを利用するコンピュータシステムにおいて、キーワードを用いてドキュメントの検索を行うための、情報処理装置、情報処理プログラム及び情報処理方法に関するものである。
【０００２】
【従来の技術】
昨今、ネットワークや高性能なパーソナルコンピュータ（以下、ＰＣと云う）の普及によって、オフィス環境における電子ドキュメントの作成と利用がより一般的になり、電子ドキュメントの情報量は日々増大の一途をたどっている。このように多量な電子ドキュメントの情報から必要な情報を検索するためのドキュメント検索技術は、電子ドキュメントの氾濫する環境においては必須の技術となっており、ドキュメント検索システムの性能はオフィス業務の効率化に重要な影響を与えている。また、ドキュメントの検索技術としては、あらかじめユーザがドキュメントに割り当てたキーワードを用いたキーワード検索技術、ドキュメントのコンテンツから指定文字列を検索する全文検索技術、または両者の特徴を合わせた検索技術であって、ドキュメントのコンテンツから自動的にキーワードを抽出してあらかじめインデックス情報を作成し、全文検索の精度の向上と検索の高速化を行うインデックス型検索技術などが存在する。その中でも、インデックスを用いた検索システムであるインデックス型検索技術は、不特定多数のドキュメントを検索する場合の性能に優れており、Ｗｅｂ情報検索サイトなどを含めて広く利用されている検索技術である。
【０００３】
インデックス型検索の性能はドキュメントのコンテンツから如何に重要なキーワードを抽出できるかに大きく依存しており、適当なキーワードを抽出するための自然言語解析処理やキーワードの出現頻度に基づくインデックス情報の重み付け処理など、インデックス情報の生成には様々な工夫が施されている。また、検索精度を向上させるために、コンテンツに含まれる文字列以外の情報を利用しようという試みも見られる。例えば、Ｇｏｏｇｌｅ社（登録商標）のＷｅｂ検索エンジンではＷｅｂドキュメント間のリンク情報をキーワードの重み付けに利用し、検索対象をＷｅｂとドキュメントに特定した範囲で高い検索成果を得ている。
【０００４】
また、例えば下記特許文献１には、簡単な操作で画像検索を行う技術が開示されている。この技術は、画像データの過去の操作内容と操作日時を検索情報として記録し、これらの検索情報と検索時に指定した検索項目が一致したときに所望の画像を抽出するものである。
さらに、例えば下記特許文献２には、Ｗｅｂブラウザ上で所望の情報を指定できるブックマークに基づいて情報検索を行う技術が開示されている。この技術は、ブックマーク情報とブックマークの利用履歴からキーワードを抽出し、キーワードにマッチする情報をユーザに通知するものである。
【０００５】
【特許文献１】
特開平７−２３９８５４号公報（段落番号００１７〜００４０、及び図１〜図３参照）
【特許文献２】
特開２００１−１９５４２２号公報（段落番号０００５〜００１３、及び図１〜図４）
【０００６】
【発明が解決しようとする課題】
しかしながら、インデックス情報の生成は依然として情報検索システムの大きな技術課題であり、検索精度の向上だけではなく、インデックス生成処理の高度化によるＣＰＵに対する負荷の増大への対応、検索対象ドキュメントのカバー率を向上させるための多種多様なドキュメントフォーマットへの対応、などといった検索を行うためのシステム全体の技術的向上もドキュメント検索システムの大きな技術課題となっている。そのような面から考えると、従来のドキュメント検索装置ないしドキュメント検索装置システムは、最適なシステム環境でインデックス情報が生成されているとは云えず、結果的には、フィールドにおける検索の柔軟性が欠けると共に、検索精度のさらなる向上を阻んでいる。上記の特許文献１及び特許文献２の技術においてもこれらの問題点は依然として解決されていない。
【０００７】
本発明は、上述の課題に鑑みてなされたもので、その目的とするところは、ドキュメントの使われ方や使われた経緯などを示すドキュメントへのアクセス履歴情報を用いてドキュメント検索方法を一元化することにより、効果的なキーワード情報を少ない処理負荷で生成することによってドキュメントの検索を容易にすることができる情報処理装置、情報処理プログラム及び情報処理方法を提供することにある。
【０００８】
【課題を解決するための手段】
上記の目的を達成するため、本発明は、ドキュメントの検索を行うためのキーワードを生成する情報処理装置であって、前記ドキュメントを操作したときのアクセス履歴から得られる履歴情報を取得する履歴取得手段と、前記履歴取得手段により取得された履歴情報に基づいて、該履歴情報から前記ドキュメント検索用のキーワードを抽出するキーワード抽出手段とを備えてなるものである。
【０００９】
また、本発明は、ドキュメントの検索を行うためのキーワードを生成する処理をコンピュータに実行させるための情報処理プログラムであって、前記ドキュメントを操作したときのアクセス履歴から得られる履歴情報を取得する履歴取得機能と、前記履歴取得ステップにより取得された履歴情報に基づいて、該履歴情報から前記ドキュメント検索用のキーワードを抽出するキーワード抽出機能とをコンピュータに実行させるものである。
【００１０】
また、本発明は、ドキュメントの検索を行うためのキーワードを生成する情報処理方法であって、前記ドキュメントを操作したときのアクセス履歴から得られる履歴情報を取得する履歴取得ステップと、前記履歴取得ステップにより取得された履歴情報に基づいて、該履歴情報から前記ドキュメント検索用のキーワードを抽出するキーワード抽出ステップとを備えてなるものである。
【００１１】
【発明の実施の形態】
本発明のドキュメント検索装置は、ドキュメントの操作内容情報、ドキュメントを使った業務内容情報、又はドキュメントの関連文書情報などのように、ドキュメントを操作したときの履歴から得られる操作履歴情報に基づいてドキュメント検索用のキーワードを生成し、このキーワードを用いてインデックス情報を生成してデータベースに登録し、以後このインデックス情報を用いてドキュメントの検索を行うようにしたり、或いは、生成されたキーワードをドキュメントに埋め込むことにより以後そのキーワードを用いてドキュメント検索を行うようにする。これによって、実際に使われる可能性が高くて精度の高い情報を検索キーワードとすることができると共に、非テキスト文章の検索にも適用することができる。また、全文検索によって全てのドキュメントを網羅した高精度な検索を行うこともできる。
【００１２】
以下、本発明の実施の形態を説明する。
図１は本発明の実施の形態におけるドキュメント検索システムの構成を示すブロックである。このドキュメント検索システムは、ユーザがドキュメントの操作を行うクライアント端末１と、ネットワーク上に存在し、履歴の取得（保存）、履歴の解析、キーワードの抽出及びインデックス情報の作成、及び各種データベース（以下、データベースをＤＢと略す）の管理などを行う履歴サーバ４と、ユーザのドキュメントを保存し、ネットワークを介してドキュメントへのアクセスを可能にするファイルサーバ１３とがＬＡＮ１４に接続された構成となっている。
【００１３】
ユーザがドキュメント操作を行うクライアント端末１は、ドキュメントの履歴情報を収集する履歴収集手段としての履歴収集部２と所定のキーワードを含むドキュメントを検索するインデックス検索手段としてのインデックス検索部３とを備えた構成となっている。
【００１４】
履歴収集部２は、クライアント端末１に組み込まれたソフトウェアモジュールであり、ユーザの行うアプリケーション操作のイベントを検知してアクセスしたドキュメント情報を収集し、このドキュメント情報を履歴サーバ４に送信する機能を備えている。
【００１５】
インデックス検索部３は、クライアント端末１のソフトウェアプリケーションであり、履歴サーバ４に存在するインデックスＤＢ１２からユーザの指定したキーワードを含むドキュメントまたはドキュメントの部位を検索し、検索結果を重みの大きい順にユーザに提示する機能を備えている。
【００１６】
履歴サーバ４は、履歴取得手段としての履歴取得部５、履歴解析手段としての履歴解析部６、キーワード抽出手段としてのキーワード抽出部７、インデックス作成部９、履歴ＤＢ１０、ドキュメント情報ＤＢ１１、及びインデックスＤＢ１２を備えた構成となっている。
【００１７】
尚、履歴サーバ４は、図１に示す構成では、ネットワーク上に存在してサーバ機能として実現し、履歴情報の保存、履歴情報の解析、及びインデックス管理を行う。しかし、このようなサーバ機能はクライアント端末１に存在させてもよい。また、履歴情報の保存、解析、及びインデックス管理の機能はそれぞれ別々のサーバに持たせてもよい。尚、インデックス管理の機能は既存のインデックスを用いた検索システムを使用する。
【００１８】
履歴取得部５は、履歴サーバ４上で動作するソフトウェアモジュールであり、クライアント端末１から送信されたドキュメンアクセスの履歴データを受信し、履歴ＤＢ１０へ保存する機能を備えている。
【００１９】
履歴解析部６は、履歴ＤＢ１０に保存されたドキュメントアクセスの履歴データを解析し、ドキュメントのアクセス回数や特定のドキュメントまたはドキュメントの部位の関連文書を抽出し、ドキュメント情報ＤＢ１１へ保存する機能を備えている。ここで、関連文書とは、そのドキュメント／部位を作成するときに参照した文書／部位、そのドキュメント／部位を利用して作成された派生文書、そのドキュメント／部位にアクセスがあるときに並行して利用されることの多い同時利用文書などの情報をいう。
【００２０】
なお、この履歴解析部６は、本発明による検索効果を増大するための付加的なものであり、履歴解析部６が本発明の全てにおける構成要件となるものではない。
【００２１】
履歴情報から関連文書を抽出する場合、履歴情報だけでは何れが関連する履歴であるか分からないため、別途ドキュメントアクセス履歴からの業務解析処理と、業務解析処理によって作成された業務履歴からのドキュメント関連付け処理を行う。
ドキュメントアクセス履歴からの業務履歴生成処理は例えば次のような処理を行う。業務の区切りの際に発生する印刷やファイルの保存、メールの送信といった特徴的なドキュメントアクセス履歴に着目し、時系列に連続するドキュメントアクセス履歴を区切ることで業務履歴を作成している。
業務履歴からドキュメントの関連付け処理は例えば次のような処理を行う。スプレッドシートでデータの集計を行ってから、プレゼンテーション用のドキュメントに集計結果を貼り付けたり過去のメールを参照して、新しいメールを作成するように、業務の種別に関らずある程度決まったパターンに従ったアクセスが発生するものと考えられるので、このアクセスパターンを予め定義しておき業務履歴ないのドキュメントアクセス履歴と比較することで、関連を導くことができる。例えば、「同じ業務内でドキュメントＡを開いた後、更新されたドキュメントＢ」というパターンからは、「ＡはＢの参照情報である」という関連と逆に「ＢはＡの派生情報である」という関連を導くことができる。
【００２２】
キーワード抽出部７は、重み付け手段としての重み付け部８を備えており、履歴ＤＢ１０及びドキュメント情報ＤＢ１１に保存された情報からキーワード情報を抽出し、このキーワード情報に重み付けを行ってインデックスＤＢ１２へ記録する。尚、キーワードの抽出方法や重み付けの方法については後述する。
インデックス作成手段としてのインデックス作成部９は、キーワード抽出部７により抽出されたキーワード及び履歴ＤＢ１０に保存された履歴情報、及びドキュメント情報ＤＢ１１に保存されたドキュメント情報（履歴解析部６の解析結果）に基づいてインデックス情報を作成する。
【００２３】
履歴ＤＢ１０は、クライアント端末１から送信された履歴情報を保存するデータベースである。尚、履歴ＤＢ１０の詳細な構成については図２を用いて後述する。
ドキュメント情報ＤＢ１１は、履歴ＤＢ１０に保存された履歴情報を解析して得られた情報を保存するデータベースである。尚、ドキュメント情報ＤＢ１１の詳細な構成については図３を用いて後述する。
インデックスＤＢ１２は、インデックス作成部９により作成されたインデックス情報を記録するデータベースである。インデックスＤＢ１２の詳細な構成については図４を用いて後述する。尚、このインデックスＤＢ１２は、既存のインデックス検索システムに付随するインデックスＤＢを流用しても構わない。
【００２４】
ファイルサーバ１３は、ユーザのドキュメントを保存するファイルＤＢ１３ａを有し、ネットワークを介してドキュメントへのアクセスを可能にするサーバであるが、本発明での管理対象となるドキュメントはこのファイルサーバ１３上だけでなく、クライアント端末１上、または、インターネット上のドキュメントであっても構わない。
【００２５】
図２は、図１に示す履歴ＤＢ１０の構成と内容の一例を示す図である。図２において、『日時』の項目欄には、クライアント端末１でドキュメントアクセスが発生した日時を記録する。例えば、「２００３／０３／１９１４：３４：１２」と云うように、アクセス発生日時を年月日から時分秒まで記録する。
『操作内容』の項目欄には、ユーザがドキュメントに対して行った操作内容を記録する。尚、操作の種別には、Ｏｐｅｎ、Ｓａｖｅ、Ｄｅｌｅｔｅ、Ｐｒｉｎｔ、Ｓｅｎｄ（メール送信）などが含まれる。
【００２６】
『ユーザ名』の項目欄には、操作を行ったユーザを識別するためのユーザ情報を記録する。例えば、操作を行ったユーザ個人の名前を記録する。
『文書』の項目欄には、捜査対象となったドキュメントを識別する情報を記録する。例えば、捜査対象がｗｅｂ文書であればＵＲＬ（例えば、ＨＹＰＥＲＬＩＮＫ ”ｈｔｔｐ：／／ｗｗｗ．ｂ−ｃａｒ．ｃｏ．ｊｐ” ｈｔｔｐ：／／ｗｗｗ．ｂ−ｃａｒ．ｃｏ．ｊｐ）を記録し、Ｗｉｎｄｏｗｓ共有ファイルであれば、Ｗｉｎｄｏｗｓ、サーバ名、共有名、パス名、ファイル名を含むネットワークパス（例えば、ＨＹＰＥＲＬＩＮＫ ”￥￥￥￥ｓｅｒｖｅｒ１￥￥ｄｓｒ￥￥ｃａｒｓ．ｄｏｃ” ￥￥ｓｅｒｖｅｒ１￥ｄｓｒ￥ｃａｒｓ．ｄｏｃ）を記録する。
【００２７】
『文書タイトル』の項目欄には、ユーザがドキュメントを操作した際にアプリケーションから得られるドキュメントのタイトルを記録する。例えば、新車情報のドキュメントであれば各自動車会社の新車情報（例えば、Ａ社新車情報）や新車に関するイベント情報（例えば、春の新車）などを記録する。
『ページ』の項目には、操作を行った対象のページ番号を記録する。尚、操作の対象がドキュメント全体である場合はこのフィールドは空白になる。
『送信／出力先』の項目欄には、ドキュメントの送信や出力操作を行った履歴に対して、送信先または出力先の情報を記録する。例えば、操作の内容がプリントである場合は、プリントを行ったプリンタ名（例えば、ＰｒｉｎｔｅｒＡ）、ＥメールでのＳｅｎｄ（送信）であれば、あて先に含まれるユーザのユーザ名（例えば、藤原）を記録する。
【００２８】
『周辺キーワード』の項目欄には、操作を行ったドキュメントに含まれるキーワード情報（例えば、車種Ａ、車種Ｂ、バイクなど）を記録する。尚、キーワード情報は操作対象となったドキュメントにあらかじめ含まれている場合もあるし、クライアント端末１の履歴収集手段２がドキュメントのコンテンツから動的に生成したものでもよい。キーワード情報があらかじめドキュメントに含まれる例としては、ＭｉｃｒｏｓｏｆｔＯｆｆｉｃｅアプリケーション（登録商標）で作成された文書などがある。ドキュメントのコンテンツから動的にキーワードを生成するには、既存の方法がいくつか用意されている。
【００２９】
例えば、コンピュータがドキュメントの中の品詞を分解して自然言語を解析する形態素解析処理によってテキスト情報を語彙の単位に分割し、それぞれの語彙のドキュメント内での出現回数をカウントして、出現する回数の多いものをキーワードとするような方法が用いられる。また、検索操作の対象がドキュメント全体ではなく、ページなどのようにドキュメントの一部であった場合は、ドキュメント全体からキーワードを抽出するのではなく、対象のページに含まれる情報からキーワードを抽出する方法が用いられる。このとき、周辺キーワードは必須ではなく、システムの負荷に応じて処理を省いてもよい。
【００３０】
図３は、図１に示すドキュメント情報ＤＢの構成と内容の一例を示す図である。すなわち、この図は、履歴ＤＢ１０に保存された履歴情報を解析して得られるドキュメント情報を記録するドキュメント情報ＤＢ１１の構成と内容を示している。
『文書』の項目欄には、ドキュメント情報の検索対象となる文書の識別子（例えば、ｃ：￥ａｂｃ．ｄｏｃ）を記述する。このフィールドに含まれる情報の記述方法は、図２に示す履歴ＤＢ１０の構成と内容で説明したときの『文書』項目の情報と同じである。すなわち、検索対象がｗｅｂ文書であればＵＲＬを記録し、Ｗｉｎｄｏｗｓ共有ファイルであれば、Ｗｉｎｄｏｗｓ、サーバ名、共有名、パス名、ファイル名を含むネットワークパスを記録する。
【００３１】
『ページ』の項目欄には、検索対象がドキュメント全体でなく、ドキュメント中の一つの部位である場合は検索対象となるページ番号を記録する。尚、検索対象がドキュメント全体である場合はこのフィールドは空白となる。
『アクセス回数』の項目欄には、検索対象のドキュメントまたは部位に対して、発生したアクセスの回数を記録する。
『参照文書／参照回数』の項目欄には、検索対象となるドキュメントまたは部位の作成や更新時に参照したドキュメント（例えば、ｄｅｆ．ｄｏｃ）を記録する。また、頻繁に参照したドキュメントを他の文書と区別するために参照回数をドキュメント名と併せて記録する。例えば、ｄｅｆ．ｄｏｃを３回参照した場合には、ｄｅｆ．ｄｏｃ／３と云うように記録する。
【００３２】
『派生文書／被参照回数』の項目欄には、検索対象となるドキュメントまたは部位を参照して作成されたドキュメント（例えば、ｃａｒｓ．ｄｏｃ）を記録する。また、派生文書作成時には検索対象文書がどれだけ参照されたかの被参照回数を記録する。
『同時利用文書』の項目欄には、検索対象となるドキュメントまたは部位にアクセスが発生するときに、同時に利用されることの多いドキュメント（例えば、ＨＹＰＥＲＬＩＮＫ ”ｈｔｔｐ：／／ｗｗｗ．ａ−ｃａｒ．ｃｏ．ｊｐ” ｈｔｔｐ：／／ｗｗｗ．ａ−ｃａｒ．ｃｏ．ｊｐ）を記録する。
『キーワード』の項目欄には、検索対象となるドキュメントまたは部位に含まれるキーワード情報（例えば、車種Ａ、車種Ｂなど）を記録する。このキーワード情報は、履歴ＤＢ１０の周辺キーワードに記録されているキーワード情報と、履歴ＤＢ１０の文書タイトルに記録されているタイトルから抽出したキーワードとを含んでいる。
【００３３】
図４は、図１に示すインデックスＤＢ１２の構成と内容の一例を示す図である。すなわち、この図は、キーワードを用いたドキュメントの検索に用いるインデックスＤＢ１２の構成と内容を示している。実際に利用されるインデックスＤＢ１２は、検索処理を高速化するためにＤＢの構成も効率化されているが、ここでは簡単のために単純な構成を例示している。
【００３４】
『文書』の項目欄には、検索対象となるドキュメントの識別情報（例えば、ｃ：￥ａｂｃ．ｄｏｃ）を記録する。この例では、先に説明した履歴ＤＢ１０またはドキュメント情報ＤＢ１１に含まれる『文書』の項目欄のフィールドの情報と同じ情報が記録されている。
『ページ』の項目欄には、検索の結果がドキュメント全体ではなく、ドキュメントの部位である場合は部位のページ番号を記録する。
『キーワード』の項目欄には、対象となるドキュメント、または部位に含まれるキーワードのリスト（例えば、Ａ社、車、カーライフなど）を記録する。インデックスＤＢ１２を既存のドキュメント検索システムと本発明によるドキュメント検索システムで共有する場合は、既存のドキュメント検索システムで生成されたキーワード情報と、本発明によるドキュメント検索システムで生成されたキーワード情報の両方がこのキーワードのリストに存在する。
【００３５】
『重み』の項目欄には、それぞれのキーワードに対する重み（重要度）を記録する。重みの値の大きいキーワードの方が対象のドキュメント／部位をよりよく表すキーワードと認識される。例えば、『文書』項目欄の“ｃ：￥ａｂｃ．ｄｏｃ”においては、「車種Ａ」の重みは“１０”であって、「車種Ｂ」の重みは“３”であるので、「車種Ａ」は「車種Ｂ」よりキーワードの重要度がはるかに高いことを意味している。
尚、インデックスＤＢ１２を用いたドキュメントの検索処理では、ユーザが指定したキーワードでこのインデックスＤＢ１２のキーワードフィールドを検索し、キーワードを含むドキュメント／部位を特定して、キーワードに割り当てられた重みの大きい順にソートして検索結果とする。
【００３６】
次に、フローチャートを用いて、本発明のドキュメント検索システムが行う動作シーケンスを詳細に説明する。図５は、本実施の形態におけるドキュメント検索システムにおいてクライアント端末が行う履歴収集処理の流れを示すフローチャートである。まず、クライアント端末１に組み込まれた履歴収集部２がユーザのアプリケーション操作イベント処理を開始する。つまり、履歴収集部２がユーザの行うドキュメントの操作を検知して一連の検索処理を開始する（ステップＳ１）。検索処理が開始されると、履歴収集部２は、イベントの内容に応じて、アプリケーションからファイル名、ページ番号、タイトル、操作内容などのドキュメント情報を操作履歴情報として収集する。このとき、履歴収集部２は、開かれているドキュメントのコンテンツから周辺キーワードの抽出を行う（ステップＳ２）。そして、履歴収集部２は、ステップＳ２で収集した情報を操作履歴情報として履歴サーバ４へ送信する（ステップＳ３）。
【００３７】
図６は、本実施の形態におけるドキュメント検索システムにおいて履歴サーバ４が行う、操作履歴情報の受信からインデックスＤＢの更新までの処理の流れを示すフローチャートである。つまり、この図は、履歴サーバ４が操作履歴情報からキーワードを抽出して登録するまでの処理の流れを示している。
【００３８】
まず、履歴サーバ４が履歴の収集態勢に入ると（ステップＳ１１）、履歴サーバ４の履歴取得部５が、クライアント端末１から履歴データを受信したか否かを判断する（ステップＳ１２）。ここで、まだ履歴データを受信していなければ（ステップＳ１２，Ｎｏ）、履歴データを受信するまで待機する。一方、履歴取得部５がクライアント端末１から履歴データを受信したことを検知したときは、キーワード検索を行うための一連の処理を開始する（ステップＳ１２，Ｙｅｓ）。
【００３９】
すなわち、履歴サーバ４においては、履歴取得部５が、受信した履歴データを履歴ＤＢ１０に保存し、キーワード抽出部７が、受信した履歴データに含まれるキーワード情報を抽出する（ステップＳ１３）。尚、キーワード抽出部７によるキーワードの抽出処理の内容は図７を用いて後述する。さらに、キーワード抽出部７は、重み付け部８により抽出されたキーワード情報を重み付けした上で、このキーワード情報をインデックスＤＢ１２へ登録する（ステップＳ１４）。これによって、クライアント端末１のインデックス検索部３は、インデックスＤＢ１２に登録されているキーワード情報を重みの高い順に並べて所望のドキュメントを検索することができる。尚、キーワード抽出部７が行うキーワードの重み付け処理の内容は図８を用いて後述する。
【００４０】
さらに、本実施の形態では、次のステップＳ１５からステップＳ１８までを付加することによってキーワードの抽出と登録を効果的に行うことができる。すなわち、ドキュメント検索システムが履歴の解析を行うように設定されているか否か（つまり、履歴解析部６が構成されているか否か）を判断し（ステップＳ１５）、履歴の解析を行うように設定されている（つまり、履歴解析部６が構成されている）場合は（ステップＳ１５，Ｙｅｓ）、履歴解析部６が履歴データの解析処理を開始する。そして、履歴解析部６は、履歴ＤＢ１０に記録されている履歴データを用いて、ドキュメントのアクセス回数、関連文書などの履歴データの解析を行う（ステップＳ１６）。尚、履歴データから関連する文書を解析する処理は既存の処理手順を用いて行う。また、ステップＳ１５でドキュメント検索システムが履歴の解析を行うように設定されていない（つまり、履歴解析部６が構成されていない）場合は（ステップＳ１５，Ｎｏ）、そのまま検索処理を終了する。
【００４１】
次に、ステップＳ１６で履歴解析部６が履歴の解析を行った場合は、キーワード抽出部７が、履歴解析部６の行った解析結果（ドキュメント情報ＤＢの内容）からキーワード情報を抽出して重み付けを行う（ステップＳ１７）。尚、キーワード抽出部７が行うキーワード情報の抽出と重み付けの処理内容は図７と図８を用いて後述する。そして、インデックス作成部９が、キーワード抽出部７により抽出されたキーワード及び重み付けの情報を用いてインデックスを作成し、インデックスＤＢ１２へ登録する（ステップＳ１８）。これによって、クライアント端末１のインデックス検索部３は、インデックスＤＢ１２に登録されているインデックス情報に基づいてキーワード情報を重みの高い順に並べて所望のドキュメントを検索することができる。
【００４２】
図７は、履歴サーバ４のキーワード抽出部７がキーワードを抽出するときのルールの一例を示す図である。すなわち、図７は、図６に示すフローチャートのステップＳ１３、Ｓ１４において、キーワード抽出部７が、履歴ＤＢ１０及びドキュメント情報ＤＢ１１から履歴データに含まれるキーワードを抽出し、インデックス作成部９がインデックスＤＢ１２へ追加キーワードとして登録したり、ステップＳ１８、Ｓ１９において、キーワード抽出部７が解析結果から抽出したキーワードをインデックス作成部９がインデックスＤＢ１２へ追加キーワードとして登録したりするときのルールの一例を示したものである。尚、図７に示すキーワード抽出ルールは、ドキュメント検索システムの構成や負荷に応じて任意に選択して使用することができ、全てのルールを使用する必要はない。また、新たにルールを追加して使用することもできる。
【００４３】
以下、キーワード抽出部７が行うキーワード抽出のルールを図７に従って詳細に説明する。
『作成者を追加する』ルールにおいては、ドキュメントを作成した作成者名を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『作成日を追加する』ルールにおいては、ドキュメントを作成した日時を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『ドキュメントの所在するサーバ名を追加する』ルールにおいては、ドキュメントの所在するサーバのサーバ名を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
【００４４】
『ドキュメントへのアクセス内容を追加する』ルールにおいては、閲覧、印刷、削除、転送、送信など、ドキュメントに対してユーザが行った操作内容を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『送付先を追加する』ルールにおいては、ドキュメントがＥメールなどにより送信されていた場合には、あて先（送信先）のユーザ名を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『アクセス頻度を追加する』ルールにおいては、ドキュメントのアクセス頻度に応じて、「高頻度」「低頻度」などの追加キーワードを抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
すなわち、アクセス頻度が高いときは「高頻度」、アクセス頻度が低いときは「低頻度」としてインデックスＤＢ１２へ登録する。
【００４５】
『アクセス回数を追加する』ルールにおいては、ドキュメントのアクセス回数に応じて、トータルアクセス回数の多い場合には「人気」、トータルアクセス回数が少ない場合には「不人気」などの追加キーワードを抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『出力先を追加する』ルールにおいては、ドキュメントが印刷出力されている場合は出力先のプリンタ名を追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
すなわち、データの出力先のデバイス名をインデックスＤＢ１２へ登録する。
【００４６】
『関連文書のキーワードを追加する』ルールにおいては、ドキュメント情報ＤＢ１１に記録されている参照文書、派生文書、同時利用文書の関連文書に含まれるキーワード情報を対象文書の追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
『関連文書のタイトルを追加する』ルールにおいては、ドキュメント情報ＤＢ１１に記録された参照文書、派生文書、同時利用文書の関連文書のタイトルに含まれるキーワード情報を対象文書の追加キーワードとして抽出し、インデックス作成部９によりインデックスＤＢ１２へ登録させる。
【００４７】
図８は、キーワード抽出部７がキーワードに重み付けをするときのルールの一例を示す図である。すなわち、図８は、図６に示すフローチャートのステップＳ１４及びステップＳ１８において、キーワード抽出部７がキーワードに重み付けをしてインデックスＤＢ１２へ登録するときのルールの一例を示した図である。
尚、図８に示すルールは、ドキュメント検索システムの構成や負荷に応じて任意に選択して使用することができ、全てのルールを使用する必要はない。また、新たにルールを追加して使用することもできる。
【００４８】
『キーワードの含まれるドキュメントのアクセス回数＞１０』である場合のように、関連文書に対するアクセス回数が多い場合のルールとしては、ドキュメントの関連文書のキーワードを登録するときにキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
『キーワードの含まれる部位のアクセス回数＞１０』のように、関連文書の部位に対するアクセス回数が多い場合のルールとして、ドキュメント関連文書の部位のキーワードを登録するときにキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
【００４９】
『キーワードの含まれる部位の出力回数＞１』のように、キーワードを含む部位を過去に印刷した履歴がある場合のルールとして、重要なドキュメント／部位であると考えてキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
『参照回数が２以上の関連ドキュメントのキーワード』のように、ドキュメントを作成したり更新したりしたときの参照文書／部位のキーワードを登録するとき、参照文書／部位に対する参照回数が多い場合は、キーワードの関連性が高いと考えてキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
【００５０】
『派生ドキュメントのキーワード』のルールとして、ドキュメントを参照して作成された派生文書／部位のキーワードを登録するとき、派生文書／部位からの参照回数が多い場合には、キーワードの関連性が強いと考えてキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
『印刷を行った関連ドキュメントのキーワード』のルールとして、関連ドキュメントに含まれるキーワードのうち、印刷や送信など特定の重要度の高い処理を行った関連文書のキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
『特定ユーザの作成文書』のルールとして、特定のユーザの作成した文書に含まれるキーワードの重みを増加する。すなわち、インデックス作成部９は、キーワードの重みを“１”増加してインデックスＤＢ１２へ登録する。
【００５１】
以上述べた実施の形態は本発明を説明するための一例であり、本発明は、上記の実施の形態に限定されるものではなく、発明の要旨の範囲で種々の変形が可能である。すなわち、上記の実施の形態では、抽出したキーワード情報をインデックスＤＢ１２へ登録するようにしたが、これに限ることはなく、ドキュメント自体にキーワード情報を埋め込んでもよい（付加してもよい）。例えば、ＭｉｃｒｏｓｏｆｔＯｆｆｉｃｅアプリケーション（登録商標）で作成される文書には、文書内にキーワード情報を記録する領域が用意されているし、ＨＴＭＬ、ＸＭＬ文書でもキーワード情報をドキュメントのコンテンツとは別のメタ情報として記録するための領域が規定されている。また、本発明のドキュメント検索システムで抽出されたキーワード情報がドキュメント自体に含まれることによって、インデックスＤＢを持たない全文検索システムなどと併用した場合であっても、本発明のドキュメント検索システムは前述の実施の形態と同様な作用効果を実現することができる。
【００５２】
なお、本発明の実施の形態では、装置内部に発明を実施する機能が予め記録されている場合で説明したが、これに限らず同様の機能をネットワークから装置にダウンロードしても良いし、同様の機能を記録媒体に記憶させたものを装置にインストールしても良い。記録媒体としては、ＣＤ−ＲＯＭ等プログラムを記憶でき、且つ装置が読取り可能な記録媒体であれば，その形態は何れの形態であっても良い。またこのように予めインストールやダウンロードにより得る機能は装置内部のＯＳ（オペレーティング・システム）等と協働してその機能を実現させるものであっても良い。
【００５３】
【発明の効果】
以上説明したように、本発明によれば、ユーザの操作履歴に基づいた信頼性の高いキーワード情報をドキュメントに付加しているので検索精度を高めることができる。また、関連文書の情報を用いることでドキュメント自体には含まれていないキーワードを付加することができ、検索の柔軟性を高めることができる。さらに、検索対象のドキュメントが画像や音声など、テキスト情報を含まないドキュメントであっても、履歴情報や関連文書を用いてキーワード情報を付加し、テキストを含むドキュメントと同様にキーワードによる検索の対象とすることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態におけるドキュメント検索システムの構成を示すブロックである。
【図２】図１に示す履歴ＤＢの構成と内容の一例を示す図である。
【図３】図１に示すドキュメント情報ＤＢの構成と内容の一例を示す図である。
【図４】図１に示すインデックスＤＢの構成と内容の一例を示す図である。
【図５】本発明の実施の形態におけるドキュメント検索システムにおいてクライアント端末が行う履歴収集処理の流れを示すフローチャートである。
【図６】本発明の実施の形態におけるドキュメント検索システムにおいて履歴サーバが行う、操作履歴情報の受信からインデックスＤＢの更新までの処理の流れを示すフローチャートである。
【図７】履歴サーバのキーワード抽出部がキーワードを抽出するときのルールの一例を示す図である。
【図８】キーワード抽出部がキーワードに重み付けをするときのルールの一例を示す図である。
【符号の説明】
１クライアント端末、２履歴収集部、３インデックス検索部、４履歴サーバ、５履歴取得部、６履歴解析部、７キーワード抽出部、１０履歴ＤＢ、１１ドキュメント情報ＤＢ、１２インデックスＤＢ、１３ファイルサーバ、１４ＬＡＮ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information processing apparatus, an information processing program, and an information processing method for searching for a document using a keyword in a computer system that uses an electronic document such as an office document or a Web document.
[0002]
[Prior art]
In recent years, with the spread of networks and high-performance personal computers (hereinafter referred to as PCs), the creation and use of electronic documents in the office environment has become more common, and the amount of information in electronic documents continues to increase day by day. . Document retrieval technology for retrieving necessary information from a large amount of electronic document information is essential in an environment where electronic documents are flooded, and the performance of the document retrieval system improves the efficiency of office work. Has an important impact. Further, as a document search technique, a keyword search technique using a keyword assigned to a document in advance by a user, a full-text search technique for searching a specified character string from document contents, or a search technique that combines the characteristics of both. There is an index type search technology that automatically extracts keywords from document contents and creates index information in advance to improve the accuracy of full-text search and speed up the search. Among them, index type search technology that is a search system using an index is excellent in performance when searching an unspecified number of documents, and is a search technology widely used including Web information search sites. .
[0003]
The performance of index-type search greatly depends on how important keywords can be extracted from document contents. Natural language analysis processing to extract appropriate keywords and index information weighting processing based on keyword appearance frequency For example, various measures are taken to generate index information. There are also attempts to use information other than character strings included in content in order to improve search accuracy. For example, Google's (registered trademark) Web search engine uses link information between Web documents for keyword weighting, and obtains high search results in a range in which search targets are specified as Web and documents.
[0004]
Further, for example, Patent Document 1 below discloses a technique for performing an image search with a simple operation. This technique records past operation contents and operation date / time of image data as search information, and extracts a desired image when the search information matches a search item specified at the time of search.
Furthermore, for example, Patent Document 2 below discloses a technique for performing information retrieval based on a bookmark that can specify desired information on a Web browser. In this technique, keywords are extracted from bookmark information and bookmark usage history, and information matching the keywords is notified to the user.
[0005]
[Patent Document 1]
JP-A-7-239854 (see paragraph numbers 0017 to 0040 and FIGS. 1 to 3)
[Patent Document 2]
JP 2001-195422 A (paragraph numbers 0005 to 0013 and FIGS. 1 to 4)
[0006]
[Problems to be solved by the invention]
However, the generation of index information is still a major technical issue for information retrieval systems, not only improving the search accuracy, but also responding to the increased load on the CPU due to advanced index generation processing, and improving the coverage of search target documents The technical improvement of the entire system for performing a search such as support for various document formats is also a major technical problem of the document search system. From this point of view, the conventional document search device or document search device system does not say that index information is generated in an optimum system environment, and as a result, the search flexibility in the field is lacking. At the same time, the search accuracy is further prevented. These problems are still not solved in the techniques of Patent Document 1 and Patent Document 2 described above.
[0007]
The present invention has been made in view of the above-described problems, and an object of the present invention is to unify a document search method using access history information to a document indicating how a document is used and how it has been used. Accordingly, an object of the present invention is to provide an information processing apparatus, an information processing program, and an information processing method capable of facilitating document search by generating effective keyword information with a small processing load.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides an information processing apparatus that generates a keyword for searching a document, and obtains history information obtained from an access history when the document is operated. And keyword extraction means for extracting the keyword for document search from the history information based on the history information acquired by the history acquisition means.
[0009]
Further, the present invention is an information processing program for causing a computer to execute a process for generating a keyword for searching for a document, and obtains history information obtained from an access history when the document is operated. Based on the history information acquired by the history acquisition step, the computer is caused to execute an acquisition function and a keyword extraction function for extracting the document search keyword from the history information.
[0010]
The present invention is also an information processing method for generating a keyword for searching a document, a history acquisition step for acquiring history information obtained from an access history when the document is operated, and the history acquisition step And a keyword extraction step of extracting the document search keyword from the history information based on the history information acquired by the above.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
The document search apparatus of the present invention is based on operation history information obtained from a history of operating a document, such as operation content information of a document, business content information using the document, or related document information of a document. Generate a keyword for search, generate index information using this keyword, register it in the database, and then search the document using this index information, or embed the generated keyword in the document As a result, the document search is performed using the keyword. This makes it possible to use highly accurate information that is highly likely to be actually used as a search keyword, and can also be applied to a search for non-text sentences. In addition, it is possible to perform a high-accuracy search that covers all documents by a full-text search.
[0012]
Embodiments of the present invention will be described below.
FIG. 1 is a block diagram showing a configuration of a document search system according to an embodiment of the present invention. This document search system exists on a network with a client terminal 1 on which a user operates a document, and acquires (saves) history, analyzes history, extracts keywords and creates index information, and various databases (hereinafter referred to as “database”). A history server 4 that manages a database (abbreviated as DB) and a file server 13 that stores user documents and enables access to the documents via a network are connected to the LAN 14. .
[0013]
A client terminal 1 on which a user operates a document includes a history collection unit 2 as a history collection unit that collects document history information, and an index search unit 3 as an index search unit that searches for documents including a predetermined keyword. It has a configuration.
[0014]
The history collection unit 2 is a software module incorporated in the client terminal 1, and has a function of collecting document information accessed by detecting an event of an application operation performed by a user and transmitting the document information to the history server 4. ing.
[0015]
The index search unit 3 is a software application of the client terminal 1 and searches the index DB 12 existing in the history server 4 for a document or a document part including the keyword specified by the user, and the search result is sent to the user in descending order of weight. It has a function to present.
[0016]
The history server 4 includes a history acquisition unit 5 as a history acquisition unit, a history analysis unit 6 as a history analysis unit, a keyword extraction unit 7 as a keyword extraction unit, an index creation unit 9, a history DB 10, a document information DB 11, and an index DB 12 It is the composition provided with.
[0017]
In the configuration shown in FIG. 1, the history server 4 exists on the network and is realized as a server function, and performs history information storage, history information analysis, and index management. However, such a server function may exist in the client terminal 1. In addition, the functions of history information storage, analysis, and index management may be provided in different servers. The index management function uses a search system using an existing index.
[0018]
The history acquisition unit 5 is a software module that operates on the history server 4 and has a function of receiving document access history data transmitted from the client terminal 1 and storing it in the history DB 10.
[0019]
The history analysis unit 6 has a function of analyzing document access history data stored in the history DB 10, extracting a document access count and a related document of a specific document or document part, and storing the document in the document information DB 11. Yes. Here, the related document refers to a document / part referenced when creating the document / part, a derived document created using the document / part, and when the document / part is accessed in parallel. This refers to information such as documents that are frequently used.
[0020]
The history analysis unit 6 is an additional component for increasing the search effect according to the present invention, and the history analysis unit 6 is not a constituent requirement in all of the present invention.
[0021]
When extracting related documents from history information, it is not possible to know which is related history only by history information. Therefore, business analysis processing from document access history and document association from business history created by business analysis processing. Process.
The business history generation process from the document access history is performed as follows, for example. Focusing on characteristic document access histories such as printing, file storage, and mail transmission that occur at the time of business separation, business history is created by separating document access history that is continuous in time series.
For example, the following processing is performed as the document association processing from the business history. After aggregating data in a spreadsheet, pasting the results into a presentation document or referencing past emails to create new emails, a pattern that is fixed to some extent regardless of the type of business. Since it is considered that the access according to this occurs, the relation can be derived by defining this access pattern in advance and comparing it with the document access history without the business history. For example, from the pattern “document B updated after opening document A within the same business”, “B is derivative information of A” contrary to the relationship “A is reference information of B”. Can lead to the relationship.
[0022]
The keyword extraction unit 7 includes a weighting unit 8 as a weighting unit, extracts keyword information from information stored in the history DB 10 and the document information DB 11, weights the keyword information, and records the keyword information in the index DB 12. A keyword extraction method and a weighting method will be described later.
The index creation unit 9 as an index creation unit uses the keywords extracted by the keyword extraction unit 7, the history information stored in the history DB 10, and the document information stored in the document information DB 11 (analysis result of the history analysis unit 6). Create index information based on it.
[0023]
The history DB 10 is a database that stores history information transmitted from the client terminal 1. The detailed configuration of the history DB 10 will be described later with reference to FIG.
The document information DB 11 is a database that stores information obtained by analyzing history information stored in the history DB 10. The detailed configuration of the document information DB 11 will be described later with reference to FIG.
The index DB 12 is a database that records the index information created by the index creation unit 9. The detailed configuration of the index DB 12 will be described later with reference to FIG. The index DB 12 may be an index DB attached to an existing index search system.
[0024]
The file server 13 has a file DB 13a for storing the user's document, and is a server that enables access to the document via the network. However, the document to be managed in the present invention is only on the file server 13. Instead, it may be a document on the client terminal 1 or on the Internet.
[0025]
FIG. 2 is a diagram showing an example of the configuration and contents of the history DB 10 shown in FIG. In FIG. 2, the date and time when the document access occurred in the client terminal 1 is recorded in the “date and time” item column. For example, as in “2003/03/19 14:34:12”, the date and time of access occurrence is recorded from the year, month, day to hour, minute, second.
In the “operation content” item column, the operation content performed on the document by the user is recorded. The types of operations include Open, Save, Delete, Print, Send (mail transmission), and the like.
[0026]
In the “user name” item column, user information for identifying the user who performed the operation is recorded. For example, the name of the user who performed the operation is recorded.
In the “document” item column, information for identifying a document to be investigated is recorded. For example, if the investigation target is a web document, a URL (for example, HYPERLINK "http://www.b-car.co.jp" http: // www. b-car. co. jp ), And if it is a Windows shared file, the network path (eg, Windows, server name, share name, path name, file name) HYPERLINK "\\\\ server1 \\ dsr \\ cars.doc" \\ server1 \ dsr \ cars. doc ).
[0027]
In the “document title” item column, the title of the document obtained from the application when the user operates the document is recorded. For example, in the case of a new car information document, new car information (for example, company A new car information) of each automobile company, event information about the new car (for example, spring new car), and the like are recorded.
In the “page” item, the page number of the target operation is recorded. Note that this field is blank when the operation target is the entire document.
In the “transmission / output destination” item column, information on the transmission destination or the output destination is recorded with respect to the history of document transmission and output operations. For example, if the content of the operation is a print, the name of the printer that performed the print (for example, Printer A), or the Send (send) by e-mail, the user name of the user included in the destination (for example, Fujiwara) Record.
[0028]
In the “peripheral keyword” item column, keyword information (for example, vehicle type A, vehicle type B, motorcycle, etc.) included in the operated document is recorded. Note that the keyword information may be included in advance in the document to be operated, or may be generated dynamically from the document content by the history collection means 2 of the client terminal 1. As an example in which the keyword information is included in the document in advance, there is a document created by the Microsoft Office application (registered trademark). There are several existing methods for dynamically generating keywords from document content.
[0029]
For example, the computer divides text information into vocabulary units by morphological analysis processing that analyzes natural language by decomposing parts of speech in the document, counts the number of occurrences of each vocabulary in the document, and the number of appearances A method is used in which keywords having a large number of characters are used. In addition, if the target of the search operation is not the entire document but a part of the document such as a page, the keyword is extracted from the information contained in the target page instead of extracting the keyword from the entire document. The method is used. At this time, the peripheral keyword is not essential, and the processing may be omitted according to the system load.
[0030]
FIG. 3 is a diagram showing an example of the configuration and contents of the document information DB shown in FIG. That is, this figure shows the configuration and contents of the document information DB 11 that records the document information obtained by analyzing the history information stored in the history DB 10.
In the “document” item column, an identifier (for example, c: ¥ abc.doc) of a document to be searched for document information is described. The description method of the information included in this field is the same as the information of the “document” item described in the configuration and contents of the history DB 10 shown in FIG. That is, if the search target is a web document, a URL is recorded, and if it is a Windows shared file, a network path including Windows, a server name, a shared name, a path name, and a file name is recorded.
[0031]
In the “page” item column, the page number to be searched is recorded when the search target is not the whole document but one part in the document. If the search target is the entire document, this field is blank.
In the “access count” field, the number of accesses that have occurred for the document or part to be searched is recorded.
In the item column of “reference document / reference count”, a document (for example, def.doc) referred to when creating or updating a document or part to be searched is recorded. In order to distinguish frequently referred documents from other documents, the number of references is recorded together with the document name. For example, def. When doc is referenced three times, def. Record as doc / 3.
[0032]
In the item column of “derived document / referenced count”, a document (for example, cars.doc) created by referring to a document or part to be searched is recorded. In addition, when the derived document is created, the number of times that the search target document is referred is recorded.
In the item column of “simultaneous use document”, a document that is frequently used at the same time when a document or a part to be searched is accessed (for example, HYPERLINK “http://www.a-car.co”). .Jp " http: // www. a-car. co. jp ).
In the “keyword” item column, keyword information (for example, vehicle type A, vehicle type B, etc.) included in the document or part to be searched is recorded. The keyword information includes keyword information recorded in the peripheral keywords of the history DB 10 and keywords extracted from the titles recorded in the document titles of the history DB 10.
[0033]
FIG. 4 is a diagram showing an example of the configuration and contents of the index DB 12 shown in FIG. That is, this figure shows the configuration and contents of the index DB 12 used for searching for documents using keywords. The index DB 12 that is actually used has an efficient DB configuration in order to speed up the search process, but here a simple configuration is illustrated for simplicity.
[0034]
In the “document” item column, identification information (for example, c: ¥ abc.doc) of a document to be searched is recorded. In this example, the same information as the field information in the “document” item column included in the history DB 10 or the document information DB 11 described above is recorded.
In the “page” item column, the page number of the part is recorded when the search result is not the whole document but the part of the document.
In the “keyword” item column, a list of keywords included in the target document or part (for example, company A, car, car life, etc.) is recorded. When the index DB 12 is shared by the existing document search system and the document search system according to the present invention, both the keyword information generated by the existing document search system and the keyword information generated by the document search system according to the present invention are the same. Present in the keyword list.
[0035]
In the “weight” item column, the weight (importance) for each keyword is recorded. A keyword having a larger weight value is recognized as a keyword that better represents the target document / part. For example, in “c: ¥ abc.doc” in the “document” item column, the weight of “car type A” is “10” and the weight of “car model B” is “3”. "Means that the importance of the keyword is much higher than" car type B ".
In the document search process using the index DB 12, the keyword field of the index DB 12 is searched with the keyword specified by the user, the document / part including the keyword is specified, and sorted in descending order of the weight assigned to the keyword. As a search result.
[0036]
Next, an operation sequence performed by the document search system of the present invention will be described in detail using a flowchart. FIG. 5 is a flowchart showing the flow of history collection processing performed by the client terminal in the document search system according to the present embodiment. First, the history collection unit 2 incorporated in the client terminal 1 starts a user application operation event process. That is, the history collection unit 2 detects a document operation performed by the user and starts a series of search processes (step S1). When the search process is started, the history collection unit 2 collects document information such as a file name, a page number, a title, and operation contents from the application as operation history information according to the contents of the event. At this time, the history collection unit 2 extracts peripheral keywords from the content of the opened document (step S2). Then, the history collection unit 2 transmits the information collected in step S2 to the history server 4 as operation history information (step S3).
[0037]
FIG. 6 is a flowchart showing a flow of processing from reception of operation history information to update of the index DB, which is performed by the history server 4 in the document search system according to the present embodiment. That is, this figure shows the flow of processing until the history server 4 extracts and registers keywords from the operation history information.
[0038]
First, when the history server 4 enters a history collection system (step S11), the history acquisition unit 5 of the history server 4 determines whether or not history data has been received from the client terminal 1 (step S12). Here, if history data has not been received yet (step S12, No), it waits until history data is received. On the other hand, when the history acquisition unit 5 detects that the history data has been received from the client terminal 1, a series of processes for performing a keyword search is started (Yes in step S12).
[0039]
That is, in the history server 4, the history acquisition unit 5 stores the received history data in the history DB 10, and the keyword extraction unit 7 extracts the keyword information included in the received history data (step S13). The content of the keyword extraction process by the keyword extraction unit 7 will be described later with reference to FIG. Further, the keyword extraction unit 7 weights the keyword information extracted by the weighting unit 8, and registers this keyword information in the index DB 12 (step S14). Thereby, the index search unit 3 of the client terminal 1 can search for a desired document by arranging the keyword information registered in the index DB 12 in descending order of weight. The content of the keyword weighting process performed by the keyword extraction unit 7 will be described later with reference to FIG.
[0040]
Furthermore, in this embodiment, keywords can be extracted and registered effectively by adding the following steps S15 to S18. That is, it is determined whether or not the document search system is set to perform history analysis (that is, whether or not the history analysis unit 6 is configured) (step S15) and set to perform history analysis. If it is set (that is, the history analysis unit 6 is configured) (step S15, Yes), the history analysis unit 6 starts analysis processing of history data. Then, the history analysis unit 6 uses the history data recorded in the history DB 10 to analyze history data such as the number of document accesses and related documents (step S16). The processing for analyzing the related document from the history data is performed using an existing processing procedure. If the document search system is not set to perform history analysis in step S15 (that is, the history analysis unit 6 is not configured) (step S15, No), the search process is terminated as it is.
[0041]
Next, when the history analysis unit 6 analyzes the history in step S16, the keyword extraction unit 7 extracts the keyword information from the analysis result (contents of the document information DB) performed by the history analysis unit 6 and weights it. Is performed (step S17). The keyword information extraction and weighting processing performed by the keyword extraction unit 7 will be described later with reference to FIGS. Then, the index creating unit 9 creates an index using the keyword extracted by the keyword extracting unit 7 and weighting information, and registers it in the index DB 12 (step S18). Accordingly, the index search unit 3 of the client terminal 1 can search for a desired document by arranging the keyword information in descending order based on the index information registered in the index DB 12.
[0042]
FIG. 7 is a diagram illustrating an example of rules when the keyword extraction unit 7 of the history server 4 extracts keywords. That is, FIG. 7 shows that in steps S13 and S14 of the flowchart shown in FIG. 6, the keyword extraction unit 7 extracts keywords included in the history data from the history DB 10 and the document information DB 11, and the index creation unit 9 adds them to the index DB 12. This is an example of rules for registering as a keyword or for registering the keyword extracted from the analysis result by the keyword extraction unit 7 as an additional keyword in the index DB 12 in steps S18 and S19. . Note that the keyword extraction rules shown in FIG. 7 can be arbitrarily selected and used according to the configuration and load of the document search system, and it is not necessary to use all the rules. Also, a new rule can be added and used.
[0043]
Hereinafter, the keyword extraction rules performed by the keyword extraction unit 7 will be described in detail with reference to FIG.
In the “add creator” rule, the name of the creator who created the document is extracted as an additional keyword, and is registered in the index DB 12 by the index creation unit 9.
In the “add creation date” rule, the date and time when the document was created is extracted as an additional keyword and is registered in the index DB 12 by the index creation unit 9.
In the “add server name where document is located” rule, the server name of the server where the document is located is extracted as an additional keyword and registered in the index DB 12 by the index creation unit 9.
[0044]
In the “add contents of access to document” rule, operation contents performed by the user such as browsing, printing, deletion, transfer, transmission, etc. are extracted as additional keywords, and the index creating unit 9 stores them in the index DB 12. Let me register.
In the “add destination” rule, when the document is transmitted by e-mail or the like, the user name of the destination (transmission destination) is extracted as an additional keyword, and is registered in the index DB 12 by the index creation unit 9. .
In the “add access frequency” rule, additional keywords such as “high frequency” and “low frequency” are extracted according to the access frequency of the document, and are registered in the index DB 12 by the index creation unit 9.
That is, “high frequency” is registered in the index DB 12 when the access frequency is high, and “low frequency” is registered when the access frequency is low.
[0045]
In the “add access count” rule, additional keywords such as “popular” when the total access count is high and “unpopular” when the total access count is low are extracted according to the document access count. Then, the index creation unit 9 registers it in the index DB 12.
In the “add output destination” rule, when the document is printed out, the printer name of the output destination is extracted as an additional keyword, and is registered in the index DB 12 by the index creation unit 9.
That is, the device name of the data output destination is registered in the index DB 12.
[0046]
In the “add related document keyword” rule, keyword information included in the related document of the reference document, derived document, and simultaneous use document recorded in the document information DB 11 is extracted as an additional keyword of the target document, and an index is created. The part 9 is registered in the index DB 12.
In the “add related document title” rule, keyword information included in the related document titles of the reference document, derivative document, and simultaneous use document recorded in the document information DB 11 is extracted as an additional keyword of the target document, and indexed. The creation unit 9 registers in the index DB 12.
[0047]
FIG. 8 is a diagram illustrating an example of rules when the keyword extraction unit 7 weights keywords. That is, FIG. 8 is a diagram showing an example of rules when the keyword extraction unit 7 weights the keywords and registers them in the index DB 12 in step S14 and step S18 of the flowchart shown in FIG.
Note that the rules shown in FIG. 8 can be arbitrarily selected and used according to the configuration and load of the document search system, and it is not necessary to use all the rules. Also, a new rule can be added and used.
[0048]
As in the case of “the number of accesses of the document including the keyword> 10”, as a rule when the number of accesses to the related document is large, the weight of the keyword is increased when the keyword of the related document of the document is registered. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
As a rule when the number of accesses to the part of the related document is large, such as “the number of accesses of the part including the keyword> 10”, the keyword weight is increased when the keyword of the part of the document related document is registered. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
[0049]
As a rule when there is a history of printing a part including the keyword in the past, such as “number of output of the part including the keyword> 1,” the weight of the keyword is increased considering that the document / part is an important document. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
When registering a keyword for a reference document / part when a document is created or updated, such as “Keyword of related document with reference count of 2 or more”, when the reference number for the reference document / part is large, Increase keyword weights by assuming that keywords are highly relevant. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
[0050]
When registering a keyword of a derived document / part created by referring to a document as a rule of “derived keyword”, if the number of references from the derived document / part is large, the keyword is strongly related. Think and increase keyword weight. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
As a rule of “keyword of related document that has been printed”, among the keywords included in the related document, the weight of the keyword of the related document that has been subjected to specific high importance processing such as printing or transmission is increased. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
As a rule of “document created by a specific user”, the weight of a keyword included in a document created by a specific user is increased. That is, the index creation unit 9 increases the keyword weight by “1” and registers it in the index DB 12.
[0051]
The embodiment described above is an example for explaining the present invention, and the present invention is not limited to the above-described embodiment, and various modifications can be made within the scope of the gist of the invention. That is, in the above embodiment, the extracted keyword information is registered in the index DB 12. However, the present invention is not limited to this, and the keyword information may be embedded (added) in the document itself. For example, in a document created by the Microsoft Office application (registered trademark), an area for recording keyword information is prepared in the document, and in HTML and XML documents, the keyword information is different from the document content. The area for recording is defined. In addition, since the keyword information extracted by the document search system of the present invention is included in the document itself, the document search system of the present invention can be used even when used in combination with a full-text search system without an index DB. The same effect as the embodiment can be realized.
[0052]
In the embodiment of the present invention, the function for carrying out the invention is recorded in advance in the apparatus. However, the present invention is not limited to this, and the same function may be downloaded from the network to the apparatus. Those having these functions stored in a recording medium may be installed in the apparatus. The recording medium may be in any form as long as it can store a program such as a CD-ROM and can be read by the apparatus. In addition, the function obtained by installing or downloading in advance may be realized in cooperation with an OS (operating system) inside the apparatus.
[0053]
【The invention's effect】
As described above, according to the present invention, since highly reliable keyword information based on a user's operation history is added to a document, search accuracy can be improved. Further, by using related document information, keywords that are not included in the document itself can be added, and search flexibility can be enhanced. In addition, even if the document to be searched is a document that does not contain text information such as images and sounds, keyword information is added using history information and related documents, and the search target by keyword is the same as for documents that contain text. can do.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document search system in an embodiment of the present invention.
FIG. 2 is a diagram showing an example of the configuration and contents of a history DB shown in FIG.
3 is a diagram showing an example of the configuration and contents of a document information DB shown in FIG.
4 is a diagram showing an example of the configuration and contents of an index DB shown in FIG. 1. FIG.
FIG. 5 is a flowchart showing a flow of history collection processing performed by a client terminal in the document search system in the embodiment of the present invention.
FIG. 6 is a flowchart showing a flow of processing from operation history information reception to index DB update performed by the history server in the document search system according to the embodiment of the present invention.
FIG. 7 is a diagram illustrating an example of a rule when a keyword extraction unit of a history server extracts a keyword.
FIG. 8 is a diagram illustrating an example of rules when a keyword extraction unit weights keywords.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Client terminal 2 History collection part 3 Index search part 4 History server 5 History acquisition part 6 History analysis part 7 Keyword extraction part 10 History DB 11 Document information DB 12 Index DB 13 File server 14 LAN.

Claims

An information processing apparatus for generating a keyword for searching a document,
History acquisition means for acquiring history information obtained from an access history when the document is operated;
An information processing apparatus comprising: keyword extraction means for extracting the document search keyword from the history information based on the history information acquired by the history acquisition means.

The information processing apparatus according to claim 1,
Based on the history information acquired by the history acquisition means, the keyword extracted by the keyword extraction means from the history information is at least one of the creator, creation date and time, creation location, access content, and access time related to the document. An information processing apparatus characterized by being related to one piece of information.

The information processing apparatus according to claim 1,
The information processing apparatus according to claim 1, wherein the keyword extraction unit includes a weighting unit that adds weighting information to the keyword extracted by the keyword extraction unit based on the history information.

The information processing apparatus according to claim 1,
A history analysis means for analyzing the history information based on the history information acquired by the history acquisition means,
The keyword extracting unit extracts a keyword using the history information and an analysis result of the history analyzing unit.

The information processing apparatus according to claim 1,
An information processing apparatus comprising keyword adding means for adding a keyword extracted by the keyword extracting means to the document.

The information processing apparatus according to claim 1,
An information processing apparatus comprising: index information generating means for generating index information for searching the document based on the keyword extracted by the keyword extracting means.

The information processing apparatus according to claim 6,
Index information storage means for storing index information generated by the index information generation means;
An information processing apparatus comprising: search means for searching for a document using index information stored in the index information storage means.

An information processing program for causing a computer to execute processing for generating a keyword for searching for a document,
A history acquisition function for acquiring history information obtained from an access history when the document is operated;
An information processing program for causing a computer to execute a keyword extraction function for extracting a keyword for document search from the history information based on the history information acquired by the history acquisition step.

An information processing program according to claim 8,
An information processing program for causing a computer to execute an index information generation function for generating index information for searching the document based on a keyword extracted by the keyword extraction function.

An information processing program according to claim 9,
An index information storage function for storing the index information generated by the index information generation function;
An information processing program for causing a computer to execute a search function for searching for a document using index information stored in the index information storage function.

An information processing method for generating a keyword for searching a document,
A history acquisition step of acquiring history information obtained from an access history when the document is operated;
And a keyword extracting step of extracting the document search keyword from the history information based on the history information acquired by the history acquiring step.

An information processing method according to claim 11,
An index information generating step of generating index information for searching the document based on the keyword extracted by the keyword extracting step.

An information processing method according to claim 12,
An index information storage step for storing the index information generated by the index information generation step;
A search step for searching for a document using the index information stored in the index information storage step.