JP5057512B2

JP5057512B2 - File search system

Info

Publication number: JP5057512B2
Application number: JP2007161612A
Authority: JP
Inventors: 治夫横田; 徹太郎渡部; 隆志小林
Original assignee: Tokyo Institute of Technology NUC
Current assignee: Tokyo Institute of Technology NUC
Priority date: 2007-06-19
Filing date: 2007-06-19
Publication date: 2012-10-24
Anticipated expiration: 2027-06-19
Also published as: JP2009003553A

Description

本発明は、キーワードなどによるリンク関係を持たないファイル群から、作業に関連するファイルを検索するファイル検索システムに関する。 The present invention relates to a file search system for searching a file related to work from a file group having no link relationship by a keyword or the like.

近年のサーバにおけるファイルシステムでは、ファイルの数と種類がともに爆発的に増加している。そのため、ファイルシステムは、ディレクトリの階層構造を提供して、ファイルが格納された論理的な配置を管理する構成を採っている。しかし、取り扱うファイル数が増えることによって、利用したいファイルを容易に見つけ出すことが困難になってきている。このような問題点の解決策として、キーワードを含むファイルを見つけ出すファイル検索システムが提供されている。 In recent file systems in servers, both the number and type of files have increased explosively. For this reason, the file system is configured to provide a hierarchical structure of directories and manage a logical arrangement in which files are stored. However, as the number of files handled increases, it has become difficult to easily find the files that you want to use. As a solution to such a problem, a file search system for finding a file including a keyword is provided.

ファイル検索システムは、キーワードを含むファイル名（ディレクトリパス名を含む）やファイル中にそのキーワードを含む文字列を有するファイルを対象として、検索を行う。例えば、特許文献１には、キーワード検索されたリンク関係を有するファイル間に対して、リンク重要度を定義して、リンク重要度の大きいファイルが抽出され易いように構成した発明が開示されている。 The file search system performs a search for a file name including a keyword (including a directory path name) or a file having a character string including the keyword in the file. For example, Patent Document 1 discloses an invention in which link importance is defined between files having a link relationship searched for by keyword so that a file having a high link importance can be easily extracted. .

また、非特許文献１には、ユーザがＷｅｂページを閲覧していた場面を再現できるように画像を記憶しておき、ユーザが再生場面を見て、過去の記憶を想起させる記憶想起支援ツールについて記載されている。この記憶想起支援ツールにおける検索機能として、ユーザが思い出したキーワードを含むＷｅｂページをアクティブに何ページも表示していた期間に表示していたＷｅｂページがキーワードとの関連度が大きいとして、検索結果の上位にランク付けることが開示されている。
特開２００１−２９０８４３号公報（段落［００１９］〜［００３３］）森田哲之、日高哲雄、田中明通、加藤泰久著、「記憶想起支援ツール『Memory-Retriever』」、INTERACTION2007、２００７年３月１５日 Further, Non-Patent Document 1 describes a memory recall support tool that stores an image so that a scene in which a user is browsing a Web page can be reproduced, and that the user sees a playback scene and recalls a past memory. Are listed. As a search function in this memory recall support tool, it is assumed that a Web page that has been displayed during a period in which many Web pages including a keyword that the user remembered is actively displayed has a high degree of relevance with the keyword. It is disclosed to rank higher.
JP 2001-290843 A (paragraphs [0019] to [0033]) Tetsuyuki Morita, Tetsuo Hidaka, Akinori Tanaka, Yasuhisa Kato, “Memory-Retriever”, INTERACTION2007, March 15, 2007

しかしながら、特許文献１と非特許文献１は、キーワードを用いた検索が前提となっている。したがって、キーワードに関連のある内容を含むファイルであっても、ファイル名やファイル中にそのキーワードを含まない画像ファイルやデータファイルなどを検出することが困難である。 However, Patent Literature 1 and Non-Patent Literature 1 are based on a search using keywords. Therefore, it is difficult to detect a file name or an image file or a data file that does not include the keyword in the file even if the file includes content related to the keyword.

そこで、本発明は前記した従来技術の問題点に鑑みて、キーワードなどによるリンク関係を持たないファイル群から、作業に関連するファイルを検索するファイル検索システムを提供することにある。 In view of the above-described problems of the prior art, it is an object of the present invention to provide a file search system for searching for a file related to work from a file group having no link relationship by a keyword or the like.

前記課題を解決するために、ファイル検索システムは、サーバに蓄積されたファイル群から作業に関連するファイルを検索するファイル検索システムであって、ファイルにアクセスした履歴からファイル間の関連度を導出する関連度導出手段と、キーワード検索された検索ファイルに対してこの検索ファイルと所定以上の関連度を有するファイル群を出力する出力手段と、を備え、前記関連度導出手段が、前記履歴を参照して、作業を行っていた活動時間区間を算出し、前記活動時間区間の範囲内で定められるファイルを使用していたファイル使用時間を算出し、前記ファイル使用時間の重複に基づいて前記関連度を算出し、前記活動時間区間を、前記履歴に記録されているアクセス時刻とファイル名を読み出して、所定時間幅内に前記アクセス時刻の記録がある場合に、当該所定時間幅内で作業を行っていたと判定して算出し、前記履歴に記録されているファイルのオープンに対応するクローズを所定の条件に沿って確定し、前記ファイル使用時間を、前記所定の条件に沿って確定したファイルをオープンした時刻からファイルをクローズした時刻までと、前記活動時間区間とが重複する時間として算出することを特徴とする。 In order to solve the above problems, a file search system is a file search system for searching for a file related to work from a file group stored in a server, and derives a degree of association between files from a history of accessing the file. Relevance degree deriving means, and output means for outputting a group of files having a relevance degree equal to or greater than a predetermined degree to the search file searched by the keyword, wherein the relevance degree deriving means refers to the history. And calculating the activity time interval during which the work was performed, calculating the file usage time using the file defined within the range of the activity time interval, and determining the relevance based on the duplication of the file usage time. calculated, the activity time interval, reads the file name and access time recorded in the history, the accession within a predetermined time width If there is a record of the file time, it is determined that the work was performed within the predetermined time width, and the close corresponding to the opening of the file recorded in the history is determined according to a predetermined condition, The file usage time is calculated as a time in which the activity time interval overlaps from the time when the file determined according to the predetermined condition is opened to the time when the file is closed .

かかる構成によれば、キーワードなどによるリンク関係を持たないファイルであっても、サーバにアクセスした履歴を用いて、活動時間区間（作業していた時間帯）を推定し、その活動時間区間と重複して使用していたファイル同士が相互に関連すると判定することが可能となる。また、作業していた時間（活動時間区間）を別途特別なアプリケーションを用いて割り出す必要がなく、アクセス時刻の記録から、容易に活動時間区間を算出することが可能となる。また、ファイルをオープンしている時間に対して、活動時間区間との重複を調べることによって、作業実態に基づいてファイルを使用していた時間を精度よく推定することが可能となる。 According to such a configuration, even for a file that does not have a link relationship by a keyword or the like, the activity time interval (time zone in which the user is working) is estimated using the history of accessing the server, and overlaps with the activity time interval. Thus, it is possible to determine that the files that have been used are related to each other. In addition, it is not necessary to determine the working time (activity time interval) using a separate special application, and the activity time interval can be easily calculated from the access time record. Further, by examining the overlap with the activity time interval with respect to the time when the file is opened, it is possible to accurately estimate the time during which the file was used based on the actual work.

また、前記ファイル検索システムにおいて、前記関連度の導出手段は、前記ファイル群の全てのファイルから選択された一のファイルと他のファイルに対して、前記ファイル使用時間が重複する場合に、重複している時間を共起時間とし、この共起時間が複数存在する場合に、その個数を共起回数とし、先の前記共起時間の終了から次の前記共起時間の開始までを共起間隔とし、前記一のファイルのファイル使用時間の開始時間または前記他のファイルのファイル使用時間の開始時間と前記共起時間の開始時間との差を使用開始パターンとしたとき、前記共起時間と前記共起回数と前記共起間隔と前記使用開始パターンの少なくとも一つの情報に基づく演算を行って、前記選択された一のファイルと他のファイルとの関連度を演算すること、を特徴とする。
そして、前記演算は、前記共起時間のみが情報となる場合には、前記共起時間の累積が底となり、前記共起回数のみが情報となる場合には、前記共起回数が底となり、前記共起間隔のみが情報となる場合には、前記共起間隔を累積したものの逆数が底となり、前記使用開始パターンのみが情報となる場合には、前記使用開始パターンの累積が底となって、それぞれの所定の値が指数となる関連度算出式を用いること、さらに、前記関連度算出式を組み合わせて、２またはそれ以上の式を乗算する関連度算出式を用いることを特徴とする。 Further, in the file search system, the relevance degree deriving means is duplicated when the file usage time overlaps for one file selected from all the files in the file group and another file. If there are multiple co-occurrence times, the number is the number of co-occurrence times, and the co-occurrence interval from the end of the previous co-occurrence time to the start of the next co-occurrence time And when the difference between the start time of the file use time of the one file or the start time of the file use time of the other file and the start time of the co-occurrence time is used as the use start pattern, the co-occurrence time and the Performing a calculation based on at least one information of the number of co-occurrence, the co-occurrence interval, and the use start pattern, and calculating a degree of association between the selected one file and another file; And features.
And, in the calculation, when only the co-occurrence time is information, the accumulation of the co-occurrence time is the bottom, and when only the co-occurrence number is information, the co-occurrence number is the bottom, When only the co-occurrence interval is information, the reciprocal of the accumulation of the co-occurrence intervals is the bottom, and when only the use start pattern is information, the accumulation of the use start pattern is the bottom. The relevance calculation formula in which each predetermined value becomes an index is used, and further, a relevance calculation formula that combines two or more formulas by combining the relevance calculation formulas is used.

かかる構成によれば、関連度は、前記共起時間、前記共起回数、前記共起間隔および前記開始時間パターンの少なくとも一つを情報とする関連度算出式を用いて算出される。そして、関連度を算出されるファイル同士は、キーワードなどによるリンク関係を持たないものであっても構わない。 According to this configuration, the relevance is calculated using a relevance calculation formula that uses at least one of the co-occurrence time, the number of times of co-occurrence, the co-occurrence interval, and the start time pattern as information. The files for which the degree of association is calculated may not have a link relationship with a keyword or the like.

また、前記ファイル検索システムにおいて、前記所定の条件は、
前記オープンに対応するクローズの履歴が欠けている場合には、前記履歴から削除する第１の処理、前記オープンされたままになっているファイルに対して、先の前記活動時間区間の終了から次の前記活動時間区間の開始までの間が所定時間以上離れている場合は、前記先の活動時間区間と重複する時間を、前記ファイルのオープンに対応するクローズの区間と判定する第２の処理、ファイルをオープンした時刻と当該ファイルをクローズした時刻との間隔が所定値以下になるファイルの種類に対しては、前記活動時間区間において、最初にオープンした時刻から最後にクローズした時刻までを前記ファイルのオープンに対応するクローズと判定する第３の処理、の少なくとも一つであること、を特徴とする。 In the file search system, the predetermined condition is:
If the closing history corresponding to the opening is missing, the first processing to be deleted from the history, the file that has been kept open, the next from the end of the previous activity time interval A second process for determining a time overlapping with the previous activity time interval as a closing interval corresponding to the opening of the file , when the time until the start of the activity time interval is a predetermined time or more away, For file types in which the interval between the time when the file was opened and the time when the file was closed is less than or equal to a predetermined value, the file from the time when it was first opened until the time when it was last closed in the activity time interval It is at least one of the 3rd processes determined to be close corresponding to the open of .

かかる構成によれば、第１の処理は、ファイルにアクセスした履歴（アクセスログ）において、クローズの履歴が欠けてしまう場合に関連度を算出可能とする処理である。そのような履歴を用いると、ファイル検索システムは、ファイルがずっとオープンされたままだと認識して関連度を分析してしまう。そこで、前処理として、オープンとクローズとの対応がとれるように、オープンを削除することによって、関連度の演算における擾乱を低減することが可能となる。
第２の処理は、履歴では、オープンとクローズの対応がとれてはいるが、例えば、ずっとファイルがオープンしたままとなっており、使用していないと思われる時間にも対処する処理である。すなわち、ファイルがオープンされた状態のまま、先の前記活動時間区間の終了から次の前記活動時間区間の開始までの間が所定時間以上離れている場合は、当該ファイルを作業に用いていないと推測されて、ファイルを使用していないものとしている。そして、活動時間区間の間が所定時間以上離れる前までの活動時間区間と、ファイルがオープンしている時間とが重複するときを、ファイルの使用時間と判定し、関連度の演算における擾乱を低減することが可能となる。
第３の処理は、オープンするとすぐにクローズしてしまう（ロックを離してしまう）性質のファイルのように、オープンとクローズが所定値以下に記録される場合に対処する処理である。この場合には、活動時間区間内で、最初にオープンした時刻から最後にクローズした時刻まで、ずっと、そのファイルを使用していたとみなす。このことによって、より実態に近いファイルの使用時間を推定することが可能となる。 According to such a configuration, the first process is a process that allows the degree of relevance to be calculated when a history of access to a file (access log) lacks a closing history. Using such a history, the file search system recognizes that the file is still open and analyzes the relevance. Therefore, as a preprocessing, it is possible to reduce disturbances in the calculation of the relevance by deleting the open so that the correspondence between the open and the close can be taken.
The second process is a process that copes with, for example, a time when the file is kept open and is not used, although open and closed are taken in the history. In other words, if the file is open and there is a predetermined time or more between the end of the previous activity time period and the start of the next activity time period, the file is not used for work. Assuming you are not using the file. And, when the activity time interval before the activity time interval is more than the predetermined time and the time when the file is open is overlapped, it is judged as the file usage time, and the disturbance in the calculation of relevance is reduced It becomes possible to do.
The third process is a process for dealing with a case where open and close are recorded below a predetermined value, such as a file that closes (releases lock) as soon as it is opened. In this case, it is considered that the file has been used from the time when it was first opened until the time when it was last closed within the activity time interval. This makes it possible to estimate the usage time of a file that is closer to the actual situation.

本発明によれば、キーワードなどによるリンク関係を持たないファイル群から、作業に関連するファイルを検索することが可能となる。 According to the present invention, it is possible to search for a file related to work from a file group having no link relationship by a keyword or the like.

次に、本発明を実施するための最良の形態（以降、「実施形態」と称す）について、適宜図面を参照しながら詳細に説明する。 Next, the best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described in detail with reference to the drawings as appropriate.

（実施形態）
本発明の実施形態に係るファイル検索システム１の構成について図１を用いて説明する。図１は、ファイル検索システム１の構成を示す図である。
ファイル検索システム１は、サーバ１１，１２と端末２１，２２とファイル関連度管理装置１００とがネットワーク３０を介して接続され、通信可能に構成される。 (Embodiment)
A configuration of a file search system 1 according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram showing the configuration of the file search system 1.
The file search system 1 is configured such that the servers 11 and 12, the terminals 21 and 22, and the file association degree management device 100 are connected via a network 30 and can communicate with each other.

サーバ１１，１２は、大量のファイルを蓄積していて、どのファイルに誰がアクセスしたかについて履歴（アクセスログ）を記録する。現在のほとんどのファイルシステムでは、ディレクトリの階層構造が提供されており、ファイルには、ディレクトリパス名を付して、履歴が管理されている。
端末２１，２２は、文書作成やファイル検索などの作業を行うユーザが使用するものである。端末の種類は、ＰＣ（Personal Computer）であっても、サーバに接続される専用の端末であっても構わない。
ファイル関連度管理装置１００は、サーバ１１，１２から履歴を取得して、ユーザに同時に利用されるファイルの関連度を算出する。そして、端末２１，２２によってキーワード検索が実行されたときに、検出されたキーワードを含むファイル情報（ファイル名）を取得して、ファイルの関連度を参照して、取得したファイル名との関連度を検出して、関連度の大きいファイル名を、端末２１，２２に表示する。
なお、サーバ１１，１２や端末２１，２２の台数は、１台であっても、３以上であっても構わない。 The servers 11 and 12 accumulate a large amount of files and record a history (access log) regarding who accessed which file. Most of the current file systems provide a directory hierarchical structure, and a history is managed by attaching a directory path name to each file.
Terminals 21 and 22 are used by users who perform tasks such as document creation and file search. The terminal type may be a PC (Personal Computer) or a dedicated terminal connected to the server.
The file association degree management apparatus 100 acquires histories from the servers 11 and 12 and calculates the degree of association of files that are used simultaneously by the user. Then, when a keyword search is executed by the terminals 21 and 22, the file information (file name) including the detected keyword is acquired, the relevance of the acquired file name is referred to by referring to the relevance of the file And the file names having a high degree of relevance are displayed on the terminals 21 and 22.
The number of servers 11 and 12 and terminals 21 and 22 may be one or three or more.

次に、ファイル関連度管理装置１００のハードウェア構成について図２を用いて説明する。図２は、ファイル関連度管理装置１００のハードウェア構成を示す図である。
ファイル関連度管理装置１００は、図２に示すように、処理部１１０、入出力部１２０、記憶部１３０および通信制御部１４０がバスを介して接続され、相互に通信可能なように構成される。 Next, the hardware configuration of the file relevance management apparatus 100 will be described with reference to FIG. FIG. 2 is a diagram illustrating a hardware configuration of the file association degree management apparatus 100.
As shown in FIG. 2, the file relevance management apparatus 100 is configured such that the processing unit 110, the input / output unit 120, the storage unit 130, and the communication control unit 140 are connected via a bus and can communicate with each other. .

処理部１１０は、演算処理を実行するＣＰＵ（Central Processing Unit）１１１と、このＣＰＵ１１１が演算処理に用いる記憶部であるメインメモリ１１２とを備える。メインメモリ１１２は、ＲＡＭ（Random Access Memory）などにより実現される。そして、記憶部１３０に格納されたアプリケーションプログラムがメインメモリ１１２に展開され、ＣＰＵ１１１が、それを実行することにより種々の処理を具現化する。 The processing unit 110 includes a CPU (Central Processing Unit) 111 that executes arithmetic processing, and a main memory 112 that is a storage unit that the CPU 111 uses for arithmetic processing. The main memory 112 is realized by a RAM (Random Access Memory) or the like. Then, the application program stored in the storage unit 130 is expanded in the main memory 112, and the CPU 111 implements various processes by executing it.

入出力部１２０は、ファイル関連度管理装置１００に接続されるキーボードやマウスなどの入力装置（不図示）と、処理部１１０によって演算処理された結果などの各種データを表示するディスプレイなどの表示装置（不図示）によって構成される。 The input / output unit 120 includes an input device (not shown) such as a keyboard and a mouse connected to the file association degree management device 100, and a display device such as a display for displaying various data such as the results of arithmetic processing performed by the processing unit 110. (Not shown).

記憶部１３０は、ＣＰＵ１１１が演算処理に用いる各種データや演算結果、または、入出力部１２０によって送受信されるデータを記憶する。記憶部１３０は、図示しないＲＯＭ（Read Only Memory）やハードディスク装置などにより実現される。 The storage unit 130 stores various data and calculation results used by the CPU 111 for calculation processing, or data transmitted and received by the input / output unit 120. The storage unit 130 is realized by a ROM (Read Only Memory) or a hard disk device (not shown).

通信制御部１４０は、通信インタフェース（不図示）を備え、処理部１１０によって演算処理された情報を、ネットワーク３０（図１参照）を介して、他の装置に送信し、他の装置から情報を受信する制御を行う。 The communication control unit 140 includes a communication interface (not shown), transmits information processed by the processing unit 110 to other devices via the network 30 (see FIG. 1), and receives information from other devices. Control to receive.

次に、ファイル関連度管理装置１００の機能について、図３を用いて説明する（適宜図１参照）。図３は、ファイル関連度管理装置１００の機能を示す図である。 Next, functions of the file relevance management apparatus 100 will be described with reference to FIG. 3 (see FIG. 1 as appropriate). FIG. 3 is a diagram illustrating functions of the file association degree management apparatus 100.

ファイル関連度管理装置１００の処理部１１０は、アクセスログ収集部１１３、アクセスログ解析部１１４、アクセスログ前処理部１１５、活動時間演算部１１６、ファイル使用時間演算部１１７および関連度演算部１１８を備える。 The processing unit 110 of the file relevance management apparatus 100 includes an access log collection unit 113, an access log analysis unit 114, an access log preprocessing unit 115, an activity time calculation unit 116, a file usage time calculation unit 117, and a relevance level calculation unit 118. Prepare.

アクセスログ収集部１１３は、サーバ１１、１２からアクセスログ（履歴）を取得する。履歴を取得するタイミングは、定期的でも、必要があるときでも随時であっても構わない。
アクセスログ解析部１１４は、取得したアクセスログ（履歴）をファイル名ごとに分類したり、後記するように、すぐにロックを離す拡張子を抽出する。
アクセスログ前処理部１１５は、関連度を算出する前に、アクセスログ（履歴）の生データに対して、予め定めた条件に沿って補足や修正を行う。予め定めた条件とは、例えば、オープンに対するクローズが欠けている履歴を削除することや、長時間オープン状態のファイルに対してファイルの使用時間を定めること、などである。
活動時間演算部１１６は、ユーザが作業をしている時間区間（活動時間区間）を算出する。すなわち、活動時間区間以外は、ファイルがオープンしていても、使用しているとはみなさない。
ファイル使用時間演算部１１７は、活動時間演算部１１６によって算出された活動時間区間とファイルがオープンされている時間とが重複する時間（ファイル使用時間）を算出する。なお、この処理では、アクセスログ解析部１１４およびアクセスログ前処理部１１５によって補正された履歴が使用される。そして、ファイルごとに、ファイル使用時間が算出される。
関連度演算部１１８は、ファイル使用時間演算部１１７が算出したファイル使用時間を、ファイル名ごとに突き合わせて、ファイル使用時間が重複する時間（共起時間）を算出する。そして、その共起時間と共起回数などを変数とする所定の数式によって関連度を算出する。なお、この所定の数式については後記する。算出した関連度は、ファイル名ごとに関連度ＤＢ（Data Base）１３１に記憶される。 The access log collection unit 113 acquires access logs (history) from the servers 11 and 12. The timing for acquiring the history may be regular, when necessary, or at any time.
The access log analysis unit 114 classifies the acquired access log (history) for each file name, or extracts an extension that immediately releases the lock, as will be described later.
The access log preprocessing unit 115 supplements or corrects the raw data of the access log (history) according to a predetermined condition before calculating the degree of association. The predetermined condition is, for example, deleting a history lacking close to open, or determining a file usage time for a file that has been open for a long time.
The activity time calculation unit 116 calculates a time interval (activity time interval) in which the user is working. In other words, it is not considered that the file is being used even if the file is open except during the activity time period.
The file usage time calculation unit 117 calculates a time (file usage time) in which the activity time interval calculated by the activity time calculation unit 116 overlaps with the time when the file is opened. In this process, the history corrected by the access log analysis unit 114 and the access log preprocessing unit 115 is used. Then, the file usage time is calculated for each file.
The relevance calculation unit 118 compares the file usage time calculated by the file usage time calculation unit 117 for each file name, and calculates the time (co-occurrence time) in which the file usage time overlaps. Then, the degree of association is calculated by a predetermined mathematical expression using the co-occurrence time and the number of times of co-occurrence as variables. This predetermined formula will be described later. The calculated degree of association is stored in the degree of association DB (Data Base) 131 for each file name.

なお、請求項に記載の関連度導出手段は、アクセスログ収集部１１３、アクセスログ解析部１１４、アクセスログ前処理部１１５、活動時間演算部１１６、ファイル使用時間演算部１１７および関連度演算部１１８の機能を総称したものである。 The relevance deriving means described in the claims includes the access log collection unit 113, the access log analysis unit 114, the access log preprocessing unit 115, the activity time calculation unit 116, the file usage time calculation unit 117, and the relevance level calculation unit 118. It is a general term for all functions.

また、入出力部１２０は、ファイル関連度管理装置１００を起動させるなどの操作を行うための操作入力部１２１と、各種データを表示する表示部１２２とを備える。
記憶部１３０は、処理部１１０によって算出された関連度を関連度ＤＢ１３１に記憶する。そして、通信制御部１４０や操作入力部１２１からファイル名が取得された場合に、関連度ＤＢ１３１が参照されて、取得されたファイル名に対して関連度を有するファイル群が抽出される。
通信制御部１４０は、端末２１，２２から、ファイル検索された結果であるファイル名（ファイル集合Ｆ）を受信したり、抽出された前記の関連度を有するファイル名を端末２１，２２に送信したりする。 The input / output unit 120 includes an operation input unit 121 for performing operations such as starting the file association degree management apparatus 100 and a display unit 122 for displaying various data.
The storage unit 130 stores the association degree calculated by the processing unit 110 in the association degree DB 131. When a file name is acquired from the communication control unit 140 or the operation input unit 121, the relevance level DB 131 is referred to, and a file group having a relevance level with respect to the acquired file name is extracted.
The communication control unit 140 receives the file name (file set F) as a result of the file search from the terminals 21 and 22, and transmits the extracted file name having the relevance to the terminals 21 and 22. Or

ここで、アクセスログ収集部１１３が取得する履歴について、図４を用いて説明する（適宜図３参照）。図４は、履歴の一例を示す図である。 Here, the history acquired by the access log collection unit 113 will be described with reference to FIG. 4 (see FIG. 3 as appropriate). FIG. 4 is a diagram illustrating an example of a history.

図４において、履歴は、アクセス時刻、アクセス種別、アクセスＩＤおよびファイル名を要素とする。なお、図４に示した履歴は、関連度を算出するために必要な要素のみを示したものである。
アクセス時刻は、ファイルにアクセスのあった時刻を表す。番号１について説明すると、２００７年５月２日の１６時５１分９秒にアクセスがあったことを表している。
アクセス種別は、open（オープン）またはclose（クローズ）を表す。すなわち、openは、ファイルがオープンされた（使用できるように開かれた）ことを表し、closeは、ファイルがクローズされた（使用を止めるために閉じられた）ことを表す。
アクセスＩＤは、誰がアクセスしたのか、あるいは、どの端末からアクセスされたのかなどを識別する表示である。例えば、アクセスＩＤは、サーバ１１，１２（図１参照）へのログイン名または端末に付されたＩＰアドレスや名前などである。
ファイル名は、ファイルに付された名前である。サーバ１１，１２がディレクトリの階層構造を提供しているファイルシステムを有している場合には、ディレクトリパス名を含めたファイル名となっている。
なお、図４は、アクセスＩＤがＡ１１の場合について抽出した場合の履歴を示しているが、複数のアクセスＩＤが混在していても、特定のアクセスＩＤについて抽出することが可能である。 In FIG. 4, the history includes access time, access type, access ID, and file name as elements. Note that the history shown in FIG. 4 shows only elements necessary for calculating the degree of association.
The access time represents the time when the file was accessed. If number 1 is described, it indicates that access was made at 16: 51: 9 on May 2, 2007.
The access type represents open or close. That is, open indicates that the file has been opened (opened for use), and close indicates that the file has been closed (closed to stop using it).
The access ID is a display for identifying who accessed or from which terminal. For example, the access ID is a login name to the servers 11 and 12 (see FIG. 1) or an IP address or name assigned to the terminal.
The file name is a name given to the file. When the servers 11 and 12 have a file system that provides a hierarchical structure of directories, the file names include directory path names.
Note that FIG. 4 shows a history when the access ID is A11, but it is possible to extract a specific access ID even if a plurality of access IDs are mixed.

次に、履歴に対する補正処理と関連度算出の流れについて、図５を用いて説明する（適宜図１，４参照）。図５は、履歴に対する補正処理と関連度算出の流れを示す図である。 Next, the flow of the correction process for the history and the calculation of the relevance will be described with reference to FIG. 5 (see FIGS. 1 and 4 as appropriate). FIG. 5 is a diagram showing the flow of the correction process for the history and the relevance calculation.

以下、取得した履歴が、図４に示すようなデータであったものとして説明する。
まず、ファイル関連度管理装置１００は、サーバ１１，１２からアクセスログ（履歴）を取得する（ステップＳ２０１）。
次に、特定のアクセスＩＤ（図４においてはＡ１１）に対して、作業を行っている時間、すなわち、活動時間（活動時間区間）を演算する（ステップＳ２０２）。 In the following description, it is assumed that the acquired history is data as shown in FIG.
First, the file relevance management apparatus 100 acquires access logs (history) from the servers 11 and 12 (step S201).
Next, for the specific access ID (A11 in FIG. 4), the working time, that is, the activity time (activity time interval) is calculated (step S202).

ここで、ステップ２０２における、活動時間の演算について、図６を用いて、より詳細に説明する。図６の（ａ）は、活動時間の算出の処理の概念を示す図、（ｂ）は、活動時間の算出の処理の流れを示す図、（ｃ）は、活動時間の算出の処理結果の一例を示す図である。 Here, the calculation of the activity time in step 202 will be described in more detail with reference to FIG. 6A is a diagram showing the concept of the activity time calculation process, FIG. 6B is a diagram showing the flow of the activity time calculation process, and FIG. 6C is the result of the activity time calculation process. It is a figure which shows an example.

活動時間は、所定時間幅の範囲内にアクセス時刻の履歴が存在すれば、その所定時間幅の区間を、作業とみなすことと決めている。例えば、所定時間幅を３０分に設定すると、図６の（ａ）に、図４のopenに係るアクセス時刻をプロットして示したように、活動時間の区間（活動時間区間）は、１６時３０分から１７時３０分と、１８時３０分から１９時となる。 As for the activity time, if there is a history of access time within the range of the predetermined time width, it is determined that the section of the predetermined time width is regarded as work. For example, when the predetermined time width is set to 30 minutes, the activity time interval (activity time interval) is 16:00 as shown in FIG. From 30 minutes to 17:30 and from 18:30 to 19:00.

図６の（ｂ）は、一つの所定時間幅について活動時間を算出する処理の流れを示している。まず、ステップＳ２０１において取得されたアクセスログの中から、処理対象となるアクセスＩＤのアクセスログ（履歴）を取得する（ステップＳ３０１）。そして、予め設定した所定時間幅にアクセスログ（履歴）が存在するか否かを判定する（ステップＳ３０２）。すなわち、アクセスした時刻が所定時間幅の範囲内に入っているか否かを判定する。そして、所定時間幅にアクセスログが存在する場合には（ステップＳ３０２でＹｅｓ）、その所定時間幅の時間帯を活動時間とみなして、活動フラグを１に設定する（ステップＳ３０３）。一方、所定時間幅にアクセスログが存在しない場合には（ステップＳ３０２でＮｏ）、その所定時間幅については処理を終了する。その後、次の所定時間幅について、ステップＳ３０２に戻って処理を実行する。 FIG. 6B shows the flow of processing for calculating the activity time for one predetermined time width. First, an access log (history) of an access ID to be processed is acquired from the access logs acquired in step S201 (step S301). Then, it is determined whether or not an access log (history) exists within a predetermined time width set in advance (step S302). That is, it is determined whether or not the accessed time is within a predetermined time range. If an access log exists in a predetermined time width (Yes in step S302), the time zone having the predetermined time width is regarded as an activity time, and the activity flag is set to 1 (step S303). On the other hand, if there is no access log in the predetermined time width (No in step S302), the processing is terminated for the predetermined time width. Thereafter, for the next predetermined time width, the process returns to step S302 to execute the process.

図６の（ｃ）は、処理結果の一例を示している。活動フラグが１となっている時間帯が、活動時間区間を表している。この処理結果は、記憶部１３０に記憶される（図３参照）。そして、ステップＳ２０２の処理結果は、ステップＳ２０６，Ｓ２０７に引き渡される。
なお、活動フラグは１でなくてもよく、識別可能な符号であれば他の符号であっても構わない。 FIG. 6C shows an example of the processing result. A time zone in which the activity flag is 1 represents an activity time section. The processing result is stored in the storage unit 130 (see FIG. 3). Then, the processing result of step S202 is delivered to steps S206 and S207.
The activity flag may not be 1 and may be another code as long as it is an identifiable code.

図５に戻って、ステップＳ２０３では、ステップＳ２０１において取得されたアクセスログの中から、ファイル名ごとにアクセスログ（履歴）を分類する。この処理は、ファイル名ごとにファイル使用時間を算出するために必要となる処理である。そして、ステップＳ２０３の処理結果は、ステップＳ２０５，Ｓ２０６，Ｓ２０７に引き渡される。 Returning to FIG. 5, in step S203, the access logs (history) are classified for each file name from the access logs acquired in step S201. This process is necessary for calculating the file usage time for each file name. Then, the processing result of step S203 is delivered to steps S205, S206, and S207.

次に、ステップＳ２０４では、ステップＳ２０１において取得されたアクセスログに対して統計処理を施して、すぐにロックを離す拡張子を抽出する。「すぐにロックを離す」とは、openとcloseに係るアクセスログ（履歴）が、ほぼ同時刻として記録されてしまうようなケースを指している。これは、ファイルをオープンするとメモリに読み込んでロックを開放してしまうためである。そうすると、実際にはファイルを使用していても、履歴上は、ほぼ０秒しかオープンされていないので、あたかも使用されなかったかのように認識されてしまう。これを防ぐために、ファイル名の種別を表す拡張子（例えば、doc、txtなど）ごとに、ファイルがオープンされている時間区間（openからcloseまでの時間区間）を集計して平均値を求め、その平均値が所定値以下の場合には、その拡張子に対して、補正を行うことにする。そして、補正の対象となった拡張子がステップＳ２０７に引き渡される。 Next, in step S204, statistical processing is performed on the access log acquired in step S201, and an extension that immediately releases the lock is extracted. “Immediately release the lock” refers to a case where the access logs (history) related to open and close are recorded at almost the same time. This is because opening a file reads it into memory and releases the lock. Then, even if the file is actually used, since it is opened for almost 0 seconds in the history, it is recognized as if it was not used. To prevent this, for each extension (for example, doc, txt, etc.) representing the type of file name, the average time is calculated by summing up the time interval (time interval from open to close) when the file is open, If the average value is less than or equal to a predetermined value, the extension is corrected. Then, the extension subject to correction is delivered to step S207.

次に、前処理Ａ（ステップＳ２０５）について説明する。この前処理Ａへの入力は、ステップＳ２０３の出力であり、openに対応するcloseが欠けている場合の処理である。図４の履歴では、番号３が、このケースに該当する。closeが欠けていることは、openされたファイル名に対応するcloseがあるか否かを照合することによって、見つけ出すことが可能である。そして、そのようなopenに係る履歴を削除して、履歴全体の中で、openとcloseとの対応がとれるように補正する。 Next, preprocessing A (step S205) will be described. The input to the preprocessing A is the output in step S203, and is processing when the close corresponding to open is missing. In the history of FIG. 4, number 3 corresponds to this case. The lack of close can be found by checking whether there is a close corresponding to the opened file name. Then, such a history related to open is deleted, and correction is made so that the correspondence between open and close can be taken in the entire history.

前処理Ｂ（ステップＳ２０６）は、実際には使用していないのに、ファイルがオープンされたままになっている場合に対応して、ファイル使用時間を求める処理である。この前処理Ｂへの入力は、ステップＳ２０２，Ｓ２０３の出力である。なお、前処理Ａとの違いは、openされたファイル名に対応するcloseが存在する点である。
図４の履歴では、番号４，１３がこのケースに相当する。２００７年５月２日１６時５４分１５秒にオープンして、２００７年５月３日８時３７分５６秒にクローズしている。しかも、番号１２と番号１３との間は１４時間６分３４秒であって、この間にはアクセスが無い。
このように長時間アクセスが無い場合には、ファイルは使用されていなかったと判断することとした。 The preprocess B (step S206) is a process for obtaining the file usage time corresponding to the case where the file is left open even though it is not actually used. The input to this preprocessing B is the output of steps S202 and S203. The difference from preprocessing A is that there is a close corresponding to the opened file name.
In the history of FIG. 4, numbers 4 and 13 correspond to this case. It opened at 16:54:15 on May 2, 2007 and closed at 8:37:56 on May 3, 2007. Moreover, the interval between the numbers 12 and 13 is 14 hours 6 minutes 34 seconds, and there is no access between them.
Thus, when there was no access for a long time, it was determined that the file was not used.

この前処理Ｂについて、図７を用いて、より詳細に説明する。図７の（ａ）は、前処理Ｂにおける処理の概念を示す図、（ｂ）は、前処理Ｂにおける処理の流れを示す図、（ｃ）は、前処理Ｂにおける処理結果の一例を示す図である。 This preprocessing B will be described in more detail with reference to FIG. 7A is a diagram showing a concept of processing in the preprocessing B, FIG. 7B is a diagram showing a flow of processing in the preprocessing B, and FIG. 7C is an example of processing results in the preprocessing B. FIG.

まず、前処理Ｂでは、ファイル使用時間を算出するために、（１）ファイルがオープンしている時間と活動時間（活動時間区間）とが重複している区間であること、かつ、（２）活動時間（活動時間区間）でない時間が所定時間以上となる場合には、その活動時間でない区間より前の活動時間区間までの履歴を用いること、
を条件として設定した。
前記したように、図４の履歴では、番号４，１３のファイル名DDD.docが対象となる。
図７の（ａ）において、ドットを付した横棒４００で示したように、活動時間（活動時間区間）と重複する区間が、ファイルを使用した状態、すなわち、ファイル使用時間であると決定される。それにともなって、openやcloseの履歴が追加される。 First, in the preprocessing B, in order to calculate the file usage time, (1) the time when the file is open and the activity time (activity time interval) overlap, and (2) If the time that is not activity time (activity time interval) is equal to or longer than the predetermined time, use the history up to the activity time interval before the activity time interval,
Was set as a condition.
As described above, in the history of FIG. 4, the file names DDD.doc with numbers 4 and 13 are targeted.
In FIG. 7A, as shown by the horizontal bar 400 with dots, it is determined that the section that overlaps the activity time (activity time section) is the state in which the file is used, that is, the file use time. The Along with that, history of open and close is added.

図７の（ｂ）は、前処理Ｂの処理の流れを示している。まず、処理対象となるアクセスＩＤのアクセスログ（履歴）を取得し（ステップＳ４０１）、活動時間でない時間区間が所定時間より小さいか否かを判定する（ステップＳ４０２）。そして、活動時間でない時間区間が所定時間より小さい場合には（ステップＳ４０２でＹｅｓ）、活動時間区間にファイルを使用したとみなして、アクセスログ（履歴）のopenやcloseを補正する（ステップＳ４０３）。そして、ステップＳ４０２へ戻って、処理を繰り返す。一方、活動時間でない時間区間（先の活動時間区間の終了から次の活動時間区間の開始までの間）が所定時間以上である場合には（ステップＳ４０２でＮｏ）、当該アクセスＩＤに対応するそれ以降（所定時間以上となった時刻以降）のアクセスログを削除し（ステップＳ４０４）、処理を終了する。 FIG. 7B shows a process flow of the pre-process B. First, an access log (history) of an access ID to be processed is acquired (step S401), and it is determined whether or not a time interval that is not an activity time is smaller than a predetermined time (step S402). If the time period that is not the activity time is smaller than the predetermined time (Yes in step S402), it is considered that the file is used for the activity time period, and the open or close of the access log (history) is corrected (step S403). . And it returns to step S402 and repeats a process. On the other hand, when the time interval that is not the activity time (from the end of the previous activity time interval to the start of the next activity time interval) is equal to or longer than the predetermined time (No in step S402), it corresponds to the access ID. The access log after that (after the time when the predetermined time or more has elapsed) is deleted (step S404), and the process ends.

図７の（ｃ）は、処理結果の一例を示している。番号３−１，３−２，３−３は、補正されたアクセスログであって、活動時間と重複するように補正されている。また、番号１２と番号１３との間には、１４時間以上ものアクセスの無い時間帯が存在しているため、番号１３は削除される。ただし、本発明の実施形態では、所定時間を５時間と設定しているが、これに限られない。 FIG. 7C shows an example of the processing result. Numbers 3-1, 3-2 and 3-3 are corrected access logs, which are corrected so as to overlap with the activity time. In addition, since there is a time period of no access for 14 hours or more between the numbers 12 and 13, the number 13 is deleted. However, in the embodiment of the present invention, the predetermined time is set to 5 hours, but is not limited thereto.

次に、図５に戻って、前処理Ｃ（ステップＳ２０７）について説明する。前処理Ｃへの入力は、ステップＳ２０２、ステップＳ２０３、ステップＳ２０４の出力である。 Next, referring back to FIG. 5, the preprocessing C (step S207) will be described. The input to the preprocessing C is the output of step S202, step S203, and step S204.

この前処理Ｃについて、図８を用いて、より詳細に説明する。図８の（ａ）は、前処理Ｃにおける処理の概念を示す図、（ｂ）は、前処理Ｃにおける処理の流れを示す図、（ｃ）は、前処理Ｃにおける処理結果の一例を示す図である。 The preprocessing C will be described in more detail with reference to FIG. 8A is a diagram showing the concept of processing in the preprocessing C, FIG. 8B is a diagram showing the flow of processing in the preprocessing C, and FIG. 8C shows an example of processing results in the preprocessing C. FIG.

まず、前処理Ｃでは、ファイル使用時間を算出するために、ステップＳ２０４から取得したファイルの拡張子（html, texなど）が付されたファイルを対象として、活動時間区間で最初のopenに係る履歴と、最後のcloseに係る履歴のみを残して、それ以外の履歴を削除する条件を設定した。
例えば、図４の履歴では、番号５，６，８，９，１１，１２のファイル名CCC.htmlが対象となる。
図８の（ａ）において、ドットを付した横棒５００で示したように、活動時間区間において、最初のopenに係る履歴と最後のcloseに係る履歴との間の区間が、ファイルを使用した状態であると決定される。それにともなって、openとcloseの履歴が削除される。 First, in the pre-processing C, the history related to the first open in the activity time section is targeted for the file with the file extension (html, tex, etc.) acquired from step S204 in order to calculate the file usage time. Then, the condition for deleting only the history related to the last close and deleting the other history is set.
For example, in the history of FIG. 4, the file name CCC.html of the numbers 5, 6, 8, 9, 11, and 12 is targeted.
In FIG. 8A, as indicated by the horizontal bar 500 with dots, in the activity time section, the section between the history related to the first open and the history related to the last close used the file. Determined to be in a state. Along with that, the history of open and close is deleted.

図８の（ｂ）は、一つの活動時間区間についての前処理Ｃの処理の流れを示している。まず、処理対象となるアクセスＩＤのアクセスログ（履歴）を取得し（ステップＳ５０１）、活動時間区間の最後のアクセスか否かを判定する（ステップＳ５０２）。そして、最後のアクセスである場合には（ステップＳ５０２でＹｅｓ）、活動時間区間の最初のopenと最後のcloseを残して、それ以外を削除し（ステップＳ５０３）、他の活動時間区間についても、同様にステップＳ５０２，Ｓ５０３を繰り返し実行する。一方、最後のアクセスでない場合には（ステップＳ５０２でＮｏ）、ステップＳ５０２へ戻って処理を続ける。 FIG. 8B shows the flow of preprocessing C for one activity time interval. First, an access log (history) of an access ID to be processed is acquired (step S501), and it is determined whether the access is the last access in the activity time interval (step S502). If it is the last access (Yes in step S502), the first open and the last close of the activity time interval are left and the others are deleted (step S503). Similarly, steps S502 and S503 are repeatedly executed. On the other hand, if it is not the last access (No in step S502), the process returns to step S502 and continues.

図８の（ｃ）は、処理結果の一例を示している。すなわち、図４の履歴では、最初のopenと最後のcloseではない番号６，８が削除される。 FIG. 8C shows an example of the processing result. That is, in the history of FIG. 4, numbers 6 and 8 that are not the first open and the last close are deleted.

図５に戻って、前処理Ａ、前処理Ｂおよび前処理Ｃによって履歴に補正がなされた後、ファイル名ごとに、ファイル使用時間が集計される（ステップＳ２０８）。
ここで、ファイル使用時間について、図９を用いて説明する。図９の（ａ）は、特定のファイル名についてのファイル使用時間表の一例を示す説明図、（ｂ）は、別のファイル名についてのファイル使用時間表の一例を示す説明図、（ｃ）は、両方のファイルが重複して使用されている共起時間表の一例を示す説明図である。
なお、図９は、同じアクセスＩＤに対して取得された履歴としている。ここで、図９（ａ）と（ｂ）に示すファイル使用時間表は、ファイル使用開始時刻（ファイルの開始時刻）と、ファイル使用時間長とが関連づけられて構成される。また、図９（ｃ）に示す共起時間表は、二つのファイル使用時間表を突合してファイル使用時間が重複するファイル共起開始時刻（共起時間の開始時刻）とファイル共起時間長とを算出した結果である。 Returning to FIG. 5, after the history is corrected by the preprocessing A, the preprocessing B, and the preprocessing C, the file usage time is totaled for each file name (step S208).
Here, the file usage time will be described with reference to FIG. 9A is an explanatory diagram showing an example of a file usage time table for a specific file name, FIG. 9B is an explanatory diagram showing an example of a file usage time table for another file name, and FIG. These are explanatory drawings showing an example of a co-occurrence time table in which both files are used in duplicate.
FIG. 9 shows the history acquired for the same access ID. Here, the file usage time table shown in FIGS. 9A and 9B is configured by associating the file usage start time (file start time) with the file usage time length. In addition, the co-occurrence time table shown in FIG. 9C is a file co-occurrence start time (start time of co-occurrence time) and a file co-occurrence time length that overlap two file use time tables and overlap the file use time. It is the result of having calculated.

再び、図５に戻って、次に、関連度の演算（ステップＳ２０９）について説明する。
まず、関連を調べようとする二つのファイル名のファイル使用時間表から、ファイル使用開始時刻とファイル使用時間とを取得して突合し、図９の（ｃ）に示すように、共起（重複）している時間帯のファイル共起開始時刻とファイル共起時間長を算出する。そして、ファイル共起時間の累積Ｔと、ファイル共起時間の回数Ｋと、ファイル使用開始パターンの類似度Ｐと、共起の間隔度Ｄとを求めて、関連度算出式Ｒ＝Ｔ^αＫ^βＰ^γＤ^δによって、関連度Ｒを算出する（０≦α，β，γ，δ≦１）。なお、α，β，γ，δは、それぞれ、Ｔ，Ｋ，Ｐ，Ｄに対する重み付けを行う指数である。
なお、関連度の数式は、前記の式に限られるものではなく、後記する変数Ｔ，Ｋ，Ｐ，Ｄのいずれかひとつ、または、それらの組み合わせであっても構わない。 Returning again to FIG. 5, the calculation of the relevance level (step S209) will be described next.
First, the file usage start time and the file usage time are obtained from the file usage time tables of the two file names whose relations are to be examined, and collated. As shown in FIG. Calculate the file co-occurrence start time and the file co-occurrence time length for the current time zone. Then, the file co-occurrence time accumulation T, the file co-occurrence time count K, the similarity P of the file use start pattern, and the co-occurrence interval degree D are obtained, and the relevance calculation formula R = T ^α K ^The degree of association R is calculated from ^β P ^γ D ^δ (0 ≦ α, β, γ, δ ≦ 1). Α, β, γ, and δ are indices for weighting T, K, P, and D, respectively.
The mathematical formula for the degree of association is not limited to the above formula, and may be any one of variables T, K, P, and D described later, or a combination thereof.

ここで、ファイル共起時間の累積Ｔと、ファイル共起時間の回数Ｋと、ファイル使用開始パターンの類似度Ｐと、共起の間隔度Ｄについて、図１０を用いて詳しく説明する。図１０は、関連度算出式の変数について示す図である。 Here, the cumulative T of file co-occurrence times, the number K of file co-occurrence times, the similarity P of the file use start pattern, and the interval D of co-occurrence will be described in detail with reference to FIG. FIG. 10 is a diagram illustrating variables of the relevance calculation formula.

図１０において、横軸は時間を示す。そして、ファイルＸとファイルＹについてのファイル使用時間表のデータがプロットされているものとする。これらファイルＸとファイルＹとの共起時間表が、最上段にプロットされている。
ファイル共起時間（共起時間）の累積Ｔは、Ｔ＝Σｔ_i（ただし、ｉ＝１〜ｎ）である。なお、図１０では、ｎは４である。なお、ｔ_iがファイル共起時間長である。
また、ファイル共起時間の回数（共起回数）Ｋは、Ｋ＝ｎである。
ファイル使用開始パターンの類似度Ｐは、Ｐ＝１／Σｐ_i（ただし、ｉ＝１〜ｎ）であって、ｐ_iが０の場合には、Ｐ＝１である。なお、ｐ_iがファイル使用開始時刻と共起開始時刻との差である。
共起の間隔度Ｄは、Ｄ＝Σｄ_i(i+1)（ただし、ｉ＝１〜ｎ-1）であって、ｎ＝１の場合にはＤ＝１である。なお、ｄ_i(i+1)が共起間隔である。 In FIG. 10, the horizontal axis indicates time. It is assumed that file usage time table data for file X and file Y is plotted. A co-occurrence time table of these files X and Y is plotted at the top.
The cumulative T of file co-occurrence times (co-occurrence times) is T = Σt _i (where i = 1 to n). In FIG. 10, n is 4. Note that t _i is the file co-occurrence time length.
The number of file co-occurrence times (number of co-occurrence) K is K = n.
The similarity P of the file use start pattern is P = 1 / Σp _i (where _i = 1 to n), and when p _i is 0, P = 1. P _i is the difference between the file use start time and the co-occurrence start time.
The co-occurrence interval degree D is D = Σd _{i (i + 1)} (where i = 1 to n−1), and when n = 1, D = 1. Note that d _{i (i + 1)} is the co-occurrence interval.

変数Ｔ，Ｋ，ＰおよびＤは、以下の考えを表現可能な式としている。すなわち、
（１）共起時間の累積が長いほど関連度が大きい。
（２）共起回数が多いほど関連度が大きい。
（３）ファイル使用開始パターンが類似しているほど関連度が大きい。
（４）共起の間隔が離れているほど関連度が大きい。
そして、（１）〜（３）は、使用実態に基づく使用者の直感に合致させている。また、（４）は、先の共起と次の共起との間に長い間隔があっても、再び共起するということは、より密接に関連していると思われる。 The variables T, K, P, and D are expressions that can express the following idea. That is,
(1) The longer the accumulation of co-occurrence times, the greater the degree of association.
(2) The degree of association increases as the number of co-occurrence increases.
(3) The degree of relevance increases as the file use start pattern is similar.
(4) The degree of relevance increases as the co-occurrence interval increases.
And (1)-(3) are matched with the user's intuition based on the actual use. In addition, (4) seems to be more closely related to the fact that co-occurrence occurs again even if there is a long interval between the previous co-occurrence and the next co-occurrence.

そして、関連度Ｒは、図１１に示すように、それぞれ２つのファイル間の関連度を示すデータベースとして、記憶部１３０の関連度ＤＢ１３１（図３参照）に記憶される。
なお、関連度ＤＢ１３１は、定期的に新しい履歴を加えて、図５に示す処理（ステップＳ２０１〜ステップＳ２１０）を行うことによって、更新してもよい。また、必要があるときに、随時更新しても構わない。 Then, as shown in FIG. 11, the relevance level R is stored in the relevance level DB 131 (see FIG. 3) of the storage unit 130 as a database indicating the relevance level between two files.
The relevance DB 131 may be updated by periodically adding a new history and performing the processing shown in FIG. 5 (steps S201 to S210). Also, when necessary, it may be updated at any time.

次に、ファイル関連度管理装置１００における関連度の算出処理および検索処理の流れについて、図１２を用いて説明する。図１２は、ファイル関連度管理装置１００における関連度の算出処理および検索処理の流れを示す図である。 Next, the flow of relevance calculation processing and search processing in the file relevance management apparatus 100 will be described with reference to FIG. FIG. 12 is a diagram illustrating the flow of the relevance calculation process and the search process in the file relevance management apparatus 100.

まず、関連度の算出処理では、ファイル関連度管理装置１００は、アクセスログ（履歴）をサーバ１１，１２から取得するために、アクセスログ要求を送信する（ステップＳ６０１）。サーバ１１，１２は、アクセスログ要求を受信すると、アクセスログを送信する（ステップＳ６０２）。ファイル関連度管理装置１００は、アクセスログを取得し（ステップ６０３）、関連度の算出、記憶、更新を行う（ステップＳ６０４）。
このステップＳ６０１〜ステップＳ６０４は、定期的に行ってもよく、また、適宜必要が生じたときに行っても構わない。また、ファイル関連度管理装置１００の機能が、サーバ１１，１２に備えられている場合には、ステップＳ６０１，Ｓ６０２の処理は不要であり、直接、アクセスログを取得（ステップＳ６０３）するところから開始される。 First, in the relevance calculation processing, the file relevance management device 100 transmits an access log request in order to obtain an access log (history) from the servers 11 and 12 (step S601). Upon receiving the access log request, the servers 11 and 12 transmit the access log (step S602). The file association degree management apparatus 100 acquires an access log (step 603), and calculates, stores, and updates the association degree (step S604).
Steps S601 to S604 may be performed periodically, or may be performed as necessary. When the functions of the file relevance management apparatus 100 are provided in the servers 11 and 12, the processing in steps S601 and S602 is unnecessary, and the process starts directly from acquiring the access log (step S603). Is done.

次に、検索処理においては、端末２１，２２がキーワード検索の実行を指示して（ステップＳ６０５）、キーワードを含むファイル（ファイル集合Ｆ）を取得する（ステップＳ６０６）。そして、端末２１，２２は、キーワードを含むファイル集合Ｆをファイル関連度管理装置１００に送信する（ステップＳ６０７）。一方、ファイル関連度管理装置１００は、ファイル集合Ｆの情報を受け付けて、ファイル集合Ｆに含まれるファイル名を用いて関連度ＤＢ１３１（図３参照）を参照して、ファイル集合Ｆに含まれるファイル名に関連するファイル名の関連度を抽出する（ステップＳ６０８）。そして、ファイル関連度管理装置１００は、関連度の高い（大きい）ファイル名を端末２１，２２に送信する（ステップＳ６０９）。そして、端末２１，２２は、送信されてきた関連度の高い（大きい）ファイル名を表示する（ステップＳ６１０）。
なお、ステップＳ６０５〜ステップＳ６０７は、サーバ１１，１２が行っても構わない。 Next, in the search process, the terminals 21 and 22 instruct execution of keyword search (step S605), and a file including the keyword (file set F) is acquired (step S606). Then, the terminals 21 and 22 transmit the file set F including the keyword to the file association degree management apparatus 100 (step S607). On the other hand, the file relevance management apparatus 100 receives information on the file set F, refers to the relevance DB 131 (see FIG. 3) using the file name included in the file set F, and includes the files included in the file set F. The degree of association of the file name related to the name is extracted (step S608). Then, the file association degree management apparatus 100 transmits a file name having a high degree of association (large) to the terminals 21 and 22 (step S609). Then, the terminals 21 and 22 display the transmitted file names having a high degree of relevance (step S610).
Note that the servers 11 and 12 may perform steps S605 to S607.

なお、請求項に記載の出力手段は、ファイル関連度管理装置１００の処理部１１０（図２参照）がステップＳ６０８〜Ｓ６０９を実行するものである。 Note that the output means described in the claims is one in which the processing unit 110 (see FIG. 2) of the file association degree management apparatus 100 executes steps S608 to S609.

なお、ステップＳ６０８においては、関連度に予め設定した閾値以上のファイル名についてのみ検索を実行することも、検索時間の短縮のために有効である。
そのために、閾値Ｒ_thを以下のようにして算出する。
まず、学習セット用のファイルの集合をＨとする。各ファイルｈ（ただし、ｈ∈Ｈ）に関して、学習データを収集する実験に参加した被験者から、共起時間に対して関連のあるものと判断したか否かの情報が収集される。
ここで、ファイルｈと共起しているファイルの集合をＱ（ｈ）とする。次に、ファイル集合Ｑ（ｈ）の各要素ｑ_iに対して、被験者が関連のあるものと判断した場合を正解、関連がないものと判断した場合を不正解として関連付ける。
次に、Ｑ（ｈ）の要素の中で、正解と判断され、かつ、最もｈとの関連度が小さい要素ｑ_kを求めて、ｈに対する閾値Ｒ（ｈ，ｑ_k）を求める。そして、ファイル集合Ｈに含まれる全てのｈに対して、閾値Ｒ（ｈ，ｑ_k）を算出し、それらの平均値をＲ_thとする。
すなわち、
Ｒ_th＝Σ_h∈HＲ（ｈ，ｑ_k）／｜Ｈ｜
によって、閾値Ｒ_thを算出する。ただし、｜Ｈ｜は、ファイル集合Ｈに含まれるファイルの個数である。 In step S608, it is also effective to shorten the search time by executing a search only for file names that are equal to or higher than a threshold set in advance in the degree of relevance.
For this purpose, the threshold value _Rth is calculated as follows.
First, let H be the set of files for the learning set. For each file h (where hεH), information is collected from subjects who participated in the experiment to collect learning data as to whether or not it was determined to be related to the co-occurrence time.
Here, a set of files co-occurring with the file h is defined as Q (h). Next, each element q _{i of} the file set Q (h) is associated as a correct answer when the subject determines that it is related and as an incorrect answer when it is determined that the subject is not related.
Next, among the elements of Q (h), an element q _k that is determined to be correct and has the smallest relevance to h is obtained, and a threshold value R (h, q _k ) for h is obtained. Then, a threshold value R (h, q _k ) is calculated for all h included in the file set H, and an average value thereof is set as R _th .
That is,
R _th = _Σh∈HR (h, q _k ) / | H |
To calculate the threshold value R _th . However, | H | is the number of files included in the file set H.

（評価実験例）
次に、本発明の実施形態に関して、事前に行った評価実験について説明する。 (Example of evaluation experiment)
Next, an evaluation experiment performed in advance with respect to the embodiment of the present invention will be described.

（実験環境）
アクセスログの採取には，現在広く用いられているWindows（登録商標）互換ファイルサーバのSamba を用いた。評価実験では，ユーザの環境に特別なアプリケーションをインストールしなくても，ファイルの使用時間を抽出することが出来るためSambaを用いたが，Samba 以外のファイルサーバであっても，ファイルのオープン／クローズ情報が取得出来るものなら同様に提案手法を適用することが出来る。
評価実験では，ファイルサーバSamba2.2.3a をログレベル２で起動し，二人の被験者が、そのファイルサーバを端末から約4ヶ月間使用した。この期間に被験者は学会発表の準備などがあり，それに向けての論文のテキストファイル，画像ファイル，実験のデータファイルなどに対するアクセスログが採取された。システムファイル等のアクセスは無視するように，解析対象の拡張子はbib，doc，gif，htm，html，jpg，mpg，mpeg，pdf，ppt，tex，txt，xls とした．また、活動時間幅は３０分、前処理Ｂにおける活動時間でない時間に係る所定値は５時間、ステップＳ２０４におけるオープンされている時間の平均値に係る所定値は１０秒とした。 (Experiment environment)
The access log was collected using Samba, a Windows (registered trademark) compatible file server that is widely used at present. In the evaluation experiment, Samba was used because the usage time of the file can be extracted without installing a special application in the user's environment. However, even with a file server other than Samba, opening / closing of the file is possible. If information can be acquired, the proposed method can be applied in the same way.
In the evaluation experiment, the file server Samba2.2.3a was started at log level 2, and two subjects used the file server from the terminal for about 4 months. During this period, subjects prepared for academic conference presentations, and access logs for text files, image files, experimental data files, etc. were collected. The extensions of the analysis target are bib, doc, gif, htm, html, jpg, mpg, mpeg, pdf, ppt, tex, txt, and xls so that access to system files etc. is ignored. In addition, the activity time width is 30 minutes, the predetermined value relating to the non-activity time in the preprocessing B is 5 hours, and the predetermined value relating to the average value of the open times in step S204 is 10 seconds.

（評価指標）
評価結果の評価指標として、情報検索システムの分野では一般的に用いられている、再現率(Recall)と適合率(Precision)とを用いた。
再現率は検索対象としているファイル群の中で適合しているファイル（正解ファイル）に対してどれだけの適合したファイルを検索できているかという網羅性の指標である。また、適合率は検索結果として得られたファイル群の中にどれだけ適合したファイルを含んでいるかという正確性の指標である。
再現率＝検索された適合したファイルの数／全対象文書中の正解ファイルの数
適合率＝検索された適合したファイルの数／検索されたファイルの数 (Evaluation index)
As the evaluation index of the evaluation result, the recall rate (Recall) and the precision rate (Precision), which are generally used in the field of information retrieval systems, were used.
The recall is an index of completeness indicating how many matching files can be searched for a matching file (correct answer file) in the file group to be searched. The relevance ratio is an accuracy index indicating how many suitable files are included in the file group obtained as a search result.
Reproducibility = number of matched files searched / number of correct files in all target documents precision = number of matched files searched / number of files searched

（評価結果）
まず、それぞれの被験者のアクセスログからファイル使用時間表を作成した。次に、全ファイルの中からランダムに選択されたファイルに対して共起時間を算出した。そして、被験者は、共起時間に対して、作業に関連があると判断した場合を正解、作業に関連がないと判断した場合を不正解として、評価した。
評価したファイル数と全ファイル数を、表１に示す。 (Evaluation results)
First, a file usage time table was created from each subject's access log. Next, the co-occurrence time was calculated for a file randomly selected from all files. The subject evaluated the co-occurrence time as a correct answer when judged to be related to the work, and an incorrect answer when judged to be unrelated to the work.
Table 1 shows the number of files evaluated and the total number of files.

次に、表１に示す評価したファイルの集合の中からランダムに半分のファイルを選択し、学習セットとした。関連度Ｒは、Ｒ＝Ｔ^αＫ^βＰ^γＤ^δによって算出した。このとき、関連度Ｒの指数（α，β，γ，δ）は、学習セットに含まれる全てのファイルの再現率の平均値が最大となるときの値とした。そして、関連度の閾値Ｇ_thも算出した。その結果を表２に示す。 Next, half of the files were randomly selected from the set of evaluated files shown in Table 1 to obtain a learning set. The relevance R was calculated by R = T ^α K ^β P ^γ D ^δ . At this time, the index (α, β, γ, δ) of the relevance R is a value when the average value of the recall ratios of all the files included in the learning set is maximized. The relevance threshold G _th was also calculated. The results are shown in Table 2.

表２に示した指数を用いて、評価セットの各ファイルに対して、再現率と適合率を求めた。その結果を表３に示す。 Using the indices shown in Table 2, the recall and precision were determined for each file in the evaluation set. The results are shown in Table 3.

表３に示す結果は、実用上問題ないと考える。そして、キーワードなどによるリンク関係を持たないファイル群から、作業に関連するファイルを検索することが可能となる。 The results shown in Table 3 are considered to be practically acceptable. Then, it becomes possible to search for a file related to the work from a file group having no link relationship by a keyword or the like.

ファイル検索システムの構成を示す図である。It is a figure which shows the structure of a file search system. ファイル関連度管理装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a file related degree management apparatus. ファイル関連度管理装置の機能を示す図である。It is a figure which shows the function of a file related degree management apparatus. 履歴の一例を示す図である。It is a figure which shows an example of a log | history. 履歴に対する補正処理と関連度算出の流れを示す図である。It is a figure which shows the flow of the correction process with respect to a log | history, and relevance calculation. （ａ）は、活動時間の算出の処理の概念を示す図、（ｂ）は、活動時間の算出の処理の流れを示す図、（ｃ）は、活動時間の算出の処理結果の一例を示す図である。(A) is a diagram showing the concept of processing for calculating activity time, (b) is a diagram showing the flow of processing for calculating activity time, and (c) is an example of processing results for calculating activity time. FIG. （ａ）は、前処理Ｂにおける処理の概念を示す図、（ｂ）は、前処理Ｂにおける処理の流れを示す図、（ｃ）は、前処理Ｂにおける処理結果の一例を示す図である。(A) is a figure which shows the concept of the process in the pre-process B, (b) is a figure which shows the flow of the process in the pre-process B, (c) is a figure which shows an example of the process result in the pre-process B. . （ａ）は、前処理Ｃにおける処理の概念を示す図、（ｂ）は、前処理Ｃにおける処理の流れを示す図、（ｃ）は、前処理Ｃにおける処理結果の一例を示す図である。(A) is a figure which shows the concept of the process in the pre-process C, (b) is a figure which shows the flow of the process in the pre-process C, (c) is a figure which shows an example of the process result in the pre-process C. . （ａ）は、特定のファイル名についてのファイル使用時間表の一例を示す説明図、（ｂ）は、別のファイル名についてのファイル使用時間表の一例を示す説明図、（ｃ）は、両方のファイルが重複して使用されている共起時間表の一例を示す説明図である。(A) is explanatory drawing which shows an example of the file usage time table about a specific file name, (b) is explanatory drawing which shows an example of the file usage time table about another file name, (c) is both It is explanatory drawing which shows an example of the co-occurrence time table in which these files are used repeatedly. 関連度算出式の変数について示す図である。It is a figure shown about the variable of a relevance calculation formula. ファイル間の関連度を示す図である。It is a figure which shows the association degree between files. ファイル関連度管理装置における関連度の算出処理および検索処理の流れを示す図である。It is a figure which shows the flow of the calculation process and search process of a relevance degree in a file relevance degree management apparatus.

Explanation of symbols

１ファイル検索システム
１１，１２サーバ
２１，２２端末
３０ネットワーク
１００ファイル関連度管理装置
１１０処理部
１１３アクセスログ収集部
１１４アクセスログ解析部
１１５アクセスログ前処理部
１１６活動時間演算部
１１７ファイル使用時間演算部
１１８関連度演算部
１２０入出力部
１３０記憶部
１３１関連度ＤＢ
１４０通信制御部 DESCRIPTION OF SYMBOLS 1 File search system 11,12 Server 21,22 Terminal 30 Network 100 File relevance management apparatus 110 Processing part 113 Access log collection part 114 Access log analysis part 115 Access log pre-processing part 116 Activity time calculation part 117 File use time calculation part 118 relevance calculation unit 120 input / output unit 130 storage unit 131 relevance DB
140 Communication control unit

Claims

A file search system for searching for files related to work from a group of files stored in a server,
Relevance deriving means for deriving the relevance between files from the history of accessing the file;
An output means for outputting a file group having a degree of relevance greater than or equal to a predetermined degree with respect to the search file searched by the keyword,
The relevance deriving means refers to the history, calculates an activity time section in which work has been performed, calculates a file use time using a file defined within the range of the activity time section, Calculate the relevance based on the overlap of file usage time ,
Read the access time and file name recorded in the history for the activity time interval, and if the access time is recorded within a predetermined time width, it is determined that the work was performed within the predetermined time width. Calculated,
Confirming the closing corresponding to the opening of the file recorded in the history according to a predetermined condition;
The file search system characterized in that the file usage time is calculated as a time when the activity time interval overlaps from a time when a file determined according to the predetermined condition is opened to a time when the file is closed .

The relevance deriving means includes:
For one file selected from all files in the file group stored in the server and other files,
If the file usage time overlaps, the overlapping time is the co-occurrence time,
If there are multiple co-occurrence times, the number is the number of co-occurrence,
The co-occurrence interval is from the end of the previous co-occurrence time to the start of the next co-occurrence time,
When the difference between the start time of the file use time of the one file or the start time of the file use time of the other file and the start time of the co-occurrence time is used as the use start pattern,
Performing a calculation based on at least one piece of information of the co-occurrence time, the number of times of co-occurrence, the co-occurrence interval, and the use start pattern, and calculating a degree of association between the selected one file and another file. ,
The file search system according to claim 1 .

The operation is a file search system according to claim 2 in which the accumulation of pre-Symbol co-occurrence time becomes the bottom, characterized by using a relevance calculation formula predetermined value is exponential.

The operation is a file search system of claim 2, before Symbol co-occurrence count becomes the bottom, characterized by using a relevance calculation formula predetermined value is exponential.

The operation is a file search system according to claim 2, reciprocal those accumulated before Symbol co-occurrence interval becomes the bottom, characterized by using a relevance calculation formula predetermined value is exponential.

The operation is a file search system according to claim 2 in which the cumulative prior Symbol used starting pattern becomes the bottom, characterized by using a relevance calculation formula predetermined value is exponential.

The calculation includes a relevance calculation formula in which the accumulation of the co-occurrence time is a base and a predetermined value is an exponent, a relevance calculation formula in which the number of co-occurrence is a base and a predetermined value is an exponent, and the co-occurrence interval In the combination of the relevance calculation formula in which the reciprocal number of the cumulative value of the cumulative value of the above and the predetermined value becomes the exponent, and the relevance calculation formula in which the accumulation of the use pattern becomes the base and the predetermined value becomes the exponent, 3. The file search system according to claim 2 , wherein a relevance calculation formula for multiplying the above formula is used.

The predetermined condition is:
A first process of deleting from the history when a history of closing corresponding to the opening is missing;
For the file that remains open, if there is a predetermined time or more between the end of the previous activity time interval and the start of the next activity time interval, the previous activity time interval A second process for determining a time overlapping with a closing interval corresponding to the opening of the file;
For file types in which the interval between the time when the file was opened and the time when the file was closed is less than or equal to a predetermined value, the file from the time when it was first opened until the time when it was last closed in the activity time interval A third process for determining a close corresponding to the open of
Being at least one of
The file search system according to claim 1 .