JP5265420B2

JP5265420B2 - Document search system

Info

Publication number: JP5265420B2
Application number: JP2009063884A
Authority: JP
Inventors: 昌平阿部
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2009-03-17
Filing date: 2009-03-17
Publication date: 2013-08-14
Anticipated expiration: 2029-03-17
Also published as: JP2010218191A

Description

本発明は、文書の検索のための技術に関する。 The present invention relates to a technique for searching for a document.

文書検索に関する技術として、例えば、特許文献１及び２に開示の技術がある。 As a technique related to document search, for example, there are techniques disclosed in Patent Documents 1 and 2.

特許文献１によれば、まず、キーワード、文書ＩＤ及び文書断片のうちのいずれか１種類の情報が入力され、入力された情報が類似文書型データベースに送られ、類似文書型データベースから、その情報から特定される文書に含まれている複数の特徴単語が抽出される。そして、抽出されたそれぞれの特徴単語が、キーワード型データベースに送られる。これにより、特徴単語を含んだ文書が検索される。 According to Patent Document 1, first, any one type of information of a keyword, a document ID, and a document fragment is input, and the input information is sent to a similar document type database. A plurality of feature words included in the document identified from the above are extracted. Then, each extracted characteristic word is sent to the keyword database. Thereby, a document including the characteristic word is searched.

特許文献２によれば、まず、対象文書（検索条件文書）が入力され、入力された対象文書から複数のキーワードが抽出され、それぞれのキーワードを用いて、キーワード検索が行われる。そのキーワード検索により、キーワードを含んだ文書が検索される。そして、その検索により見つかった複数の文書の中から、上記の対象文書に類似する文書が検索される（つまり類似文書検索が行われる）。 According to Patent Document 2, first, a target document (search condition document) is input, a plurality of keywords are extracted from the input target document, and a keyword search is performed using each keyword. By the keyword search, a document including the keyword is searched. Then, a document similar to the target document is searched from a plurality of documents found by the search (that is, a similar document search is performed).

特開２００２−２２２２０８号公報JP 2002-222208 A 特開２００４−１５１９５９号公報JP 2004-151959 A

ところで、一般に、類似文書検索の精度は低い。言い換えれば、類似文書検索で検索される文書がユーザの望む文書である確率は低い。具体的には、例えば、対象文書に或る観点では類似していると考えられる文書が検索されることがあるものの、全体としては、ユーザの望む文書とはかなり違った文書が検索されることは少なくない。 By the way, generally, the accuracy of similar document retrieval is low. In other words, the probability that the document searched by the similar document search is the document desired by the user is low. Specifically, for example, a document that is considered to be similar to the target document from a certain point of view may be searched, but as a whole, a document that is significantly different from the document desired by the user is searched. There are many.

そこで、本発明の目的は、類似文書検索の精度を向上することにある。 Accordingly, an object of the present invention is to improve the accuracy of similar document search.

異なる観点に従う異なる文書空間が予め定義される。各文書空間には、その文書空間の観点に基づいて決定された複数の類似カテゴリがある。検索対象となる複数の文書のそれぞれが、いずれか二以上の文書空間のいずれかの類似カテゴリに分類される。 Different document spaces according to different viewpoints are predefined. Each document space has a plurality of similar categories determined based on the viewpoint of the document space. Each of the plurality of documents to be searched is classified into any similar category in any two or more document spaces.

検索手段が、以下の（Ａ）乃至（Ｄ）の処理：
（Ａ）対象文書の初めの検索範囲とされる文書空間の観点に基づき、対象文書のその文書空間での類似カテゴリを特定する、
（Ｂ）特定された類似カテゴリと同一の類似カテゴリに分類されている文書を初めの文書空間から検索する；
（Ｃ）この（Ｃ）の処理の直前の検索範囲とは別の文書空間における、この（Ｃ）の直前の処理により見つかった文書と同一の類似カテゴリに分類されている文書を、検索する；
（Ｄ）上記（Ｃ）の処理により見つかった文書が対象文書と所定の関係があるか否かを判断する；
を実行する。上記（Ｄ）の判断の結果が否定的であれば、検索手段は、上記（Ｃ）を再実行する。一方、上記（Ｄ）の判断の結果が肯定的であれば、検索手段は、上記（Ｃ）の処理により見つかった文書を、対象文書に類似する文書と判断する。 The search means performs the following processes (A) to (D):
(A) Based on the viewpoint of the document space that is the first search range of the target document, the similar category of the target document in the document space is specified.
(B) Searching for documents classified in the same similar category as the identified similar category from the original document space;
(C) Search for a document classified in the same similar category as the document found by the processing immediately before (C) in a document space different from the search range immediately before the processing of (C);
(D) It is determined whether the document found by the process (C) has a predetermined relationship with the target document;
Execute. If the result of the determination in (D) is negative, the retrieval unit re-executes (C). On the other hand, if the result of the determination in (D) is affirmative, the search means determines that the document found by the process in (C) is a document similar to the target document.

これにより、複数の類似文書検索を組み合わせることで、個々の類似文書検索の精度の低さを補うことができ、全体としての類似文書検索の精度を向上させることができる。 Thus, by combining a plurality of similar document searches, it is possible to compensate for the low accuracy of individual similar document searches and to improve the accuracy of similar document searches as a whole.

なお、所定の関係とは、例えば、上記（Ｃ）の処理により見つかった文書と、対象文書の初めの文書空間での類似カテゴリに分類されている文書が、上記（Ｃ）の処理の直前の検索範囲の文書空間において同一の類似カテゴリに分類されていることである。 The predetermined relationship is, for example, that a document found by the process (C) and a document classified into a similar category in the first document space of the target document are immediately before the process (C). In the document space of the search range, they are classified into the same similar category.

また、「（Ｃ）の処理の直前の検索範囲」とは、（Ｂ）の処理での検索範囲、又は、直前回の（Ｃ）の処理での検索範囲である。 The “search range immediately before the process (C)” is the search range in the process (B) or the search range in the last process (C).

また、上記（Ｂ）及び／又は（Ｃ）において検索された文書は、キーワードに関する所定の条件に適合する文書であっても良い。この場合、「キーワードに関する所定の条件に適合する文書」とは、例えば、そのキーワードをｋ個含む又は含まない文書である（ｋは自然数）。典型的にはｋは１であると考えられるが、ｋは２以上であっても良い。この場合、例えば、条件が「キーワード「ＡＢＣ」を２個含む」であれば、キーワード「ＡＢＣ」を１個だけ含んでいる文書は、「キーワードに関する所定の条件に適合する文書」に該当しない文書である。 Further, the document searched in the above (B) and / or (C) may be a document that meets a predetermined condition relating to a keyword. In this case, “a document that satisfies a predetermined condition regarding a keyword” is, for example, a document that includes or does not include k keywords (k is a natural number). Typically, k is considered to be 1, but k may be 2 or more. In this case, for example, if the condition is “includes two keywords“ ABC ””, a document including only one keyword “ABC” does not correspond to “a document that satisfies a predetermined condition regarding keywords”. It is.

また、「文書」は、文字を含んだ電子データであれば、どのようなデータでも良い（例えば、表や画像を含んだ文書であっても良い）。 The “document” may be any data as long as it is electronic data including characters (for example, it may be a document including a table or an image).

図１は、本発明の第一実施形態に係る文書検索システムが適用されたメール監査システムを有するコンピュータシステムを示す。FIG. 1 shows a computer system having a mail audit system to which a document search system according to a first embodiment of the present invention is applied. 図２は、本発明の第一実施形態での検索処理の流れの一例を示す。FIG. 2 shows an example of the flow of search processing in the first embodiment of the present invention. 図３は、本発明の第一実施形態での検索処理における二次検索の説明図である。FIG. 3 is an explanatory diagram of the secondary search in the search process according to the first embodiment of the present invention. 図４は、本発明の第二実施形態に係る文書検索システムが適用されたメール監査システムを有するコンピュータシステムを示す。FIG. 4 shows a computer system having a mail auditing system to which the document search system according to the second embodiment of the present invention is applied. 図５は、本発明の第二実施形態での検索処理の説明図である。FIG. 5 is an explanatory diagram of search processing in the second embodiment of the present invention. 図６（Ａ）は、類似メール空間の類似カテゴリに分類されているメール群から少なくとも一つのメールを選択する方法（以下、メール選択方法）の第一の例の説明図である。図６（Ｂ）は、メール選択方法の第二の例の説明図である。図６（Ｃ）は、メール選択方法の第三の例の説明図である。図６（Ｄ）は、メール選択方法の第四の例の説明図である。FIG. 6A is an explanatory diagram of a first example of a method for selecting at least one mail from a mail group classified into a similar category in a similar mail space (hereinafter referred to as a mail selection method). FIG. 6B is an explanatory diagram of a second example of the mail selection method. FIG. 6C is an explanatory diagram of a third example of the mail selection method. FIG. 6D is an explanatory diagram of a fourth example of the mail selection method. 図７は、本発明の第二実施形態での検索処理の流れの一例を示す。FIG. 7 shows an example of the flow of search processing in the second embodiment of the present invention. 図８は、図６（Ｄ）に示したメール選択方法の説明の補足図である。FIG. 8 is a supplementary diagram for explaining the mail selection method shown in FIG.

以下、文書が電子メール（以下、単に「メール」と言う）である場合を例に採り、図面を参照しながら本発明の幾つかの実施形態について詳細に説明する。 Hereinafter, taking a case where a document is an electronic mail (hereinafter simply referred to as “mail”) as an example, some embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の第一実施形態に係る文書検索システムが適用されたメール監査システム１０３を有するコンピュータシステムを示す。 FIG. 1 shows a computer system having a mail audit system 103 to which a document search system according to the first embodiment of the present invention is applied.

社内端末１０１と社外端末１０５との間で、社内ネットワーク１１１及び社外ネットワーク１１２を経由して、メールが送受信される。社内端末１０１及び社外端末１０５は、例えば、パーソナルコンピュータ或いはサーバマシンである。社内ネットワーク１１１は、例えばＬＡＮ（Local Area Network）である。社外ネットワーク１１２は、例えば、社内ネットワーク１１１とは異なる外部のイントラネットやインターネットを含んだネットワークである。 Mail is transmitted and received between the internal terminal 101 and the external terminal 105 via the internal network 111 and the external network 112. The in-house terminal 101 and the outside terminal 105 are, for example, personal computers or server machines. The in-house network 111 is, for example, a LAN (Local Area Network). The external network 112 is, for example, a network including an external intranet and the Internet different from the internal network 111.

社内ネットワーク１１１に、メールサーバ１０７及びメール監査システム１０３が接続されている。 A mail server 107 and a mail audit system 103 are connected to the in-house network 111.

メールサーバ１０７は、社内ネットワーク１１１を経由して送受信された電子メールを記憶する。 The mail server 107 stores e-mails transmitted / received via the in-house network 111.

メール監査システム１０３は、社内ネットワーク１１を経由する電子メール、特に、例えば、社内から社外に送信されるメール（いわゆるアウトバウンドメール）をチェックする。 The mail auditing system 103 checks e-mails that pass through the in-house network 11, in particular, e.g., e-mails that are transmitted from the company to the outside (so-called outbound mail).

メール監査システム１０３は、例えば、ＣＰＵ１３１と、記憶資源（例えば、メモリ１３２及び記憶装置１３５）と、外部の装置との通信を制御するインターフェイス装置（通信Ｉ／Ｆ）１３３とを備える。 The mail audit system 103 includes, for example, a CPU 131, storage resources (for example, the memory 132 and the storage device 135), and an interface device (communication I / F) 133 that controls communication with an external device.

メモリ１３２は、例えば、種々のコンピュータプログラムや、ＣＰＵ１３１が行う処理に使用される種々のデータ等を記憶する。コンピュータプログラムとしては、例えば、検索プログラム３２１がある。 The memory 132 stores, for example, various computer programs, various data used for processing performed by the CPU 131, and the like. An example of the computer program is a search program 321.

ＣＰＵ１３１は、メモリ１３２に記憶されているコンピュータプログラムを実行することにより、メール検索を行うことができる。具体的には、例えば、ＣＰＵ１３１は、検索プログラム３２１を実行することにより、検索部３０１の機能を実現する。 The CPU 131 can perform a mail search by executing a computer program stored in the memory 132. Specifically, for example, the CPU 131 implements the function of the search unit 301 by executing the search program 321.

記憶装置１３５は、類似メールデータベース（類似メールＤＢ）４４１と、キーワードデータベース（キーワードＤＢ）４４２とを記憶する。類似メールＤＢ４４１は、類似メール検索で使用されるＤＢである。キーワードＤＢ４４２は、キーワード検索で使用されるＤＢである。 The storage device 135 stores a similar mail database (similar mail DB) 441 and a keyword database (keyword DB) 442. The similar mail DB 441 is a DB used for similar mail search. The keyword DB 442 is a DB used for keyword search.

本実施形態において、検索部３０１は、キーワードと対象メールの両方を用いてメールをメールサーバ１０７から検索する。検索部３０１は、例えば、メール監査システム１０３と通信可能なユーザ端末（例えば、社内端末１０１、或いは、社内ネットワーク１１１と非経由にメール監査システム１０３に接続されている通信端末）を介しユーザから、検索の指示を受け、その指示に応答して、検索処理を行い、そのユーザ端末に検索結果を出力することができる。出力された検索結果は、ユーザ端末のディスプレイ画面に表示される。 In the present embodiment, the search unit 301 searches for mail from the mail server 107 using both the keyword and the target mail. The search unit 301 receives, for example, from a user via a user terminal that can communicate with the mail audit system 103 (for example, the in-house terminal 101 or a communication terminal that is connected to the mail audit system 103 without going through the in-house network 111). A search instruction can be received, a search process can be performed in response to the instruction, and the search result can be output to the user terminal. The output search result is displayed on the display screen of the user terminal.

図２は、検索部３０１が行う検索処理の流れを示す。 FIG. 2 shows the flow of search processing performed by the search unit 301.

キーワードが入力される（Ｓ１０１）。このキーワードは、例えば、ユーザから入力されたキーワードである。もちろん、それに限らず、例えば、メモリ１３２に記憶されているキーワードリストから任意に選択されたキーワードであっても良い。 A keyword is input (S101). This keyword is, for example, a keyword input by the user. Of course, the keyword is not limited thereto, and may be a keyword arbitrarily selected from a keyword list stored in the memory 132, for example.

検索部３０１が、入力されたキーワードを用いた一次検索（キーワード検索）を行う（Ｓ１０２）。具体的には、例えば、検索部３０１は、入力されたキーワードを含んだメールを、メールサーバ１０７が有する複数のメールから検索する。この結果、例えば、メールサーバ１０７が有する１００００通のメール（グループＣ）から、１０通のメール（グループＡ）が検索されたとする。 The search unit 301 performs a primary search (keyword search) using the input keyword (S102). Specifically, for example, the search unit 301 searches for a mail including the input keyword from a plurality of mails included in the mail server 107. As a result, for example, it is assumed that ten mails (group A) are retrieved from 10,000 mails (group C) that the mail server 107 has.

Ｓ１０２での検索により見つかった１０通のメール（グループＡ）から少なくとも１通のメールが自動的に選択される（Ｓ１０３）。選択の基準としては、例えば、Ｓ１０１で入力されたキーワードを幾つ含むかという基準がある。ここでは、Ｓ１０１で入力されたキーワードを１番多く含むメールＸと、そのキーワードを２番目に多く含むメールＹが選択されたとする。 At least one mail is automatically selected from the 10 mails (group A) found by the search in S102 (S103). As a selection criterion, for example, there is a criterion for how many keywords input in S101 are included. Here, it is assumed that the mail X including the most keyword input in S101 and the mail Y including the second most keyword are selected.

ここで選択されたメールが、対象メールとされる。「対象メール」とは、類似メール検索の対象（起点）となるメールである。 The mail selected here is the target mail. The “target mail” is a mail that is a target (starting point) of similar mail search.

検索部３０１が、キーワードを用いたキーワード検索と対象メールを用いた類似メール検索との両方である二次検索を行う。この二次検索では、Ｓ１０１で入力されたキーワードを含まず且つ対象メールと類似するメールが検索される。この結果、例えば、グループＣにおけるグループＡ以外のグループＢ（９９９０通のメール）から、対象メールＸに類似する１８通のメールと、対象メールＹに類似する２２通のメールが検索されたとする。 The search unit 301 performs a secondary search that is both a keyword search using a keyword and a similar email search using a target email. In this secondary search, a mail that does not include the keyword input in S101 and is similar to the target mail is searched. As a result, for example, it is assumed that 18 mails similar to the target mail X and 22 mails similar to the target mail Y are retrieved from the group B (9990 mails) other than the group A in the group C.

検索部３０１は、検索結果を出力する（Ｓ１０５）。この検索結果には、例えば、検索ヒット数とメールリストが含まれる。検索ヒット数は、例えば、一次検索で見つかったメールの数と、二次検索で見つかったメールの数との合計である。メールリストは、例えば、一次検索で見つかったメールに関する情報（例えば、送信元及び送信先のメールアドレス、メール本文）と、二次検索で見つかったメールに関する情報とを含んでいる。 The search unit 301 outputs the search result (S105). This search result includes, for example, the number of search hits and a mail list. The number of search hits is, for example, the total of the number of emails found in the primary search and the number of emails found in the secondary search. The mail list includes, for example, information related to mails found in the primary search (for example, sender and destination mail addresses and mail text) and information related to mails found in the secondary search.

以上が、検索処理の流れである。なお、この流れは一例であり、例えば以下のいずれかの変形例が採用されても良い。すなわち、Ｓ１０２では、Ｓ１０１で入力されたキーワードを含まないメールが検索されても良い。Ｓ１０３では、グループＡからユーザが対象メールを選択しても良い。Ｓ１０４では、Ｓ１０１で入力されたキーワードとは別のキーワード（例えば、Ｓ１０３で選択された対象メールに含まれているキーワード、或いは、ユーザが別途入力したキーワード）が用いられても良い。Ｓ１０４では、キーワードを含むメールが検索されてもよい。 The above is the flow of search processing. This flow is an example, and for example, any of the following modifications may be adopted. That is, in S102, a mail that does not include the keyword input in S101 may be searched. In S103, the user may select the target mail from group A. In S104, a keyword different from the keyword input in S101 (for example, a keyword included in the target mail selected in S103 or a keyword input separately by the user) may be used. In S104, an email including a keyword may be searched.

図３を参照して、二次検索について詳細に説明する。 The secondary search will be described in detail with reference to FIG.

二次検索では、類似メールＤＢ４４１及びキーワードＤＢ４４２が用いられる。 In the secondary search, the similar mail DB 441 and the keyword DB 442 are used.

類似メールＤＢ４４１は、各類似度とメールとの関係を表すテーブルである。本実施形態での類似メール検索は、ＬＳＨ（Locality Sensitive Hashing）、つまり近似近傍点探索手法を用いた類似メール検索である。このため、メールの類似度は、メールのハッシュ値である。 The similar mail DB 441 is a table representing the relationship between each similarity and the mail. The similar mail search in the present embodiment is a similar mail search using LSH (Locality Sensitive Hashing), that is, an approximate neighborhood search method. For this reason, the mail similarity is a mail hash value.

キーワードＤＢ４４２は、各キーワードとそのキーワードを含むメールとの関係を表すテーブルである。 The keyword DB 442 is a table that represents the relationship between each keyword and mail that includes the keyword.

本実施形態では、既存メール群（メールサーバ１０７に記憶されているメール群）４４３における各メール４５１が、事前に（例えば夜間バッチで）、類似メールＤＢ４４１及びキーワードＤＢ４４２に登録される。具体的には、類似メールＤＢ４４１について言えば、全てのメール４５１のそれぞれのハッシュ値が算出され、そのハッシュ値に対応した欄に、そのメール４５１のＩＤが追記される（そのハッシュ値が類似メールＤＢ４４１に未登録であれば、そのハッシュ値とメールＩＤが類似メールＤＢ４４１に登録される）。一方、キーワードＤＢ４４２について言えば、メール４５１から単語が抽出され、その単語と同じキーワードに対応した欄に、そのメール４５１のＩＤが追記される（抽出された単語がキーワードＤＢ４４２に未登録であれば、その単語（キーワード）とメールＩＤがキーワードＤＢ４４２に登録される）。 In the present embodiment, each mail 451 in the existing mail group (mail group stored in the mail server 107) 443 is registered in advance (for example, in a night batch) in the similar mail DB 441 and the keyword DB 442. Specifically, for the similar mail DB 441, the hash values of all the mails 451 are calculated, and the ID of the mail 451 is added to the column corresponding to the hash value (the hash value is similar mail). If it is not registered in the DB 441, its hash value and mail ID are registered in the similar mail DB 441). On the other hand, regarding the keyword DB 442, a word is extracted from the mail 451, and the ID of the mail 451 is added to the column corresponding to the same keyword as the word (if the extracted word is not registered in the keyword DB 442). The word (keyword) and the mail ID are registered in the keyword DB 442).

二次検索では、類似メール検索と、キーワード検索と、検索結果統合とが行われる。類似メール検索では、検索部３０１が、対象メールのハッシュ値を算出し、そのハッシュ値に対応した全てのメールＩＤ（第一のメールＩＤ群）を類似メールＤＢ４４１から取得する。キーワード検索では、検索部３０１が、キーワードに対応した全てのメールＩＤ（第二のメールＩＤ群）をキーワードＤＢ４４２から取得する。検索結果統合では、例えば条件が「指定されたキーワードを含まない」であれば、検索部３０１が、第一のメールＩＤ群から、第二のメールＩＤ群に含まれているメールＩＤと異なるメールＩＤを全て取得する（条件が「指定されたキーワードを含む」であれば、検索部３０１は、第一のメールＩＤ群から、第二のメールＩＤ群に含まれているメールＩＤと同じメールＩＤを全て取得する）。その取得されたメールＩＤが、二次検索でヒットしたメールのＩＤである。対象メールを図２の対象メールＸとすれば、この二次検索でヒットしたメールのＩＤは、上記１８通のメールのＩＤである。 In the secondary search, similar mail search, keyword search, and search result integration are performed. In the similar mail search, the search unit 301 calculates a hash value of the target mail, and acquires all mail IDs (first mail ID group) corresponding to the hash value from the similar mail DB 441. In the keyword search, the search unit 301 acquires all mail IDs (second mail ID group) corresponding to the keywords from the keyword DB 442. In the search result integration, for example, if the condition is “does not include the specified keyword”, the search unit 301 sends a mail that is different from the mail ID included in the second mail ID group from the first mail ID group. All IDs are acquired (if the condition is “include specified keyword”, the search unit 301 uses the same mail ID as the mail ID included in the second mail ID group from the first mail ID group. All). The acquired mail ID is the ID of the mail hit in the secondary search. If the target mail is the target mail X in FIG. 2, the ID of the mail hit in the secondary search is the ID of the 18 mails.

以上が、第一の実施形態についての説明である。なお、類似メール検索としては、ＬＳＨの手法に従う類似メール検索に限らず、他の手法に従う類似メール検索が採用されても良い。 The above is the description of the first embodiment. The similar mail search is not limited to the similar mail search according to the LSH technique, and a similar mail search according to another technique may be employed.

上述した第一の実施形態によれば、二次検索では、類似メール検索とキーワード検索が併用されるので、類似メール検索の結果に一定の方向性が与えられる。故に、ユーザの望むメールが得られる確率を向上させることができる。具体的には、例えば、下記の第１及び第２のケースで、ユーザの望むメールが得られる確率が向上する。 According to the first embodiment described above, since the similar mail search and the keyword search are used in the secondary search, a certain directionality is given to the result of the similar mail search. Therefore, it is possible to improve the probability of obtaining the mail desired by the user. Specifically, for example, in the following first and second cases, the probability of obtaining a user's desired mail is improved.

第１のケース：キーワード型監査における検索漏れを防ぐ。
「キーワード型監査」とは、例えば、予め用意されているＮＧキーワードリストに登録されているＮＧキーワードが含まれている送信メールを抽出し、抽出された送信メールの本文を人手等で確認することで、会社に不利益となる送信メール（以下、危険メール）が社外に流出していないかどうかを監査する手法である。本ケースには、以下のように本実施形態を適用することができる。すなわち、検索部３０１が、ＮＧキーワード「ＸＸＸ」を含んだメールを一次検索で検索し、一次検索で見つかったメール群から選択されたメール（危険メール）を対象メールとし、「対象メールに類似するメールであってキーワード「ＡＢＣ」を含まないメール」を二次検索で検索する。これにより、ＮＧキーワード「ＡＢＣ」を含まないメール群の中から、ＮＧキーワード「ＡＢＣ」を含む危険メールに類似したメールを見つけることができる。つまり、検索の網羅性が確保され、危険メールの流出検知の確実性が向上する。 First case: Prevent search omission in keyword type audit.
“Keyword-type audit” refers to, for example, extracting outgoing mail containing NG keywords registered in a prepared NG keyword list, and manually checking the text of the extracted outgoing mail This is a method for auditing whether outgoing e-mails (hereinafter referred to as dangerous e-mails) that are disadvantageous to the company have leaked outside the company. The present embodiment can be applied to this case as follows. That is, the search unit 301 searches a mail including the NG keyword “XXX” by a primary search, sets a mail (dangerous mail) selected from the mail group found by the primary search as a target mail, and “similar to the target mail”. A secondary search is performed for a mail that does not include the keyword “ABC”. Thereby, it is possible to find a mail similar to the dangerous mail including the NG keyword “ABC” from the mail group not including the NG keyword “ABC”. That is, the completeness of the search is ensured and the certainty of detection of dangerous mail outflow is improved.

第２のケース：キーワード検索での検索結果を絞り込む。
本ケースには、以下のように本実施形態を適用することができる。すなわち、検索部３０１が、ユーザからのキーワード「ＤＦＧ」を含んだメールを一次検索で検索し、一次検索で見つかったメール群から選択されたメール（ユーザ所望のメール）を対象メールとし、「対象メールに類似するメールであってキーワード「ＤＦＧ」を含むメール」を二次検索で検索する。つまり、ユーザが入力したキーワード「ＤＦＧ」に加えて類似メール検索で絞り込みを行う。これにより、ユーザが新たにキーワードを追加入力すること無く、ユーザ所望のメール以外のメールの数が少なくなるよう、検索結果を絞り込むことができる。 Second case: The search result in the keyword search is narrowed down.
The present embodiment can be applied to this case as follows. That is, the search unit 301 searches a mail including the keyword “DFG” from the user by a primary search, and selects a mail selected from the mail group found by the primary search (user desired mail) as a target mail. A secondary search is performed for a mail similar to the mail and including the keyword “DFG”. In other words, in addition to the keyword “DFG” input by the user, the similar mail search is used for narrowing down. As a result, the search result can be narrowed down so that the number of mails other than the user-desired mails is reduced without the user newly inputting a keyword.

本発明の第二の実施形態を説明する。その際、第一の実施形態との相違点を主に説明し、第一の実施形態との共通点については説明を省略或いは簡略する。なお、以下の説明では、メールＩＤ＝ｘ（ｘは整数）のメールを「メール＃ｘ」と表記する。 A second embodiment of the present invention will be described. At that time, differences from the first embodiment will be mainly described, and description of common points with the first embodiment will be omitted or simplified. In the following description, a mail with mail ID = x (x is an integer) is represented as “mail #x”.

図４は、本発明の第二実施形態に係る文書検索システムが適用されたメール監査システム８０３を有するコンピュータシステムを示す。 FIG. 4 shows a computer system having a mail audit system 803 to which the document search system according to the second embodiment of the present invention is applied.

記憶装置１３５に、類似メールＤＢ４４１が複数個用意される。言い換えれば、本実施形態では、複数の類似メール空間が定義されている。本実施形態では、複数の類似メール空間として、例えば、類似メール空間Ａ及びＢが定義されているとする。このため、類似メールＤＢ４４１として、類似メールＤＢ４４１Ａ及び４４１Ｂが用意されているとする。 A plurality of similar mail DBs 441 are prepared in the storage device 135. In other words, in the present embodiment, a plurality of similar mail spaces are defined. In the present embodiment, it is assumed that, for example, similar mail spaces A and B are defined as a plurality of similar mail spaces. For this reason, it is assumed that similar mail DBs 441A and 441B are prepared as the similar mail DB 441.

ＣＰＵ１３１で実現される検索部８０１（ＣＰＵ１３１で実行される検索プログラム８２１）は、検索部３０１が有する機能に代えて又は加えて、類似メール空間を辿っていく類似メール検索を行う機能を有する。 The search unit 801 (search program 821 executed by the CPU 131) realized by the CPU 131 has a function of performing a similar mail search that follows the similar mail space instead of or in addition to the function of the search unit 301.

図５は、検索部８０１が行う類似メール検索の説明図である。 FIG. 5 is an explanatory diagram of similar mail search performed by the search unit 801.

類似メール空間Ａ及びＢは、ＬＳＨの異なる類似度モデルに基づいて定義された空間である。すなわち、類似メール空間Ａ（第一の類似メールＤＢ４４１Ａ）における各ハッシュ値（カテゴリ）と、類似メール空間Ｂ（第二の類似メールＤＢ４４１Ｂ）における各ハッシュ値（カテゴリ）は、異なる類似度モデルに従い算出されている。例えば、類似メール空間Ａについてのハッシュ値は、類似度モデルＡに従う方法で得られ、類似メール空間Ｂについてのハッシュ値は、類似度モデルＢに従う方法で得られる。 Similar mail spaces A and B are spaces defined based on different similarity models of LSH. That is, each hash value (category) in the similar mail space A (first similar mail DB 441A) and each hash value (category) in the similar mail space B (second similar mail DB 441B) are calculated according to different similarity models. Has been. For example, the hash value for the similar mail space A is obtained by a method according to the similarity model A, and the hash value for the similar mail space B is obtained by a method according to the similarity model B.

メールサーバ１０７が記憶するメール群（既存メール群）における各メールについて、予め、類似メール空間Ａ及びＢのそれぞれの類似度モデルに従ってそれぞれのハッシュ値が算出される。そして、各メールが、それぞれのハッシュ値に従い類似メール空間ＡとＢのそれぞれに分類される。 For each mail in the mail group (existing mail group) stored in the mail server 107, each hash value is calculated in advance according to each similarity model of the similar mail spaces A and B. Each mail is classified into the similar mail spaces A and B according to the respective hash values.

検索部８０１は、例えば、対象メールが入力された場合、下記の処理（５−１）〜（５−６）を行うことで、対象メールに類似するメールとして、メール＃４を検索することができる。
（５−１）対象メールの初めの検索範囲である類似メール空間Ａの類似度モデルＡに従って、対象メールのハッシュ値＝４８を算出する。
（５−２）：類似メール空間Ａからハッシュ値＝４８に該当するメール群を検索する。
（５−３）：（５−２）の検索で見つかったメール群から、所定の方法で、メール＃５を選択する。
（５−４）：（５−３）で選択されたメール＃５のＩＤ＝５をキーに、この（５−４）の直前の検索範囲とは別の検索範囲である類似メール空間Ｂを参照する。これにより、類似メール空間Ｂから、メール＃５が分類されているハッシュ値＝９４８のメール群が見つかる。
（５−５）：（５−４）で見つかったメール群から、所定の方法で、メール＃８を選択する。
（５−６）：（５−５）で選択されたメール＃８のＩＤ＝８をキーに、この（５−６）の直前の検索範囲とは別の検索範囲である類似メール空間Ａを参照する。これにより、類似メール空間Ａから、メール＃８が分類されているハッシュ値＝１８のメール群が見つかる。
（５−７）：（５−６）で見つかったメール群に含まれているメール＃４は、類似度モデルＢに従うハッシュ値が４８３である。この（５−６）の直前の検索範囲である類似メール空間Ｂでは、ハッシュ値＝４８３には、メール＃１も分類されている。メール＃１は、最初の検索範囲の類似メール空間Ａにおいて、対象メールと同じハッシュ値＝４８に分類されているメールである。以上のことから、メール＃４は、対象メールに類似するメール＃１に類似しており、メール＃１が、対象メールに類似しているということになる。このため、メール＃４を、対象メールに類似するメールと判定する。 For example, when a target mail is input, the search unit 801 can search mail # 4 as a mail similar to the target mail by performing the following processes (5-1) to (5-6). it can.
(5-1) The hash value = 48 of the target mail is calculated according to the similarity model A of the similar mail space A that is the initial search range of the target mail.
(5-2): A mail group corresponding to hash value = 48 is searched from the similar mail space A.
(5-3): Mail # 5 is selected by a predetermined method from the mail group found in the search of (5-2).
(5-4): Using ID = 5 of the mail # 5 selected in (5-3) as a key, a similar mail space B which is a different search range from the search range immediately before (5-4) refer. As a result, a mail group having a hash value = 948 in which the mail # 5 is classified is found from the similar mail space B.
(5-5): Mail # 8 is selected from the mail group found in (5-4) by a predetermined method.
(5-6): Using ID = 8 of the mail # 8 selected in (5-5) as a key, a similar mail space A that is a different search range from the search range immediately before (5-6) refer. As a result, a mail group of hash value = 18 in which the mail # 8 is classified is found from the similar mail space A.
(5-7): Mail # 4 included in the mail group found in (5-6) has a hash value 483 according to the similarity model B. In the similar mail space B that is the search range immediately before (5-6), mail # 1 is also classified with hash value = 483. Mail # 1 is mail classified into the same hash value = 48 as the target mail in the similar mail space A in the first search range. From the above, the mail # 4 is similar to the mail # 1 similar to the target mail, and the mail # 1 is similar to the target mail. For this reason, mail # 4 is determined to be mail similar to the target mail.

以上のようにして、初めの検索範囲の類似メール空間Ａでは対象メールに非類似であるが、別の類似メール空間Ｂでは対象メールに類似するようなメール＃４を見つけることができる。 As described above, it is possible to find mail # 4 that is dissimilar to the target mail in the similar mail space A in the first search range, but similar to the target mail in another similar mail space B.

対象メールに類似するメールは、一つの類似メール空間Ａだけを参照しても見つからない。別の言い方をすれば、対象メールを基に一つの類似メール空間Ａだけを参照して見つかったメールは、必ずしも対象メールに類似しているとは限らない。 A mail similar to the target mail is not found even if only one similar mail space A is referred to. In other words, a mail found by referring to only one similar mail space A based on the target mail is not necessarily similar to the target mail.

本実施例では、複数の類似度モデルに従う複数の類似メール空間を定義して各メールをそれぞれの類似メール空間に分類しておき、対象メールを基に類似メール空間を辿る（上記例では、類似メール空間Ａ及びＢを交互に参照する）。つまり、対象メールに類似するメールを、複数の観点から検索する。これにより、対象メールに実は類似しているメール＃４を検索することができる。言い換えれば、類似メール検索の精度を全体として向上することができる。 In this embodiment, a plurality of similar mail spaces according to a plurality of similarity models are defined, each mail is classified into each similar mail space, and the similar mail space is traced based on the target mail (in the above example, similar mail spaces are traced). The mail spaces A and B are alternately referred to). That is, a mail similar to the target mail is searched from a plurality of viewpoints. Thus, it is possible to search for mail # 4 that is actually similar to the target mail. In other words, the accuracy of similar mail search can be improved as a whole.

以上の処理は一例であり、例えば以下のいずれかの変形例が採用されても良い。 The above processing is an example, and for example, any one of the following modifications may be adopted.

例えば、上記の例では、類似メール空間Ａが最初に参照されるが、どの類似メール空間を最初の参照先とするかは、予め定義されていても良いし、ランダム或いは他の方法で変更されても良い。 For example, in the above example, the similar mail space A is referred to first, but which similar mail space is set as the first reference destination may be defined in advance, or may be changed randomly or by another method. May be.

また、例えば、該当するハッシュ値に分類されているメール群の検索の際に（例えば上記の（５−２）や（５−４）の検索の際に）、第一の実施形態のようにキーワード検索が併用されても良い。具体的には、例えば、（５−１）の検索では、ハッシュ値＝４８に分類されているメールであってキーワード「ＨＩＪ」を含まない（又は含む）メールが検索される。キーワード「ＨＩＪ」は、対象メールから抽出されたキーワードであっても良いし、対象メールに含まれておらずユーザから入力されたキーワードであっても良い。また、例えば、（５−３）の検索では、ハッシュ値＝９４８に分類されているメールであってキーワード「ＫＬＭ」を含まない（又は含む）メールが検索される。キーワード「ＫＬＭ」は、対象メール或いはこれまで検索（又は選択）されたメールのいずれかのメール（例えば、ハッシュ値＝４８に分類されているいずれかのメール、又は、選択されたメール＃５）から抽出されたキーワードであっても良いし、別途ユーザから入力されたキーワードであっても良い。 Further, for example, when searching for a mail group classified into the corresponding hash value (for example, when searching for (5-2) or (5-4) above), as in the first embodiment Keyword search may be used in combination. Specifically, for example, in the search of (5-1), mails that are classified as hash value = 48 and do not include (or include) the keyword “HIJ” are searched. The keyword “HIJ” may be a keyword extracted from the target mail, or may be a keyword that is not included in the target mail and input from the user. Further, for example, in the search of (5-3), mails that are classified as hash value = 948 and do not include (or include) the keyword “KLM” are searched. The keyword “KLM” is either a target mail or a mail that has been searched (or selected) so far (for example, any mail classified as hash value = 48 or selected mail # 5). The keyword may be extracted from the above, or may be a keyword input by the user separately.

また、例えば、キーワード検索は、最後の絞り込みで採用されて良い。すなわち、対象メールを基に類似メール空間を辿った結果として取得された、対象メールに類似するメール群から、キーワード「ＮＯＰ」を含まない（又は含む）メールが検索されても良い。キーワード「ＮＯＰ」は、対象メール或いはこれまで検索（又は選択）されたメールのいずれかのメール（例えば、空間Ａのハッシュ値＝４８に分類されているいずれかのメール、空間Ｂのハッシュ値＝４８３に分類されているいずれかのメール、又は、選択されたメール＃５又はメール＃８）から抽出されたキーワードであっても良いし、別途ユーザから入力されたキーワードであっても良い。 In addition, for example, keyword search may be adopted in the final narrowing down. That is, a mail that does not include (or includes) the keyword “NOP” may be searched from a mail group similar to the target mail acquired as a result of tracing the similar mail space based on the target mail. The keyword “NOP” is either a target mail or a mail that has been searched (or selected) so far (for example, any mail classified into the hash value of space A = 48, the hash value of space B = It may be a keyword extracted from any mail classified as 483, selected mail # 5 or mail # 8), or may be a keyword input by a user separately.

また、例えば、類似メール空間を辿る際、キーワード検索が全く併用されなくても良い。言い換えれば、キーワードＤＢ４４２は無くても良い。 Further, for example, when tracing a similar mail space, keyword search may not be used at all. In other words, the keyword DB 442 may not be provided.

また、複数の類似メール空間に、異なる種類の類似メール検索手法に従う類似メール空間が含まれていても良い。上記の例で言えば、ＬＳＨに従う類似メール空間の他に、他の類似メール検索手法に従う類似メール空間（例えば、カテゴリとして、ハッシュ値ではなく、「業務」、「私用」などのようなメール種類が採用された空間）が含まれていても良い。 The plurality of similar mail spaces may include similar mail spaces according to different types of similar mail search methods. In the above example, in addition to a similar mail space according to LSH, a similar mail space according to another similar mail search method (for example, mail such as “business”, “private”, etc., not a hash value as a category) The space in which the type is adopted may be included.

ところで、（５−３）及び（５−５）における「所定の方法」（すなわち、類似メール空間Ａのハッシュ値＝４８から一つのメール＃５を選択する方法、及び、類似メール空間Ｂのハッシュ値＝９４８から一つのメール＃８を選択する方法）とは、例えば、図６に示す（Ａ）〜（Ｄ）のいずれかの方法である。 By the way, the “predetermined method” in (5-3) and (5-5) (that is, the method of selecting one mail # 5 from the hash value = 48 of the similar mail space A and the hash of the similar mail space B) The method of selecting one mail # 8 from value = 948) is, for example, one of the methods (A) to (D) shown in FIG.

図６（Ａ）の方法は、キーワードで選択する法である。図６（Ａ）の例によれば、空間Ａのハッシュ値＝４８に分類されているメール群から、入力されたキーワードを含まない（又は含む）メール＃５が選択される。なお、入力されたキーワードは、対象メール或いはこれまで検索（又は選択）されたメールのいずれかのメールから抽出されたキーワードであっても良いし、別途ユーザから入力されたキーワードであっても良い。 The method of FIG. 6A is a method of selecting by a keyword. According to the example of FIG. 6A, mail # 5 that does not include (or includes) the input keyword is selected from the mail group classified as hash value = 48 in space A. The input keyword may be a keyword extracted from either the target mail or the mail searched (or selected) so far, or may be a keyword input by a user separately. .

図６（Ｂ）の方法は、ユーザが手動で選択する方法である。図６（Ｂ）の例によれば、ユーザが、ハッシュ値＝４８に分類されている各メールを閲覧し（例えば各メールの本文を閲覧し）、所望のメール＃５を選択する。 The method of FIG. 6B is a method in which the user manually selects. According to the example of FIG. 6B, the user browses each mail classified as hash value = 48 (for example, browses the body of each mail) and selects a desired mail # 5.

図６（Ｃ）の方法は、全文単語検索で選択する方法である。図６（Ｃ）の例によれば、ハッシュ値＝４８に分類されている全てのメールから様々な単語が抽出され、メール毎の単語統計とメール群全体の単語統計とが算出され、メール群全体の単語統計と、メール毎の単語統計とを基に、メール＃５が選択される。具体的には、例えば、メール群全体の単語統計によれば、単語「立て替え」が最も多く存在し、単語「立て替え」が最も多く存在するメールは、ハッシュ値＝４８に分類されているメール群のうちメール＃５のため、メール＃５が選択される。なお、全文単語検索では、対象メールに含まれている単語も考慮されても良い。 The method of FIG. 6C is a method of selecting by full-text word search. According to the example of FIG. 6C, various words are extracted from all mail classified as hash value = 48, word statistics for each mail and word statistics for the entire mail group are calculated, and the mail group Mail # 5 is selected based on the entire word statistics and the word statistics for each mail. Specifically, for example, according to the word statistics of the entire mail group, the mail having the largest number of words “replacement” and the largest number of words “replacement” is the mail group classified as hash value = 48. Mail # 5 is selected for Mail # 5. In the full text word search, words included in the target mail may be taken into consideration.

図６（Ｄ）の方法は、入力のメールを中心とした半径Ｒの距離を、取得されるメール数がＰ個（Ｐは自然数）になるよう調整することで、半径Ｒの範囲内にあるメールを選択する方法である。「入力のメール」とは、（５−２）では対象メールであり、（５−４）ではメール＃５である。この入力のメールを、図８にあげるとおり、ベクトルに変換し、類似メール空間へ投影する。ベクトルへの変換はどのように行われても構わないが、たとえば、ここでは、メール文中に含まれる単語を用いるものとする。メール文中に含まれる単語をについて、あらかじめ別途リストアップしておいた単語リストと比較し、単語リストに含まれるものについては“1”、含まれないものについては“0”とする。これによりベクトルが形成される。なお、リストアップする単語の種類を変えることで異なる類似メール空間が形成される。図６（Ｄ）の例によれば、類似ベクトル空間において対象メールＷを中心とした半径Ｒの距離を、Ｒ１からＲ２に縮めることで、ハッシュ値＝４８に分類されているメール群を１つのメール＃５に絞り込むことが行われる。 The method of FIG. 6D adjusts the distance of the radius R around the input mail so that the number of acquired mails is P (P is a natural number) and is within the range of the radius R. This is a method of selecting mail. The “input mail” is the target mail in (5-2), and is mail # 5 in (5-4). This input mail is converted into a vector as shown in FIG. 8 and projected to a similar mail space. The conversion into a vector may be performed in any way. For example, here, a word included in a mail sentence is used. The words included in the mail text are compared with a word list separately listed in advance, and “1” is included for those included in the word list, and “0” is included for those not included. This forms a vector. Note that different similar mail spaces are formed by changing the types of words to be listed. According to the example of FIG. 6D, by reducing the distance of the radius R around the target mail W in the similar vector space from R1 to R2, the mail group classified as hash value = 48 is reduced to one. Narrowing down to mail # 5 is performed.

なお、ベクトルの変換に用いる要素は、単語以外に、メールの送信時刻、添付ファイル、或いは送信形態（新規メールであるか、転送であるか、返信であるか、等）が用いられてもよい。また、単語を用いる場合でも、単語の有無ではなく、含まれる単語の数が用いられてもよい。 In addition to words, elements used for vector conversion may be email transmission time, attached file, or transmission format (new email, forwarding, reply, etc.). . Even when using words, the number of included words may be used instead of the presence or absence of words.

以上の図６（Ａ）〜図６（Ｄ）の方法のうちの少なくとも一つが、該当するハッシュ値に分類されているメール群の検索結果の絞り込み（例えば、上記の（５−２）や（５−４）の検索の結果の絞り込み）に利用されても良い。例えば、対象メールの初めの類似メール空間Ａでのハッシュ値＝４８に１００００通のメールが分類されている場合、１００００通のメールを絞り込むために、図６（Ａ）〜図６（Ｄ）の方法のうちの少なくとも一つが利用されても良い。 At least one of the methods of FIGS. 6A to 6D described above can narrow down the search results of the mail group classified into the corresponding hash value (for example, the above (5-2) and ( (5-4) Search result refinement). For example, in a case where 10,000 mails are classified into hash value = 48 in the similar mail space A at the beginning of the target mail, in order to narrow down 10,000 mails, FIG. 6 (A) to FIG. 6 (D). At least one of the methods may be used.

図７は、検索部８０１が行う検索処理の流れを示す。以下の説明では、説明を分かり易くするために、適宜、図５に示したハッシュ値及びメールＩＤを使用する。 FIG. 7 shows the flow of search processing performed by the search unit 801. In the following description, for the sake of easy understanding, the hash value and the mail ID shown in FIG. 5 are used as appropriate.

検索部８０１は、対象メールが入力された場合（Ｓ７０１）、対象メールの初めの検索範囲となる類似メール空間（空間Ａ）の類似度モデルに従って、対象メールのハッシュ値（例えば４８）を算出する（Ｓ７０２）。 When the target mail is input (S701), the search unit 801 calculates a hash value (for example, 48) of the target mail according to the similarity model of the similar mail space (space A) that is the first search range of the target mail. (S702).

次に、検索部８０１は、算出されたハッシュ値と同一のハッシュ値に分類されているメール群を初めの類似メール空間（空間Ａ）から検索する（Ｓ７０３）。 Next, the search unit 801 searches the first similar mail space (space A) for a mail group classified into the same hash value as the calculated hash value (S703).

次に、検索部８０１は、直前の検索範囲の類似メール空間（空間Ａ）とは別の類似メール空間（空間Ｂ）から、直前の検索により見つかったメール群から選択されたメール（メール＃５）が属するハッシュ値に分類されているメール群を検索する（Ｓ７０４）。具体的には、例えば、検索部８０１は、上記選択されたメールのメールＩＤをキーに、別の類似メール空間を参照することで、その別の類似メール空間から、選択されたメールが属するハッシュ値に分類されているメール群を検索する。なお、上記「選択されたメール」とは、直前の検索により見つかったメール群から図６（Ａ）〜（Ｄ）の方法のいずれかの方法で選択されたメールである（これは、以下の説明でも同様である）。 Next, the search unit 801 selects a mail (mail # 5) selected from the mail group found by the previous search from a similar mail space (space B) different from the similar mail space (space A) in the previous search range. The mail group classified by the hash value to which () belongs is searched (S704). Specifically, for example, the search unit 801 refers to another similar mail space by using the mail ID of the selected mail as a key, so that the hash to which the selected mail belongs from the other similar mail space. Search mail group classified by value. The “selected mail” is a mail selected by any one of the methods shown in FIGS. 6A to 6D from the mail group found by the previous search (this is the following) The same applies to the explanation).

次に、検索部８０１は、Ｓ７０４の検索により見つかったメール群から選択されたメール（メール＃８）が対象メールと所定の関係があるか否かを判断する（Ｓ７０５）。言い換えれば、検索部８０１は、類似メール検索を終了して良いかどうかを判断する。 Next, the search unit 801 determines whether or not the mail (mail # 8) selected from the mail group found by the search in S704 has a predetermined relationship with the target mail (S705). In other words, the search unit 801 determines whether or not the similar mail search can be terminated.

Ｓ７０５の判断の結果が否定的であれば（Ｓ７０５：ＮＯ）、検索部８０１は、Ｓ７０４を再実行する。具体的には、例えば、直前回のＳ７０４の検索範囲は類似メール空間Ｂであったため、検索部８０１は、直前回に選択されたメールのＩＤ＝８をキーに、別の類似メール空間Ａを参照する。これにより、別の類似メール空間Ａから、メール＃８が属するハッシュ値＝１８に分類されているメール群を見つけることができる。 If the result of the determination in S705 is negative (S705: NO), the search unit 801 re-executes S704. Specifically, for example, since the search range of the immediately previous S704 was the similar mail space B, the search unit 801 uses another similar mail space A as the key with ID = 8 of the mail selected immediately before. refer. As a result, a mail group classified as hash value = 18 to which mail # 8 belongs can be found from another similar mail space A.

以上のように、検索部８０１は、Ｓ７０５の判断の結果として肯定的な結果が得られるまで、Ｓ７０４を繰り返すことになる。従って、Ｓ７０４での直前の検索範囲とは、Ｓ７０３での検索範囲（初めの検索範囲である類似メール空間Ａ）、又は、直前回のＳ７０４での検索範囲である。 As described above, the search unit 801 repeats S704 until a positive result is obtained as a result of the determination in S705. Therefore, the search range immediately before in S704 is the search range in S703 (similar mail space A that is the first search range) or the search range in the previous S704.

Ｓ７０４の判断の結果が肯定的であれば（Ｓ７０４：ＹＥＳ）、検索部８０１は、最後のＳ７０４の検索により見つかったメール（メール＃４）を、対象メールに類似するメールと判断する（Ｓ７０６）。 If the determination result in S704 is affirmative (S704: YES), the search unit 801 determines that the mail (mail # 4) found by the last search in S704 is similar to the target mail (S706). .

ここで、Ｓ７０５の判断における「所定の関係」とは、Ｓ７０４の検索により見つかったメール（メール＃４）と、対象メールの初めの検索範囲（類似メール空間Ａ）でのハッシュ値（＝４８）に分類されているメール（メール＃１）が、そのＳ７０４の直前の検索範囲（類似メール空間Ｂ）において同一のハッシュ値（＝４８３）に分類されていることである。 Here, the “predetermined relationship” in the determination in S705 is the hash value (= 48) in the mail (mail # 4) found by the search in S704 and the initial search range (similar mail space A) of the target mail. That is, the mail (mail # 1) classified into the same hash value (= 483) in the search range (similar mail space B) immediately before S704.

このような検索処理により、図５を参照して説明した検索が行われることになる。すなわち、類似メール空間Ａ及びＢを交互に参照することになり、その結果として、対象メールに実は類似するメール＃４が検索される。 With such a search process, the search described with reference to FIG. 5 is performed. That is, the similar mail spaces A and B are alternately referred to, and as a result, the mail # 4 that is actually similar to the target mail is searched.

以上、第二の実施形態によれば、複数の類似メール検索を組み合わせることで、個々の類似メール検索の精度の低さを補うことができ、全体としての類似メール検索の精度を向上させることができる。 As described above, according to the second embodiment, by combining a plurality of similar mail searches, it is possible to compensate for the low accuracy of individual similar mail searches and to improve the accuracy of similar similar mail searches as a whole. it can.

上述した本発明の幾つかの実施形態は、本発明の説明のための例示であり、本発明の範囲をそれらの実施形態にのみ限定する趣旨ではない。本発明は、その要旨を逸脱することなく、その他の様々な態様でも実施することができる。 The several embodiments of the present invention described above are examples for explaining the present invention, and are not intended to limit the scope of the present invention only to those embodiments. The present invention can be implemented in various other modes without departing from the gist thereof.

１０３，８０３…メール監査システム 103,803 ... Mail audit system

Claims

Different document spaces according to different perspectives,
A target document input means for inputting the target document;
Search means for searching a document similar to the target document from a plurality of documents,
Each document space has a number of similar categories determined from the perspective of that document space,
Each document is classified into one of the similar categories in any two or more document spaces,
The search means performs the following processes (A) to (D):
(A) identifying a similar category of the target document in the document space based on the viewpoint of the document space that is the first search range of the target document;
(B) searching the first document space for documents classified in the same similar category as the identified similar category;
(C) Search a document classified in the same similar category as the document found by the process immediately before (C) from a document space different from the search range immediately before the process of (C);
(D) determining whether the document found by the process of (C) has a predetermined relationship with the target document;
Run
If the result of the determination in (D) is negative, the search means re-executes the process in (C),
If the result of the determination in (D) is affirmative, the search means determines that the document found by the process in (C) is a document similar to the target document.
Document search system.

The predetermined relationship is that a document found by the process (C) and a document classified into a similar category in the first document space of the target document are searched immediately before the process (C). Being in the same similar category in the document space of the scope,
The document search system according to claim 1.

The document found from the search range immediately before the process (C) is a document in which two or more documents found from the search range are narrowed down using keywords.
The document search system according to claim 1 or 2.

At least one document space is a space based on a similarity model of LSH (Locality Sensitive Hashing),
Each similarity category is a hash value,
The document found from the search range immediately before the process of (C) is the input of a plurality of documents having the same hash value as the document input for the search in (C). A document belonging to a range within an adjusted radius R centered on the document;
The document search system according to claim 1 or 2.

The document found by the processing of (B) and / or (C) is a document that meets a predetermined condition regarding keywords.
The document search system according to any one of claims 1 to 4.

Entering the target document; and
A computer program for causing a computer to execute a step of searching for a document similar to the target document from a plurality of documents,
There are different document spaces that follow different perspectives, and each document space has multiple similar categories that are determined based on that document space perspective,
Each document is classified into one of the similar categories in any two or more document spaces,
In the searching step, the following processes (A) to (D) are performed:
(A) identifying a similar category of the target document in the document space based on the viewpoint of the document space that is the first search range of the target document;
(B) searching the first document space for documents classified in the same similar category as the identified similar category;
(C) Search a document classified in the same similar category as the document found by the process immediately before (C) from a document space different from the search range immediately before the process of (C);
(D) determining whether the document found by the process of (C) has a predetermined relationship with the target document;
Run
If the result of the determination in (D) is negative, re-execute the process in (C),
If the result of the determination in (D) is affirmative, the document found by the process in (C) is determined as a document similar to the target document.
Computer program.