JP2010231252A

JP2010231252A - System and method of detecting containment of email content

Info

Publication number: JP2010231252A
Application number: JP2009074811A
Authority: JP
Inventors: Guy Barry Owen Bunker; ガイ・バリー・オーウェン・バンカー; Tsuen Wan Ngan; ツェン・ワン・ガン
Original assignee: Symantec Corp
Current assignee: NortonLifeLock Inc
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2010-10-14
Anticipated expiration: 2029-03-25
Also published as: JP5731740B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and a method of detecting containment of email content. <P>SOLUTION: This method includes generating a first set of hash values including hash values corresponding to each of a plurality of character strings of a first email document, generating a second set of hash values including hash values corresponding to each of a plurality of character strings of a second email document, and determining whether the first set of hash values is a subset of the second set of hash values, or not. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、電子メールシステムに関し、詳細には、電子メール文書内のコンテンツ包含の検出に関する。 The present invention relates to electronic mail systems, and in particular to content inclusion detection in electronic mail documents.

データベースに存在する類似した電子メールを効率的に見つけることが望まれる場合が多い。例えば、訴訟電子情報開示場面では、電子メールの広範囲なデータベースを検索して電子メールが訴訟事件に重要であるか否かを判定する必要がある。広範囲なデータベースを検索して、電子メールを比較して潜在的に類似したものを判定することは、問題含みの冗長なプロセスである場合がある。類似性に関して電子メールを比較する１つの手法は、異なる電子メールのコンテンツからハッシュ値を計算した後に、同一性に関してハッシュ値を比較することである。残念ながら、一般にこのような手法で特定されるのは、正確な複製である電子メールだけであり、電子メールに何らかの差異があれば、一般に結果として異なるハッシュ値が生成されることになる。可能性のある別の手法は、１つの電子メールの全てのワードを別の電子メールのワードと比較して類似性を判定することである。しかしながら、このような手法は、一般的に非常に計算的に厳しいものがある。 It is often desirable to efficiently find similar emails that exist in a database. For example, in the case of litigation electronic information disclosure, it is necessary to search an extensive database of e-mails to determine whether the e-mail is important for litigation cases. Searching an extensive database and comparing emails to determine potentially similar ones can be a problematic and tedious process. One approach to compare emails for similarity is to compute hash values from different email content and then compare hash values for identity. Unfortunately, in general, such an approach only identifies email that is an exact copy, and any difference in email generally results in different hash values. Another possible approach is to compare all words of one email with words of another email to determine similarity. However, such techniques are generally very computationally strict.

電子メールは多くのテキストが追加されることなく転送又は返信されるので、ほとんど複製である場合が多い。最初の電子メールが繰り返し返信及び／又は転送される際に、そのチェーンの最後の電子メールだけを見つけることが望ましい場合があるは、その理由は、最後の電子メールが先行の電子メールのコンテンツの全てを含む場合が多いからである。したがって、電子情報開示場面では、何の情報も見落とすことなく最小数の電子メールを調べることができるように、返答電子メールのチェーン内の最後の電子メールを見つけることがより望ましいであろう。 Emails are often duplicates because they are forwarded or replied without much text being added. When the first email is repeatedly replied and / or forwarded, it may be desirable to find only the last email in the chain, because the last email is the content of the previous email This is because it often includes all of them. Thus, in an electronic information disclosure scene, it would be more desirable to find the last email in the reply email chain so that the minimum number of emails can be examined without missing any information.

電子メールコンテンツ包含を検出するシステム及び方法を開示する。１つの実施形態において、方法は、第１の電子メール文書に対応する第１の組のハッシュ値を生成する段階を含み、第１の組のハッシュ値は、第１の電子メール文書の複数の文字列の各々に対応するそれぞれのハッシュ値を含む。本方法は、第２の電子メール文書に対応する第２の組のハッシュ値を生成する段階を更に含み、第２の組のハッシュ値は、第２の電子メール文書の複数の文字列の各々に対応するそれぞれのハッシュ値を含む。本方法は、最後に、第１の組のハッシュ値が第２の組のハッシュ値のサブセットであるか否かを判定する段階を更に含む。 Disclosed are systems and methods for detecting email content inclusion. In one embodiment, the method includes generating a first set of hash values corresponding to the first e-mail document, wherein the first set of hash values includes a plurality of first e-mail documents. Contains a respective hash value corresponding to each of the strings. The method further includes generating a second set of hash values corresponding to the second e-mail document, the second set of hash values being each of a plurality of character strings of the second e-mail document. Each hash value corresponding to is included. The method finally further includes determining whether the first set of hash values is a subset of the second set of hash values.

特定の実施形態においては、本方法は、第１の電子メール文書に対応する第１の組のハッシュ値を表す第１のブルームフィルタを生成する段階と、第２の電子メール文書に対応する前記第２の組のハッシュ値を表す第２のブルームフィルタを生成する段階と、第１のブルームフィルタを第２のブルームフィルタと比較する段階とを更に含むことができる。第１及び第２のブルームフィルタは、ビット単位のＯＲ演算を行うことにより比較することができる。種々の実施形態において、本方法は、判定する段階の結果に基づいて、第１の電子メール文書のコンテンツが第２の電子メール文書内に含まれているか否かの表示を行う段階を更に含む。 In certain embodiments, the method generates a first Bloom filter representing a first set of hash values corresponding to a first email document, and the method corresponds to a second email document. The method may further include generating a second Bloom filter that represents the second set of hash values and comparing the first Bloom filter to the second Bloom filter. The first and second Bloom filters can be compared by performing a bitwise OR operation. In various embodiments, the method further includes displaying whether the content of the first email document is included in the second email document based on the result of the determining step. .

電子メールデータベース及び包含検出コードを含むコンピューターシステムのブロック図である。1 is a block diagram of a computer system including an email database and inclusion detection code. 電子メール文書内でコンテンツ包含を検出する方法の１つの実施形態のフローチャートである。2 is a flowchart of one embodiment of a method for detecting content inclusion in an email document. ２つの例示的な電子メールのコンテンツを示す。2 shows two exemplary email content. 無関係なコンテンツが取り除かれた２つの例示的な電子メールを示す。Figure 2 shows two exemplary emails with irrelevant content removed. 例示的なハッシュを示す。An exemplary hash is shown. ブルームフィルタ手法を用いてハッシュ値を比較する方法の１つの実施形態のフローチャートである。6 is a flowchart of one embodiment of a method for comparing hash values using a Bloom filter approach. 例示的なブルームフィルタを示す。2 illustrates an exemplary Bloom filter. ブルームフィルタの例示的なビット単位ＯＲ比較を示す。Fig. 4 illustrates an exemplary bitwise OR comparison of Bloom filters.

本発明は、種々の変更及び代替形態とすることが可能であり、特定の実施形態が例示的に図示されており本明細書で詳細に説明されている。しかしながら、図面及びその詳細な説明は、本発明を開示された特定の形態に限定することを意図しておらず、むしろ、特許請求の範囲により定義される本発明の精神及び範囲に該当する全ての変更物、均等物、及び代替物代案を包含することが意図されていることを理解されたい。本出願において、用語「ｍａｙ」は、必須の意味（すなわち、ｍｕｓｔ）ではなく、許容的な意味（すなわち、〜可能性を有する、〜できる）で使用されることに留意されたい。 While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example and are described in detail herein. However, the drawings and detailed description thereof are not intended to limit the invention to the particular forms disclosed, but rather fall within the spirit and scope of the invention as defined by the claims. It should be understood that variations, equivalents, and alternatives of the above are intended to be included. It should be noted that in this application, the term “may” is used in an acceptable meaning (ie, has a possibility, can be) rather than an essential meaning (ie, must).

ここで図１を参照すると、コンピューターシステム１００の１つの実施形態のブロック図が示されている。コンピューターシステム１００は、プロセッササブシステム１５０に結合された記憶サブシステム１１０を含む。記憶サブシステム１１０は、電子メールデータベース１２０及び包含検出コード１３０を格納している。コンピューターシステム１００は、パーソナルコンピュータシステム、デスクトップコンピュータ、ラップトップ又はノートコンピュータ、メインフレームコンピュータシステム、ハンドヘルドコンピュータ、ワークステーション、ネットワークコンピュータ、携帯電話、ポケットベル、又は個人用携帯情報端末（ＰＤＡ）など消費者向け装置を含むが、これらに限定されない種々の形式の装置の何れかとすることができる。また、コンピューターシステム１００は、記憶装置、スイッチ、モデム、ルータなどの任意の形式のネットワーク化された周辺機器とすることもできる。また、図１には単一のコンピューターシステム１００が示されているが、システム１００は、同時に作動する２つ又はそれ以上のコンピューターシステムとして実現することもできる。 Referring now to FIG. 1, a block diagram of one embodiment of a computer system 100 is shown. Computer system 100 includes a storage subsystem 110 coupled to a processor subsystem 150. The storage subsystem 110 stores an email database 120 and an inclusion detection code 130. The computer system 100 is a consumer such as a personal computer system, desktop computer, laptop or notebook computer, mainframe computer system, handheld computer, workstation, network computer, mobile phone, pager, or personal digital assistant (PDA). It can be any of various types of devices including, but not limited to, a pointing device. The computer system 100 can also be any type of networked peripheral device such as a storage device, a switch, a modem, or a router. Also, although FIG. 1 shows a single computer system 100, the system 100 can also be implemented as two or more computer systems operating simultaneously.

プロセッササブシステム１５０は、包含検出コード１３０を実行することができる１つ又はそれ以上のプロセッサを示している。例えば、ｘ８６プロセッサ、パワーＰＣプロセッサ、ＩＢＭセルプロセッサ、又はＡＲＭプロセッサ等の種々の特定型式のプロセッサを使用することができる。 The processor subsystem 150 illustrates one or more processors that can execute the containment detection code 130. For example, various specific types of processors such as x86 processors, power PC processors, IBM cell processors, or ARM processors can be used.

「コンピュータ読み取り可能な記憶媒体」と呼ぶ場合もある記憶サブシステム１１０は、さまざまな記憶媒体を示す。記憶サブシステム１１０は、任意の適切な媒体形式及び／又は記憶アーキテクチャを使用して実現することができる。例えば、記憶サブシステム１１０は、ハードディスク記憶装置、フロッピー（登録商標）ディスク記憶装置、取外し可能ディスク記憶、フラッシュメモリ、ランダムアクセスメモリ、又はリードオンリーメモリ等の半導体メモリなどの記憶媒体を使用して実現することができる。記憶サブシステム１１０は、単一の場所で実行すること、又は（例えば、ＳＡＮ構成で）分散させることができることに留意されたい。 Storage subsystem 110, sometimes referred to as "computer readable storage medium", represents a variety of storage media. Storage subsystem 110 may be implemented using any suitable media format and / or storage architecture. For example, the storage subsystem 110 is implemented using a storage medium such as a hard disk storage device, a floppy disk storage device, a removable disk storage, a flash memory, a random access memory, or a semiconductor memory such as a read-only memory. can do. Note that the storage subsystem 110 can run in a single location or can be distributed (eg, in a SAN configuration).

電子メールデータベース１２０は、１人又はそれ以上の電子メールシステムユーザに関連した、本明細書では各々を電子メール文書と呼ぶ、複数の電子メールメッセージを含む。電子メールデータベース１２０内の種々の電子メール文書は、お互い複製であること、又はデータベース内の他の電子メール（例えば、最初の電子メール、及び最初の電子メールを含む対応する返答電子メール）に実質的に類似したコンテンツを含むことができることに留意されたい。 The email database 120 includes a plurality of email messages, each referred to herein as an email document, associated with one or more email system users. The various email documents in the email database 120 are duplicates of each other, or are substantially the same as other emails in the database (eg, the first email and the corresponding reply email including the first email). Note that content that is similar in nature can be included.

以下に詳細に説明するように、包含検出コード１３０は、データベース１２０内の１つの電子メール文書のコンテンツが別の電子メール文書内に含まれているか（又は潜在的に含まれているか）否かを特定するための、プロセッササブシステム１５０により実行可能な命令を含む。種々の実施形態において、潜在的に含まれるか又は他の電子メールのコンテンツを含むと包含検出コード１３０によって特定された電子メール文書は、ユーザ（例えば、少なくとも返答電子メールのチェーン内の最後の電子メール）に報告することができる。特定の実施形態において、特定された電子メールは更に評価することができる。例えば、同一確認により、電子メール文書を別のコードにより分析又は比較して、１つの電子メールのコンテンツが別のメール内に含まれている程度を判定及び／又は確認すること、及び／又は電子メールのチェーンを特定することができる。包含検出コード１３０を実行することで、他の電子メール文書内にコンテンツを含まない電子メール文書の効率的なフィルタリングが可能になる。 As will be described in detail below, the containment detection code 130 indicates whether the content of one email document in the database 120 is (or is potentially included) in another email document. Includes instructions executable by the processor subsystem 150 to identify In various embodiments, the email document identified by the inclusion detection code 130 as potentially contained or containing other email content is the user (eg, at least the last email in the chain of reply emails). Email). In certain embodiments, the identified email can be further evaluated. For example, e-mail documents can be analyzed or compared by another code with the same confirmation to determine and / or confirm the extent to which the content of one e-mail is included in another e-mail, and / or electronic Identify email chains. By executing the inclusion detection code 130, it is possible to efficiently filter email documents that do not contain content in other email documents.

図２は、包含検出コード１３０の１つの実施形態の実行に従って実施することができる種々の作業を例示するフローチャートである。図２に例示する種々の作業は、図３に示す例示的な状況に関連して検討することになるが、図３は、２つの可能性がある電子メール文書３０１Ａ及び３０１Ｂのコンテンツを示す。図示のように、電子メール文書３０１Ｂは、電子メール文書３０１Ａに対する返答である。本実施例において、電子メールは３０１Ａ及び３０１Ｂは、異なる電子メールヘッダ（例えば、送信者、宛先、及び件名の部分）を含むことに留意されたい。また、電子メール文書３０１Ｂの末尾部には、電子メール文書０１1Ａには含まれていない文字列「Ｔｈｅｆｏｘｗａｓｃｕｎｎｉｎｇ」が含まれることに留意されたい。 FIG. 2 is a flowchart illustrating various operations that may be performed in accordance with execution of one embodiment of inclusion detection code 130. While the various tasks illustrated in FIG. 2 will be discussed in connection with the exemplary situation illustrated in FIG. 3, FIG. 3 illustrates the contents of two possible email documents 301A and 301B. As illustrated, the e-mail document 301B is a response to the e-mail document 301A. Note that in this example, the emails 301A and 301B include different email headers (eg, sender, destination, and subject portions). Note that the character string “The Fox Was Cunning” that is not included in the e-mail document 011A is included at the end of the e-mail document 301B.

ステップ２１０では、処理中の電子メール文書の無関係の電子メールコンテンツを取り除く又は無視する。この無関係なコンテンツとしては、「Ｆｒｏｍ［Ｎａｍｅ］、Ｔｏ［Ｎａｍｅ］、Ｓｕｂｊｅｃｔ［ＴＩＴＬＥ］、［ＤＡＴＥ］、［ＴＩＭＥ］、［ＮＡＭＥ］ｗｒｏｔｅ」、「転送メッセージ開始」、「−−−−−元のメッセージ−−−−−」等の一般的な電子メール文書に見いだすことができる共通の再発する語句を挙げることができる。図４にはこのステップから得られる実施例が示されており、ヘッダは、電子メール文書３０１Ａ及び３０１Ｂから取り除かれている。種々の実施形態において、ステップ２１０の間に各々の電子メール文書から取り除かれた／無視された無関係な電子メールコンテンツは、所定の又は予め選択されたワード又は語句（例えば、一般に電子メール文書に共通の語句）とすることができる。他の実施形態において、取り除かれる／無視される無関係な電子メールコンテンツは、ユーザからの入力によって管理又は指定することができる。特定の実施形態において、ステップ２１０を省略できることに留意されたい。 In step 210, irrelevant email content of the email document being processed is removed or ignored. The irrelevant contents include “From [Name], To [Name], Subject [TITLE], [DATE], [TIME], [NAME] slot”, “Transfer message start”, “----- original Common recurring phrases that can be found in common e-mail documents such as FIG. 4 shows an example resulting from this step, where the header has been removed from the email documents 301A and 301B. In various embodiments, irrelevant email content removed / ignored from each email document during step 210 is a predetermined or preselected word or phrase (eg, commonly common to email documents). ). In other embodiments, irrelevant email content that is removed / ignored can be managed or specified by input from the user. Note that in certain embodiments, step 210 may be omitted.

次に、ステップ２２０において、処理中の第１の電子メール文書の段落ごとに第１の組のハッシュ値が生成され、ステップ２３０において、処理中の第２の電子メール文書の段落ごとに第２の組のハッシュ値が生成される。図５には、段落「Ｔｈｅｑｕｉｃｋｂｒｏｗｎｆｏｘｊｕｍｐｅｄｏｖｅｒｔｈｅｌａｚｙｄｏｇ」、「Ｔｈｅｄｏｇｗａｓｓｌｅｅｐｉｎｇ」、及び「Ｔｈｅｆｏｘｗａｓｃｕｎｎｉｎｇ」に関するハッシュ値５０１Ａ〜Ｅが生成される実施例を示す。この特定の実施形態において、段落内の各々の文字のアルファベット位置を合計して各々のハッシュ値を生成する。例えば、文字「Ｔ」はアルファベットの第２０の文字であり、文字「ｈ」は第８の文字である。したがって、段落「Ｔｈｅｑｕｉｃｋｂｒｏｗｎｆｏｘｊｕｍｐｅｄｏｖｅｒｔｈｅｌａｚｙｄｏｇ」の文字のアルファベット位置の合計に基づいて「４６４」のハッシュ値が生成される。同様に、それぞれの段落「Ｔｈｅｄｏｇｗａｓｓｌｅｅｐｉｎｇ」及び「Ｔｈｅｆｏｘｗａｓｃｕｎｎｉｎｇ」に基づいてハッシュ値「１８９」及び「２０３」が計算される。 Next, in step 220, a first set of hash values is generated for each paragraph of the first email document being processed, and in step 230, a second value is generated for each paragraph of the second email document being processed. A set of hash values is generated. FIG. 5 illustrates an example in which hash values 501A to 501A-E relating to the paragraphs “The quick brown jumped over the lazy dog”, “The dog was sleeping”, and “The Fox was cunning” are generated. In this particular embodiment, the alphabetic position of each character in the paragraph is summed to generate each hash value. For example, the letter “T” is the 20th letter of the alphabet, and the letter “h” is the 8th letter. Accordingly, a hash value of “464” is generated based on the sum of the alphabet positions of the characters in the paragraph “The quick brown jumped over the lazy dog”. Similarly, hash values “189” and “203” are calculated on the basis of the respective paragraphs “The dog was sleeping” and “The Fox was cunning”.

任意の種々の他のハッシュ関数を使用して特定の段落に関するハッシュ値を計算できることに留意されたい。一般的に言えば、「ハッシュ関数」は、入力を数（すなわちハッシュ値）へマッピングさせる任意の関数である。したがって、種々の実施形態において、ＭＤ５ハッシュ、ＳＨＡ−１ハッシュなどの特定のハッシュアルゴリズムを使用することができる。例示的な実施例において、ハッシュ関数への入力としては、段落を形成する文字、又は文字のＡＳＣＩＩ序数値又は各々の段落内の文字の英字位置等の文字を表す値を挙げることができる。実施形態に応じて、句読点記号及び／又は数字等の文字をハッシュ関数への入力として含めること又は含めないことができる。 Note that any variety of other hash functions can be used to calculate the hash value for a particular paragraph. Generally speaking, a “hash function” is any function that maps an input to a number (ie, a hash value). Thus, in various embodiments, specific hash algorithms such as MD5 hash, SHA-1 hash, etc. can be used. In an exemplary embodiment, the input to the hash function may include a value that represents a character, such as a character that forms a paragraph, or an ASCII ordinal value of a character or a letter position of a character within each paragraph. Depending on the embodiment, characters such as punctuation marks and / or numbers may or may not be included as input to the hash function.

また、特定の実施形態において、異なるハッシュ関数を使用して各々の段落に関して複数のハッシュ値を生成できることに留意されたい。更に、特定の別の実施形態において、例えば、文、段落の一部、又は文字をグループ化するための任意の他の変形例等の段落以外の文字列に関してハッシュ値を計算できることに留意されたい。 It should also be noted that in certain embodiments, different hash functions can be used to generate multiple hash values for each paragraph. Furthermore, it should be noted that in certain alternative embodiments, a hash value can be calculated for non-paragraph strings, such as a sentence, part of a paragraph, or any other variation for grouping characters. .

ステップ２４０において、ステップ２２０で生成した第１の組のハッシュ値と、ステップ２３０で生成した第２の組のハッシュ値とを比較して、第１の組のハッシュ値が第２の組のハッシュ値のサブセットを成すか否かを判定する。第１セットが第２セットのサブセットを成す場合、包含検出コード１３０は、ステップ２５０Ａにおいて、第１の電子メールのコンテンツが第２の電子メールに含まれる（又は含まれる可能性がある）旨の表示を行うことができる。逆に、第１セットが第２セットのサブセットではない場合、包含検出コード１３０は、ステップ２５０Ｂにおいて、第１の電子メールのコンテンツが第２の電子メール内に含まれない（又は含まれない可能性がある）旨の表示を行うことができる。図５に示すように、電子メール文書３０１Ａのそれぞれの段落「Ｔｈｅｑｕｉｃｋｂｒｏｗｎｆｏｘｊｕｍｐｅｄｏｖｅｒｔｈｅｌａｚｙｄｏｇ」及び「Ｔｈｅｄｏｇｗａｓｓｌｅｅｐｉｎｇ」から、ハッシュ値４６４及び１８９が生成される。電子メール文書３０１Ａのこれらの段落は電子メール文書３０１Ｂのコンテンツにも含まれるので、ハッシュ値「４６４」及び「１８９」は、電子メール文書３０１Ｂに関しても生成される。一方で「Ｔｈｅｆｏｘｗａｓｃｕｎｎｉｎｇ」は、電子メール文書３０１Ｂだけに含まれているので、ハッシュ値「２０３」は、電子メール文書３０１Ｂに関してだけ生成される。電子メール文書３０１Ａに対応するハッシュ値「４６４」及び「１８９」の組は、電子メール文書３０１Ｂに対応するハッシュ値「４６４」、「１８９」、及び「２０３」の組よりも小さなサブセットを成すことから、包含検出コード１３０は、電子メール文書３０１Ａのコンテンツが電子メール文書３０１Ｂに含まれるという表示を行うことができる。本明細書で使用する場合、第２の組が第１の組に含まれる各々のハッシュ値、並びに第１の電子メール文書に含まれない段落から生成される追加的なハッシュ値を含む場合、第１の電子メール文書に関して生成された第１の組のハッシュ値は、第２の電子メール文書の小さなサブセットを成す。特定の実施形態において、第１の組が第２の組と同じである場合（すなわち、各々の組が同じハッシュ値を含む場合）、包含検出コード１３０は、ステップ２５０Ａにおいてコンテンツ包含の表示を行うこともできる。 In step 240, the first set of hash values generated in step 220 is compared with the second set of hash values generated in step 230, and the first set of hash values is the second set of hash values. Determine whether to form a subset of values. If the first set is a subset of the second set, the inclusion detection code 130 indicates that the content of the first email is (or may be) included in the second email in step 250A. Display can be made. Conversely, if the first set is not a subset of the second set, the inclusion detection code 130 may not include (or may not include) the content of the first email in the second email in step 250B. Can be displayed. As shown in FIG. 5, hash values 464 and 189 are generated from the respective paragraphs “The quick brown jumped over the lazy dog” and “The dog was sleeping” of the e-mail document 301A. Since these paragraphs of the email document 301A are also included in the content of the email document 301B, the hash values “464” and “189” are also generated for the email document 301B. On the other hand, since “The Fox Was Cunning” is included only in the electronic mail document 301B, the hash value “203” is generated only for the electronic mail document 301B. The set of hash values “464” and “189” corresponding to the e-mail document 301A forms a smaller subset than the set of hash values “464”, “189”, and “203” corresponding to the e-mail document 301B. Therefore, the inclusion detection code 130 can display that the content of the e-mail document 301A is included in the e-mail document 301B. As used herein, if the second set includes each hash value included in the first set, as well as additional hash values generated from paragraphs not included in the first email document, The first set of hash values generated for the first email document forms a small subset of the second email document. In certain embodiments, if the first set is the same as the second set (ie, each set contains the same hash value), the containment detection code 130 provides an indication of content inclusion at step 250A. You can also.

異なる電子メール文書のペアを繰り返し比較することにより、返答電子メールのチェーンは、他の電子メールに含まれるコンテンツを有する電子メールを判定することによって特定することができる。１つの電子メールがチェーン内の全ての他のコンテンツを含むと判定されると、この電子メールがチェーン内の最後のメールであると推察できる。例えば、図５において、電子メール文書３０１Ａ及び３０１Ｂは、返答電子メールの同じチェーン内にあり、電子メール文書３０１Ｂは、最後の電子メールである。特定の実施形態においては、包含検出コード１３０は、特定の電子メールは、いくつかの他の電子メールのコンテンツを含むと判定し、かつ、特定の電子メールがチェーン内の最後であるという表示を行うように構成することができる。 By repeatedly comparing pairs of different e-mail documents, a chain of reply e-mails can be identified by determining e-mails with content included in other e-mails. If it is determined that one email contains all the other content in the chain, it can be inferred that this email is the last email in the chain. For example, in FIG. 5, email documents 301A and 301B are in the same chain of reply emails, and email document 301B is the last email. In certain embodiments, the inclusion detection code 130 determines that a particular email contains the content of some other email and displays an indication that the particular email is the last in the chain. Can be configured to do.

複数の無関係な電子メールは、１つの電子メール文書のコンテンツが別の電子メール（例えば、共通の再発する段落）に含まれる（又は、潜在的に含まれる）という誤った表示をもたらすコンテンツを含む場合が時折あることに留意されたい。したがって、種々の実施形態において、ステップ２４０の間に、包含検出コード１３０は、複数の無関係な電子メールに出現するコンテンツに対応する特定のハッシュ値を無視するようにプログラムすることができる。 Multiple unrelated emails include content that results in a false indication that the content of one email document is included (or potentially included) in another email (eg, a common recurring paragraph) Note that there are occasional cases. Thus, in various embodiments, during step 240, the inclusion detection code 130 can be programmed to ignore specific hash values corresponding to content appearing in multiple unrelated emails.

図６は、１組のハッシュ値が別の組の小さなサブセットを成すか否かを判定するステップ２４０の１つの特定の実施例を示すフローチャートである。したがって、以下の操作を前述の方法とともに実行することができる。 FIG. 6 is a flowchart illustrating one particular embodiment of step 240 for determining whether a set of hash values constitutes a small subset of another set. Therefore, the following operations can be performed with the method described above.

ステップ６１０において、ステップ２２０で生成した第１の組のハッシュ値は、第１の電子メール文書に対応するブルームフィルタに反映される。一般的に言えば、「ブルームフィルタ」は、１組の素子を表すビットベクトルの形のデータ構造であり、要素がその組の構成要素であるか否かを検査するために使用される。最初に、空のブルームフィルタは、ゼロのビットアレイとして特徴づけることができる。要素はブルームフィルタに追加されるので、対応する代表ビットをセットすることができる。 In step 610, the first set of hash values generated in step 220 is reflected in the Bloom filter corresponding to the first email document. Generally speaking, a “bloom filter” is a data structure in the form of a bit vector that represents a set of elements and is used to check whether an element is a member of that set. Initially, an empty Bloom filter can be characterized as a zero bit array. Since the element is added to the Bloom filter, the corresponding representative bit can be set.

つまり、図７に示すように、選択されたビットをセットすることにより、「４６４」の計算ハッシュ値５０１Ａ、及び電子メール文書３０１Ａからの段落に対応する「１８９」の計算ハッシュ値５０１Ｂは、ブルームフィルタ７０１Ａに反映される。特に、本実施例で示す特定のブルームフィルタアルゴリズムに関して、計算ハッシュ値「４６４」を形成する数字に基づいて、ブルームフィルタ６０１Ａのビット位置４及び６をセットし、同様に、ハッシュ値「１８９」に対して位置１、８、及び９に対応するビットをセットする。ステップ６２０において、図示のように、選択されたビットを同様にセットすることにより、第２の電子メール文書３０１Ｂの段落に対応する、ステップ２３０で生成された計算ハッシュ値をブルームフィルタ７０１Ｂに反映させる。 That is, as shown in FIG. 7, by setting the selected bit, the calculated hash value 501A of “464” and the calculated hash value 501B of “189” corresponding to the paragraph from the email document 301A are This is reflected in the filter 701A. In particular, for the specific Bloom filter algorithm shown in this example, bit positions 4 and 6 of the Bloom filter 601A are set based on the numbers forming the calculated hash value “464” and similarly to the hash value “189”. On the other hand, the bits corresponding to positions 1, 8, and 9 are set. In step 620, as shown, the selected bit is similarly set to reflect the calculated hash value generated in step 230 corresponding to the paragraph of the second e-mail document 301B in the Bloom filter 701B. .

他の実施形態において、任意の種々の他のブルームフィルタアルゴリズムを用いることができることに留意されたい。例えば、ブルームフィルタデータ構造を形成するベクトルのサイズ（すなわちビット数）は、図７に示すものより非常に大きくすることができ、アルゴリズムで指示される通りに他の特定のビット位置をセットすることにより、所定のハッシュ値をブルームフィルタにおいて表すことができる。 Note that in various embodiments, any of a variety of other Bloom filter algorithms can be used. For example, the size (ie number of bits) of the vector forming the Bloom filter data structure can be much larger than that shown in FIG. 7, and other specific bit positions can be set as directed by the algorithm. Thus, the predetermined hash value can be expressed in the Bloom filter.

ステップ６３０において、ステップ６１０及び６２０で生成されたブルームフィルタを比較してオーバーラップの程度を判定する。図６に示すように、計算ハッシュ値「４６４」及び「１８９」は、ブルームフィルタ７０１Ａ及び７０１Ｂの両方に示されているので、ブルームフィルタ７０１Ａ及び７０１Ｂの位置１、４、６、８、及び９のビットがこれに対応してセットされる。一方、ハッシュ値「２０３」は、ブルームフィルタ７０１Ｂでのみに示されているので、ブルームフィルタ７０１Ａにおいて、位置２、０、及び３のビットはこれに対応してセットされない。 In step 630, the Bloom filters generated in steps 610 and 620 are compared to determine the degree of overlap. As shown in FIG. 6, since the calculated hash values “464” and “189” are shown in both the Bloom filters 701A and 701B, positions 1, 4, 6, 8, and 9 of the Bloom filters 701A and 701B are shown. Are set accordingly. On the other hand, since the hash value “203” is shown only in the Bloom filter 701B, the bits at positions 2, 0, and 3 are not set correspondingly in the Bloom filter 701A.

図８に示す１つの特定の実施形態において、ビット単位のＯＲを行って２つの電子メール文書のブルームフィルタを比較することができる。本実施例において、ビットベクトル８０１は、ブルームフィルタ７０１Ａ及び７０１Ｂの各ビットベクトルの間のビット単位ＯＲから生成され、その後、ブルームフィルタ７０１Ａ及び７０１Ｂの各々と比較される。結果として得られるビット単位ＯＲのビットベクトル８０１が入力ブルームフィルタ７０１Ａ又は７０１Ｂのいずれかに一致する場合、包含検出コード１３０は、ステップ２５０Ａにおいて、一方の電子メールのコンテンツが他方の電子メールのコンテンツに含まれている（又は、潜在的に含まれている）旨の表示を行うことができる。逆に、結果として得られるビット単位ＯＲのビットベクトル８０１がブルームフィルタ７０１Ａ又は７０１Ｂのいずれにも一致しない場合、包含検出コード１３０は、ステップ２５０Ｂにおいて、いずれの電子メールのコンテンツも他方の電子メールのコンテンツに含まれない（又は、潜在的に含まれない）という表示を行うことができる。図８に示す特定の実施例において、ビットベクトル８０１は、ブルームフィルタ７０１Ｂに一致しているので、包含検出コード１３０は、電子メール文書３０１Ａのコンテンツが電子メール文書３０１Ｂに含まれるという表示を行うことに留意されたい。 In one particular embodiment shown in FIG. 8, a bitwise OR can be performed to compare the Bloom filters of two email documents. In this embodiment, the bit vector 801 is generated from the bitwise OR between the bit vectors of the Bloom filters 701A and 701B and then compared with each of the Bloom filters 701A and 701B. If the resulting bitwise OR bit vector 801 matches either the input Bloom filter 701A or 701B, the inclusion detection code 130 determines that the content of one email is replaced by the content of the other email in step 250A. An indication of being included (or potentially included) can be made. Conversely, if the resulting bitwise OR bit vector 801 does not match either of the Bloom filters 701A or 701B, the containment detection code 130 determines that the content of either email is the content of either email in step 250B. It can be displayed that it is not included (or potentially not included) in the content. In the specific example shown in FIG. 8, since the bit vector 801 matches the Bloom filter 701B, the inclusion detection code 130 displays that the content of the email document 301A is included in the email document 301B. Please note that.

以上、特定の実施形態について説明したが、これらの実施形態は、特定の特徴部に関して単一の実施形態だけで説明されている場合であっても本開示内容の範囲を限定することを目的とするものではない。本開示内容においてもたらされる特徴部の実施例は、特に記述がない限り、制限的ではなく例示的であることが意図されている。前記の説明は、本開示内容から利益を得る当業者であれば理解できるような変形物、変更物、及び均等物を包含することを目的とする。本発明の範囲は、種々の説明された実施形態によって解決される問題点の一部又は全部を軽減するか否かを問わず、本明細書で（明示的に又は非明示的に）開示する任意の特徴部又は特徴部の組み合わせ、又は任意の一般化されたものを含む。したがって、新しい請求項は、このような特徴部の任意の組合せに対して、本出願（又は優先権を主張する出願）の審査中に作成することができる。特に、特許請求の範囲に関して、従属請求項の特徴部は独立請求項と組み合わせることができ、それぞれの独立請求項の特徴部は、任意の適当な方法で、特許請求の範囲に列挙する特定の組合せだけに留まらず組み合わせることができる。 Although specific embodiments have been described above, these embodiments are intended to limit the scope of the present disclosure even if the specific features are described in only a single embodiment. Not what you want. Examples of features provided in the present disclosure are intended to be illustrative rather than restrictive unless otherwise stated. The above description is intended to cover variations, modifications, and equivalents as may be appreciated by one of ordinary skill in the art having the benefit of this disclosure. The scope of the present invention is disclosed herein (explicitly or implicitly), whether or not alleviating some or all of the problems solved by the various described embodiments. Includes any feature or combination of features, or any generalization. Accordingly, new claims may be made during the examination of this application (or an application claiming priority) for any combination of such features. In particular, with respect to the claims, the features of the dependent claims can be combined with the independent claims, and each independent claim feature can be identified in any suitable manner as specified in the claims. It can be combined not only in the combination.

１００コンピューターシステム；１１０記憶サブシステム；
１２０電子メールデータベース；１３０包含検出コード；
１５０プロセッササブシステム。 100 computer system; 110 storage subsystem;
120 email database; 130 inclusion detection code;
150 processor subsystem.

Claims

Generating a first set of hash values corresponding to the first e-mail document and including respective hash values corresponding to each of the plurality of character strings of the first e-mail document;
Generating a second set of hash values corresponding to the second email document and including respective hash values corresponding to each of the plurality of character strings of the second email document;
Determining whether the first set of hash values forms a smaller subset than the second set of hash values;
A method comprising the steps of:

Each of the plurality of character strings of the first e-mail document is a respective paragraph of the first e-mail document, and each of the plurality of character strings of the second e-mail document is the first e-mail document. The method of claim 1, wherein the method is a paragraph of each of the two email documents.

Generating a first Bloom filter representing the first set of hash values corresponding to the first email document;
Generating a second Bloom filter representing the second set of hash values corresponding to the second email document;
The method of claim 1, further comprising: comparing the first Bloom filter and the second Bloom filter.

The method of claim 2, wherein the determining includes performing a bitwise OR operation on the first and second Bloom filters.

The method further comprises the step of displaying whether the content of the first e-mail document can be included in the second e-mail document based on the result of the determining step. 4. The method according to 4.

The method further comprises the step of displaying whether the content of the first e-mail document can be included in the second e-mail document based on the result of the determining step. The method according to 1.

Generating a first set of hash values corresponding to the first email document and including respective hash values corresponding to each of the plurality of character strings of the first email document;
Generating a second set of hash values corresponding to the second email document and including respective hash values corresponding to each of the plurality of character strings of the second email document;
Determining whether the first set of hash values forms a smaller subset than the second set of hash values;
A computer-readable memory medium having recorded thereon program instructions for executing the procedure on a computer.

Each of the plurality of character strings of the first e-mail document is a respective paragraph of the first e-mail document, and each of the plurality of character strings of the second e-mail document is the first e-mail document. The computer-readable memory medium of claim 7, wherein each is a paragraph of each of two electronic mail documents.

The program instruction further comprises:
Generating a first Bloom filter representing the first set of hash values corresponding to the first email document;
Generating a second Bloom filter representing the second set of hash values corresponding to the second email document;
The procedure is computer-executable and determining whether the first set of hash values is a smaller subset than the second set of hash values comprises: 8. The computer readable memory medium of claim 7, comprising a comparison with two Bloom filters.

The program instructions display whether content of the first email document can be included in the second email document based on a comparison of the first and second Bloom filters; The computer-readable memory medium of claim 9, wherein the computer-readable memory medium is executable on a computer.

10. The computer-readable memory medium of claim 9, wherein the program instructions can perform a comparison of the first and second Bloom filters by performing a bitwise OR operation on a computer.

8. The computer-readable memory medium of claim 7, wherein the program instructions are computer-executable to ignore predetermined content of the first and second email documents.

The computer-readable memory medium of claim 12, wherein the predetermined content includes email header information.

One or more processors,
Memory,
Generating a first set of hash values corresponding to the first email document and including respective hash values corresponding to each of the plurality of character strings of the first email document;
Generating a second set of hash values corresponding to the second e-mail document and including respective hash values corresponding to each of the plurality of character strings of the second e-mail document;
Determining whether the first set of hash values forms a smaller subset than the second set of hash values;
A memory having recorded program instructions for executing the procedure on a computer;
A system characterized by that.

Each of the plurality of character strings of the first e-mail document is a respective paragraph of the first e-mail document, and each of the plurality of character strings of the second e-mail document is the first e-mail document. 15. The system of claim 14, wherein the system is a paragraph of each of the two email documents.

The program instructions are
Generating a first Bloom filter representing the first set of hash values corresponding to the first email document;
Generating a second Bloom filter representing the second set of hash values corresponding to the second email document;
The procedure can be performed on a computer,
The step of determining whether the first set of hash values forms a smaller subset than the second set of hash values compares the first Bloom filter and the second Bloom filter. The system according to claim 14, comprising:

17. The system of claim 16, wherein the program instruction can perform a computer comparison of the first and second Bloom filters by performing a bitwise OR operation.

When the program instruction determines whether the first set of hash values is a smaller subset than the second set of hash values, one of the first or second set or 15. The system of claim 14, wherein ignoring further hash values can be performed by a computer.

Based on the program instructions determining whether the first set of hash values is a smaller subset than the second set of hash values, the second e-mail document is converted to the first e-mail document. 15. The system of claim 14, wherein the system is executable on a computer that identifies the response to the e-mail document.

15. The system of claim 14, wherein one or more of the first and second sets of hash values are generated using an MD5 or SHA-1 hash algorithm.