JP4903386B2

JP4903386B2 - Searchable information content for pre-selected data

Info

Publication number: JP4903386B2
Application number: JP2004568963A
Authority: JP
Inventors: ケヴィンティーロウニー; マイケルアールウルフ; ミシリーゴパラクリシュナン; ヴィタリーフリードマン; ジョセフアンサネリ
Original assignee: Symantec Corp
Current assignee: NortonLifeLock Inc
Priority date: 2002-09-18
Filing date: 2003-09-17
Publication date: 2012-03-28
Anticipated expiration: 2023-09-17
Also published as: JP2005539334A; AU2003270883A8; EP1540542A2; WO2004027653A3; WO2004027653A2; CA2499508A1; AU2003270883A1

Description

本発明は、データを処理する分野に関しており、具体的には、本発明は、情報コンテンツ内で事前選択された（例えば所有権のある）データを検出することに関する。 The present invention relates to the field of processing data, and in particular, the invention relates to detecting preselected (eg, proprietary) data in information content.

多くの組織は、リレーショナルデータベース内に大量の安全機密情報を保存している。この型式のデータは、通常、物理的な保護、アクセス制御、周辺の保安制約、及び場合によっては暗号化を含む非常に徹底した保安方策に委ねられている。データベースのデータへのアクセスは、企業内の多くの従業員の仕事を機能させるには必要不可欠なので、この情報が盗難にあったり偶然に配布されたりする可能性は大いにある。情報の盗難は、知的財産の価値という点からも、法令遵守に関連する法律上の信頼性の点からも重大な経営的危険性を意味する。 Many organizations store large amounts of secure and confidential information in relational databases. This type of data is usually left to very thorough security measures, including physical protection, access control, peripheral security constraints, and possibly encryption. Because access to database data is essential for the work of many employees in a company to function, there is a great chance that this information could be stolen or distributed accidentally. Theft of information represents a significant business risk, both in terms of the value of intellectual property and in terms of legal reliability related to legal compliance.

リレーショナルデータベースシステム
リレーショナルデータベースシステムは、膨大な範囲のアプリケーションに有用である。関係する構造体は、データを問い合わせるのに自然な直感的方法を提示し、下層のディスク記憶システムの詳細をユーザーから隠すという付加的利点を有する様式でデータを保持している。データベースシステムの典型的なアプリケーションは、自然に表構造にフォーマットされる大量の小さなデータを記憶し検索することである。殆どの人が関心を持っている問い合わせの型式は、以下に概要を述べるが、周知のインデックス構造を使って最適化することができるので、リレーショナルデータベースは非常に有用である。 Relational database systems Relational database systems are useful for a vast range of applications. The related structure presents the data in a manner that presents a natural and intuitive way to query the data and has the added benefit of hiding the details of the underlying disk storage system from the user. A typical application of a database system is to store and retrieve large amounts of small data that is naturally formatted into a table structure. The query types that most people are interested in are outlined below, but relational databases are very useful because they can be optimized using well-known index structures.

リレーショナルデータベースシステムに要求される問い合わせは、ユーザーが自分の探している表データを簡潔に要求できるようにする構造化問い合わせ言語（ＳＱＬ）と呼ばれる自然な直感的述語論理を使用する。データベース表には、殆ど常に、ＳＱＬに基づく問い合わせを更に効率的にするインデックスが備えられている。これらのインデックスは、Ｂツリーと呼ばれるデータ構造を使ってメモリ内に記憶されている。現下の議論に最も関係のある、Ｂツリーの顕著な特長は、以下の通りである。 Queries required for relational database systems use natural intuitive predicate logic called Structured Query Language (SQL) that allows users to simply request the table data they are looking for. Database tables are almost always equipped with indexes that make queries based on SQL more efficient. These indexes are stored in the memory using a data structure called a B-tree. The salient features of the B-tree that are most relevant to the current discussion are:

Ｂツリーは、バイナリツリーに基づく抽象的データ構造であり；
Ｂツリーには、インデックス付けする複数のコピーを含んでいなければならず；
Ｂツリーは、以下に概説するする問い合わせ例を使うのが最も効率的である。 A B-tree is an abstract data structure based on a binary tree;
A B-tree must contain multiple copies to be indexed;
B-trees are most efficient using the example query outlined below.

多数の問い合わせ例があり、
A=v の形態の正確な一致問い合わせで、ここに；
A は、所与のデータベース表の列又は「属性」であり、
V は、特定の属性値であり、
例えば、SELECT^*FROM CUSTOMERS WHERE Income＝30,000
v1<A<v2 の形態の範囲問い合わせで、ここに；
A は、所与のデータベース表の列又は「属性」であり、
例えば、SELECT^*FROM CUSTOMERS WHERE 30<Income<40
A MATCHESs^*の形態の接頭語の問い合わせで、ここに、
「s」は特定のストリング値であり、
「s^*」は正規表現であり、
例えば、Last_Name MATCHES”Smith^*” There are many examples of inquiries
An exact match query of the form A = v, where:
A is a column or "attribute" of a given database table,
V is a specific attribute value,
For example, SELECT ^* FROM CUSTOMERS WHERE Income = 30,000
In the range query of the form v1 <A <v2, here;
A is a column or "attribute" of a given database table,
For example, SELECT ^* FROM CUSTOMERS WHERE 30 <Income <40
In the query for a prefix in the form of A MATCHESs ^*
"S" is a specific string value,
“S ^* ” is a regular expression,
For example, Last_Name MATCHES “Smith ^* ”

データベースシステムの分野における初期の著作に関して沢山の参考文献がある。第１は、Ｅ．Ｆ．Ｃｏｄｄ「大きな共有型データバンク用のデータの関係モデル」ＡＣＭ通信１３（６）、３７７−３８７、１９７０年、によるリレーショナルデータベースに関する独創作業である。 There are many references on early work in the field of database systems. First, E.I. F. Coded “Relational Model of Data for Large Shared Data Bank” ACM Communications 13 (6), 377-387, 1970, original work on relational database.

第２の参考文献は、上に概要を述べた型式の効率的な問い合わせを可能にする基礎的なデータ構造である「Ｂツリー」データ構造に関する最初に出版された著作の内の１つである。ＲｕｄｏｌｆＢａｙｅｒとＥｄｗａｒｄＭ．ＭｃＣｒｅｉｇｈｔによる「大量の順序付けされたインデックスの編集と管理」、データ記述とアクセスに関する１９７０ＡＣＭＳＩＧＦＩＤＥＴワークショップの記録、１９７０年１１月１５−１６日、米国、テキサス州ヒューストン、Ｒｉｃｅ大学（補遺付き第２版）、１０７−１４１頁、ＡＣＭ、１９７０、を参照されたい。 The second reference is one of the first published works on the “B-tree” data structure, the basic data structure that allows efficient querying of the type outlined above. . Rudolf Bayer and Edward M. McCreight's "Editing and Managing Large Ordered Indexes", 1970ACM SIGFIDET Workshop Record on Data Description and Access, November 16-16, 1970, University of Rice, Houston, Texas (2nd edition with appendix) ), Pages 107-141, ACM, 1970.

情報検索システム
情報検索は、文書内に見られるテキストデータの記憶と検索を取り扱う広範な分野である。これらのシステムは、表データではなく主に標準的な文書に焦点を当てており、データベースシステムのものとは異なっている。このシステムの初期の例は、コーネル大学でＳＭＡＲＴシステムの一部として開発された。今日、最も良く知られている情報検索アプリケーションは、Ｇｏｏｇｌｅ、Ｉｎｋｔｏｍｉ及びＡｌｔａＶｉｓｔａの様なウェブベースの探索エンジンである。これらのシステムを使用する一般的な方法は、もっと大きなデジタル文書セットの一部である文書への参照を発見することである。これらのアプリケーションに関するユーザー経験は、通常は、一連の問い合わせと、結果のブラウジングが交錯して構成されている。問い合わせの結果は、関連性が高い順に示されており、ユーザーは、更にブラウジングした後で、問い合わせを精緻化することができる。リレーショナルデータベースについて、これらのシステムが並外れて人気があるのは、人々が最も有用であると分かっている問い合わせの型式に対し迅速に応答するという、基礎をなすインデックスの能力によるものである。 Information Retrieval System Information retrieval is a broad field that deals with the storage and retrieval of text data found in documents. These systems focus primarily on standard documents, not tabular data, and are different from those of database systems. An early example of this system was developed at Cornell University as part of the SMART system. Today, the best-known information retrieval applications are web-based search engines such as Google, Inktomi and AltaVista. A common way to use these systems is to find references to documents that are part of a larger digital document set. The user experience associated with these applications usually consists of a series of queries and results browsing. The results of the query are shown in order of relevance, and the user can refine the query after further browsing. For relational databases, these systems are exceptionally popular because of the ability of the underlying index to respond quickly to the types of queries that people find most useful.

これらのシステムの大部分は、インデックスが付いた文書の集まりから構築される、いわゆる「用語索引」から導き出されたインデックスに基づいている。これらの用語索引は、各用語について、各文書内でその用語が発生した各場所を一覧表にしているデータ構造を含んでいる。そのようなデータ構造によって、特定の用語を含んでいる全ての文書を迅速に探索できるようになる。用語の集まりを含んでいる全ての文書に問い合わせるユーザーの問い合わせに対して、インデックスは、高次のユークリッド空間内の多数のベクトルを表すように構成される。次いでユーザーの問い合わせ用語のリストも、この空間内のベクトルとして再翻訳される。問い合わせは、文書空間内のどのベクトルが問い合わせベクトルに最も近いかを発見することによって実行される。この最後の段階には、正確さと速度を求めて様々な最適化が施され、「余弦計量」と呼ばれている。 Most of these systems are based on an index derived from a so-called “term index” built from a collection of indexed documents. These term indexes include a data structure that lists, for each term, each location where that term occurs within each document. Such a data structure allows a quick search of all documents that contain a particular term. For user queries that query all documents that contain a collection of terms, the index is configured to represent a number of vectors in the higher order Euclidean space. The list of user query terms is then retranslated as a vector in this space. The query is performed by finding which vector in the document space is closest to the query vector. This last stage is called “cosine metric” with various optimizations for accuracy and speed.

先に述べたように、この種のシステムとの典型的なユーザーの対話は、問い合わせ、ブラウジング、精査、そして再度問い合わせに戻る、繰り返しのサイクルである。問い合わせの結果は、通常は、関連性が高い順にランク付けされた多数の文書であり、間違いの可能性の割合が非常に高いこともある。問い合わせに関する幾つかの標準的な例がある。 As mentioned earlier, typical user interaction with this type of system is an iterative cycle of querying, browsing, reviewing, and returning to the query. The result of a query is typically a large number of documents ranked in order of relevance, and the probability of error is very high. There are some standard examples of queries.

ａ）「データベース」と「インデックス」という用語を含んでいる全ての文書
ｂ）「データベース」又は「インデックス」という用語を含むが「サイベース」は含んでいない全ての文書、のようなブール問い合わせ。 a) All documents that contain the terms “database” and “index” b) Boolean queries such as all documents that contain the term “database” or “index” but not “database”.

ａ）「犬」という用語を含んでいる文書でリンクされている全ての文書
ｂ）「犬」という用語を含んでいる、最も「人気のある」（即ちリンクされている）文書、のようなリンクベースの問い合わせ。 a) All documents linked with documents containing the term “dog” b) Most “popular” (ie linked) documents containing the term “dog” Link-based inquiry.

情報検索システムの最初の重要な実行プロジェクトの内の１つは、コーネル大学のＳＭＡＲＴシステムである。このシステムは、今日でも使用されている情報検索システムの多くの基本的な構成要素を含んでいる：Ｃ．Ｂｕｃｋｌｅｙによる「ＳＭＡＲＴ情報検索システムの実行」技術レポートＴＲ８５−６８６、コーネル大学、１９８５年。 One of the first important implementation projects of the information retrieval system is Cornell University's SMART system. This system includes many basic components of the information retrieval system still in use today: C.I. Buckley's “Execution of SMART Information Retrieval System” Technical Report TR85-686, Cornell University, 1985.

ＷＡＩＳプロジェクトは、シンキングマシン社製の大量並列処理スーパーコンピューターの初期のアプリケーションだった。これは、インターネット上で利用可能になった最初の情報検索システムの内の１つである。この仕事に関して最初に言及しているのが、ＢｒｅｗｓｔｅｒＫａｈｌｅとＡｒｔＭｅｄｌａｒの「企業ユーザー用の情報システム：広域情報サーバー」技術レポートＴＭＣ−１９９、シンキングマシン社、１９９１年４月、３．１９版である。 The WAIS project was an early application of a massively parallel supercomputer made by Thinking Machine. This is one of the first information retrieval systems made available on the Internet. The first mention of this work is Brewster Kahle and Art Medlar's "Information System for Enterprise Users: Wide Area Information Server" Technical Report TMC-199, Sinking Machine, April 1991, version 3.19. is there.

多くの現在のインターネット探索サービスの市販者の中に、Ｇｏｏｇｌｅがある。探索の精度におけるＧｏｏｇｌｅの真のブレークスルーは、インデックスが付けられた文書のテキストとハイパーリンク構造の両方からデータを取り込める能力である。ＳｅｒｇｅｙＢｒｉｎ、ＬａｗｒｅｃｅＰａｇｅの「大型ハイパーテキストのウェブ探索エンジンの構造」http://dbpubs.stanford.edu:8090/pub/1998-8を参照されたい。 Among the many current Internet search service vendors is Google. Google's true breakthrough in search accuracy is the ability to capture data from both the text and hyperlink structure of the indexed document. See “Structure of Web Search Engine for Large Hypertext” http://dbpubs.stanford.edu:8090/pub/1998-8 by Sergey Brin, Lawrage Page.

ファイルシングリングシステム
インターネットと、デジタル文書をコピーし配信する手頃な手段の成長は、不法又は不適切な文書のコピーを検出するのを助ける技術における研究の関心を高めた。この仕事に関する主要なアプリケーションは、著作権法の違反を検出することであり、盗用を検出することである。この問題は、無差別ｅメール（ＡＫＡ要求しないのに送られてくる宣伝用のｅメール）の検出及び自動削除にも関係するので、相当な関心がある。これらの技法の大部分を記述するのに利用する技術用語は、文書フラグメントの隣接するシーケンスが、ハッシュコードによって「シングル」されて減少し、文書内で発見されたのと同じシーケンスでルックアップ表内に記憶される「ファイルシングリング」である。 File shingling systems The growth of the Internet and affordable means of copying and distributing digital documents has raised research interest in technologies that help detect illegal or inappropriate copies of documents. The primary application for this task is to detect copyright law violations and to detect plagiarism. This problem is of considerable interest since it also relates to the detection and automatic deletion of promiscuous emails (advertisement emails sent without AKA requests). The technical term used to describe most of these techniques is that the adjacent sequence of document fragments is “single” reduced by a hash code and looked up in the same sequence as found in the document. It is a “file shingling” that is stored inside.

ファイルシングリングは、２つの文書の間の類似性を探す非常に迅速な方法を提供する。特定の文書（例えばテキストファイル）を保護するために、文書は、文書を文章毎にハッシングし、これらのハッシングした文章を、迅速に探索するための表内に記憶させることによってシングルされる。新しい文書が著作権で保護された内容のフラグメントを含んでいるかどうかを試験して調べるために、試験メッセージの各フラグメントに同じハッシュ関数が適用され、そのフラグメントが著作権で保護された内容で現れるのと同様の順序で現れるかどうかを見る。その技法は、個々のフラグメントを探索するのに必要な時間が極めて短いので、迅速である。 File shingling provides a very quick way to look for similarities between two documents. To protect a particular document (eg, a text file), the document is singled by hashing the document sentence by sentence and storing these hashed sentences in a table for quick searching. To test and find out if a new document contains a copyrighted content fragment, the same hash function is applied to each fragment in the test message, and the fragment appears with copyrighted content. See if they appear in the same order as. The technique is quick because the time required to search for individual fragments is very short.

ファイルシングリングシステムとの典型的なユーザーの対話は、能動的ではなく受動的である。ファイルシングリングシステムは、普通は、文書を自動的に処理し、問い合わせ結果を非同期的にユーザーに配信するように設定されている。典型的なファイルシングリングアプリケーションでは、一式のメッセージを用いて組織が組織自体のｅメールシステムに配信したくない制約された内容のインデックスを作る、無差別掲示防止になっている。このシナリオでは、「問い合わせ」は、自動的なｅメールメッセージの処理と、適切な自動的経路指定に過ぎない。 Typical user interaction with the file shingling system is passive rather than active. File shingling systems are usually set up to automatically process documents and deliver query results to users asynchronously. In a typical file shingling application, a set of messages is used to prevent indiscriminate posting by creating an index of constrained content that an organization does not want to deliver to its own email system. In this scenario, “inquiry” is just an automatic email message processing and proper automatic routing.

文書等価性の問い合わせについては、各試験文書ｔで、ｔと同じ内容を有するインデックス付き文書の集まりの中の全文書ｄを突き止める。無差別掲示検出の場合、セットｄを、全ての既知の積極的な無差別掲示メッセージとし、文書ｔを、入信ｅメールメッセージとすればよい。 As for the document equivalence inquiry, in each test document t, all documents d in the collection of indexed documents having the same contents as t are identified. In the case of indiscriminate bulletin detection, set d may be all known positive indiscriminate bulletin messages and document t may be an incoming email message.

カットアンドペースト検出の問い合わせについては、各試験文書ｔで、何らかのｄのフラグメントがｔで発生しているインデックス付き文書の集まりの中の全文書ｄを突き止める。剽窃を検出する場合、セットｄを、特定のクラスに関して先に提出された全エッセイにし、文書ｔを、剽窃の疑いのある学生が書いた新しい文書にすればよい。 Regarding the inquiry of cut and paste detection, in each test document t, all documents d in the collection of indexed documents in which some fragment of d occurs at t are located. If plagiarism is detected, set d may be all previously submitted essays for a particular class and document t may be a new document written by a student suspected of plagiarism.

ファイルシングリングにおける主要な発行済み調査プロジェクトは、ＫＯＡＬＡ、ＣＯＰＳ及びＳＣＡＭと呼ばれている。それら全てが、性能及び精度を最適にする変数と共に上に述べた基本的なファイルシングリング法の変数を使用している。ＫＯＡＬＡの情報に関しては、Ｎ．Ｈｅｉｎｔｚｅによる「計測可能な文書諮問押捺制度」（電子商取引における第２回ＵＳＥＮＩＸワークショップの会議録、１９９６年１１月）を参照されたい。ｈｔｔｐ：／／ｗｗｗ−２．ｃ．ｓ．ｃｍｕ．ｅｄｕ／ａｆｓ／ｃｓ／ｕｓｅｒ／ｎｃｈ／ｗｗｗ／ｋｏａｌａ／ｍａｉｎ．ｈｔｍｌ．ＣＯＰＳの情報については、Ｓ．Ｂｒｉｎ、Ｊ．Ｄａｖｉｓ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「デジタル文書のためのコピー検出機構」（ＡＣＭＳＩＧＭＯＤ年次会議の会議録、１９９５年５月）を参照されたい。ＳＣＡＭの情報については、Ｎ．Ｓｈｉｖａｋｕｍａｒ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「ＳＣＡＭ：デジタル文書のためのコピー検出機構」（デジタルライブラリの理論及び実践（ＤＬ’９５）における第２回国際会議の会議録、１９９５年６月）ｈｔｔｐ：／／ｗｗｗ−ｄｂ．ｓｔａｎｆｏｒｄ．ｅｄｕ／〜ｓｈｉｖａ／ＳＣＡＭ／ｓｃａｍｉｎｆｏ．ｈｔｍｌ．と、Ｎ．Ｓｈｉｖａｋｕｍａｒ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「計測可能で正確なコピー検出機構の構築」（デジタルライブラリ（ＤＬ’９５）に関する第１回ＡＣＭ会議の会議録、１９９６年３月）ｈｔｔｐ：／／ｗｗｗ−ｄｂ．ｓｔａｎｆｏｒｄ．ｅｄｕ／ｐｕｂ／ｐａｐｅｒ／ｐｅｒｆｏｒｍａｎｃｅ．ｐｓ．を参照されたい。 The main published research projects in file shingling are called KOALA, COPS and SCAM. All of them use the basic file shingling variables described above with variables that optimize performance and accuracy. For information on KOALA, see N. See Heintze's “Measurable Document Advisory and Imprinting System” (Proceedings of the Second USENIX Workshop in Electronic Commerce, November 1996). http: // www-2. c. s. cmu. edu / afs / cs / user / nch / www / koala / main. html. For information on COPS, see S.C. Brin, J. et al. Davis and H.C. See “Copy Detection Mechanism for Digital Documents” by Garcia-Molina (ACM SIGMOD Annual Conference, May 1995). For information on SCAM, see N. Shivakumar and H.K. Garcia-Molina "SCAM: Copy Detection Mechanism for Digital Documents" (Proceedings of the Second International Conference in Digital Library Theory and Practice (DL'95), June 1995) http: // www-db . Stanford. edu / ~ shiva / SCAM / caminfo. html. N. Shivakumar and H.K. Garcia-Molina “Building a measurable and accurate copy detection mechanism” (Proceedings of the 1st ACM Conference on Digital Library (DL'95), March 1996) http: // www-db. Stanford. edu / pub / paper / performance. ps. Please refer to.

インターネットの内容ろ過システム
内容ろ過システムと呼ばれる様々な市販のアプリケーションが保護手段を実行する。この範疇には、ウェブサイト制約／モニタリングソフトウェアとｅメール内容制御の２つの主要なアプリケーション型式がある。どちらの場合も、現在使用されているメインアルゴリズムは、データの誤使用を示すテキストフラグメントの集まりのセットに関して行う正規表現のセットに対するパターンマッチングである。例えば、テキストフラグメント「ＸＸＸ」を含むＵＲＬでの全てのブラウジングを制限することである。ｅメールの内容制御カテゴリの代表的な例は、「所有権」及び「秘密」という用語を含んでいるが、「ジョーク」又は「冗談」という用語を含んでいない全ｅメールを停止し、阻止することである。 Internet Content Filtration Systems Various commercial applications called content filtration systems implement protection measures. There are two main application types in this category: website constraint / monitoring software and email content control. In either case, the currently used main algorithm is pattern matching against a set of regular expressions that is performed on a set of text fragments that indicate misuse of data. For example, restricting all browsing with URLs containing the text fragment “XXX”. A typical example of an email content control category that includes the terms “ownership” and “secret”, but stops and blocks all emails that do not contain the terms “jokes” or “jokes”. It is to be.

Ｅ．Ｆ．Ｃｏｄｄ「大きな共有型データバンク用のデータの関係モデル」E. F. Codd "Data Relational Model for Large Shared Data Bank" ＲｕｄｏｌｆＢａｙｅｒとＥｄｗａｒｄＭ．ＭｃＣｒｅｉｇｈｔによる「大量の順序付けされたインデックスの編集と管理」Rudolf Bayer and Edward M. “Creating and managing a large number of ordered indexes” by McCreight Ｃ．Ｂｕｃｋｌｅｙによる「ＳＭＡＲＴ情報検索システムの実行」C. "Execution of SMART information retrieval system" by Buckley ＢｒｅｗｓｔｅｒＫａｈｌｅとＡｒｔＭｅｄｌａｒの「企業ユーザー用の情報システム：広域情報サーバー」Brewster Kahlle and Art Medlar "Information System for Corporate Users: Wide Area Information Server" ＳｅｒｇｅｙＢｒｉｎ、ＬａｗｒｅｃｅＰａｇｅの「大型ハイパーテキストのウェブ探索エンジンの構造」Sergey Brin, Lawrage Page, “Structure of Web Search Engine for Large Hypertext” Ｎ．Ｈｅｉｎｔｚｅによる「計測可能な文書諮問押捺制度」N. "Measurable document advisory and imprint system" by Heintze Ｓ．Ｂｒｉｎ、Ｊ．Ｄａｖｉｓ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「デジタル文書のためのコピー検出機構」S. Brin, J. et al. Davis and H.C. "Copy detection mechanism for digital documents" by Garcia-Molina Ｎ．Ｓｈｉｖａｋｕｍａｒ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「ＳＣＡＭ：デジタル文書のためのコピー検出機構」N. Shivakumar and H.K. “SCAM: Copy Detection Mechanism for Digital Documents” by Garcia-Molina Ｎ．Ｓｈｉｖａｋｕｍａｒ及びＨ．Ｇａｒｃｉａ−Ｍｏｌｉｎａによる「計測可能で正確なコピー検出機構の構築」N. Shivakumar and H.K. "Building a measurable and accurate copy detection mechanism" by Garcia-Molina

パーソナルコンピューター装置に記憶されている事前選択されたデータを検出するための方法と装置について説明している。或る実施形態では、本方法は、埋め込まれている事前選択されたデータに関してネットワーク上を電子的に送信されたメッセージをモニターする段階と、メッセージ上で内容探索を実行し、事前選択されたデータから導き出された抽象的データ構造を使って、埋め込まれている事前選択されたデータの存在を検出する段階と、を含んでいる。 A method and apparatus for detecting preselected data stored in a personal computer device is described. In some embodiments, the method monitors a message electronically transmitted over the network for embedded preselected data, performs a content search on the message, and selects the preselected data. Detecting the presence of embedded preselected data using an abstract data structure derived from.

本発明は、本発明の様々な実施形態に関する以下の詳細な説明及び添付図面から良く理解頂けるであろうが、これらは、本発明を特定の実施形態に限定するものではなく、説明と理解のためのものに過ぎない。 The present invention may be better understood from the following detailed description of the various embodiments of the invention and the accompanying drawings, which are not intended to limit the invention to the specific embodiments, but are to be understood and understood. It is only for the purpose.

パーソナルコンピューター装置上のあらゆる場所における機密情報の使用を追跡及びモニターするシステム及び方法をここに説明している。或る実施形態では、このモニタリングは、デスクトップコンピューター又はポータブルコンピューターのようなパーソナルコンピューター装置のデータ記憶媒体の内容探索を実行することによって実施される。別の実施形態では、モニタリングは、メッセージがパーソナルコンピューター装置で送受信されるときに、メッセージの内容探索を実行することによって実施される。更に別の実施形態では、モニタリングは、パーソナルコンピューター装置上で実行されているあらゆるアプリケーション内で潜在的機密情報が使用される前、使用されている間、及び使用された後に内容探索を実行することによって実施される。或る実施形態では、ここで説明しているシステムは、大量のデータベースのデータを取り扱える安全且つ測定可能な方法で、この情報を検出することができる。データベースのデータは、限定するわけではないが、リレーショナルデータベース、スプレッドシート、フラットファイルなどを含む様々なシステム内に記憶されているあらゆる形態の表様式データを備えている。 Described herein are systems and methods for tracking and monitoring the use of sensitive information everywhere on a personal computer device. In some embodiments, this monitoring is performed by performing a content search of a data storage medium of a personal computer device such as a desktop computer or portable computer. In another embodiment, the monitoring is performed by performing a message content search when the message is transmitted and received at the personal computer device. In yet another embodiment, the monitoring performs a content search before, while and after the potentially sensitive information is used in any application running on the personal computing device. Implemented by: In some embodiments, the system described herein can detect this information in a secure and measurable manner that can handle large amounts of database data. Database data comprises any form of tabular data stored in various systems including, but not limited to, relational databases, spreadsheets, flat files, and the like.

以下の説明では、本発明を徹底的に説明するため膨大な詳細事項を記載している。しかしながら、当業者には自明のように、本発明は、これら特定の詳細事項を備えていなくても実施することができる。別の例では、本発明が分かり難くならないように、周知の構造と装置は、詳しく示さずブロック図の形で示している。 In the following description, numerous details are set forth in order to provide a thorough explanation of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

以下の詳細な説明の幾つかの部分は、コンピューターメモリ内のデータビットに関するオペレーションのアルゴリズムと記号的表現で表示されている。これらのアルゴリズム的記述及び表現は、データ処理技術分野の当業者が、他の当業者に、彼等の仕事の本質を最も効果的に伝えるために用いる手段である。ここでは、そして一般的に、アルゴリズムは、所望の結果に辿り着く自己一貫したステップのシーケンスであると考えられる。このステップは、物理量の物理的な操作を要するステップである。通常は、必ずというわけではないが、これらの量は、記憶し、伝送し、結合し、比較し、或いは操作することのできる電気又は磁気信号の形態を取っている。時々、主として共通に使用する目的で、これらの信号をビット、数値、要素、記号、文字、用語、数字などで表すのが便利であると分かっている。 Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. Here, and generally, the algorithm is considered to be a self-consistent sequence of steps that arrives at the desired result. This step is a step that requires physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transmitted, combined, compared, and manipulated. It has proven convenient at times, principally for reasons of common usage, to represent these signals as bits, numbers, elements, symbols, characters, terms, numbers, or the like.

しかしながら、以上の及び同様の用語は、全て適切な物理量と関係しており、これらの量に適用する便宜的ラベルに過ぎないことが頭をよぎる。特記しないかぎり、以下の議論から明らかなように、記述全体を通して、「処理」、「演算」、「計算」、「判断」又は「表示」などの様な用語を使用している議論は、コンピューターシステムのレジスタ及びメモリ内で物理（電子）量で表されているデータを、コンピューターシステムメモリ又はレジスタ或いは他のそのような情報記憶装置か、変換又は表示装置内の物理量として同様に表される他のデータに操作及び変換する、コンピューターシステム、又は同様の電子演算装置の動作と処理を指すものと理解されたい。 However, it should be borne in mind that all of these and similar terms are associated with appropriate physical quantities and are merely convenient labels applied to these quantities. Unless otherwise noted, as will be apparent from the following discussion, discussions that use terms such as “processing”, “operation”, “calculation”, “judgment” or “display” throughout Data represented in physical (electronic) quantities in system registers and memory, computer system memory or registers or other such information storage devices, or other equivalently represented as physical quantities in conversion or display devices It should be understood that it refers to the operation and processing of a computer system, or similar electronic computing device, that manipulates and converts to the same data.

本発明は、更に、ここに述べるオペレーションを実行するための装置に関する。この装置は、必要な目的のために特別に作ってもよいし、コンピューターに記憶されているコンピュータープログラムによって選択的に起動又は再構成される汎用コンピューターを備えていてもよい。そのようなコンピュータープログラムは、限定するわけではないが、フロッピー（登録商標）ディスク、光ディスク、ＣＤ−ＲＯＭ及び磁気光ディスクを含む何らかの型式のディスクか、読み取り専用メモリ（ＲＯＭ）か、ランダムアクセスメモリ（ＲＡＭ）か、ＥＰＲＯＭか、ＥＥＰＲＯＭか、磁気又は光カードか、電子的命令を記憶するのに適している何らかの型式の媒体の様なコンピューター読み取り可能記憶媒体内に記憶され、それぞれが、コンピューターシステムのバスに連結されている。 The invention further relates to an apparatus for performing the operations described herein. This device may be specially made for the required purpose, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such computer programs include, but are not limited to, any type of disk including floppy disks, optical disks, CD-ROMs and magnetic optical disks, read only memory (ROM), or random access memory (RAM). ), EPROM, EEPROM, magnetic or optical card, or any type of medium suitable for storing electronic instructions, each stored in a computer readable storage medium It is connected to.

ここに呈示されているアルゴリズム及び表示は、本来的に、何れかの特定のコンピューター又は他の装置に関係してはいない。様々な汎用システムを、ここでの教示によるプログラムと共に使用してもよいし、必要な方法ステップを実行するため、更に特別仕様の装置を構築すると便利なことも分かっている。これらの様々なシステムに必要な構造は、以下の説明から明らかになるであろう。更に、本発明は、何れかの具体的なプログラミング言語に関連付けて説明してはいない。ここに記載している本発明の教示を実行するのに、様々なプログラミング言語を使用できるものと理解されたい。 The algorithms and displays presented here are not inherently related to any particular computer or other device. Various general purpose systems may be used with the program according to the teachings herein, and it has also proven convenient to construct more specialized equipment to perform the necessary method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It should be understood that various programming languages can be used to implement the teachings of the invention described herein.

機械読み取り可能な媒体は、情報を、機械（例えばコンピューター）によって読み取り可能な形態で記憶又は送信するためのあらゆる機構を含んでいる。例えば、機械読み取り可能な媒体には、読み取り専用メモリ（ＲＯＭ）、ランダムアクセスメモリ（ＲＡＭ）、磁気ディスク記憶媒体、光記憶媒体、フラッシュメモリ装置、或いは、電気、光、音響又は他の形態の伝播信号（例えば、搬送波、赤外線信号、デジタル信号など）などが含まれる。 A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer). For example, a machine readable medium may include a read only memory (ROM), a random access memory (RAM), a magnetic disk storage medium, an optical storage medium, a flash memory device, or an electrical, optical, acoustic or other form of propagation Signals (eg, carrier wave, infrared signal, digital signal, etc.) are included.

代表的な実施形態の構成要素
或る実施形態では、ここに記載されている検出方式を実行するためのシステムは、ポリシー管理システム（ＰＭＳ）とメッセージモニタリングシステム（ＭＭＳ）の２つの主要な構成要素で構成されている。ＰＭＳは、ネットワーク上で送られるメッセージに含まれるか、ポータブルコンピューター、デスクトップコンピューター、パーソナルデジタルアシスタント、携帯電話などの様なパーソナルコンピューター装置のデータ記憶媒体に記憶されているデータ（例えばデータベースのデータ）の使用と送信のための情報保護ポリシーを定めるユーザー入力の受け入れを担当している。従って、このデータは、事前に選択されている。ここで使用する「パーソナルコンピューター装置のデータ記憶媒体」という用語は、パーソナルコンピューター装置内に在るか、一時的又は永久的にパーソナルコンピューター装置用のデータを記憶するパーソナルコンピューター装置がアクセス可能なあらゆる記憶装置を指す。 Exemplary Embodiment Components In one embodiment, a system for performing the detection scheme described herein includes two main components: a policy management system (PMS) and a message monitoring system (MMS). It consists of A PMS is a data (eg, database data) contained in a message sent over a network or stored on a data storage medium of a personal computer device such as a portable computer, desktop computer, personal digital assistant, mobile phone, etc. Responsible for accepting user input that defines information protection policies for use and transmission. This data is therefore pre-selected. As used herein, the term “personal computer device data storage medium” refers to any storage that is in a personal computer device or that is accessible to a personal computer device that stores data for the personal computer device temporarily or permanently. Refers to the device.

ＭＭＳは、ネットワーク上で送られるメッセージ、パーソナルコンピューター装置で処理されるデータ、又はパーソナルコンピューター装置のデータ記憶媒体に記憶されるデータの内容探索の実行と、ユーザーによるＰＭＳに対するポリシーの指定の実施とを担当している。或る実施形態では、これらのシステムは、両方共、情報交換のためあらゆる標準的プロトコルで交信するコンピューターネットワークに連結されている。 MMS performs the search of the contents of messages sent over the network, data processed by personal computer devices, or data stored in the data storage medium of personal computer devices, and the enforcement of policies for users by PMS. It is in charge. In some embodiments, both of these systems are coupled to a computer network that communicates with any standard protocol for information exchange.

この実施形態では、通常のオペレーションの途上で、ユーザーは、或る人間によるデータベースのデータの使用又は送信を制限する所与のポリシーを実行するよう決定し、次いでグラフィカルユーザーインターフェースと１つ又は複数のユーザー入力装置（例えば、マウス、キーボードなど）を使って、このポリシーをＰＭＳに手動で入力する。ユーザーインターフェースは、入力を受け取り、ＰＭＳを備えたコンピューターシステム又は個々の機械で実行されている。例えば或るポリシーは、顧客サービスにおける所与のグループの個々人が、事前選択されたデータを含むデータファイルを、パーソナルコンピューター装置に取り付けられている取り外し可能な媒体装置に保存するのを停止させる。或る実施形態では、ポリシーは、所望される保護の特性（例えば従業員の或るサブセットだけを制限する）、保護を要するデータの型式（例えばデータベースのデータ）、及び保護を要するデータベースのデータのネットワーク位置（例えば、データベース表の名前、サーバーのＩＰアドレス、サーバー又はファイル名）を含んでいる。ここでも、この情報は全て、ユーザーに特定の情報を正しいフィールドに入力するよう促す標準的グラフィカルユーザーインターフェースを使って指定される。 In this embodiment, in the course of normal operation, the user decides to enforce a given policy that restricts the use or transmission of database data by a person, and then the graphical user interface and one or more This policy is manually entered into the PMS using a user input device (eg, mouse, keyboard, etc.). The user interface receives input and is executed on a computer system or individual machine with a PMS. For example, one policy stops individuals in a given group at customer service from storing data files containing preselected data on removable media devices attached to personal computing devices. In some embodiments, the policy may include the desired protection characteristics (eg, limit only a subset of employees), the type of data that needs to be protected (eg, database data), and the database data that needs protection. Contains the network location (eg, database table name, server IP address, server or file name). Again, all this information is specified using a standard graphical user interface that prompts the user to enter specific information into the correct fields.

或る実施形態ではユーザーが調節できるが、デフォルトでは指定された間隔（例えば１日）毎に一回である、或る規則的な間隔で、ＰＭＳは、データベースに問い合わせ、保護対象となるデータベースのデータのコピーを抽出し、そのデータから、以下に詳しく説明する抽象的データ構造（以後「インデックス」と呼ぶ）を導き出す。 At certain regular intervals, which can be adjusted by the user in some embodiments, but by default once every specified interval (eg, one day), the PMS will query the database and identify the database to be protected. A copy of the data is extracted, and an abstract data structure (hereinafter referred to as an “index”) described in detail below is derived from the data.

ＰＭＳは、次に、このインデックスを、実施すべきポリシーの詳細事項と共にＭＭＳへ送り、ＭＭＳがそのポリシーの強制を開始できるようにする。ＭＭＳは、インデックスを、強制すべきポリシーの詳細と共にＰＭＳから受け取る。ＭＭＳは、インデックスとポリシーの情報を使って、ユーザーが指定したポリシーを強制する。或る実施形態では、ＭＭＳは、このインデックスを使って、以下に詳しく論じるように、保護対象のデータベースのデータに関し、出力される各メッセージ（例えばｅメール、ウェブメールメッセージなど）を探索する。別の実施形態では、ＭＭＳは、このインデックスを使用して、以下に詳しく論じるように、保護対象のデータベースのデータに関し、パーソナルコンピューター装置のデータ記憶媒体の内容、及び／又はユーザーとパーソナルコンピューター装置の間の対話の内容を探索する。 The PMS then sends this index to the MMS along with the details of the policy to be enforced so that the MMS can begin enforcing that policy. The MMS receives the index from the PMS along with the policy details to be enforced. The MMS uses the index and policy information to enforce the user specified policy. In one embodiment, the MMS uses this index to search each output message (eg, email, webmail message, etc.) for data in the protected database, as discussed in detail below. In another embodiment, the MMS uses this index to relate to the data in the database to be protected and / or the contents of the personal computer device data storage media and / or the user and personal computer device as discussed in detail below. Explore the content of the dialogue between them.

代表的なワークフローの概要を図１に示すが、ここでは、最高価値の情報が識別され、ポリシーが生み出され、監視と強制が実行され、訴訟の対象となるビジネス情報に結びつくようになっている。 An overview of a typical workflow is shown in Figure 1, where the highest value information is identified, policies are created, monitored and enforced, and tied to business information subject to litigation. .

ネットワークベースのオペレーションのモード
或る実施形態では、メッセージモニタリングシステムは、「監視モード」と「強制モード」の２つの方法の内の一方で構成されている。図２は、２つのネットワーク構成を示している。監視モードでは、ＭＭＳは、ポリシー違反に関してトラフィック及びレポートを観察できるネットワーク上のどこかに配置されているが、メッセージが出て行くときに阻止するようには構成されていない。これは、ＰＭＳが情報にアクセスしている図２Ａに示されている。ＰＭＳは、スイッチ、タップ及びファイアウォールを介してインターネットに連結されている。ＭＭＳは、タップを使ってネットワークメッセージをモニターする。「強制モード」では、ＭＭＳは、違反に関しトラフィックとレポートを観察できるが、更に、メッセージを遮って経路を変更し、メッセージの最終的な宛先を変えることができる。これは、ＰＭＳが情報にアクセスして、スイッチとファイアウォールを介してインターネットに連結されている図２Ａに示されている。この実施形態では、ＭＭＳは、一連のサーバーを使ってトラフィックをモニターし、メッセージが事前選択された情報を含んでいるようだと判断した場合は、例えば、特定のサーバーへとトラフィックの経路を変更する。ＭＭＳは、様々な層のプロトコル毎に異なるサーバーを使用することができる。 Modes of Network-Based Operation In some embodiments, the message monitoring system is configured in one of two ways: “monitor mode” and “forced mode”. FIG. 2 shows two network configurations. In supervised mode, the MMS is located somewhere on the network where traffic and reports can be observed for policy violations, but is not configured to block when messages leave. This is illustrated in FIG. 2A where the PMS is accessing information. The PMS is connected to the Internet through switches, taps and firewalls. MMS uses taps to monitor network messages. In “forced mode”, the MMS can observe traffic and reports for violations, but can also block the message and change the path, changing the final destination of the message. This is illustrated in FIG. 2A where the PMS has access to information and is connected to the Internet through a switch and firewall. In this embodiment, the MMS uses a set of servers to monitor traffic and, if it determines that the message appears to contain preselected information, for example, redirects traffic to a specific server To do. MMS can use different servers for different layers of protocols.

メッセージの経路変更は、強制ではない。替わりに、ＭＭＳは、出て行くメッセージを遮って停めるよう構成することもできる。「強制モード」のポリシーの一つの例では、適切な懲戒行為を行えるように、ポリシーに違反する全てのメッセージを、ポリシーに違反した人のマネージャーに送るようになっている。 Message rerouting is not mandatory. Alternatively, the MMS can be configured to block and stop outgoing messages. One example of a “forced mode” policy is to send all messages that violate the policy to the manager of the person who violated the policy so that appropriate disciplinary action can be taken.

両方のオペレーションのモードでは、多くのＭＭＳをインストールして、それぞれが内容を検出するのに必要なインデックスに関する固有のコピーを備えておくようにすることができる。この平行処理形態は、尺度の問題と、情報の出口について可能性のある複数のポイントの防御とに役立つ。 In both modes of operation, many MMSs can be installed, each with its own copy of the index needed to detect the contents. This parallel processing mode helps with scale issues and the possible multiple points of defense for information exit.

両方の形態において、ＭＭＳは、様々なアプリケーション層のプロトコル（例えば、ＳＭＴＰ、ＨＴＴＰ、ＦＴＰ、ＡＩＭ、ＩＣＱ、ＳＯＡＰなど）を使って移送されるメッセージを積極的に解析する。 In both forms, the MMS actively analyzes messages that are transported using various application layer protocols (eg, SMTP, HTTP, FTP, AIM, ICQ, SOAP, etc.).

或る実施形態では、２つのサブシステム（ＰＭＳとＭＭＳ）は、１つのローカルエリアネットワーク（ＬＡＮ）上で走る。しかしながら、ＰＭＳとＭＭＳは、同じ物理的又は論理的システム内に統合できる。この統合された形態は、システムを作るのに必要な商品経費を制御できるため、より適している。 In one embodiment, the two subsystems (PMS and MMS) run on one local area network (LAN). However, PMS and MMS can be integrated within the same physical or logical system. This integrated form is more suitable because it can control the cost of goods needed to create the system.

更に別の代替実施形態では、ＰＭＳとＭＭＳは、必ずしも同じＬＡＮ上にはない。ＰＭＳは、データベースの情報と同じＬＡＮ上に在るが、ＭＭＳは、ＰＭＳが在るＬＡＮとは異なるＬＡＮ上に在る。この構成では、２つの異なるＬＡＮは、インターネットによって最終的には１つに連結されるが、ファイアウォール、ルータ及び／又は他のネットワーク装置によって分離される。これは、或る会社が、（法律事務所又は調査代理店の様な）彼等のデータベースのデータを必要としている他の会社を、最初の会社のデータベースのデータのポリシーに違反することから制限したい場合に好都合な構成である。 In yet another alternative embodiment, the PMS and MMS are not necessarily on the same LAN. The PMS is on the same LAN as the database information, but the MMS is on a different LAN than the LAN where the PMS is. In this configuration, two different LANs are ultimately connected together by the Internet, but are separated by firewalls, routers and / or other network devices. This limits one company from violating the first company's database data policy (such as a law firm or a research agency) from another company that requires data from their database. This is a convenient configuration if you want to.

図３は、データベースのデータを保護するためのプロセスの或る実施携帯のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア、又は両者の組み合わせを備えた処理論理によって実行される。 FIG. 3 is a flow diagram of one implementation of a process for protecting data in a database. This process is performed by processing logic comprising hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination of both.

図３に示すように、処理論理は、事前選択されたデータに関してメッセージをモニターする。（処理ブロック３０１）。次に、処理論理は、メッセージが事前選択されたデータを有しているか否かを判断する（処理ブロック３０２）。有していなければ、処理は、ブロック３０１１へ移る。有していれば、処理論理は、個人的な送信／受信メッセージが、メッセージ内の情報を送信／受信する承認を受けているか否かを判断する（処理ブロック３０３）。承認を受けていれば、プロセスは終了し、処理は処理ブロック３０１へ移る。承認を受けていなければ、処理論理は、メッセージを遮り、メッセージの経路を変更し、メッセージをログするなどの１つ又は複数のアクションを取り（処理ブロック３０４）、処理は、処理ブロック３０１へ移る。 As shown in FIG. 3, processing logic monitors messages for preselected data. (Processing block 301). Next, processing logic determines whether the message has preselected data (processing block 302). If not, processing moves to block 3011. If so, processing logic determines whether the personal send / receive message is approved to send / receive information in the message (processing block 303). If approved, the process ends and processing moves to processing block 301. If not approved, processing logic takes one or more actions such as intercepting the message, changing the path of the message, logging the message (processing block 304), and processing moves to processing block 301. .

クライアントベースのオペレーションのモード
クライアントベースのオペレーションのモードは、データーの潜在的な誤使用を含むユーザーのオペレーションを検出するために、パーソナルコンピューター装置のユーザーが取るモニタリング動作に向けられている。これらのユーザーのオペレーションは、例えば、演算システムの何れかの記憶装置上の制限されているデータベースのデータをセーブ又はアクセスする段階と、アプリケーション内で制限されているデータベースのデータを使用する段階と、制限されているデータベースのデータを印刷する段階と、何れかのネットワーク通信プロトコルで制限されているデータベースのデータを使用する段階などを含んでいる。或る実施形態では、ユーザーの動作のモニタリングは、パーソナルコンピューター装置の局所記憶システムにアクセス又はセーブされる内容、或いは様々なアプリケーション層プロトコル（例えば、ＳＭＴＰ、ＨＴＴＰ、ＦＴＰ、ＡＩＭ、ＩＣＱ、ＳＯＡＰなど）を使って移送される内容の何れかを解析し探索することによって実行される。別の実施形態では、ユーザーの動作のモニタリングは、ユーザーとパーソナルコンピューター装置の間で交換されるデータを捕らえて解釈することによって実行される。 Mode of Client-Based Operation The mode of client-based operation is directed to a monitoring operation taken by a user of a personal computing device to detect user operations including potential misuse of data. These user operations include, for example, saving or accessing restricted database data on any storage device of the computing system, using restricted database data within the application, and It includes the steps of printing restricted database data and using the restricted database data with any network communication protocol. In some embodiments, user activity monitoring may include content accessed or saved in a personal computer device's local storage system, or various application layer protocols (eg, SMTP, HTTP, FTP, AIM, ICQ, SOAP, etc.). This is done by analyzing and searching for any of the content that is transferred using. In another embodiment, the monitoring of user behavior is performed by capturing and interpreting data exchanged between the user and the personal computer device.

図９は、事前選択された機密データのクライアントベースの保護に関するシステムの１つの実施形態のブロック図である。 FIG. 9 is a block diagram of one embodiment of a system for client-based protection of preselected sensitive data.

図９に示すように、サーバー９０２は、ネットワーク９０６を介してクライアントのコンピューター（クライアントと呼ぶ）９１０と交信する。ネットワーク９０６は、専用ネットワーク（例えば、ローカルエリアネットワーク（ＬＡＮ））でもよいし、公開ネットワーク（例えば、ワイドエリアネットワーク（ＷＡＮ））でもよい。クライアント９１０は、組織内の異なる従業員に属するコンピューターである。各クライアント９１０は、例えば、デスクトップコンピューター、ポータブルコンピューター（例えばラップトップ）、又は間欠的なネットワーク接続によって作動する他の何らかのコンピューターである。内容モニタリングシステム（ここでは、メッセージモニタリングシステム又はＭＭＳとも呼ぶ）９１２は、各クライアント９１２上に在り、事前選択された機密データに関するこのクライアントのデータ記憶媒体の内容を探索することと、ユーザーとクライアント９１２の間で交換された内容を捕らえて解釈することを担当している。データ記憶媒体は、例えば、メインメモリ、スタティックメモリ、大容量記憶メモリ（例えばハードディスク）、又は、クライアントコンピューター用のファイル又は他の文書を一時的又は永久的に記憶する何らかの他の記憶装置を含んでいる。或る実施形態では、ＭＭＳ９１２は、ファイルの読み取り、ファイルの書き込み、ファイルの更新のような特定のデータオペレーションをモニターし、取り外し可能な媒体装置（例えば、フロッピー（登録商標）ドライブ、ユニバーサルシリアルバス（ＵＳＢ）装置、コンパクトディスクレコーダブル（ＣＤＲ）装置など）の読み取り書き込みを行う。ＭＭＳ９１２のオペレーションは、取り外し可能で移動可能な装置による機密データの喪失を防止し易くする。例えば、ＭＭＳ９１２のオペレーションは、ユーザーが、クライアント９１０に記憶されている機密データをフロッピー（登録商標）ディスクにコピーし、機密データを有するファイルをＵＳＢベースの取り外し可能な記憶装置に移し、ラップトップ又はデスクトップコンピューターから機密データを印刷又はｅメールし、機密データを承認されていないアプリケーションで使用するなどの場合に発生する機密データの漏洩を防ぐ。 As shown in FIG. 9, the server 902 communicates with a client computer (referred to as a client) 910 via a network 906. The network 906 may be a dedicated network (for example, a local area network (LAN)) or a public network (for example, a wide area network (WAN)). Client 910 is a computer belonging to different employees within an organization. Each client 910 is, for example, a desktop computer, a portable computer (eg, a laptop), or some other computer that operates with an intermittent network connection. A content monitoring system (also referred to herein as a message monitoring system or MMS) 912 resides on each client 912 and searches the content of this client's data storage medium for preselected sensitive data, and the user and client 912. I am in charge of capturing and interpreting the content exchanged between the two. Data storage media include, for example, main memory, static memory, mass storage memory (eg, hard disk), or some other storage device that temporarily or permanently stores files or other documents for a client computer. Yes. In some embodiments, the MMS 912 monitors specific data operations such as file reads, file writes, file updates, and removable media devices (eg, floppy drives, universal serial buses ( USB) devices, compact disc recordable (CDR) devices, etc.). The operation of MMS 912 facilitates preventing loss of sensitive data by removable and mobile devices. For example, the operation of the MMS 912 may allow a user to copy sensitive data stored on the client 910 to a floppy disk, move a file containing the sensitive data to a USB-based removable storage device, Prevents leakage of confidential data that occurs when confidential data is printed or emailed from a desktop computer and used in an unauthorized application.

サーバー９０２は、組織の中で、ここに記載している検出方式の構築を担当している。サーバー９０２は、ＰＭＳ９０４とメッセージコレクタ９１４を含んでいる。ＰＭＳ９０４は、機密データの使用を制御する一式の安全ポリシーを維持している。一式の安全ポリシーは、機密データの潜在的な誤使用についてその人のコンピューターをモニターしなければならない従業員を識別し、探索を実行する機密データを指定し、探索の範囲を定義する（例えば、特定の記憶媒体、データオペレーションなど）。この情報に基づいて、ＰＭＳ９０４は、対応するクライアント９１０を探索するか否かについて各ＭＭＳに指示し、探索に用いるインデックスを送る。インデックスは、安全ポリシーに基づいて１つ又は複数のクライアント９１２に対して事前選択された特定の機密データから導き出される。メッセージコレクタ９１４は、クライアント９１０のユーザーによるデータの誤使用を通知する、ＭＭＳ９１２から受信するメッセージの収集を担当している。 The server 902 is responsible for constructing the detection method described here in the organization. Server 902 includes PMS 904 and message collector 914. PMS 904 maintains a set of security policies that control the use of sensitive data. A set of safety policies identifies employees who must monitor their computer for potential misuse of sensitive data, specifies the sensitive data to perform the search, and defines the scope of the search (eg, Specific storage media, data operations, etc.). Based on this information, the PMS 904 instructs each MMS as to whether or not to search for the corresponding client 910, and sends an index used for the search. The index is derived from specific sensitive data preselected for one or more clients 912 based on a security policy. The message collector 914 is responsible for collecting messages received from the MMS 912 that notify the misuse of data by the user of the client 910.

或る実施形態では、各ＭＭＳ９１２は、サーバー９０２とのネットワーク接触を維持できないとき（例えば、ラップトップ９１０が週末に家に持ち帰られたり、他のネットワークに移されたり、盗まれるなど）、スタンドアローンで作動することができる。例えば、ユーザーがラップトップ９１０をネットワーク９０６から切断すると、ラップトップ９１０上で走っているＭＭＳ９１２は、ユーザーが家庭でラップトップで作業している間に、ラップトップ９１０のデータ記憶媒体の定期的な内容探索を実行する。具体的には、ＭＭＳ９１２は、ラップトップ９１０のローカルファイルシステム、ｅメールメッセージアーカイブなどを探索する。更に、ＭＭＳ９１２は、ＰＭＳ９０２に指示された場合は、特定のデータオペレーション（例えば、ファイルの読み取り、ファイルの書き込み、ファイルの更新、フロッピー（登録商標）ディスクの様な取り外し可能な媒体装置に対する読み取り書き込み）をモニターする。或る実施形態では、ＭＭＳ９１２は、クライアント９１０の何れかのデータ記憶媒体上で事前選択されたデータを検出すると、事前選択されたデータの検出の通知を含むメッセージを作成し、このメッセージを送信キューに配置する。後で、ネットワークへの接続が最確立されたときに、送信キューからのメッセージがメッセージコレクタ９１４へ送信される。或る実施形態では、ＰＭＳ９０４が維持しているポリシーは、事前選択されたデータが検出されると、ＭＭＳ９１２が事前選択されたデータへのアクセスを防ぐよう要求する。 In some embodiments, each MMS 912 may stand alone when it cannot maintain network contact with the server 902 (eg, the laptop 910 is taken home on the weekend, moved to another network, or stolen). Can be operated with. For example, when the user disconnects the laptop 910 from the network 906, the MMS 912 running on the laptop 910 may periodically update the data storage medium of the laptop 910 while the user is working on the laptop at home. Perform a content search. Specifically, the MMS 912 searches the laptop 910's local file system, email message archive, and the like. Further, the MMS 912, when instructed by the PMS 902, performs certain data operations (eg, read file, write file, update file, read / write to removable media devices such as floppy disks). To monitor. In some embodiments, when MMS 912 detects pre-selected data on any data storage medium of client 910, MMS 912 creates a message that includes a notification of detection of the pre-selected data and sends the message to the transmission queue. To place. Later, when the connection to the network is reestablished, a message from the send queue is sent to the message collector 914. In some embodiments, the policy maintained by PMS 904 requires that MMS 912 prevent access to the preselected data when preselected data is detected.

図１０は、パーソナルコンピューター装置ベースの、事前選択されたデータの保護のためのプロセスの或る実施形態のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア、又は両方の組み合わせを備えた処理論理によって実行される。処理論理は、クライアント９１０のようなパーソナルコンピューター装置上に在る。 FIG. 10 is a flow diagram of one embodiment of a process for protection of preselected data based on a personal computer device. This process is performed by processing logic comprising hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination of both. Processing logic resides on a personal computer device such as client 910.

図１０に示すように、処理論理は、パーソナルコンピューター装置で実行される内容探索の範囲を定義する指示を受け取る（処理ブロック１００２）。或る実施形態では、指示は、探索しなければならないデータ記憶媒体と探索の周期を指定している。或る実施形態では、指示は、事前選択された機密データの存在に関してモニターすべきデータのオペレーションも指定している。 As shown in FIG. 10, processing logic receives an instruction defining the scope of a content search performed on the personal computer device (processing block 1002). In some embodiments, the instructions specify the data storage medium that must be searched and the search period. In some embodiments, the instructions also specify the operation of data to be monitored for the presence of preselected sensitive data.

次に、処理論理は、事前選択された機密データから導き出された抽象的データ構造即ちインデックスを受け取る（処理ブロック１００４）。幾つかの抽象的データ構造の実施形態について、以下に更に詳細に論じる。 Next, processing logic receives an abstract data structure or index derived from preselected sensitive data (processing block 1004). Several abstract data structure embodiments are discussed in more detail below.

処理ブロック１００６では、処理論理は、抽象的データ構造を使って、事前選択された機密データに関して、パーソナルコンピューター装置のデータ記憶媒体の内容を探索する。内容探索の範囲は、サーバーから受け取った指示で定義されている。探索は、このパーソナルコンピューター装置のデータのデータ記憶媒体の内容、及び／又は、ユーザーとパーソナルコンピューター装置の間でやり取りされた内容に対して行われる。或る実施形態では、内容探索は、所定の時間間隔で周期的に実行される。処理論理に用いられる探索技法の幾つかの実施形態について、以下に更に詳しく論じる。「パーソナルコンピューター装置のデータ記憶媒体」という用語は、例えば、磁気ディスク、揮発性ランダムアクセスメモリ、取り外し可能な媒体、テープバックアップシステム、遠隔ネットワークアドレス指定可能記憶装置などを含むパーソナルコンピューター装置がアクセス可能なあらゆる形態のデータ記憶装置を指す。或る実施形態では、処理論理は、パーソナルコンピューター装置上で走っているアプリケーションによって、事前選択されたデータの使用を検出するため、揮発性記憶装置を探索する。使用が検出されれば、処理論理は、事前選択されたデータを使用しているアプリケーションを特定する。 At processing block 1006, processing logic uses the abstract data structure to search the contents of the personal computer device data storage medium for preselected sensitive data. The scope of content search is defined by instructions received from the server. The search is performed on the content of the data storage medium of the data of the personal computer device and / or the content exchanged between the user and the personal computer device. In some embodiments, the content search is performed periodically at predetermined time intervals. Some embodiments of search techniques used in processing logic are discussed in further detail below. The term "personal computer device data storage medium" is accessible to personal computer devices including, for example, magnetic disks, volatile random access memory, removable media, tape backup systems, remote network addressable storage devices, etc. Refers to any form of data storage. In some embodiments, processing logic searches volatile storage to detect the use of preselected data by an application running on the personal computer device. If usage is detected, processing logic identifies the application that is using the preselected data.

処理論理が事前選択されたデータ（又はその一部）の存在を検出すれば（処理ボックス１００８）、処理論理は、ＰＭＳにより維持されているポリシーが、事前選択されたデータへのアクセスの阻止を要求しているか否かを判断する（処理ボックス１００９）。或る実施形態では、検出されたデータへのアクセスが、このデータへのアクセスを試みているアプリケーションに対して阻止される。 If the processing logic detects the presence of the preselected data (or a portion thereof) (processing box 1008), the processing logic indicates that the policy maintained by the PMS prevents access to the preselected data. It is determined whether or not a request is made (processing box 1009). In some embodiments, access to the detected data is blocked for applications attempting to access this data.

阻止が必要な場合、処理論理は、事前選択されたデータへのアクセスを阻止（処理ブロック１０１０）し、更に、パーソナルコンピューター装置が、サーバー又は何れかの他の指定された装置とのネットワーク接触を維持できるか否かを判断する（処理ボックス１０１１）。この判断が正である場合、処理論理は、検出の通知を含むメッセージをサーバーへ送る（処理ブロック１０１２）。通知は、パーソナルコンピューター装置と検出されたデータを特定する。或る実施形態では、通知は、パーソナルコンピューター装置上で走っているときに事前選択されたデータを使っていたアプリケーションを特定する。 If blocking is required, processing logic blocks access to the preselected data (processing block 1010), and further, the personal computer device makes network contact with the server or any other specified device. It is determined whether or not it can be maintained (processing box 1011). If this determination is positive, processing logic sends a message containing a notification of detection to the server (processing block 1012). The notification identifies the personal computer device and the detected data. In some embodiments, the notification identifies the application that was using the preselected data when running on the personal computer device.

パーソナルコンピューター装置とサーバーの間が接続されていない場合、処理論理は、将来ネットワークとの接続が再確立されたときにサーバーへ送信するために、このメッセージをキュー内に置く（処理ブロック１０１４）。 If there is no connection between the personal computing device and the server, processing logic places this message in the queue for transmission to the server when a connection to the network is re-established (processing block 1014).

先に論じたように、パーソナルコンピューター装置ベースのモニタリングは、パーソナルコンピューター装置に保存され処理される内容の監視を考慮している。プロトコルに基づいてフィルタリングを実行する既存のデスクトップベースのファイアウォールとは違って、ここで説明している内容探索は、ファイルシステムやメモリバンク内の事前選択されたデータベースのデータ、又はアプリケーションがアクセスしているプロセス内のデータの追跡に関する具体的な探索の問題に取り組んでいる。 As discussed above, personal computer device based monitoring allows for the monitoring of content stored and processed on the personal computer device. Unlike existing desktop-based firewalls that perform filtering based on protocols, the content exploration described here is accessed by pre-selected database data or applications in the file system or memory bank. We are working on specific exploration issues related to tracking data within a process.

要求者に資格証明書（例えばパスワード）を促すことによって、未承認のアクセスを禁止するアクセス制御技法に関して、ここで説明している機密データのクライアントベースの保護は、パーソナルコンピューター装置内に保存されている内容を、この内容がダウンロードされたか又はアクセス制御システムを介してアクセスされた後でモニターする。 With respect to access control techniques that prohibit unauthorized access by prompting the requester with credentials (eg, a password), the client-based protection of sensitive data described here is stored within the personal computer device. Content is monitored after this content has been downloaded or accessed through an access control system.

デスクトップベースの暗号化／暗号解読パッケージシステムは、一般的に、サーバーベースの機構を頼りにデータを暗号化し、デスクトップベースの機構を頼りにデータを解読して閲覧するが、このシステムは、データを解読する暗号キーへのアクセスを制限することによって、データの誤使用を防ぐ働きをしている。ここで説明している機密データのクライアントベースの保護は、暗号作成法の包路線の「全くの」外側に残されており、従って第三者によって盗まれ易いデータを保護するのに使用することができる。 Desktop-based encryption / decryption package systems generally rely on server-based mechanisms to encrypt data and rely on desktop-based mechanisms to decrypt and view the data. By restricting access to the decryption key, it works to prevent misuse of data. The client-based protection of sensitive data described here is left “exactly” outside the cryptography envelope and should therefore be used to protect data that is easily stolen by third parties. Can do.

添付書類内の敵意を隠したコードの存在を検出するのに通常用いられるアンチウイルスの解決法に対し、ここに説明している機密データのクライアントベースの保護は、隠されたコードの存在ではなく、事前選択されたデータベースのデータの存在を検出することに向けられている。 In contrast to the anti-virus solutions normally used to detect the presence of hostile code in attachments, the client-based protection of sensitive data described here is not the presence of hidden code. Dedicated to detecting the presence of pre-selected database data.

そのパーソナルコンピューター装置に送られてくる全ての内容をモニターする内容フィルターを使ってハードウェアのオペレーションを駆動するよう書かれているソフトウェアの形態をしたドライバフィルターは、事前選択されたデータに関してパーソナルコンピューター装置のデータ記憶媒体の探索を実行できる能力に欠けている。 A driver filter in the form of software written to drive hardware operations using a content filter that monitors all content sent to that personal computer device. Lacks the ability to perform searches on data storage media.

１つ又は複数のシステム実施形態に関する安全要件
この検出システムの実施形態は情報の安全ポリシーを強制するのに用いられるので、このシステムの安全特性は最高のものである。或る実施形態では、このシステムの主要目的は、データベースのデータに関わる安全ポリシーを強制することである。これは、このシステムがデータベースのデータを扱う方法が非常に安全であることを示唆している。データベースのデータを保護する過程で、システムがデータベースのデータを盗むための新しい道を開けば、最終的な目的が覆される。 Safety requirements for one or more system embodiments Since this detection system embodiment is used to enforce an information safety policy, the safety characteristics of this system are the best. In some embodiments, the primary purpose of the system is to enforce safety policies involving database data. This suggests that the way the system handles database data is very secure. In the process of protecting database data, if the system opens a new way to steal database data, the ultimate goal is overturned.

或る実施形態では、ＭＭＳは、ネットワークを流れる膨大な数のメッセージをモニター及び／又は阻止するやり方で展開される。これは、トラフィックが集中しているネットワークの様々なポイント（例えば、ルーター、メールシステム、ファイアウォール、デスクトップコンピューター、ｅメールアーカイブシステムなど）にＭＭＳをインストールすることを意味している。これは、ＭＭＳが、ネットワーク上のこれらの集中ポイントの１つの後ろか前の何れかにインストールされることを意味している。システムのこのような配置は、システムがメッセージの例外的な閲覧をできるようにし、本システムを使っている組織の効用を増大させる。不都合なことに、このような配置によって、ＭＭＳは、第三者が未承認のネットワークアクセスを使用してネットワークを取り囲んでいる機密保護ぺリメーターを犯し、ネットワーク内に含まれているデータを盗むネットワークベースの攻撃（一般的に「ハッキング」と呼ばれている）を、非常に被り易くもなる。このような配置によって、ＭＭＳは、ＭＭＳがモニターしている同じ従業員による「ハッキング」攻撃を受け易くなる。 In some embodiments, MMS is deployed in a manner that monitors and / or blocks a large number of messages flowing through the network. This means installing MMS at various points of the network where traffic is concentrated (eg, routers, mail systems, firewalls, desktop computers, email archive systems, etc.). This means that MMS is installed either behind or before one of these concentration points on the network. Such an arrangement of the system allows the system to view messages exceptionally and increases the utility of the organization using the system. Unfortunately, this arrangement allows MMS to breach a security perimeter surrounding the network using unauthorized network access by third parties and steal data contained within the network. Base attacks (commonly referred to as “hacking”) are also very susceptible. Such an arrangement makes the MMS susceptible to “hacking” attacks by the same employees that the MMS is monitoring.

別の実施形態では、ＭＭＳは、パーソナルコンピューター装置上に局所的に展開され、局所的な記憶媒体の使用、パーソナルコンピューター装置上で走っているアプリケーションによる分類されたデータの使用、及び装置に対するネットワーク通信に関する監視の実行を担当している。このようなシステムの配置によって、システムは、コンピューター装置を操作している人間がアクセスし使用する情報を例外的に閲覧できるようになり、そのシステムを使用している組織の効用を高める。しかしながら、このような配置によって、ＭＭＳは、ＭＭＳがモニターしている同じ従業員による「ハッキング」攻撃を受け易くなる。 In another embodiment, the MMS is deployed locally on a personal computer device, using a local storage medium, using classified data by an application running on the personal computer device, and network communication to the device Is responsible for the execution of monitoring. Such a system arrangement allows the system to exceptionally browse information used and accessed by a person operating a computer device, increasing the utility of the organization using the system. However, this arrangement makes the MMS susceptible to “hacking” attacks by the same employees that the MMS is monitoring.

ＰＭＳの機密に関する懸念も、そのソフトウェアがＭＭＳの使用するインデックスを作るために情報源に直接問い合わせる点で、高い。 Concerns about PMS confidentiality are also high in that the software directly queries the source to create an index for use by MMS.

従って、或る実施形態ではネットワーク上のＭＭＳの配置、或いは別の実施形態ではパーソナルコンピューター装置上のＭＭＳの配置が、ＭＭＳを攻撃に曝すことになる。或る実施形態では、これらの攻撃は、ローカルエリアネットワーク（ＬＡＮ）の内側から、又は組織が維持しているＷＡＮ及び／又はインターネットリンクを通してＬＡＮの外側から来る。別の実施形態では、攻撃がパーソナルコンピューター装置のユーザーから来ることもある。ここでの具体的な機密に関する懸念は、ＭＭＳが、保護しようとしているリレーショナルデータベースからの貴重なデータベースのデータを含んでいることである。懸念は、ハッカー又はパーソナルコンピューター装置のユーザーが、リレーショナルデータベースが実際に走っている、もっと徹底的にガードされたコンピューターからではなく、ＭＭＳからデータを盗もうとすることである。 Thus, in some embodiments the placement of MMS on the network, or in other embodiments the placement of MMS on a personal computer device exposes the MMS to attack. In some embodiments, these attacks come from inside the local area network (LAN) or from outside the LAN through WAN and / or Internet links maintained by the organization. In another embodiment, the attack may come from a user of a personal computer device. A particular confidentiality concern here is that MMS contains valuable database data from the relational database it is trying to protect. The concern is that hackers or users of personal computer devices try to steal data from MMS rather than from a more thoroughly guarded computer where the relational database is actually running.

アプリケーションに対する第２のそして関係する機密に関する懸念は、ＭＭＳが、ＰＭＳが展開されているＬＡＮとは異なるＬＡＮで展開されている場合に発生する。先に述べたように、これは、データベースのデータを共有する２つの組織に亘って安全ポリシーを実施するのを助ける重要な構成である。ＭＭＳに記憶されている情報は、ここでも、情報安全の脅威に曝される。 A second and related security concern for applications arises when MMS is deployed on a different LAN than the LAN on which PMS is deployed. As mentioned earlier, this is an important configuration that helps enforce security policies across two organizations that share database data. Information stored in the MMS is again exposed to information security threats.

様々な実施形態が、これらの安全の脅威を直接取り扱う。ここに説明しているこれらの実施形態の画期性の１つの態様は、ＰＭＳ／ＭＭＳの対が、保護しようとしているデータのコピーを含んでいないインデックスを交換することである。上に述べたように、ＰＭＳは、ＭＭＳがポリシーを強制できるように、データベースのデータから導き出した抽象的データ構造をＭＭＳへ送る。この保護を実現するために考えられる１つの方法は、単にデータベースをＭＭＳにコピーするだけか、又は（同じく安全の観点から）内容がポリシーと矛盾していないことを確認するためにＭＭＳがデータベースに直接問い合わせできるようにすることである。この方法の問題は、この方法が、それまでは無かった相当な安全の脆弱性を持ち込むことである。この危険な方法では、回復させることは病気より難しい。 Various embodiments deal directly with these security threats. One aspect of the breakthrough of these embodiments described herein is that the PMS / MMS pair exchanges an index that does not contain a copy of the data it is trying to protect. As stated above, the PMS sends an abstract data structure derived from the data in the database to the MMS so that the MMS can enforce policies. One possible way to achieve this protection is simply to copy the database to the MMS, or (also from a security standpoint) to ensure that the content is not in conflict with the policy. It is to be able to inquire directly. The problem with this method is that it introduces considerable security vulnerabilities that were not previously available. In this dangerous way, recovery is more difficult than illness.

或る実施形態では、ＰＭＳは、データベースのデータのコピーを含んでいないか、又はデータベースのデータの暗号化された又はハッシュされたコピーだけを含んでいるデータベースからインデックスを作成する。そのようなインデックスは、データベースのデータのフラグメントに関係する多くのタプルを記憶するためのデータ構造を提供するタプル記憶機構を使って作成される。タプル記憶機構の例には、ハッシュ表、ベクトル、アレイ、ツリー、リスト又はリレーショナルデータベース管理システムの表が含まれる。以下に記載のプロセスでは、インデックスに記憶されているデータは、他の要素に対するデータベース内のその要素の相対位置を保持しているに過ぎない。例えば、ハッシュ表の場合、インデックスは、データベースのデータの各フラグメント（データベースセル内のデータフラグメント）毎に、フラグメントのハッシュコードを、その行番号、列番号及び列の型式と共に記憶している。 In some embodiments, the PMS creates an index from a database that does not contain a copy of the data in the database or contains only an encrypted or hashed copy of the data in the database. Such an index is created using a tuple storage mechanism that provides a data structure for storing a number of tuples related to database data fragments. Examples of tuple storage mechanisms include hash tables, vectors, arrays, trees, lists or relational database management system tables. In the process described below, the data stored in the index only holds the position of the element in the database relative to other elements. For example, in the case of a hash table, the index stores the hash code of the fragment together with its row number, column number, and column type for each fragment of data in the database (data fragment in the database cell).

この同じ解決法の他の実施形態は、保護されている知的財産のフラグメントを含んでいるインデックスを使用して、その情報を安全の脅威に曝すことにより解決法の価値を下げている。或る実施形態では、ここで具体的に述べている技法は、ＭＭＳを走らせるホストにハッカーが侵入した場合に、盗難に曝されるデータが非論理的となるように、データ自身のあらゆる表現を記憶しないようにしている。 Other embodiments of this same solution use an index that contains fragments of protected intellectual property to reduce the value of the solution by exposing that information to security threats. In some embodiments, the techniques specifically described herein can represent any representation of the data itself so that if a hacker enters a host running MMS, the data subject to theft is illogical. Is not remembered.

以下に述べるプロセスで説明する代替実施形態を実施すれば、機能を強化することができる。この代替実施形態では、システム内のデータの大部分を表しているデータベースからの頻繁に用いられるストリングと数字の少量のコピーだけが、インデックス内に、データベースの表内のデータの相対位置に関する情報の残りと共に、直接記憶される。これは、ハッシュコードの替わりに、これらの一般的なストリング自体のコピーを記憶することによって行われる。この代替方法では、システムは、（これらの一般的な用語に関して）行番号、列番号及びデータベースのデータの型式を記憶するが、ここではハッシュコードを記憶する代わりに、ストリング自体を記憶する。それほど一般的ではないデータベースの残りのセルでは、具体的にはこれらのストリングのコピーを記憶しないで、行番号、列番号及びデータベースのデータの型式だけが記憶される。この方法は、データベース内のストリングと数字のデータの統計的な分布は、最も一般的な用語が記憶されているデータの全体量の非常に大きな割合を占めるようにスキューされることが多いという事実を利用している。少数の一般的な用語が問い合わせの大部分を占めるので、これらの一般的な用語を別々のインデックスに記憶すれば、インデックスの問い合わせが効率的になり、これらの問い合わせは、文献（例えば、ハッシュ表ルックアップ、ビットマップなど）から標準的で迅速な技法を使って走らせることができる。これが安全上の脆弱性ではないという理由は、データベースのデータの量の不均衡な共有部分を占めるこの少数の用語は、最も価値の少ないデータであるからである。「ジョン」及び「スミス」という用語は、名前を含むデータベースの中では非常に一般的であるが、これらの用語の盗難は比較的価値が低い。この実施形態では、システムは、より高い値の、一般的でない用語のデータ（例えばクレジットカード番号、ＳＳＮ、一般的でない名前など）のコピーの記憶を慎重に回避している。この実施形態では、先に述べた実施形態でのように、システムは、データベース内のセルの配置に関係する情報のハッシュコードとタプルのみを記憶することによって、機密情報のあらゆるコピーの記憶を回避する。 Functionality can be enhanced by implementing alternative embodiments described in the process described below. In this alternative embodiment, only a small copy of frequently used strings and numbers from the database representing the majority of the data in the system is stored in the index, with information about the relative position of the data in the database table. Stored directly with the rest. This is done by storing a copy of these generic strings themselves instead of hash codes. In this alternative method, the system stores (in terms of these general terms) the row number, column number, and database data type, but instead of storing the hash code, it stores the string itself. In the remaining cells of the less common database, only the row number, column number, and database data type are stored, specifically without storing copies of these strings. This method is the fact that the statistical distribution of string and numeric data in the database is often skewed so that the most common terms account for a very large percentage of the total amount of data stored Is used. Since a few common terms occupy the majority of queries, storing these common terms in separate indexes makes index queries efficient, and these queries can be found in literature (for example, hash tables). Lookup, bitmap, etc.) can be run using standard and quick techniques. This is not a security vulnerability because the few terms that account for the disproportionate share of the amount of data in the database are the least valuable data. Although the terms “John” and “Smith” are very common in databases containing names, theft of these terms is relatively low value. In this embodiment, the system carefully avoids storing a copy of higher value, uncommon term data (eg, credit card number, SSN, uncommon name, etc.). In this embodiment, as in the previous embodiment, the system avoids storing any copy of sensitive information by storing only hash codes and tuples of information related to the placement of cells in the database. To do.

事前選択されたデータの検出
或る実施形態では、事前選択されたデータの検出のプロセスは、索引付けと探索の２つの主なオペレーション又は段階を含んでいる。索引付け段階では、システムが、事前選択されたデータからインデックスを構築する。事前選択されたデータは、関係を表フォーマットに構成できるようなデータであればどの様なデータでもよい。つまり、事前選択されたデータは、表フォーマットで記憶される（例えば、リレーショナルデータベース内のデータ、エクセルのスプレッドシート内のデータなど）か、表フォーマットで記憶するのではないが、表フォーマットで記憶できるような関係（例えば、フラットファイル又はパスワードデータベース内にカンマ分離値として記憶されるデータ、オブジェクト志向データベース内のリレーショナルデータなど）を有している。 Preselected Data Detection In one embodiment, the process of detecting preselected data includes two main operations or stages: indexing and searching. In the indexing phase, the system builds an index from preselected data. The preselected data may be any data as long as the relationship can be configured in a table format. That is, preselected data can be stored in a table format (eg, data in a relational database, data in an Excel spreadsheet, etc.) or stored in a table format, but not in a table format. (For example, data stored as a comma separated value in a flat file or password database, relational data in an object-oriented database, etc.).

図４は、事前選択されたデータに索引を付けるためのプロセスの１つの実施形態のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア又はそれらの組み合わせを備えた処理論理によって実行される。 FIG. 4 is a flow diagram of one embodiment of a process for indexing preselected data. This process is performed by processing logic with hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination thereof.

図４に示すように、処理論理は、事前選択されたデータが標準的表フォーマットに記憶されているか否かを判断する段階（処理ボックス４０２）で始まる。記憶されていなければ、処理論理は、事前選択されたデータを標準的表フォーマットに変換する（処理ブロック４０４）。出来た表内の各セルは、事前選択されたデータのフラグメントを記憶している。或る実施形態では、各データフラグメントはトークンである。トークンは、単一の単語でも単語のクラスタ（例えば、引用符でくくられた単語）でもよい。例えば、「ｔｈｉｓ」という単語はデータベースのセル内に記憶されているトークンを表すが、「ｔｈｉｓｔｏｋｅｎ」という句は、単一のストリングとしてデータベースのセル内に記憶されている場合、独立したトークンを表す。 As shown in FIG. 4, processing logic begins with determining whether pre-selected data is stored in a standard table format (processing box 402). If not, processing logic converts the preselected data to a standard table format (processing block 404). Each cell in the resulting table stores a pre-selected fragment of data. In some embodiments, each data fragment is a token. A token may be a single word or a cluster of words (eg, a quoted word). For example, the word “this” represents a token stored in a database cell, but the phrase “this token” represents an independent token when stored in a database cell as a single string. To express.

次に、処理論理は、事前選択されたデータから導き出された、タプル記憶構造を作成する（処理ブロック４０６）。タプル記憶構造は、事前選択されたデータのフラグメントと関連付けられた多くのタプルを記憶するための機構を提供する。タプル記憶構造の例には、ハッシュ表、ベクトル、アレイ、ツリー又はリストが含まれる。タプル記憶構造の各型式は、何れかの所与の内容フラグメントに関して一式のタプル（タプル記憶構造内に一致するものが無ければ、その一式のタプルは空となる）を検索するための方法に関係付けられている。 Next, processing logic creates a tuple storage structure derived from preselected data (processing block 406). Tuple storage structures provide a mechanism for storing a number of tuples associated with preselected fragments of data. Examples of tuple storage structures include hash tables, vectors, arrays, trees or lists. Each type of tuple storage structure relates to a method for searching a set of tuples for any given content fragment (the set of tuples is empty if there is no match in the tuple storage structure). It is attached.

更に、処理論理は、対応するタプルのデータベース内の各データフラグメントの位置に関する情報を記憶する（処理ブロック４０８）。或る実施形態では、データフラグメントの位置に関する情報は、データベース内のデータフラグメントを記憶している行の番号を含んでいる。別の実施形態では、この情報は、データベース内のデータフラグメントを記憶している列の番号と、随意的に列のデータ型式も含んでいる。 Further, processing logic stores information regarding the location of each data fragment in the database of corresponding tuples (processing block 408). In some embodiments, the information regarding the location of the data fragment includes the number of the row storing the data fragment in the database. In another embodiment, this information also includes the number of the column storing the data fragment in the database, and optionally the data type of the column.

その後、処理論理は、タプルを所定の順序（例えば、昇順辞書式順序）にソートする（処理ブロック４１０）。 Thereafter, processing logic sorts the tuples into a predetermined order (eg, ascending lexicographic order) (processing block 410).

従って、出来上がった抽象的データ構造（即ち、インデックス）は、大きな全体の文脈内におけるデータ記録の相対位置に関する情報を含んでいるだけであり、事前選択されたデータ自体のフラグメントは何も含んでいない。 Thus, the resulting abstract data structure (ie, index) only contains information about the relative position of the data record within a large overall context, and contains no fragments of the preselected data itself. .

或る実施形態では、インデックスの内容は、更にインデックスを盗難から守るために、暗号法的に（例えば、ハッシュ関数で、又は暗号キーを備えた暗号関数を使って）取り扱われている。 In some embodiments, the contents of the index are handled cryptographically (eg, with a hash function or with a cryptographic function with a cryptographic key) to further protect the index from theft.

事前選択されたデータの検出プロセスの「探索」段階について、更に詳しく論じる。図５は、事前選択されたデータに関する情報内容を探索するためのプロセスの１つの実施形態のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア、又は両方の組み合わせを備えた処理論理によって実行される。 The “search” phase of the preselected data detection process will be discussed in more detail. FIG. 5 is a flow diagram of one embodiment of a process for searching information content for preselected data. This process is performed by processing logic comprising hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination of both.

図５に示すように、処理論理は、情報内容を受け取る段階で始まる（処理ブロック５０２）。情報内容は、ファイル（例えば、コンピューターのハードドライブに記憶されているアーカイブ済ｅメールメッセージ）、又は、ネットワークで送信されたデータのブロック（何れかの型式のネットワークプロトコルを使って、ネットワーク上を送信されたｅメールメッセージ）に含まれている。 As shown in FIG. 5, processing logic begins with receiving information content (processing block 502). Information content can be sent over the network using a file (eg, an archived email message stored on a computer hard drive) or a block of data sent over the network (any type of network protocol) Email message).

次に、処理論理は、情報内容の中で、事前選択されたデータの一部を含んでいる可能性のある内容フラグメントのシーケンスを検出する（処理ブロック５０４）。先に述べたように、事前選択されたデータは、保護する必要のある所有権のあるデータベースのデータであるか、固有の表構造を有する何か他の種類のデータである。つまり、事前選択されたデータは、表フォーマットで記憶してもよい（例えば、リレーショナルデータベース内のデータ、エクセルのスプレッドシート内のデータなど）し、表フォーマットで記憶するのではないが、表フォーマットで記憶できるような関係（例えば、フラットファイル又はパスワードデータベース内にカンマ分離値として記憶されているデータ、オブジェクト志向データベース内のリレーショナルデータなど）を有しているものでもよい。 Next, processing logic detects a sequence of content fragments in the information content that may contain a portion of the preselected data (processing block 504). As previously mentioned, the preselected data is proprietary database data that needs to be protected or some other kind of data that has a unique table structure. That is, pre-selected data may be stored in a tabular format (eg, data in a relational database, data in an Excel spreadsheet, etc.) and not stored in tabular format, but in tabular format. It may have a relationship that can be stored (for example, data stored as a comma-separated value in a flat file or password database, relational data in an object-oriented database, etc.).

或る実施形態では、検出された内容フラグメントのシーケンスは、情報内容内の一式の隣接するトークンである。各トークンは、単語又は句に対応する。検出された内容フラグメントのシーケンスは、受け取った情報内容の一部か又は情報内容全体である。 In some embodiments, the sequence of detected content fragments is a set of adjacent tokens in the information content. Each token corresponds to a word or phrase. The sequence of detected content fragments is part of the received information content or the entire information content.

或る実施形態では、処理論理は、内容フラグメントのシーケンスが列フォーマット済データに似ていると判定する際に、内容フラグメントのシーケンスが、事前選択されたデータの一部を含んでいる可能性があると判断する。この判定は、受信した情報内容を解析して分離線を識別し（例えば、タグ＜ｃｒ＞又は＜ｃｒ＞＜１ｆ＞で表示される）、これらの分離線が同数のトークンと、随意的に同様のトークンのデータ形式を含んでいることを見つけ出すことによって、行われる。 In some embodiments, when the processing logic determines that the sequence of content fragments is similar to the column formatted data, the sequence of content fragments may include a portion of the preselected data. Judge that there is. This determination involves analyzing the received information content to identify separation lines (e.g., indicated by tags <cr> or <cr> <1f>), and these separation lines optionally with the same number of tokens. This is done by finding out that it contains a similar token data format.

別の実施形態では、処理論理は、全情報内容を解析し、事前選択されたデータに関して隣接するトークンのブロックを探索する際に、内容フラグメントのシーケンスが、事前選択されたデータの一部を含んでいる可能性があると判断する。或る実施形態では、隣接するトークンのブロックは、各ブロックのユーザーが指定した幅、及び、情報内容の中の各ブロックのユーザーが指定した位置（例えば、ユーザーは、２つの隣接するブロックを、或る数のトークンで分離することを要求する）の様なユーザーが指定したパラメーターに基づいて定義される。 In another embodiment, processing logic analyzes the entire information content and, when searching for blocks of adjacent tokens for preselected data, the sequence of content fragments includes a portion of the preselected data. It is determined that there is a possibility that In some embodiments, adjacent blocks of tokens have a user-specified width of each block and a user-specified location of each block in the information content (e.g., the user selects two adjacent blocks, Defined based on user-specified parameters such as requiring separation with a certain number of tokens).

更に別の実施形態では、処理論理は、情報内容の中に事前に定義されたフォーマットの表現を見つけ出す際に、内容フラグメントのシーケンスが、事前選択されたデータの一部を含んでいる可能性があると判断する。そのような表現は、例えば、口座番号、社会保障番号、クレジットカードの番号、電話番号、郵便番号、ｅメールアドレス、金融値又は数値を示すテキストフォーマット（例えば、数字を伴う「＄」印）などである。この表現が見つけ出されると、処理論理は、この表現の回りのテキストの領域は、事前選択されたデータの一部を含んでいる可能性があると判断する。この領域の大きさは、見つけ出された表現の各側の所定のトークンの数によって定められる。 In yet another embodiment, the processing logic may find that the sequence of content fragments includes a portion of the preselected data in finding a pre-defined format representation in the information content. Judge that there is. Such expressions include, for example, account numbers, social security numbers, credit card numbers, telephone numbers, postal codes, email addresses, financial values or text formats that indicate numbers (eg, “$” sign with numbers), etc. It is. Once this representation is found, processing logic determines that the area of text around this representation may contain some of the preselected data. The size of this area is determined by the number of predetermined tokens on each side of the found expression.

更に別の実施形態では、処理論理は、受信した情報内容に関係付けられた或る特性が、以前の違反の履歴に基づいて、情報内容の中に事前選択されたデータを含んでいる可能性があることを示していると判定する際に、内容フラグメントのシーケンスが、事前選択されたデータの一部を含んでいると判断する。これらの特性には、例えば、情報内容の宛先（例えば、電子メッセージの受信者）、情報内容の起点、情報内容に関係付けられた送信の時間、情報内容に関係付けられた送信のサイズ、送信に含まれているファイルの型式（例えば、多目的インターネットメールエクステンション（ＭＩＭＥ）型のファイル）などが含まれる。或る実施形態では、以前の違反の履歴は、事前選択されたデータの検出の度に、事前選択されたデータが検出された情報内容の特性を識別し、これらの特性を以前の違反のデータベース内に記録することによって保持される。その後、処理論理は、新しい情報内容の中の内容フラグメントのシーケンスが事前選択されたデータの一部を含んでいるか否かを判定するときに、新しい情報内容の特性を識別し、これらの特性について以前の違反のデータベースを探索する。一致が見つかれば、処理論理は、一致特性に関係付けられた以前の違反が、新しい情報内容の中に事前選択されたデータが含まれている可能性を表示しているか否かを判定する。この表示は、一致特性に関係付けられた以前の違反の数、又は一致特性に関係付けられた以前の違反の頻度に基づいていてもよい。例えば、この表示は、特定の送信者が犯した違反の合計数に基づいていてもよいし、それらの違反の所与の期間に亘る頻度に基づいていてもよい。 In yet another embodiment, processing logic may allow certain characteristics associated with the received information content to include pre-selected data in the information content based on a history of previous violations. When it is determined that the content fragment is present, it is determined that the sequence of content fragments includes a portion of the preselected data. These characteristics include, for example, the destination of the information content (eg, recipient of the electronic message), the origin of the information content, the time of transmission associated with the information content, the size of transmission associated with the information content, File types (for example, multipurpose Internet mail extension (MIME) type files) and the like. In one embodiment, the history of previous violations identifies characteristics of the information content from which the preselected data was detected each time the preselected data was detected, and these characteristics are stored in the database of previous violations. Retained by recording in. Thereafter, processing logic identifies the characteristics of the new information content when determining whether the sequence of content fragments in the new information content includes a portion of the preselected data, and for these characteristics Search the database for previous violations. If a match is found, processing logic determines whether a previous violation associated with the match characteristic indicates the possibility that preselected data is included in the new information content. This indication may be based on the number of previous violations associated with the match characteristic or the frequency of previous violations associated with the match characteristic. For example, this indication may be based on the total number of violations committed by a particular sender or may be based on the frequency of those violations over a given period.

その後、事前選択されたデータの一部を含んでいる可能性がある内容フラグメントのシーケンスを検出する際に、処理論理は、これらの内容フラグメントの何れかのサブセットが、事前選択されたデータのサブセットと一致するか否かを判定する（処理ブロック５０６）。この判定は、事前選択されたデータの表構造を定義するインデックス（ここでは抽象的データ構造と呼ぶ）を使って行われる。 Thereafter, in detecting a sequence of content fragments that may contain a portion of the preselected data, processing logic determines that any subset of these content fragments is a subset of the preselected data. (Processing block 506). This determination is made using an index that defines the table structure of preselected data (referred to herein as an abstract data structure).

図６は、事前選択されたデータから導き出された抽象的データ構造内の内容フラグメントのサブセットに対する一致を見つけ出すためのプロセスの１つの実施形態のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア又はそれらの組み合わせを備えた論理を処理することによって実行される。 FIG. 6 is a flow diagram of one embodiment of a process for finding matches for a subset of content fragments in an abstract data structure derived from preselected data. This process is performed by processing logic with hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination thereof.

図６に示すように、処理論理は、図５の処理ブロック５０４で識別された内容フラグメントのシーケンスを内容フラグメント（例えば、トークン）で解析する段階で始まる。次いで、処理論理は、各内容フラグメント毎に、一式の一致タプルに対して抽象的データ構造を探索する（処理ブロック６０２）。例えば、情報内容に含まれている「Ｓｍｉｔｈ」という単語が、抽象的データ構造内に反映されている事前選択されたデータ内に複数回発生しているかもしれない。具体的には、これらの発生のそれぞれは、抽象的データ構造内に対応するタプルを有している。処理論理は、探索の間に、事前選択されたデータ内の「Ｓｍｉｔｈ」という単語の発生に対応する一式のタプルを検索する。各タプルは、事前選択されたデータを記憶しているデータベース又は表の中のこのデータフラグメントの位置に関する情報を記憶する。或る実施形態では、位置情報は、データフラグメントを記憶しているセルの行番号を含んでいる。別の実施形態では、位置情報は、このセルの列番号と、随意的にその列のデータ型式も含んでいる。 As shown in FIG. 6, processing logic begins with parsing the sequence of content fragments identified in process block 504 of FIG. 5 with content fragments (eg, tokens). Processing logic then searches the abstract data structure for a set of matching tuples for each content fragment (processing block 602). For example, the word “Smith” included in the information content may occur multiple times in the pre-selected data reflected in the abstract data structure. Specifically, each of these occurrences has a corresponding tuple in the abstract data structure. Processing logic searches a set of tuples corresponding to the occurrence of the word “Smith” in the preselected data during the search. Each tuple stores information regarding the location of this data fragment in a database or table storing preselected data. In some embodiments, the location information includes the row number of the cell storing the data fragment. In another embodiment, the location information also includes the column number of this cell and optionally the data type of that column.

次に、処理論理は、全内容フラグメントで見つけ出された一致タプルのセットを組み合わせ（処理ブロック６０４）、次に、組み合わせられた一致タプルのセットを、行番号でグループＬに分類する（処理ブロック６０６）。その結果、各グループＬ（ここではアキュムレータと呼ぶ）は、全てが同じ列番号を有する一致タプルのセットを含み、即ち、各グループＬの一致タプルのセットは、全てがデータベース内の同じ行から出ているように見える事前選択されたデータのフラグメントに対応している。 Next, processing logic combines the set of matching tuples found in all content fragments (processing block 604), and then classifies the combined set of matching tuples into group L by row number (processing block). 606). As a result, each group L (referred to herein as an accumulator) includes a set of matching tuples that all have the same column number, ie, the set of matching tuples for each group L all come from the same row in the database. Corresponds to pre-selected fragments of data that appear to be.

更に、処理論理は、各グループに含まれている一致タプルのセットの数によってグループＬをソートし（処理ブロック６０８）、或る実施形態では、際立った列の数を備えたタプルのセットを有するグループを選択する（処理ブロック６１０）。その後、処理論理は、選択されたグループが十分に大きい数の一致タプルのセットを有しているか否かを判断する（処理ブロック６１２）。例えば、１つのグループの一致タプルのセットの数が「３」を上回っている場合、情報内容が、データベース内の同じ行の４つ又はそれ以上の列からのデータを含んでいる公算が高い。 Further, processing logic sorts group L by the number of matching tuple sets contained in each group (processing block 608), and in one embodiment has a set of tuples with a distinct number of columns. A group is selected (processing block 610). Thereafter, processing logic determines whether the selected group has a sufficiently large set of matching tuples (processing block 612). For example, if the number of sets of matching tuples in a group is greater than “3”, the information content is likely to contain data from four or more columns of the same row in the database.

探索プロセスの代表的な実施形態について述べる。図７Ａ―７Ｃは、事前選択されたデータのハッシュ表インデックスを使って入信してくるメッセージを探索するためのプロセスの代替実施形態のフロー図である。このプロセスは、ハードウェア（回路、専用論理など）、（汎用コンピューターシステム又は専用機上を走っているような）ソフトウェア又はそれらの組み合わせを備えた処理論理によって実行される。 A representative embodiment of the search process is described. 7A-7C are flow diagrams of an alternative embodiment of a process for searching for incoming messages using a pre-selected data hash table index. This process is performed by processing logic with hardware (circuitry, dedicated logic, etc.), software (such as running on a general purpose computer system or a dedicated machine), or a combination thereof.

図７Ａに示すように、処理論理は、入信してくるメッセージを解析する段階で始まる（処理ブロック７０２）。次に、処理論理は、入信してくるメッセージの解析された部分が列フォーマットされたデータを含んでいるか否か判定する（処理ブロック７０４）。或る実施形態では、語彙解析を使用して、（例えば、ラインを分離するのに使用されるタグ＜ｃｒ＞又は＜ｃｒ＞＜１ｆ＞を見つけ出すことによって）入信してくるメッセージの解析された部分のラインを識別し、隣接するラインの中で見つけ出されたトークンの数が数と型式において同じであることを検出する。或る実施形態では、処理論理は、各トークンの型式をトークンの総数と共に記憶する。 As shown in FIG. 7A, processing logic begins with parsing the incoming message (processing block 702). Next, processing logic determines whether the analyzed portion of the incoming message contains column formatted data (processing block 704). In some embodiments, lexical analysis is used to analyze incoming messages (eg, by finding the tags <cr> or <cr> <1f> used to separate lines). Identify the partial lines and detect that the number of tokens found in adjacent lines is the same in number and type. In some embodiments, processing logic stores the type of each token along with the total number of tokens.

処理ボックス７０４でなされた判断が否であれば、処理は、処理ブロック７０２に移る。そうでなければ、処理は処理ブロック７０６へ移り、そこで処理論理がｉを、列フォーマットされたデータに似ている最初のラインと等しくなるよう設定する。 If the determination made at process box 704 is no, the process moves to process block 702. Otherwise, processing moves to processing block 706 where processing logic sets i equal to the first line that resembles column formatted data.

次に、処理論理は、ラインｉ内の各トークンにハッシュ関数Ｈ（ｋ）を適用し（処理ブロック７０８）、ラインｉ内の各トークンに関しハッシュ表内のＨ（ｋ）で一式のタプルを見つけ出し、そのタプルをリストＬに加え、リストＬを、各アキュムレータのタプルが同じ行数値を有する一式のアキュムレータに再分類する（処理ブロック７１２）。更に、処理論理は、各Ａｉの長さによってそのリストＬをソートし（処理ブロック７１４）、ソートされたリストＬ内の固有の列の出現を確認する（処理ブロック７１６）。処理ブロック７１０では、随意の事前処理論理を実行してリストＬに挿入する前にトークンをろ過し、元のトークンｋの語彙型式に一致する型式を備えたタプルだけがＬに加えられるようにする。実施形態の中には、固有の列の出現を確認する段階が速度又は簡潔さの理由で省略されるものもある。更に別の実施形態では、タプルは、行番号のみを含んでいる（即ち、列番号も型式のインジケーターも含んでいない）単なる「単集合」である。 Next, processing logic applies a hash function H (k) to each token in line i (processing block 708) and finds a set of tuples at H (k) in the hash table for each token in line i. , Add the tuple to list L, and reclassify list L into a set of accumulators where each accumulator tuple has the same row value (processing block 712). Further, processing logic sorts the list L by the length of each Ai (processing block 714) and checks for the appearance of a unique column in the sorted list L (processing block 716). At processing block 710, optional pre-processing logic is performed to filter the tokens before insertion into the list L so that only tuples with a type that matches the lexical type of the original token k are added to L. . In some embodiments, the step of confirming the appearance of a unique column is omitted for speed or simplicity reasons. In yet another embodiment, a tuple is simply a “single set” that includes only row numbers (ie, neither column numbers nor type indicators).

その後、入信してくるメッセージが、列フォーマットされたデータに似たもっと多くのラインを含んでいる場合（処理ボックス７１８）、処理論理は、列フォーマットされたデータと似た次のラインにｉを逐増し（処理ブロック７２２）、処理は処理ブロック７０６へ移る。そうでない場合、処理論理は、所定のサイズを上回り且つ固有の列番号を有しているＡｉを備えたテキストのラインを報告する（処理ブロック７２０）。 Thereafter, if the incoming message contains more lines similar to the column formatted data (processing box 718), processing logic sets i to the next line similar to the column formatted data. Step by step (processing block 722) and processing moves to processing block 706. Otherwise, processing logic reports a line of text with Ai that exceeds a predetermined size and has a unique column number (processing block 720).

図７Ｂに示すように、処理論理は、ユーザーが指定した「幅」（Ｗ）と「ジャンプ」（Ｊ）のパラメーターを受け取り（処理ブロック７３２）、入信してくるメッセージを解析する（処理ブロック７３４）段階で始まる。パラメーターＷは、一回繰り返す間に、探索することになる隣接するトークンの各ブロック内の隣接するトークンの数を指定し、パラメーターＪは、２つの隣接するブロックの間の必要なトークン数を指定する。 As shown in FIG. 7B, processing logic receives “width” (W) and “jump” (J) parameters specified by the user (processing block 732) and analyzes the incoming message (processing block 734). ) Start with stage. Parameter W specifies the number of adjacent tokens in each block of adjacent tokens that will be searched during a single iteration, and parameter J specifies the number of tokens required between two adjacent blocks To do.

次に、処理論理は、位置変数（Ｓ_t）の値をゼロに設定し（処理ブロック７３６）、Ｓ_tで始まるＷの隣接するメッセージのトークンを集めることによって、探索するブロック（「テキストブロック」）を定義する（処理ブロック７３８）。 Next, processing logic sets the value of the position variable (S _t ) to zero (processing block 736) and collects the tokens of W adjacent messages that begin with S _t , thereby searching the block (“text block” ) Is defined (processing block 738).

更に、処理論理は、テキストブロック内の各トークンにハッシュ関数Ｈ（ｋ）を適用し（処理ブロック７４０）、テキストブロック内の各トークン毎にハッシュ表内のＨ（ｋ）で一式のタプルを見つけ出し、テキストブロック内の対応するトークンと同じ型式を有するタプルをリストＬに追加し（処理ブロック７４２）、リストＬを、一式のアキュムレータに再分類し（処理ブロック７４４）、各Ａｉの長さでそのリストＬをソートし（処理ブロック７４６）、ソートされたリストＬ内の、固有の列の出現を確認する（処理ブロック７４８）。 Further, processing logic applies a hash function H (k) to each token in the text block (processing block 740) and finds a set of tuples at H (k) in the hash table for each token in the text block. Add a tuple with the same type as the corresponding token in the text block to processing list 742 (processing block 742), reclassify the processing list L into a set of accumulators (processing block 744) The list L is sorted (processing block 746) and the occurrence of a unique column in the sorted list L is checked (processing block 748).

その後、処理論理は、トークンのＪ数だけＳ_tを漸増し（処理ブロック７５０）、位置Ｓ_tがなおメッセージ内にあるか否か判定する（処理ボックス７５２）。判断が正であれば、処理は処理ブロック７３８へ移る。そうでない場合、処理論理は、所定のサイズを上回り且つ固有の列番号を有しているＡｉを備えているテキストブロックを報告する（処理ブロック７５８）。 Thereafter, processing logic, escalating S _t only J number of tokens (processing block 750), the position S _t is still determined whether the message (processing box 752). If the determination is positive, processing moves to processing block 738. Otherwise, processing logic reports a text block with Ai that exceeds a predetermined size and has a unique column number (processing block 758).

図７Ｃに示すように、処理論理は、入信してくるメッセージを解析し（処理ブロック７６４）、ユーザーの指定したフォーマットを有する最初の表現を探す（処理ブロック７６６）段階で始まる。そのような表現は、例えば、口座番号、社会保障番号、クレジットカード番号、金融値又は数値を示すテキストフォーマット（例えば、数字を伴う「＄」印）などである。一致表現が見付からなければ、処理は処理ブロック７６４へ移る。見つけ出されれば、処理は、処理ブロック７６８へ移り、そこで処理論理は、一致表現の前後のＷの隣接するトークンを集めることによって、探索するブロック（「テキストブロック」）を定義する。例えば、テキストブロックは、一致表現の直前の１０個のトークンと、一致表現自体と、一致表現の直後の１０個のトークンで構成される。 As shown in FIG. 7C, processing logic begins with parsing the incoming message (processing block 764) and looking for the first representation with the user specified format (processing block 766). Such a representation is, for example, an account number, social security number, credit card number, financial value or text format indicating a numerical value (eg, a “$” sign with a number). If no match expression is found, processing moves to processing block 764. If found, processing moves to processing block 768 where processing logic defines the block to search (“text block”) by collecting W adjacent tokens before and after the match expression. For example, the text block includes 10 tokens immediately before the matching expression, the matching expression itself, and 10 tokens immediately after the matching expression.

更に、処理論理は、テキストブロック内の各トークンにハッシュ機能Ｈ（ｋ）を適用し（処理ブロック７７０）、テキストブロック内の各トークン毎にハッシュ表内のＨ（ｋ）で一式のタプルを見つけ出し、テキストブロック内の対応するトークンと同じ型式を有するタプルをリストＬに追加し（処理ブロック７７２）、リストＬを、一式のアキュムレータに再分類し（処理ブロック７７４）、各Ａｉの長さでそのリストＬをソートし（処理ブロック７７６）、ソートされたリストＬ内の、固有の列の出現を確認する（処理ブロック７７８）。 Further, processing logic applies a hash function H (k) to each token in the text block (processing block 770) and finds a set of tuples at H (k) in the hash table for each token in the text block. Add a tuple with the same type as the corresponding token in the text block to processing list 772 (processing block 772), reclassify the processing list L into a set of accumulators (processing block 774) The list L is sorted (processing block 776) and the occurrence of a unique column in the sorted list L is confirmed (processing block 778).

その後、処理論理は、メッセージがユーザーの指定したフォーマットの表現をそれ以上有しているか否か判定する（処理ボックス７８０）。判定が正であれば、処理は処理ブロック７６８へ移る。そうでない場合、処理論理は、所定のサイズを上回り且つ固有の列番号を有しているＡｉを備えているテキストブロックを報告する（処理ブロック７８２）。 Thereafter, processing logic determines whether the message has more representations of the format specified by the user (processing box 780). If the determination is positive, processing moves to processing block 768. Otherwise, processing logic reports a text block with Ai that exceeds a predetermined size and has a unique column number (processing block 782).

代表的なアプリケーション
或る実施形態では、通常のオペレーションの途上では、（保護が必要な記録が常駐している）組織のデータベースとの安全な通信が行えるように、ＰＭＳは会社のネットワーク上に配置されていると想定されている。通常のオペレーションの途上では、更に、ＭＭＳが、組織の全ての外部とのｅメール通信をモニター及び／又は遮断できるように配置されていると想定されている。 Typical Applications In one embodiment, the PMS is placed on a company network so that it can communicate securely with an organization database (where records that need to be protected reside) during normal operation. It is assumed that In the course of normal operation, it is further assumed that the MMS is arranged to monitor and / or block email communications with all outside the organization.

この例では、組織が、１）名前、２）姓、３）クレジットカード番号、４）残高の４つの列が含まれている「カスタマーレコード」と呼ばれるデータベース表を保護しようとしていると想定する。この組織の従業員は、ＰＭＳが提供するユーザーインターフェースアプリケーションを使用して、カスタマーレコード表がｅメールによる盗難に対する保護を必要としていると指定することになる。すると、ＰＭＳは、データベース内のセルのストリング値から導き出されたハッシュ表で構成されているカスタマーレコード表内の記録のインデックスを作る。つまり、セル内の数値を使ってハッシュ表内の数値を調べる。ハッシュ表自体には、各行番号、列番号及びセル自体のデータ型式の記録が含まれている。ハッシュ表にしばしば見られる衝突の場合、「衝突リスト」は、行番号、列番号及び型式に関するそのような記録を多数保持している。データベース表内の全セルがそのような構造にハッシュされると、インデックスが作成され、ＭＭＳへの送信の準備が整う。インデックスにはデータベースのデータ自体の記録は含まれていないことに注目されたい。これは、このシステムが満たす重要な安全上の制約である。 In this example, assume that an organization wants to protect a database table called “customer record” that contains four columns: 1) name, 2) last name, 3) credit card number, and 4) balance. An employee of this organization will use a user interface application provided by the PMS to specify that the customer record table needs protection against email theft. The PMS then indexes the records in the customer record table, which consists of a hash table derived from the string values of the cells in the database. In other words, the number in the hash table is examined using the number in the cell. The hash table itself includes a record of each row number, column number, and data type of the cell itself. In the case of collisions often found in hash tables, the “collision list” keeps a number of such records regarding row numbers, column numbers and types. Once all the cells in the database table have been hashed into such a structure, an index is created and ready for transmission to the MMS. Note that the index does not include a record of the database data itself. This is an important safety constraint that this system meets.

ＭＭＳは、インデックスを受け取った後で、メッセージを解析し、メモリ内のハッシュ表を、ＰＭＳで作成されたのと同じ様式で再度作成する。 After the MMS receives the index, it parses the message and recreates the in-memory hash table in the same manner as it was created by the PMS.

ＭＭＳは、外部とのｅメールメッセージをピックアップしてそれを解析する際には、以下に説明するやり方でこのインデックスを使用し、これらのｅメールの何れかに、データベースからのデータが含まれているか否かを検出する。これは、ｅメールメッセージからのテキストの個々のラインを解析することによって行われる。これには、周囲のファイル型式をデコードする段階と、全てのものを生のテキストに変換する段階（例えば、マイクロソフトのワードファイルから全てのフォーマット情報を剥ぎ取り、テキスト自体のみを残す）が含まれている。テキストのこの一連のラインは、「スペース」符号又は他の形態の句読点の様な分離マークを探すことによって個々の単語で解析される。これらの単語はテキストのトークンである。このシステムは、テキストトークンの各ラインに対して、ハッシュ関数を各トークンに適用することによって、インデックスを調べる。このオペレーションの結果が、そのライン上の各トークンに関するハッシュ表の衝突リストとなる。先に説明したように、各衝突リストは、それ自体が、可能性のある行番号、列番号及び型式のトリプレットを記憶する一式のデータ要素である。全トリプレットのユニオンが全ての衝突リストから取られていれば、そして一式のトリプレットが、同じ行番号だが異なる列番号を備えていることが分かれば、高い確率で、ｅメールメッセージからのテキストのこのラインにはデータベースからの記録が含まれている。なお、ここで使用する「タプル」という用語は、行番号、列番号及び型式のトリプレットの特定の場合に限定されず、これら３つのパラメーターの全ては含まれていないデータ構造を指す。例えば、或る実施形態では、或るタプルには、行番号は含まれているが、列番号とデータベースのデータの型式は含まれていない。 When MMS picks up an external email message and parses it, it uses this index in the manner described below, and any of these emails contains data from the database. Detect whether or not. This is done by parsing individual lines of text from the email message. This includes decoding the surrounding file type and converting everything into raw text (eg stripping all formatting information from a Microsoft word file and leaving only the text itself). ing. This series of lines of text is analyzed in the individual words by looking for the separation mark, such as a punctuation of "space" sign or other forms. These words are text tokens. The system looks up the index by applying a hash function to each token for each line of text tokens. The result of this operation is a hash table collision list for each token on the line. As explained above, each collision list is itself a set of data elements that store possible row numbers, column numbers, and type triplets. If you know that all triplets unions have been taken from all collision lists, and you know that a set of triplets have the same row number but different column numbers, you are likely to The line contains records from the database. As used herein, the term “tuple” is not limited to the specific case of row number, column number, and type triplet, and refers to a data structure that does not include all three parameters. For example, in one embodiment, a tuple includes a row number, but does not include a column number and database data type.

先行技術との比較
データベース問い合わせ機構は、ここで説明している教示と大幅に異なっている。１つの相違点は、Ｂツリーには、実際に、Ｂツリーがインデックスを付けるデータベース表のフラグメントが含まれていることである。上記の方法では、インデックスの内側にはデータベースのデータのコピーは記憶されていない。これが重要なのは、上記の通り、ＭＭＳは、データを漏洩から守るためにインデックスのコピーを持たなければならないが、同時に、相当な脅威に曝されるネットワーク内の位置に最も良く展開されるからである。ＭＭＳが使用するインデックスを、データベースのデータの何れの構成要素からも自由に保つことが、重要な要件である。 Comparison with the prior art The database query mechanism is significantly different from the teachings described here. One difference is that the B-tree actually contains fragments of database tables that the B-tree indexes. In the above method, a copy of database data is not stored inside the index. This is important because, as mentioned above, MMS must have a copy of the index to protect data from leakage, but at the same time it is best deployed to locations in the network that are exposed to significant threats. . It is an important requirement to keep the index used by the MMS free from any component of the data in the database.

標準的なデータベース問い合わせ機構とここに概説する本発明との間のもう１つの相違点は、必要な問い合わせの型式と関係がある。リレーショナルデータベースに用いられる標準的な問い合わせのセットは、ＡＮＤ又はＯＲのような連結語を使用する述語論理に基づいている。この基本システムは、通常、ｅメール及びウェブメールのメッセージへとカットアンドペーストされるデータベースのデータを検出するのには上手く働かない。ｅメールメッセージへとカットアンドペーストされるデータベースのデータは、通常、レポートからのもので、各ラインに、異質で、データベース表の内側には見られないデータが含まれていることが多い。一つの例は、例えば、一群の消費者に対する会計情報が含まれているｅメールメッセージである。そのようなメッセージには、例えば、名前、姓、社会保障番号など、保護を必要とするコアデータベースからの多量の記録が含まれているが、コアデータベース表に無い情報も含まれている。代表的例は、他のデータベースから「連結された」情報である。もう１つの例は、データベースのデータのフィールドを分離する単純なラインフォーマットトークンである。これらの各ライン上で見られるこの余分なデータの可能性の故に、出て行くメッセージのライン上の各トークンに適用されるＡＮＤ及びＯＲのような標準的な述語論理の接続語が、（ＯＲの場合）過剰なヒットか、（ＡＮＤの場合）ゼロヒットを作り出す。ここでの説明では、本システムは、ｎがライン内のトークンの総数より大幅に少なくても、全てデータベース表の同じ行からのｎ個又はそれ以上のトークンの存在を検出できる。これは、本発明と、データベース及び文書問い合わせ機構に関する上記先行技術との間のもう１つの重要な相違点である。 Another difference between the standard database query mechanism and the invention outlined here relates to the type of query required. The standard set of queries used for relational databases is based on predicate logic using connectives such as AND or OR. This basic system usually does not work well for detecting database data that is cut and pasted into email and webmail messages. Database data that is cut and pasted into an email message is usually from a report, and each line often contains data that is foreign and not found inside the database table. One example is, for example, an email message that contains accounting information for a group of consumers. Such messages include, for example, a large amount of records from the core database that need protection, such as name, surname, social security number, but also information that is not in the core database table. A typical example is information “linked” from other databases. Another example is a simple line format token that separates the fields of data in the database. Because of the possibility of this extra data seen on each of these lines, standard predicate logic conjunctions such as AND and OR applied to each token on the outgoing message line (OR (If) produces excessive hits or (if AND) produces zero hits. In the description here, the system can detect the presence of n or more tokens, all from the same row of the database table, even though n is significantly less than the total number of tokens in the line. This is another important difference between the present invention and the above prior art regarding databases and document query mechanisms.

上記技法と情報検索技術の間には、幾つかの重大な相違点がある。第１に、これらのシステムのインデックスは、保護対象のデータベース内に記憶されている同じ用語を（用語索引内に）含んでいる。ここでも、システムはハッカーの脅威を受ける可能性のあるネットワーク上の位置にこのインデックスを展開するので、明らかな欠点となる。第２に、これらの問い合わせシステムは、ＡＮＤ及びＯＲのような述語論理の形態を使って、ブール問い合わせを実行する。先に述べたように、この方法は、他の表からの異質なデータと「連結されている」可能性のあるデータベースの記録を検出するには、明らかに不利である。 There are several significant differences between the above techniques and information retrieval techniques. First, the indexes of these systems contain the same terms (in the term index) that are stored in the protected database. Again, the system deploys this index to a network location where it can be threatened by hackers, which is an obvious drawback. Second, these query systems perform Boolean queries using forms of predicate logic such as AND and OR. As mentioned earlier, this method is clearly disadvantageous for detecting database records that may be “concatenated” with foreign data from other tables.

ファイルシングリングの技法は、ここに述べる技法と似ているが、実質的に異なっている。ファイルシングリングでは、関心事の主体はテキストデータ（散文、ソフトウェア、概要など）である。ここに述べる技法では、データベースのデータを保護することに焦点が絞られている。１つの相違点は、所与のデータベース表からのデータベースのデータは、試験メッセージで任意に並べ換えられる行順又は列順で現われることである。これらの並べ換えは、通常、データベースのデータを抽出するために適用される問い合わせ機構の単純な結果である。データベースの問い合わせは、任意の列順、及び任意の行順で出てくるデータベースのデータのブロックになる。このため、ファイルシングリングの基本的な技法をデータベースのデータに適用しても、働かない。ファイルシングリングは、保護されている文書と試験文書との間に同じ線形シーケンスが続くことを前提としている。 The file shingling technique is similar to the technique described here, but is substantially different. In file shingling, the main subject of interest is text data (prose, software, summary, etc.). The techniques described here focus on protecting database data. One difference is that database data from a given database table appears in a row or column order that is arbitrarily reordered in the test message. These permutations are usually the simple result of a query mechanism applied to extract database data. A database query is a block of database data that appears in any column order and any row order. For this reason, applying the basic technique of file shingling to database data does not work. File shingling assumes that the same linear sequence follows between the protected document and the test document.

インターネットの内容ろ過システムとここに述べる教示との間には、多くの重要な相違点がある。先に述べたように、インターネットの内容ろ過システムは、キーワード探索に基づいている。上に述べた最新の技法は、保護したいデータベースから、抽象的データ構造を構築する。この抽象的データ構造は、保護しようとしているテキストのフラグメントを含んでいない。キーワードろ過システムは、探索しているテキストの幾つかの表現を、そのテキストに関する問い合わせを実行するために、含んでいなければならない。第２の重要な相違点は、これらのインターネットの内容ろ過システムには、データベースのデータを保護する意図がないことである。データベースのデータに関する組織の機密ポリシーに対する違反を検出するために通常の表現一致を使用すると、検出の方法が非常に不正確になる。これらのシステムは、主に、インターネットがポルノ又は虐待的な内容及び言語に関係している際に、従業員がインターネットを悪用するのを止めるのに利用される。そのようなシステムは、データベースのデータの保護に適用すると、データベースの記録と整合を取るのに、通常の表現を使用する。このことも、データベースのデータのフラグメントを、安全の危険性が最大であるネットワーク上のコンピューターに伝送することになる。 There are many important differences between Internet content filtering systems and the teachings described herein. As mentioned earlier, Internet content filtering systems are based on keyword searches. The state-of-the-art techniques described above build abstract data structures from the database that we want to protect. This abstract data structure does not contain the fragment of text that is to be protected. The keyword filtering system must contain several representations of the text you are searching for in order to perform queries on that text. A second important difference is that these Internet content filtering systems do not intend to protect database data. Using regular expression matching to detect violations of the organization's confidentiality policy on database data makes the detection method very inaccurate. These systems are primarily used to stop employees from misusing the Internet when it is related to pornographic or abusive content and language. Such a system, when applied to the protection of database data, uses regular expressions to align with database records. This also transmits database data fragments to computers on the network where the danger of safety is greatest.

代表的なコンピューターシステム
図８は、ここで述べた１つ又は複数のオペレーションを実行する代表的なコンピューターシステムのブロック図である。図８に示すように、コンピューターシステム８００は、代表的なクライアント８５０又はサーバー８００のコンピューターシステムを備えている。コンピューターシステム８００は、情報を伝達するための通信機構又はバス８１１と、情報を処理するためにバス８１１に連結されているプロセッサー８１２とを備えている。プロセッサー８１２は、限定するわけではないが、例えばＰｅｎｔｉｕｍ^TM、ＰｏｗｅｒＰＣ^TM、Ａｌｐｈａ^TMなどのようなマイクロプロセッサーを含んでいる。 Exemplary Computer System FIG. 8 is a block diagram of an exemplary computer system that performs one or more of the operations described herein. As shown in FIG. 8, the computer system 800 includes a typical client 850 or server 800 computer system. Computer system 800 includes a communication mechanism or bus 811 for communicating information, and a processor 812 coupled to bus 811 for processing information. The processor 812 includes, but is not limited to, a microprocessor such as Pentium ^™ , PowerPC ^™ , Alpha ^™, and the like.

システム８００は、ランダムアクセスメモリ（ＲＡＭ）、又は、プロセッサー８１２によって実行される情報及び指示を記憶するためのバス８１１に連結されている他のダイナミック記憶装置８０４（メインメモリと呼ばれている）を更に備えている。メインメモリ８０４は、プロセッサー８１２が指示を実行している間に、一時的な変数又は中間情報を記憶するのにも用いられる。 System 800 includes random access memory (RAM) or other dynamic storage device 804 (referred to as main memory) coupled to bus 811 for storing information and instructions executed by processor 812. In addition. Main memory 804 is also used to store temporary variables or intermediate information while processor 812 is executing instructions.

コンピューターシステム８００は、読み取り専用メモリ（ＲＯＭ）、及び／又はプロセッサー８１２用のスタティック情報及び指示を記憶するためにバス８１１に連結されている他のスタティック記憶装置８０６と、磁気ディスク又は光ディスクのようなデータ記憶装置８０７及びその対応するディスクドライブを更に備えている。データ記憶装置８０７は、情報及び指示を記憶するためバス８１１に連結されている。 Computer system 800 includes a read only memory (ROM) and / or other static storage device 806 coupled to bus 811 for storing static information and instructions for processor 812, such as a magnetic disk or optical disk. A data storage device 807 and its corresponding disk drive are further provided. Data storage device 807 is coupled to bus 811 for storing information and instructions.

コンピューターシステム８００は、更に、情報をコンピューターのユーザーに表示するためバス８１１に連結されている陰極線管（ＣＲＴ）又は液晶表示装置（ＬＣＤ）の様な表示装置８２１に連結されている。英数字及び他のキーを含んでいる英数字入力装置８２２も、プロセッサー８１２に情報及びコマンド選択を伝達するためバス８１１に連結されている。追加のユーザー入力装置は、指示情報及びコマンド選択をプロセッサー８１２に伝達し、ディスプレイ上のカーソルの動きを制御するためバス８１１に連結されている、マウス、トラックボール、トラックパッド、スタイラス又はカーソル方向キーの様なカーソルコントロール８２３である。 The computer system 800 is further coupled to a display device 821 such as a cathode ray tube (CRT) or liquid crystal display (LCD) that is coupled to the bus 811 for displaying information to a computer user. An alphanumeric input device 822 containing alphanumeric characters and other keys is also coupled to bus 811 for communicating information and command selections to processor 812. Additional user input devices communicate instruction information and command selections to the processor 812, mouse, trackball, trackpad, stylus or cursor direction keys coupled to the bus 811 to control cursor movement on the display. The cursor control 823 is as follows.

バス８１１に連結されているもう１つの装置は、紙、フィルム又は同様の型式の媒体の様な媒体上に指示、データ又は他の情報を印刷するために用いられるハードコピー装置８２４である。更に、スピーカー及び／又はマイクロホンの様な音声記録及び再生装置も、コンピューターシステム８００と音響的インターフェースを取るために、随意的にバス８１１に連結される。この他、バス８１１に連結される装置には、電話又は携帯装置と通信するための有線／無線通信機器８２５がある。 Another device coupled to the bus 811 is a hard copy device 824 used to print instructions, data or other information on media such as paper, film or similar type media. In addition, audio recording and playback devices, such as speakers and / or microphones, are optionally coupled to the bus 811 for providing an acoustic interface with the computer system 800. In addition, a device connected to the bus 811 includes a wired / wireless communication device 825 for communicating with a telephone or a portable device.

なお、システム８００及び関連ハードウェアの構成要素の何れか又は全てを、本発明で用いることができる。しかしながら、他の構成のコンピューターシステムが、本装置の一部又は前部を含んでいてもよい。 Note that any or all of the components of system 800 and associated hardware can be used in the present invention. However, other configurations of the computer system may include part or the front of the apparatus.

以上の説明を読んだ後では、当業者には、本発明に関する多くの変更及び修正が疑いもなく明白になったことであろうが、分かり易くするために図示し説明している具体的な実施形態は、本発明を限定するものではない。従って、様々な実施形態の詳細についての言及は、本発明に必須であると見なされる特徴のみを列挙している請求項の範囲を限定する意図はない。 After reading the foregoing description, many changes and modifications relating to the present invention will no doubt become apparent to those skilled in the art, but the specific details shown and described for clarity. The embodiments do not limit the present invention. Accordingly, references to details of various embodiments are not intended to limit the scope of the claims, which enumerate only the features that are considered essential to the invention.

ワークフローの１つの実施形態を示している。1 illustrates one embodiment of a workflow. 代表的なオペレーションのモードを示している。A typical mode of operation is shown. 代表的なオペレーションのモードを示している。A typical mode of operation is shown. データベースのデータを保護するためのプロセスの１つの実施形態のフロー図である。FIG. 3 is a flow diagram of one embodiment of a process for protecting data in a database. データベースのデータに索引を付けるためのプロセスの１つの実施形態のフロー図である。FIG. 3 is a flow diagram of one embodiment of a process for indexing data in a database. 事前選択されたデータの情報内容を探索するためのプロセスの１つの実施形態のフロー図である。FIG. 4 is a flow diagram of one embodiment of a process for searching information content of preselected data. 事前選択されたデータから導き出された抽象的データ構造内の内容フラグメントのサブセットに一致するものを見つけ出すためのプロセスの１つの実施形態のフロー図である。FIG. 3 is a flow diagram of one embodiment of a process for finding a match for a subset of content fragments in an abstract data structure derived from preselected data. 事前選択されたデータのハッシュ表インデックスを使って、入信してくるメッセージを探索するためのプロセスの代替実施形態のフロー図である。FIG. 6 is a flow diagram of an alternative embodiment of a process for searching for incoming messages using a pre-selected data hash table index. 事前選択されたデータのハッシュ表インデックスを使って、入信してくるメッセージを探索するためのプロセスの代替実施形態のフロー図である。FIG. 6 is a flow diagram of an alternative embodiment of a process for searching for incoming messages using a pre-selected data hash table index. 事前選択されたデータのハッシュ表インデックスを使って、入信してくるメッセージを探索するためのプロセスの代替実施形態のフロー図である。FIG. 6 is a flow diagram of an alternative embodiment of a process for searching for incoming messages using a pre-selected data hash table index. 本明細書で説明している１つ又は複数のオペレーションを実行する代表的なコンピューターシステムのブロック図である。FIG. 2 is a block diagram of a representative computer system that performs one or more operations described herein. 事前選択された機密データをクライアントベースで保護するためのシステムの１つの実施形態のブロック図である。1 is a block diagram of one embodiment of a system for client-based protection of preselected sensitive data. FIG. 事前選択された機密データをクライアントベースで保護するためのプロセスの１つの実施形態のフロー図である。FIG. 5 is a flow diagram of one embodiment of a process for protecting preselected sensitive data on a client basis.

Claims

A message monitoring system monitoring a plurality of messages electronically transmitted to reach individual destinations over a network for preselected data;
Performing a content search of the plurality of messages by the message monitoring system to determine whether one or more of the retrieved messages includes at least a portion of the preselected data; And performing a content search of a message containing a plurality of content fragments,
Exploring an abstract data structure to identify an entry in the abstract data structure that includes a pre-created hash or cipher of the preselected data corresponding to a hash or cipher of the content fragment The abstract data structure is an index of the preselected data having a table structure, and the abstract data structure entry includes a pre-created hash of the contents of a cell in the table structure or Identifying the text and the row number of the cell, wherein the abstract data structure does not represent the contents of a cell in the table structure of the preselected data;
Determining a group of content fragments, each group associated with a hash or cipher corresponding to a pre-created hash or cipher of entries in the abstract data structure having the same line number Determining the content fragment to contain;
Determining if any group satisfies at least one condition, determining that the message includes at least a portion of the preselected data;
In response to the message monitoring system determining that the message on which the content search has been performed includes at least a portion of the pre-selected data, the message is prevented from reaching the individual destination. A method comprising:

The abstract data structure does not include a copy of the preselected data;
The method of claim 1, wherein the at least one condition is a quantity threshold that is satisfied if any group includes a quantity of content fragments that reach or exceed the quantity threshold.

The method of claim 1, wherein each entry in the abstract data structure further comprises a column number of a cell in the table structure and a column type of a cell in the table structure.

The method of claim 1, further comprising creating the abstract data structure based on the preselected data extracted from a database prior to performing the content search. the method of.

Each entry in the abstract data structure further includes a cell column number, the condition is a quantity threshold for a content fragment, and performing the content search of the message further comprises:
Parsing individual lines of text in the message and parsing the message with content fragments that are words or phrases;
Applying a hash function to at least a portion of the content fragment to create a hash of the content fragment;
Creating a hash table collision list for each content fragment, wherein the hash table collision list is a list of matches between pre-created hashes for entries in the abstract data structure and content fragment hashes Creating each of the groups having a content fragment associated with a hash table collision list identifying entries in the abstract data structure having the same row number;
If any group includes at least a content fragment quantity threshold associated with a hash table collision list identifying entries in the abstract data structure having different column numbers, the message is preselected 2. The method of claim 1, comprising determining to include at least a portion of the data.

The step of performing a content search of the message further includes:
Before searching the abstract data structure, each line of text in the message is parsed, the message is parsed with content fragments that are words or phrases, and the preselected in the message. Detecting a sequence of content fragments that may contain a portion of the data,
Applying a hash function or encryption function to at least a portion of the content fragments in the sequence to create a hash or ciphertext of the content fragments;
An entry in the abstract data structure that searches the abstract data structure and includes a pre-created hash or cipher of the preselected data corresponding to a hash or cipher of content fragments in the sequence The method of claim 1 including the step of:

Detecting the sequence of content fragments that may contain a portion of the preselected data;
Searching the body of the message for an expression having the predetermined format using a search rule defining a predetermined format, wherein the expression having the predetermined format includes an account number, a social security number, a credit card Searching for a number, telephone number, postal code, email address, cash amount or driver's license number;
Determining that an area surrounding the representation may include a portion of the preselected data, wherein the region surrounding the representation is adjacent to the front and back of the representation And determining the amount of data fragments to be included.

Means for monitoring a plurality of electronically transmitted messages to reach individual destinations on the network for embedded preselected data;
Means for performing a content search of the plurality of messages and determining whether one or more of the retrieved plurality of messages includes at least a portion of the preselected data; The stage of performing a content search for messages containing fragments is:
Exploring an abstract data structure to identify an entry in the abstract data structure that includes a pre-created hash or cipher of the preselected data corresponding to a hash or cipher of the content fragment The abstract data structure is an index of the preselected data having a table structure, and the entry in the abstract data structure is a pre-created hash of the contents of a cell in the table structure Identifying the abstract data structure that does not represent the contents of a cell in the table structure of the preselected data, comprising:
Determining a group of content fragments, each group associated with a hash or cipher that matches a pre-created hash or cipher of an entry in the abstract data structure having the same line number Determining the content fragment to contain;
Determining means if any group satisfies at least one condition, determining that the message includes at least a portion of the preselected data;
Means for preventing the message from reaching the individual destination in response to determining that the message on which the content search has been performed includes at least a portion of the preselected data. A device characterized by comprising.

When executed on a processor, the processor
Monitoring a plurality of electronically transmitted messages to reach individual destinations on the network for preselected data;
Performing a content search of the plurality of messages to determine whether one or more of the retrieved plurality of messages includes at least a portion of the preselected data, the plurality of contents The stage of performing a content search for messages containing fragments is:
Exploring an abstract data structure to identify an entry in the abstract data structure that includes a pre-created hash or cipher of the preselected data corresponding to a hash or cipher of the content fragment The abstract data structure is an index of the preselected data having a table structure, and the abstract data structure entry includes a pre-created hash of the contents of a cell in the table structure or Identifying the text and the row number of the cell, wherein the abstract data structure does not represent the contents of a cell in the table structure of the preselected data;
Determining a group of content fragments, each group associated with a hash or cipher that matches a pre-created hash or cipher of an entry in the abstract data structure having the same line number Determining the content fragment to contain;
Determining if any group satisfies at least one condition, determining that the message includes at least a portion of the preselected data;
In response to determining that a message for which a content search has been performed includes at least a portion of the preselected data, preventing the message from reaching the individual destination;
Computer-readable storage medium storing instructions for executing a method comprising.

An abstract data structure derived from data elements of preselected sensitive data having a table structure, wherein the abstract data structure is an index of the preselected data having a table structure and the abstract The entry in the static data structure includes a pre-created hash or secret of the contents of the cells in the table structure and the row number of the cells, and the contents of the cells in the table structure of the preselected sensitive data A personal computer device receives an abstract data structure that does not contain from a server;
A plurality of personal computer devices for indicia indicating that at least some of the preselected sensitive data stored in the server is included in the contents of a plurality of data storage media of the personal computer device; Searching the contents of the data storage medium, wherein the searching step comprises:
The abstract data structure was searched and pre-created for the pre-selected data that is a content fragment of content from a storage medium that matches a hash or cipher of the content fragment containing a word or phrase Identifying an entry in an abstract data structure containing a hash or secret;
Determining a group of content fragments, each group associated with a hash or cipher that matches a pre-generated hash or cipher of entries having the same line number in the abstract data structure A step of determining including determining a content fragment that is included, and
A plurality of data stores in the personal computer device to detect that a portion of the preselected sensitive data stored in the server is stored in the personal computer device by a user of the personal computer device Detecting in at least one content of the medium at least a portion of the preselected sensitive data based on the group meeting at least one condition;
Sending a notification from the personal computer device to the server regarding the detection of a portion of the preselected sensitive data.

11. The method of claim 10, further comprising preventing access to the detected data if at least a portion of the preselected sensitive data is detected.

The method of claim 10, wherein the content is searched periodically.

The method of claim 10, wherein the content is searched when the personal computer device is disconnected from the network.

Sending the notification comprises:
Creating a message that includes a notification of detection of the preselected sensitive data upon detecting the preselected sensitive data;
Placing the message in a send queue;
14. The method of claim 13, comprising transmitting the message to the system after the personal computer device is reconnected to the system.

11. The method of claim 10, further comprising receiving instructions from the server that define a search range for the personal computer device.

Exploring the contents of a plurality of data storage media of the personal computing device includes monitoring one or more specific data operations for the presence of at least a portion of the preselected sensitive data. The method according to claim 10, characterized in that:

At least one of the one or more specific data operations is file read, file write, file update, read from removable media device, write to removable media device, and on the personal computer device 17. The method of claim 16, wherein the method is selected from the group consisting of: access to data stored in any of the plurality of data storage media by a program running on the computer.

The method of claim 10, wherein the preselected sensitive data is maintained by an organization in at least one of a spreadsheet, a flat file, and a database.

Each of the abstract data structure entries further includes a column number of the cell and optionally a column type, and the searching step further comprises:
Analyzing the content of the storage medium with the content fragment;
Applying a hash function to at least a portion of the content fragment to create a hash of the content fragment;
Creating a hash table collision list for each of the content fragments, wherein the hash table collision list is a match between a pre-created hash for the entry in the abstract data structure and a hash of the content fragment. And each group includes creating a hash table collision list having content fragments associated with a hash table collision list that identifies entries in the abstract data structure having the same row number. The method according to claim 18.

The method of claim 10, wherein the plurality of data storage media are selected from the group consisting of main memory, static memory, and mass storage memory.

Searching the contents of the plurality of data storage media comprises:
Searching the contents of each volatile storage device in the plurality of data storage media;
11. The method of claim 10, comprising searching the contents of each permanent storage device in the plurality of data storage media.

The method of claim 21, further comprising detecting use of the preselected data by an application running on the personal computing device.

Identifying the application using the preselected data;
The method of claim 21, further comprising reporting the identified application.

Means for a personal computing device to receive from a server an abstract data structure derived from data elements of preselected sensitive data having a table structure, wherein the abstract data structure has the table structure The abstract data structure entry includes a pre-created hash or ciphertext of the contents of the cells in the table structure and the row number of the cell, the abstract data structure being Means for receiving the preselected sensitive data not including the contents of the cells in the table structure;
A plurality of personal computer devices for indicia indicating that at least some of the preselected sensitive data stored in the server is included in the contents of a plurality of data storage media of the personal computer device; Means for searching the contents of the data storage medium, wherein the searching means comprises:
The abstract data structure was searched and pre-created for the pre-selected data that is a content fragment of content from a storage medium that matches a hash or cipher of the content fragment containing a word or phrase Identifying an entry in an abstract data structure containing a hash or secret;
Determining a group of content fragments, each group associated with a hash or cipher that matches a pre-generated hash or cipher of entries having the same line number in the abstract data structure A means for determining comprising determining a content fragment that is included, and
A plurality of data stores in the personal computer device to detect that a portion of the preselected sensitive data stored in the server is stored in the personal computer device by a user of the personal computer device Means for detecting at least a portion of the pre-selected sensitive data based on the group satisfying at least one condition in at least one content of the medium;
Means for sending a notification from the personal computer device to the server regarding detection of a portion of the preselected sensitive data;
A device characterized by comprising:

The apparatus of claim 24, wherein the content is searched periodically.

25. The device of claim 24, wherein the content is searched when the personal computer device is disconnected from the network.

The means for sending the notification is:
Means for creating a message including a notification of detection of the preselected sensitive data upon detecting the preselected sensitive data;
Means for placing the message in a transmission queue;
25. The apparatus of claim 24, further comprising: means for transmitting the message to the system after the personal computer device is reconnected to the system.

25. The apparatus of claim 24, further comprising means for receiving instructions from the server that define a search range for the personal computer device.

The means for searching the contents of a plurality of data storage media of the personal computer device is for monitoring one or more specific data operations for the presence of at least a portion of the preselected sensitive data. 25. The apparatus of claim 24, comprising:

At least one of the one or more specific data operations is file read, file write, file update, read from removable media device, write to removable media device, and on the personal computer device 30. The apparatus of claim 29, wherein the apparatus is selected from the group consisting of: access to data stored in any of the plurality of data storage media by a program running on the computer.

25. The apparatus of claim 24, wherein the plurality of data storage media are selected from the group consisting of main memory, static memory, and mass storage memory.

The means for searching the contents of the plurality of data storage media includes:
Means for searching the contents of each volatile storage device in the plurality of data storage media;
25. The apparatus of claim 24, comprising: means for searching the contents of each permanent storage device in the plurality of data storage media.

The apparatus of claim 32, further comprising means for detecting use of the preselected data by an application running on the personal computing device.

Means for identifying the application using the preselected data;
The apparatus of claim 32, further comprising means for reporting the identified application.

When executed on a processor, the processor
Receiving an abstract data structure derived from data elements of pre-selected sensitive data having a table structure from a server, wherein the abstract data structure has the table structure The abstract data structure entry includes a pre-created hash or ciphertext of the contents of the cells in the table structure and the row number of the cell, the abstract data structure being Receiving the pre-selected sensitive data not including the contents of the cells in the table structure;
A plurality of personal computer devices for indicia indicating that at least some of the preselected sensitive data stored in the server is included in the contents of a plurality of data storage media of the personal computer device; Searching for the content of the data storage medium by the personal computer device, wherein the searching step comprises:
The abstract data structure was searched and pre-created for the pre-selected data that is a content fragment of content from a storage medium that matches a hash or cipher of the content fragment containing a word or phrase Identifying an entry in an abstract data structure containing a hash or secret;
Determining a group of content fragments, each group associated with a hash or cipher that matches a pre-generated hash or cipher of entries having the same line number in the abstract data structure A step of determining including determining a content fragment that is included, and
A plurality of data stores in the personal computer device to detect that a portion of the preselected sensitive data stored in the server is stored in the personal computer device by a user of the personal computer device Detecting, by the personal computer device, at least a portion of the preselected sensitive data based on the group satisfying at least one condition in at least one content of the medium;
Sending a notification from the personal computing device to the server regarding detection of a portion of the preselected sensitive data;
A computer-readable storage medium storing instructions for performing a method comprising:

The method of claim 1, wherein the at least one condition is a quantity threshold that is satisfied if any group includes a quantity of content fragments that reach or exceed the quantity threshold.

10. The computer readable storage medium of claim 9, wherein the at least one condition is a quantity threshold that is satisfied if any group contains a quantity of content fragments that reach or exceed the quantity threshold. .

The method of claim 10, wherein the at least one condition is a quantity threshold that is satisfied if any group includes a quantity of content fragments that reach or exceed the quantity threshold.

25. The apparatus of claim 24, wherein the at least one condition is a quantity threshold that is satisfied when any group includes a quantity of content fragments that reach or exceed the quantity threshold.