JP4667362B2

JP4667362B2 - Identifying similarity and revision history in large unstructured data sets

Info

Publication number: JP4667362B2
Application number: JP2006501066A
Authority: JP
Inventors: カーソン・ドウェイン・エー; バッセラ・ドナート; スモルスキー・マイケル
Original assignee: Verdasys Inc
Current assignee: Verdasys Inc
Priority date: 2003-01-23
Filing date: 2004-01-21
Publication date: 2011-04-13
Anticipated expiration: 2024-01-21
Also published as: JP2006516775A; CA2553654C; EP1590748A2; WO2004066086A2; CA2553654A1; EP1590748A4; WO2004066086A3

Description

本件出願は、2003年12月17日に出願された米国特許出願第10/738,924号および2003年12月17日に出願された米国特許出願第10/738,919号の優先権を主張する部分継続出願であり、これらは、2003年1月23日に出願された「所有する電子情報の適応的識別および保護の方法およびシステム（Method and System for Adaptive Identification and Protection
of Proprietary Electronic Information）」という名称の米国特許仮出願第60/442,464号の利益を主張している。上記出願の教示の全体は、ここでの言及によって本明細書に組み込まれたものとする。 This application is a continuation-in-part application claiming priority from U.S. Patent Application No. 10 / 738,924 filed on December 17, 2003 and U.S. Patent Application No. 10 / 738,919 filed on December 17, 2003. These are “Method and System for Adaptive Identification and Protection” filed on January 23, 2003.
claims the benefit of US Provisional Application No. 60 / 442,464, entitled “Proprietary Electronic Information”. The entire teachings of the above application are hereby incorporated herein by reference.

今やほぼすべての組織が、知的財産を包含する機密に属する情報を含むその組織の大量の情報を種々のフォーマットの電子ファイルとして保存している。この傾向は、コンピュータのコストが低くて幅広く利用可能であること、電子および磁気記憶媒体そのもののコストが減少し続けていること、および情報のアーカイブとしてのバックアップの維持が比較的容易であることなど、多くの理由による。 Almost all organizations now store a large amount of their information, including confidential information that encompasses intellectual property, as electronic files in various formats. This trend is due to the low cost of computers and wide availability, the cost of electronic and magnetic storage media themselves continue to decrease, and the ease of maintaining backups as information archives, etc. For many reasons.

データの電子的保存データへの強い動機の１つは、特定の情報を求めて大量のファイルに効率よく照会できる容易な点にある。この課題に対処するために、いくかのアルゴリズム的技法が提案されている。広く知られている技術の１つは、テキスト形式の内容に限られており、ウェブでのサーチエンジンで最も広く使用されている。この技法では、ユーザが単語または単語のセットをサーチエンジンに打ち込み、次いでサーチエンジンが膨大なデータの集まりについてあらかじめインデックス化しておいたイメージを処理して、サーチ条件で指定された単語を含む文書を取って来る（フェッチする）。 One strong motivation for data to be stored electronically is the ease with which a large amount of files can be efficiently queried for specific information. Several algorithmic techniques have been proposed to address this challenge. One well-known technique is limited to textual content and is most widely used in search engines on the web. In this technique, a user types a word or set of words into a search engine, then processes the images that the search engine has pre-indexed for a large collection of data to produce a document that contains the word specified in the search criteria. Get (fetch).

この技法を洗練することによって、ユーザは、よりユーザ・フレンドリな人間の言葉の形式（単語セット、すなわち単語「ボストン AND 特売」ではない）で情報を入力できるようになる。これらのいわゆる「自然言語」インターフェイスによって、ユーザは、「ボストン地域で現在特売を宣伝しているのはいずれの取引業者か？」などとクエリ（照会）を入力することができる。画像パターン認識および数学的相関などのその他の技法を用いて、例えば画像のようなテキスト以外のデータの集まりにおいて情報が見つけ出される（例えば、保安カメラで顔を捉えた人物が既知の犯罪者のデータベース内に存在するか否か見つけるため）。 Refinement of this technique allows the user to enter information in the form of a more user-friendly human language (not a word set, ie the word “Boston AND Sale”). These so-called “natural language” interfaces allow the user to enter a query such as “Which trader is currently promoting the bargain in the Boston area?”. Other techniques such as image pattern recognition and mathematical correlation are used to find information in a collection of non-text data such as images (for example, a database of criminals with known faces captured by security cameras) To find out if it exists in).

技術が発展し、またハードウェアがより利用可能になり手ごろな価格になるにつれて、コンピュータ・ユーザは、同一文書について複数のコピーを保持できる能力を獲得した（また、そのようにすることを実際に好んでいる）。このようなコピーは、テキストの追加、削除もしくは配置変更、画像のトリミング、１つの文書を２つの文書に分ける、またはいくつかの文書の合体など、わずかな量の編集によってのみ異なっていることがしばしばである。さらに、文書は異なるフォーマットに変換されることがあり、例えば、植字指示付きのテキスト・ファイルを印刷可能形式に変換することができる。これら同一またはきわめて類似した文書の複数のコピーが、同一のコンピュータ上に保持されるかもしれない。しかしながら、これら文書を構内通信網（ＬＡＮ）または広域通信網（ＷＡＮ）に接続された多くのコンピュータに分散させることも可能であり、すなわち異なる部署、あるいは物理的に何千マイルも離れた場所にさえ置くことができる。 As technology has evolved and hardware has become more available and affordable, computer users have acquired the ability to hold multiple copies of the same document (and actually do so). I like it). Such copies may differ only by a small amount of editing, such as adding text, deleting or changing text, cropping an image, splitting a document into two documents, or merging several documents. Often. Further, the document may be converted to a different format, for example, a text file with typesetting instructions can be converted to a printable format. Multiple copies of these same or very similar documents may be kept on the same computer. However, it is also possible to distribute these documents across many computers connected to a local area network (LAN) or wide area network (WAN), i.e. in different departments, or physically thousands of miles away. Can even put.

しかしながら、同一文書について多数のコピーを容易に生成できるということは、あるいくつかの問題を引き起こす。これらの問題としては、以下のものがある。
・データのセキュリティ−−文書のコピーが多くなるとともに、その中身へのアクセス
を管理することが難しくなる。
・文書の分類−−類似の文書のコピーは、ユーザの介在を必要とせずに同一の方法で処理される必要があると考えられ、さらにこれを自動で行なうのが望ましい。
・系図−−特定の文書がいかに発展したのかについて履歴を特定する。
・フォレンジック−−誰が文書を改ざんしたかを特定する。
・法令遵守−−今や、医療業界および金融業界におけるあるいくつかの法律および規則が、文書へのアクセスを管理して、かつ／または文書が所定の時間経過後に自動的に廃棄されるように要求している。 However, the ability to easily generate multiple copies for the same document causes some problems. These problems include the following.
Data security--more copies of documents and more difficult to manage access to their contents.
Document classification—copying of similar documents is considered to need to be processed in the same way without the need for user intervention, and it is desirable to do this automatically.
Genealogy--identifies history about how a particular document has evolved.
Forensics--identifying who has tampered with a document
Legal compliance--Some laws and regulations in the medical and financial industries now require access to documents to be managed and / or to be automatically disposed of after a certain amount of time is doing.

既存のデータ検索アルゴリズムは、文書間の類似性の計算および文書配布経路の再現について、効率性、正確さ、または拡張性が十分ではない。 Existing data retrieval algorithms are not efficient, accurate, or scalable for calculating similarity between documents and reproducing document distribution paths.

本発明の一構成によれば、文書の巨大な集合からのデータと所定のデータ一部（新規であってもよく、前記集合に属していてもよい）との間の類似度を効率的に発見するの方法およびシステムが提供される。 According to one configuration of the present invention, the similarity between data from a large set of documents and a predetermined part of data (which may be new or may belong to the set) is efficiently increased. Methods and systems for discovery are provided.

さらに詳細には、本システムは、組織のコンピュータにわたって分散されるソフトウェア・プログラムとして実装できる。クライアント側のモニタ・プロセスが、コンピュータ・ユーザのディジタル資産に関するアクティビティ（例えば、機密に属するユーザ文書がコピーされ、変更され、削除され、あるいは送信される）を報告する。これらアクティビティの報告を使用して、データ・セキュリティ・アプリケーションが、文書配布経路（ＤＤＰ）を文書間の履歴的な関係の表現である有向グラフとして保持することができる。ＤＤＰは、ユーザのアクティビティの履歴を観測するシステムにもとづいて構築される。 More particularly, the system can be implemented as a software program distributed across an organization's computers. The client-side monitoring process reports activity on the computer user's digital assets (eg, confidential user documents are copied, modified, deleted, or transmitted). Using these activity reports, data security applications can maintain document distribution paths (DDPs) as a directed graph that is a representation of historical relationships between documents. The DDP is constructed based on a system that observes a history of user activity.

さらに、本システムは、ユーザ・データ・ファイルについて、類似する（必ずしも等価でなくてよい）情報の高速な照会を可能にするようにインデックス化され、きわめて大きく低減された（「不可逆の」）階層表現を保持する。これより、本システムは、「所定の文書に類似する文書を発見せよ」などといった照会に応答できる。次いで、この情報は、ある操作がクライアント・モニタ処理にとって不可視である場合に、ＤＤＰグラフのさらなる追補に使用される。 In addition, the system is indexed to allow fast querying of similar (but not necessarily equivalent) information for user data files, and a greatly reduced ("irreversible") hierarchy Hold the expression. Thus, the system can respond to an inquiry such as “Find a document similar to a predetermined document”. This information is then used to further supplement the DDP graph if an operation is invisible to the client monitor process.

文書の類似性照会は、ユーザから手動で起動されることができ、または分散データ処理システム・サービスの一部として適用および／もしくは実装することができる。「新規ファイルに類似するデータを含む」既存のファイルを見つけ、新規ファイルに自動的に適切な管理を適用することができる、組織全体にわたるセキュリティの解決手段を提供するために、類似性検出エンジン（ＳＤＥ）と呼ばれる文書類似性サービスを使用することができる。好ましい実施の形態においては、類似性の判断を高速化するために、ＳＤＥは文書のスパース表現を使用する。スパース表現は、好ましくは、ファイルの選択された部分すなわち「チャンク」から割り出された応答型のフーリエ係数の階層で構成される。文書を最もよく表わしているフーリエ係数成分を選択的に選ぶためにアルゴリズムが使用される。 Document similarity queries can be initiated manually from a user, or can be applied and / or implemented as part of a distributed data processing system service. Similarity detection engine (to provide an organizational security solution that can find existing files "contains data similar to new files" and automatically apply appropriate controls to new files ( A document similarity service called SDE) can be used. In the preferred embodiment, the SDE uses a sparse representation of the document to speed up similarity determination. The sparse representation preferably consists of a hierarchy of responsive Fourier coefficients determined from selected portions or “chunks” of the file. An algorithm is used to selectively select the Fourier coefficient components that best represent the document.

このシステムは、エンドユーザに透過であり、最新のコンピュータ・ワークステーションにおいて利用可能なリソースのわずかな部分しか利用しない。本システムは、多数のクライアント・ワークステーションをサポートするために、専用のサーバまたはサーバ・クラスタを必要とする場合もある。 This system is transparent to the end user and uses only a small portion of the resources available on modern computer workstations. The system may require a dedicated server or server cluster to support a large number of client workstations.

このように、文書配布経路を自動的に保持および／または再構築する能力を有するデータ管理アプリケーションを提供するために、本システムを使用することができる。この経
路は、１）文書の起源、２）起源の場所からの配布経路、および３）当該文書を改ざんしたユーザの名前および改ざんが生じた時刻、を特定することができる。 In this way, the system can be used to provide a data management application that has the ability to automatically maintain and / or reconstruct document distribution paths. This route can specify 1) the origin of the document, 2) the distribution route from the place of origin, and 3) the name of the user who altered the document and the time when the alteration occurred.

組織は、本発明のこの能力を多数の最終用途に適用することができる。例えば、業務の流れに影響する情報交換の致命的なボトルネックを特定して解消することによって、文書の流れおよび能率的な企業実務を監視するために、本発明を使用することができる。 Organizations can apply this capability of the present invention to a number of end uses. For example, the present invention can be used to monitor document flow and efficient business practices by identifying and resolving fatal bottlenecks in information exchange that affect business flow.

たとえ企業の膨大な文書の集合にわたっても、類似文書をリアルタイムで自動的に特定できるようにすることで、情報セキュリティ・アプリケーションにこの構成を実装することも可能である。機密でない文書の交換を妨げることなく機密に属するデータへの不適切なアクセスまたは配布を防止するために、不可欠なデータ・セキュリティ機能である文書の機密性の判断に文書の類似性分析を利用することができる。 This configuration can also be implemented in information security applications by allowing similar documents to be automatically identified in real time, even across a large collection of documents in a company. Use document similarity analysis to determine document confidentiality, an essential data security feature, to prevent inappropriate access or distribution of sensitive data without interfering with the exchange of non-sensitive documents be able to.

本発明の前記の目的、特徴および利点、ならびに他の目的、特徴および利点は、添付の図面に示した本発明の好ましい実施の形態に関する以下のさらに詳しい説明から明らかになるであろう。添付の図面においては、異なる図であっても同一参照符号は同一部分を指している。図面は必ずしも縮尺どおりではなく、本発明の原理を示すことに重点がおかれている。 The foregoing objects, features and advantages of the present invention, as well as other objects, features and advantages will become apparent from the following more detailed description of the preferred embodiment of the present invention as illustrated in the accompanying drawings. In the accompanying drawings, the same reference numerals refer to the same parts even in different drawings. The drawings are not necessarily to scale, emphasis instead being placed on illustrating the principles of the invention.

＜システム環境の概要＞
図１は、データ類似性発見システム１００を高度に概念化した図である。クライアント・コンピュータ１０２およびサーバ・コンピュータ１０４（利用される場合）が、ユーザの作業を継続的に監視し、データ・ファイルまたは価値ある情報を包含する文書ファイルのようなその他「ディジタル資産」についての情報を収集する。監視されるイベントには、コンピュータのオペレーティング・システム（ＯＳ）ならびにそのユーザによって変更（生成、コピー、移動、削除、編集、または合体）された文書についての情報の検出および記録のみが含まれる。この情報は、文書配布経路（ＤＤＰ）１５０と称されるデータ構造として表わされ、通常は有向グラフとして実現される。有向グラフにおける頂点は文書を表わし、有向グラフにおける辺は文書間の履歴的な関係（historic relationship）を記述する。ＤＤＰ１５０は、ファイルおよびそれらのチャンク（ひとまとまりのデータ群）に関するその他の情報とともに、データベースに保存される。 <Overview of system environment>
FIG. 1 is a highly conceptualized view of the data similarity discovery system 100. Information about other "digital assets" such as data files or document files that contain valuable information, as client computer 102 and server computer 104 (if utilized) continually monitor user work. To collect. Monitored events include only the detection and recording of information about the computer's operating system (OS) and documents that have been modified (created, copied, moved, deleted, edited, or merged) by the user. This information is represented as a data structure called a document distribution path (DDP) 150 and is usually realized as a directed graph. Vertices in the directed graph represent documents, and edges in the directed graph describe historical relationships between documents. The DDP 150 is stored in a database along with other information about the files and their chunks.

多くの場合において、ＯＳおよびネットワーク・プロトコル・アーキテクチャによって、全文書間の履歴上の関係をシステム１００が再構築するのが妨げられる。特に、ユーザが電子メールの添付として文書を受信してディスクに保存した場合、既存の電子メール・プロトコルは、組織的ネットワークの別のワークステーションでのファイルの起源（文書の起源）までさかのぼって当該ファイルを追跡するアプリケーションをサポートしていない。このような場合、システム１００は、受信した文書を既存の文書のデータベースに対して照会するように、類似性検出エンジン（ＳＤＥ）１６０（以下で詳しく説明する）を使用することができる。次いで、システムは照会結果を使用して、ＤＤＰ１５０を最初に構築する。 In many cases, the OS and network protocol architecture prevents the system 100 from reconstructing historical relationships between all documents. In particular, if a user receives a document as an email attachment and saves it to disk, the existing email protocol is traced back to the origin of the file (origin of the document) on another workstation in the organizational network. Does not support applications that track files. In such a case, the system 100 can use a similarity detection engine (SDE) 160 (described in detail below) to query the received document against a database of existing documents. The system then builds DDP 150 first using the query results.

ＳＤＥ１６０は、システムで利用できる文書の「チャンク」のデータベースを維持する。ＳＤＥ１６０は、これらのチャンクのデータを高度に圧縮された階層構造表現１７０に変換する。この階層構造表現１７０は、チャンク間の類似性を近似的に示して使用するのに最適な形式である。さらに、ＳＤＥ１６０は、チャンクの出所についての情報を文書チャンク・データベース１７５内に保持する。 The SDE 160 maintains a “chunk” database of documents available in the system. The SDE 160 converts the data of these chunks into a highly compressed hierarchical structure representation 170. This hierarchical structure representation 170 is an optimal format for approximating and using the similarity between chunks. Furthermore, the SDE 160 maintains information about the origin of the chunk in the document chunk database 175.

本システムは、単一のスタンドアロンのローカル・マシン１０２上で動作するように構
成でき、この場合、ＤＤＰ１５０、ＳＤＥ１６０、および階層構造１７０はすべてこのローカル・マシン１０２に存在する。しかしながら、システムを企業全体にわたるデータの管理またはセキュリティ解決手段として実装できることを理解すべきである。この場合、クライアント装置１０２およびサーバ１０４は構内通信網および／またはインターネットワーク接続１０６を介して接続される。このようなシステムにおいて、インターネット１０８のような外部のネットワークへの接続も可能であることより、企業の外でファイルが生成されて、かつ／または企業の外に分配される。 The system can be configured to operate on a single stand-alone local machine 102, in which case the DDP 150, SDE 160, and hierarchical structure 170 are all present on this local machine 102. However, it should be understood that the system can be implemented as an enterprise-wide data management or security solution. In this case, the client device 102 and the server 104 are connected via a local area network and / or internetwork connection 106. In such a system, since it is possible to connect to an external network such as the Internet 108, a file is generated outside the company and / or distributed outside the company.

ネットワーク化された環境において、ＤＤＰ１５０、ＳＤＥ１６０、および階層構造１７０といった各構成要素が、通常は複数のクライアント１０２およびサーバ１０４および／またはサーバ・クラスタに分散される。これより、ＳＤＥ１６０はローカル・マシン１０２上の文書の階層化データベース１７０による表現を保持し、分散によってサーバ１０４上、および／またはサーバ１０４のクラスタ上に同一の圧縮された表現を保持できる。クラスタおよび／または分散型の実装において、ローカルＳＤＥ１６０が新規に受信した文書に対する照会に応答できないとき、ローカルＳＤＥ１６０はサーバＳＤＥ１０４に照会を行なう。次いで、ユーザが新規文書を生成したとき、または既存の文書を変更したとき、ローカルＳＤＥ１６０がサーバＳＤＥ１０４を更新する。更新がサーバＳＤＥ１０４に届くと、すぐに他のクライアント・ワークステーション上で動作する他のローカルＳＤＥ１６０による照会が利用可能になる。クライアント１０２がネットワーク１０６に接続されていない状況（例えば、ラップトップのユーザがオフィスを離れて旅行中の場合など）においては、ネットワーク接続が回復されるときまで通信要求が延期されて待ち行列に入れられる。 In a networked environment, each component, such as DDP 150, SDE 160, and hierarchical structure 170, is typically distributed across multiple clients 102 and servers 104 and / or server clusters. Thus, the SDE 160 holds the representation by the hierarchical database 170 of the document on the local machine 102, and can hold the same compressed representation on the server 104 and / or on the cluster of the server 104 by distribution. In a cluster and / or distributed implementation, the local SDE 160 queries the server SDE 104 when the local SDE 160 cannot respond to queries for newly received documents. The local SDE 160 then updates the server SDE 104 when the user creates a new document or changes an existing document. As soon as the update arrives at the server SDE 104, a query by another local SDE 160 running on another client workstation is available. In situations where the client 102 is not connected to the network 106 (eg, a laptop user is traveling away from the office), the communication request is deferred and queued until the network connection is restored. It is done.

ＤＤＰ１５０およびＳＤＥ１６０は多数のさまざまなアプリケーション１２０において使用される。このようなアプリケーションの一例は、文書の使用についてのアカウンタビリティの境界を使用点（使用時点および使用場所）において確立するために、データ・セキュリティ・アプリケーションが使用される。このアカウンタビリティ・モデルは、権限を与えられたユーザによる文書へのアクセスを追跡できるだけでなく、さらに重要なことには、機密に属する文書のコピーを周辺機器またはネットワーク接続を介してアクセスまたは移動しようとする企てを監視することができる。このようにして、機密に属する知的財産または他の情報を配布もしくは記録しようとする企て、あるいは他に考えられる権限を悪用するイベントを管理または防止するために、ＳＤＥ依存セキュリティ・アプリケーション１２０が使用される。 DDP 150 and SDE 160 are used in many different applications 120. One example of such an application uses a data security application to establish an accountability boundary for document usage at the point of use (time of use and place of use). This accountability model not only tracks access to documents by authorized users, but more importantly, attempts to access or move copies of sensitive documents through peripherals or network connections. You can monitor your attempts. In this way, the SDE-dependent security application 120 can manage or prevent attempts to distribute or record confidential intellectual property or other information, or other events that exploit other possible authorities. used.

透過システムのイベント・モニタ１８０と呼ばれるシステム構成要素が、アプリケーション１２０のエージェントとして動作する。モニタ１８０は、クライアント１０２上で動作するオペレーティング・システム（ＯＳ）とエンドユーザ・アプリケーション１９０との間に介装されている。モニタ・プロセス１８０は、ファイル・システム１９２、ネットワーク・インターフェイス１９４、ポート１９６、および／またはシステム・クリップボード１９８への読み書き動作を検出するために、センサまたはシム（Shim）を有する。これらモニタ・プロセス１８０のセンサは、ローカル・ファイル・サーバには見ることも制御するもできない装置にユーザがアクセスするときに常に生じうる、考えられる不正なイベントを検出するために使用されてもよい。これらのイベントには、コンパクト・ディスク読み書き（ＣＤ−ＲＷ）ドライブ、携帯情報端末（ＰＤＡ）、ユニバーサル・シリアル・バス（ＵＳＢ）記憶装置、無線装置、ディジタル・ビデオ記録装置などの管理不可能な媒体への文書の書き込みが含まれ、さらには文書の印刷も含まれる。他の疑わしいイベントは、外部ピア・トゥ・ピア（Ｐ２Ｐ）アプリケーション、外部の電子メール・アプリケーションを介しての文書の送信、インスタント・メッセージ（ＩＭ）アプリケーションの実行、およびインターネット１０８を介してのウェブサイトへの文書のアップロードなどのイベントを検出するために、ネットワーク・センサ１９４によって検出される。 A system component called transparent system event monitor 180 operates as an agent of application 120. The monitor 180 is interposed between an operating system (OS) operating on the client 102 and the end user application 190. Monitor process 180 has sensors or shims to detect read / write operations to file system 192, network interface 194, port 196, and / or system clipboard 198. These monitoring process 180 sensors may be used to detect possible fraudulent events that may occur whenever a user accesses a device that is neither visible nor controlled by the local file server. . These events include unmanageable media such as compact disc read / write (CD-RW) drives, personal digital assistants (PDAs), universal serial bus (USB) storage devices, wireless devices, digital video recording devices, etc. This includes writing a document to the document, and further includes printing the document. Other suspicious events include external peer-to-peer (P2P) applications, sending documents via external email applications, running instant messaging (IM) applications, and websites via the Internet 108 Detected by the network sensor 194 to detect events such as uploading documents to the network.

イベントとともに通常集められるデータはイベントの種類およびＤＤＰ１５０内に保持したいと望まれる情報の種類による。このような情報は以下に示すものを含むことができる。
・ファイル操作の場合には、元／先のファイル名、操作の種類（オープン、書き込み、削除、名前の変更、ゴミ箱への移動）、装置の種類、最初および最後のアクセス時刻
・アプリケーションの呼び出しの場合には、呼び出しプロセスの識別、実行可能な名前、開始時間、終了時間、およびプロセス所有者
・ログオンまたはログオフなどのユーザ操作の場合には、時刻およびユーザの識別子（ＩＤ）
・ネットワーク操作の場合には、発信元／宛先のアドレス、ポートおよびホスト名、開始／終了時刻のスタンプ、送信および受信したバイト数、入力および出力のデータ伝送時間
・クリップボード操作の場合には、宛先のプロセスＩＤ、イベント開始時刻、関係するファイル名のフルパス
・リムーバブル記憶媒体へのアクセスのようなその他の高レベルの操作の場合には、ファイル名、装置ＩＤ、日時、転送されたバイト数、など The data that is normally collected with an event depends on the type of event and the type of information that is desired to be stored in the DDP 150. Such information can include the following:
-For file operations, source / destination file name, operation type (open, write, delete, rename, move to trash), device type, first and last access time In the case of call process identification, executable name, start time, end time, and process owner. In the case of user operations such as logon or logoff, time and user identifier (ID)
-For network operation, source / destination address, port and host name, start / end time stamp, number of bytes sent and received, input and output data transmission time-For clipboard operation, destination Process ID, event start time, full path of the relevant file name-For other high-level operations such as accessing removable storage media, file name, device ID, date and time, number of bytes transferred, etc.

類似性発見システムがセキュリティ・システムの一部である場合、ローカル文書へのアクセス規制、リムーバブル媒体への書き込みの禁止、またはネットワーク・トラフィックの制限など、セキュリティ・アプリケーション１２０によって定義されているように、アクセス・ポリシーを受け取り行使するために、さらにモニタ・プロセス１８０を使用できる。 If the similarity discovery system is part of a security system, as defined by the security application 120, such as restricting access to local documents, prohibiting writing to removable media, or restricting network traffic, A monitoring process 180 can also be used to receive and enforce access policies.

イベント・モニタ１８０処理は、アプリケーション１２０、ＤＤＰ１５０および／またはＳＤＥ１６０による処理を制限するために発見的方法（heuristics）を含んでもよい。典型的な発見的方法は、システム・ファイルへの標準的な呼び出しによって生成された多数の重要でないイベントを自動的にフィルタリングするための承認ファイル・フィルタを含んでもよい。例えば、多種多様な実行可能ファイルおよびダイナミック・ライブラリのオペレーティング・システムのファイル、フォント・ファイルなどは、同一のアプリケーションから繰り返しオープンされてアクセスされるのが、全くあたりまえである。 Event monitor 180 processing may include heuristics to limit processing by application 120, DDP 150 and / or SDE 160. A typical heuristic may include an approval file filter for automatically filtering a number of unimportant events generated by a standard call to a system file. For example, it is quite natural that a wide variety of executable files, dynamic library operating system files, font files, and the like are repeatedly opened and accessed from the same application.

イベント・モニタ１８０およびそれとセキュリティ・アプリケーション１２０とのやり取りについてのさらなる詳細が本件出願と同時に継続中であるヴェルデーシス社（Verdasys,Inc.）による2003年11月12日付の「ディジタル資産の管理された配布（Managed Distribution of Digital Assets）」という名称の関連の米国特許出願第10/706,871号に含まれており、この出願は、その全体がここでの言及によって本明細書に組み込まれたものとする。しかしながら、ＳＤＥ依存アプリケーションの他の種類も本発明を利用できることを理解すべきである。 "Managed distribution of digital assets" dated 12 November 2003 by Verdasys, Inc., where further details about the event monitor 180 and its interaction with the security application 120 are ongoing at the same time as this application. No. 10 / 706,871 in the related application entitled “Managed Distribution of Digital Assets”, which is incorporated herein by reference in its entirety. However, it should be understood that other types of SDE dependent applications can also utilize the present invention.

＜文書の系図を表現する文書配布経路（ＤＤＰ）１５０の生成＞
前記データ・セキュリティ・アプリケーション１２０の一部として、システムは、通常システム内における文書の流れについての履歴的なイベントの表現である文書配布経路（ＤＤＰ）１５０を生成する。ＤＤＰは通常、ノードすなわち頂点が文書の識別子であって、辺が文書間の履歴的な関係を記述する有向グラフであってもよい。このようなグラフを保持することによって、文書が生成、修正、および／またはアクセスされたときに、セキュリティ・ポリシーをリアルタイムに適用することができる。 <Generation of Document Distribution Path (DDP) 150 Representing Document Tree>
As part of the data security application 120, the system generates a document distribution path (DDP) 150, which is typically a historical event representation of document flow within the system. A DDP may typically be a directed graph in which nodes or vertices are document identifiers and edges describe historical relationships between documents. By maintaining such a graph, security policies can be applied in real time as documents are generated, modified, and / or accessed.

さらに、文書の新しいバージョンとこれら文書の起源である出所元文書との類似性は、コンピュータ・システムの動作（例えば、文書の名前が変更、または文書がコピーもしく
は合体させられる場合は常に）を監視することによって明らかにされることもしばしばある。その他の場合（例えば、文書がネットワーク１０８から受信された場合）には、この類似性を、文書がデータベース内の既存の文書と同様であるか否かを判定することによってのみ明らかにできる。これは、ＳＤＥ１６０がセキュリティ・アプリケーション１２０の重要な一部となる状況の別の例である。 In addition, the similarity between the new versions of the documents and the original documents from which they originated monitors the behavior of the computer system (for example, whenever a document is renamed or copied or merged) It is often revealed by doing. In other cases (eg, when a document is received from the network 108), this similarity can only be revealed by determining whether the document is similar to an existing document in the database. This is another example of a situation where the SDE 160 becomes an important part of the security application 120.

図２は、コンピュータ・システム内の文書の流れの経路の一例（シナリオ）、およびいかに典型的なＤＤＰ１５０を構築できるのかを示す図である。最初の時刻ｔ_０において、システムは、データベース内の３つの文書（図２において、「文書」Ａ，ＢおよびＣと印されている）の起源について何の情報も有していない。しかしながら、セキュリティ・アプリケーションはＳＤＥ１６０を使用し、文書Ａ，ＢおよびＣの比較を実行して文書ＡおよびＣが類似しているという最初の結論を確立することができる。この結果が、図３に示すとおり、ＤＤＰ１５０の関係を有するデータ・セットのエントリ群におけるエントリ３０１として保存される。 FIG. 2 is a diagram illustrating an example (scenario) of a document flow path in a computer system and how a typical DDP 150 can be constructed. At the first time t ₀ , the system has no information about the origin of the three documents in the database (marked “Documents” A, B and C in FIG. 2). However, the security application can use SDE 160 to perform a comparison of documents A, B, and C to establish an initial conclusion that documents A and C are similar. The result is stored as an entry 301 in the entry group of the data set having the DDP 150 relationship as shown in FIG.

さらに、文書Ａに高セキュリティ設定が付されているが、文書Ｃがそのように特定されていない場合、これらの文書が類似であるとＳＤＥ１６０が判断したことから、セキュリティ・アプリケーション１２０は、今や文書Ｃにも同一のセキュリティ設定を適用する。このように、新しい文書に直面したときにセキュリティ・アプリケーション１２０によって適用される一般的アルゴリズムは、類似文書の探索にＳＤＥ１６０を使用することになっている。類似の文書が見つけ出された場合、新しい文書に対して同一セキュリティ設定を仮定することができる。 Further, if the document A has a high security setting, but the document C is not identified as such, the security application 120 now determines that the documents are similar, because the SDE 160 has determined that the documents are similar. The same security setting is applied to C. Thus, the general algorithm applied by the security application 120 when confronted with a new document is to use the SDE 160 for searching for similar documents. If a similar document is found, the same security settings can be assumed for the new document.

時刻ｔ_２において、イベント・モニタ１８０（図１）によってコピー・イベント２０２が検出され、文書Ａがコピーされて文書Ａ’として保存されたことが報告される。これが、さらなるエントリ３０２（図３を参照）としてＤＤＰ１５０に記録される。これは単なるコピー操作であるため、文書同士は類似であると推定され、２つの文書間の関係を完成するためにＳＤＥ１６０を使用する必要はない。 At time t ₂ , event monitor 180 (FIG. 1) detects copy event 202 and reports that document A has been copied and saved as document A ′. This is recorded in the DDP 150 as a further entry 302 (see FIG. 3). Since this is just a copy operation, the documents are presumed to be similar and there is no need to use the SDE 160 to complete the relationship between the two documents.

時刻ｔ_３において、文書Ｂおよび文書Ｃを合体させて新しい文書ＢＣにするファイル合体イベント２０３が見られる。文書Ｃが高セキュリティのラベルを有するので、１つの結果は、このようなラベルが合体後の文書ＢＣに自動的に適用されるというものであろう。 At time t ₃ , a file merge event 203 is seen that merges document B and document C into a new document BC. Since document C has a high security label, one result would be that such a label would be automatically applied to the merged document BC.

時刻ｔ_４において、イベント・モニタ１８０が、文書Ａの文書Ａ''への名称変更２０４を報告する。このイベントが、エントリ３０４（図３を参照）としてＤＤＰ１５０に保存される。 At time _{t 4,} the event monitor 180, to report the name change 204 to the document A '' of document A. This event is stored in the DDP 150 as an entry 304 (see FIG. 3).

次に、ｔ_５において、２つのイベントが生じるが、これは、イベント・モニタ１８０およびＳＤＥ１６０の他の部分の両者を備えなければフォレンジックに対して解読困難である状況の一例である。イベント２０５−１が、機密に属する文書Ａ''が編集プログラム（マイクロソフト・ワードなど）にロードされた旨を報告している。イベント２０５−３は、文書Ｄがインターネットから受信されて、やはりエディタ（編集プログラム）によってオープンされた旨を報告している。しかしながら、ＳＤＥ１６０は、現時点において文書Ｄの起源を知らない（実際には、この例では、ユーザが作業を行なっている文書Ｄは個人的な誕生日パーティの招待状であり、正確な判断をするためには、システムがこの文書を機密に属する文書に分類してはならない）。時刻ｔ_６において、クリップボードについてのカット・アンド・ペースト操作のイベント２０６が見られる。しかしながら、マイクロソフト・ワードのカット・アンド・ペースト操作はセキュリティ・アプリケーション１２０の「適用範囲外」であるという事実により、解決すべき問題が存在している。したがって単にファイル名および保存操作を追跡するだけで文書の系図を辿るのは困難である。こ
れより、検出された操作の範囲が、セキュリティ・アプリケーション１２０には知らされない。 Then, at t _5, but two events occurs, which is an example of a difficult decryption situation with respect to forensic be equipped both other parts of the event monitor 180 and SDE160. Event 205-1 reports that confidential document A ″ has been loaded into an editing program (such as Microsoft Word). Event 205-3 reports that document D has been received from the Internet and has also been opened by the editor (editing program). However, SDE 160 does not currently know the origin of document D (in fact, in this example, document D that the user is working on is a personal birthday party invitation and makes an accurate determination). To do this, the system must not classify this document as a confidential document). At time t _6, the event 206 of the cut-and-paste operation on the clipboard can be seen. However, there is a problem to be solved due to the fact that the Microsoft Word cut and paste operation is “out of scope” of the security application 120. Therefore, it is difficult to follow the genealogy of a document simply by tracking the file name and save operation. As a result, the detected operation range is not notified to the security application 120.

ｔ_７において、イベント・モニタ１０８が文書Ｅへの保存操作を見つけ、時刻ｔ_８において、文書Ｅがインターネットを介して送信されるというイベント２０８をイベント・モニタ１８０が報告する。このユーザは、機密に属する文書Ａ''からの情報を文書Ｅとして保存して送信し、セキュリティを損なったか？あるいは、このユーザは、単に文書Ｄから誕生日の招待状用の文書Ｅを生成しただけか？ In t _7, the event monitor 108 finds a save operation to the document E, at time _{t 8,} the document E is reported event monitor 180 the event 208 that is transmitted over the Internet. Has this user saved and sent information from the confidential document A ″ as document E and compromised security? Or has this user simply created a document E for birthday invitations from document D?

ここで、文書Ａ''と文書Ｅ、および文書Ｄと文書Ｅの比較を要求するＳＤＥ１６０の結果によって、セキュリティ分類の精度を大きく向上させることができる。文書Ｅが文書Ｄにきわめて類似するとの報告が返されたならば、これは低セキュリティのイベントであって違反は生じておらず、インターネット転送の動作の続行が許可される（そして／あるいは、報告はされない）。しかしながら、文書Ｅが文書Ａ''に類似するのであれば、違反が生じていると考えられ、セキュリティ・アプリケーションが、企業のセキュリティ・ポリシーに規定されているとおりに適切な処理を行なう。低リスクのイベントを誤って高リスクのイベントに分類してしまうことは、このような誤りが多数の誤警報につながり、セキュリティ・システムの運営コストを大きく膨大させてしまうため、一般に納得できるものではない。 Here, the accuracy of the security classification can be greatly improved by the result of the SDE 160 that requests the comparison between the document A ″ and the document E and between the document D and the document E. If a report is returned that Document E is very similar to Document D, this is a low security event and no violation has occurred and Internet transfer operation is allowed to continue (and / or report) Not) However, if document E is similar to document A ″, a violation is considered to have occurred and the security application performs the appropriate processing as specified in the corporate security policy. Classifying low-risk events into high-risk events by mistake can lead to many false alarms and greatly increase the operating cost of the security system. Absent.

これらのイベントを記録する適切なエントリ３０６，３０７および３０８（図３参照）がＤＤＰ１５０に入力され、新規ドキュメントＤおよびＥがどこから由来したのかについての履歴、および文書Ｅが送り出されたという事実が記録される。 Appropriate entries 306, 307 and 308 (see FIG. 3) recording these events are entered into the DDP 150, recording the history of where the new documents D and E came from, and the fact that the document E was sent out. Is done.

時刻ｔ８において、いずれかのアプリケーションから保存イベント２０９が検出される。このイベント２０９では、古いファイルである文書Ｃ’と同一の名前を有する新しいファイルに別のデータが保存される。ここでも、同一ファイル名を有するファイルは同一セキュリティ分類に属すると単に仮定するのではなく、文書Ｃ’の中身をデータベースと比較して文書Ｃ’を分類するために、ＳＤＥ１６０のエンジンを使用することができる。 At time t8, a save event 209 is detected from any application. In this event 209, another data is stored in a new file having the same name as the document C 'which is an old file. Again, not simply assume that files with the same file name belong to the same security classification, but use the engine of SDE 160 to classify document C 'by comparing the contents of document C' with the database. Can do.

時刻ｔ_９において、企業のセキュリティ部門が所有情報漏洩の報告を受信したため、フォレンジック調査が要求された。このような調査は、調査人がＤＤＰ１５０の情報を利用できるのであれば、大幅に簡略化され、より正確に行なうことができる。したがって、企業の外部への機密情報の配布を阻止するようにシステムが構成されていなくても、適切なログおよび報告が一旦もたらされると、後の調査によってこのような漏洩を発見し、違反者に対して法的手段に訴えることができる。 At time t _9, because the security department of the company has received the report of the ownership information leakage, forensic investigation has been requested. Such an investigation can be greatly simplified and performed more accurately if the investigator can use the information of the DDP 150. Therefore, even if the system is not configured to prevent the distribution of sensitive information outside the company, once proper logging and reporting has been provided, subsequent investigation will find such leaks and violators Can be sued by legal means.

ＳＤＥ１６０は、また、２つのファイルの比較の結果として、類似性の程度（実際の数字）を報告することができる。次いで、この数字が使用されて、さらに／あるいはＤＤＰに保持される。したがって、新規文書Ｅが文書Ａ''に６０％類似し、文書Ｄに３２％類似していると例えばＳＤＥ１６０が報告する場合、この情報も、いかに文書が作成されたかについてフォレンジックを推測するうえで重要でありうる。 The SDE 160 can also report the degree of similarity (actual number) as a result of comparing the two files. This number is then used and / or held in the DDP. Thus, if, for example, the SDE 160 reports that the new document E is 60% similar to the document A ″ and 32% similar to the document D, this information is also used to infer forensics about how the document was created. Can be important.

文書から文書への類似性の程度は、好ましくは２つの文書内の「チャンク」の総数に対する類似するチャンクの数にもとづいて計算される（このようなアルゴリズムの１つについての詳細な説明は、後述されている）。ファイルの一方が入手不可能であり、このファイルに対する類似性を他のファイルに対する既知の類似度にもとづいて計算しなければならない場合、確率論に共通の公式を推定として使用してもよい。例えば、入手できない文書Ａの文書Ｂに対する類似度がＳ_ＡＢであると分かっており、文書Ｂの文書Ｃに対する類似度がＳ_ＢＣであると分かっている場合、文書Ａと文書Ｃとの間の類似度は、 The degree of similarity from document to document is preferably calculated based on the number of similar chunks relative to the total number of “chunks” in the two documents (a detailed description of one such algorithm is As described below). If one of the files is not available and the similarity to this file must be calculated based on the known similarity to the other file, a common formula in probability theory may be used as an estimate. For example, when it is known that the similarity between the document A and the document B that cannot be obtained is S _AB and the similarity between the document B and the document C is S _BC , The similarity is

であると推定できる。この公式より、入手不可能なファイルＡおよび照会されたファイルＣに対する類似度が、既知であるファイルの数が多くなれば、より大幅に正確になりうる。 It can be estimated that. From this formula, the similarity to unavailable file A and queried file C can be much more accurate if the number of known files increases.

＜ＳＤＥ１６０によって使用されるデータの圧縮された内部表現＞
次に、ＳＤＥ１６０がいかに２つの文書が類似であるか否かを判断するかについて説明する。ＳＤＥ１６０の現実の実装は、いくつかの要件を満足していなければならない。通常は、上述した目的のためには、むしろ非類似の情報であっても類似していると考えられるべきである（例えば、大きく変更された文書でも、元の文書に類似していると考えられるべきである）。今や、一般的なユーザが取り扱う情報の量はきわめて大きくなる可能性があり、システム間での大量のデータの転送がきわめて高速に実行可能であることもしばしばであるため、ＳＤＥ１６０は、コンピュータ的にきわめて効率的かつ正確でなければならない。ＳＤＥ１６０に必要なメモリ量およびディスク空間量は、ユーザに対して透過であるという要件を満足するため、きわめて限られた量である必要がある。 <Compressed internal representation of data used by SDE 160>
Next, how the SDE 160 determines whether two documents are similar will be described. The actual implementation of SDE 160 must satisfy several requirements. In general, for the purposes described above, even dissimilar information should be considered similar (eg, a heavily modified document is considered similar to the original document). Should be done). Now, the amount of information handled by a typical user can be quite large, and the transfer of large amounts of data between systems can often be performed very quickly, so the SDE 160 is computationally efficient. It must be very efficient and accurate. The amount of memory and disk space required for the SDE 160 needs to be very limited to satisfy the requirement of being transparent to the user.

効率的なＳＤＥ１６０の実装における１つの一般的所見は、２つのバイナリ・データ・ストリームから取り出される同一サイズの２つのチャンクについて、一方のチャンクに存在する長い一連のバイトが他方のチャンクの長い一連のバイトとほぼ一致する（必ずしも正確に一致していなくてもよい）ならば、２つのチャンクは通常は類似であると考えられる。数学的には、このような類似度を示す量は、２つのチャンク間の「共分散」であってもよい（共分散の計算のために各チャンクから取り出されたバイトの対は二次元のデータ点であると考えられる）。ここに記載したＳＤＥ１６０の実装において、望ましいチャンクのサイズは、１キロバイト（KBt）が一般的な値である可変のパラメータである。この数値はシステムのパラメータであり、ＳＤＥ１６０についての所望の速度と正確さの間のトレードオフ、保持しなければならない情報の量、および典型的な文書のサイズなどに応じて、より大きくすることも、より小さくすることも、可能である。 One common observation in an efficient SDE 160 implementation is that for two identically sized chunks retrieved from two binary data streams, a long series of bytes present in one chunk is a long series of other chunks. Two chunks are usually considered similar if they roughly match a byte (not necessarily exactly). Mathematically, such a measure of similarity may be a “covariance” between two chunks (a pair of bytes taken from each chunk for covariance calculation is a two-dimensional Considered data points). In the implementation of SDE 160 described here, the desired chunk size is a variable parameter where 1 kilobyte (KBt) is a typical value. This number is a system parameter and can be made larger depending on the trade-off between desired speed and accuracy for the SDE 160, the amount of information that must be retained, and the size of a typical document. It is also possible to make it smaller.

典型的な動作の例（シナリオ）には、このように２つ以上のチャンクを含むデータ・ストリームを伴い、さらに別個に、このデータ・ストリームが比較されるチャンクのセット（おそらくは大きなセット）を伴う。目標は、ストリームからのチャンクに類似するチャンクがデータ・セット中に存在するか否かを見出すことにある。「部分文字列検索（substring search）」または「編集回数（number of edits）」などの伝統的なアルゴリズムは、これらがチャンクのデータ・セットに対して、ストリームのすべてのチャンクをすべての文字位置（character position）から出発して照会するため、実用的ではない。伝統的なアルゴリズムが所定のストリームからの互いに重なり合わないチャンクのみを照会するように改良された場合、データ・ストリームを分割するときに分割の位置シフトすなわち「位相」を正確に推測することができないため、類似チャンクの対をほとんど発見できないであろう。 A typical operation example (scenario) involves a data stream that includes two or more chunks in this way, and additionally involves a set of chunks (possibly large sets) against which this data stream is compared. . The goal is to find out if there are chunks in the data set that are similar to the chunks from the stream. Traditional algorithms such as “substring search” or “number of edits” are used to set all chunks of a stream to all character positions (for a data set of these chunks). Since it starts from character position), it is not practical. When traditional algorithms are modified to query only non-overlapping chunks from a given stream, the position shift or “phase” of the split cannot be accurately guessed when splitting the data stream Therefore, you will hardly find pairs of similar chunks.

好ましい実施の形態においては、代わりに、ＳＤＥ１６０がチャンクのフーリエ係数の絶対値を比較し、かなりの大きさで互いの位相がずれているチャンクについて、チャンク間の類似性を見つけ出す。以下に説明する階層的チャンク表現を使用して、ＳＤＥ１６０は正確な一致を特定するためにフーリエ係数の全セットの約１０％しか必要とせず、それらを低い正確さの形式（それぞれにつき１バイト、あるいは半バイト）で維持することが
できる。 In the preferred embodiment, instead, the SDE 160 compares the absolute values of the Fourier coefficients of the chunks and finds similarities between the chunks for chunks that are quite large and out of phase with each other. Using the hierarchical chunk representation described below, SDE 160 requires only about 10% of the full set of Fourier coefficients to identify exact matches, and these are in the form of low accuracy (one byte for each, Or half a byte).

したがって、データ比較のために効果的に利用されるデータの圧縮された内部表現は、データの短いチャンクのフーリエ係数の絶対値の部分集合であって、低い正確さの形式で維持される。 Thus, the compressed internal representation of the data that is effectively utilized for data comparison is a subset of the absolute values of the Fourier coefficients of the short chunks of data and is maintained in a low accuracy form.

＜クラスタリング・アルゴリズムおよびインデックス・アルゴリズム＞
このように、フーリエ変換に基づくチャンク比較は、ＳＤＥ１６０（図１を参照）の核心をなす手法である。これより、元となる既存の文書ファイル（例えば、前述した文書Ａ，Ａ’，Ａ''，Ｂ，Ｃなど）が、小さなチャンク（それぞれ約１KBt）に分割され、これらのフーリエ係数のいくつかがチャンク・データベース１７５に保持される。新しいデータ・ストリームを受信すると、ＳＤＥ１６０はこのストリームをチャンクのセットに分解し、これらをデータベース１７５と比較する。ＳＤＥ１６０は、新しいデータ・ストリームとデータベース上のチャンクを構成している既存の文書との間の類似度に関して比較の結果を返す。 <Clustering algorithm and index algorithm>
Thus, chunk comparison based on Fourier transform is a technique that forms the core of SDE 160 (see FIG. 1). Thus, the original existing document file (for example, the documents A, A ′, A ″, B, and C described above) is divided into small chunks (about 1 KBt each), and some of these Fourier coefficients. Is held in the chunk database 175. When a new data stream is received, SDE 160 breaks this stream into a set of chunks and compares them to database 175. The SDE 160 returns the result of the comparison regarding the similarity between the new data stream and the existing documents that make up the chunk on the database.

図４は、高いレベルにおけるＳＤＥ１６０の処理の代表的なフローチャートである。このように、第１の工程４００はデータのストリームを受信する工程であり、次いで工程４１０でストリームのチャンクを決定する。工程４２０において、チャンクのフーリエ係数が計算され、これらのうちのいくつかのみが保持される一方で、残りは廃棄される（詳細は後述）。次いで、一連の工程４３０が、各チャンクのフーリエ係数をデータベース内のファイルのチャンクのフーリエ係数と比較するように、順序だてた方法で実行される。その後、工程４４０において類似度を割り出す。 FIG. 4 is a representative flowchart of the processing of the SDE 160 at a high level. Thus, the first step 400 is to receive a stream of data, and then step 410 determines the chunk of the stream. In step 420, the Fourier coefficients of the chunk are calculated and only some of them are retained, while the rest are discarded (details below). A series of steps 430 are then performed in an ordered manner to compare the Fourier coefficients of each chunk with the Fourier coefficients of the chunks of the file in the database. Thereafter, in step 440, the similarity is determined.

通常のファイル・システムが分割されて得られるチャンクの数はきわめて多く、これらのフーリエ係数のデータベースへの効率的な照会手段、および圧縮されたフォーマットでデータを保持する方法が必要とされる。特に、簡単なＳＱＬに基づく照会では、わずかに少数のフーリエ係数の大きな相違を、たとえ他のフーリエ級数の良好な一致が勝っていても、一致していないと判断するため、類似のデータ・チャンクを突き止めることができない。しかしながら、ＳＤＥ１６０は、いわゆる最近隣探索（nearest neighbor search）を利用して、少数のフーリエ係数の不一致を重大な相違であるとはみなさない。 The number of chunks that can be obtained by splitting a normal file system is very large, and an efficient means of querying these Fourier coefficient databases and a method of holding the data in a compressed format is required. In particular, a simple SQL-based query determines that large differences in a small number of Fourier coefficients are not matched, even if a good match in other Fourier series wins. I ca n’t find out. However, the SDE 160 utilizes a so-called nearest neighbor search and does not consider the minority Fourier coefficient mismatch to be a significant difference.

すなわち工程４２０において、チャンクの係数で構成されるベクトル・セットの効率的な表現は、係数の大クラスタのツリー状構造であり、クラスタ・サイズが十分に類似するチャンクのグループを表現するのに十分なだけ小さくなるまで、より小さいクラスタに分割される。このクラスタリング・アルゴリズムはフーリエ級数のセットについてのハッシュ関数の概念を実装し、データベースのインデックスと多少類似する役割を果たす。 That is, in step 420, the efficient representation of the vector set composed of the coefficients of the chunk is a tree-like structure of large clusters of coefficients, sufficient to represent a group of chunks with sufficiently similar cluster sizes. It is divided into smaller clusters until it is as small as possible. This clustering algorithm implements the concept of a hash function for a set of Fourier series and plays a role somewhat similar to a database index.

工程４２０のさらなる詳細に関し、最初にＳＤＥ１６０、照会されたチャンクを含むクラスタを見つけるために、最も高いレベルのクラスタを検索する。この処理は、クラスタ階層の底部で一致するチャンク（またはチャンクのセット）に到達するまで、あるいは類似のチャンクが存在しないと判断されるまで続けられる。こうして、ＳＤＥ１６０は、類似の文書を同一のクラスタ・セットにマッピングすることができ、これより、クラスタにフィットするすべてのチャンクの座標ではなく、クラスタそのものの座標のみを保持することによって、高レベルのデータ圧縮が達成される。 For further details of step 420, first search the highest level cluster to find the SDE 160, the cluster containing the queried chunk. This process continues until a matching chunk (or set of chunks) is reached at the bottom of the cluster hierarchy, or until it is determined that no similar chunks exist. In this way, SDE 160 can map similar documents to the same set of clusters, thus keeping only the coordinates of the cluster itself, rather than the coordinates of all the chunks that fit the cluster, thereby providing a high level Data compression is achieved.

以下で示すとおり、単一のチャンク・ルックアップ・クエリが、存在するのであれば、類似チャンクの発見を保証することが、ＳＤＥ１６０の全体性能にとって極めて重要というわけではない。一致するレコードの取り出しが保証されている決定論的なデータベース・アーキテクチャに反して、ＳＤＥ１６０の照会は、多くの場合において正しい一致を見
つけるが、すべての場合とは対照的に、その他の場合には形式的に誤った不一致または「発見されず」の応答を返す。このような緩やかな要件の環境においては、照会を速度について大幅に最適化することができる。 As shown below, ensuring that similar chunks are found is not critical to the overall performance of the SDE 160 if a single chunk lookup query exists. Contrary to deterministic database architecture where retrieval of matching records is guaranteed, SDE 160 queries find a correct match in many cases, but in all other cases Returns a formally mismatched or “not found” response. In such a relaxed requirement environment, queries can be greatly optimized for speed.

階層内のクラスタは、かなりの程度の重なり合いを有しているため、類似のクラスタが発見される可能性があるツリーのすべての分岐枝を下るということは、照会を大部分の分岐枝を下って移動させることになり、（単純なクラスタ・セットと比べて）階層を備えることの利益を無にしてしまう。本照会は、確率論的推定を使用して、所定チャンクをいずれのクラスタが最も受け入れる可能性があるかを判断し、これらのクラスタを通過する階層の分岐枝のみを探索するように進める。この多分岐枝確率論的サーチは、必要とされる正確さと性能との間に設定可能なバランスをもたらし、これがリアルタイムで文書の類似度を判断するために不可欠である。 Because clusters in the hierarchy have a significant degree of overlap, going down all branches in the tree where similar clusters can be found means that queries go down most branches. And the benefits of having a hierarchy (compared to a simple cluster set) are negated. The query uses probabilistic estimation to determine which clusters are most likely to accept a given chunk and proceeds to search only the branch branches of the hierarchy that pass through these clusters. This multi-branch branch probabilistic search provides a configurable balance between required accuracy and performance, which is essential for determining document similarity in real time.

ＳＤＥ１６０が、元々の照会の他にさらに２つの類似の照会を開始するのであれば、工程４４０における照会の正確さは大きく改善される。これらの照会においては、元のチャンクの最初の半分または最後の半分のいずれかからのデータのみがフーリエ変換に使用され、使用されない半分からのデータはゼロに設定される。照会されたチャンクに類似するチャンクがシステムに存在する場合、それは照会された半チャンクの一方を含む（重なり合うのではなく）であろう。また、それらの類似度はかなり大きいであろう。３つの照会のうち、最も類似するチャンクのセットを取り出した照会が、最も信頼できる結果を生成するであろう。 If SDE 160 initiates two more similar queries in addition to the original query, the accuracy of the query in step 440 is greatly improved. In these queries, only data from either the first half or the last half of the original chunk is used for the Fourier transform, and data from the unused half is set to zero. If there is a chunk in the system that is similar to the queried chunk, it will contain (rather than overlap) one of the queried half-chunks. Their similarity will be quite large. Of the three queries, the query that retrieves the most similar set of chunks will produce the most reliable results.

単一のクラスタ内にファイル・システムからの多数のチャンクが属する可能性があり、またそれが通常であるため、単一のチャンク照会では、いずれの文書が所定のチャンクに類似するチャンクを含むかを判断することができない。したがって、ＳＤＥ１６０が所定のファイルまたはストリームのいくつかの連続するチャンクについて実行する複数の照会４３０からの結果を照会解釈手順４４０が統合し、所定のファイルに最も類似するいくつかのファイルの名称（または識別子）を出力する。さらにＳＤＥ１６０は、照会結果の正確さを裏打ちするため、結果についての確率論的大きさを出力する。この大きさが、文書配布経路内における類似度の推定として、あるいは情報セキュリティ・システムにおける確定性因子として使用される。 Because many chunks from the file system can belong in a single cluster and it is normal, a single chunk query contains which chunks are similar to a given chunk Cannot be judged. Thus, the query interpretation procedure 440 consolidates the results from multiple queries 430 that the SDE 160 performs on several consecutive chunks of a given file or stream, and names of some files (or most similar to the given file) (or Identifier). In addition, the SDE 160 outputs a probabilistic magnitude for the results to support the accuracy of the query results. This size is used as an estimate of similarity in the document distribution path or as a deterministic factor in information security systems.

＜多種内容ファイルからのデータ抽出＞
いくつかの共通する種類のファイル（例えば、オフィス文書）は、異なる性質の情報を異なるストリーム中に別個に保持している。ストリームごとにもとづき、この情報を分離する方法がいくつか存在する。チャンク・データベースのルックアップをより高速にするためにこれらの手段を利用することができる。例えば、テキスト情報は、画像のデータベースと比較する必要がなく、所定の実装例では、ある種の情報（例えば、ダウンロードしたウェブページ）を機密に関すると判断しないよう決定できる。 <Data extraction from various content files>
Some common types of files (e.g., office documents) hold different types of information separately in different streams. There are several ways to separate this information on a stream-by-stream basis. These means can be used to make the chunk database lookup faster. For example, text information need not be compared to a database of images, and in certain implementations, it can be determined that certain information (eg, downloaded web pages) is not considered confidential.

＜好ましい実施の形態の設計に関する数学的側面＞
フーリエ係数のスパース表現を使用する比較処理の設計の目的は、ストリームからのデータを、ＳＤＥ１６０が利用できるすべての文書からのすべてのチャンクを含む予め定められたデータベースと比較できるアルゴリズムを設計することにあった。２つのｎ次元データベクトルｘおよびｙを考える（必ずしも同じ長さでなくてもよい）。これらのベクトルの畳み込みは、以下のとおり定義される。 <Mathematical Aspects Regarding Design of Preferred Embodiment>
The purpose of the design of the comparison process using the sparse representation of the Fourier coefficients is to design an algorithm that can compare the data from the stream with a predetermined database containing all chunks from all documents available to the SDE 160. there were. Consider two n-dimensional data vectors x and y (not necessarily the same length). The convolution of these vectors is defined as follows:

添字ｑの関数としての畳み込みが、あるｑにおいて平均に比べて大きい値を有する場合、これらベクトルの２つのチャンクは、おそらくは互いに類似している。畳み込みが複数のピークを呈するということは、ベクトルｘおよびｙの中に一致するチャンクが多数存在し、これら一致するチャンクの対に関し、これらが属するベクトルの始点からのオフセットの距離が異なっているということを意味する。 If the convolution as a function of the subscript q has a large value compared to the average at a certain q, then the two chunks of these vectors are probably similar to each other. The fact that the convolution exhibits multiple peaks means that there are many matching chunks in the vectors x and y, and for these matching chunk pairs, the offset distance from the starting point of the vector to which they belong is different. Means that.

図５は、畳み込み結果の一例である。図示の信号を生成するために、以下のｍａｔｌａｂ（計算機言語の一種）スクリプトを使用した。
clear
n=1000;
a1=rand(n,1);a2=rand(size(a1));
1part=n/4;n1part=1;n2part=n1part+1part-1;
j1part=n1part:n2part;j2part=n/2+(n1part:n2part);
a2(j1part)=a1(j1part);a2(j2part)=a1(j2part+100);
a1=a1-mean(a1);a2=a2-mean(a2);
c=conv(a1,flipud(a2));plot(ｃ) FIG. 5 is an example of the convolution result. The following matlab script was used to generate the signal shown.
clear
n = 1000;
a1 = rand (n, 1); a2 = rand (size (a1));
1part = n / 4; n1part = 1; n2part = n1part + 1part-1;
j1part = n1part: n2part; j2part = n / 2 + (n1part: n2part);
a2 (j1part) = a1 (j1part); a2 (j2part) = a1 (j2part + 100);
a1 = a1-mean (a1); a2 = a2-mean (a2);
c = conv (a1, flipud (a2)); plot (c)

関数ｃｏｎｖ（ｘ，ｙ）、さらに正確にはそのピークの高さが、ベクトルｘとｙの間の類似度のよい指標である。この関数の以下の特徴を、アルゴリズムの構築に使用することができる。ベクトルｘ、ｙ、およびｃｏｎｖ（ｘ，ｙ）のフーリエ・スペクトルについて検討する。畳み込み定理によれば、 The function conv (x, y), more precisely the height of its peak, is a good measure of similarity between the vectors x and y. The following features of this function can be used to construct the algorithm. Consider the Fourier spectrum of the vectors x, y, and conv (x, y). According to the convolution theorem,

であり、ここでＦは、ベクトルへのフーリエ分解の適用を示している。この式は、上記式の両辺をｅｘｐ（ｉｋｑ）で乗算し、ｑについて合計し、総和を右辺に入れ換えることによって容易に確認できる。フーリエ係数は、一般に複素数である。絶対値をとり、次いで上記式の両辺の平均を計算すると、以下のとおりである。 Where F indicates the application of Fourier decomposition to the vector. This equation can be easily confirmed by multiplying both sides of the above equation by exp (ikq), summing up q, and replacing the sum with the right side. The Fourier coefficient is generally a complex number. Taking the absolute value and then calculating the average of both sides of the above equation is as follows.

ここで、│・│は、複素数の絶対値を取ることを意味しており、＜・＞は、平均値を取り除いた後の平均化を表わしている。ベクトルｘおよびｙが両者の間の位相シフト無しで一致する場合、式の右辺の平均は、同一の振幅および長さを有する任意のベクトル間について得られる平均よりも大きくなるであろう。しかしながら、たとえｘおよびｙの間に位相シフトが存在しても、（ｘとｙの大きさの差と対照的に）これらのフーリエ係数の位相の差に反映され、この位相シフトの影響は絶対値をとることによって除去される。 Here, | · | means that the absolute value of the complex number is taken, and <·> represents the averaging after removing the average value. If the vectors x and y match without a phase shift between them, the average of the right hand side of the equation will be greater than the average obtained between any vectors with the same amplitude and length. However, even if there is a phase shift between x and y, it is reflected in the phase difference of these Fourier coefficients (as opposed to the magnitude difference between x and y), and the effect of this phase shift is absolutely Removed by taking the value.

この公式が、比較アルゴリズムの数学的基礎の１つを提供する。類似のいくつかのアルゴリズムの説明について、例えばエム・ジェイ・アタラー（M.J.Atallah）、エフ・チザ
ク（F.Chyzak）、ピー・デュマス（P.Dumas）の「近似文字列マッチングのためのランダム化アルゴリズム（A Randomized Algorithm for Approximate String Matching）」、http://algo.inria.fr/dumas/AtChDu99/を参照されたい。 This formula provides one of the mathematical foundations of the comparison algorithm. For a description of several similar algorithms, see “Randomization algorithms for approximate string matching” by MJAtallah, F.Chyzak, and P. Dumas. A Randomized Algorithm for Approximate String Matching) ”, http://algo.inria.fr/dumas/AtChDu99/.

この式は、その右辺において、２つのベクトルのフーリエ係数の絶対値間の相関の公式ときわめてよく似ている。したがって、２つのストリームの比較の問題は、それらのフーリエ係数間の相関係数の計算の問題に帰する。我々のニーズにとって十分な正確さで相関係数を推定するためには、文書のチャンクに保存されたデータのフーリエ係数のすべてを保持する必要はない。我々の実験から、実際に必要とされるのは、すべてのフーリエ係数のうちのわずか約１０％であることが明らかになっている。これらほとんどが必要とされない係数の指標を選択するために異なる手法も試みて、より低周波数の係数が保持される手法が、最もよい結果を示した。 This equation is very similar to the formula for the correlation between the absolute values of the Fourier coefficients of the two vectors on its right side. Thus, the problem of comparing the two streams results in the problem of calculating the correlation coefficient between their Fourier coefficients. In order to estimate the correlation coefficient with sufficient accuracy for our needs, it is not necessary to retain all of the Fourier coefficients of the data stored in the document chunks. From our experiments, it is clear that only about 10% of all Fourier coefficients are actually needed. Different approaches have also been tried to select the index of coefficients for which most of these are not needed, and approaches that retain lower frequency coefficients have shown the best results.

ｃｏｎｖ（ｘ，ｙ）のピークを見つけることに比べると、このアプローチの利点の１つは、ある長さ（小さな素数の倍数、好ましくは２の整数乗）のベクトルのフーリエ係数を、ベクトルの長さとほぼ線形である時間で計算できるという事実による。高速フーリエ変換として知られている一般的なアルゴリズムは、ベクトルの長さがｎである場合、時間Ｏ（ｎｌｏｇｎ）で動作する。このアルゴリズムを適用することによって、２つのベクトルの畳み込みのフーリエ係数の平均の計算を、時間がベクトルのサイズの平方に比例する畳み込みそのものの直接計算よりも、大幅に高速にすることができる。 Compared to finding the peak of conv (x, y), one advantage of this approach is that the Fourier coefficient of a vector of a certain length (a multiple of a small prime number, preferably an integer power of 2) Due to the fact that it can be calculated in a time that is almost linear. A common algorithm known as Fast Fourier Transform operates at time O (nlogn) when the vector length is n. By applying this algorithm, the calculation of the average of the Fourier coefficients of the convolution of two vectors can be made significantly faster than the direct calculation of the convolution itself, whose time is proportional to the square of the vector size.

ベクトルの係数間の相関の計算の問題について、さらに詳しく検討する。成分が正規分布（ガウス分布）している２つの任意のベクトルを考える。これらの相関係数ｒの分布関数を調べる。ベクトルが十分に長い長さｋである（この記述の目的において、ｋ＞１０を十分に大きい数字であると考えることができる）場合、量の分布関数ｙが、分散Ｄ＝１／（ｋ−３）でほぼ標準的であることが、統計学から知られているのが事実である。 The problem of calculating the correlation between vector coefficients will be examined in more detail. Consider two arbitrary vectors whose components are normally distributed (Gaussian distribution). The distribution function of these correlation coefficients r is examined. If the vector has a sufficiently long length k (for the purposes of this description, k> 10 can be considered a sufficiently large number), the quantity distribution function y has a variance D = 1 / (k− It is a fact that statistics are known to be almost standard in 3).

ジー・エイ・コーン（G.A.Korn）、ティー・エム・コーン（T.M.Korn）の「科学者および技術者のための数学ハンドブック（Mathematical Handbook for Scientista and Engineers）」、マグロウヒル社（McGraw-Hill）、1968年を参照されたい。 GA Korn, TMKorn's "Mathematical Handbook for Scientista and Engineers", McGraw-Hill, 1968 See year.

この記述は、上記の条件のもとでは、２つのベクトルについて測定された相関係数が理論値とは異なっており、相違はベクトルの長さとともにほぼ指数関数的に減少することを意味している。 This description means that, under the above conditions, the correlation coefficient measured for two vectors is different from the theoretical value, and the difference decreases almost exponentially with the length of the vector. Yes.

ある１つのチャンクに保存されたデータのフーリエ係数がいかに分布（正規または他の何らかの分布）しているか明らかでないため、文書のチャンクに保存されたデータのフーリエ成分の相関係数に上記の記述は直接当てはまらない。実際、我々は、現実の多くの場面において、文書のチャンクに保存されたデータのフーリエ係数の分布関数が正規ではないことを見出している。我々は、単純な技法（外れ値を放棄するなど）を適用することが、フーリエ係数の分布関数をほぼ正規に促進させるのに十分であることを見出した。 Since it is not clear how the Fourier coefficients of the data stored in one chunk are distributed (normal or some other distribution), the above description is in the correlation coefficient of the Fourier components of the data stored in the document chunk Does not apply directly. In fact, we have found that in many real-world situations, the distribution function of the Fourier coefficients of the data stored in the document chunks is not normal. We have found that applying simple techniques (such as abandoning outliers) is sufficient to promote the distribution function of the Fourier coefficients to be approximately normal.

今や、「文書の２つのチャンクが類似であるか」という質問が、一般的な統計学の枠組みの中に置かれる。我々は、「２つの文書のチャンクが関係していない」という統計的仮
説を考査することを意図した。文書のチャンクに保存されたデータのフーリエ係数の絶対値が正規分布しているという仮定のもとで、この仮説は「上記導入された量ｙが、ゼロ平均および１／（ｋ−３）の分散を有する正規分布に属している」に帰する（ｋは、使用するフーリエ係数の数である）。この考査は、統計学においても最も一般的かつよく検討されているものの１つである。この問題の再構成は、我々に「ファイルのチャンクが類似である」および「ファイルのチャンクに保存されたデータのフーリエ係数がよく相関している」という２つの定性的表現を交換可能に使用させる。 The question “is the two chunks of a document similar?” Is now placed in the general statistical framework. We intended to examine the statistical hypothesis that “the two document chunks are not related”. Under the assumption that the absolute values of the Fourier coefficients of the data stored in the document chunks are normally distributed, the hypothesis is that the introduced quantity y is zero mean and 1 / (k−3). Belongs to a normal distribution with variance "(k is the number of Fourier coefficients used). This examination is one of the most common and well studied in statistics. The reconstruction of this problem allows us to use two qualitative representations interchangeably: “file chunks are similar” and “Fourier coefficients of data stored in file chunks are well correlated” .

我々の結論は、文書のチャンクを利用可能なチャンクのデータベースに対して考査するために、所定のチャンクおよびデータベース内のすべてのチャンクについて「２つのチャンクは関係ない」という仮説を考査することを選択してもよい。しかしながら、ファイル・システムのうちでＳＤＥ１６０に公開されている部分およびチャンクのデータベース１７５（図１参照）がきわめて大きいかもしれないので、この考査は法外に高価となる。したがって、我々の課題に対して、「すべてのチャンクを考査する」方法すなわち「サーチし尽くす」方法よりも効率的な技法を発明する必要がある。この件への対処の試みにおいて、我々は、文書のチャンクのツリー状の"world inside the world"構造（図６を参照）を設計した。決定的に重要なことは、相関の関係がほぼ推移的であり、ａがｂと相関し、ｂがｃと相関するならば、ａはｃと相関するというものである。換言すれば、ベクトルの小クラスタの中心が、ある所定のベクトルとの強い相関を示さないのであれば、その所定のベクトルは、クラスタ内のあらゆるベクトルとの間に強い相関をもたない可能性が高い。 Our conclusion is to examine the hypothesis that “two chunks are irrelevant” for a given chunk and all chunks in the database to examine the document's chunk against the available chunk database. May be. However, this review is prohibitively expensive because the portion of the file system that is exposed to the SDE 160 and the chunk database 175 (see FIG. 1) may be quite large. Therefore, there is a need to invent a technique for our problem that is more efficient than the “examine all chunks” method, or the “search out” method. In an attempt to address this issue, we designed a tree-like “world inside the world” structure of document chunks (see FIG. 6). What is critically important is that if the correlation is almost transitive, if a correlates with b and b correlates with c, then a correlates with c. In other words, if the center of a small cluster of vectors does not show a strong correlation with a given vector, the given vector may not have a strong correlation with every vector in the cluster. Is expensive.

その要素がＳＤＥ１６０に公開された文書のチャンクに保存されたデータのフーリエ係数であって、ユニタリＬ_２ノルムを有するように正規化されている空間を考える。我々は、この空間内の要素の類似度の大きさとしてベクトル間の相関を使用する。 Consider a space whose elements are Fourier coefficients of data stored in a chunk of a document published to SDE 160 and are normalized to have a unitary L ₂ norm. We use the correlation between vectors as a measure of the similarity of elements in this space.

推移性の近似関係を念頭において、図６に示すようなクラスタの階層構造が生成される。以下の説明は、所定のチャンクに類似するチャンクを求める照会に対してこの構造６００がいかに「すべてをチェックする」方法すなわち網羅的サーチよりも効率的な照会をサポートするのかについての詳細な説明である。特に、照会は、照会されたベクトルと相関するクラスタの中心を通過する構造６００の分岐枝に掘り下げられる。 With the approximate relation of transitivity in mind, a cluster hierarchical structure as shown in FIG. 6 is generated. The following description is a detailed description of how this structure 600 supports a more efficient query than an exhaustive search, ie a “check everything” method for queries that seek chunks similar to a given chunk. is there. In particular, the query is drilled into branch branches of the structure 600 that pass through the center of the cluster that correlates with the queried vector.

次に、上記クラスタの階層構造６００を構築するために我々が使用したクラスタリング方法を説明する。一般に、クラスタリングの問題はＮＰ困難（現実的な次元で解けない）であり、進んだアルゴリズムの適用を必要とする（K-means法、遺伝的アルゴリズム、など）。我々の場合には、すべてのチャンクをメモリ内に同時に保持することさえ不可能である（そのメモリのデータをきわめて多数回閲覧することは言うまでもなく不可能である）ことが、従来からのクラスタリング技法の使用の可能性を無くしてしまうきわめて厄介な問題である。我々は、階層の構築の全プロセスにおいて１回のみ、または最大でも数回のみ、すべてのチャンクを監視することができるオンライン・アルゴリズムを構築する必要がある。 Next, the clustering method we used to construct the cluster hierarchy 600 will be described. In general, the problem of clustering is NP-hard (it cannot be solved in a realistic dimension) and requires advanced algorithm application (K-means method, genetic algorithm, etc.). In our case, it is impossible to keep all the chunks in memory at the same time (it is impossible to view the data in that memory very many times). It is a very troublesome problem that eliminates the possibility of using. We need to build an online algorithm that can monitor all chunks only once, or at most several times, in the entire hierarchy building process.

我々は、ディッテンバッハ・エム（Dittenbach,M）、ラウバー・エー（Rauber,A）、およびメルクル・ディー（Merkl,D）の「成長する階層自己組織化マップを使用するデータ内階層構造の発見（Uncovering the Hierarchical Structure in Data Using the Growing Hierarchical Self-Organizing Maps）」、ニューロコンピューティング（Neurocomputing）、2002年、48巻（1〜4）：199〜216頁、http://www.ifs.tuwien.ac.at/〜mbach/ghsom/に記載の「成長する階層自己組織化マップ（Growing Hierarchical Self-Organizing Maps）」法に類似するアルゴリズムの構築を選択する。 We found “in-data hierarchy using the growing hierarchical self-organizing map” by Dittenbach, M, Rauber, A, and Merkl, D (Uncovering the Hierarchical Structure in Data Using the Growing Hierarchical Self-Organizing Maps), Neurocomputing, 2002, 48 (1-4): 199-216, http: //www.ifs.tuwien Choose to build an algorithm similar to the “Growing Hierarchical Self-Organizing Maps” method described in .ac.at / ~ mbach / ghsom /.

このアルゴリズムにおいて、すべてのクラスタは当該クラスタに新しい要素が挿入されたときに空間内における位置を変化させるが、このような挿入は、要素がクラスタ内に収まる場合にのみ生じる（このようなクラスタが存在しない場合、構造によって別のクラスタが自動的に生成される）。我々の構造において我々が使用するクラスタは所定の半径を有する球形状である。同一階層レベルにあるクラスタの半径は同一であり、階層の上部から底部へと次第に小さくなる。底部レベルではない１つのクラスタから階層のいくつかの分岐枝が発生する。すべての分岐枝は共通の底部に達する。要素は、構造の底部レベルにおいて登録される。我々の理論の構築のために、我々は「クラスタが要素に類似している」という表現を、より厳密な「クラスタが、要素に類似している中心を備えている」という表現に代えて使用する。クラスタの半径は、そのメンバーがその中心で有する最小の相関係数に対応している。 In this algorithm, every cluster changes position in space when a new element is inserted into that cluster, but such insertion occurs only if the element fits within the cluster (such a cluster is If not, another cluster is automatically generated by the structure). The clusters we use in our structure are spherical with a certain radius. The radii of clusters at the same hierarchical level are the same and gradually decrease from the top to the bottom of the hierarchy. Several branch branches of the hierarchy arise from one cluster that is not at the bottom level. All branch branches reach a common bottom. Elements are registered at the bottom level of the structure. To build our theory, we use the phrase “clusters are similar to elements” instead of the more rigorous expression “clusters have centers that are similar to elements”. To do. The radius of the cluster corresponds to the smallest correlation coefficient that the member has at its center.

クラスタが少数の要素しか有していない場合、要素が挿入されたときにクラスタは大きく移動して空間内における適切な位置を「学習」する。クラスタの歩幅は、クラスタが成長するにつれて小さくなり、最終的にクラスタは、事実上不動になる。我々は、新しい要素が挿入されたときに、常に中心がクラスタに属するすべての要素の平均であるように、クラスタの中心の座標を更新することを選択する。ひとたびクラスタが元の位置から移動すると、当該クラスタの要素が依然クラスタ内にあるかどうか保証することがもはやできない。しかしながら、中心極限定理によれば、新たなチャンクが挿入されたときに初期位置からのクラスタ中心の移動の総距離は、いくつのチャンクが属しているかにかかわらず有限である。この理由から、要素がそれら要素の属するクラスタの範囲外になることはまれである。アルゴリズムが、階層構造６００を定期的に調べて、クラスタの動きを定期的に調べ、各クラスタの要素が受け入れ先クラスタの範囲から外れる可能性を推定する。次いで、前記可能性があるしきい値（典型的には１０^−３）を超えたクラスタの要素を自動的に再チェックする。 If the cluster has only a small number of elements, when the element is inserted, the cluster moves greatly and "learns" the appropriate position in space. The stride of the cluster decreases as the cluster grows, and eventually the cluster becomes practically immobile. We choose to update the cluster center coordinates so that when a new element is inserted, the center is always the average of all elements belonging to the cluster. Once a cluster has moved from its original position, it can no longer be guaranteed that the elements of the cluster are still in the cluster. However, according to the central limit theorem, the total distance of movement of the cluster center from the initial position when a new chunk is inserted is finite regardless of how many chunks belong. For this reason, it is rare for elements to be outside the range of clusters to which they belong. The algorithm periodically examines the hierarchical structure 600 and periodically examines the movement of the clusters to estimate the likelihood that the elements of each cluster fall outside the range of the accepting cluster. It then automatically rechecks the elements of the cluster that have exceeded the possible threshold (typically 10 ⁻³ ).

われわれの構造６００のクラスタ６１０は、程度の大きな重なり合いを互いに有しているようである。構造６００内に挿入しようとする要素（すなわち、フーリエ係数のセット）６２０に対して、当該要素を任意のクラスタに挿入するのに十分高い類似度のクラスタ６１０が複数存在することもしばしばである。これら複数のクラスタのすべては当該要素に対して一定の類似度を呈する。したがって、それらの中からいずれのクラスタが挿入しようとする要素にとって最も適した受け入れ先であるかを判断しなければならないことが、しばしばある。我々は、このロジックをこの章でさらに明らかにする。 The clusters 610 of our structure 600 appear to have a large degree of overlap with each other. Often, for an element (ie, a set of Fourier coefficients) 620 that is to be inserted into structure 600, there are multiple clusters 610 with a sufficiently high degree of similarity to insert that element into any cluster. All of the plurality of clusters exhibit a certain degree of similarity to the element. Therefore, it is often necessary to determine which of these clusters is the most appropriate recipient for the element to be inserted. We will clarify this logic further in this chapter.

我々の階層構造６００は、すべてのツリー状構造に共通であるいくつかの問題を抱えている。第１に、これらの構造は、これらが適切にバランスしている場合、すなわち所定のレベルから出発して各分岐枝の要素の数が大まかに同一である場合にのみ、上手く機能する。簡潔なツリー構造では、（要素が挿入されるときに）オンザフライのバランスが可能であるのに対し、より複雑なツリー構造では、定期的なバランス再調整手順が必要になる。我々の構造も、このような手順を必要とし、ワークステーション１０２（図１参照）が待機状態の間にＳＤＥ１６０が適切な方法を呼び出す。 Our hierarchical structure 600 has several problems that are common to all tree-like structures. First, these structures only work if they are properly balanced, i.e., starting from a given level, the number of elements in each branch is roughly the same. A simple tree structure allows for on-the-fly balancing (when elements are inserted), while a more complex tree structure requires periodic rebalancing procedures. Our structure also requires such a procedure, and the SDE 160 calls the appropriate method while the workstation 102 (see FIG. 1) is idle.

次に、図７のフローチャートを参照して、所定の要素について十分に高い相関を呈するクラスタのセットを求めて要素のクラスタの階層構造を照会する手順を説明する。データの掘り出しにおいて、このような手順は「類似度サーチ」と称される。ここに検討する手順の目標は、探し歩く構造の分岐枝の数を可能な限り少なくしつつ（したがって、照会を満足するために要する時間を減らす）、サーチ条件を満足するクラスタを可能な限り多く突き止めることにある。形式的に、我々のサーチ条件は常に、「所定の要素とクラスタ中心との相関が、指定のしきい値よりも大である」というものである。このしきい値ｒ_ｑの値は、この手順の外部パラメータであり、いかに選択されるのかも含めて、この章で後述
する。このアルゴリズムの全体的目標に沿って、この手順の正確さは、確率論的な条件で表現され、すなわちこの手順は、所定の条件を満足するクラスタをすべて突き止めることを保証しない。 Next, a procedure for querying the hierarchical structure of the cluster of elements for a set of clusters exhibiting a sufficiently high correlation for a predetermined element will be described with reference to the flowchart of FIG. In data mining, such a procedure is called “similarity search”. The goal of the procedure considered here is to make as many clusters that satisfy the search criteria as possible, while minimizing the number of branches in the search structure (thus reducing the time required to satisfy the query). It is to find out. Formally, our search condition is always that the correlation between a given element and the cluster center is greater than a specified threshold. The value of this threshold r _q is an external parameter for this procedure and will be described later in this chapter, including how it is selected. In line with the overall goal of the algorithm, the accuracy of the procedure is expressed in terms of probabilistic conditions, i.e. the procedure does not guarantee to find all clusters that satisfy a given condition.

照会される要素をｑとする（「照会のベース」とも称される）。階層構造において最上位レベルに位置するすべてのクラスタ（図６を参照）を Let q be the element being queried (also referred to as the “base of the query”). All clusters located at the highest level in the hierarchy (see Figure 6)

とし、それらの中心を And the center of them

とする。この手順は、最初に階層構造の最上位レベルを調べる（図７を参照、工程７０１）。幾何学的検討によれば、あるクラスタ And This procedure first examines the highest level of the hierarchical structure (see FIG. 7, step 701). According to geometric considerations, a cluster

においてｑと高い類似度を呈する要素ｘを発見する可能性は、 The possibility of finding an element x that exhibits high similarity to q in

とｑとの間の相関係数に従って増加する。
ｘについてｃｏｒｒ（ｑ，ｘ）＞ｒとすると、 And increase according to the correlation coefficient between q and q.
If corr (q, x)> r for x,

（近似）

(Approximate)

この公式が、われわれの照会手順の基礎である。この手順の次の工程７０３は、すべての This formula is the basis of our inquiry procedure. The next step 703 of this procedure is all

についてｑとの相関係数を計算する。 Calculate the correlation coefficient with q for.

次の工程７０５は、これらの係数の値に従ってクラスタを並べ替える工程である。次の工程７０７において、ｑとの高い類似度を呈する要素の収容先である可能性が最も高い The next step 705 is a step of rearranging the clusters according to the values of these coefficients. In the next step 707, it is most likely to be an accommodation destination of an element exhibiting a high similarity with q.

からクラスタの部分集合 To a subset of clusters

を選択する。 Select.

内に入るクラスタと他のクラスタとの間の区別に使用される可能性しきい値Ｐ_ｑは、この手順の外部パラメータである。このパラメータは、通常はＰ_ｑ〜１０^−２〜１０^−４の範囲で選択され、これがこの手順の速度と正確さの間の受け入れ可能なトレードオフであることを我々は見出している。パラメータＰ_ｑは、ｑとの高い類似度を呈する要素をこの手順が報告しない確率である。この手順は、階層構造の最上位レベルにおいてＰ_ｑに相当する相関しきい値 The probability threshold P _q used to distinguish between clusters that fall within and other clusters is an external parameter of this procedure. This parameter is usually selected in the range of P _q ^-10 ⁻² to 10 ⁻⁴ and we have found that this is an acceptable trade-off between the speed and accuracy of this procedure. The parameter P _q is the probability that this procedure does not report an element that exhibits a high similarity to q. This procedure involves a correlation threshold corresponding to P _q at the highest level of the hierarchy.

を自動的に計算する。この手順が選択するクラスタの部分集合 Is automatically calculated. The subset of clusters that this procedure selects

は、さらに詳細に調べる価値のある階層構造中の分岐枝の部分集合を特定する。 Identifies a subset of branch branches in a hierarchical structure that is worth examining in more detail.

次の工程７０９において、この手順は階層構造の次の（より低い）レベルを調べる。構造の当該レベルに属し、かつこの手順の最初の工程で進入する価値があると見出された分岐枝の部分集合に属するすべてのクラスタが集められる。 In the next step 709, the procedure examines the next (lower) level of the hierarchy. All clusters belonging to that level of structure and belonging to a subset of branch branches found to be worth entering in the first step of the procedure are collected.

このようにして、工程７０９においてクラスタの部分集合 In this way, a subset of clusters in step 709

が形成され、 Formed,

の代わりに Instead of

を使用して前述の分析が適用される。この分析の結果として、この部分集合 The above analysis is applied using As a result of this analysis, this subset

が、ｑとの高い類似度を呈するクラスタによって形成される、さらなる Are formed by clusters exhibiting high similarity to q

に帰し、必要とされる相関しきい値 To the required correlation threshold

の値が計算される。 The value of is calculated.

これらの工程は状態７１２で手順が階層構造の底部レベルに達したことが発見されるまで繰り返され、このレベルにおいてｒ_ｑ（この手順の外部パラメータ、前記を参照）よりも大きいｑとの相関を中心が呈するクラスタがこの手順の結果として報告される（工程７１４）。 These steps are repeated until it is found at state 712 that the procedure has reached the bottom level of the hierarchy, at which level the correlation with q is greater than r _q (external parameters of this procedure, see above). The cluster represented by the center is reported as a result of this procedure (step 714).

すでに述べたように、要素ｑが階層構造に挿入されるとき、階層構造のレベルｌにおいて、当該要素を収容しうる２つ以上のクラスタ As already mentioned, when an element q is inserted into a hierarchical structure, two or more clusters that can accommodate the element at level l of the hierarchical structure

が存在することがしばしばである。これらのクラスタは、 Is often present. These clusters are

であり、ここでｒ^ｌは、レベルｌにおけるクラスタ半径を定める相関しきい値である。ｑを収容するのに適したクラスタのこの部分集合の中から、ｑに対してもっとも適切な収容先あろうクラスタを選択しなければならない。次に、いかにクラスタ選択を決定するかについて説明する。 Where r ^l is the correlation threshold that defines the cluster radius at level l. From this subset of clusters suitable for accommodating q, the cluster that is most likely to be accommodated for q must be selected. Next, how to determine the cluster selection will be described.

分岐枝上の他のクラスタとともに、要素ｑを収容する階層構造の底部レベルのあるクラスタ A cluster at the bottom level of the hierarchy that contains element q, along with other clusters on the branch

を選択すると仮定する（ここで、Ｌは階層の底部レベルを表わしている）。次いで、すでに述べたとおり照会のベースとして行なわれる同一要素ｑについての類似度照会を実行すると仮定する。以下の条件が、ｑの最も適切な収容先として底部レベルのクラスタを指定する。そのクラスタは、引き続く類似度照会が、同一要素を最も高い確度で見つけることができるクラスタである。最もｑに類似するクラスタは階層の各レベルにおいて見つけられ、その分岐枝がｑの収容先として選択される「貪欲な（greedy）」挿入ロジックが、必ずしも策定された基準を満足しないことに注意すべきである。実際、最上位のレベルにおいてあるクラスタがｑにきわめて類似している場合、貪欲なロジックは、このクラスタをｑの収容先として選択し、このクラスタから出発する分岐枝のみを下ってより低いレベル
の収容先クラスタの選択を続けるであろう。しかしながら、選択された分岐枝に属し、かつ我々の構造の次のレベルにおいて最もｑに類似しているクラスタ (Where L represents the bottom level of the hierarchy). Then, assume that a similarity query for the same element q, which is performed as the basis of the query as described above, is executed. The following conditions specify the bottom level cluster as the most appropriate accommodation for q. That cluster is the cluster where subsequent similarity queries can find the same element with the highest accuracy. Note that the most greedy cluster is found at each level of the hierarchy, and the “greedy” insertion logic whose branch is selected as the destination for q does not necessarily meet the established criteria. Should. In fact, if a cluster at the highest level is very similar to q, then the greedy logic chooses this cluster as the destination for q and goes down only the branch branch starting from this cluster to the lower level. Will continue to select the accommodation cluster. However, the cluster that belongs to the selected branch and is most similar to q at the next level of our structure

が、ｑとかなり非類似である場合がありうる（また、そのような場合が多い）。特に、 Can be quite dissimilar to q (and is often the case). In particular,

の場合を考えてみよう。ここでｒ^１および Let's consider the case. Where r ¹ and

の意味は、すでに述べたとおりである。このような状況下において、引き続く照会手順は The meaning of is as already described. Under these circumstances, the subsequent inquiry procedure is

をｑの収容先としての可能性があると考えず、したがって階層内でｑを見つけることができないであろう。要素挿入手順の設計において考慮に入れるべき他の重要な側面は、クラスタに新規要素が挿入されたときにすべてのレベルのクラスタが移動するという点である。その結果、ある時点において要素ｑに対して良好な候補のように思われる階層の分岐枝は、構造が成長するとともに良好な候補ではなくなるかもしれない。 Will not be considered as a potential containment of q, so q will not be found in the hierarchy. Another important aspect to consider in designing the element insertion procedure is that all levels of clusters move when new elements are inserted into the cluster. As a result, a branch in a hierarchy that appears to be a good candidate for element q at some point in time may not be a good candidate as the structure grows.

所定の要素ｑを挿入するのに階層構造の最も適切な分岐枝を突き止めるために我々が使用を好む方法を以下の内容で説明する。 The method we prefer to use to locate the most appropriate branch in the hierarchy to insert a given element q is described below.

我々は最初に、階層の底部においてｑに類似するクラスタのグループを見つけるために、類似度照会手順を実行する。 We first perform a similarity query procedure to find a group of clusters similar to q at the bottom of the hierarchy.

次いで我々は、平均して、階層のすべてのレベルにおいてｑに最も類似している分岐枝に属するクラスタをこのグループ内で見つける。我々は、ｑと階層のすべてのレベルにおいて分岐枝を構成しているクラスタの中心との間の重み付けしたＬ_２距離の二乗平均平方根としてこの平均を定義する。この計算における重み付けは、先の照会手順において We then find, on average, clusters in this group that belong to the branch that is most similar to q at all levels of the hierarchy. It defines the average as the root mean square of the weighted L ₂ distance between the center of the cluster constituting the branches branch at all levels of q and hierarchies. The weight in this calculation is

に対応する半径である。 Is the radius corresponding to.

すでに述べたとおり、工程７１４（図７参照）における要素の類似度の照会は、次に通常、照会された要素（照会ベース）に類似するクラスタのセットを返す。このセット内の各クラスタは、種々の文書からのデータ・チャンクを含む。したがって、ただ１つの照会では、いずれの単一文書が照会されたチャンクを収容しているかを判断するのに十分ではない。しかしながら、ＳＤＥ１６０は、ベースとして文書からの連続するチャンクとの複数の類似度照会を実行でき、次いで、これら照会の結果にもとづいていずれの文書が所望のチャンクを含むかを推論することができる。この目標を満足するために、チャンクが含まれる階層のクラスタにそのチャンクをマッピングした文書チャンク・データベースをＳＤＥ１６０は維持している。ひとたび未知の文書の連続するチャンクに対していくつかの類似度照会をＳＤＥ１６０が実行し、これら連続するチャンクに類似するクラスタのセットが得られると、別の手順が実行される。この手順は、文書チャンク・データベースにアクセスし、その連続するチャンクが、類似度照会によって発見されたものと同一のクラスタに含まれ、かつ同一順序である文書を引き出す。これらの文書が、照会された未知の文書と類似であるとして報告される。後処理の正確さは、照会された未知の文書のチャンクの数に従って指数関数的に増加する。これにより、前処理された文書の１つと当該文書との類似度を高い確度で発見するためには、その文書の少数の連続するチャンクのみを調べればよい。 As already mentioned, the query of element similarity in step 714 (see FIG. 7) then typically returns a set of clusters similar to the queried element (query base). Each cluster in this set contains data chunks from various documents. Thus, a single query is not sufficient to determine which single document contains the queried chunk. However, SDE 160 can perform multiple similarity queries with successive chunks from documents as a basis, and then infer which documents contain the desired chunks based on the results of these queries. In order to satisfy this goal, the SDE 160 maintains a document chunk database in which the chunk is mapped to the cluster of the hierarchy including the chunk. Once SDE 160 performs several similarity queries on successive chunks of unknown documents, and a set of clusters similar to these successive chunks is obtained, another procedure is performed. This procedure accesses the document chunk database and retrieves documents whose successive chunks are in the same cluster and found in the same order as found by the similarity query. These documents are reported as being similar to the queried unknown document. Post-processing accuracy increases exponentially according to the number of unknown document chunks queried. Thus, in order to find a similarity between one of the preprocessed documents and the document with high accuracy, only a few consecutive chunks of the document need be examined.

次に、我々の典型的な類似度照会手順（上記参照）において、われわれの手順が引き出すクラスタとの照会のベースの類似度しきい値を指定するパラメータｒ_ｑについて検討する。照会の後処理を簡単にするために、ベースに偶発的に類似する可能性と同程度のクラスタをこの手順が取り出すことがないように十分に高い値をこのパラメータは有する必要がある。同時に、この照会の最終の目標である要素に類似するチャンクを収容するクラスタをこの手順が取り出してしまうことを防止できるので、このパラメータを高くしすぎてもいけない。したがって、照会の後処理手順がいかに実装されるかに応じてこのパラメータは決まり、さらに階層構造の空間の次元（すなわち、含まれるフーリエ・モードの数）によって決まる。我々の実験においては、７０の次元が我々の目的にとって適切であると見出されており、パラメータｒ_ｑは、偶発的クラスタ取出しが約１％であるように選択される。 Next, in our typical similarity query procedure (see above), consider the parameter r _q that specifies the base similarity threshold for queries with clusters that we derive. In order to simplify the post-processing of the query, this parameter should have a value that is high enough so that the procedure does not pick out as many clusters as it could accidentally resemble the base. At the same time, this parameter should not be set too high because it can prevent the procedure from taking out clusters that contain chunks similar to the element that is the ultimate goal of the query. Thus, this parameter depends on how the query post-processing is implemented, and further depends on the dimension of the hierarchical space (ie, the number of included Fourier modes). In our experiments, 70 dimensions have been found suitable for our purposes, and the parameter r _q is chosen such that the accidental cluster retrieval is about 1%.

以上、本発明を、本発明の好ましい実施の形態を参照しつつ詳しく示して説明したが、添付の特許請求の範囲に包含される本発明の技術的範囲から離れることなく、形態や詳細においてさまざまな変更が可能であることを、当業者であれば理解できるであろう。 Although the present invention has been shown and described in detail with reference to the preferred embodiments of the present invention, it has been described in various forms and details without departing from the technical scope of the present invention as encompassed by the appended claims. Those skilled in the art will understand that various modifications are possible.

本発明による類似性発見システムの構成要素を示す図である。クライアント側においては、類似性検出エンジン（ＳＤＥ）がＳＤＥ依存アプリケーションからの文書チャンク類似度照会をサポートする。このため、ＳＤＥは文書管理に関連するシステム・イベントを監視し、文書ファイルの階層構造表現を保持する。所定のファイルの階層構造の構成要素はデータ「チャンク」のフーリエ成分として参照され、その識別子（ＩＤ）ならびに元々のファイルにおける位置が備え付けの文書チャンク・データベースに保存されている。クライアント側のデータベースも文書配布経路（ＤＤＰ）を保存している。選択的に、企業全体にわたるサーバをクライアントのＳＤＥからのデータを集めるために使用でき、ローカルＳＤＥではサービスできないサービス照会を提供する。It is a figure which shows the component of the similarity discovery system by this invention. On the client side, a similarity detection engine (SDE) supports document chunk similarity queries from SDE dependent applications. Thus, the SDE monitors system events related to document management and maintains a hierarchical representation of the document file. The components of the hierarchy of a given file are referred to as the Fourier component of the data “chunk” and its identifier (ID) as well as its location in the original file is stored in the provided document chunk database. The client side database also stores the document distribution path (DDP). Optionally, an enterprise-wide server can be used to collect data from the client SDE, providing a service query that cannot be serviced by the local SDE. コンピュータ・システムにおける文書の流れの経路の一例（シナリオ）を示す図である。時刻ｔ_０において、ＳＤＥは文書の出所について何ら情報を有しておらず、元々の階層構造ならびに文書配布経路（ＤＤＰ）を生成するためにファイル・システムをスキャンする。文書の新しいバージョンとそれらの起源の元文書との類似性は、コンピュータ・システムの動作を監視するだけでは網羅できないこともありうる（例えば、文書の名前が変更されるか、またはコピーもしくは統合された場合）。その他の場合（例えば、文書がネットワークから受信された場合）には、この類似性はＳＤＥへの照会によって最もよく明らかにされる。It is a figure which shows an example (scenario) of the path | route of the document flow in a computer system. At time t ₀ , the SDE has no information about the origin of the document and scans the file system to generate the original hierarchical structure as well as the document distribution path (DDP). The similarity between a new version of a document and the original document of their origin may not be covered by simply monitoring the operation of the computer system (for example, the document is renamed, copied or integrated) If). In other cases (eg, when a document is received from the network), this similarity is best revealed by a query to the SDE. 文書配布経路（ＤＤＰ）の表現のリレーショナル・データベースのエントリの一例を示す図であり、文書間の関係、およびいかにしてこれらが生成されたのかを記録している。FIG. 4 is a diagram illustrating an example of a relational database entry of a document distribution path (DDP) representation, recording the relationship between documents and how they were generated. ＳＤＥが類似文書を特定するのに使用するアルゴリズムの高レベル流れ図である。Figure 5 is a high level flow diagram of an algorithm that SDE uses to identify similar documents. 文書チャンク階層の最下位レベルの成分をそれぞれ表わす２つのベクトルの畳み込みを示す図である。ここで、畳み込みはそれぞれがベクトル長さの４分の１である２つの相対的にオフセットした共通部分、ならびにランダム・ノイズの２つのピークを有している。FIG. 5 illustrates the convolution of two vectors that each represent the lowest level component of the document chunk hierarchy. Here, the convolution has two relatively offset intersections, each of which is a quarter of the vector length, and two peaks of random noise. データ・ファイルを表わすためにＳＤＥによって使用される階層構造のアーキテクチャを示す図である。この構造は、文書のチャンクに保存されたデータのフーリエ係数のベクトル空間を表現している。より高いレベルのクラスタはそれぞれ、より低いレベルのクラスタ集合への参照を保持している。底部レベルのクラスタが前記フーリエ係数空間の構成要素を収容している。FIG. 2 illustrates a hierarchical architecture used by SDE to represent data files. This structure represents a vector space of Fourier coefficients of data stored in a document chunk. Each higher level cluster holds a reference to a lower level cluster set. A bottom level cluster houses the components of the Fourier coefficient space. 所定の構成要素（「照会のベース」と称される）に類似するクラスタについて階層構造を照会するために使用される操作のフローチャートである。FIG. 6 is a flowchart of operations used to query a hierarchical structure for clusters similar to a given component (referred to as a “query base”).

Claims

A method of maintaining a representation of a history of operations performed on a document in a data processing environment of a computer device ,
(i) an identifier of one or more original documents, (ii) an identifier of at least one previous document, and (iii) a relationship representing the degree to which the one or more original documents were used to generate the previous document. Maintaining a document distribution path representation including an entry with a descriptor in a database of the computing device ;
A detecting step in which the monitoring means in the operating system kernel of the computing device detects an access event that may affect the relationship between documents;
In response to detecting an access event that generates a new original document or an access event that changes a relationship descriptor for an existing document, the computer is configured to reflect the access event in the history of operations. In the method for retaining an operation history expression, wherein the device means includes a generation step of generating a new entry in the document distribution path expression with a relation descriptor determined by the detected access event,
If the relationship descriptor cannot be determined from the detected access event,
A method for holding an operation history expression , wherein the similarity detection means inquires of the previous document against a database of existing documents and determines an appropriate relation descriptor for the previous document.

2. The operation history expression according to claim 1, wherein the document distribution path expression includes an event identifier, and the event identifier is selected from a group consisting of a user identifier, a calling process identifier, a network operation identifier, and a storage medium identifier. Retention method.

3. The operation history expression holding method according to claim 2, wherein the storage medium is a removable storage medium.

The method of claim 1, wherein the data distribution path expression is a graph data structure having a vertex representing a document and an edge representing a relation descriptor.

2. The operation history expression holding method according to claim 1, wherein the relation descriptor specifies whether the original document and the destination document are the same.

The operation history expression holding method according to claim 1, wherein the relation descriptor specifies a means in which an original document is changed to become a destination document.

7. The operation history expression holding method according to claim 6, wherein the relation descriptor is selected from a group consisting of editing, merging, and copying.

7. The operation according to claim 6, wherein the relationship descriptor further constitutes identification of the means by which the original document was modified by monitoring user access to the original document and / or monitoring change events related to the original document. How to keep history representation.

The operation history expression holding method according to claim 1, wherein the relation descriptor quantifies the degree to which the original document is changed so that the previous document is obtained.

The data distribution path of claim 1, wherein the data distribution path is used to implement a data security application on the computing device ,
If the step of querying a pre-existing document against a database of existing documents determines that a similar document exists, the data security application determines the security classification already assigned to the similar document. An operation history expression holding method applied to the destination document.

11. The operation history expression holding method according to claim 10, wherein the monitor means can restrict user access to a document according to a security classification of the document.

11. The operation history expression holding method according to claim 10, wherein the monitor unit can restrict user control over the document according to the security classification of the document.

11. The method of claim 10, wherein the security classification is applied to the new document in real time when the new document is first stored in the computer device .

The method of claim 1, wherein the relationship descriptor specifies an initial change relationship between at least one pair of documents, which is determined by a degree of similarity between documents.

15. The operation history expression holding method according to claim 14, wherein the change relationship is further determined by at least one of an access time and a change time of the documents.

2. The data distribution path according to claim 1, wherein the data distribution path is used to implement a document deletion function, and the similar document is also deleted when the inquiry step determines that there is a document similar to the document requested to be deleted. , Retention method of operation history.

In claim 1, the determination of the appropriate relationship descriptor includes:
A method is used to determine whether first and second documents stored in digital form in a data processing system are similar by comparing sparse representations of the first and second documents. Judgment method is
A dividing step of dividing the first and second documents into chunks of data of a predetermined size;
A selection process that selects a subset of all chunks as representative of the data in the document;
A determining step for determining a set of coefficients representing the selected chunk;
Combining a set of coefficients into a coefficient cluster, the coefficient cluster including coefficients that are similar according to a predetermined similarity criterion;
An operation history expression holding method , comprising: an evaluation step in which the similarity detection unit evaluates the degree of similarity between documents by counting clusters to which chunks from both documents correspond.

18. The operation history expression holding method according to claim 17, wherein a coefficient representing the specific chunk is selected as a Fourier transform coefficient for a data value constituting the chunk.

19. The operation history expression holding method according to claim 18, wherein the selected coefficient is an absolute value of a Fourier transform coefficient.

According to claim 18, before the Fourier coefficients are calculated, the data in the chunk are mapped to a unitary circle in the plane of the complex variable, method of holding the operation history representation.

18. The operation history expression holding method according to claim 17, wherein the degree of similarity is determined by calculating a correlation of coefficients of data stored in a chunk.

According to claim 21, after the value deviates from the vector of coefficients has been removed, the correlation is linear, method of holding the operation history representation.

18. The method for retaining an operation history expression according to claim 17, wherein a shift that may occur in the position of similar data in two documents is clarified and the step of evaluating the degree of similarity is performed.

According to claim 17, wherein the cluster representation comprises a hierarchy having at least two levels, successively a lower level of the hierarchy represent only a part of chunks at the higher levels of the hierarchy, method of holding the operation history representation.

18. The comparison process according to claim 17, wherein the comparison step initially performs processing at a higher level in the hierarchy, and a sufficient degree of similarity is found between the queried chunk coefficients and the cluster centers at the higher level. A method of maintaining an operation history representation in which the comparison of coefficients is continued at lower levels of the hierarchy only when issued.

26. The comparison of chunk coefficients and clusters at a predetermined lower level of the hierarchy to the branches of the hierarchy that pass through the associated higher level clusters that have already been determined to be similar to the coefficients of the queried document. An operation history representation retention method that is limited to the study of only the cluster to which it belongs.

In claim 25, further
a. Selecting for a first document a cluster search set derived from a set of coefficients located at a given level of the hierarchy;
b. A computing step for calculating a similarity for the clusters in the cluster search set by comparing the clusters in the cluster search set with at least one chunk of the second document selected as a base element;
c. A reordering step of reordering the compared clusters according to a degree of similarity to chunks from the second element;
d. A calculation process for calculating a similarity threshold for advancement;
e. A selection step of selecting a subset of the cluster search set as the cluster most similar to the base element;
f. Handling the subset as the next cluster search set;
g. Repeating steps b to f until reaching the bottom of the hierarchy;
h. A method of holding an operation history expression, comprising: a return step that returns the subset generated in step f as a solution when the iteration is completed.

18. The query interpretation process according to claim 17, wherein the comparing step further includes a query interpretation process for combining the results of queries for a plurality of chunks in a hierarchy so as to determine an overall degree of similarity for two documents. , Retention method of operation history .

In claim 28, further:
For the first document and for all documents in the processed document set that have already been processed by this similarity determination method, a further step of determining the number of similar chunks in these documents,
A method of retaining an operation history representation, wherein the first document is determined to be similar to a group of documents in a larger set of processed documents.

30. The operation history expression holding method according to claim 29, wherein documents in the set of processed documents whose number of chunks similar to the first document is smaller than a predetermined number are not determined to be similar.

26. The cluster having the average most similar to a given coefficient set together with a parent cluster at a higher level in the hierarchy among the subset of clusters generated in the handling step f stores a corresponding coefficient set. A method of holding an operation history expression that is selected as an acceptance destination.

32. The average similarity of clusters located at different levels of the hierarchy to the corresponding coefficient set according to claim 31 is the arithmetic mean of the squares of the similarity of the cluster to the coefficient set at different levels, Retention method for operation history given by arithmetic mean weighted by cluster dimensions at these levels.