KR100860821B1

KR100860821B1 - Computing system, method for establishing an identifier and recording medium with a computer readable program for use in a commonality factoring system

Info

Publication number: KR100860821B1
Application number: KR1020027010804A
Authority: KR
Inventors: 그레고리 하간 멀톤; 스테판 비. 화이트힐
Original assignee: 이엠씨 코포레이션
Priority date: 2000-02-18
Filing date: 2001-02-14
Publication date: 2008-09-30
Also published as: KR20020082851A; CA2399555A1; AU2001238269B2; EP1269350A1; WO2001061563A1; EP1269350A4; JP4846156B2; AU3826901A; JP2003524243A

Abstract

해시 및/또는 일정한 숫자열, 상이하거나 변화하는 길이(304)에 기초하여 구성되며 시스템으로부터의 데이터의 군집 블록들(또는 데이터 블록들의 부분들)의 리던던트 카피를 제거하거나 선별할 수 있는 컴퓨터 파일 시스템이 개시되어 있다. 본 발명의 해시 파일 시스템은 체크섬 생성 프로그램, 엔진, 또는 산업 표준 MD4, MD5, SHA 또는 SHA-1 알고리즘과 같은 알고리즘에 의해 생성될 수 있는 컴퓨터 파일들 또는 파일 단편들(306)에 대해 해시값(310)을 사용한다. 다르게 이 해시값은 체크섬 프로그램, 엔진이나 알고리즘, 또는 수학적 알고리즘에 기초하여 확정되지 않은 크기의 데이터 블록에 대해 효과적으로 유일한 해시값을 생성하는 다른 수단에 의해 생성(308)될 수 있다.A computer file system constructed based on hash and / or constant strings, different or varying lengths 304 and capable of removing or screening redundant copies of cluster blocks (or portions of data blocks) of data from the system. Is disclosed. The hash file system of the present invention provides a hash value for the computer files or file fragments 306 that can be generated by a checksum generator, engine, or algorithms such as industry standard MD4, MD5, SHA or SHA-1 algorithms. 310). Alternatively, this hash value may be generated 308 by a checksum program, engine or algorithm, or other means for effectively generating a unique hash value for a block of data of an undetermined size based on a mathematical algorithm.

체크섬, 패킷, 하드 디스크, RAID, NAS, MD5Checksum, packet, hard disk, RAID, NAS, MD5

Description

Computing system, method for setting identifier, and recording medium with computer readable program for use in common element differentiation system.

본 발명은 일반적으로 해시 파일 시스템 및 공통부분 요소분화 시스템의 분야에 관한 것이다. 특히 본 발명은 분산 컴퓨터 데이터 환경의 전자 파일들과 그에 대한 특정한 어플리케이션들 사이의 대응을 판정하기 위한 시스템 및 방법에 관한 것이다.The present invention generally relates to the field of hash file systems and common element differentiation systems. In particular, the present invention relates to systems and methods for determining a correspondence between electronic files in a distributed computer data environment and specific applications thereto.

경제적, 정치적, 그리고 사회적 권력은 점점 더 데이터에 의해 관리되고 있다. 데이터에 의해 거래와 재화가 표현된다. 정치적 권력은 데이터를 기초로 분석되고 수정된다. 인간의 상호 작용 및 관계는 데이터 교환에 의해 정의된다. 따라서, 데이터의 효율적인 분산, 저장, 및 관리가 인간 사회에서 점점 더 필수적인 역할을 할 것으로 예상된다.Economic, political and social power are increasingly managed by data. Data represents transactions and goods. Political power is analyzed and revised based on data. Human interactions and relationships are defined by data exchange. Thus, efficient distribution, storage, and management of data is expected to play an increasingly essential role in human society.

컴퓨터 프로그램, 데이터베이스, 파일 등의 형태로 관리되어야 하는 데이터의 양은 지수 함수적으로 증가하고 있다. 컴퓨터 처리 능력이 증가함에 따라, 운영 체제 및 어플리케이션 소프트웨어는 더 커지게 된다. 더욱이, 멀티미디어 파일 및 대량의 데이터베이스와 같은 보다 큰 데이터 세트들에 억세스하고자 하는 요망은 관리되는 데이터의 양을 보다 증가시키고 있다. 이러한 점점 더 커지는 데이터 부하는 컴퓨팅 장치들 사이에서 전송되어야 하며 억세스 가능한 형태로 저장되어야 한다. 데이터의 지수 함수적인 성장 속도는 통신 대역 및 저장 용량의 향상을 능가할 것으로 예상되어, 종래의 방법들을 사용한 데이터 관리를 보다 다급하게 한다.The amount of data that must be managed in the form of computer programs, databases, files, etc. is increasing exponentially. As computer processing power increases, operating systems and application software become larger. Moreover, the desire to access larger data sets, such as multimedia files and large databases, is increasing the amount of data managed. This growing data load must be transferred between computing devices and stored in an accessible form. The exponential growth rate of data is expected to outperform communications bands and storage capacity, making data management more urgent using conventional methods.

종래의 데이터 저장 시스템에서는 많은 요인들이 균형을 이루어야 하며 종종 타협되어야 한다. 데이터의 양이 극히 커지고 있기 때문에, 저장 비트당 비용을 감소시키는 것에 대한 압력이 계속되고 있다. 또한, 데이터 관리 시스템은 현재의 요구뿐만 아니라 마찬가지로 미래의 요구에 대해서도 대처하도록 확장 가능하여야 한다. 바람직하게는, 저장 시스템은 사용자가 임의의 특정한 시간에 필요한 용량만을 구매할 수 있도록 점차 확장 가능하다. 데이터 사용자들이 점점 더 손실, 손상, 및 유효하지 않은 데이터에 견딜 수 없게 됨에 따라 높은 신뢰성 및 높은 유용성이 또한 고려된다. 그러나, 종래의 데이터 관리 구조는 이러한 요인들과 타협해야 하므로, 어떠한 구조도 비용 효율성, 신뢰성, 높은 유용성, 확장 가능한 솔루션을 제공하지 못한다.In conventional data storage systems many factors must be balanced and often compromised. As the amount of data is growing extremely large, pressure continues to reduce the cost per bit stored. In addition, the data management system must be extensible to address not only current needs but also future needs as well. Preferably, the storage system is gradually scalable such that the user can purchase only the capacity required at any particular time. High reliability and high availability are also taken into account as data users become increasingly unbearable for loss, corruption, and invalid data. However, conventional data management structures must compromise these factors, so no structure provides a cost-effective, reliable, high availability, scalable solution.

종래의 RAID(복수 배열 독립 디스크: Redundant Array of Independent Disks)는 하드 디스크와 같은 다수의 저장 장치들 상의 상이한 위치들에 동일한 데이터를 저장(따라서, 중복적)하는 방식이다. 다수의 디스크 상에 데이터를 위치시킴으로써, 입력/출력("I/O") 동작은 균형 잡힌 방식으로 중첩되어, 성능을 향상시킬 수 있다. 다수의 디스크의 사용은 평균 고장 간격("MTBF")을 증가시키므로, 저장 데이터는 또한 중복적으로 장애 내성(fault-tolerance)을 증가시킨다. RAID 시스템은 운영 체재에 대하여 단일의 논리적 하드디스크로 보이도록, 하드웨어 또는 소프트웨어 제어기에 의존하여, 실제 데이터 관리의 복잡성을 감춘다. 그러나, RAID 시스템은 케이블 작업과 제어기의 물리적 제한으로 인하여 확대/축소가 어렵다. 또한, RAID 시스템의 활용도는 제어기 자체의 기능에 매우 의존하므로, 제어기 고장시에는 제어기가 사용 불가능해진 후, 즉 고장 후 저장된 데이터는 활용할 수 없게 된다. 더욱이, RAID 시스템은 상용 하드웨어가 아니라, 전문화될 필요가 있으므로, 고비용의 솔루션이 되는 경향이다.Conventional RAID (Redundant Array of Independent Disks) is a way of storing (and thus redundant) the same data in different locations on multiple storage devices, such as hard disks. By placing data on multiple disks, input / output (“I / O”) operations can be superimposed in a balanced manner, improving performance. Since the use of multiple disks increases the average failure interval ("MTBF"), the stored data also redundantly increases fault-tolerance. RAID systems rely on hardware or software controllers to appear as a single logical hard disk for the operating system, hiding the complexity of actual data management. However, RAID systems are difficult to scale due to cabling and physical limitations of the controller. In addition, since the utilization of the RAID system is highly dependent on the function of the controller itself, in the event of a controller failure, data stored after the controller becomes unavailable, i.e., after failure, cannot be utilized. Moreover, RAID systems tend to be expensive solutions as they need to be specialized rather than commodity hardware.

NAS(네트워크 부속 저장 장치: network-attached storage)는 어플리케이션 서버에 접속되기보다는 그 자신의 네트워크 어드레스로 설정되는 하드 디스크 저장 장치를 일컫는다. 파일 요청은 NAS 파일 서버에 의해 맵핑된다. NAS는 하드웨어 또는 소프트웨어 기반 RAID를 사용하여 투명한 I/O 동작을 제공할 수 있다. 또한, NAS는 하나 이상의 다른 NAS 장치들로의 데이터 미러링(mirroring)을 자동화하여 장애 내성을 보다 향상시킬 수 있다. NAS 장치들은 네트워크에 추가될 수 있기 때문에, 이들은 네트워크에 이용 가능한 전체 저장용량의 확장을 가능하게 한다. 그러나, NAS 장치들은 RAID 어플리케이션들 내에서는 종래의 RAID 제어기들의 능력들에 제약된다. 또한, NAS 시스템은 노드들 사이의 미러링 및 패리티(parity)를 허용하지 않으므로 제한된 솔루션이다.NAS (network-attached storage) refers to hard disk storage that is set to its own network address rather than to an application server. File requests are mapped by NAS file servers. NAS can use hardware or software-based RAID to provide transparent I / O operation. In addition, the NAS can further improve fault tolerance by automating data mirroring to one or more other NAS devices. Since NAS devices can be added to a network, they allow for expansion of the total storage available to the network. However, NAS devices are limited in the capabilities of conventional RAID controllers within RAID applications. NAS systems are also a limited solution because they do not allow mirroring and parity between nodes.

데이터 저장 문제에 더하여, 원거리 네트워크("WAN") 및 인터네트워킹 기술의 향상으로 데이터 전송이 급속히 진화하고 있다. 예를 들어, 인터넷은 거의 유비쿼터스한 접근에 의해 전세계적으로 망연결된 환경을 이루었다. 급속한 네트워크 기반 시설 향상에도 불구하고, 전송을 필요로 하는 데이터량의 증가 속도는 이용 가능한 대역폭의 향상을 능가할 것으로 예상된다.In addition to data storage problems, data transmission is evolving rapidly with improvements in long-range networks ("WAN") and internetworking technologies. For example, the Internet has created a world-wide networked environment with a nearly ubiquitous approach. Despite rapid network infrastructure improvements, the rate of increase in the amount of data that needs to be transmitted is expected to outweigh the improvements in available bandwidth.

냉정하게, 데이터가 통상 관리되는 방식은 데이터를 조작 및 전송하도록 개발되었던 하드웨어 장치들 및 기반 시설들과 조화되지 않는다. 예를 들어, 컴퓨터는 특성상 가상적으로는 무제한의 다양한 기능들을 수행하도록 쉽게 프로그래밍되는 범용 머신이다. 그러나, 많은 부분에서, 컴퓨터에는 특수 목적으로 기능하게 하기 위해 그 범용 특성을 제한하는, 고정적이고 느리게 변화하는 데이터 세트가 탑재된다. 상용 컴퓨터에 있어서, 프로세싱 속도, 주변장치 성능, 및 데이터 저장 용량의 향상이 가장 현저하다. 그러나, 많은 데이터 저장 솔루션들은 그들이 기반하는 저장장치 제어기들에 의해 확장되는 것이 아니라, 제한되기 때문에 위와 같은 향상의 장점을 취할 수 없다. 마찬가지로, 인터넷은 장애 내성이 있고, 다중 경로의 상호연결 네트워크로서 개발되었다. 그러나, 종래에는 네트워크 자원들이 특정 네트워크 노드들에서 구현되어, 그 노드가 접속된 네트워크의 장애 내성에도 불구하고 노드들의 고장이 자원을 이용할 수 없게 한다. 높은 유용성, 높은 신뢰성, 높은 확장성의 데이터 저장장치 솔루션에 대한 계속적인 요구가 존재한다.Coolly, the way data is typically managed is incompatible with hardware devices and infrastructures that have been developed to manipulate and transmit data. For example, a computer is a general purpose machine that is easily programmed to perform a variety of virtually unlimited functions. In many cases, however, computers are equipped with fixed, slow-changing data sets that limit their general purpose to function for special purposes. For commercial computers, the improvement in processing speed, peripheral performance, and data storage capacity is most significant. However, many data storage solutions are not scaled by the storage controllers on which they are based, but are not capable of taking advantage of these improvements because they are limited. Likewise, the Internet has been developed as a fault tolerant, multipath interconnection network. However, conventionally, network resources are implemented in certain network nodes such that failure of the nodes renders the resource unavailable despite the fault tolerance of the network to which the node is connected. There is a continuing need for high availability, high reliability, and highly scalable data storage solutions.

본 명세서에는 일정한, 상이한, 또는 변화하는 길이의 해시들 및/또는 디지트 스트링에 기초하여 구성되며 시스템으로부터 데이터의 블록들(또는 데이터 블록들의 부분들)의 중복적 복사본을 제거 또는 선별(screening)할 수 있는 컴퓨터 파일 시스템을 위한 시스템 및 방법이 개시되어 있다. 또한, 본 명세서에는 컴퓨터 파일 시스템을 위한 시스템 및 방법이 개시되어 있는데, 여기서 산업 표준 메시지 다이제스트 4(Message Digest 4)("MD4"), MD5, 시큐어 해시 알고리즘(Secure Hash Algorithm)("SHA") 또는 SHA-1 알고리즘들과 같은 체크섬 생성 프로그램, 엔진, 또는 알고리즘에 의해 해시들이 생성될 수 있다. 또한, 본 명세서에는 컴퓨터 파일 시스템을 위한 시스템 및 방법이 개시되어 있는데, 여기서 비선형 확률 수학적 알고리즘(non-linear probablistic mathematical algorithm) 또는 다른 데이터/숫자 시퀀스의 입력 텍스트로부터 의사-난수(pseudo-random) 값들을 생성하기 위한 임의의 산업 표준 기술을 기초로 하여 확정되지 않은 크기의 데이터 블록에 대한 확률적으로 유일한 해시값을 생성하는 체크섬 프로그램, 엔진, 알고리즘 또는 다른 수단에 의해 해시들이 생성될 수 있다.The specification is constructed based on constant, different, or varying length hashes and / or digit strings to remove or screen redundant copies of blocks of data (or portions of data blocks) from a system. A system and method for a computer file system are disclosed. Also disclosed herein are systems and methods for computer file systems, where industry standard Message Digest 4 ("MD4"), MD5, Secure Hash Algorithm ("SHA") Or hashes may be generated by a checksum generator, engine, or algorithm, such as SHA-1 algorithms. Also disclosed herein are systems and methods for computer file systems, where pseudo-random values from input text of a non-linear probablistic mathematical algorithm or other data / numeric sequence. The hashes may be generated by a checksum program, engine, algorithm or other means that generates a probabilistic unique hash value for a block of data of indeterminate size based on any industry standard technique for generating the data.

본 발명의 시스템 및 방법은 본 명세서에 개시된 특정한 어플리케이션에 있어서, 잠재적으로 매우 큰 양의 요소분화되지 않은(unfactored) 저장장치들이 종종 몇몇 크기 순서에 의해 사이즈를 감소시켜, 데이터 중복성을 자동적으로 요소분화하여 없애도록(factor out) 이용될 수 있다. 이 점에 있어서, 본 발명의 시스템 및 방법은 특정한 하드웨어 또는 소프트웨어 특성에 관계없이 모든 컴퓨터들이 데이터를 간단하고, 효율적이고 안전하게 공유하며 데이터의 판독, 기입, 또는 참조를 이루기 위한 독특하게 유리한 수단을 제공하도록 한다. 본 발명의 시스템 및 방법은 특히 네트워킹된 컴퓨터 또는 컴퓨터 시스템에 효과가 있으나 또한 분리된 데이터 저장 장치에도 필적할만한 결과로 적용될 수 있다. The systems and methods of the present invention allow for potentially specific large amounts of unfactored storage devices to be reduced in size in several order of magnitude, thereby automatically redefining data redundancy in certain applications disclosed herein. Can be used to factor out. In this regard, the systems and methods of the present invention allow all computers to share data simply, efficiently, securely, and to provide uniquely advantageous means for reading, writing, or making reference to data, regardless of particular hardware or software characteristics. do. The systems and methods of the present invention are particularly effective for networked computers or computer systems, but can also be applied with comparable results to separate data storage devices.

본 발명의 해시 파일 시스템은 장점으로서 종래의 저장장치 아키텍처에 어려움을 주는 여러 문제점들을 해결한다. 예를 들어, 본 발명의 시스템 및 방법은 사본, 및 약간 다른 복사로 불가피하게 발생하는 시스템 자원들과 함께 디렉토리들 및 파일들의 대규모 집합을 관리할 필요성을 제거한다. 사본 파일의 유지 및 저장은 전통적인 회사 및 개인 컴퓨터 시스템들에 어려움을 주며 일반적으로 "디스크 청소(clean up disk space)"에 사람이 귀찮게 관여할 것을 요구한다. 본 발명의 해시 파일 시스템은 복사를 위해 사용되는 디스크 공간을 제거하고 부분 복사에 사용되는 디스크 공간을 거의 완전히 제거함으로써 상기 문제점을 효과적으로 제거한다. 예를 들어, 새로운 위치로 기가바이트 디렉토리 구조를 복사하는 전통적인 컴퓨터 시스템은 다른 기가바이트의 저장 장치를 필요로 한다. 특정 어플리케이션에 있어서, 본 발명의 해시 파일 시스템은 10만배 이상까지 이러한 동작에 사용되는 디스크 공간을 감소시킨다.The hash file system of the present invention advantageously solves a number of problems that present difficulties with conventional storage architectures. For example, the systems and methods of the present invention eliminate the need to manage large sets of directories and files along with copies, and system resources that inevitably occur with slightly different copies. Maintaining and storing copy files presents difficulties for traditional corporate and personal computer systems and generally requires human intervention in "clean up disk space". The hash file system of the present invention effectively eliminates this problem by removing the disk space used for copying and almost completely eliminating the disk space used for partial copying. For example, a traditional computer system that copies a gigabyte directory structure to a new location requires another gigabyte of storage. In certain applications, the hash file system of the present invention reduces the disk space used for this operation by up to 100,000 times.

현재, 몇몇의 파일 시스템은 사본을 제거하는 메카니즘을 갖고 있지만, 시스템이 확장되더라도, 어느 것도 이러한 연산을 단시간(기술적인 용어로 시스템이 0(I)("일정한 시간 순서") 시간)에 사본을 요소분화하는 것을 의미한다)에 수행할 수 없다. 이는 O(N^**2), O(N) 또는 O(log(N)) 시간을 요하는 다른 시스템과는 상반되게, 하나의 단위의 시간이 일정하다는 것을 의미하며, 시간이 요소분화 중인 저장장치의 양에 관계됨을 의미한다. 일정하지 않은 시간에서 저장장치를 요소분화하는 것은 저장량이 적은 시스템에는 겨우 만족스러울 수도 있으나, 시스템 사이즈가 증가함에 따라, 일정하지 않은 요소분화 시스템 중 가능 효율적인 시스템이라도 만족스럽지 않게 된다. 본 발명의 해시 파일 시스템은 이전에 결코 시도되지 않았던 규모로 저장 장치를 요소분화하도록 설계되며, 제1 구현예에 있어서는 보다 큰 확장 능력으로 2백만 페타바이트(petabytes)의 저장장치를 요소분화할 수 있다. 기존의 파일 시스템은 이러한 규모로 데이터를 관리할 수는 없다.Currently, some file systems have a mechanism for removing copies, but even if the system is extended, none of these operations can copy these operations in a short time (in technical terms, the system copies the copies at 0 (I) ("regular time order") time). Implying urea differentiation). This means that one unit of time is constant, as opposed to other systems that require O (N ^** 2), O (N), or O (log (N)) time, with time being factorized Means the amount of the device. Element differentiation of the storage device at an inconsistent time may only be satisfactory for a system with less storage, but as the system size increases, even an efficient system among non-uniform elemental differentiation systems is not satisfactory. The hash file system of the present invention is designed to tier the storage device on a scale that has never been attempted before, and in the first embodiment it can tier 2 million petabytes of storage with greater scalability. have. Existing file systems cannot manage data on this scale.

더욱이, 본 발명의 해시 파일 시스템은 저렴하고, 범세계적인 컴퓨터 시스템 데이터 보호 및 백업을 제공하도록 사용될 수 있다. 컴퓨터 파일 시스템들은 각각의 백업 동작간에 그 전체 저장량의 몇 퍼센트 이상으로 좀처럼 변화하지 않기 때문에, 그 요소분화 기능은 전형적인 백업 데이터 세트들에 대해 매우 효율적으로 동작한다. 또한, 본 발명의 해시 파일 시스템은 효율적인 메시징(이메일) 시스템의 기반이 될 수 있다. 이메일 시스템은 기본적으로 데이터 복사 메카니즘이며 여기서 저자는 메시지를 기입하여 그것을 리스트의 수신인들에 전송한다. 이메일 시스템은 데이터를 한 장소로부터 다른 장소로 복사함으로써 "송신" 동작을 효과적으로 구현한다. 저자는 일반적으로 그가 전송한 메시지들의 사본들을 유지하며, 수신인들은 각각 그들 자신의 사본들을 소유한다. 이러한 사본들은 반대로 회신에 첨부되어 또한 유지된다(즉, 사본의 사본). 본 발명의 공통부분 요소분화의 특징은 이메일 사용자들이 이러한 친숙한 사본 지향의 패러다임을 투명하게 유지하도록 하는 한편, 이러한 거대한 비효율성을 제거할 수 있는 것이다.Moreover, the hash file system of the present invention can be used to provide inexpensive, worldwide computer system data protection and backup. Because computer file systems rarely change more than a few percent of their total storage between each backup operation, the factorization function works very efficiently for typical backup data sets. In addition, the hash file system of the present invention can be the basis of an efficient messaging (email) system. The email system is basically a data copying mechanism where the author fills out a message and sends it to the list's recipients. Email systems effectively implement "send" operations by copying data from one place to another. The author generally keeps copies of the messages he sent, and each recipient owns his own copies. These copies are conversely attached to the reply and also maintained (ie, copies of the copies). A feature of the common element differentiation of the present invention is that it allows email users to keep this familiar copy-oriented paradigm transparent while eliminating this huge inefficiency.

이미 주지된 바와 같이, 컴퓨터 시스템의 대부분의 데이터는 거의 변화되지 않기 때문에, 본 발명의 해시 파일 시스템은 시스템의 필요에 따라, 예컨대, 시스템이 존재하는 매일 매시간마다 또는 분단위(또는 그 이하로) 간격으로, 전체 시스템의 완전한 스냅샵(snapshot)의 재구축을 가능하게 한다. 또한, 종래의 컴퓨터 시스템은 종종 제한된 버전의 파일들을 제공하므로(즉, Digital Equipment Corporation의 VAX® VMS® 파일 시스템), 본 발명의 해시 파일 시스템은 또한 이점에서 현저한 장점들을 제공한다. 종래의 시스템에서의 버저닝(versioning)은 우수한 측면과 좋지 않은 측면을 나타낸다. 전자의 예에서는, 사고를 방지하는 데는 도움이 되지만, 후자의 예에서는, 소비하는 디스크 공간을 감소시키도록 규칙적인 제거를 필요로 한다. 본 발명의 해시 파일 시스템은 매우 적은 여분의 공간을 사용하여 동일한 사본 또는 편집된 사본을 요소분화(factoring)시킴으로써 오버헤드가 거의 없는 파일의 버저닝을 제공한다. 예를 들어, 전형적인 문서를 1백번 수정한 것을 세이브하는 것은 원본 파일 공간의 약 1백배를 필요로 한다. 본 명세서에 개시된 해시 파일 시스템을 사용하면, 상기 수정본들은 원래의 공간의 3배만을 필요로 할 것이다(문서의 크기, 편집의 정도와 형태, 및 외부 요인에 의존적임).As is already well known, since most of the data of a computer system is hardly changed, the hash file system of the present invention is according to the needs of the system, for example, every hour or minute (or less) every day the system is present. At intervals, it allows for a complete snapshot of the entire system. In addition, since conventional computer systems often provide a limited version of files (ie, the VAX® VMS® file system from Digital Equipment Corporation), the hash file system of the present invention also provides significant advantages in this regard. Versioning in conventional systems exhibits both good and bad aspects. In the former example, it helps to prevent accidents, while the latter example requires regular removal to reduce the disk space consumed. The hash file system of the present invention provides versioning of files with little overhead by factoring the same or edited copies using very little extra space. For example, saving 100 revisions of a typical document requires about 100 times the original file space. Using the hash file system disclosed herein, these modifications would require only three times the original space (depending on the size of the document, the degree and type of editing, and external factors).

본 발명의 해시 파일 시스템의 또 다른 잠재적인 어플리케이션들은 웹-서빙(web-serving)을 포함한다. 이점에 있어서, 공통부분을 요소분화하는 방법(해싱)은 또한 모든 해시 파일 시스템 서버에 걸쳐서 균일한 분산을 이루기 때문에, 해시 파일 시스템은 웹 컨텐츠를 효율적으로 분산시키기 위해 사용될 수 있다. 이러한 균등한 분산은 다수 어레이의 서버들이 균등하게 분산된 부하를 갖는 거대한 웹 서버 지대로서 기능하도록 한다. 다른 어플리케이션에서, 본 발명의 해시 파일 시스템은 데이터 그 자체 대신에 데이터에 대한 프록시(해시)를 전송함으로써 네트워크 트래픽을 감소시키도록 사용될 수 있는 한, 네트워크 가속기로서 사용될 수 있다. 현재의 네트워크 트래픽의 많은 퍼센트는 위치들 사이에서 이동하는 리던던트 데이터이다. 데이터에 대한 프록시를 전송하는 것은 로컬 캐싱 메카니즘이 효과적으로 동작하도록 하여, 인터넷 상의 트래픽을 상당히 감소시키는 것을 가능하게 한다.Still other potential applications of the hash file system of the present invention include web-serving. In this regard, a hash file system can be used to efficiently distribute web content because the method of hashing the common parts (hashing) also achieves a uniform distribution across all hash file system servers. This even distribution allows multiple arrays of servers to function as huge web server zones with an evenly distributed load. In other applications, the hash file system of the present invention can be used as a network accelerator as long as it can be used to reduce network traffic by sending a proxy (hash) to the data instead of the data itself. Many percentages of current network traffic are redundant data moving between locations. Sending a proxy for the data allows the local caching mechanism to work effectively, making it possible to significantly reduce traffic on the Internet.

특히, 본 명세서에서 개시된 바와 같이, 본 발명의 해시 파일 시스템 및 방법은 160 비트의 해시섬을 범용 포인터들로서 사용하여 구현될 수 있다. 이는 중앙국으로부터 할당된 포인터들(즉, Unix에서는 32 비트 "inode"가 유일성을 보장하는 록-스텝 동작(lock-step operation)으로 커널의 파일 시스템에 의해 할당됨)을 사용하는 종래의 파일 시스템과 다르다. 본 발명의 해시 파일 시스템에서, 이러한 160 비트의 해시섬들은 중앙국 없이(즉, 로킹 없이, 동기 없이) 해싱 알고리즘에 의해 할당된다.In particular, as disclosed herein, the hash file system and method of the present invention may be implemented using a 160-bit hashsum as general purpose pointers. This is a conventional file system that uses pointers allocated from the central station (i.e. 32-bit "inode" in Unix are allocated by the kernel's file system in a lock-step operation that ensures uniqueness). Is different. In the hash file system of the present invention, these 160 bits of hash islands are allocated by a hashing algorithm without a central station (ie, without locking, without synchronization).

공지된 해싱 알고리즘들은 값들의 범위를 균일하게 확장(spanning)하는 확률적으로 유일한 숫자를 생성한다. 해시 함수 SHA-1의 경우에, 그 범위는 0과 10e⁴⁸ 사이이다. 이러한 해싱 동작은 저장된 데이터의 내용만을 조사함으로써 행해지며, 따라서 완전히 분리하여, 비동기적으로 인터로킹 없이 행해질 수 있다.Known hashing algorithms produce stochasticly unique numbers that span the range of values uniformly. In the case of the hash function SHA-1, the range is between 0 and 10e ⁴⁸ . This hashing operation is performed by examining only the contents of the stored data, and thus can be completely separated and asynchronously performed without interlocking.

해싱은 시스템의 임의의 구성 요소에 의해 확인될 수 있는 동작이며, 그 구성 요소들 사이의 신뢰된 연산들(trusted operations)에 대한 필요성을 제거한다. 그러므로, 본 명세서에 개시된 본 발명의 해시 파일 시스템 및 방법은 종래의 대규모 분산 파일 시스템의 중요한 병목, 즉 신뢰된 포괄 중앙국(trusted encompassing central authority)을 제거하는 기능을 한다. 이는 불일치의 위험성 없이, 그리고 임의의 종래 기술에서의 병목 제한 없이 동작할 수 있는 대규모의 분산 파일 시스템의 구성을 가능하게 한다.Hashing is an operation that can be identified by any component of the system and eliminates the need for trusted operations between the components. Hence, the hash file system and method of the present invention disclosed herein serves to eliminate the critical bottleneck of the traditional large scale distributed file system, that is, the trusted encompassing central authority. This allows the construction of large scale distributed file systems that can operate without the risk of inconsistencies and without any bottleneck limitations in any prior art.

도 1은 본 발명의 시스템 및 방법이 구현될 수 있는 대표적인 네트워킹된 컴퓨터 환경의 하이레벨 설명도이다.1 is a high level explanatory diagram of an exemplary networked computer environment in which the systems and methods of the present invention may be implemented.

도 2는 본 발명의 시스템 및 방법을 사용하기 위한 가능한 동작 환경의 좀더 상세한 개념도로서, 임의의 수의 데이터 센터들 또는 컴퓨터들 상에 유지되는 파일들이, 예컨대, 지리적으로 떨어진 위치들에 배치된 다수의 RAIN(Redundant Arrays of Independent Nodes) 랙으로의 인터넷 접속을 통해 분산된 컴퓨터 시스템 내에 저장될 수 있음을 나타내는 개념도.FIG. 2 is a more detailed conceptual diagram of a possible operating environment for using the system and method of the present invention, in which files maintained on any number of data centers or computers are arranged, for example, in geographically separated locations. A conceptual diagram that can be stored within a distributed computer system via an Internet connection to a Redundant Array of Independent Nodes (RAIN) racks.

도 3은 본 발명의 해시 파일 시스템으로의 컴퓨터 파일의 입력 시의 단계들을 도시한 논리 순서도로서, 파일에 대한 해시값이 한 세트 또는 데이터베이스 내에 앞서서 유지되어 있는 파일들에 대한 해시값들에 대비하여 체크되는 순서도.FIG. 3 is a logical flow diagram illustrating the steps in the entry of a computer file into the hash file system of the present invention, in contrast to hash values for files in which a hash value for a file has been previously maintained in a set or database. Flowchart checked.

도 4는 파일 또는 다른 데이터 시퀀스를 해싱된 단편으로 분할하여, 각각의 단편에 대한 대응하는 확률적으로 유일한 해시값들 뿐만 아니라 다수의 데이터 단편들을 생성하는 단계들을 도시한 논리 순서도.4 is a logical flow diagram illustrating the steps of dividing a file or other data sequence into hashed fragments to produce multiple data fragments as well as corresponding stochastic unique hash values for each fragment.

도 5는 하나의 파일의 각각의 단편에 대한 해시값들과 세트(또는 데이터베이스) 내의 기존의 해시값들과의 비교, 여러 단편들의 해시값들로 모든 파일 단편들에 대하여 단일의 해시값의 등가를 나타내는 레코드들의 생성, 및 새로운 데이터 단편들 및 이에 상응하는 새로운 해시값들이 세트에 추가되는 것을 도시한 다른 논리 순서도.5 compares hash values for each fragment of a file with existing hash values in the set (or database), the equivalent of a single hash value for all file fragments with hash values of several fragments. And other logical flow diagrams illustrating the creation of records representing a new data fragments and corresponding new hash values added to the set.

도 6은 파일 해시 또는 디렉토리 리스트 해시값들과 기존의 디렉토리 리스트 해시값들과의 비교 및 새로운 파일 또는 디렉토리 리스트 해시값을 세트 디렉토리 리스트에 추가하는 단계들을 도시한 또 다른 논리 순서도.Figure 6 is another logical flow diagram illustrating the comparison of file hash or directory list hash values with existing directory list hash values and adding new file or directory list hash values to the set directory list.

도 7은 일례의 파일의 특정 단편의 편집 전후에 대표적인 컴퓨터 파일의 단편들과 대응하는 해시값들과의 비교를 도시한 도면.7 illustrates a comparison of fragments of a representative computer file with corresponding hash values before and after editing a particular fragment of an example file.

도 8은 본 발명의 시스템 및 방법에 의해 파생될 수 있는 합성 데이터가, 명확하게 표현되지만 그 대응하는 해시들에 의해 표현되는 데이터의 연결 또는 해시들에 의해 표현되는 데이터를 사용한 함수의 결과 등의 방법(recipe)에 의해 대신 생성될 수 있는 데이터와 동일하다는 사실을 나타내는 개념도.8 illustrates the results of a function using data represented by hashes or concatenation of data represented by synthetic data derivable by the systems and methods of the present invention, but clearly represented by the corresponding hashes. A conceptual diagram showing the fact that it is identical to the data that can instead be generated by a recipe.

도 9는 본 발명의 해시 파일 시스템 및 방법이 어떻게 그들이 나타내는 데이터에 대한 해시값들을 포인터로서 사용하여 중복적 시퀀스의 재사용을 최적화하도록 데이터를 구성하기 위해 사용되는지를 나타내는 또 다른 개념도로서, 여기서 데이터는 명시적 바이트 시퀀스(atomic data) 또는 시퀀스들의 그룹(composites)으로서 표현될 수 있는 개념도.FIG. 9 is another conceptual diagram illustrating how the hash file system and method of the present invention is used to organize data to optimize reuse of redundant sequences using hash values for the data they represent as pointers. A conceptual diagram that can be represented as an explicit byte sequence or composites of sequences.

도 10은 일례의 160비트 해시값에 대한 해시 파일 시스템 어드레스 번역 함수를 설명하는 간략화된 도면.10 is a simplified diagram illustrating a hash file system address translation function for an example 160 bit hash value.

도 11은 본 발명의 시스템 및 방법에 사용하기 위한 인덱스 스트라이프 스플릿팅 함수(index stripe splitting function)의 간략화된 도면.11 is a simplified diagram of an index stripe splitting function for use in the systems and methods of the present invention.

도 12는 Day 1에 다수의 프로그램 및 문서 파일들을 갖는 대표적인 홈 컴퓨터에 대한 데이터의 백업에 사용하기 위한 본 발명의 시스템 및 방법의 전체 기능의 간략화된 도면으로서, 문서 파일들 중 하나는 제3 문서 파일의 추가와 함께 Day 2에 편집되는 도면.12 is a simplified diagram of the overall functionality of the system and method of the present invention for use in backing up data for a representative home computer having multiple program and document files on Day 1, wherein one of the document files is a third document. Drawing edited on Day 2 with the addition of a file.

도 13은 편집 전후에 다수의 "스티키(sticky) 바이트"에 의해 마킹된 특정한 문서 파일의 여러 단편들의 비교를 도시한 도면으로서, 여기서 단편들 중 하나는 다른 단편들이 동일하게 유지되는 동안 변화되되는 도면.FIG. 13 shows a comparison of several fragments of a particular document file marked by a number of "sticky bytes" before and after editing, where one of the fragments is changed while the other fragments remain the same. drawing.

본 발명의 전술한 그리고 다른 특징 및 목적과 이들을 이루는 방식은 첨부 도면과 함께 다음의 바람직한 실시예의 설명을 참조함으로써 보다 명백해지고 잘 이해될 것이다.The foregoing and other features and objects of the present invention and the manner in which they are made will become more apparent and better understood by reference to the following description of the preferred embodiments in conjunction with the accompanying drawings.

본 명세서에 개시된 바와 같은 본 발명의 해시 파일 시스템의 특정한 구현예에 있어서, 그 어플리케이션은 상용 컴퓨터 장치들의 급속한 진보 및 인터넷과 같은 인터네트워킹 기술의 강건한 특성에 걸맞는 높은 활용도와 고 신뢰도의 데이터 저장 시스템에 관한 것이다. 특히, 본 명세서에는 그 데이터 블록(파일, 디렉토리, 드라이브 이미지, 소프트웨어 어플리케이션, 디지털화된 음성, 및 풍부한 미디어 컨텐츠를 포함하지만 이에 제한되지 않는다)에 대한 하나 이상의 부호(들)와 함께 하나 이상의 데이터 블록(들)의 대응을 관리하는 해시 파일 시스템이 개시되며, 여기서 부호는 데이터 블록 그 자체로부터 파생되며 그 데이터 블록에 대해 통계적, 확률적으로, 다시 말해 실질적으로 유일한, 숫자, 해시, 체크섬, 이진 시퀀스, 또는 다른 식별자일 수 있다. 시스템 그 자체는 이에 제한되는 것은 아니지만 개인용 컴퓨터, 슈퍼 컴퓨터, 분산 또는 비분산 네트워크, IDE, SCSI 또는 다른 디스크 버스들을 사용하는 스토리지 영역 네트워크("SAN"), 네트워크 접속 가능 저장 장치("NAS") 또는 데이터를 저장 및/또는 프로세싱할 수 있는 다른 시스템을 포함하는 임의의 컴퓨터 시스템 상에서 동작한다.In a particular embodiment of the hash file system of the present invention as disclosed herein, the application is a high utilization and high reliability data storage system for the rapid advances of commercial computer devices and the robust nature of internetworking technologies such as the Internet. It is about. In particular, the specification includes one or more data blocks (including one or more code (s) for their data blocks (including but not limited to files, directories, drive images, software applications, digitized voice, and rich media content). (B) a hash file system is disclosed that manages the correspondence of the < RTI ID = 0.0 > (x) < / RTI > Or another identifier. The system itself is not limited to personal computers, supercomputers, distributed or non-distributed networks, storage area networks ("SAN") using network buses (IDEs), SCSI, or other disk buses, network-attached storage ("NAS"). Or other computer system including other systems capable of storing and / or processing data.

본 명세서에 개시된 해시 파일 시스템의 특정한 구현예에 있어서, 부호(들)는 이에 제한되는 것은 아니지만 MD4, MD5, SHA, SHA-1, 또는 그 파생물을 포함하는 하나 이상의 해시 또는 체크섬 생성 엔진, 프로그램, 또는 알고리즘을 사용하여 유도될 수 있다. 또한, 부호(들)는 이에 제한되는 것은 아니지만 MD4, MD5, SHA, SHA-1 또는 데이터 내용에 기반한 확률적으로 유일한 식별자를 생성하는 다른 방법을 포함하는 해시 또는 체크섬 생성 엔진, 프로그램, 또는 알고리즘을 사용하여 유도되는 가변 또는 고정 길이 부호들의 부분들을 포함할 수 있다. 본 명세서에 개시된 특정한 구현예에 있어서, 데이터를 검색하거나 데이터의 존재/유용성에 대한 검사를 위한 파일 탐색 또는 룩업은 부호부 표시 또는 데이터의 존재/유용성에 대한 탐색, 검색, 또는 검사를 위한 라우팅 정보를 제공을 통해 부호의 전부 또는 보다 작은 부분에서 검색함으로써 가속될 수 있다.In certain embodiments of the hash file system disclosed herein, the code (s) may include one or more hash or checksum generation engines, programs, including, but not limited to, MD4, MD5, SHA, SHA-1, or derivatives thereof. Or using an algorithm. In addition, the code (s) may comprise a hash or checksum generation engine, program, or algorithm, including, but not limited to, MD4, MD5, SHA, SHA-1, or other ways of generating stochastic unique identifiers based on data content. It may include portions of variable or fixed length codes derived using. In certain embodiments disclosed herein, a file search or lookup for retrieving data or checking for the presence / availability of the data may include routing information for searching, retrieving, or checking for sign representation or for the presence / availability of the data. Can be accelerated by searching in all or a smaller portion of the code.

본 명세서에는 해시 파일 시스템을 위한 시스템 및 방법이 개시되어 있는데, 여기서 부호는 시스템 내의 리던던트 카피의 식별 및/또는 저장을 위해 시스템에 제공되는 데이터와 중복되는 시스템 내의 카피의 식별을 가능하게 한다. 이 부호는 데이터 무결성의 손실 없이 데이터의 리던던트 카피, 및/또는 시스템이나 데이터 내의 데이터 부분들 및/또는 시스템에 제공되는 데이터의 부분들의 선별이 가능하게 되어 시스템에 대한 이용 가능한 저장 용량에 대해 고르게 데이터가 분산될 수 있다. 본 명세서에 개시된 바와 같은 본 발명의 시스템 및 방법은 중앙 운영 포인트를 필요로 하지 않고, 시스템에 제공된 데이터를 저장 및 처리할 수 있는 모든 컴퓨터, 슈퍼 컴퓨터 또는 다른 장치들을 통한 처리 및/또는 입력/출력("I/O") 부하의 균형을 맞춘다. 본 명세서에서 제공된 데이터의 리던던트 카피 및/또는 데이터의 부분들의 선별은 시스템 내의 다른 데이터, 시스템에 제공되는 장래 데이터, 또는 시스템에 의해 저장되는 장래 데이터를 선별하기 위해 지능적인 바운더리들의 생성, 반복적인 생성, 또는 유지를 가능하게 한다.Disclosed herein is a system and method for a hash file system, where a sign enables identification of a copy in a system that overlaps with data provided to the system for identification and / or storage of redundant copies in the system. This code allows for a redundant copy of the data and / or the screening of data portions within the system or data and / or portions of data provided to the system without loss of data integrity, so that the data is evenly distributed over the available storage capacity for the system. Can be dispersed. The systems and methods of the present invention as disclosed herein do not require a central operating point and process and / or input / output through any computer, supercomputer or other device capable of storing and processing data provided to the system. ("I / O") Balance the load. Redundant copies of data and / or portions of data provided herein include the generation, iterative generation of intelligent boundaries to screen other data in the system, future data provided to the system, or future data stored by the system. , Or enable maintenance.

본 발명은 인터넷과 같은 공중 통신 채널들을 사용하여 엔터프라이즈 컴퓨팅 시스템과 같은 분산 컴퓨팅 환경의 관점에서 도시되어 설명되고 있다. 그러나, 본 발명의 중요한 특징은 특정한 어플리케이션의 요구를 충족시키기 위해 손쉽게 상방으로 그리고 하방으로 스케일링된다는 것이다. 따라서, 반대로 특정되지 않는다면, 본 발명은 종래의 LAN 시스템과 같은 소규모 네트워크 환경뿐만 아니라 매우 크고, 복잡한 네트워크 환경에도 적용 가능하다.The invention is illustrated and described in terms of a distributed computing environment such as an enterprise computing system using public communication channels such as the Internet. However, an important feature of the present invention is that it is easily scaled up and down to meet the needs of a particular application. Thus, unless otherwise specified, the present invention is applicable to very large and complex network environments as well as small network environments such as conventional LAN systems.

도 1을 참조하면, 본 발명은 네트워크(10) 상의 신규한 데이터 저장 시스템과 관련하여 사용될 수 있다. 본 도면에서, 예시적인 인터네트워크 환경(10)은 ㅂ복수의 WAN(14)와 LAN(16) 사이의 논리적 및 물리적 접속에 의해 형성된 글로벌 인터네트워크를 포함하는 인터넷을 포함할 수 있다. 인터넷 백본(12)은 메인 라인들 및 데이터 트래픽을 운반하는 라우터들로 표현된다. 백본(12)은, 예를 들어, GTE, MCI, Sprint, UUNet, 및 America Online과 같은 메이저 인터넷 서비스 업자("ISP")들에 의해 운영되는 시스템에서 가장 큰 네트워크에 의해 형성된다. WAN들(14) 및 LAN들(16)의 인터넷 백본(12)에 대한 접속을 표현하기 위하여 편의상 단일 접속 라인들이 사용되었지만, 실제로는 다중 WAN들(14) 및 LAN들(16) 사이에 다중 경로, 라우팅 가능한 물리적인 접속들이 존재한다는 것이 이해될 것이다. 이는 단일 또는 다중 고장에 직면하였을 때 인터네트워크(10)를 강건하게 한다.Referring to FIG. 1, the present invention can be used in connection with a novel data storage system on the network 10. In this figure, the exemplary internetwork environment 10 may include the Internet including a global internetwork formed by logical and physical connections between multiple WANs 14 and LANs 16. The internet backbone 12 is represented by routers carrying main lines and data traffic. The backbone 12 is formed by the largest network in a system operated by major Internet service providers (“ISPs”), such as, for example, GTE, MCI, Sprint, UUNet, and America Online. Although single connection lines have been used for convenience to represent the connection of the WANs 14 and LANs 16 to the Internet backbone 12, in practice multiple paths between multiple WANs 14 and LANs 16 are used. It will be appreciated that there are routable physical connections. This makes the internetwork 10 robust when faced with single or multiple failures.

컴퓨터 내의 주변 장치들 사이에 구현된 내부 데이터 경로(pathway)들로부터의 네트워크 접속들을 구별하는 것은 중요하다. "네트워크"는 노드들(18) 상에서 동작하는 프로세스들 간의 논리적인 접속을 가능하게 하는, 주로 스위칭되는 물리적인 접속들을 포함하는 범용의 시스템을 포함한다. 네트워크에 의해 구현되는 물리적인 접속들은 전형적으로 네트워크를 사용하여 프로세스들 간에 설정되는 논리적인 접속들에 독립적이다. 이러한 방식으로, 파일 전송, 메일 전송 등의 범위에 있는 혼성 프로세스 세트가 동일한 물리적인 네트워크를 사용할 수 있다. 역으로, 네트워크는 네트워크를 사용하여 논리적으로 접속된 프로세스들에 대하여 보이지 않는, 물리적인 네트워크 기술들의 혼성 세트로부터 형성될 수 있다. 네트워크에 의해 구현되는 프로세스들 간의 논리적인 접속은 물리적인 접속에 무관하므로, 장거리에 걸친 가상적으로 무한한 수의 노드들에 인터네트워크들이 쉽게 확대될 수 있다. It is important to distinguish network connections from internal data paths implemented between peripherals in a computer. “Network” includes a general purpose system that includes primarily switched physical connections, which enables logical connections between processes operating on nodes 18. The physical connections implemented by the network are typically independent of the logical connections established between the processes using the network. In this way, a hybrid set of processes in the range of file transfers, mail transfers, and the like can use the same physical network. Conversely, a network can be formed from a hybrid set of physical network technologies that are not visible to processes that are logically connected using the network. Since the logical connection between the processes implemented by the network is independent of the physical connection, the internetwork can easily be extended to a virtually infinite number of nodes over a long distance.

이와 대조적으로, 시스템 버스, 주변장치 상호접속("PCI") 버스, 인텔리전트 드라이브 전자 장치("IDE") 버스, 소형 컴퓨터 시스템 인터페이스("SCSI") 버스 등과 같은 내부 데이터 통로들은 컴퓨터 시스템 내의 특수 목적의 접속을 구현하는 물리적인 접속들을 정의한다. 이러한 접속들은 프로세스들 간의 논리적인 접속들과 반대로 물리적인 장치들 간의 물리적 접속들을 구현한다. 이러한 물리적인 접속들은 구성 요소들 사이에 거리 제한이 있고, 접속에 결합될 수 있는 장치들의 수에 제한이 있으며, 접속을 통해 연결될 수 있는 장치들의 포맷이 한정되어 있는 것을 특징으로 한다.In contrast, internal data paths such as system buses, peripheral interconnect ("PCI") buses, intelligent drive electronics ("IDE") buses, and small computer system interface ("SCSI") buses, may be used for special purpose purposes within a computer system. Defines the physical connections that implement the connection. These connections implement physical connections between physical devices as opposed to logical connections between processes. These physical connections are characterized by limited distances between components, a limited number of devices that can be coupled to the connection, and a limited format of devices that can be connected via the connection.

본 발명의 특정한 구현예에 있어서, 저장 장치는 노드들(18)에 위치될 수 있다. 임의의 노드(18)의 저장장치는 단일 하드 드라이브를 포함하거나, 단일의 논리적 볼륨으로 구성되는 다수의 하드 드라이브를 갖는 종래의 RAID 장치와 같은 관리 저장 시스템을 포함할 수 있다. 중요하게는, 본 발명은 노드들에서와는 반대로 임의의 주어진 노드 내의 저장장치의 지정된 구성이 덜 관여되도록 노드들 사이의 중복적 동작을 관리한다.In a particular implementation of the invention, the storage device may be located at nodes 18. The storage of any node 18 may include a single hard drive or may include a managed storage system, such as a conventional RAID device having multiple hard drives organized into a single logical volume. Importantly, the present invention manages redundant operations between nodes so that the designated configuration of storage within any given node is less involved than in nodes.

선택사항으로서, 하나 이상의 노드들(18)이 분산된 협동적인 방식으로 노드들(18)을 통해 데이터 저장을 관리하는 저장 할당 관리("SAM") 프로세스들을 구현할 수 있다. 바람직하게는, SAM 프로세스들은 전체적인 시스템에 대한 중앙 집중식 제어가 거의 없거나 전혀 없다. SAM 프로세스들은 노드들(18) 사이의 데이터 분산을 제공하여, RAID 저장장치 서브시스템에서 발견되는 패러다임과 마찬가지의 방식으로 네트워크 노드들(18) 사이의 장애 내성의 방식으로 복구를 구현한다.Optionally, one or more nodes 18 may implement storage allocation management (“SAM”) processes that manage data storage through nodes 18 in a distributed collaborative manner. Preferably, the SAM processes have little or no centralized control over the entire system. SAM processes provide data distribution between nodes 18 to implement recovery in the manner of fault tolerance between network nodes 18 in the same manner as the paradigm found in RAID storage subsystems.

그러나, SAM 프로세스들은 단일 노드, 또는 단일 컴퓨터에서 보다는 노드들을 통해 운영되기 때문에, 이들은 종래의 RAID 시스템보다 우수한 장애 내성 및 우수한 레벨의 저장장치 효율성이 가능하다. 예를 들어, SAM 프로세스들은 네트워크 노드(18), LAN(16), 또는 WAN(14)이 이용 가능하지 않은 경우에도 복구할 수 있다. 더욱이, 인터넷 백본(12)의 한 부분이 오류 또는 혼잡으로 인해 이용 가능하지 않을 경우에도, SAM 프로세스는 억세스 가능하게 유지되는 노드들(18) 상에 분산된 데이터를 사용하여 복구할 수 있다. 이러한 방식으로, 본 발명은 인터네트워크의 특성을 강건하게 하여 전례 없는 유용성, 신뢰성, 장애 내성 및 강건함을 제공한다.However, because SAM processes operate through nodes rather than on a single node or on a single computer, they are capable of better fault tolerance and a higher level of storage efficiency than conventional RAID systems. For example, SAM processes may recover even when network node 18, LAN 16, or WAN 14 are not available. Moreover, even if a portion of the Internet backbone 12 is not available due to errors or congestion, the SAM process may recover using data distributed on the nodes 18 that remain accessible. In this way, the present invention strengthens the properties of the internetwork, providing unprecedented usefulness, reliability, fault tolerance and robustness.

도 2를 참조하면, 본 발명이 구현되는 일례의 네트워크 컴퓨팅 환경의 보다 상세한 개념이 도시되어 있다. 선행 도면의 인터네트워크(10)(또는 본 도면의 인터넷(118))는 수퍼컴퓨터 또는 데이터 센터(104)에서부터 핸드헬드 또는 펜 기반 장치(114)까지의 범위의 컴퓨팅 장치들 및 메카니즘들(102)의 혼성 세트의 상호접속된 네트워크(100)를 가능하게 한다. 이러한 장치들은 본질적으로 다른 데이터 저장장치를 필요로 하는 한편, 네트워크(100)를 통한 데이터의 검색 능력을 공유하며, 자신들만의 자원들 내에서 그 데이터를 운용한다. IBM 호환 장치(108), 매킨토시 장치(110) 및 랩탑 컴퓨터(112)와 같은 개인용 컴퓨터 또는 워크 스테이션급 장치들 뿐만 아니라 메인프레임 컴퓨터(예를 들어, VAX 스테이션(106) 및 IBM AS/400 스테이션(116))를 포함하는 이종 컴퓨팅 장치들이 인터네트워크(10) 및 네트워크(100)를 통해 쉽게 상호접속된다. 도시되지는 않았지만, 모바일 및 다른 무선 장치들이 인터네트워크(10)에 연결될 수 있다.2, a more detailed concept of an exemplary network computing environment in which the present invention is implemented is shown. The internetwork 10 of the preceding figures (or the Internet 118 of the figures) may be computing devices and mechanisms 102 ranging from a supercomputer or data center 104 to a handheld or pen based device 114. It allows for a hybrid set of interconnected networks 100. While these devices inherently require other data storage devices, they share the ability to retrieve data through the network 100 and operate that data within their own resources. Personal computer or workstation-class devices such as IBM compatible device 108, Macintosh device 110, and laptop computer 112, as well as mainframe computers (e.g., VAX station 106 and IBM AS / 400 station ( Heterogeneous computing devices including 116) are easily interconnected via the internetwork 10 and the network 100. FIG. Although not shown, mobile and other wireless devices may be connected to the internetwork 10.

인터넷 기반 네트워크(120)는 복수의 내부 네트워크들(122) 사이의 일련의 논리적 접속들의 세트를 포함하며, 그 중 일부는 인터넷(118)을 통해 이루어진다. 개념적으로, 인터넷 기반 네트워크(120)는 지리적으로 먼 노드들 간의 논리적인 접속을 가능하게 한다는 면에서 WAN(14)(도 1)과 유사하다. 인터넷 기반 네트워크(120)는 인터넷(118) 또는 임대 라인, 광섬유 채널 등을 포함하는 다른 공중 및 사설 WAN 기술들을 사용하여 구현될 수 있다.Internet-based network 120 includes a set of logical connections between a plurality of internal networks 122, some of which are made through the Internet 118. Conceptually, the Internet-based network 120 is similar to the WAN 14 (FIG. 1) in that it enables logical connections between geographically distant nodes. Internet-based network 120 may be implemented using the Internet 118 or other public and private WAN technologies, including leased lines, fiber optic channels, and the like.

마찬가지로, 내부 네트워크(122)는 WAN(14)보다 더 제한된 거리간의 논리적인 접속을 가능하게 한다는 면에서 LAN(16)과 개념적으로 유사하다. 내부 네트워크들(122)은 이더넷, 광 분배 데이터 인터페이스("FDDI"), 토큰링, 애플토크, 광섬유 채널 등을 포함하는 다양한 LAN 기술들을 사용하여 구현될 수 있다.Similarly, internal network 122 is conceptually similar to LAN 16 in that it allows for a logical connection over a more limited distance than WAN 14. Internal networks 122 may be implemented using various LAN technologies, including Ethernet, optical distribution data interface (“FDDI”), token ring, AppleTalk, fiber channel, and the like.

각 내부 네트워크(122)는 하나 이상의 RAIN(Redundant Arrays of Independent Nodes) 구성요소(124)를 연결하여 RAIN 노드들(18)(도 1 참조)을 구현한다. 각 RAIN 구성 요소들(124)은 프로세서, 메모리, 및 하드 디스크와 같은 하나 이상의 대용량 저장 장치를 포함한다. RAIN 구성 요소들(124)은 또한 종래의 IDE 또는 SCSI 제어기, 또는 RAID 제어기와 같은 관리 제어기일 수 있는 하드 디스크 제어기들을 포함한다. RAIN 구성 요소(124)는, 냉각 및 전원과 같은 자원들을 공유하는 하나 이상의 랙(rack)에 물리적으로 분산되거나 동일 위치에 있을 수 있다. 한 노드(18)의 고장 또는 장애가 다른 노드들(18)의 활용성에 영향을 주지 않으며, 한 노드(18) 상에 저장된 데이터가 다른 노드들(18) 상에 저장된 데이터로부터 재구성될 수 있다는 면에서 각각의 노드(18)는 다른 노드들(18)과 독립적이다.Each internal network 122 connects one or more Redundant Arrays of Independent Nodes (RAIN) components 124 to implement RAIN nodes 18 (see FIG. 1). Each RAIN component 124 includes one or more mass storage devices such as a processor, memory, and hard disk. RAIN components 124 also include hard disk controllers, which can be a management controller such as a conventional IDE or SCSI controller, or a RAID controller. RAIN component 124 may be physically distributed or co-located in one or more racks that share resources such as cooling and power. The failure or failure of one node 18 does not affect the utilization of the other nodes 18, in that data stored on one node 18 can be reconstructed from data stored on other nodes 18. Each node 18 is independent of the other nodes 18.

특정한 일 실시예에서, RAIN 구성 요소들(124)은 종래의 AT 또는 ATX 케이스 내에 실장되는 256 메가바이트의 랜덤 억세스 메모리("RAM") 및 PCI 버스를 지원하는 마더모드 상에 탑재된 인텔 기반 마이크로프로세서와 같은 상용 구성 요소를 사용한 컴퓨터들을 포함할 수 있다. SCSI 또는 IDE 제어기는 마더보드 상에 또는 PCI 버스에 연결된 확장 카드에 의해 구현될 수 있다. 이 제어기들은 단지 마더보드 상에만 구현되어 있어도, PCI 확장 버스가 임의선택적으로 사용될 수 있다. 특정 구현예에 있어서, 각 RAIN 구성요소(124)가 4개 또는 그 이상까지의 EIDE 하드 디스크들을 포함하도록, 마더보드는 두 개의 추가의 마스터링 EIDE 채널을 구현하는데 사용되는 PCI 확장 카드 및 두 개의 EIDE 채널을 포함할 수 있다. 특정 구현예에 있어서, 각 하드 디스크는 RAIN 구성 요소마다 320 기가바이트 이상의 총 저장 용량에 대해 80 기가바이트 하드 디스크를 포함할 수 있다. 하드 디스크 용량 및 RAIN 구성요소(124)들 내의 구성은 특정한 어플리케이션의 요구를 충족시키도록 쉽게 증가 또는 감소될 수 있다. 또한, 케이스는 전원 및 냉각 장치(도시 생략)와 같은 지원 메카니즘들을 수용한다.In one particular embodiment, the RAIN components 124 are 256-byte random access memory ("RAM") mounted in a conventional AT or ATX case and an Intel-based microcomputer mounted on a mother mode that supports a PCI bus. It may include computers using commercially available components such as processors. The SCSI or IDE controller can be implemented by an expansion card connected to the motherboard or to the PCI bus. Although these controllers are only implemented on the motherboard, a PCI expansion bus can optionally be used. In certain embodiments, the motherboard is a PCI expansion card and two PCI expansion cards used to implement two additional mastering EIDE channels, such that each RAIN component 124 includes up to four or more EIDE hard disks. It may include an EIDE channel. In certain embodiments, each hard disk may comprise 80 gigabyte hard disks for a total storage capacity of 320 gigabytes or more per RAIN component. Hard disk capacity and configuration within the RAIN components 124 can be easily increased or decreased to meet the needs of a particular application. The case also accommodates support mechanisms such as power and cooling devices (not shown).

각각의 RAIN 구성요소(124)는 운영 체제를 실행한다. 특정 구현예에 있어서, UNIX 또는 Linux와 같은 UNIX 변형 운영 체제가 사용될 수 있다. 그러나, DOS, 마이크로소프트 윈도우즈, 애플 매킨토시 OS, OS/2, 마이크로소프트 윈도우즈 NT 등과 같은 다른 운영 체제들이 동등하게 대체 가능하나 성능 변화가 예측된다. 선택된 운영 체제는 어플리케이션 소프트웨어 및 프로세스들을 실행하기 위한 플랫폼을 형성하고, 하드 디스크 제어기(들)를 통해 대용량 저장 장치를 억세스하기 위한 파일 시스템을 구현한다. 다양한 어플리케이션 소프트웨어 및 프로세스들이 각각의 RAIN 구성 요소(124)상에 구현되어 유저 데이터그램 프로토콜("UDP"), 전송 제어 프로토콜(TCP), 인터넷 프로토콜(IP) 등과 같은 적절한 네트워크 프로토콜들을 사용하여 네트워크 인터페이스를 통해 네트워크 접속성을 제공할 수 있다.Each RAIN component 124 runs an operating system. In certain embodiments, UNIX variant operating systems such as UNIX or Linux may be used. However, other operating systems such as DOS, Microsoft Windows, Apple Macintosh OS, OS / 2, Microsoft Windows NT, etc. are equally replaceable, but performance changes are expected. The selected operating system forms a platform for executing application software and processes, and implements a file system for accessing mass storage devices via hard disk controller (s). Various application software and processes are implemented on each RAIN component 124 to provide network interface using appropriate network protocols such as User Datagram Protocol ("UDP"), Transmission Control Protocol (TCP), Internet Protocol (IP), and the like. Can provide network connectivity.

도 3을 참조하면, 본 발명의 해시 파일 시스템으로 컴퓨터 파일을 입력하는 단계들을 나타낸 논리 순서도로서, 파일에 대한 해시값이 세트 또는 데이터베이스에서 이전에 유지되어 있는 파일들에 대한 해시값들에 대해 검사되는 논리 순서가 도시되어 있다.Referring to Figure 3, a logical flow diagram illustrating steps of entering a computer file into the hash file system of the present invention, wherein the hash value for the file is checked for hash values for files previously held in the set or database. The logical sequence that is shown is shown.

본 발명의 해시 파일 시스템("HFS")으로의 컴퓨터 파일 데이터(202)의 입력(예를 들어, "File A")에 의해 프로세스 200이 시작되며, 단계 204에서 해시 함수가 수행된다. 다음에, 판정 단계 208에서 File A의 해시를 나타내는 데이터(206)가 해시 파일 값들을 포함하는 세트의 내용과 비교된다. File A의 해시를 나타내는 데이터(206)가 이미 세트 내에 존재한다면, 단계 210에서 파일의 해시값이 디렉토리 리스트에 추가된다. 해시값들 및 대응하는 데이터를 포함하는 세트(212)의 내용은 판정 단계 208의 비교 동작을 위해 기존의 해시값들(214)의 형태로 제공된다. 한편, File A에 대한 해시값이 세트 내에 현재 존재하지 않는다면, 단계 216에서 파일이 해싱된 단편들로 분할된다(다음에서 보다 상세히 설명될 것임).Process 200 is initiated by input of computer file data 202 (eg, "File A") into the hash file system ("HFS") of the present invention, and a hash function is performed in step 204. Next, at decision step 208, data 206 representing the hash of File A is compared with the contents of the set containing the hash file values. If data 206 representing the hash of File A already exists in the set, then the hash value of the file is added to the directory list in step 210. The contents of the set 212 including the hash values and corresponding data are provided in the form of existing hash values 214 for the comparison operation of decision step 208. On the other hand, if the hash value for File A does not currently exist in the set, then the file is split into hashed fragments in step 216 (which will be described in more detail below).

도 4를 참조하면, 디지털 시퀀스(예컨대, 파일 또는 다른 데이터 시퀀스)를 해싱된 단편들로 분할하기 위한 프로세스(300)의 단계들을 나타내는 논리 순서도가 도시되어 있다. 이러한 프로세스(300)는 궁극적으로 다수의 데이터 단편들뿐 아니라 각각의 단편에 대해 대응하는 확률적으로 유일한 해시값들의 생성도 야기한다.Referring to FIG. 4, a logic flow diagram is shown that illustrates the steps of process 300 for dividing a digital sequence (eg, a file or other data sequence) into hashed fragments. This process 300 ultimately leads to the generation of multiple data fragments as well as corresponding stochastic uniquely hash values for each fragment.

파일 데이터(302)는 단계 304에서 시스템 내의 다른 단편들과의 공통부분(commonality) 또는 장차 단편들이 공통적인 것으로 발견될 가능성에 기초하여 단편들로 분할된다. 파일 데이터(302)에 대한 단계 304의 연산 결과는 도시된 대표예에서 A1 내지 A5로 명명된 5개의 파일 단편들(306)의 생성이다.File data 302 is divided into fragments at step 304 based on the commonality with other fragments in the system or the likelihood that future fragments will be found to be common. The result of the operation of step 304 on the file data 302 is the generation of five file fragments 306 named A1 through A5 in the representative example shown.

다음, 파일 단편들(306) 각각은 단계 308에서 개별적인 해시 함수 연산을 통해 이를 위치시킴으로써 연산되어, 확률적으로 유일한 숫자를 단편들(306)(A1 내지 A5) 각각에 할당한다. 단계 308에서의 동작 결과는 단편들(306)(A1 내지 A5) 각각이 연관된 확률적으로 유일한 해시값(310)을 갖는다는 것이다(A1 해시 내지 A5 해시로 각각 도시됨). 단계 304의 파일 분할 프로세스는 또한 본 명세서에서 개시된 유일한 "스티키 바이트(sticky bite)" 연산과 연계하여 이하 보다 상세히 설명될 것이다.Each of the file fragments 306 is then computed by locating it through a separate hash function operation in step 308, assigning a probabilistic unique number to each of the fragments 306 (A1 through A5). The result of the operation in step 308 is that each of the fragments 306 (A1 through A5) has an associated probabilistic unique hash value 310 (shown as A1 hashes to A5 hashes, respectively). The file splitting process of step 304 will also be described in greater detail below in conjunction with the only " sticky bite " operation disclosed herein.

도 5를 참조하면, 파일의 각각의 단편(306)의 해시값들(310)과 세트(212) 내에 유지되어 있는 기존의 해시값들(214)과의 비교 프로세스(400)를 나타내는 또 다른 논리 순서도가 도시되어 있다. 특히, 단계 402에서, 파일의 각각의 단편(306)에 대한 해시값들(310)이 기존의 해시값들(214)과 비교되어, 새로운 해시값들(408) 및 대응하는 새로운 데이터 단편들(406)이 세트(212)에 추가된다. 이러한 방식에 있어서, 데이터베이스 세트(212) 내에 이전에 제공되어 있지 않은 해시값들(408)은 그 연관된 데이터 단편들(406)과 함께 추가된다. 또한, 프로세스(400)는 다양한 단편들(306)의 해시값들(310)과 함께 모든 파일 단편들에 대하여 단일의 해시값의 등가물을 도시한 레코드(404)의 생성을 야기한다.Referring to FIG. 5, another logic representing the comparison process 400 of the hash values 310 of each fragment 306 of the file and the existing hash values 214 maintained in the set 212. Flowchart is shown. In particular, at step 402, the hash values 310 for each fragment 306 of the file are compared with existing hash values 214, such that new hash values 408 and corresponding new data fragments ( 406 is added to the set 212. In this manner, hash values 408 that were not previously provided in database set 212 are added along with their associated data fragments 406. Process 400 also results in the creation of record 404 showing the equivalent of a single hash value for all file fragments along with hash values 310 of various fragments 306.

도 6을 참조하면, 파일 해시 또는 디렉토리 리스트 해시값과 기존의 디렉토리 리스트 해시값들을 비교하고, 데이터베이스 디렉토리 리스트에 새로운 파일 또는 디렉토리 리스트 해시값들을 추가하는 프로세스(500)를 나타내는 또 다른 논리 순서도가 도시되어 있다. 프로세스 500은 디렉토리 내에 누적된 파일명 리스트, 파일 메타-데이터(예를 들어, 일자, 시간, 파일 길이, 파일 형태 등), 및 각각의 항목에 대한 파일의 해시값을 포함하는 저장된 데이터(502) 상에서 동작한다. 단계 504에서, 해시 함수가 디렉토리 리스트의 내용에 대해 실행된다. 판정 단계 506은 디렉토리 리스트에 대한 해시값이 기존의 해시값들(214)의 세트(212) 내에 존재하는지 여부를 판정하도록 동작한다. 만약 존재한다면, 프로세스 500은 다른 파일 해시 또는 디렉토리 리스트 해시를 디렉토리 리스트에 추가하는 것으로 복귀한다. 대안으로서, 디렉토리 리스트에 대한 해시값이 데이터베이스 세트(212) 내에 아직 존재하지 않는다면, 디렉토리 리스트에 대한 데이터 및 해시값이 단계 508에서 데이터베이스 세트(212)에 추가된다.Referring to Figure 6, another logical flow diagram illustrating a process 500 for comparing file hash or directory list hash values to existing directory list hash values and adding new file or directory list hash values to the database directory list. It is. Process 500 is stored on stored data 502 including a list of file names accumulated in the directory, file meta-data (eg, date, time, file length, file type, etc.), and a hash value of the file for each item. It works. In step 504, a hash function is executed on the contents of the directory list. The decision step 506 is operative to determine whether a hash value for the directory listing exists within the set of existing hash values 214. If present, process 500 reverts to adding another file hash or directory list hash to the directory list. Alternatively, if a hash value for the directory list does not yet exist in the database set 212, the data and hash value for the directory list are added to the database set 212 at step 508.

도 7을 참조하면, 일례의 파일의 특정 단편의 편집 전후에, 그 대응하는 해시값들(310)과 대표적인 컴퓨터 파일(즉, "File A")의 단편들(306)의 비교가 도시되어 있다. 본 예에서, 레코드(404)는 File A의 해시값뿐 아니라 파일의 단편들(A1 내지 A5) 각각의 해시값들(310)도 포함한다. File A의 표현되어 있는 편집은 파일 단편들(306A) 중 단편 A2에 대한 데이터(이하, "A2-b"로 표현됨)의 변화와 함께 이에 대응하는 해시값들(310A) 중 해시값 A2-b의 변화를 일으킬 수 있다. 편집된 파일 단편은 File A의 수정된 해시값 및 단편 A2-b의 수정된 해시값을 포함하는 갱신된 레코드 404A를 생성한다.Referring to FIG. 7, a comparison of the corresponding hash values 310 with fragments 306 of a representative computer file (ie, “File A”) is shown before and after editing a particular fragment of an example file. . In this example, the record 404 includes the hash values 310 of each of the fragments A1-A5 of the file as well as the hash value of File A. The edited representation of File A changes with the data for fragment A2 (hereinafter referred to as " A2-b ") in file fragments 306A and the corresponding hash value A2-b of hash values 310A. Can cause a change. The edited file fragment produces an updated record 404A that includes the modified hash value of File A and the modified hash value of fragment A2-b.

도 8을 참조하면, 본 발명의 시스템 및 방법에 의해 유도되는 합성 데이터(합성 데이터 702 및 704)가, 명확하게 표현되나 그 대신 "제법(recipe)" 또는 공식에 의해 생성되는 데이터(706)와 실질적으로 동일하다는 사실을 나타내는 개념적 표현(700)이 도시되어 있다. 도시된 예에서, 이러한 제법은 그 대응하는 해시들(708)에 의해 표현되는 데이터의 연결, 또는 해시들에 의해 표현되는 데이터를 사용한 함수의 결과를 포함한다. 도시된 바와 같이, 데이터 블록들(706)은 가변 길이를 가지며, 해시값들(708)은 그 연관된 데이터 블록들로부터 유도된다. 전술한 바와 같이, 해시값들(708)은 대응하는 데이터 단편들의 확률적으로 유일한 식별자이지만, 참으로 유일한 식별자가 대신 또는 함께 사용될 수 있다. 또한, 합성 데이터(702, 704)는 다수 레벨 깊이의 다른 합성 데이터를 참조할 수 있으며 합성 데이터에 대한 해시값들(708)은 그 제법이 생성하는 데이터 값 또는 그 제법의 해시값으로부터 유도될 수 있다.Referring to FIG. 8, synthetic data (synthetic data 702 and 704) derived by the systems and methods of the present invention are clearly represented, but instead are data 706 generated by a “recipe” or formula. A conceptual representation 700 is shown that represents the fact that they are substantially the same. In the example shown, this preparation includes the concatenation of the data represented by the corresponding hashes 708, or the result of a function using the data represented by the hashes. As shown, data blocks 706 have a variable length and hash values 708 are derived from their associated data blocks. As mentioned above, hash values 708 are probabilistic unique identifiers of corresponding data fragments, but indeed unique identifiers may be used instead or together. In addition, the composite data 702, 704 can refer to other composite data of multiple levels of depth and the hash values 708 for the composite data can be derived from the data value the recipe produces or the hash value of the recipe. have.

도 9를 참조하면, 본 발명의 해시 파일 시스템 및 방법이 어떻게 그들이 나타내는 데이터에 대한 해시값들(806)을 포인터로서 사용하여 중복적 시퀀스의 재사용을 최적화하기 위해 데이터(802)를 구성하도록 사용되는지를 나타내는 또 다른 개념적인 표현도(800) 도시되어 있으며, 여기서 데이터(802)는 명시적 바이트 시퀀스(atomic data)(808) 또는 시퀀스들의 그룹(composites)(804)으로서 표현될 수 있다.Referring to FIG. 9, how the hash file system and method of the present invention are used to construct data 802 to optimize reuse of redundant sequences using hash values 806 for the data they represent as pointers. Another conceptual representation 800 is shown, where data 802 may be represented as an explicit byte sequence 808 or a group of sequences 804.

부호 800은 제법의 광대한 공통부분 및 각각의 레벨에서 재사용되는 데이터를 도시하고 있다. 본 발명의 해시 파일 시스템의 기본 구조는 주로 "트리" 또는 "부시(bush)" 구조로서, 해시값들(806)은 종래의 포인터들 대신에 사용되는 구조이다. 해시값들(806)은 제법에서 데이터 또는 그 자체가 제법이 될 수 있는 다른 해시값을 지적하는데 사용된다. 본질적으로, 제법은 그 자체일 수 있는 몇몇의 특정 데이터에 대해 궁극적으로 지시하는 다른 제법들을 지시하고, 보다 많은 데이터를 지시하는 다른 제법을 지시하여, 결과적으로 데이터가 아닌 것을 제외한다.Reference numeral 800 shows a vast common part of the manufacturing process and the data to be reused at each level. The basic structure of the hash file system of the present invention is mainly a "tree" or "bush" structure, where hash values 806 are used instead of conventional pointers. Hash values 806 are used to point out other hash values in the manufacturing process that can be data or the recipe itself. In essence, the recipe indicates other recipes that ultimately dictate some specific data that may be themselves, and dictates other recipes that point to more data, resulting in the non-data.

도 10을 참조하면, 일례의 160비트 해시값(902)에 대한 해시 파일 시스템 어드레스 해석 함수를 설명하는 간략화된 도면이 도시되어 있다. 해시값(902)은 도시된 바와 같이 전단부(904) 및 후단부(906)를 포함하는 데이터 구조를 포함하며, 다이어그램 900은 해시값(902)의 사용을 가능하게 하여 대응하는 데이터를 포함하는 시스템 내의 특정 노드의 위치로 진행하기 위하여 사용되는 특정한 "0 또는 1" 연산을 설명하고 있다.Referring to FIG. 10, a simplified diagram illustrating a hash file system address resolution function for an example 160 bit hash value 902 is shown. Hash value 902 includes a data structure that includes a front end 904 and a rear end 906 as shown, and diagram 900 enables the use of hash value 902 to include corresponding data. Describes a particular "0 or 1" operation used to advance to the location of a particular node in the system.

부호 900은 어떻게 해시값(902) 데이터 구조의 전단부(904)가 해시 프리픽스를 스트라이프 식별자("ID")(908)로 표시하기 위하여 사용될 수 있는지, 즉 차례로 스트라이프 ID를 IP 어드레스에 맵핑하고 ID 클래스를 IP 어드레스(910)에 맵핑하는 데 사용될 수 있는 지를 설명하고 있다. 본 예에서, "S2"는 인덱스 노드(37)의 스트라이프(2)(912)를 지시한다. 노드 37의 인덱스 스트라이프(912)는 참조 번호 914로 표시된 데이터 노드 73의 스트라이프 88을 지시한다. 그 후 연산에 있어서, 그 자신의 해시값(902)의 한 부분은 시스템 내의 어느 노드가 관련 데이터를 포함하고 있는지를 지시하기 위하여 사용될 수 있으며, 해시값(902)의 또 다른 부분이 특정한 노드에서의 데이터의 스트라이프를 지시하는 데 사용될 수 있고, 해시값(902)의 또 다른 부분이 그 스트라이프 내에서 데이터가 존재하는 위치를 지시하는 데 사용될 수 있다. 이러한 3단계 프로세스를 통해, 해시값(902)에 의해 표현되는 데이터가 이전에 시스템 내에 존재하는지 여부가 신속하게 판정될 수 있다.Reference numeral 900 indicates how the front end 904 of the hash value 902 data structure can be used to indicate the hash prefix as a stripe identifier (" ID ") 908, i. It is described whether the class can be used to map to IP address 910. In this example, " S2 " indicates a stripe (2) 912 of the index node 37. Index stripe 912 at node 37 indicates stripe 88 at data node 73, indicated by reference numeral 914. In a later operation, one portion of its own hash value 902 may be used to indicate which node in the system contains related data, and another portion of hash value 902 may be used at a particular node. May be used to indicate a stripe of data, and another portion of hash value 902 may be used to indicate the location of data within that stripe. Through this three-step process, it can be quickly determined whether the data represented by hash value 902 has previously existed in the system.

도 11을 참조하면, 11은 본 발명의 시스템 및 방법에 사용하기 위한 인덱스 스트라이프 스플릿 기능의 간단한 일례의 설명이 도시되어 있다. 본 설명에서, 스트라이프 1002(S2)를 2개의 스트라이프(1004(S2) 및 1006(S7))로 효과적으로 분할하는 데 사용될 수 있는 일례의 함수(1000)가 도시되어 있는데, 한 스트라이프가 너무 채워지지 않아야 한다. 본 예에서, 홀수 엔트리는 스트라이프 1006(S7)로 이동되었으며, 짝수 엔트리는 스트라이프 1004를 유지하고 있다. 함수 1000은 전체 시스템 크기가 커지고 복잡해짐에 따라 어떻게 스트라이프 엔트리들이 취급될 수 있는지에 대한 일례이다.Referring to FIG. 11, FIG. 11 illustrates a simple example of an index stripe split function for use in the systems and methods of the present invention. In this description, an example function 1000 is shown that can be used to effectively divide stripe 1002 (S2) into two stripes 1004 (S2) and 1006 (S7), where one stripe should not be too full. do. In this example, the odd entry has been moved to stripe 1006 (S7), and the even entry has kept stripe 1004. Function 1000 is an example of how stripe entries can be handled as the overall system size grows and becomes complex.

도 12를 참조하면, Day 1에 다수의 프로그램 및 문서 파일들(1102A 및 1104A)을 갖는 대표적인 홈 컴퓨터에 대한 데이터의 백업에 사용하기 위한 본 발명의 시스템 및 방법의 전체 기능이 간단히 도시되어 있으며, 여기서 문서 파일들 중 하나(1104B)는 제3 문서 파일(Z.doc)의 추가와 함께 Day 2상에서 편집(Y.doc)되는 한편 프로그램 파일들(1102B)은 Day 2 상에서 동일하게 유지된다.12, the overall functionality of the system and method of the present invention for use in the backup of data for a representative home computer having multiple program and document files 1102A and 1104A on Day 1 is shown briefly, Here one of the document files 1104B is edited on Day 2 with the addition of a third document file (Z.doc) (Y.doc) while the program files 1102B remain the same on Day 2.

어떻게 컴퓨터 파일 시스템이 단편들로부터 원래의 데이터를 재구성하기 위해 단편들로 분할되고 글로벌 데이터 보호 네트워크("gDPN") 상에서 일련의 제법들로서 목록화될 수 있는지에 대한 세부사항이 설명 1100에 도시되어 있다. 매우 작은 컴퓨터 시스템이 "Day 1"에 "스냅샷(snapshot)" 형태로 도시되어 있으며, 다음에 "Day 2"가 도시되어 있다. "Day 1"에서, "프로그램 파일 H5" 및 "내 문서 H6"은 참조번호 1106으로 표시되어 있으며 전자는 제법 1108에 의해 표현되는데, 제1 실행 가능한 파일은 해시값 H1(1114)로 표현되고, 제2 실행 가능한 파일은 해시값 H2(1112)로 표현된다. 문서 파일은 해시값 H6(1110)에 의해 표현되고, 제1 문서는 해시값 H3(1118)에 의해서, 제2 문서는 해시값 H4(1116)에 의해서 표현된다. 이후에, "Day 2"에서, 참조번호 1120으로 표시된 "프로그램 파일 H5" 및 "내 문서 H10"은 "프로그램 파일 H5"가 변화되지 않았지만, "내 문서 H10"은 변화되었다는 것을 나타낸다. 참조 번호 1122로 표시된 H10은 "X.doc"가 여전히 해시값 H3에 의해 표현되지만 "Y.doc"는 현재 참조번호 1124의 해시값 H8에 의해 표현된다는 것을 나타낸다. 새로운 문서 파일 "Z.doc"는 참조번호 1126의 해시값 H9에 의해 표현된다.Details of how a computer file system can be divided into fragments to reconstruct the original data from the fragments and listed as a series of recipes on the Global Data Protection Network (“gDPN”) are shown in Description 1100. . A very small computer system is shown in the form of a "snapshot" in "Day 1" followed by "Day 2". In "Day 1", "Program File H5" and "My Document H6" are denoted by reference numeral 1106 and the former is represented by Formula 1108, the first executable file is represented by hash value H1 1114, The second executable file is represented by hash value H2 1112. The document file is represented by hash value H6 1110, the first document is represented by hash value H3 1118, and the second document is represented by hash value H4 1116. Thereafter, in "Day 2", "program file H5" and "my document H10" indicated by reference numeral 1120 indicate that "program file H5" has not changed, but "My document H10" has changed. H10 denoted by reference numeral 1122 indicates that "X.doc" is still represented by hash value H3 while "Y.doc" is currently represented by hash value H8 of reference numeral 1124. The new document file "Z.doc" is represented by the hash value H9 at reference numeral 1126.

본 예에서는, Day 2에서 몇몇의 파일들이 변화되었지만, 나머지들은 변화되지 않았다는 것을 알 수 있다. 변화된 파일들에서, 그들의 단편들 중 몇몇은 변화되지 않았으며 다른 단편들은 변화되었다. 본 발명의 해시 파일 시스템의 사용을 통해, 컴퓨터 시스템의 "스냅 샷"이 Day 1에서 이루어질 수 있으며(그들이 존재하기 때문에 컴퓨터 파일들의 재구성에 필수적인 제법을 생성함) Day 2에서는 이전 날짜의 제법 중 몇몇을 재사용하고 다른 것들은 재형성하며 그 때에 시스템을 기술하는 새로운 것들을 추가하여 이루어진다. 이러한 방식에서, 컴퓨터 시스템은 Day 1 또는 Day 2 뿐만 아니라 임의의 후속 날짜들에서도 임의의 시점에 컴퓨터 시스템이 재생성될 수 있다.In this example, we can see that some files have changed on Day 2, but others have not. In the changed files, some of their fragments were not changed and others were changed. Through the use of the hash file system of the present invention, a "snapshot" of the computer system can be made on Day 1 (since they exist, creating a recipe that is essential for the reconstruction of computer files) and on Day 2 some of the recipes of the previous date This is done by reusing it, rebuilding other things, and then adding new things that describe the system. In this manner, the computer system can be regenerated at any time on Day 1 or Day 2 as well as on any subsequent dates.

도 13을 참조하면, 편집 이전(Day 1 1202A)과 이후(Day 2 1202B)에 다수의 "스티키 바이트"(1204)에 의해 마킹된 특정한 문서 파일의 여러 단편들의 비교가 도시되어 있으며, 여기서 단편들 중 하나는 다른 단편들이 동일하기 유지되는 동안 변화된다.Referring to FIG. 13, there is shown a comparison of several fragments of a particular document file marked by a number of “sticky bytes” 1204 before (Day 1 1202A) and after (Day 2 1202B) where fragments are shown. One of them changes while the other fragments remain the same.

예를 들어, Day 1에서, 파일 1202A는 가변 길이 단편들 1206(1.1), 1208(1.2), 1210(2.1), 1212(2.), 1214(2.3) 및 1216(3.1)을 포함한다. Day 2에 서, 단편들(1206, 1208, 1210, 1214, 및 1216)은 동일하게 유지되지만(따라서 동일한 해시값을 가짐) 단편 1212는 편집되어 단편 1212A를 생성하였다(따라서 상이한 해시값을 가짐).For example, at Day 1, file 1202A includes variable length fragments 1206 (1.1), 1208 (1.2), 1210 (2.1), 1212 (2.), 1214 (2.3) and 1216 (3.1). On Day 2, fragments 1206, 1208, 1210, 1214, and 1216 remain the same (and therefore have the same hash value), but fragment 1212 was edited to produce fragment 1212A (thus having different hash values). .

데이터 스티키 바이들(또는 "스티키 포인트들")은 공통적인 구성요소들이 컴퓨터들간의 통신 없이도 다수의 관련 컴퓨터와 관련되지 않은 컴퓨터 상에서 탐색될 수 있도록 컴퓨터 파일들을 부분할(sub-divide)하는 유일하고, 완전히 자동화된 방식이다. 데이터 스티키 포인트들의 탐색된다는 의미는 본질적으로 완전히 수학적인 것이며, 파일들의 데이터 내용에 관계없이 동일하게 수행된다. 본 발명의 해시 파일 시스템에서는, 예컨대, MD4, MD5, SHA, 또는 SHA-1(이에 한하지 않음)과 같은 산업 표준 체크섬을 사용하여 모든 데이터 객체가 인덱싱되고, 저장되어, 검색될 수 있다. 연산 시에 2개의 파일이 동일한 체크섬을 가진다면, 이는 그들이 동일한 파일일 가능성이 높은 것으로 고려될 수 있다. 본 명세서에 개시된 시스템 및 방법에서, 데이터 스티키 포인트들은 수학적 표준 분산 및 목표 크기의 작은 퍼센트인 표준 편차들로 생성될 수 있다.Data sticky Byes (or "sticky points") are unique and sub-divide computer files so that common components can be browsed on computers that are not associated with multiple related computers without communication between computers. In a completely automated way. The meaning of searching for data sticky points is essentially completely mathematical and is performed the same regardless of the data content of the files. In the hash file system of the present invention, all data objects can be indexed, stored and retrieved using, for example, industry standard checksums such as, but not limited to, MD4, MD5, SHA, or SHA-1. If two files have the same checksum at the time of operation, it can be considered that they are most likely the same file. In the systems and methods disclosed herein, data sticky points can be generated with standard deviations that are a small percentage of the mathematical standard variance and the target size.

데이터 스티키 포인트는 통계적으로 드물게 발생하는 바이트들의 배열이다. 이러한 경우에, 현재 마이크로프로세서 기술에서의 구현 용이성 때문에 32 바이트의 예가 제공된다.The data sticky point is an array of statistically rarely occurring bytes. In this case, an example of 32 bytes is provided because of ease of implementation in current microprocessor technology.

32 비트의 롤링 해시(rolling hash)가 파일 "f"에 대해 생성될 수 있다. A 32 bit rolling hash can be generated for file "f".

//f[i] = is the ith byte of the file "f".// f [i] = is the ith byte of the file "f".

//scrambel is a 256 entry array of integers with each //being 32 bits wide;// scrambel is a 256 entry array of integers with each // being 32 bits wide;

//those integers are typically chosen to uniformly //span the range.// those integers are typically chosen to uniformly // span the range.

int t=8 //target number of trailing zerosint t = 8 // target number of trailing zeros

int hash = 0;int hash = 0;

int sticky_bits;int sticky_bits;

for(int i=O; i<filesize; i++)for (int i = O; i <filesize; i ++)

hash = hash >> 1┃scarmble[f[i]];hash = hash >> 1┃scarmble [f [i]];

//At every byte in the file, hash represents the //rolling hash of the file.// At every byte in the file, hash represents the // rolling hash of the file.

sticky_bits = (hash-1)^hash;sticky_bits = (hash-1) ^ hash;

//sticks_bits is a variable which will have the //number of ones in the hash// sticks_bits is a variable which will have the // number of ones in the hash

//that correspond to the number of trailing zergo in //the "hash".// that correspond to the number of trailing zergo in // the "hash".

number_of_bits = count_ones(stick_bits);number_of_bits = count_ones (stick_bits);

if(number_of_bits > t)if (number_of_bits> t)

output_sticky_point(i); output_sticky_point (i);

}}

스티키 포인트는 이진수로 표현되는 해시를 갖는 타겟 숫자로서 적어도 트레일링 제로의 숫자를 갖는 롤링 해시인 것으로 정의된다. 통계적으로 말하자면, 이 알고리즘은 2^t로 이격된 지점들을 찾을 것이며, 여기서 t는 트레일링 제로들의 목표 숫자이다. 본 예에서는, t=8이며, 2^8=256 바이트 떨어져 이격된 스티키 포인트들을 평균적으로 찾을 것이다.The sticky point is defined as being a rolling hash with a number of trailing zeros as a target number with a hash expressed in binary. Statistically speaking, this algorithm will find points spaced 2 ^ t, where t is the target number of trailing zeros. In this example, t = 8, we will average the sticky points spaced 2 ^ 8 = 256 bytes apart.

32비트의 롤링 해시가 다음에서 f 파일에 대해 생성될 수 있다.A 32-bit rolling hash can be generated for f file at

f[i] = is the ith byte of the file f.f [i] = is the ith byte of the file f.

scramble is a 256 entry array of random elements with each being n scramble is a 256 entry array of random elements with each being n

bits wide;bits wide;

int t=8 // target number of trailing zerosint t = 8 // target number of trailing zeros

int target_distance = 256; // 2 to the power of 8int target_distance = 256; // 2 to the power of 8

int hash = 0;int hash = 0;

int sticky_bits;int sticky_bits;

int distance = 0;int distance = 0;

int last_point = 0;int last_point = 0;

for(int i=O; i<filesize; i++) {for (int i = O; i <filesize; i ++) {

hash = hash >> 1┃scarmble[f[i]]; hash = hash >> 1┃scarmble [f [i]];

//At every byte in the file hash represents the //rolling hash of the file. // At every byte in the file hash represents the // rolling hash of the file.

sticky_bits = (hash-1)^hash; sticky_bits = (hash-1) ^ hash;

//sticks_bits is a variable which will have the //number of ones that correspond to the number of// trailing zergo in the "hash". // sticks_bits is a variable which will have the // number of ones that correspond to the number of // trailing zergo in the "hash".

number_of_bits = count_ones(stick_bits); number_of_bits = count_ones (stick_bits);

distance = i-last_point; distance = i-last_point;

if(number_of_bits ^*distance/target_distance>t)if (number_of_bits ^* distance / target_distance> t)

last_point=i; last_point = i;

output_sticky_point(i);output_sticky_point (i);

} }

}}

본 발명의 해시 파일 시스템을 구현하는 데 사용된 해시 함수는 보통 정도로 복잡한 연산을 필요로 하지만, 이는 충분히 오늘날의 컴퓨터 시스템의 능력 내에 있다. 해싱 함수는 본질적으로 확률적이며 2개의 상이한 데이터 객체들이 동일한 해시값을 가질 경우에는 부정확한 결과를 발생시킬 수 있다. 그러나, 본 명세서에 개시된 시스템 및 방법은 종래의 컴퓨터 하드웨어 동작에서 허용되는 훨씬 에러율을 적게 하여 신뢰성 있는 사용을 위해 수용 가능한 레벨(즉, 조 단위)로 충돌의 가능성을 감소시킨 널리 공지되어 연구되고 있는 해싱 함수들에 의해 상기의 문제점을 해결하였다.The hash functions used to implement the hash file system of the present invention require moderately complex operations, but this is well within the capabilities of today's computer systems. Hashing functions are inherently probabilistic and can produce inaccurate results if two different data objects have the same hash value. However, the systems and methods disclosed herein have been well known and studied to reduce the likelihood of a collision to an acceptable level (i.e. trillion units) for reliable use, with much lower error rates allowed in conventional computer hardware operation. The above problem is solved by hashing functions.

본 명세서에서 사용된 용어 "인터넷 기반 시설"은 다양한 하드웨어 및 소프트웨어 메카니즘을 의미하지만, 이 용어는 기본적으로 라우터들, 라우터 소프트웨어, 및 한 네트워크 노드로부터 다른 네트워크 노드로 데이터 패킷을 전송하는 기능을 하는 라우터들 사이의 물리적인 링크를 칭한다. 본 명세서에서 사용된 "디지털 시퀀스"는 제한하는 것은 아니지만, 컴퓨터 프로그램 파일, 컴퓨터 어플리케이 션, 데이터 파일, 네트워크 패킷, 멀티미디어(오디오 및 비디오 포함)와 같은 스트리밍 데이터, 원격 측정에 의한 데이터, 및 디지털 또는 숫자 시퀀스로 표현될 수 있는 어떠한 다른 형태의 데이터도 포함할 수 있다. 본 발명의 해시 파일 시스템 및 방법에 의해 생성된 확률적으로 유일한 식별자들은 또한 네트워크 어플리케이션들에서 URL로서 사용될 수 있다.The term "Internet infrastructure" as used herein refers to a variety of hardware and software mechanisms, but the term basically refers to routers, router software, and routers that function to transfer data packets from one network node to another network node. Refers to the physical link between them. As used herein, “digital sequence” includes, but is not limited to, computer program files, computer applications, data files, network packets, streaming data such as multimedia (including audio and video), data by telemetry, and digital Or any other form of data that can be represented by a sequence of numbers. Probabilistic unique identifiers generated by the hash file system and method of the present invention may also be used as URLs in network applications.

특정한 동작 및 시스템 구성과 함께 본 발명의 원리가 위에서 설명되었지만, 전술한 설명은 단지 예시적인 것이지 본 발명의 범위를 제한하려는 것이 아니라는 사실이 분명하게 이해될 것이다. 특히, 당업자라면 전술한 개시를 이해하여 다른 수정을 제안할 수 있을 것이라는 사실이 인식된다. 이러한 수정은 본질적으로 이미 알려지고 본 명세서에 이미 개시된 특징들 대신 또는 추가하여 사용될 수 있는 다른 특징들을 포함할 수 있다. 특징들의 특정한 조합들에 대해 본 출원에서 청구범위가 공식화되었지만, 임의의 청구범위에서 청구된 것과 동일한 발명에 관한 것이든 아니든, 그리고 본 발명에 의해 직면되는 바와 같은 임의의 또는 모든 동일한 기술적 문제점들을 완화시키는 아니든, 본 명세서의 개시 범위는 명확하게 또는 잠재적으로 개시된 임의의 신규한 특징 또는 임의의 신규한 특징들의 조합이나 당업자에게 명백한 임의의 일반화 또는 수정을 포함한다. 본 출원인은 본 출원의 계속 또는 이로부터 유도되는 임의의 추가 출원의 계속 동안에 상기와 같은 특징들 및 그 특징들의 조합에 대한 새로운 청구범위를 작성할 권리를 갖는다.While the principles of the invention have been described above in conjunction with specific operations and system configurations, it will be clearly understood that the foregoing description is illustrative only and is not intended to limit the scope of the invention. In particular, it is appreciated that one of ordinary skill in the art would understand the foregoing disclosure and suggest other modifications. Such modifications may inherently include other features that may be used in place of or in addition to those already known and already disclosed herein. Although the claims have been formulated in this application for specific combinations of features, mitigation of any or all of the same technical problems as faced by the present invention, whether or not related to the same invention as claimed in any claim. Without departing from the scope of the present specification, any novel feature or combination of any novel feature, clearly or potentially disclosed, or any generalization or modification apparent to those skilled in the art. Applicant reserves the right to prepare new claims for such features and combinations of features during the continuation of this application or any further application derived therefrom.

Claims

delete

At least one list for maintaining portions of digital sequences and corresponding probabilistic unique identifiers for each of the portions of digital sequences; And

At least one process unit,

The at least one process unit,

Receive at least one new digital sequence;

Split the new digital sequence into a plurality of shorter digital sequences, and perform at least one partitioning mechanism for generating a stochastic unique identifier for each of the shorter digital sequences;

And perform a comparison mechanism to determine if any one of the probabilistic unique identifiers for each of the plurality of shorter digital sequences is currently maintained in the list.

delete

27. The computing system of claim 25, wherein the at least one process unit is configured to perform the at least one partitioning mechanism, and the at least one list is connected by a network.

34. The computing system of claim 33 wherein said network comprises a public network, such as the Internet.

35. The computing system of claim 34 wherein said at least one partitioning mechanism and said at least one list are physically distributed.

27. The computing system of claim 25 wherein the probabilistic unique identifiers are generated by a hash function.

37. The computing system of claim 36 wherein said hash function comprises an industry standard digest algorithm.

38. The computing system of claim 37 wherein said hash function comprises an MD4, MD5, SHA or SHA-1 algorithm.

37. The computing system of claim 36 wherein the probabilistic unique identifiers are generated by a checksum.

26. The computing system of claim 25 wherein said digital sequences are of variable length.

26. The computing system of claim 25 wherein said digital sequences are of fixed length.

26. The computing system of claim 25 wherein the comparison mechanism is operative to use at least one portion of the probabilistic unique identifiers for each of the plurality of shorter digital sequences as a locator correlated with the list partitions.

The method of claim 25, wherein the digital sequence,

A feature comprising a data file;

A feature comprising a data stream;

A feature comprising an executable file;

A feature comprising a database record;

A feature comprising a database index;

A feature comprising a digital device image;

A feature comprising a network packet; or

Features include digitized analog signals

Computing system having at least one of the.

delete

27. The system of claim 25, wherein at least one of the probabilistic unique identifiers and corresponding ones of the plurality of shorter digital sequences that are not determined to be maintained within the at least one list are added to the at least one list. Computing system.

delete

A method of setting an identifier for at least a portion of a digital sequence, the method comprising:

Performing a hashing function on the at least one portion of the digital sequence to produce a probabilistic unique sign for the at least one portion of the digital sequence;

Establishing a corresponding relationship between said at least one portion of said digital sequence and said probabilistic unique code;

Using the probabilistic unique sign as the identifier

Identifier setting method comprising a.

67. The method of claim 66 wherein the identifier and at least one portion of the digital sequence corresponding to the identifier is maintained in at least one data list.

68. The method of claim 67, wherein at least one portion of the identifier is available as a pointer to the location of at least one portion of the digital sequence corresponding to the identifier in the at least one data list.

The method of claim 66,

Said at least one portion of said digital sequence comprises at least one portion of a data file and said identifier points to a content of said at least one portion of said data file;

Said at least one portion of said digital sequence comprises at least one portion of a data stream, said identifier indicating a content of said at least one portion of said data stream; or

The at least one portion of the digital sequence includes at least one portion of an executable file, and wherein the identifier indicates the contents of the at least one portion of the executable file

Method of setting an identifier having at least one of the features.

delete

67. The method of claim 66, wherein performing the hashing function is performed by an Industry Standard Digest Algorithm.

74. The method of claim 73, wherein performing the hashing function is performed by one of an MD4, MD5, SHA, or SHA-1 algorithm.

A recording medium having a computer readable program implemented therein for setting an identifier for at least a portion of a digital sequence,

The computer readable program includes a plurality of computer readable codes,

When the computer readable codes are executed on a computer,

The computer performs a hashing function on the at least one portion of the digital sequence to produce a probabilistic unique sign;

The computer establishes a correspondence between the at least one portion of the digital sequence and the probabilistic unique code;

And the computer is configured to use the stochastic unique code as the identifier.

76. The recording medium of claim 75, wherein the identifier and at least a portion of the digital sequence corresponding to the identifier are maintained in at least one data list.

77. The computer program product of claim 76, wherein at least one portion of the identifier is usable as a pointer to a location of at least one portion of the digital sequence corresponding to the identifier in the at least one data list. One recording medium.

76. The method of claim 75,

Said at least one portion of said digital sequence comprises at least one portion of a data file, said identifier indicating a content of said at least one portion of said data file;

Said at least one portion of said digital sequence comprises at least one portion of an executable file, said identifier indicating the contents of said at least one portion of said executable file

A recording medium having a computer readable program comprising at least one of the following features.

delete

76. The recording medium of claim 75, wherein the computer readable code configured to cause the computer to perform a hashing function is performed by an industry standard digest algorithm.

83. The computer program product of claim 82, wherein the computer readable code configured to cause the computer to perform a hashing function is performed by one of an MD4, MD5, SHA, or SHA-1 algorithm. Recording media.

delete