CN101339494A - Common factor disintegration hardware acceleration on mobile medium - Google Patents

Common factor disintegration hardware acceleration on mobile medium Download PDF

Info

Publication number
CN101339494A
CN101339494A CNA2008101305497A CN200810130549A CN101339494A CN 101339494 A CN101339494 A CN 101339494A CN A2008101305497 A CNA2008101305497 A CN A2008101305497A CN 200810130549 A CN200810130549 A CN 200810130549A CN 101339494 A CN101339494 A CN 101339494A
Authority
CN
China
Prior art keywords
module
chunk
removable storage
hash
storage cassette
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101305497A
Other languages
Chinese (zh)
Inventor
马修·D·邦杜兰特
史蒂文·W·斯克罗格斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Prostor Systems Inc
Original Assignee
Prostor Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Prostor Systems Inc filed Critical Prostor Systems Inc
Publication of CN101339494A publication Critical patent/CN101339494A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention describes systems and methods for commonality factoring for storing data on removable storage media. The systems and methods allow for highly compressed data to be stored in an efficient manner on portable memory devices. The methods include breaking data into unique chunks and calculating identifiers based on the unique chunks. Redundant chunks can be identified by calculating identifiers and comparing identifiers of other chunks to the identifiers of unique chunks previously calculated. When a redundant chunk is identified, a reference to the existing unique chunk is generated such that the chunk can be reconstituted in relation to other chunks in order to recreate the original data. The method further includes storing one or more of the unique chunks, the identifiers and/or the references on the removable storage medium. The acceleration hardware and/or software can reside in multiple devices, depending on the embodiment.

Description

It is hardware-accelerated that common factor on the removable medium decomposes
The application requires all common unsettled right of priority: the 60/948th, No. 387 U.S. Provisional Application that on July 6th, 2007 submitted to; The 60/948th, No. 394 U.S. Provisional Application that on July 6th, 2007 submitted to; The 12/167th, No. 867 U. S. application that on July 3rd, 2008 submitted to; And the 12/167th, No. 872 U. S. application of submission on July 3rd, 2008, thereby its full content clearly is hereby expressly incorporated by reference.
Technical field
The present invention relates generally to data-storage system, and relates to (but being not limited to) data-storage system of canned data on removable medium.
Background technology
Traditional backup relates to a series of backups fully, incremental backup or differential backup, and it preserves the multiple copy of data identical or that slowly change.This backup method causes the data redundancy of higher degree.
For many years, along with the storer based on tape becomes cheaply, tape and based on existing sizable gap between the price of the storage system of disk.Therefore, traditional data storage solution has been based on the storage system of tape, and the traditional algorithm that its utilization is about 2: 1 average compression ratio comes packed data.Advantageously, use removable tape tape (cartridge), it can be taken the outer position of device and be used for the disaster recovery based on the storage system of tape.Yet, recover slow, complicated and unreliable based on the processing of the data in the storage system of tape.
The data de-duplication that common factor by name decomposes (commonality factoring) is by eliminating the processing that redundant data reduces storage demand.Data de-duplication is based on the data-storage system of disk, the demand to disk space that it has reduced widely.Yet, comprise that the data-storage system based on disk of repetition delet method can not be exported to removable medium easily.For the data through repeating to delete are exported to removable medium, must at first the data through repeating to delete be formed once more its primitive form and then be recorded on the removable tape tape, therefore, need be than through the more storage space of the version that repeats to delete.
Data de-duplication is that resource-intensive is handled, and it is carried out as the part of common factor decomposition solution in software.Because this intensive calculations is handled, uses high-end thread multinuclear/multiprocessor servers provide for carry out this repeat deletion handle sufficient performance.By using the performance total amount that multinuclear/multiprocessor servers obtained to depend on employed algorithm and its realization in software.Yet the total expenses and the power consumption of these multinuclear/multiprocessor servers are higher.
Summary of the invention
In various embodiments, the system and method for the common factor decomposition that is used for storage data on movable storage medium has been described.These system and methods allow the data (for example, utilizing to comprise filing that repeats to delete or the data that backup method compressed) of high compression to be stored on the portable memory (such as, removable storage cassette) with efficient way.This method comprises: data (for example, being used for the backed up data file) are divided into a plurality of unique chunks (chunk); And come compute identifiers (for example, Hash identifier) based on these unique chunks.Can compare by compute identifiers and with the identifier of the identifier of other chunks and the unique chunk that is calculated before and discern redundant chunk.When identifying redundant chunk, generate index, thereby can rebuild this chunk, so that raw data is reproduced with respect to other chunks for existing unique chunk.This method also comprises one or more being stored on the movable storage medium in a plurality of unique chunks, a plurality of identifier and/or a plurality of index.
In some aspects, can use hardware and/or software to quicken this common factor resolution process.Depend on embodiment, can will speed up hardware and/or software places multiple arrangement.For example, be used for the hardware of chunk (chunking) and/or Hash (hash) function and/or software and can be positioned at one or more in main frame, mobile storage means, removable tape frame (for example, socket) and the removable storage cassette.
In one embodiment, a kind of common factor decomposing system that is used to utilize removable storage cassette storage data is disclosed.The socket that this system comprises processor, is connected to the expansion bus of processor and is connected to this expansion bus.This jack configurations is the removable storage cassette of admittance.Expansion module is detachably connected to expansion bus.This expansion module is configured to data are sent to removable storage cassette.This expansion module comprises chunk module and Hash module.This chunk module is configured to original data stream is divided into a plurality of chunks.The Hash module is connected to the chunk module with pipeline system, thereby at least a portion of the input of Hash module is comprised output from the chunk module.This Hash module is configured to determine that each chunk is whether unique and transmits to removable storage cassette and to be defined as unique chunk.
In another embodiment, a kind of common factor decomposition method that is used to utilize removable storage cassette storage data is disclosed.In a step, receive original data stream at the expansion module place that is detachably connected to main frame.This expansion module comprises chunk module and Hash module.Dispose Hash module and chunk module with pipelined architecture, thereby at least a portion of the input of Hash module is comprised output from the chunk module.At chunk module place, original data stream is divided into a plurality of chunks.To described these modules of Hash module forwards.This Hash module is calculated the identifier of each chunk of being transmitted; Location identifier; And determine based on identifier whether each chunk is unique.In identifier and a plurality of unique chunk at least one is forwarded to removable storage cassette.This removable storage cassette comprises memory driver.
In another embodiment, the expansion card that a kind of common factor that is used to utilize removable storage cassette storage data decomposes is disclosed.This expansion card comprises chunk module and Hash module.The original data stream that this chunk module is configured to receive from main frame also is divided into a plurality of chunks with this original data stream.This expansion card be configured to be detachably connected to main frame and removable storage cassette and with data storage on removable storage cassette.The Hash module is connected to the chunk module with pipeline system, thereby at least a portion of the input of this Hash module is comprised output from the chunk module.This Hash module is configured to: receive a plurality of chunks from the chunk module; Calculate the identifier of each chunk that is received; Determine based on identifier whether each chunk is unique; And with unique chunk store on removable storage cassette.
Hereinafter, according to the detailed description that is provided, more applications of the present disclosure will become apparent.Should understand, though sketched each embodiment, specific descriptions and particular instance all only are intended to explanation and are not used for limit to scope of the present invention.
Description of drawings
Fig. 1 shows the block diagram of the embodiment of data-storage system.
Fig. 2 shows the block diagram of the embodiment that is used to carry out the system that common factor decomposes.
Fig. 3 shows the block diagram of the alternative embodiment that is used to carry out the system that common factor decomposes.
Fig. 4 shows the block diagram of the alternative embodiment that is used to carry out the system that common factor decomposes.
Fig. 5 A, Fig. 5 B and Fig. 5 C show the synoptic diagram of the alternative embodiment that is used to carry out the data-storage system that common factor decomposes.
Fig. 6 shows the process flow diagram that is used for the example of the processing of storage data on the removable data tape.
In the accompanying drawings, similarly parts and/or feature can have identical reference number.In addition, can distinguish each parts that second label of like is distinguished same type by after reference number, adding dash and being used to.If only used first reference number in this manual, then no matter second reference number how, this description is applicable to any one like with first identical reference number.
Embodiment
Following description only provides preferred exemplary embodiment, and is not used in qualification scope of the present invention, applicability and configuration.Definitely, the following description of preferred illustrative embodiment can realize that the description of preferred embodiment offers those skilled in the art.Should understand, under the situation that does not deviate from the spirit and scope that claims illustrate, can carry out various changes the function and the configuration of ingredient.
The disclosure relates generally to and is used for the data-storage system that data backup, storage and file are used.It relates to the of new generation removable storage cassette of the hard disk drive (HDD) that holds as storage medium particularly.Run through instructions, can describe storage medium, but should be appreciated that and replacedly to use flash memory or solid-state disk (SSD) driver with HDD.
Embodiments of the invention concentrate on a kind of system, are used for being stored in single storage cassette than traditional Lempel-Ziv (LZ) the data more data that compression method allowed of use.This is by realizing that common factor decomposes (or, repeat deletion) and realizes.Particularly, system according to the present invention makes to handle and quickens, thereby carrying out data reduction and not need high-end server to carry out this processing with the speed that LTO (LTO) tape drive is competed mutually.
According to one embodiment of present invention, provide a kind of acceleration common factor decomposing system that is used to utilize storage cassette storage data.This system comprises the chunk module that is used for original data stream is divided into a plurality of chunks.In the chunk module, stream line operation and question blank are used for optimizing.This system also comprises the Hash module, be used for determining whether each chunk unique or be not before the copy of any chunk of a plurality of chunks of being stored.By the chunk resume module realize collimation by first byte of this each chunk of Hash resume module before last byte of chunk.
In this embodiment, the chunk module can comprise the portion that is used for Rabin " fingerprint " identification or be used to carry out the moving window check and portion.In addition, portion one or more that in this embodiment, this Hash module can comprise the portion that is used for message digest algorithm 5 (MD5) Hash, be used for the portion of Secure Hash Algorithm-1 (SHA-1) Hash and be used for Secure Hash Algorithm-2 (SHA-2) Hash.
According to another embodiment of the present invention, provide the another kind of acceleration common factor decomposing system that is used to utilize storage cassette storage data.This system comprises above-mentioned chunk module and Hash module, and comprises the additional data processing module.
In this embodiment, this additional data processing module can comprise data compressing module, encrypting module, and Error Correction of Coding (ECC) module in one or more.In addition, this data compressing module can comprise the portion that is used to carry out Lempel-Ziv Stac (LZS) algorithm.In addition, encrypting module can comprise the portion that is used to carry out triple DES (3DES) algorithm, Advanced Encryption Standard-128 (AES-128) algorithm or Advanced Encryption Standard-256 (AES-256) algorithm.
According to still another embodiment of the invention, provide another to be used to utilize the acceleration common factor decomposing system of storage cassette storage data.This system comprises above-mentioned chunk module, Hash module and additional data processing module, and comprise that the database search module of being followed by the additional data processing module, database search module are used for based on carrying out the search in chunk data storehouse from the output of Hash module and only unique chunk being delivered to the additional data processing module.Purpose is to reduce the bandwidth demand of additional data processing module.
According to still another embodiment of the invention, provide another to be used to utilize the acceleration common factor decomposing system of storage cassette storage data.This system comprises chunk module and associated modules, wherein, utilizes many parallel data paths.Purpose is to quicken the common factor resolution process.
In this embodiment, many data path can comprise single data stream, by being a plurality of examples with this traffic splitting not necessarily data stream being blocked with the position of the chunk boundary alignment that calculates by the program grouping module, wherein, the size of the truncation part of data stream is that fix or variable.
At first, show the embodiment of data-storage system 100 with reference to figure 1.This data-storage system 100 can comprise main frame 102 and driver slot (drive bay) 104 movably.Main frame 102 comprises processor and expansion bus.This expansion bus is connected to processor and is configured to and via standard interface data is sent to driver slot 104.Movably driver slot 104 can comprise removable tape cartridge device 110 and removable tape frame (holder) 106.Main frame 102 can be connected communicatedly with removable tape cartridge device 110.As an example, removable tape cartridge device 110 can be arbitrary version of small computer system interface (SCSI), optical-fibre channel (FC) interface, Ethernet interface, Advanced Technology Attachment (ATA) interface to the interface of main frame 102, or allows the interface of any other type that removable tape cartridge device 110 and main frame 102 communicate.Tape frame 106 can be plastic socket and the circuit board that can physically be installed to removable tape cartridge device 110.Tape frame 106 can also comprise ejection and locking mechanism.Removable storage cassette 108 provides storage capacity for data-storage system 100, and wherein, storage cassette 108 is detachably connected to removable tape cartridge device 110.Can also optionally portable storage tape 108 be locked in the tape frame 106.In alternative embodiment, can main frame 102 be connected communicatedly with tape frame 106 by interface cable 112.
As describing in detail among a plurality of embodiment below, the one or more positions in following a plurality of positions can be embodied as expansion module with the common factor decomposition function: 1) in storage cassette 108; 2) in removable tape cartridge device 110 and tape frame 106 outsides; And 3) in main frame 102.
As mentioned above, the repeating part in the original data stream of having stored before the present invention's identification, thus the index that can store this data division replaces at repeating part itself.The a plurality of steps that are used to carry out this processing are as follows: (1) is divided into original data stream the step of a plurality of small chunks (data division) that can be used to analyze redundance; (2) calculate the step of the identifier of each chunk; (3), and do not determine whether each chunk is unique step owing in chunk before, find identical chunk by Search Flags symbol database; And (4) thus the unique chunk of tissue, identifier and the metadata that is associated can generate the step of original data stream once more.This original data stream can be represented such as any type of data of audio frequency, video, text and can be a plurality of files or object.
The step of above-mentioned processing (1) and (2) are the processor intensities, therefore are suitable for its application hardware is quickened.Similarly, these steps can combine with other data modification steps such as traditional data compression and encryption as the part of whole data storage processing.Consider that according to integration mode all these steps are to provide maximized throughput of system.
Rabin " fingerprint " identification is the method that the data stream of input is divided into the less data chunks that can be used for the redundance analysis.This method have such as the verification that does not present rolling and the manageable statistical attribute of comparatively simpler method, but in a plurality of embodiment, can use program grouping algorithm arbitrarily.This method can be embodied as the part of common factor decomposing scheme in software, for the Time To Market that quickens these products can be made these products as cost with cost and/or performance.Multinuclear/the multiprocessor servers of use most significant end provides the sufficient performance of software algorithm.Substitute and realize this method in software, one embodiment of the present of invention realize this method with the solution of using hardware, and it utilizes lower cost and lower power attenuation that the performance that increases is provided.The details that realizes Rabin " fingerprint " recognition methods according to pipeline system hardware has been described in the 60/948th, No. 394 U.S. Provisional Patent Application of submitting on July 6th, 2007.By realize this hardware with pipeline system, can utilize minimized logical operation to obtain higher handling capacity with rational clock rate.
Rabin " fingerprint " identification be basically in a data stream each single byte to polynomial operation.Because most systems pair play good effect with the data of 8 bit byte boundary alignments, so the result that polynomial expression is operated is only relevant with per eight bits.Owing to do not consider intermediate computations, so we can make the calculating optimization by each fingerprint value that directly calculates next 8 bits.
Can calculate the Rabin fingerprint to the moving window (for example, 48 bytes in the buffer array) of data.For each calculating, replace byte the oldest in the array with up-to-date byte.First-class last pipeline stages (stage) substitutes the oldest byte with up-to-date byte and carries out based on the oldest byte and searches, and it provides the value that can be used for removing the influence of the oldest byte from fingerprint.Ensuing flow line stage utilizes this input and remove the oldest data from fingerprint, uses another question blank that this fingerprint is combined with new data then and generates new fingerprint.Whether whether the part that last flow line stage is determined new fingerprint be complementary with the predetermined verification values that is used for determining the chunk border and check this chunk sizes to be fit in minimum/maximum magnitude.
Utilize Rabin " fingerprint " identification or such as the moving window check and the output of chunk step of simpler method be the data sequence that is called chunk, can analyze it and determine whether to be stored system's storage before it.A method of having stored this chunk before determining whether efficiently is the one-way function that data computation is called Hash, its allow to make have the statistical possibility about these data whether be before copy definite of any data in the data of storing.Multiple hash algorithm (such as MD5, SHA-1 and SHA-2 family) can be used for this purpose.Target is to select a kind of algorithm, and it has statistics and goes up enough little conflict contingency, can suppose that it can not produce erroneous matching.This hash algorithm can be resisted and cause having a mind to or the malice trial of this conflict.This hash algorithm should be safe; MD5 can not be taken as safety veritably and SHA-1 has some potential weakness, but these weakness can shall not be applied to some application.Can be according to the type that should be used for selecting hash algorithm.In addition, in certain embodiments, can use multiple hash algorithm.
Next with reference to figure 2, show the block diagram of the embodiment that is used to carry out the system 200 that common factor decomposes.System 200 comprises the chunk module 202 that is directly connected to Hash module 204.In this embodiment, chunk module 202 is carried out following steps: utilize Rabin " fingerprint " recognizer on the moving window of data stream 206 original data stream 206 to be divided into than small chunks.Other embodiment can use diverse ways and algorithm, for example, the moving window verification and.Refer again to Fig. 1, in a plurality of embodiment as discussed below, can provide original data stream 206 from different sources according to the position of expansion module.For example, in a plurality of embodiment, main frame 102, removable tape cartridge device 110 or storage cassette 108 can be forwarded to chunk module 202 with original data stream 206.
Chunk module 202 will be called the data byte sequence of chunk 206-1 together with indication 208 outputs that whether arrived the chunk border for each data byte sequence.The end of each sequence indication 208 also is known as recording terminal end or EOR.This allows them synchronous when EOR 208 and data chunks 206-1 are passed to Hash module 204.In this embodiment, chunk module 202 is connected to Hash module 204 with pipeline system, thereby input has comprised output from chunk module 202 at least a portion of Hash module 204.In one embodiment, before the last byte of identical chunk is handled by chunk module 202, first byte that Hash module 204 is handled from each chunk of data byte 206-1 sequence.The complete chunk that other embodiment can obtain from chunk module 202 moves this chunk by Hash module 204 then.
Hash module 204 is carried out following steps: calculate the identifier from each chunk of data byte 206-1 sequence, determine the uniqueness of this chunk then.Can determine whether unique this determining step of carrying out of each chunk by the database that stores into identifier in the database and search for this identifier.Finding under the unique situation of this chunk, this unique chunk and identifier thereof are being stored in chunk/ID database 220 on the removable storage cassette 108.Table I shows at the example that unique chunk stream and identifier thereof is stored in the chunk/ID database 220 on the removable storage cassette 108.
Figure A20081013054900161
If chunk is not unique, then abandons this redundancy chunk and establishment index, thereby can re-construct this redundancy chunk relatively to regenerate this original data stream 206 with other chunks to unique chunk of existence.To be forwarded to removable storage cassette 108 then to the index of existing unique chunk and be used for being stored in index data base 222.The example that is stored in the index stream in the index data base 222 has been shown in Table II.Other embodiment can comprise the independent module that is used for determining the chunk uniqueness.The length stream of the unique chunk 206-2 stream of Hash module 204 outputs, recording terminal end indication 208, each unique chunk 210, the cryptographic hash stream of each unique chunk 212 and index 214 streams.
With reference to figure 3, show the block diagram of the alternative embodiment that is used to carry out the system 300 that common factor decomposes.In system 300, the add-on module 308 that is used to carry out data processing is directly connected to the output terminal of chunk module 202 and 204 combinations of Hash module.It helps to optimize the data stream between a plurality of modules.This additional data processing module 308 comprises compression module 302 and encrypting module 304.To send to compression module 302 from unique chunk 206-2 stream, index 214 streams and recording terminal end 208 streams of Hash module 204.Compression module 302 utilizes (for example) Lempel-Ziv Stac (LZS) algorithm to carry out traditional data compression of unique chunk 206-2.
Next compressed unique chunk 306, index 214 and recording terminal end 208 are sent to encrypting module 304.This encrypting module 304 can use algorithms of different, for example, and triple DES (3DES), Advanced Encryption Standard-128/256 (AES-128/256).Other embodiment can also comprise such as Error Correction of Coding module (for example, additional data processing module Reed-Solomon).Output (for example, the length of unique chunk 210, cryptographic hash 212) from the Hash module optionally can be sent to compression module 302 and encrypting module 304.This will provide synchronous output for each module.
The system of system shown in similar Fig. 3 has the advantage that reduces the data traffic between expansion module and the main system memory (being stored in this when data are not processed probably).Table III shows in LZS compression and common factor decomposition each all to be provided under the situation of about 2: 1 data reduction, the example of the saving in the bus bandwidth.In this embodiment, final output data rate is described as " D ".
Figure A20081013054900181
Can the exemplary digital from Table III find out that a fully-integrated advantage is the reduction on the bandwidth from the system storage to the hardware accelerator, comparing it with not integrated method almost is three times.Similarly, fully-integratedly provide better balance than partly integrated method on bandwidth, it has superiority when the bi-directional serial interface that uses such as PCI-Express.
With reference now to Fig. 4,, shows the block diagram of the another embodiment that is used to carry out the system 400 that common factor decomposes.This system 400 comprises chunk module 202 (not shown in this figure), Hash module 204-1 and additional data processing module 308.This Hash module 204-1 comprises Hash calculation device 402, search module 404 and identifier database 406.In this embodiment, Hash calculation device 402 is carried out following steps: calculate the identifier of each chunk and the uniqueness of uncertain chunk.Therefore, this Hash calculation device 402 output data chunk sequence 206-3, the length of recording terminal end indication 208-1, each chunk 210-1 and the identifier of each chunk 212-1.Search module 404 is carried out to the search of database 406 and with unique chunk 206-4 from the identifier 212-1 of Hash calculation device 402 based on output and is sent to data processing module 308.
Search module 404 comprises that enough bufferings store from the output data of Hash calculation device 402 and determine whether each chunk should be dropped or be passed to the remainder of data path.By search module 404 in-line (inline) is placed in the data stream, suppose that it is unique for this embodiment average half chunk only being arranged, and then can reduce half with the bandwidth demand of data path remainder.This can simplify the design and the load of minimizing on the interface between expansion module and the system's remainder of remaining data processing module 308.Data processing module 308 can comprise compression module and encrypting module.Other embodiment can also comprise the Error Correction of Coding module.
In certain embodiments, the single example with sufficiently high speed operation chunk module 204 may not be actual fully with the bandwidth demand that satisfies system.In these cases, tool resemble a plurality of examples that present chunk module 204 and associated modules produce be used to increase total bandwidth many panel data paths (for example, in a plurality of embodiment, two, three, four, five, six, seven, eight or more panel data paths can be arranged) may be significant.Receiving and handling under the situation of a plurality of input traffics 206, these data stream can be mapped to simply a plurality of examples of data path.Single data stream 206 need be wideer than the bandwidth that single example provides the situation of bandwidth under, can between a plurality of examples, scratch this data stream.The simplest method is that this data stream 206 is blocked is that an example just looks like to have arrived the data end to this, and these data are redirected to other examples.
This solution has the spinoff that produces the error group block boundary at the truncation points place.Because utilized the similar already used diverse ways of chunk possibility to determine the border of this chunk, therefore last chunk may not mate with the chunk that exists.As long as the part that data stream is blocked is big with respect to each chunk sizes, then it may not necessarily cause the notable difference in the common factor decomposing efficiency.For example, if the part of being blocked is 10MB and the chunk mean size is 8KB, then each part will approximately be 1250 chunks.The influence that only has 2 (initial chunk and last chunks) to be blocked in these chunks, thus 0.16% efficient will only be reduced in one embodiment.
Next with reference to figure 5A, show the synoptic diagram of the embodiment that is used to carry out the data-storage system 500-1 that common factor decomposes.In this embodiment, common factor is decomposed the expansion module 502-1 that is embodied as in the storage cassette 508-1.This expansion module 502-1 also is known as chunk and Hash module (C/H), and its representative in this embodiment is used for the primary engine that common factor decomposes.Can realize this expansion module 502-1 by hardware, software or their combination.For hardware implementation mode, can be at one or more special ICs (ASIC), digital signal processor (DSP), digital signal processing device (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), processor, controller, microprocessor, be designed to carry out in other electronic units of above-mentioned functions and/or their combination and realize this processing unit.Other embodiment can only comprise one or more parts of C/H module and comprise other parts in the another location in a position.
At first via such as SCSI (small computer system interface), serial ATA (Advanced Technology Attachment), Parallel ATA, SAS (Serial Attached SCSI (SAS)), Firewire TM, Ethernet, UWB, USB, Bluetooth TM, arbitrary standards interface such as WiFi sends to tape frame 506-1 with original data stream from main frame 102.This tape frame 506-1 can comprise that electricity, light and/or wave point come and storage cassette 508-1 swap data and instruction.Interface between tape frame 506-1 and storage cassette 508-1 also can use above similar standard interface.Therefore, match, can storage cassette 508-1 be detachably connected to removable tape cartridge device 510-1 via tape frame 506-1 by electricity, light and/or wireless connector with tape frame 506-1.
In this embodiment, expansion module 502-1 is incorporated among the removable storage cassette 508-1 itself, therefore produced self-contained storage cassette based on disk.Some embodiment can use among the storage cassette 508-1 processor to carry out the C/H module.This processor can be in the inside of hard disk drive or is outside but under any circumstance all in tape.In one embodiment, the firmware upgrade for hard disk drive allows to realize that C/H is functional.In another embodiment, the C/H module is outside and be positioned on the circuit card of storage cassette 508-1 at hard disk drive.
With reference to figure 5B, show the synoptic diagram of the another embodiment of data-storage system 500-2.In this embodiment, common factor decomposes the expansion module 502-2 that is implemented as in the removable tape cartridge device 510-2.At first send to removable tape cartridge device 510-2 from main frame 102 by the data that will be stored on the storage cassette 508-2 at the arbitrary standards interface described in the embodiment before.Next original data stream enters expansion module 502-2 and is used for removable tape cartridge device horizontal processing.Next will repeat deleted data by above-mentioned similar standard interface and send to storage cassette 508-2 via tape frame 506-2.
Next with reference to figure 5C, show and be used to utilize storage cassette to carry out the synoptic diagram of the embodiment of the data-storage system 500-3 that common factor decomposes.In this embodiment, common factor decomposes the expansion card 502-3 that is embodied as in the main frame 102-3.Main frame 102-3 comprises expansion bus, and it is connected to the processor of main frame.In a plurality of embodiment, expansion card can be inserted computing machine expansion bus (for example, PCI, ISA, AGP bus).The realization that common factor on expansion card decomposes can be finished by using hardware and/or software.In case realized that by chunk among the main frame 102-3 and Hash module 502-3 the common factor that data are carried out decomposes, and then can send to the data through repeating to delete tape frame 506-3 and send to storage cassette 508-3 via standard interface then.
Next with reference to figure 6, show the embodiment that is used for the processing 600 of storage data on removable storage cassette 108.Handle 600 institute and describes partly and start from frame 602, wherein, depend on expansion module 502 positions and from multiple source reception original data stream.For example, in a plurality of embodiment, expansion module 502 can receive the raw data from main frame 102, removable tape cartridge device 510, tape frame 506 or storage cassette 508.In certain embodiments, this original data stream can comprise a plurality of files.In certain embodiments, this stream is not divided into a plurality of files according to the mode of can distinguishing.
At frame 604 places, use chunk module 202 data stream to be divided into the data chunks sequence and to generate recording terminal end (EOR) and come the definitions section block boundary.In case program grouping module 202 has produced chunk, then handle and proceed to piece 606, wherein, Hash module 204 is calculated the identifier of each chunk.In a plurality of embodiment, can use different algorithms, for example, message digest algorithm (MD5), Secure Hash Algorithm-1 (SHA-1) and Secure Hash Algorithm-2 (SHA-2).Next at frame 608 places identifier is stored in the identifier database 406.
At frame 610 places, owing to not finding that in chunk before whether unique identical chunk make determine of each chunk.At frame 610 places, use Hash module 204 to determine by Search Flags symbol database 406 whether each chunk is unique.Some embodiment can use independent search module 404 to be used for determining the uniqueness of each chunk at frame 610 places.In this case, whether search module 404 is unique to determine each chunk based on the search of carrying out from the output of Hash calculation device 402 identifier database 406.If this chunk is unique, then handles from frame 610 and flow to choice box 612 to carry out additional data processing such as compression, encryption and Error Correction of Coding.Next at frame 614 places unique chunk and the identifier that is associated thereof are stored on the removable medium.If this chunk is not unique, then handle from frame 610 and go to frame 616, wherein, redundant data chunks is dropped and produces the index to the unique chunk that exists.Next will be forwarded to removable medium to the index of unique chunk of existing at frame 618 places and be used for storage.Handle then and turn back to frame 602 and be used to carry out common factor and decompose.
Although described principle of the present invention hereinbefore in conjunction with specific device, to understand with should be understood that, this description only is as an example, not delimit the scope of the invention.

Claims (20)

1. one kind is used to utilize removable storage cassette to store the common factor decomposing system of data, comprising:
Processor;
Expansion bus is connected to described processor;
Socket is connected to described expansion bus and is configured to admit described removable storage cassette; And
Expansion module is detachably connected to described expansion bus, and wherein, described expansion module is configured to data are sent to described removable storage cassette, and described expansion module comprises:
The chunk module is configured to original data stream is divided into a plurality of chunks; And
The Hash module is configured to be connected to described chunk module with pipeline system, thereby at least a portion of the input of described Hash module is comprised output from described chunk module, and described Hash module is configured to:
Determine whether each described chunk is unique, and
Be defined as unique chunk to described removable storage cassette forwarding.
2. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1, wherein, before by last byte of described each described chunk of chunk resume module by first byte of the described chunk of described Hash resume module.
3. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1, wherein, described original data stream comprises a plurality of files.
4. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1, wherein, described Hash module be further configured into:
Calculate the identifier of each described chunk of being transmitted;
Store described identifier; And
Determine based on described identifier whether described chunk is unique.
5. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1 further comprises the additional data processing module.
6. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 5, wherein, described additional data processing module comprises one or more in data compressing module, encrypting module and the Error Correction of Coding module.
7. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 5, further comprise the identifier database search module, be used for sending to described additional data processing module based on carrying out from the output of described Hash module to the search of identifier database and with described unique chunk.
8. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1, wherein, many parallel data paths are used.
9. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 8, wherein,
Described many data paths comprise single data stream, by being a plurality of examples with described traffic splitting not necessarily described data stream being blocked with the position of the chunk boundary alignment that calculates by described program grouping module, wherein, the size of the truncation part of described data stream is that fix or variable.
10. the common factor decomposing system that is used to utilize removable storage cassette storage data according to claim 1, wherein, described Hash module is configured to use one or more in message digest algorithm 5 (MD5) Hash, Secure Hash Algorithm-1 (SHA-1) Hash and Secure Hash Algorithm-2 (SHA-2) Hash.
11. a common factor decomposition method that is used to utilize removable storage cassette storage data comprises:
Receive original data stream at the expansion module place that is detachably connected to main frame, wherein, described expansion module comprises:
The chunk module, and
The Hash module,
Wherein, dispose described Hash module and described chunk module, thereby at least a portion of the input of described Hash module is comprised output from described chunk module with pipelined architecture;
At described chunk module place, described original data stream is divided into a plurality of chunks; And
To a plurality of described chunks of described Hash module forwards,
Wherein, described Hash module is carried out following steps:
Calculate the identifier of each described chunk of being transmitted;
Store described identifier; And
Determine based on described identifier whether each described chunk is unique; And
In described identifier and a plurality of unique chunk at least one sent to described removable storage cassette, and wherein, described removable storage cassette comprises memory driver.
12. the common factor decomposition method that is used to utilize removable storage cassette storage data according to claim 11, wherein, described identifier is stored in the low delay memory.
13. the common factor decomposition method that is used to utilize removable storage cassette storage data according to claim 11, wherein, at least one in described chunk module and the described Hash module is positioned at described main frame outside.
14. one kind is used to utilize removable storage cassette to store the expansion card of the common factor decomposition of data, comprises:
The chunk module is configured to:
Reception is from the original data stream of main frame, and
Described original data stream is divided into a plurality of chunks, and wherein, described expansion card is configured to:
Be detachably connected to described main frame and described removable storage cassette, and
With data storage on described removable storage cassette; And
The Hash module is connected to described chunk module with pipeline system, thereby at least a portion of the input of described Hash module is comprised output from described chunk module, and wherein, described Hash module is configured to:
Reception is from a plurality of described chunk of described chunk module,
The identifier of each chunk in a plurality of described chunk that calculating receives,
Determine based on described identifier whether each described chunk is unique, and
With unique chunk store on described removable storage cassette.
15. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 14 decomposes, wherein, described original data stream comprises a plurality of files.
16. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 14 decomposes, wherein, many parallel data paths are used.
17. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 14 decomposes, wherein, before by last byte of described each described chunk of chunk resume module by first byte of the described chunk of described Hash resume module.
18. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 14 decomposes further comprises the additional data processing module.
19. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 18 decomposes, further comprise the identifier database search module, be used for being forwarded to described additional data processing module to the search of identifier database and with described unique chunk based on carrying out from the output of described Hash module.
20. the expansion card that the common factor that is used to utilize removable storage cassette storage data according to claim 18 decomposes, wherein, described additional data processing module comprises one or more in data compressing module, encrypting module and the Error Correction of Coding module.
CNA2008101305497A 2007-07-06 2008-07-07 Common factor disintegration hardware acceleration on mobile medium Pending CN101339494A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US94839407P 2007-07-06 2007-07-06
US60/948,387 2007-07-06
US60/948,394 2007-07-06
US12/167,872 2008-07-03
US12/167,867 2008-07-03

Publications (1)

Publication Number Publication Date
CN101339494A true CN101339494A (en) 2009-01-07

Family

ID=40213571

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101305497A Pending CN101339494A (en) 2007-07-06 2008-07-07 Common factor disintegration hardware acceleration on mobile medium

Country Status (1)

Country Link
CN (1) CN101339494A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460371A (en) * 2009-04-30 2012-05-16 网络存储技术公司 Flash-based data archive storage system
CN102541989A (en) * 2010-10-28 2012-07-04 微软公司 Robust auto-correction for data retrieval
CN102736961A (en) * 2011-03-11 2012-10-17 微软公司 Backup and restore strategies for data deduplication
CN101764811B (en) * 2009-12-30 2013-02-13 飞天诚信科技股份有限公司 Method for generating data flow
CN107315653A (en) * 2017-03-02 2017-11-03 陈辉 A kind of band deletes the portable storage device and implementation method of calculating and processing function again

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460371A (en) * 2009-04-30 2012-05-16 网络存储技术公司 Flash-based data archive storage system
CN101764811B (en) * 2009-12-30 2013-02-13 飞天诚信科技股份有限公司 Method for generating data flow
CN102541989A (en) * 2010-10-28 2012-07-04 微软公司 Robust auto-correction for data retrieval
CN102541989B (en) * 2010-10-28 2015-12-09 微软技术许可有限责任公司 The sane automatic correction of data retrieval
CN102736961A (en) * 2011-03-11 2012-10-17 微软公司 Backup and restore strategies for data deduplication
CN102736961B (en) * 2011-03-11 2017-08-29 微软技术许可有限责任公司 The backup-and-restore strategy of data deduplication
US9823981B2 (en) 2011-03-11 2017-11-21 Microsoft Technology Licensing, Llc Backup and restore strategies for data deduplication
CN107315653A (en) * 2017-03-02 2017-11-03 陈辉 A kind of band deletes the portable storage device and implementation method of calculating and processing function again

Similar Documents

Publication Publication Date Title
US8335877B2 (en) Hardware acceleration of commonality factoring on removable media
EP2012235A2 (en) Commonality factoring
CN107210753B (en) Lossless reduction of data by deriving data from prime data units residing in a content association filter
US8812738B2 (en) Method and apparatus for content-aware and adaptive deduplication
US8179291B2 (en) Method and system for compression of logical data objects for storage
WO2016185459A1 (en) Storage, transfer and compression of next generation sequencing data
US20150379068A1 (en) Table boundary detection in data blocks for compression
US10509771B2 (en) System and method for data storage, transfer, synchronization, and security using recursive encoding
EP1866776A1 (en) Method for detecting the presence of subblocks in a reduced-redundancy storage system
US12079168B2 (en) System and method for error-resilient data compression using codebooks
US11366790B2 (en) System and method for random-access manipulation of compacted data files
CN101339494A (en) Common factor disintegration hardware acceleration on mobile medium
US8909606B2 (en) Data block compression using coalescion
Thwel et al. An efficient indexing mechanism for data deduplication
CN102609338A (en) Reverse file increment filing method
CN116601593A (en) Data compression device, data storage device and method for data compression and data de-duplication
US11609881B2 (en) System and method for computer data type identification
WO2020264522A1 (en) Data storage, transfer, synchronization, and security using recursive encoding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090107