CN104408111A - Method and device for deleting duplicate data - Google Patents

Method and device for deleting duplicate data Download PDF

Info

Publication number
CN104408111A
CN104408111A CN201410682621.2A CN201410682621A CN104408111A CN 104408111 A CN104408111 A CN 104408111A CN 201410682621 A CN201410682621 A CN 201410682621A CN 104408111 A CN104408111 A CN 104408111A
Authority
CN
China
Prior art keywords
file
node
duplicate
described
rbtree
Prior art date
Application number
CN201410682621.2A
Other languages
Chinese (zh)
Other versions
CN104408111B (en
Inventor
张朝潞
Original Assignee
浙江宇视科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江宇视科技有限公司 filed Critical 浙江宇视科技有限公司
Priority to CN201410682621.2A priority Critical patent/CN104408111B/en
Publication of CN104408111A publication Critical patent/CN104408111A/en
Application granted granted Critical
Publication of CN104408111B publication Critical patent/CN104408111B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Abstract

The invention discloses method and device for deleting duplicate data on the basis of the Openstack Object Storage system. The method and device are characterized in that a proxy node comprises a duplicate removing middleware module; a storage node comprises a duplicate removing service process module; a duplicate removing hash ring is built in the duplicate middleware module, and each node of the duplicate removing hash ring is a root node of a red-black tree; a fingerprint file is sent to the duplicate removing duplicate module through the duplicate removing service process module; a duplicate file can be determined by the duplicate removing middleware module after finding the root node of the red-black tree; then one duplicate file is remained in the storage node, and other duplicate files are deleted; a redirection file directed to the remained duplicate file is stored at the position in which other duplicate files are stored before; if no duplicate file is found, the value of a virtual node partition in each fingerprint file and the MD5 value of the file content are inserted into a sub-node of the red-black tree. According to the method and device, the transverse expansion advantages of linear increase of the performance of the Openstack Object Storage system are fully utilized, and thus the node can be easily expanded.

Description

A kind of method of deleting duplicated data and device

Technical field

The application relates to Openstack Object Storage cloud memory technology, particularly relates to the method based on Openstack Object Storage system-kill repeating data and device.

Background technology

Openstack Object Storage (swift) be Openstack increase income cloud computing project object store subscheme, provide powerful extendability, redundancy and persistence.Its framework is as follows: Fig. 1 OpenstackObject Storage frame diagram.

As Fig. 1, Openstack Object Storage mainly contains two kinds of node compositions: proxy (agency) node and storage (storage) node.Proxy node is responsible for receiving the request of client, and with storage node communication.It is according to the object of client-requested, positions and forwards the request to storage node.Storage node primary responsibility data store, and provide backup, fault-tolerant, consistance, and data Autonomic Migration Framework reaches the data securities such as data balancing and ensures.In cluster, all nodes all can be extending transversely, and this comprises two aspects, and one is by storage growth data memory capacity; Two is linear improving performance (as QPS, handling capacities etc.).The metadata completely evenly stochastic distribution of Swift data, in conjunction with its all node easily extensible, ensure that without business single-point at framework and design.

Cloud stores service can be provided easily based on Openstack Object Storage, data access service is provided.But store for cloud, must store mass data, how under guaranteeing to serve normal situation, this is the problem that each cloud storage service provider is considering to save carrying cost, and data de-duplication solves wherein a kind of mode of this problem just.For mass data, will inevitably have that many titles are different but data content is same case.These data or be stored in different catalogues, or belong to the different inferior situations of user account.And for this data, cloud storage only should retain portion, other only point to the portion of this reservation, and this reduces the storage total amount required for storage greatly, improves storage space utilization factor, has saved carrying cost.

For the data de-duplication of Openstack Object Storage, existing relevant scheme at present, can with reference to the detailed description in " A Deduplication Campus-based Cloud Storage System Based on Swift " paper.

In this paper, mainly in original Openstack Object Storage system architecture, add two parts and realize data de-duplication, one be duplicate removal client, two is duplicate removal middlewares.Because the prior art adds duplicate removal client, and All Files access all needs, by access duplicate removal client, to lose the benefit brought extending transversely of Openstack Object Storage: the linear increase of performance.Further, the single-point that duplicate removal client yet forms both cluster is increased.In addition, the program does not almost use the existing mechanism of Openstack Object Storage, not in conjunction with its characteristic, just increases function in periphery, causes machining system complex structure.

Duplicate removal client owing to increasing in prior art employs sqlite and stores data, and sqlite adopts the lock of coarseness.When a connection will write database, every other connection is lockable, until write the affairs connecting and terminate it, when concurrency is high time, system performance reduces greatly, and along with data volume increase, the mass data retrieval of sqlite also becomes bottleneck.And the present invention is owing to passing through in internal memory based on hash ring and RBTree, in conjunction with the storage characteristics of Openstack Object Storage, build duplicate removal hash ring and realize dynamic data maintenance, do not need to use sqlite to store static data, therefore, while avoiding the problem that in prior art, sqlite storage static data brings, also in a case of a considerable amount of data, the search time of repeating data can be reduced, improve system performance.

Summary of the invention

The application provides method based on Openstack Object Storage system-kill repeating data and device, solves the problem of deleting duplicated data in prior art.

According to the first aspect of the embodiment of the present application, provide a kind of based on Openstack Object Storage system-kill repeating data method, described Openstack Object Storage system comprises proxy node and storage node; Described proxy node comprises duplicate removal middleware module, described storage node comprises duplicate removal service processes module, wherein this duplicate removal middleware module is built with duplicate removal hash ring, and each node of this duplicate removal hash ring is the root node of a RBTree, and described method comprises step:

The file preserved under each file partition of described duplicate removal service processes module scans, generates the fingerprint that each file is corresponding, and these fingerprints are sent to described duplicate removal middleware module by file fingerprint; Described file fingerprint comprises the value of a dummy node partition and the MD5 value of each file content;

Described duplicate removal middleware module searches the RBTree root node corresponding to each fingerprint according to the result of each MD5 value to n delivery, and wherein n is the nodes that duplicate removal hash ring comprises;

After finding RBTree root node, judge whether the md5 value comprised in each fingerprint has been present in corresponding RBTree child node respectively, md5 value for arbitrary fingerprint is present in the situation in RBTree child node, judge that whether the value of the dummy node partition that this RBTree child node is deposited is identical with the dummy node partition value that the file fingerprint at this fingerprint place comprises further, if different, then confirm to there is duplicate file; A duplicate file is retained in storage node, deletes other duplicate files, and store a redirection file pointing to this duplicate file be retained of position preservation of other duplicate files at script;

If there is no duplicate file, then with each MD5 value for the MD5 value of the value of the dummy node partition that comprises in each file fingerprint, file content is inserted in the corresponding child node of this RBTree by key.

The second aspect of the embodiment of the present application, provide another kind of based on Openstack Object Storage system-kill repeating data method, described Openstack Object Storage system comprises proxy node and storage node; Described proxy node comprises duplicate removal middleware module, described storage node comprises duplicate removal service processes module, wherein this duplicate removal middleware module is built with duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and the child node of described RBTree preserves the MD5 value of the value of dummy node partition, file content; Described method comprises step:

Receive the file storage resource request of client;

Obtain the MD5 value of file to be stored, and search RBTree root node according to the result of described MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises;

After finding the root node of RBTree, judge whether described MD5 value is present in the child node in this RBTree, if be present in described child node, then confirm to there is duplicate file, a redirection file pointing to described duplicate file is preserved in the position storing file to be stored; If there is no in described child node, then confirm to there is not duplicate file, this file to be stored is stored in the position storing file to be stored, and with the MD5 value of file to be stored for the value of the dummy node partition of this file to be stored, MD5 value are inserted in the corresponding child node of this RBTree by key.

The third aspect of the embodiment of the present application, provides a kind of device based on Openstack Object Storage system-kill repeating data, and described Openstack Object Storage system comprises proxy node and storage node; This device comprises:

Be positioned at the duplicate removal service processes module of storage node, for scanning the file preserved under each file partition in described storage node, generate the fingerprint that each file is corresponding, and these fingerprints are sent to described duplicate removal middleware module by file fingerprint; And when there is duplicate file, a duplicate file is retained in storage node by notice storage node, and delete other duplicate files, a redirection file pointing to this duplicate file be retained is preserved in the position storing other duplicate files at script; Described file fingerprint comprises the value of a dummy node partition and the MD5 value of each file content;

Be positioned at the duplicate removal middleware module of proxy node, for building duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and search the RBTree root node corresponding to each fingerprint according to the result of each MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises; After finding RBTree root node, judge whether the md5 value comprised in each fingerprint has been present in corresponding RBTree child node respectively, md5 value for arbitrary fingerprint is present in the situation in RBTree child node, judge that whether the value of the dummy node partition that this RBTree child node is deposited is identical with the dummy node partition value that the file fingerprint at this fingerprint place comprises further, if different, then confirm there is duplicate file and inform described duplicate removal service processes module; If there is no duplicate file, then with each MD5 value for the MD5 value of the value of the dummy node partition that comprises in each file fingerprint, file content is inserted in the corresponding child node of this RBTree by key.

The fourth aspect of the embodiment of the present application, provides a kind of device based on Openstack Object Storage system-kill repeating data, and described Openstack Object Storage system comprises proxy node and storage node; This device comprises,

Described duplicate removal middleware module, for building duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and the child node of described RBTree preserves the MD5 value of the value of dummy node partition, file content; When receiving the file storage resource request of client, obtain the MD5 value of file to be stored, and search RBTree root node according to the result of described MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises; After finding the root node of RBTree, judge whether described MD5 value is present in the child node in this RBTree, if be present in described child node, then confirm to there is duplicate file, notify described duplicate removal service processes; If there is no in described child node, then with the MD5 value of file to be stored for the value of the dummy node partition of this file to be stored, MD5 value are inserted in this RBTree corresponding child node by key;

Described duplicate removal service processes module, for when there is duplicate file, a redirection file pointing to described duplicate file is preserved in the position storing file to be stored by notice storage node; When there is not duplicate file, the notice position that storage is storing file to be stored stores this file to be stored.

The application is not introducing the situation of new node, in conjunction with the mechanism that Openstack Object Storage self provides, new function is increased at the original intra-node of system, can realize performing the object of data de-duplication to system, thus solve in prior art the system architecture challenge needing additionally to increase parts and cause.Simultaneously, due to the functional equivalent of proxy node each in the system architecture of the application, the nodal function of each storage node is also equal to, therefore the advantage extending transversely of the linear increase of the performance of Openstack Object Storage system is taken full advantage of, can easy expansion node, and also there is not single-point in cluster.

Accompanying drawing explanation

Fig. 1 is Openstack Object Storage system architecture diagram in prior art;

Fig. 2 is Openstack Object Storage system hardware structure figure in the embodiment of the present application

Fig. 3 is Openstack Object Storage system architecture schematic diagram in the embodiment of the present application;

Fig. 4 is the schematic diagram of duplicate removal Hash ring in the embodiment of the present application;

Fig. 5 is the process flow diagram building duplicate removal Hash ring in the embodiment of the present application;

Fig. 6 is that when building Hash ring in the embodiment of the present application, the signal of system moves towards schematic diagram;

Fig. 7 is the process flow diagram of client storage file in the embodiment of the present application;

Fig. 8 is the process flow diagram of client deleted file in the embodiment of the present application;

Fig. 9 is the process flow diagram of client file reading in the embodiment of the present application.

Embodiment

Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the application.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that some aspects of the application are consistent.

Only for describing the object of specific embodiment at term used in this application, and not intended to be limiting the application." one ", " described " and " being somebody's turn to do " of the singulative used in the application and appended claims is also intended to comprise most form, unless context clearly represents other implications.

This application provides a set of solution realizing its data de-duplication function based on Openstack Object Storage system, to solve under cloud storage environment, the problem that heap file is identical, reduce system storage total amount, space utilisation is provided, and reduces the impact on cluster readwrite performance as far as possible.First the system architecture of Openstack Object Storage system has been set forth in the following content of the application; then in this system to identifying that the building process that duplicate removal data play the duplicate removal Hash ring of key effect is described in detail, subsequently the process utilizing this duplicate removal Hash ring to carry out data deduplication is described in detail.The type of data deduplication comprises the synchronous duplicate removal of data and asynchronous duplicate removal, This application describes respectively for the solution of these two kinds of duplicate removal modes.

Conveniently to the understanding of the technical scheme described by the application, first several key concepts of Openstack ObjectStorage system are defined: the data of Openstack Object Storage system storage have three kinds, are respectively account, container and object.

Account: represent account, have multiple container under an account.

Container: represent container, a container has multiple object.

Object: represent object, object by file data, the compositions such as metadata.

The dummy node of consistent hashing algorithm in partition:Openstack Object Storage, actual node (storage) is corresponding several dummy nodes, dummy node arranges with cryptographic hash in hash space.For the file corresponding to an object, can be mapped to corresponding dummy node by hash algorithm, dummy node finds corresponding actual node by mapping relations again.In the application, the value of dummy node is the sequence number of dummy node, after finding this dummy node, then can find actual stoage node by the mapping relations of dummy node and actual node by sequence number.

In the application, the technical scheme of Openstack Object Storage system architecture describes:

In Openstack Object Storage system, two assemblies are added in the application, namely the duplicate removal middleware module shown in Fig. 2 and duplicate removal service processes module, wherein duplicate removal middleware module is installed on proxy node, and duplicate removal service processes module is run in storage node.

The Openstack Object Storage system of the application realizes by multiple servers.Duplicate removal middleware can realize by Proxy server, and duplicate removal service processes can realize by storage server.Duplicate removal middleware module and duplicate removal service processes module can pass through software simulating, also can be realized by the mode of hardware or software and hardware combining.For software simulating, as the device on a logical meaning, be by the processor of its place equipment, computer program instructions corresponding in nonvolatile memory is read operation in internal memory to be formed.Say from hardware view, as shown in Figure 2, for a kind of hardware structure diagram of duplicate removal service processes in the application and duplicate removal middleware, except the processor shown in Fig. 2, network interface, internal memory and nonvolatile memory, in embodiment, each server can also comprise other hardware, is no longer shown specifically this Fig. 2.

Fig. 3 is the logical architecture figure of Openstack Object Storage system in embodiment.Duplicate removal middleware module builds and safeguards duplicate removal Hash ring (hash ring) and some RBTrees.Each node wherein in this duplicate removal Hash ring is respectively as the root node of each RBTree.The child node of each RBTree is used for preserving the key message of storage object object, can be found each object stored in whole system by this key message.Here the key message of storage object object at least comprises the sequence number (i.e. partition value) of the dummy node that object is positioned at, the md5 value of file content and file fingerprint name.Fig. 4 is the schematic diagram of duplicate removal Hash ring and RBTree.

The duplicate removal Hash ring built is an array, and this array element number can carry out unrestricted choice according to real needs, but as an embodiment, usually adopts the number of number as array element of dummy node.The application position represents array index, i.e. the sequence number of array element, can navigate to array element according to position value, i.e. the node of duplicate removal Hash ring, is also the root node of RBTree, thus accesses this RBTree.How value about position calculates the explanation please joined hereinafter.

Illustrate how to construct duplicate removal Hash ring described in Fig. 4 and each RBTree below in conjunction with accompanying drawing 5 and Fig. 6.Duplicate removal hash ring and each RBTree can be built when system initialization in the present embodiment, other moment can certainly be selected according to demand to build.

In the present embodiment, duplicate removal service profile information can write among object-server.conf, adds object-deduper field, according to these configuration informations, can construct the structure of arrays of duplicate removal Hash ring.Fig. 6 is the process flow diagram building each node of RBTree.

The each file preserved under each file partition of duplicate removal service processes module scans (i.e. device), generates the file fingerprint (step 701) of the fingerprint (fingerprint have can the characteristic of a unique identification file) comprising each file.

Composition graphs 3 can be recognized, the file fingerprint corresponding to the object that this file fingerprint is all under including this device.Each fingerprint in this file fingerprint comprises the MD5 value of file content, accout, container, the object belonging to file respectively.MD5 value is the condition code of 128 (bit) obtained after carrying out mathematic(al) manipulation according to disclosed MD5 algorithm to prime information.The value of the documentary fingerprint of the information comprised in file fingerprint, partition, the IP address of storage, device path.The file corresponding to concrete object can be found: corresponding stoarge node can be found according to the value of partition by these information, find corresponding device according to the IP address of storage, device path, more concrete file can be navigated to according to the MD5 value in fingerprint.

The file fingerprint of each device collects and is sent to duplicate removal middleware module (step 702) by duplicate removal service processes module.In the present embodiment, file fingerprint is sent to duplicate removal middleware module by HTTP as shown in Figure 5.

Duplicate removal middleware module is resolved file fingerprint, obtains the finger print information (step 703) of each file in file fingerprint.Each RBTree is built according to the md5 value in each fingerprint in this file fingerprint.The concrete mode building RBTree is: first, respectively the number delivery of the MD5 value of each file content to the array element forming duplicate removal Hash ring is obtained position value, navigate to position r (step 704) of RBTree root node according to each position value respectively;

Then further using each MD5 value as Key, the key message of each object is added to respectively in child node corresponding to corresponding RBTree root node (step 705).Please refer to the drawing 4, has n+1 node in this figure, namely the number of array element is n+1; If the MD5 value that certain object is corresponding is m, then the root node position r of the RBTree that this object is corresponding is: r=m% (n+1).Be not difficult to find out, the root node position of the RBTree that arbitrary object is corresponding is at 0 ~ n within the scope of this.

In the finger print information of object, partition value and MD5 value can be used for judging whether to there is repeating data.And the filename of file fingerprint can find corresponding file fingerprint, can judge according to the account information of the fingerprint stored in file fingerprint the account that duplicate file data store.And as an embodiment, preservation for key message in RBTree child node in step 705 can in the following way: the IP address of storage and the path of file in this storage have been stored in the filename of file fingerprint, file fingerprint IP_device names, IP in this filename is identical with the IP of the storage at the place, file partition of carry, and the device in filename is identical with the path of the file partition of carry.File fingerprint can be stored in the particular memory location in proxy, therefore in a preferred embodiment, only needs to preserve the filename of this file fingerprint, partition value and file content MD5 in RBTree node.

As shown in Figure 5, in a preferred embodiment of the application, utilize the feature of Openstack ObjectStorage system horizontal scalability, duplicate removal hash ring can be distributed on multiple proxy, each described proxy node maintenance wherein one section of duplicate removal hash ring, thus form distributed duplicate removal hash ring.Specific practice is as follows: according to the scope option of the position value of configuration file, can by whole duplicate removal hash ring, split into several part, each proxy node is only responsible for storage area duplicate removal hash ring, can carry out query manipulation between each proxy node by HTTP.Fig. 5 splits into two-part distributed duplicate removal hash ring: the hash ring with n node is divided into two duplicate removal hash rings and is stored on different proxy, and one of them stores position scope is 0 to the array of (n/2-1); Another stores position scope is the array of n/2 to n.

In a preferred embodiment, duplicate removal middleware module exists with the form of WSGI plug-in unit, the object opening and closing this pin function can be realized by configuration file, attribute due to WSGI itself determines this plug-in unit can all HTTP request of receiving of pre-service, that is the write request of data can first arrive before storing remove middleware carry out pre-service, thus for prevent Data duplication write provide possibility.In one embodiment, following deploy content can be added by configuration file (naming this configuration file below by way of proxy-server.conf) to duplicate removal middleware module:

1) number of array element, can give tacit consent to identical with the number of partition as an embodiment, and configuration is upper more convenient like this;

2) position scope, be 0 ~ array element total number-1 between.

Except these information can also configure other information as required; such as; when adopting the distributed architecture of distributed hash ring as described above, deploy content can also comprise the IP address that other are configured with the proxy of duplicate removal Hash ring, and this proxy configures position scope.

In one embodiment; under distributed Hash ring stand structure; in reading file fingerprint data; count after delivery obtains position value according to MD5 to duplicate removal Hash link; also need to judge whether position belongs to this proxy node; if do not belong to this proxy node, then known the proxy node at this file fingerprint place by configuration file, by IP address information, this file fingerprint is passed to corresponding proxy node.

This embodiment framework in a distributed manner, each proxy builds and the process of maintenance rebuild hash ring can parallel processing.Can be found out by the following analysis to time degree of being responsible for and space complexity that each proxy builds duplicate removal hash ring, distributed framework greatly reduces the pressure of single server, the time that when additionally reducing server fail, duplicate removal hash ring is rebuild.

The time complexity of duplicate removal hash ring:

Time complexity=O (1)+O (logN), the mean value of N: the total amount/position quantity of storage object in N=system

Duplicate removal hash annular space complexity:

Suppose that the file size that each RBTree root node stores is about 65byte, wherein MD5 is that 16byte, ip are no more than 15byte, and the path of the file partition of carry is approximately 9byte, and partition is 4 byte.RBTree structure takies 25byte, and three pointers parent, left, right take 24byte, and color takies 1byte.

The internal memory required for 100,000,000 files is calculated: 65*100000000 ≈ 6.4G with 65byte

Hereafter describe the method utilizing duplicate removal Hash ring and RBTree deleting duplicated data in detail.

First set forth the process synchronously removing repeating data (synchronous duplicate removal), synchronous duplicate removal refers to when files passe, if system has existed this file, then not writing in files, direct backspace file is uploaded complete, realizes level upload function second.Synchronous duplicate removal triggers when sending associated documents upload request by client.

As shown in Figure 7, after duplicate removal middleware module receives the request of certain client requirements upload file, start to search RBTree root node position according to the MD5 value of file content, in a preferred embodiment, MD5 value is carried in the message of file storage resource request, as an embodiment, can be the etag in HTTP message header.

MD5 value according to described file content searches RBTree root node position (step 801), and by RBTree root node and search in described RBTree the child node (step 802) whether having MD5 value identical using MD5 value as key, if had, then confirm to there is duplicate file, a redirection file pointing to described duplicate file is preserved in the position storing file to be stored.If do not had, then confirm to there is not duplicate file, this file to be stored is stored in the position storing file to be stored, and with the MD5 value of file to be stored for the value of the dummy node partition of this file to be stored, MD5 value are inserted in the corresponding child node of this RBTree by key.(storing this file to be stored in the position storing file to be stored can be by the corresponding account in file write storage to be uploaded, there is instruction in the prior art for corresponding account upload file write in sotrage, be not described in detail at this).

It should be noted that, the application utilizes the interface of the opening provided for user in Openstack Object Storage system to realize the associative operation of redirection file, this open interface allows the file read-write class interface file DiskFile adding oneself, can by increasing this interface document to generate and safeguarding redirection file.Increase data de-duplication function, only need to perform when reading, find if redirection file, then read redirection file content, according to partition, account, container, object of redirection file content, get authentic document content and return.

In figure, step 8031 to step 8034 describes in an application example and generates redirection file to step 803 to step 804, and by the process of described redirection file write storage, specifically comprises:

Dedupe_account be Openstack Object Storage system initialization by duplicate removal service processes CMOS macro cell an account (if certainly as required other moment generate the account also can), for storing documents content.Duplicate removal middleware module finds the child node of the RBTree at this repeating data place according to the md5 value of file content to be uploaded, the filename of the file fingerprint utilizing this child node to store finds corresponding file fingerprint, according to the account title stored in fingerprint, fingerprint corresponding according to this MD5 value subsequently, can judge whether aforementioned file content to be uploaded is stored in (step 8031) in the account of dedupe_account; If aforementioned file to be uploaded has been stored in the account of dedupe_account, after then generating described redirection file, redirection file to be write in described storage this account corresponding to MD5 value, and in dedupe_account, record the number (step 8034) of current described redirection file; Namely the counter refer in figure adds 1.

If file content to be stored is not in dedupe_account, then the file content corresponding to MD5 value is write described dedupe_account (step 8032), after generating described redirection file, redirection file to be write in described storage this account corresponding to MD5 value, and in dedupe_account, record the number (step 8033) of current described redirection file; Namely the counter refer in figure adds 1, represents to quote to add 1.The fingerprint of the dedupe_account file after renewal can send to duplicate removal middleware in real time by duplicate removal service processes, also can send to duplicate removal middleware according to the cycle of duplicate removal service processes collection file fingerprint.

As shown in Figure 8, when receiving the request of deleted file (DELETE) of client, judge whether the file deleted is redirection file (step 1001), if not, then perform normal deletion action (step 1007); If, then read redirection file (step 1002), delete described redirection file (step 1005), and upgrade the number (step 1003) (namely representing in figure that the counter refer quoting number of times of depute_account subtracts 1) of current described redirection file.If the number of the current described redirection file after upgrading is zero (step 1004), then the file content in described dedupe_account is deleted, and the RBTree child node corresponding to dedupe_account is deleted (step 1006).

Can find out, when file stores, by only retaining a authentic document content on dedupe_account, only can need store a redirection file during other account storage file contents, thus the object of synchronous duplicate removal can be realized.

Fig. 9 is the process flow diagram from Openstack Object Storage system file reading.When receiving file and reading (GET operation) request, judge whether the file read is redirection file (step 901), if, then read redirection file, and obtain the content (account, container, object information) (step 902) of redirection file, according in the corresponding ring file of these information searchings, obtain file content (step 903) from dedupe_account, and backspace file content is to proxy server (step 904).Ring is the most important assembly of Swift, for recording the mapping relations between storage object and physical location.When relating to the information such as inquiry account, Object, just need the information of inquiring about Ring file.How to carry out searching can realize according to the mode of prior art according to ring file.

Below the process of asynchronous removal repeating data (asynchronous duplicate removal):

Time asynchronous duplicate removal occurs in and builds duplicate removal hash ring and RBTree.When device is deleted in interpolation, during the events such as server resets, also asynchronous duplicate removal can be started.

When asynchronous duplicate removal, duplicate removal middleware module determines whether to there is repeating data by the key message going each child node of RBTree and preserve, its decision method is: if MD5 value is identical, and partition value is different, and what then illustrate that two objects store is repeating data; If MD5 value is identical, and partition value is identical, may there is repeating data, and may be that same user specially stores two parts of identical files, a copy of it be copy, now, and special process that it goes without doing; If MD5 value is different, then determine to there is not repeating data.How set forth the application below in conjunction with Fig. 7 utilizes the duplicate removal hash ring built to remove the method for repeating data.

When the complete file fingerprint of duplicate removal service processes module collection, by HTTP, data are sent to duplicate removal middleware module.For each fingerprint in file fingerprint, duplicate removal middleware module searches the child node of RBTree according to the MD5 value that this fingerprint comprises, if find corresponding child node, then judge that it is whether identical with the partition value that records in above-mentioned child node according to the partition value recorded in this file fingerprint, if different, then confirm currently there is repeating data.Md5 value according to the file content in file fingerprint finds corresponding fingerprint, can judge whether the account storing repeating data is dedupe_account according to the account title stored in fingerprint, if repeating data does not leave dedupe_account in, then the repeating data stored in the account of node identical for MD5 value is moved into described dedupe_account, and store in this account point to this repeating data redirection file, and record the number of current described redirection file.In one embodiment, after file content is write dedupe_account, can upgrade with issuing duplicate removal middleware module by the finger print information corresponding to the file content after this renewal by file fingerprint the information that RBTree each child node stores.

Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the application.The application is intended to contain any modification of the application, purposes or adaptations, and these modification, purposes or adaptations are followed the general principle of the application and comprised the undocumented common practise in the art of the application or conventional techniques means.Instructions and embodiment are only regarded as exemplary, and true scope and the spirit of the application are pointed out by claim below.

Should be understood that, the application is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.The scope of the application is only limited by appended claim.

Claims (10)

1., based on an Openstack Object Storage system-kill repeating data method, described Openstack Object Storage system comprises proxy node and storage node; It is characterized in that, described proxy node comprises duplicate removal middleware module, and described storage node comprises duplicate removal service processes module, and wherein this duplicate removal middleware module is built with duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and described method comprises step:
The file preserved under each file partition of described duplicate removal service processes module scans, generates the fingerprint that each file is corresponding, and these fingerprints are sent to described duplicate removal middleware module by file fingerprint; Described file fingerprint comprises the value of a dummy node partition and the MD5 value of each file content;
Described duplicate removal middleware module searches the RBTree root node corresponding to each fingerprint according to the result of each MD5 value to n delivery, and wherein n is the nodes that duplicate removal hash ring comprises;
After finding RBTree root node, judge whether the md5 value comprised in each fingerprint has been present in corresponding RBTree child node respectively, md5 value for arbitrary fingerprint is present in the situation in RBTree child node, judge that whether the value of the dummy node partition that this RBTree child node is deposited is identical with the dummy node partition value that the file fingerprint at this fingerprint place comprises further, if different, then confirm to there is duplicate file; A duplicate file is retained in storage node, deletes other duplicate files, and store a redirection file pointing to this duplicate file be retained of position preservation of other duplicate files at script;
If there is no duplicate file, then with each MD5 value for the MD5 value of the value of the dummy node partition that comprises in each file fingerprint, file content is inserted in the corresponding child node of this RBTree by key.
2. the method for claim 1, is characterized in that, is retained in storage node by a duplicate file and is specially: be kept in depute_account by a duplicate file; Described dedupe_account is the account built for depositing repeating data.
3., based on an Openstack Object Storage system-kill repeating data method, described Openstack Object Storage system comprises proxy node and storage node; It is characterized in that, described proxy node comprises duplicate removal middleware module, described storage node comprises duplicate removal service processes module, wherein this duplicate removal middleware module is built with duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and the child node of described RBTree preserves the MD5 value of the value of dummy node partition, file content; Described method comprises step:
Receive the file storage resource request of client;
Obtain the MD5 value of file to be stored, and search RBTree root node according to the result of described MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises;
After finding the root node of RBTree, judge whether described MD5 value is present in the child node in this RBTree, if be present in described child node, then confirm to there is duplicate file, a redirection file pointing to described duplicate file is preserved in the position storing file to be stored; If there is no in described child node, then confirm to there is not duplicate file, this file to be stored is stored in the position storing file to be stored, and with the MD5 value of file to be stored for the value of the dummy node partition of this file to be stored, MD5 value are inserted in the corresponding child node of this RBTree by key.
4. method according to claim 3, is characterized in that, described method also comprises: when receiving the request of deleted file, if the file deleted is redirection file, then deletes described redirection file; If there is no redirection file, then delete the file content in described storage node, and RBTree child node corresponding for the fingerprint of this file deleted.
5., based on a device for Openstack Object Storage system-kill repeating data, described Openstack Object Storage system comprises proxy node and storage node; It is characterized in that, this device comprises:
Be positioned at the duplicate removal service processes module of storage node, for scanning the file preserved under each file partition in described storage node, generate the fingerprint that each file is corresponding, and these fingerprints are sent to described duplicate removal middleware module by file fingerprint; And when there is duplicate file, a duplicate file is retained in storage node by notice storage node, and delete other duplicate files, a redirection file pointing to this duplicate file be retained is preserved in the position storing other duplicate files at script; Described file fingerprint comprises the value of a dummy node partition and the MD5 value of each file content;
Be positioned at the duplicate removal middleware module of proxy node, for building duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and search the RBTree root node corresponding to each fingerprint according to the result of each MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises; After finding RBTree root node, judge whether the md5 value comprised in each fingerprint has been present in corresponding RBTree child node respectively, md5 value for arbitrary fingerprint is present in the situation in RBTree child node, judge that whether the value of the dummy node partition that this RBTree child node is deposited is identical with the dummy node partition value that the file fingerprint at this fingerprint place comprises further, if different, then confirm there is duplicate file and inform described duplicate removal service processes module; If there is no duplicate file, then with each MD5 value for the MD5 value of the value of the dummy node partition that comprises in each file fingerprint, file content is inserted in the corresponding child node of this RBTree by key.
6. device as claimed in claim 5, it is characterized in that, described duplicate removal service processes module notifies that a duplicate file is retained in storage node by described storage node and is specially: be kept in depute_account by a duplicate file; Described dedupe_account is the account built for depositing repeating data.
7., based on a device for Openstack Object Storage system-kill repeating data, described Openstack Object Storage system comprises proxy node and storage node; It is characterized in that, this device comprises,
Described duplicate removal middleware module, for building duplicate removal hash ring, each node of this duplicate removal hash ring is the root node of a RBTree, and the child node of described RBTree preserves the MD5 value of the value of dummy node partition, file content; When receiving the file storage resource request of client, obtain the MD5 value of file to be stored, and search RBTree root node according to the result of described MD5 value to n delivery, wherein n is the nodes that duplicate removal hash ring comprises; After finding the root node of RBTree, judge whether described MD5 value is present in the child node in this RBTree, if be present in described child node, then confirm to there is duplicate file, notify described duplicate removal service processes; If there is no in described child node, then with the MD5 value of file to be stored for the value of the dummy node partition of this file to be stored, MD5 value are inserted in this RBTree corresponding child node by key;
Described duplicate removal service processes module, for when there is duplicate file, a redirection file pointing to described duplicate file is preserved in the position storing file to be stored by notice storage node; When there is not duplicate file, the notice position that storage is storing file to be stored stores this file to be stored.
8. device according to claim 7, is characterized in that, when receiving the request of deleted file, if the file of described duplicate removal service processes module also for deleting is redirection file, then notifies redirection file described in storage knot removal; If there is no redirection file, then notify that file content is deleted by storage node, and notify that RBTree child node corresponding for the fingerprint of this file is deleted by described duplicate removal middleware module.
9. device according to claim 7, is characterized in that, described duplicate removal hash ring splits into multiple ring according to pre-defined rule and is stored in multiple described proxy node respectively.
10. device according to claim 7, is characterized in that, described duplicate removal middleware module is installed on described proxy node with WSGI form.
CN201410682621.2A 2014-11-24 2014-11-24 A kind of method and device of deleting duplicated data CN104408111B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410682621.2A CN104408111B (en) 2014-11-24 2014-11-24 A kind of method and device of deleting duplicated data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410682621.2A CN104408111B (en) 2014-11-24 2014-11-24 A kind of method and device of deleting duplicated data

Publications (2)

Publication Number Publication Date
CN104408111A true CN104408111A (en) 2015-03-11
CN104408111B CN104408111B (en) 2017-12-15

Family

ID=52645742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410682621.2A CN104408111B (en) 2014-11-24 2014-11-24 A kind of method and device of deleting duplicated data

Country Status (1)

Country Link
CN (1) CN104408111B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637066A (en) * 2015-03-12 2015-05-20 湖南大学 Method for extracting binary image quick skeleton based on sequential refining
CN105955675A (en) * 2016-06-22 2016-09-21 南京邮电大学 Repeated data deletion system and method for de-centralization cloud environment
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again
WO2018205471A1 (en) * 2017-05-10 2018-11-15 深圳大普微电子科技有限公司 Data access method based on feature analysis, storage device and storage system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
US20100250896A1 (en) * 2009-03-30 2010-09-30 Hi/Fn, Inc. System and method for data deduplication
CN102629258A (en) * 2012-02-29 2012-08-08 浪潮(北京)电子信息产业有限公司 Repeating data deleting method and device
JP2012198832A (en) * 2011-03-23 2012-10-18 Nec Corp Duplicate file detection device
CN102902762A (en) * 2012-09-25 2013-01-30 华为技术有限公司 Method, device and system for deleting repeating data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250896A1 (en) * 2009-03-30 2010-09-30 Hi/Fn, Inc. System and method for data deduplication
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
JP2012198832A (en) * 2011-03-23 2012-10-18 Nec Corp Duplicate file detection device
CN102629258A (en) * 2012-02-29 2012-08-08 浪潮(北京)电子信息产业有限公司 Repeating data deleting method and device
CN102902762A (en) * 2012-09-25 2013-01-30 华为技术有限公司 Method, device and system for deleting repeating data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104637066A (en) * 2015-03-12 2015-05-20 湖南大学 Method for extracting binary image quick skeleton based on sequential refining
CN104637066B (en) * 2015-03-12 2017-06-16 湖南大学 The quick framework extraction method of bianry image based on sequential refinement
CN105955675A (en) * 2016-06-22 2016-09-21 南京邮电大学 Repeated data deletion system and method for de-centralization cloud environment
CN105955675B (en) * 2016-06-22 2018-11-09 南京邮电大学 A kind of data deduplication system and method for removing center cloud environment
WO2018205471A1 (en) * 2017-05-10 2018-11-15 深圳大普微电子科技有限公司 Data access method based on feature analysis, storage device and storage system
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again

Also Published As

Publication number Publication date
CN104408111B (en) 2017-12-15

Similar Documents

Publication Publication Date Title
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US10235093B1 (en) Restoring snapshots in a storage system
JP6199394B2 (en) Software-defined network attachable storage system and method
US10235065B1 (en) Datasheet replication in a cloud computing environment
US20190155793A1 (en) Handling data extent size asymmetry during logical replication in a storage system
EP3120235B1 (en) Remote replication using mediums
US9043287B2 (en) Deduplication in an extent-based architecture
US9767154B1 (en) System and method for improving data compression of a storage system in an online manner
JP2019194882A (en) Mounting of semi-structure data as first class database element
US8990257B2 (en) Method for handling large object files in an object storage system
US8429198B1 (en) Method of creating hierarchical indices for a distributed object system
Srinivasan et al. iDedup: latency-aware, inline data deduplication for primary storage.
JP5996088B2 (en) Cryptographic hash database
CN105786408B (en) Logic sector mapping in flash array
CN103098035B (en) Storage system
CN102779180B (en) The operation processing method of data-storage system, data-storage system
US8370305B2 (en) Method of minimizing the amount of network bandwidth needed to copy data between data deduplication storage systems
US10019459B1 (en) Distributed deduplication in a distributed system of hybrid storage and compute nodes
Vora Hadoop-HBase for large-scale data
Lakshman et al. Cassandra: a decentralized structured storage system
CN102346695B (en) Scalable segment-based data de-duplication system and method for incremental backups
US10430398B2 (en) Data storage system having mutable objects incorporating time
US8214334B2 (en) Systems and methods for distributed system scanning
EP2615566A2 (en) Unified local storage supporting file and cloud object access
Liao et al. Multi-dimensional index on hadoop distributed file system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant