CN105068757B

CN105068757B - A kind of redundant data De-weight method based on file semantics and system real-time status

Info

Publication number: CN105068757B
Application number: CN201510435945.0A
Authority: CN
Inventors: 尹建伟; 唐彦; 邓水光; 李莹; 吴健; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2015-07-23
Filing date: 2015-07-23
Publication date: 2017-12-22
Anticipated expiration: 2035-07-23
Also published as: CN105068757A

Abstract

The invention discloses a kind of redundant data De-weight method based on file semantics and system real-time status, this method is mainly by three Implement of Function Module：Duplicate removal relative importance value computing module (MPD modules), hierarchical data deduplication module (removing treasure) based on the division of multi-semantic meaning dimension and the duplicate removal control module (controller) based on system real-time status.File semantics of the MPD modules based on various dimensions, the preferential file object for carrying out deduplication operation is exported, go treasure then to be performed successively according to output above hierarchical, include global profile aspect duplicate removal and the local duplicate removal strategy based on data block aspect duplicate removal；Going simultaneously among the operation of treasure, controller can be according to the real-time status of system, to going treasure to carry out dynamic adjustment, so as to while the read request response performance of distributed main memory storage system is ensured, save more memory space cost overheads.

Description

A kind of redundant data De-weight method based on file semantics and system real-time status

Technical field

The invention belongs to computer information management technical field, and in particular to one kind is based on file semantics and the real-time shape of system The redundant data De-weight method of state.

Background technology

With cloud computing and the further popularization of mobile Internet and deep application, all kinds of web-based applications kimonos are done honest work each Even more important role is play in each industry of row, and is steeply risen with the user base number of the Internet, applications, global information is total Amount is also being increased with astonishing speed.Distributed memory system is the background support system of all kinds of cloud services, all Service data is stored among distributed memory system, and storage system externally provides unified read-write interface, so as to user or Upper layer application service access or modification are stored in the data among disk.Corresponding different application scenarios and main application, it is distributed Storage system can be divided into two classes from coarseness again：Backup storage system and main storage system.Backup storage system is main Applied to the backup of cold data, such as system journal, historical archives, these standby systems are typically built in cost relative moderate Storage hardware equipment on, or even the bottom storage device of tape storage is can be based on, because among backup storage system The access temperature of data is low in the extreme, general just only to read out historical data when there is specific demand, therefore backs up Storage system is to the readwrite performance of data without too high requirement.By contrast, main storage system, generally referred to as store The system for the data that layer application service can be accessed directly, then the access performance for data has higher requirement, because reading The high efficiency write directly determines user for upper layer application and the experience of service.Main storage system be generally all using data block as Base unit, split the file into data block one by one and be stored among the disk of bottom, then work as in the internal memory of main storage system In maintain a index on all data blocks, the purpose of index is for the letter of the file of record data block institute subordinate Breath, and physical location of the data block on disk.Because main storage system needs higher data access performance, therefore Under distributed environment, system developer would generally set a set of data storage with redundancy properties and administrative mechanism, i.e., same A data, it can retain among distributed main storage system and be more than a backup, two parts of closely similar texts in structure Part, two project files folder of a large amount of same files is such as contained, can also retain two parts of complete files.

Before the big data epoch arrive, the data that our storage device needs to store and handle all are relatively a small amount of , even if the hardware device required for building distributed main memory storage system is high, the expense on hardware does not also carry as service For the factor paid attention to required for business.And in the epoch of big data, with all kinds of the Internet, applications and the surge of service, the number of user Amount is all increasing daily, and the scale for servicing and applying is also more and more huger and complicated, therefore supports the primary storage of all kinds of cloud services The data volume that stores is needed among system, and there have been the growth of an explosion type.These data are needed by the application on upper strata Directly had access to service, therefore can not assist to store using cheap backup storage system.Although storage hardware is set Standby cost is constantly reducing with technical improvement, and lower price can buy the equipment of more Large Copacity, but because For in the big data epoch, data volume is being increased with exponential speed, therefore among reality, the growth of storage data quantity Speed has been over the decrease speed of hardware storage device cost, and logarithm can not fundamentally be answered by buying more storage devices According to the surge of amount.And from economic interests for, this be even more one very consume input cost method.

Under the driving of this background and challenge, redundant data duplicate removal technology gradually receives increasing service and provided The concern of business, especially need to preserve the service provider of mass data among main storage system.Retouched from the angle of summary State, redundant data duplicate removal technology be exactly by comparing " signature " between data, if it find that have the data of same signature, Then it is determined as the data of redundancy, next the data of redundancy can be deleted, the data being then deleted in index upgrade Information, its physical disk position is pointed to the position of redundant data remained.When next time, user or application access this When individual deleted data, system then can according to the information among index, request is directed to be retained in it is superfluous on disk The position of remainder evidence, and the operation that user is asked is carried out on the data.Divide from the granularity thickness of data deduplication technology, one As can be divided into two classes：Duplicate removal in the duplicate removal and data block aspect of file level.Put it briefly, the data in file level Duplicate removal technology only deletes identical two files of redundancy, and its comparative unit is whole file.And data block aspect On duplicate removal technology comparison and processing unit be then to refine in data block, as previously mentioned, distributed main memory storage system The bottom storage of system is actually that a file may be divided into multiple pieces and be stored on disk in units of block, and In this case, each data block can be calculated the signature for belonging to this block, and duplicate removal technology is exactly based on data block On the basis of the contrast of signature, the data block of redundancy is deleted.Divided from the action scope of data deduplication technology, the overall situation can be broken up Duplicate removal and local duplicate removal.The concept of action scope works in distributed memory system, puts it briefly, global duplicate removal skill Art can detect all redundant datas on the server of the system, even if the server where these data is geographically Separation, and local duplicate removal technology then only focuses on the redundant data on same server or in same storage device.From number According to being divided on the execution opportunity of duplicate removal technology, two kinds of offline duplicate removal and online duplicate removal can be divided into.Offline duplicate removal is generally referred to as Go master control program to be run in system background, be carried out again when new file data has been written into disk after the detection of redundant data with Delete, and what online duplicate removal referred to is exactly to carry out detection and redundancy among process of the new data in write-in to delete.

Classification with reference to more than, among the storage system of reality the duplicate removal technical scheme generally realizing and use all be with The combination of upper referred to classification, and the main scheme used has two classes：Global duplicate removal (the Global based on file level File-level Deduplication, GFD) and local duplicate removal (the Local Chunk-level based on data block aspect Deduplication, LCD).Although both schemes can meet certain duplicate removal demand to a certain extent, they Design all do not account for but in balance between other factors of main storage system this application scenarios and choice.First, Data deduplication technology can bring the loss in data reading performance, and GFD, which is likely to result in, to be needed to transmit user from the server of distal end The data of access, so as to bring the delay in network transmission, and LCD can then cause the fragmentation of local disk, so as to read Multiple disk tracking may be may require that when some data, so as to bring the delay to read operation response.Go severe bigger, just The expense brought in storage device can be more saved, but reading performance will be affected more serious, therefore go severe and reading performance On balance and choice be primarily to consider the problem of.Secondly, primary storage is different from backup storage system, the data of the latter It is that cold form infrequently access and typically as a historical record preserves, therefore comes from the angle of system Say, these data can uniformly be regarded as binary word throttles.And among main storage system, because it is directly supported The characteristic of layer service, the data stored among it are very diversified, and can be different due to the species of service, customer group Access characteristics it is different, there is certain Semantic.The Semantic of these files should be used and gone towards main storage system Among weight conceptual design.3rd, because main storage system is directly come into contacts with user, the feature of the access of user can be because the time Difference, the difference of region, or service change and have differences, user's can also vary with each individual, so towards main memory The duplicate removal scheme of storage system needs to be dynamic, can make adjustment, could be preferably in reading performance and memory space efficiency Balance on make more preferable selection.

As fully visible, under the historical background of big data, on the one hand cloud service provider has urgent reduction memory space The active demand of cost overhead, on the other hand, wish again redundant data duplicate removal will not to upper layer application and service performance by To too much influence, the usage experience of user can be guaranteed.How for distributed main memory storage system use characteristic with And data characteristics, how using the abundant file semantics different from backup storage system and the system mode of change, to set Efficient redundant data duplicate removal scheme is counted and realized, reaches in system space efficiency and is efficiently balanced in data reading performance, into For those skilled in the art's major issue in the urgent need to address.

The content of the invention

It is real based on file semantics and system the invention provides one kind for the above-mentioned technical problem present in prior art When state redundant data De-weight method, distributed main memory storage system can be made to maintain the same of higher read request response performance When, the expense of reduction memory space cost.

A kind of redundant data De-weight method based on file semantics and system real-time status is as follows：

The periodically reading response delay and duplicate removal ratio of detection distributed memory system；According to the reading of current time system Response delay and duplicate removal ratio, using dynamic below based on SLA (Service Level Agreement, service-level agreement generation) State regulation mechanism goes treasure to be adjusted to system：

According to the SLA of the current institute's reference of system, judge whether the reading response delay of current time system is more than the SLA and reads to ring Answer the time delay section upper limit 1.1 times：

If so, treasure is then set to stop performing GFD and LCD to system within next cycle；If it is not, judge that current time is The reading response delay section upper limit read response delay and whether be less than the SLA of system：

If it is not, then making treasure retain system within next cycle performs GFD, stop performing LCD；If so, judge to work as Whether the duplicate removal ratio of etching system is less than the duplicate removal rate terms lower limit of the SLA when preceding：

If so, then make treasure execution GFD and LCD normal to system within next cycle；If it is not, then make the current institute of system The SLA of reference lifts a grade, and re-starts judgement according to above-mentioned dynamic mechanism according to new SLA.

The response delay of reading of current time system was average response of the user to all read requests of system in a upper cycle Time delay；The duplicate removal ratio of current time system be the redundant data that initial time removes to current time system cumulative volume with Initial time to current time system does not remove the ratio of the data cumulative volume after the storage that adds up in the case of redundant data.

If the reading response delay of current time system, which is more than system, is presently in SLA reads the response delay section upper limit 1.1 Times, then make treasure stop performing GFD and LCD to system within next cycle；If go treasure continuous plus next cycle Three cycles stop performing GFD and LCD to system, then the SLA of the current institute's reference of system is declined a grade.

The grade of the SLA is detailed as follows：

The SLA of the first estate, it be [0,600ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1)；

The SLA of second grade, it is [0,600ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25]；

The SLA of the tertiary gradient, it be [600ms, 750ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1)；

The SLA of the fourth estate, it be [600ms, 750ms] that it, which reads response delay section, duplicate removal rate terms be [0.1, 0.25]；

The SLA of 5th grade, it be [750ms, 900ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1)；

The SLA of 6th grade, it be [750ms, 900ms] that it, which reads response delay section, duplicate removal rate terms be [0.1, 0.25]；

The SLA of 7th grade, it be [900ms, 1200ms] that it, which reads response delay section, duplicate removal rate terms be [0.25, 1)；

The SLA of 8th grade, it be [900ms, 1200ms] that it, which reads response delay section, duplicate removal rate terms be [0.1, 0.25]。

Described goes treasure to carry out duplicate removal to system using based on the hierarchical redundant data duplicate removal scheme of relative importance value, specifically Process is as follows：

(1) file semantics are based on, calculate the duplicate removal relative importance value of each file in system；

(2) the document location record of n maximum file of duplicate removal relative importance value, n are extracted from the global profile index of system For the natural number more than 1；

(3) recorded according to document location and hierarchical redundant data duplicate removal is carried out to above-mentioned n file：

First, GFD is carried out to this n file, and then updates global profile index；Initially, the document location of this n file Record is marked as " dirty ", and for the document location record by file after GFD removal redundant datas then labeled as " dry Only "；

Then, to after GFD is handled document location record still labeled as " dirty " file carry out LCD, and then update this Ground data block index；

(4) performed repeatedly to (3) according to step (1).

The duplicate removal relative importance value of each file is calculated in described step (1) by below equation：

ρ=α₁H+α₂G+α₃E

Wherein：ρ is the duplicate removal relative importance value of file, and H is the preferred value of file last access time, and G is the excellent of file size First be worth, E be file type preferred value, α₁~α₃Respectively correspond to H, G and E weight coefficient.

The preferred value H calibrates accurate as follows really：

If file last access time is more than 1 month, H=7；

If file last access time was more than 15 days and less than or equal to 1 month, H=6；

If file last access time was more than 7 days and less than or equal to 15 day, H=5；

If file last access time was more than 3 days and less than or equal to 7 day, H=4；

If file last access time was more than 1 day and less than or equal to 3 day, H=3；

If file last access time was more than 12 hours and less than or equal to 1 day, H=2；

If file last access time was more than 6 hours and less than or equal to 12 hour, H=1；

If file last access time is less than or equal to 6 hours, H=-1.

The preferred value G calibrates accurate as follows really：

If file size is more than 1G, G=7；

If file size is more than 512MB and is less than or equal to 1G, G=6；

If file size is more than 256MB and is less than or equal to 512MB, G=5；

If file size is more than 64MB and is less than or equal to 256MB, G=4；

If file size is more than 8MB and is less than or equal to 64MB, G=3；

If file size is more than 1MB and is less than or equal to 8MB, G=2；

If file size is more than 128KB and is less than or equal to 1MB, G=1；

If file size is less than or equal to 128KB, G=-1.

The preferred value E calibrates accurate as follows really：

If file type is backup log, E=7；

If file type is mirror image, E=6；

If file type is project, E=5；

If file type is video, E=4；

If file type is audio, E=3；

If file type is document, E=2；

If file type is picture, E=1；

If file type is other, E=-1.

The present invention has following significant beneficial to effect as a result of above-mentioned technical scheme, therefore compared with prior art Fruit：

(1) the inventive method has compared traditional redundant data duplicate removal scheme, and a clear advantage is that abundant profit With abundant file semantics of the main storage system included in, carried out by the attribute value of the file semantics to various dimensions The division of quantization, assign different files and belong to specific duplicate removal preferred value, and by assigning between different dimensions with difference Weight, quantitatively calculate the duplicate removal relative importance value represented with numerical value of some file so that go treasure to treat with a certain discrimination not Same file, so as to while memory space is saved, maintain the efficient response to upper layer application read request.

(2) the inventive method introduces to dynamic adjustment mechanism when going the master control program to run and duplicate removal scheme demand is entered The mechanism of row Unify legislation.Unified requirement description mechanism can make the functional module in redundant data duplicate removal scheme qualitatively obtain The real needs at family are taken, clearly can according to priority arrange multistage choice of the user in reading performance and space efficiency.It is dynamic The adjustment of state can make the duplicate removal of the present invention accomplish that system mode can perceive, can be with when read request response performance is affected The working method of treasure is gone in adjustment in various degree.

(3) the inventive method introduces the combination GFD and LCD of bilayer duplicate removal procedure mechanism, and is adopted on different opportunitys With different mechanism.Compared to the redundant data duplicate removal scheme for playing traditional single mechanism, double-deck mixing proposed by the invention Duplicate removal mechanism adds the flexibility in duplicate removal granularity and duplicate removal action scope, can cause controller according to the demand of user with The real-time status of system, meet the SLA of higher priority by adjusting the operation opportunity between two kinds of different schemes to reach Effect.

Brief description of the drawings

Fig. 1 is the function structure schematic diagram of redundant data De-weight method of the present invention.

Fig. 2 is the schematic flow sheet of redundant data De-weight method of the present invention.

Embodiment

In order to more specifically describe the present invention, below in conjunction with the accompanying drawings and embodiment is to technical scheme It is described in detail.

As shown in figure 1, among actual motion application environment, the present invention is superfluous based on file semantics and system real-time status Remainder is operated according to duplicate removal scheme in general distributed main memory storage system, and storage system mainly includes meta data server and right As storage server cluster；Wherein：

Meta data server is responsible for receiving user's request and directed the request in corresponding object storage server, also It is responsible for the running status of the whole distributed main memory storage system of detection, and maintains the index in units of filename of the overall situation.Rope " signature " of each file, positional information and metadata information are contained in drawing, this signature is for each file All it is unique, is to be calculated by SHA-1 algorithms on the binary content of file.

And object storage server cluster then contains more independent object storage servers, each server is all protected A certain amount of file has been deposited, and the local disk of server is supported file data to be stored in a manner of data block, Mei Yitai Object storage server is responsible for safeguarding the index information in units of data block for being stored in local file.Similarly, index Contain physical location of the data block on local disk, and the binary content based on this data block through SHA-1 algorithms The signature calculated.

Mainly include three functional modules in present embodiment：Duplicate removal relative importance value based on the division of multi-semantic meaning dimension calculates mould Block (MPD modules), hierarchical data deduplication module (removing treasure) and the duplicate removal control module (control based on system real-time status Device processed).These three functional modules are operated on meta data server or in object storage server, and can be taken with metadata Data block index on the global profile index of business device and each object storage server is interacted and operated, wherein：

MPD modules are run on meta data server, it preserved by obtaining on meta data server on file And the satellite information of file, calculate the listed files with high duplicate removal relative importance value, file language of the MPD modules based on various dimensions Justice, a list is periodically exported, the document location record of fixed qty is contained inside the list, these files are duplicate removal Device should preferentially implement the object of duplicate removal.Three dimensions of the MPD modules of this programme based on file are semantic to carry out duplicate removal relative importance value Calculate：The last access time stamp of file, the size of file, the type of file.On this basis, MPD modules are for each The scope of occurrence in dimension has made clear and definite assignment so that property value of each file in these three dimensions corresponding one Individual clear and definite numerical value, this data indicate duplicate removal relative importance value of some file among this dimension, and the bigger expression of numerical value is more It is prioritized carry out deduplication operation.Moreover, these three dimensions have different weight coefficients, the final duplicate removal relative importance value of file is The weighting of respective value among these three dimensions.Among practical application, these assignment are can be according to specific service scenarios With demand come what is customized, the default criteria for classifying and corresponding assignment of present embodiment, and the weight of each different dimensions are assigned Value is as shown in table 1：

Table 1

According to each entry value of table 1, value=last access time preferred value * 0.5+ of the final duplicate removal relative importance value of each file File size preferred value * 0.3+ file type preferred values * 0.2.MPD modules periodically scan for the file in meta data server Index and by the information of maximum preceding 50 files of duplicate removal relative importance value, cover among a list, the list is referred to as duplicate removal time Select list.Each fileinfo occupies a line of the table, also has an attached flag bit per a line, each during initialization Flag bit is " dirty ", represents that the style of writing part is not yet gone mistake handled by treasure；Accordingly, if going treasure by the row File process finish, then the flag bit is changed to " clean ".

Treasure is gone to operate in each object storage server, it receives the list of MPD modules output as input, to row File in table carries out with different levels duplicate removal.Go treasure to perform the deduplication operation of bilayer, be global file level respectively Duplicate removal GFD and local data block aspect duplicate removal LCD.Treasure is gone periodically to obtain newest duplicate removal candidate list from MPD modules, Then deduplication operation is proceeded by from the file of the duplicate removal preferred value with maximum.In the case where controller allows, treasure is removed GFD operations can be implemented to this document first, in order to search out the file for whether having redundancy in global scope, local removes treasure Meeting to meta data server to initiating inquiry request, because the file index of the overall situation is saved in meta data server, therefore can To know whether this document has different redundancy backups to be distributed in different object storage servers.If find exist through inquiry The file backup of redundancy, then it is current to go treasure to send duplicate removal request, other side couple to the object server where redundant file As server is notified that it local goes treasure to delete the redundant file from local disk.Initiate the duplicate removal of duplicate removal request The flag bit that this document among duplicate removal preferred list records can be arranged to " dry by device after knowing that redundant file is deleted Only ", while notify meta data server to update file index, the positional information of firm deleted file is directed to this and removes treasure In the object storage server at place.

, can be from the case where controller allows after going treasure to travel through duplicate removal preferred list in a manner of GFD to complete Head starts to travel through once the list again in a manner of LCD.It is by retrieving in local object memories to go treasure to implement LCD The index of data block, to determine locally with the presence or absence of some with current file or the data block of some piecemeals institute redundancy.If The data block of redundancy be present, then after the block of redundancy is deleted, correspondingly update the index of native object storage server, and will go Corresponding file item flag bit is labeled as " clean " in weight candidate list.LCD processes have also traveled through duplicate removal candidate LCD processes also time Go through after duplicate removal candidate list, locally gone the duplicate removal preferred list in the treasure notice meta data server cycle to handle Into.

And among the process for going treasure to operate, controller can be according to the real-time status of system, the duplicate removal plan to removing treasure Slightly enter Mobile state adjustment.Controller is a distributed component, and it operates in meta data server and each object simultaneously In storage server, among the process of operation, controller can enter according to the real-time status of system to the duplicate removal strategy for removing treasure Mobile state adjusts.It is responsible for monitoring responding for read request for whole distributed main memory storage system in part on meta data server Delay situation, the part operated in each object storage server is responsible for monitoring memory space on the server and taken Situation.What controller can be set in advance according to one group by user, according to priority level arrangement<Response reference time delay is read, is gone Weight ratio ranges>Demand pair, dynamically to be adjusted to the progress for removing treasure.We are with different levels with priority level by more than The set of one group of demand pair, referred to as a service-level agreement is for SLA.Present embodiment gives SLA set form, therefore The SLA can voluntarily refine setting by user according to this form, so as to the input as controller；The SLA of format sample such as tables 2 It is shown：

Table 2

When from user when reading request to up to meta data server, meta data server using timestamp now as pair Should read request begin to respond to timestamp, and the timestamp inserted inside request message, then forwarded this request to pair In the object storage server answered.The object storage server for receiving the read request reads object corresponding to the read request completely Go out, and when last data BOB(beginning of block) is sent to user terminal, using timestamp now as the read request at the end of Between stab.Time started stabs is separated by duration between ending time stamp, is the response delay of the request.The delay is by being distributed in Controller subassembly in each object storage server is captured, and is sent to the controller on meta data server Component, the controller assemblies positioned at meta data server collect what is sent from the controller assemblies in each object storage server Relative to the response delay of each read request, then every fixed cycle T, reading response all in the cycle T is calculated The average value of time delay, the value is as the reading response delay reference value in the cycle.

Duplicate removal ratio is the volume of the redundant data removed and does not carry out the ratio of data storage cumulative volume before duplicate removal.It is located at Each object storage server removes treasure after completing to GFD the or LCD deduplication operations of some file, excellent duplicate removal is waited While flag bit among first list is arranged to " clean ", removed redundant data is added after this document item record Volume size.Meta data server counts this row when reclaiming the duplicate removal preferred list that each has been traversed from table The volume size for the data being removed in table, then divided by indexed out from global profile can directly obtain it is whole distributed main The size of All Files among storage system, draw real-time duplicate removal ratio.

Controller positioned at meta data server can be stored up periodically with detecting and inquiring about whole distributed main memory in each cycle T The reading response delay value and duplicate removal ratio of system, then in conjunction with the real needs scope in SLA come to going treasure to move State is adjusted, as shown in Fig. 2 main operation includes below scheme：

Controller is since the minimum SLA of priority level, if current reading response delay value meets or better than current SLA's Range of needs, and instant duplicate removal ratio meets or the duplicate removal speed upper limit higher than present level, then allows treasure to continue just Often it is operated, and current SLA priority levels is up lifted into one-level；

If current reading response delay value meets or the range of needs better than present level, but instant duplicate removal ratio is less than The duplicate removal ratio lower limit of present level, then allow treasure to continue to be normally carried out work, and current SLA is retained in current etc. Level；

If the current scope read response delay value and do not meet current demand, in two kinds of situation：If a) current read response delay Value is more than 1.1 times of the reading latency requirement range limit of present level, then all operations of treasure are gone in controller stopping, if continuously The reading response delay of system never falls back to 1.1 of the reading latency requirement range limit less than present level in three cycles In the range of times, then current SLA grades are down reduced into one-level；B) if current response delay value of reading is more than in current demand scope Limit but less than 1.1 times of the demand upper limit, controller is then still rested in the demand levels, and to going treasure to send instruction, is stopped Only LCD deduplication operations, only retain GFD operations.

The above-mentioned description to embodiment is understood that for ease of those skilled in the art and using this hair It is bright.Person skilled in the art obviously can easily make various modifications to above-described embodiment, and described herein General Principle is applied in other embodiment without by performing creative labour.Therefore, the invention is not restricted to above-described embodiment, For those skilled in the art according to the announcement of the present invention, the improvement made for the present invention and modification all should be in the protections of the present invention Within the scope of.

Claims

1. a kind of redundant data De-weight method based on file semantics and system real-time status, as follows：

The periodically reading response delay and duplicate removal ratio of detection distributed memory system；Responded according to the reading of current time system Time delay and duplicate removal ratio, treasure is gone to be adjusted system using below based on SLA dynamic mechanisms：

According to the SLA of the current institute's reference of system, when judging whether the reading response delay of current time system is more than the SLA and reads response Prolong the section upper limit 1.1 times：

If so, treasure is then set to stop performing GFD and LCD to system within next cycle；If it is not, judge current time system Read the reading response delay section upper limit whether response delay is less than the SLA：

If it is not, then making treasure retain system within next cycle performs GFD, stop performing LCD；If so, when judging current Whether the duplicate removal ratio of etching system is less than the duplicate removal rate terms lower limit of the SLA：

If so, then make treasure execution GFD and LCD normal to system within next cycle；If it is not, then make the current institute's reference of system SLA lift a grade, and judgement is re-started according to above-mentioned dynamic mechanism according to new SLA；

Wherein：The GFD represents the global duplicate removal based on file level, and the LCD represents local based on data block aspect Duplicate removal, the grade of the SLA is detailed as follows：

The SLA of the fourth estate, it is [600ms, 750ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25]；

The SLA of 6th grade, it is [750ms, 900ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25]；

The SLA of 7th grade, it be [900ms, 1200ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1)；

The SLA of 8th grade, it is [900ms, 1200ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25].

2. redundant data De-weight method according to claim 1, it is characterised in that：If during the reading response of current time system Prolong 1.1 times for being presently in SLA more than system and reading the response delay section upper limit, then treasure is stopped within next cycle to system Only perform GFD and LCD；If removing treasure plus next cycle, continuous three cycles stop performing GFD and LCD to system, The SLA of the current institute's reference of system is set to decline a grade.

3. redundant data De-weight method according to claim 1, it is characterised in that：Described goes treasure to use based on preferential Spend hierarchical redundant data duplicate removal scheme and duplicate removal is carried out to system, detailed process is as follows：

(2) the document location record of n maximum file of duplicate removal relative importance value is extracted from the global profile index of system, n is big In 1 natural number；

First, GFD is carried out to this n file, and then updates global profile index；Initially, the document location record of this n file It is marked as " dirty ", and then " clean " is labeled as the document location record by file after GFD removal redundant datas；

Then, to after GFD is handled document location record still carry out LCD labeled as the file of " dirty ", and then update local number Indexed according to block；

(4) performed repeatedly to (3) according to step (1).

4. redundant data De-weight method according to claim 3, it is characterised in that：By following in described step (1) Formula calculates the duplicate removal relative importance value of each file：

ρ=α₁H+α₂G+α₃E

Wherein：ρ is the duplicate removal relative importance value of file, and H is the preferred value of file last access time, and G is the preferred value of file size, E be file type preferred value, α₁~α₃Respectively correspond to H, G and E weight coefficient.

5. redundant data De-weight method according to claim 4, it is characterised in that：The preferred value H calibrates standard such as really Under：

If file last access time is more than 1 month, H=7；

If file last access time is less than or equal to 6 hours, H=-1.

6. redundant data De-weight method according to claim 4, it is characterised in that：The preferred value G calibrates standard such as really Under：

If file size is more than 1G, G=7；

If file size is more than 512MB and is less than or equal to 1G, G=6；

If file size is more than 256MB and is less than or equal to 512MB, G=5；

If file size is more than 64MB and is less than or equal to 256MB, G=4；

If file size is more than 8MB and is less than or equal to 64MB, G=3；

If file size is more than 1MB and is less than or equal to 8MB, G=2；

If file size is more than 128KB and is less than or equal to 1MB, G=1；

If file size is less than or equal to 128KB, G=-1.

7. redundant data De-weight method according to claim 4, it is characterised in that：The preferred value E calibrates standard such as really Under：

If file type is backup log, E=7；

If file type is mirror image, E=6；

If file type is project, E=5；

If file type is video, E=4；

If file type is audio, E=3；

If file type is document, E=2；

If file type is picture, E=1；

If file type is other, E=-1.