CN112929395A

CN112929395A - Cloud data duplicate removal method and system

Info

Publication number: CN112929395A
Application number: CN201911237434.2A
Authority: CN
Inventors: 唐鑫; 周琳娜; 胡冰蔚; 单伟杰; 刘丹; 刘小梅
Original assignee: International Relations, University of
Current assignee: International Relations, University of
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2021-06-08
Anticipated expiration: 2039-12-05
Also published as: CN112929395B

Abstract

The invention provides a cloud data duplicate removal method and system. The cloud data deduplication method comprises the following steps: acquiring detection identification information from a received file uploading request; determining the number of the missed blocks and the number of the additional blocks according to the detection identification information; judging whether the number of the missed blocks is equal to the number of the additional blocks; when the number of the missed blocks is equal to the number of the additional blocks, returning a first deduplication response, wherein the identification information in the first deduplication response comprises the identification information of the hit blocks and the identification information of the missed blocks; when the number of the missed blocks is not equal to the number of the additional blocks, returning a second deduplication response, wherein the identification information in the second deduplication response comprises the identification information of the missed blocks; and the number of the identification information in the first deduplication response is equal to that of the identification information in the second deduplication response. The method and the device can avoid privacy disclosure, improve the security of cloud data and reduce communication overhead.

Description

Cloud data duplicate removal method and system

Technical Field

The invention relates to the field of data deduplication, in particular to a cloud data deduplication method and system.

Background

In a cloud storage scenario, a cross-user deduplication technology is widely used for saving overhead of cloud data storage and management, and the deduplication range is expanded from a single user to multiple users, so that deduplication efficiency is further improved. However, in the side channel attack mode, since an attacker knows all public information of the cloud target file, the attacker can generate sensitive information completely by guessing, synthesize a complete file and upload the file to the cloud duplicate removal system for detection, and judge the correctness of the sensitive information in the uploaded file according to a cloud duplicate removal response result. If there are n possibilities for the sensitive information of the file A stored in the cloud, an attacker can steal the existence privacy of the file by detecting the file at most n times.

Fig. 1 is a schematic diagram of a side channel attack model. As shown in fig. 1, it is assumed that all sensitive information of the cloud target file a is contained in one data block B, and the rest data blocks are public information. In order for an attacker to obtain sensitive information, generate (A)₁，A₂，...，A_n) The n files only comprise data blocks Bi (i is 1, 2.. n) containing sensitive information, the other public blocks are the same, an attacker uploads the generated files to a cloud duplicate removal system for detection, and if the duplicate removal response prompts the uploaded file A, the attacker uploads the file A to a cloud duplicate removal system for detection_k(k∈[1，n]) Repeating with the cloud target file A, the attacker can judge the file A_kSensitive block B in (1)_kIs identical to the data block B of the cloud target file A, namely B_kThe content in the cloud target file A is sensitive information content, so that the privacy of the cloud target file A is revealed.

FIG. 2 is a schematic diagram of an additional block attack model. As shown in fig. 2, in the additional block attack, an attacker adds redundant blocks that do not exist in the cloud target file a to the file to be detected. Since the number of additional blocks is also randomly generated by an attacker, the cloud end is difficult to judge the actual existence of the detected file. In an attack scenario of X ' additional blocks, the number of the missed blocks which can be detected by the cloud may be X ' +1 or X ', which respectively corresponds to two situations of absence and presence of a detected file. However, the cloud does not know the value of X' since it was randomly chosen by an attacker. Therefore, under the attack of the additional blocks, the cloud end cannot judge the number of the additional blocks of the detected file according to the duplicate removal result, so that the cloud end cannot confuse an attacker through response fuzzification. Therefore, the additional block attack is a huge threat to the security of the cloud data.

In the prior art, random redundancy values of deduplication responses of a hit block and an un-hit block are in different value ranges, so that the possibility of privacy disclosure exists, and a large communication overhead is also caused.

Disclosure of Invention

The embodiment of the invention mainly aims to provide a cloud data duplicate removal method and system, so as to avoid privacy disclosure, improve the security of cloud data and reduce communication overhead.

In order to achieve the above object, an embodiment of the present invention provides a cloud data deduplication method, including:

acquiring detection identification information from a received file uploading request;

determining the number of the missed blocks and the number of the additional blocks according to the detection identification information;

judging whether the number of the missed blocks is equal to the number of the additional blocks;

when the number of the missed blocks is equal to the number of the additional blocks, returning a first deduplication response, wherein the identification information in the first deduplication response comprises the identification information of the hit blocks and the identification information of the missed blocks;

when the number of the missed blocks is not equal to the number of the additional blocks, returning a second deduplication response, wherein the identification information in the second deduplication response comprises the identification information of the missed blocks;

and the number of the identification information in the first deduplication response is equal to that of the identification information in the second deduplication response.

An embodiment of the present invention further provides a cloud data deduplication system, including:

the acquisition unit is used for acquiring the detection identification information from the received file uploading request;

a determining unit configured to determine the number of the missed blocks and the number of the additional blocks according to the detection identification information;

a judging unit for judging whether the number of the missed blocks is equal to the number of the additional blocks;

a first returning unit, configured to return a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks, where identification information in the first deduplication response includes identification information of the hit blocks and identification information of the missed blocks;

a second returning unit, configured to return a second deduplication response when the number of the missed blocks is not equal to the number of the additional blocks, where identification information in the second deduplication response includes identification information of the missed blocks;

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the steps of the cloud data deduplication method are realized when the processor executes the computer program.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the cloud data deduplication method are realized.

The cloud data deduplication method and the cloud data deduplication system of the embodiment of the invention firstly acquire detection identification information from a received file uploading request, then determine the number of missed blocks and the number of additional blocks according to the detection identification information, and then judge whether the number of the missed blocks is equal to the number of the additional blocks; returning a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks; and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a side channel attack model;

FIG. 2 is a schematic diagram of an additional block attack model;

FIG. 3 is a flowchart of a cloud data deduplication method in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a document to be detected and its detection identification information according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a document to be detected and its detection identification information according to another embodiment of the present invention;

FIG. 6 is a schematic diagram of a set of tags and a cloud tag in an embodiment of the present invention;

fig. 7 is a block diagram of a cloud data deduplication system in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

In view of the fact that the possibility of privacy disclosure exists in the prior art and a large communication overhead is caused, the embodiment of the invention provides a cloud data deduplication method to avoid privacy disclosure, improve the security of cloud data and reduce the communication overhead. The present invention will be described in detail below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a cloud data deduplication method in an embodiment of the present invention. As shown in fig. 3, the cloud data deduplication method includes:

s101: and acquiring the detection identification information from the received file uploading request.

Before executing S101, the method further includes: a file upload request is received from a detection system. The detection system stores a file to be detected, and the file to be detected comprises a plurality of data blocks. The detection system generates and uploads corresponding detection identification information according to each data block.

S102: and determining the number of the missed blocks and the number of the additional blocks according to the detection identification information.

The detection identification information may be a hash value of the data block.

S103: it is determined whether the number of missed blocks equals the number of additional blocks.

S104: and when the number of the missed blocks is equal to the number of the additional blocks, returning a first deduplication response, wherein the identification information in the first deduplication response comprises the identification information of the hit blocks and the identification information of the missed blocks.

S105: and when the number of the missed blocks is not equal to the number of the additional blocks, returning a second deduplication response, wherein the identification information in the second deduplication response comprises the identification information of the missed blocks.

The execution subject of the cloud data deduplication method shown in fig. 3 may be a computer. As can be seen from the flow shown in fig. 3, the cloud data deduplication method and system in the embodiment of the present invention first obtain the detection identification information from the received file upload request, determine the number of the missed blocks and the number of the additional blocks according to the detection identification information, and then determine whether the number of the missed blocks is equal to the number of the additional blocks; returning a first deduplication response when the number of missed blocks equals the number of additional blocks: and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

In one embodiment, determining the number of missed blocks comprises:

determining the number of the hit blocks according to the detection identification information;

determining the number of the missed blocks according to the number of the detection identification information and the number of the hit blocks; in specific implementation, the number of the missed blocks is the difference between the number of the detection identification information and the number of the hit blocks.

Wherein, determining the number of the hit blocks comprises:

determining a label set corresponding to the detection identification information; acquiring a cloud label in a label set; and matching the cloud label with the detection identification information to determine the number of the hit blocks.

When the number of the missed blocks is equal to that of the additional blocks, it is indicated that detection identification information in the file uploading request is completely matched with a cloud tag in a cloud target file, sensitive blocks in the cloud target file are hit, and the cloud target file comprises a file to be detected corresponding to the detection identification information; at this time, identification information of a hit block needs to be added into the returned first deduplication response to enable the number of the identification information in the first deduplication response to be equal to the number of the identification information in the second deduplication response, so that response fuzzification is achieved, and the purpose of confusing attackers is achieved.

In one embodiment, determining the number of additional blocks comprises:

determining the number of cloud tags;

and determining the number of the additional blocks according to the number of the detection identification information and the number of the cloud labels.

In specific implementation, the number of the additional blocks is the difference between the number of the detection identification information and the number of the cloud end tags.

Assuming that all unpublished sensitive information in the cloud target file is contained in one data block, the data block is called a cloud sensitive block, and other published data blocks are public blocks; the file to be detected comprises a plurality of public blocks and a detection sensitive block generated by a detection system; the flow of one embodiment of the invention is as follows:

1. and acquiring the detection identification information from the received file uploading request.

Fig. 4 is a schematic diagram of a file to be detected and detection identification information thereof in an embodiment of the present invention. Fig. 5 is a schematic diagram of a document to be detected and its detection identification information in another embodiment of the present invention. As shown in fig. 4 and 5, the file to be detected includes a plurality of data blocks (including Y 'additional blocks), where the data block Cs and the data block Cs' are detection sensitive blocks generated by the detection system, and the detection tag set t_{F}And t_{F}' to detect a set of identification information, t_CsAnd t_Cs' detection identification information corresponding to the detection sensitive block.

2. Determining a label set corresponding to the detection identification information; and acquiring cloud tags in the tag set, and matching the cloud tags with the detection identification information to determine the number of the hit blocks.

Fig. 6 is a schematic diagram of a tag set and a cloud tag in an embodiment of the invention. As shown in fig. 6, the cloud file includes a plurality of tag sets, and a tag set t corresponding to the detection identification information is determined first_{F}Then, a plurality of cloud tags in the tag set, such as t, are obtained_C1，t_C2，t_CsAnd t_Cn. Wherein, t_CsAnd the cloud end label corresponds to the cloud end sensitive block. When the detection identification information includes t_CsWhen the number of the hit blocks is the number of the open blocks plus one (the number of all data blocks in the file to be detected), when the detection identification information does not include t_CsAnd then, the number of the hit blocks is the number of the public blocks (the number of all data blocks in the file to be detected is reduced by one).

3. And determining the number of the missed blocks according to the number of the detection identification information and the number of the hit blocks.

The number of the missed blocks is the difference between the number of the detection identification information and the number of the hit blocks.

As shown in fig. 6, the detection identification information does not include t_CsThe number of time-lapse missing blocks includes t in comparison with the detection identification information_CsThe number of the time-lapse missed blocks is one more. When it is detected that the identification information does not include t_CsWhen the number of the missed blocks is the number of the additional blocks plus one; when the detection identification information includes t_CsThe number of the missed blocks is addedThe number of blocks.

4. Determining the number of cloud tags; and determining the number of the additional blocks according to the number of the detection identification information and the number of the cloud labels.

The number of the additional blocks is the difference between the number of the detection identification information and the number of the cloud end tags. Under normal conditions, no additional block exists in the file to be detected corresponding to the detection identification information uploaded by the ordinary user.

5. It is determined whether the number of missed blocks equals the number of additional blocks.

As shown in fig. 6, when the detection identification information includes t_CsThe number of missed blocks equals the number of additional blocks, at which point the first deduplication response is returned. When it is detected that the identification information does not include t_CsThe number of missed blocks equals the number of additional blocks plus one, at which time a second deduplication response is returned.

Because the number of the identification information of the non-hit block in the first deduplication response is one less than that of the identification information of the non-hit block in the second deduplication response, the first deduplication response comprises the identification information of the non-hit block and the identification information of one hit block, the second deduplication response comprises the identification information of the non-hit block, so that the number of the identification information in the first deduplication response is equal to that of the identification information in the second deduplication response, response fuzzification is realized, and the purpose of confusing an attacker is achieved, and the attacker cannot judge whether the file to be detected exists in the cloud target file through the deduplication response. In addition, the number of the identification information in the returned first deduplication response and the second deduplication response is the minimum value of the identification information required by the confusion attacker, so that the communication overhead can be reduced.

To sum up, the cloud data deduplication method of the embodiment of the present invention obtains detection identification information from a received file upload request, determines the number of missed blocks and the number of additional blocks according to the detection identification information, and then determines whether the number of missed blocks is equal to the number of additional blocks; returning a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks; and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

Based on the same inventive concept, the embodiment of the invention also provides a cloud data deduplication system, and as the problem solving principle of the system is similar to that of a cloud data deduplication method, the implementation of the system can refer to the implementation of the method, and repeated parts are not described again.

Fig. 7 is a block diagram of a cloud data deduplication system in the embodiment of the present invention. As shown in fig. 7, the cloud data deduplication system includes:

In one embodiment, the determining unit is specifically configured to:

and determining the number of the missed blocks according to the number of the detection identification information and the number of the hit blocks.

In one embodiment, the determining unit is specifically configured to:

determining a label set corresponding to the detection identification information;

acquiring a cloud label in a label set;

and matching the cloud label with the detection identification information to determine the number of the hit blocks.

In one embodiment, the determining unit is specifically configured to:

determining the number of cloud tags;

To sum up, the cloud data deduplication system of the embodiment of the present invention first obtains detection identification information from a received file upload request, determines the number of the missed blocks and the number of the additional blocks according to the detection identification information, and then determines whether the number of the missed blocks is equal to the number of the additional blocks; returning a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks; and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, all or part of contents of the cloud data deduplication method may be implemented, for example, when the processor executes the computer program, the following contents may be implemented:

To sum up, the computer device of the embodiment of the present invention first obtains the detection identification information from the received file upload request, determines the number of the missed blocks and the number of the additional blocks according to the detection identification information, and then determines whether the number of the missed blocks is equal to the number of the additional blocks; returning a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks; and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, may implement all or part of contents of a cloud data deduplication method, for example, when the processor executes the computer program, the following contents may be implemented:

To sum up, the computer-readable storage medium of the embodiment of the present invention first obtains the detection identification information from the received file upload request, determines the number of the missed blocks and the number of the additional blocks according to the detection identification information, and then determines whether the number of the missed blocks is equal to the number of the additional blocks; returning a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks; and when the number of the missed blocks is not equal to that of the additional blocks, returning a second duplicate removal response, wherein the number of the identification information in the first duplicate removal response is equal to that of the identification information in the second duplicate removal response, so that privacy disclosure can be avoided, the security of cloud data is improved, and the communication overhead is reduced.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, or devices described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

Claims

1. A cloud data deduplication method is characterized by comprising the following steps:

when the number of the missed blocks is equal to the number of the additional blocks, returning a first deduplication response, wherein identification information in the first deduplication response comprises identification information of the hit blocks and identification information of the missed blocks;

when the number of the missed blocks is not equal to the number of the additional blocks, returning a second deduplication response, wherein identification information in the second deduplication response comprises identification information of the missed blocks;

and the number of the identification information in the first deduplication response is equal to the number of the identification information in the second deduplication response.

2. The cloud data deduplication method of claim 1, wherein determining the number of the missed blocks comprises:

determining the number of hit blocks according to the detection identification information;

3. The cloud data deduplication method of claim 2, wherein determining the number of hit blocks comprises:

acquiring cloud tags in the tag set;

and matching the cloud tag with the detection identification information to determine the number of the hit blocks.

4. The cloud data deduplication method of claim 3, wherein determining the number of additional blocks comprises:

determining the number of the cloud tags;

and determining the number of the additional blocks according to the number of the detection identification information and the number of the cloud end tags.

5. A cloud data deduplication system, comprising:

a determining unit, configured to determine the number of the missed blocks and the number of the additional blocks according to the detection identification information;

a judging unit configured to judge whether the number of the missed blocks is equal to the number of the additional blocks;

a first returning unit, configured to return a first deduplication response when the number of the missed blocks is equal to the number of the additional blocks, where identification information in the first deduplication response includes identification information of a hit block and identification information of the missed block;

6. The cloud data deduplication system of claim 5, wherein the determining unit is specifically configured to:

7. The cloud data deduplication system of claim 6, wherein the determining unit is specifically configured to:

acquiring cloud tags in the tag set;

8. The cloud data deduplication system of claim 7, wherein the determining unit is specifically configured to:

determining the number of the cloud tags;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cloud data deduplication method of any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the cloud data deduplication method according to any one of claims 1 to 4.